FLOSS Project Planets

OPTASY: How to Upgrade to Drupal 9: Just Identify and Remove Any Deprecated Code from Your Website

Planet Drupal - Fri, 2019-06-21 11:01
How to Upgrade to Drupal 9: Just Identify and Remove Any Deprecated Code from Your Website radu.simileanu Fri, 06/21/2019 - 15:01

This is no news anymore: preparing to upgrade to Drupal 9 is just a matter of... cleaning your website of all deprecated code. 

No major disruption from Drupal 8. No more compatibility issues to expect (with dread)...

“Ok, but how do I know if my website's using any deprecated APIs or functions? How do I check for deprecations, identify them and then... update my code?”

2 legitimate questions that must be “haunting” you these days, whether you're a:
 

Categories: FLOSS Project Planets

Plasma Vision

Planet KDE - Fri, 2019-06-21 10:19

The Plasma Vision got written a couple years ago, a short text saying what Plasma is and hopes to create and defines our approach to making a useful and productive work environment for your computer.  Because of creative differences it was never promoted or used properly but in my quest to make KDE look as up to date in its presence on the web as it does on the desktop I’ve got the Plasma sprinters who are meeting in Valencia this week to agree to adding it to the KDE Plasma webpage.

 

Categories: FLOSS Project Planets

OpenSense Labs: Drupal in the age of FinTech

Planet Drupal - Fri, 2019-06-21 09:41
Drupal in the age of FinTech Shankar Fri, 06/21/2019 - 19:11 "There are hundreds of startups with a lot of brains and money working on various alternatives to traditional banking" - Jamie Dimon, CEO, JPMorgan Chase

FinTech and the disruption it can cause to the traditional banking systems is now a hot topic of debate in the banking conferences. Global venture capital funds are super-bullish on this front and are accentuating investments in the FinTech companies. Thanks to the burgeoning demand of FinTech in recent times, more crowdsourcing platforms are letting artists or fledgling entrepreneurs to crowd-source capital from a large constituency of online donors or investors.


For instance, peer to peer (P2P) lending, the high-tech equivalent of borrowing money from friends, helps in raising a loan from an online community at a mutually negotiated interest rate. Also, digital wallet providers allow people to zip money across borders even without any bank accounts using handheld devices.

Amalgamation of these technologies, which goes under the umbrella term FinTech, is expected to metamorphose the way all of us use banking and financial services. And Drupal can act as the perfect content management framework for building a great FinTech platform.

A portmanteau of financial technology


Financial technology, which is referred to as FinTech, illustrates the evolving intersection of financial services and technology. FinTech allows people to innovate while transacting business ranging from digital money to double-entry bookkeeping.

The lines between technology and the financial services are blurring

Since the advent of the internet revolution and later the mobile internet revolution, financial technology has grown multifold. Originally referred to   technology applied to the back office of banks or trading firms, FinTech now caters to a broad variety of technological interventions into personal and commercial finance.

According to EY’s FinTech Adoption Index, one-third of consumers leverage at least two or more FinTech services and more and more of these consumers are also aware of FinTech being a part of their daily lives.

FinTech encompasses the startups, technology companies or even legacy providers. Startups use technology to offer existing financial services at affordable costs and to provide new tech-driven solutions. Incumbent financial enterprises look to acquire or work with startups to drive digital innovation. Technology companies offer payment tools. All these can be seen as FinTech. Surely, the lines between technology and the financial services are blurring.

Origins of FinTech Source: 16Best

In broad lines, the financial industry has seen a gargantuan shift over the years with the way it is leveraged in the times of rapid technological advancements. 16Best has compiled a brief history of FinTech which shows how the gap between financial services and the technology has got bridged over the years.

The gap between financial services and the technology has got bridged over the years.

In 1918, the Fedwire Funds service began offering electronic funds transfer. And while the Great Depression was ravaging the world’s economies, IBM provided some solace with its 801 Bank Proof Cach Machine that offered the means for faster cheque processing. Subsequently, credit cards and ATMs came into existence in the ‘50s and ‘60s.

In 1971, first all-electronic trading emerged in the form of NASDAQ. And in 1973, the SWIFT (Society for Worldwide Interbank Financial Telecommunications) built a unified messaging framework between banks for handling money movement.

1997 was the year which saw the emergence of mobile payment through Coca-Cola Vending Machine. Fast forward to 2000s and the present decade, a slew of innovations crashed into the finance sector with the introduction of digital wallets, contactless payments and cryptocurrencies.

FinTech is definitely re-inventing a quicker and more durable wheel as the world continues to witness a superabundance of new ventures refining financial services with technology.

Merits of FinTech


Financial technology has taken the financial services to a whole new level with a cluster of merits that it offers. Here are some of the major benefits of FinTech:

  • Robo Advisors: They are one of the biggest areas of FinTech. These online investment services put users through a slew of questions and then relies on algorithms to come up with an investment plan for them.
  • Online Lending: It encompasses all aspects of borrowing from personal loans to refinancing student loans which improves money lending.
  • Mobile payments: There is a growing demand for mobile payment options with the stupendous rise of mobile devices over the years.
Total revenue of global mobile payment market from 2015 to 2019 (in billion U.S. dollars) | Statista

Personal Finance and Savings: A plethora of FinTech organisations in the micro saving department have been helping people to save their change for rainy days and a whole lot of them rewarding customers for doing so. For instance, Digit allows you to automate the process of saving extra cash.

Source: Statista

Online Banking and Budgeting: Online banks like Simple reward users for using their ‘automatic savings’ service and also offer a cost-effective option over a traditional bank. Leveraging online tools, they assist users to plan budgets and handle their money smartly from their mobile devices with minimal effort to meet their savings goals.

Insurance: New insurance models have been strengthening the FinTech space. Metromile, an insurance model, sells pay per mile car insurance.

Source: Statista

Regtech: Regulation Technology, which utilises IT to enhance regulatory processes, is one of the significant sectors where numerous FinTech app ideas have come into light around this domain. Regtech is useful for trading in financial markets, monitoring payment transactions and identification of clients among others. For instance, PassFort helps in standardising the online compliance processes.

How is Drupal powering FinTech?

Organisations offering FinTech solutions need to maintain a robust online presence. Drupal has been powering the landscape of FinTech with its enormous capabilities.

The launch of TPG Capital


TPG Capital is one of the major enterprise-level FinTech companies which has leveraged the power of Drupal 8.

One of the primary objectives for TPG’s marketing circuit was to harness the Drupal’s flexibility as a digital empowerment platform. They wanted the ability to make alterations to content on the fly and try out new messaging approaches. Simultaneously, the financial industry’s stringent legal and regulatory requirements called for a flexible TPG platform that would meet the specific needs of the sector thereby offering top-notch security.

Drupal came out as the right choice when it came to the CMS that would facilitate the TPG’s goal for mirroring their cutting-edge business practices and incorporate modern website design and branding.

A digital agency built a responsive, mobile-first site. It featured newer CSS features like Flexbox and CSS animations and minimised the site’s dependence on Compass by introducing auto prefixer. Moreover, Drupal 8 version of Swifttype was built for the search component and contributed back to the Drupal Community.

The launch of Tech Coast Angels


Tech Coast Angels are one of the biggest angel investment organisation in the US. 

Tech Coast Angels selected Drupal as their CMS of choice for its excellent features vis-à-vis user authentication, account management, roles and access control, custom dashboards, intricate web forms for membership and funding application, workflow management and email notifications.

Performance improvements were made by a digital agency to both the Drupal application and the server environments which brought down the costs to a huge extent by minimising the hardware requirements necessary to run the Drupal codebase in both staging and production environments.

With Drupal being one of the most security focussed CMSs, it helped a great deal in making amendments related to security of the site. Views caching were enabled and unnecessary modules were turned off on the production server.

Market trends


The Pulse of FinTech 2018 by KPMG shows that global investments activity in FinTech companies has been steadily rising with 2018 turning out as the most profitable year. It is only going to grow more in the coming years.

In the coming years, the main trends in the asset and wealth management, banking, insurance and transactions and payments services industries can be seen in the illustration above.

Conclusion

FinTech is a great alternative to traditional banks. FinTech excels where traditional banks lag behind. In addition to offering robust financial services leveraging technological advancements, organisations offering FinTech solutions need to have a superb digital presence to offer a great digital experience. Drupal can be an awesome content store for an enterprise-level FinTech platform.

Drupal experts at Opensense Labs have been powering digital transformation pursuits of organisations offering a suite of services.

Contact us at hello@opensenselabs.com to build a FinTech web application for your business using Drupal.

blog banner blog image FinTech Drupal FinTech Drupal and FinTech Financial Technology FinTech platform FinTech web application FinTech website Blog Type Articles Is it a good read ? On
Categories: FLOSS Project Planets

3D – Interactions with Qt, KUESA and Qt Design Studio, Part 1

Planet KDE - Fri, 2019-06-21 09:37

This is the first in a series of blog posts about 3D and the interaction with Qt, KUESA and Qt 3D Studio, and other things that pop up when we’re working on something.

I’m a 3D designer, mostly working in blender. Sometimes I come across interesting problems and I’ll try to share those here. For example, trying to display things on low-end hardware – where memory is sometimes limited, meaning every polygon and triangle counts;  where the renderer doesn’t do what the designer wants it to, that sort of thing. The problem that I’ll cover today is, how to easily create a reflection in KUESA or Qt 3D Studio.

Neither KUESA or Qt 3D Studio will give you free reflections. If you know a little about 3D, you know that requires ray tracing software, not OpenGL. So, I wondered if there would be an easy way to create this effect. I mean, all that a reflection is, is a mirror of an object projected onto a plane, right? So, I wondered, could this be imitated?

To recreate this, I’d need to create an exact mirror of the object and duplicate it below the original, and have a floor that is partially transparent. I’ve created a simple scene to show you how this technique works – a scene with two cubes, a ground plane and a point light.

Here’s the result of this scene. It’s starting to look like something, but I want to compare it to a ‘real’ reflection.

For comparison, the above is a cube on a reflective, rough surface – showing the result using raytracing. You can see here the reflection is different from our example above – the main issue is that the reflection eventually fades out the further away it gets from the contact point. 

How to resolve this? This can be mimicked by creating an image texture for the alpha that fades out the model towards the top (or rather the bottom) of the reflection. I can also further enhance the illusion by ensuring that the floor is rough – allowing the texture of the surface to assist the illusion of a reflection.

Another difference between the shots is the blurriness on the edge of the mesh – this could be approximated by creating duplicates of the mesh and for each one, increasing the size and reducing the opacity. Depending on the complexity of the model, this may add too many polygons to render, while only adding a subtle effect.

So, given that this is a very simple example and not one that would translate well to something that a client might ask for, how can I translate this into a more complex model, such as the car below? I’ll chat about that in the next post.

The post 3D – Interactions with Qt, KUESA and Qt Design Studio, Part 1 appeared first on KDAB.

Categories: FLOSS Project Planets

parallel @ Savannah: GNU Parallel 20190622 ('HongKong') released

GNU Planet! - Fri, 2019-06-21 09:36

GNU Parallel 20190622 ('HongKong') has been released. It is available for download at: http://ftpmirror.gnu.org/parallel/

GNU Parallel is 10 years old in a year on 2020-04-22. You are here by invited to a reception on Friday 2020-04-17.

See https://www.gnu.org/software/parallel/10-years-anniversary.html

Quote of the month:

  I want to make a shout-out for @GnuParallel, it's a work of beauty and power
    -- Cristian Consonni @CristianCantoro

New in this release:

  • --shard can now take a column name and optionally a perl expression. Similar to --group-by and replacement strings.
  • Bug fixes and man page updates.

Get the book: GNU Parallel 2018 http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html

GNU Parallel - For people who live life in the parallel lane.

About GNU Parallel

GNU Parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU Parallel can then split the input and pipe it into commands in parallel.

If you use xargs and tee today you will find GNU Parallel very easy to use as GNU Parallel is written to have the same options as xargs. If you write loops in shell, you will find GNU Parallel may be able to replace most of the loops and make them run faster by running several jobs in parallel. GNU Parallel can even replace nested loops.

GNU Parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially. This makes it possible to use output from GNU Parallel as input for other programs.

You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/

You can install GNU Parallel in just 10 seconds with:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial (man parallel_tutorial). Your command line will love you for it.

When using programs that use GNU Parallel to process data for publication please cite:

O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014.

If you like GNU Parallel:

  • Give a demo at your local user group/team/colleagues
  • Post the intro videos on Reddit/Diaspora*/forums/blogs/ Identi.ca/Google+/Twitter/Facebook/Linkedin/mailing lists
  • Get the merchandise https://gnuparallel.threadless.com/designs/gnu-parallel
  • Request or write a review for your favourite blog or magazine
  • Request or build a package for your favourite distribution (if it is not already there)
  • Invite me for your next conference

If you use programs that use GNU Parallel for research:

  • Please cite GNU Parallel in you publications (use --citation)

If GNU Parallel saves you money:

About GNU SQL

GNU sql aims to give a simple, unified interface for accessing databases through all the different databases' command line clients. So far the focus has been on giving a common way to specify login information (protocol, username, password, hostname, and port number), size (database and table size), and running queries.

The database is addressed using a DBURL. If commands are left out you will get that database's interactive shell.

When using GNU SQL for a publication please cite:

O. Tange (2011): GNU SQL - A Command Line Tool for Accessing Different Databases Using DBURLs, ;login: The USENIX Magazine, April 2011:29-32.

About GNU Niceload

GNU niceload slows down a program when the computer load average (or other system activity) is above a certain limit. When the limit is reached the program will be suspended for some time. If the limit is a soft limit the program will be allowed to run for short amounts of time before being suspended again. If the limit is a hard limit the program will only be allowed to run when the system is below the limit.

Categories: FLOSS Project Planets

mailutils @ Savannah: Version 3.7

GNU Planet! - Fri, 2019-06-21 09:15

Version 3.7 of GNU mailutils is available for download.

This version introduces a new format for mailboxes: dotmail. Dotmail is a replacement for traditional mbox format, proposed by
Kurt Hackenberg. A dotmail mailbox is a single disk file, where messages are stored sequentially. Each message ends with a single
dot (similar to the format used in the SMTP DATA command). A dot appearing at the start of the line is doubled, to prevent it from being interpreted as end of message marker.

For a complete list of changes, please see the NEWS file.

Categories: FLOSS Project Planets

wishdesk.com: Responsive design in Drupal 8: great core & contributed modules

Planet Drupal - Fri, 2019-06-21 08:23
Drupal 8 has been built with mobile devices in mind. It has responsive default themes, responsive admin interfaces, and powerful opportunities for mobile-friendly design. Great Drupal 8 modules are very helpful in implementing any ideas in this area.
Categories: FLOSS Project Planets

New website, new company, new partners, new code

Planet KDE - Fri, 2019-06-21 06:00

The obvious change to announce is the new website design. But there is much more to talk about. ### Website overhaul The old website, reachable primarily on the domain [subdiff.de][subdiff.de], was a pure blog built with Jekyll and the design was some random theme I picked up on GitHub. It was a quick thing to do back in the days when I needed a blog up fast for community interaction as a KWin and Plasma developer. But on the back burner my goal was already for quite some time to rebuild the website with a more custom and professional design. Additionally I wanted this website to not only be a blog but also a landing page with some general information about my work. The opportunity arose now and after several months of research and coding I finished the website rebuild. This all needed longer because it seemed to me like an ideal occasion to learn about modern web development techniques and so I didn't settle for the first plain solution I came across but invested some more time into selecting and learning a suitable technology stack. In the end I decided to use [Gridsome][gridsome], a static site generator leveraging [Vue.js][vue] for the frontend and [GraphQL][graphql] as data backend when generating the site. By that Gridsome is a prime example of the [JAMstack][jamstack], a most modern and very sensible way of building small to medium sized websites with only few selected dynamic elements through JavaScript APIs while keeping everything else static. After all that learning, decision taking and finally coding I'm now really happy with this solution and I definitely want to write in greater detail about it in the future. Feature-wise the current website provides what I think are the necessary basics and it could still be extended in several ways, but as for now I will stick to these basics and only look into new features when I get an urge to do it. ### Freelancer business Since January I work as a freelancer. This means in Germany that I basically had to start a company, so I did that. I called it *subdiff : software system*, and the brand is still the domain name you are currently browsing. I used it already before as this website's domain name and as an online nickname. It is derived from a mathematical concept and on the other side stands for a slogan I find sensible on a practical level in work and life: > Subtract the nonsense, differentiate what's left. ### Part of Valve's Open Source Group As a freelancer I am contracted by Valve to work on certain gaming-related XServer projects and improve KWin in this regard and for general desktop usage. In the XServer there are two main projects at the moment. The technical details of one of them are currently discussed on a work-in-progress patch series [on Gitlab][xserver-composite-accel-patch] but I want to write accessible articles about both projects here on the blog as well in the near future. In KWin I have several large projects I will look into, which would benefit KWin on X11 and Wayland alike. The most relevant one is [reworking the compositing pipeline][phab-comp-rework]. You can expect more info about this project and the other ones in KWin in future blog posts too. ### New code While there are some big projects in the pipeline I was also able to commit some major changes in the last few months to KWin and Plasma. The largest one was for sure [XWayland drag-and-drop support][xwl-dnd] in KWin. But in best case scenario the user won't even notice this feature because drag-and-drop between any relevant windows will just work from now on in our Wayland session. Inside KWin though the technical solution enabling this was built up from the ground. And in a way such that we should be able to later support something like middle-click-paste between XWayland and Wayland native windows easily. There were two other major initiatives by me that I was able to merge: the finalization of basing every display representation in KWin on the generic `AbstractOutput` class and in Plasma's display management library, daemon and settings panel to [save display-individual values][kscreen-patch] in a consistent way by introducing a new communication channel between these components. While the results of both enhancements are again supposed to be unnoticeable by the user but should improve the code structure and increase the overall stability there is more work lined up for display management which then will directly affect the interface. Take a look at [this task][display-further-work-task] to see what I have planned. So there is interesting work ahead. Luckily this week I am with my fellow KWin and Plasma developers at the Plasma and Usability sprint in Valencia to discuss and plan work on such projects. The sprint officially started yesterday and the first day already was very productive. We strive to keep up that momentum till the end of the sprint next week and I plan on writing an article about the sprint results afterwards. In the meantime you can follow [@kdecommunity][twitter-kdecommunity] on Twitter if you want to receive timely updates on our sprint while it's happening. ### Final remarks and prospect I try to keep the articles in this blog rather prosaic and technical but there are so many things moving forward and evolving right now that I want to spend a few paragraphs in the end on the opposite. In every aspect there is just immense *potential* when looking at our open source graphics stack consisting of KDE Plasma with KWin, at the moment still good old X but in the future Wayland, and the Linux graphics drivers below. While the advantages of free and open source software for the people were always obvious, how rapidly this type of software became the backbone of our global economy signifies that it is immensely valuable for companies alike. In this context the opportunities on how to make use of our software offerings and improve them are endless while the technical challenges we face when doing that are interesting. By this we can do our part such that the open source community will grow and foster. As a reader of these sentences you are already in a prime position to take part in this great journey as well by becoming an active member of the community through contributing. Maybe you already do this for example by coding, designing, researching, donating or just by giving us feedback on how our technology can become better. But if you are not yet, this is a great time to get involved and bring in your individual talents and motivation to build up something great together for ourselves and everybody. You can find out more on how to do that by visiting KDE's [Get Involved page][kde-involved] or join in on the ongoing discussion about KDE's [future goals][goals-blog]. [subdiff.de]: https://subdiff.de [gridsome]: https://gridsome.org [vue]: https://vuejs.org [graphql]: https://graphql.org [jamstack]: https://jamstack.org [xserver-composite-accel-patch]: https://gitlab.freedesktop.org/xorg/xserver/merge_requests/211 [phab-comp-rework]: https://phabricator.kde.org/T11071 [xwl-dnd]: https://phabricator.kde.org/R108:548978bfe1f714e51af6082933a512d28504f7e3 [kscreen-patch]: https://phabricator.kde.org/T10028 [display-further-work-task]: https://phabricator.kde.org/T11095 [twitter-kdecommunity]: https://twitter.com/kdecommunity [kde-involved]: https://community.kde.org/Get_Involved [goals-blog]: http://blog.lydiapintscher.de/2019/06/09/evolving-kde-lets-set-some-new-goals-for-kde/

Categories: FLOSS Project Planets

Ruslan Spivak: Let’s Build A Simple Interpreter. Part 15.

Planet Python - Fri, 2019-06-21 05:45

“I am a slow walker, but I never walk back.” — Abraham Lincoln

And we’re back to our regularly scheduled programming! :)

Before moving on to topics of recognizing and interpreting procedure calls, let’s make some changes to improve our error reporting a bit. Up until now, if there was a problem getting a new token from text, parsing source code, or doing semantic analysis, a stack trace would be thrown right into your face with a very generic message. We can do better than that.

To provide better error messages pinpointing where in the code an issue happened, we need to add some features to our interpreter. Let’s do that and make some other changes along the way. This will make the interpreter more user friendly and give us an opportunity to flex our muscles after a “short” break in the series. It will also give us a chance to prepare for new features that we will be adding in future articles.

Goals for today:

  • Improve error reporting in the lexer, parser, and semantic analyzer. Instead of stack traces with very generic messages like “Invalid syntax”, we would like to see something more useful like “SyntaxError: Unexpected token -> Token(TokenType.SEMI, ‘;’, position=23:13)”
  • Add a “—scope” command line option to turn scope output on/off
  • Switch to Python 3. From here on out, all code will be tested on Python 3.7+ only

Let’s get cracking and start flexing our coding muscles by changing our lexer first.


Here is a list of the changes we are going to make in our lexer today:

  1. We will add error codes and custom exceptions: LexerError, ParserError, and SemanticError
  2. We will add new members to the Lexer class to help to track tokens’ positions: lineno and column
  3. We will modify the advance method to update the lexer’s lineno and column variables
  4. We will update the error method to raise a LexerError exception with information about the current line and column
  5. We will define token types in the TokenType enumeration class (Support for enumerations was added in Python 3.4)
  6. We will add code to automatically create reserved keywords from the TokenType enumeration members
  7. We will add new members to the Token class: lineno and column to keep track of the token’s line number and column number, correspondingly, in the text
  8. We will refactor the get_next_token method code to make it shorter and have a generic code that handles single-character tokens


1. Let’s define some error codes first. These codes will be used by our parser and semantic analyzer. Let’s also define the following error classes: LexerError, ParserError, and SemanticError for lexical, syntactic, and, correspondingly, semantic errors:

from enum import Enum class ErrorCode(Enum): UNEXPECTED_TOKEN = 'Unexpected token' ID_NOT_FOUND = 'Identifier not found' DUPLICATE_ID = 'Duplicate id found' class Error(Exception): def __init__(self, error_code=None, token=None, message=None): self.error_code = error_code self.token = token # add exception class name before the message self.message = f'{self.__class__.__name__}: {message}' class LexerError(Error): pass class ParserError(Error): pass class SemanticError(Error): pass


ErrorCode is an enumeration class, where each member has a name and a value:

>>> from enum import Enum >>> >>> class ErrorCode(Enum): ... UNEXPECTED_TOKEN = 'Unexpected token' ... ID_NOT_FOUND = 'Identifier not found' ... DUPLICATE_ID = 'Duplicate id found' ... >>> ErrorCode <enum 'ErrorCode'> >>> >>> ErrorCode.ID_NOT_FOUND <ErrorCode.ID_NOT_FOUND: 'Identifier not found'>


The Error base class constructor takes three arguments:

  • error_code: ErrorCode.ID_NOT_FOUND, etc

  • token: an instance of the Token class

  • message: a message with more detailed information about the problem

As I’ve mentioned before, LexerError is used to indicate an error encountered in the lexer, ParserError is for syntax related errors during the parsing phase, and SemanticError is for semantic errors.


2. To provide better error messages, we want to display the position in the source text where the problem happened. To be able do that, we need to start tracking the current line number and column in our lexer as we generate tokens. Let’s add lineno and column fields to the Lexer class:

class Lexer(object): def __init__(self, text): ... # self.pos is an index into self.text self.pos = 0 self.current_char = self.text[self.pos] # token line number and column number self.lineno = 1 self.column = 1


3. The next change that we need to make is to reset lineno and column in the advance method when encountering a new line and also increase the column value on each advance of the self.pos pointer:

def advance(self): """Advance the `pos` pointer and set the `current_char` variable.""" if self.current_char == '\n': self.lineno += 1 self.column = 0 self.pos += 1 if self.pos > len(self.text) - 1: self.current_char = None # Indicates end of input else: self.current_char = self.text[self.pos] self.column += 1

With those changes in place, every time we create a token we will pass the current lineno and column from the lexer to the newly created token.


4. Let’s update the error method to throw a LexerError exception with a more detailed error message telling us the current character that the lexer choked on and its location in the text.

def error(self): s = "Lexer error on '{lexeme}' line: {lineno} column: {column}".format( lexeme=self.current_char, lineno=self.lineno, column=self.column, ) raise LexerError(message=s)


5. Instead of having token types defined as module level variables, we are going to move them into a dedicated enumeration class called TokenType. This will help us simplify certain operations and make some parts of our code a bit shorter.

Old style:

# Token types PLUS = 'PLUS' MINUS = 'MINUS' MUL = 'MUL' ...

New style:

class TokenType(Enum): # single-character token types PLUS = '+' MINUS = '-' MUL = '*' FLOAT_DIV = '/' LPAREN = '(' RPAREN = ')' SEMI = ';' DOT = '.' COLON = ':' COMMA = ',' # block of reserved words PROGRAM = 'PROGRAM' # marks the beginning of the block INTEGER = 'INTEGER' REAL = 'REAL' INTEGER_DIV = 'DIV' VAR = 'VAR' PROCEDURE = 'PROCEDURE' BEGIN = 'BEGIN' END = 'END' # marks the end of the block # misc ID = 'ID' INTEGER_CONST = 'INTEGER_CONST' REAL_CONST = 'REAL_CONST' ASSIGN = ':=' EOF = 'EOF'


6. We used to manually add items to the RESERVED_KEYWORDS dictionary whenever we had to add a new token type that was also a reserved keyword. If we wanted to add a new STRING token type, we would have to

  • (a) create a new module level variable STRING = ‘STRING’
  • (b) manually add it to the RESERVED_KEYWORDS dictionary

Now that we have the TokenType enumeration class, we can remove the manual step (b) above and keep token types in one place only. This is the “two is too many” rule in action - going forward, the only change you need to make to add a new keyword token type is to put the keyword between PROGRAM and END in the TokenType enumeration class, and the _build_reserved_keywords function will take care of the rest:

def _build_reserved_keywords(): """Build a dictionary of reserved keywords. The function relies on the fact that in the TokenType enumeration the beginning of the block of reserved keywords is marked with PROGRAM and the end of the block is marked with the END keyword. Result: {'PROGRAM': <TokenType.PROGRAM: 'PROGRAM'>, 'INTEGER': <TokenType.INTEGER: 'INTEGER'>, 'REAL': <TokenType.REAL: 'REAL'>, 'DIV': <TokenType.INTEGER_DIV: 'DIV'>, 'VAR': <TokenType.VAR: 'VAR'>, 'PROCEDURE': <TokenType.PROCEDURE: 'PROCEDURE'>, 'BEGIN': <TokenType.BEGIN: 'BEGIN'>, 'END': <TokenType.END: 'END'>} """ # enumerations support iteration, in definition order tt_list = list(TokenType) start_index = tt_list.index(TokenType.PROGRAM) end_index = tt_list.index(TokenType.END) reserved_keywords = { token_type.value: token_type for token_type in tt_list[start_index:end_index + 1] } return reserved_keywords RESERVED_KEYWORDS = _build_reserved_keywords()


As you can see from the function’s documentation string, the function relies on the fact that a block of reserved keywords in the TokenType enum is marked by PROGRAM and END keywords.

The function first turns TokenType into a list (the definition order is preserved), and then it gets the starting index of the block (marked by the PROGRAM keyword) and the end index of the block (marked by the END keyword). Next, it uses dictionary comprehension to build a dictionary where the keys are string values of the enum members and the values are the TokenType members themselves.

>>> from spi import _build_reserved_keywords >>> from pprint import pprint >>> pprint(_build_reserved_keywords()) # 'pprint' sorts the keys {'BEGIN': <TokenType.BEGIN: 'BEGIN'>, 'DIV': <TokenType.INTEGER_DIV: 'DIV'>, 'END': <TokenType.END: 'END'>, 'INTEGER': <TokenType.INTEGER: 'INTEGER'>, 'PROCEDURE': <TokenType.PROCEDURE: 'PROCEDURE'>, 'PROGRAM': <TokenType.PROGRAM: 'PROGRAM'>, 'REAL': <TokenType.REAL: 'REAL'>, 'VAR': <TokenType.VAR: 'VAR'>}


7. The next change is to add new members to the Token class, namely lineno and column, to keep track of a token’s line number and column number in a text

class Token(object): def __init__(self, type, value, lineno=None, column=None): self.type = type self.value = value self.lineno = lineno self.column = column def __str__(self): """String representation of the class instance. Example: >>> Token(TokenType.INTEGER, 7, lineno=5, column=10) Token(TokenType.INTEGER, 7, position=5:10) """ return 'Token({type}, {value}, position={lineno}:{column})'.format( type=self.type, value=repr(self.value), lineno=self.lineno, column=self.column, ) def __repr__(self): return self.__str__()


8. Now, onto get_next_token method changes. Thanks to enums, we can reduce the amount of code that deals with single character tokens by writing a generic code that generates single character tokens and doesn’t need to change when we add a new single character token type:

Instead of a lot of code blocks like these:

if self.current_char == ';': self.advance() return Token(SEMI, ';') if self.current_char == ':': self.advance() return Token(COLON, ':') if self.current_char == ',': self.advance() return Token(COMMA, ',') ...

We can now use this generic code to take care of all current and future single-character tokens

# single-character token try: # get enum member by value, e.g. # TokenType(';') --> TokenType.SEMI token_type = TokenType(self.current_char) except ValueError: # no enum member with value equal to self.current_char self.error() else: # create a token with a single-character lexeme as its value token = Token( type=token_type, value=token_type.value, # e.g. ';', '.', etc lineno=self.lineno, column=self.column, ) self.advance() return token

Arguably it’s less readable than a bunch of if blocks, but it’s pretty straightforward once you understand what’s going on here. Python enums allow us to access enum members by values and that’s what we use in the code above. It works like this:

  • First we try to get a TokenType member by the value of self.current_char
  • If the operation throws a ValueError exception, that means we don’t support that token type
  • Otherwise we create a correct token with the corresponding token type and value.

This block of code will handle all current and new single character tokens. All we need to do to support a new token type is to add the new token type to the TokenType definition and that’s it. The code above will stay unchanged.

The way I see it, it’s a win-win situation with this generic code: we learned a bit more about Python enums, specifically how to access enumeration members by values; we wrote some generic code to handle all single character tokens, and, as a side effect, we reduced the amount of repetitive code to handle those single character tokens.

The next stop is parser changes.


Here is a list of changes we’ll make in our parser today:

  1. We will update the parser’s error method to throw a ParserError exception with an error code and current token
  2. We will update the eat method to call the modified error method
  3. We will refactor the declarations method and move the code that parses a procedure declaration into a separate method.

1. Let’s update the parser’s error method to throw a ParserError exception with some useful information

def error(self, error_code, token): raise ParserError( error_code=error_code, token=token, message=f'{error_code.value} -> {token}', )


2. And now let’s modify the eat method to call the updated error method

def eat(self, token_type): # compare the current token type with the passed token # type and if they match then "eat" the current token # and assign the next token to the self.current_token, # otherwise raise an exception. if self.current_token.type == token_type: self.current_token = self.get_next_token() else: self.error( error_code=ErrorCode.UNEXPECTED_TOKEN, token=self.current_token, )


3. Next, let’s update the declaration‘s documentation string and move the code that parses a procedure declaration into a separate method, procedure_declaration:

def declarations(self): """ declarations : (VAR (variable_declaration SEMI)+)? procedure_declaration* """ declarations = [] if self.current_token.type == TokenType.VAR: self.eat(TokenType.VAR) while self.current_token.type == TokenType.ID: var_decl = self.variable_declaration() declarations.extend(var_decl) self.eat(TokenType.SEMI) while self.current_token.type == TokenType.PROCEDURE: proc_decl = self.procedure_declaration() declarations.append(proc_decl) return declarations def procedure_declaration(self): """procedure_declaration : PROCEDURE ID (LPAREN formal_parameter_list RPAREN)? SEMI block SEMI """ self.eat(TokenType.PROCEDURE) proc_name = self.current_token.value self.eat(TokenType.ID) params = [] if self.current_token.type == TokenType.LPAREN: self.eat(TokenType.LPAREN) params = self.formal_parameter_list() self.eat(TokenType.RPAREN) self.eat(TokenType.SEMI) block_node = self.block() proc_decl = ProcedureDecl(proc_name, params, block_node) self.eat(TokenType.SEMI) return proc_decl

These are all the changes in the parser. Now, we’ll move onto the semantic analyzer.


And finally here is a list of changes we’ll make in our semantic analyzer:

  1. We will add a new error method to the SemanticAnalyzer class to throw a SemanticError exception with some additional information
  2. We will update visit_VarDecl to signal an error by calling the error method with a relevant error code and token
  3. We will also update visit_Var to signal an error by calling the error method with a relevant error code and token
  4. We will add a log method to both the ScopedSymbolTable and SemanticAnalyzer, and replace all print statements with calls to self.log in the corresponding classes
  5. We will add a command line option “—-scope” to turn scope logging on and off (it will be off by default) to control how “noisy” we want our interpreter to be
  6. We will add empty visit_Num and visit_UnaryOp methods


1. First things first. Let’s add the error method to throw a SemanticError exception with a corresponding error code, token and message:

def error(self, error_code, token): raise SemanticError( error_code=error_code, token=token, message=f'{error_code.value} -> {token}', )


2. Next, let’s update visit_VarDecl to signal an error by calling the error method with a relevant error code and token

def visit_VarDecl(self, node): type_name = node.type_node.value type_symbol = self.current_scope.lookup(type_name) # We have all the information we need to create a variable symbol. # Create the symbol and insert it into the symbol table. var_name = node.var_node.value var_symbol = VarSymbol(var_name, type_symbol) # Signal an error if the table already has a symbol # with the same name if self.current_scope.lookup(var_name, current_scope_only=True): self.error( error_code=ErrorCode.DUPLICATE_ID, token=node.var_node.token, ) self.current_scope.insert(var_symbol)


3. We also need to update the visit_Var method to signal an error by calling the error method with a relevant error code and token

def visit_Var(self, node): var_name = node.value var_symbol = self.current_scope.lookup(var_name) if var_symbol is None: self.error(error_code=ErrorCode.ID_NOT_FOUND, token=node.token)

Now semantic errors will be reported as follows:

SemanticError: Duplicate id found -> Token(TokenType.ID, 'a', position=21:4)

Or

SemanticError: Identifier not found -> Token(TokenType.ID, 'b', position=22:9)


4. Let’s add the log method to both the ScopedSymbolTable and SemanticAnalyzer, and replace all print statements with calls to self.log:

def log(self, msg): if _SHOULD_LOG_SCOPE: print(msg)

As you can see, the message will be printed only if the global variable _SHOULD_LOG_SCOPE is set to true. The —scope command line option that we will add in the next step will control the value of the _SHOULD_LOG_SCOPE variable.


5. Now, let’s update the main function and add a command line option “—scope” to turn scope logging on and off (it’s off by default)

parser = argparse.ArgumentParser( description='SPI - Simple Pascal Interpreter' ) parser.add_argument('inputfile', help='Pascal source file') parser.add_argument( '--scope', help='Print scope information', action='store_true', ) args = parser.parse_args() global _SHOULD_LOG_SCOPE _SHOULD_LOG_SCOPE = args.scope

Here is an example with the switch on:

$ python spi.py idnotfound.pas --scope ENTER scope: global Insert: INTEGER Insert: REAL Lookup: INTEGER. (Scope name: global) Lookup: a. (Scope name: global) Insert: a Lookup: b. (Scope name: global) SemanticError: Identifier not found -> Token(TokenType.ID, 'b', position=6:9)

And with scope logging off (default):

$ python spi.py idnotfound.pas SemanticError: Identifier not found -> Token(TokenType.ID, 'b', position=6:9)


6. Add empty visit_Num and visit_UnaryOp methods

def visit_Num(self, node): pass def visit_UnaryOp(self, node): pass

These are all the changes to our semantic analyzer for now.

See GitHub for Pascal files with different errors to try your updated interpreter on and see what error messages the interpreter generates.


That is all for today. You can find the full source code for today’s article interpreter on GitHub. In the next article we’ll talk about how to recognize (i.e. how to parse) procedure calls. Stay tuned and see you next time!

LSBASI_SERIES_LINKS

Categories: FLOSS Project Planets

Candy Tsai: Outreachy Week 5: What is debci?

Planet Debian - Fri, 2019-06-21 05:33

The theme for this week in Outreachy is “Think About Your Audience”. So I’m currently thinking about you.

Or not?

After being asked sooo many times what am I doing for this internship, I think I never explained it well enough so that others could understand. Let me give it a try here.

debci is short for “Debian Continuous Integration”, so I’ll start with a short definition of what “Continuous Integration” is then!

Continuous Integration (CI)

Since there should be quite some articles talking about this topic, here is a quick explanation that I found on Microsoft Azure (link):

Continuous Integration (CI) is the process of automating the build and testing of code every time a team member commits changes to version control.

A scenario would be whenever I push code to Debian Salsa Gitlab, it automatically runs the tests we have written in our code. This is to make sure that the new code changes doesn’t break the stuff that used to work.

The debci Project

Before Debian puts out a new release, all of the packages that have tests written in them need to be tested. debci is a platform for testing packages and provides a UI to see if they pass or not. The goal is to make sure that the the packages will pass their tests before a major Debian release. For example, when the ruby-defaults package is updated, we not only want to test ruby-defaults but also all the packages that depend on it. In short, debci helps make sure the packages are working correctly.

For my internship, I am working on improving the user experience through the UI of the debci site. The most major task is to let developers easily test their packages with packages in different suites and architectures.

The terms that keep popping up in debci are:

  • suite
  • architecture
  • pin-packages
  • trigger

There are three obvious suites on the debci site right now, namely unstable, testing and stable. There is also an experimental suite that a user can test their packages upon. And architecture is like amd64 or arm64.

The life of a package is something like this:

  1. When a package is updated/added, it goes into unstable
  2. After it has stayed 2-10 days in unstable without any big issues, it moves into testing
  3. It becomes stable after a release
Normally a package moves from unstable > testing > stable

Let’s say, a user wants to test the ruby-defaults package in unstable on amd64 along with a package C from experimental. Here package C would be a pin-package, which means the package the user wants to test along with. Last but not least, trigger is just the name of the test job, one can choose to use it or not.

Currently, there is an API that you make this request with curl or something similar, but it’s not very friendly since not everyone is familiar with how the request should look like. Therefore, not a lot of people are willing to use it and I have also seen requests for an improvement on this in the #debci channel. An easy to use UI might be the solution to make requesting these tests easier. Knowing what I am working on is useful for others is an important key to keep myself motivated.

The debci Community

The debci community is very small but active. The people that are directly involved are my mentors: terceiro and elbrus. Sometimes people drop by the IRC channel to ask questions, but basically that’s it. This works pretty good for me, because I’m usually not at ease and will keep a low profile if the community gets too large.

I’m not familiar with the whole Debian community, but I have also been hanging around the #debian-outreach channel. It felt warm to know that someone realized there was an intern from Taiwan for this round of Outreachy. As far as I have experienced, everyone I have chatted with were nice and eager to share which Debian-related communities were close to me.

Week 5: Modularizing & Testing

This week I worked on adding tests and tried pulling out the authentication code to make the code a bit more DRY.

  • Learned how to setup tests in Ruby
  • Came up with test cases
  • Learned more about how classes work in Ruby
  • Separated the authentication code

And… probably also writing this blog post! I found that blogging takes up more time than I thought it should.

Categories: FLOSS Project Planets

Sven Hoexter: logstash json filter error

Planet Debian - Fri, 2019-06-21 04:08

If you've a logstash filter that contains a json filter/decoding step like

filter { json { source => "log" } }

this, and you end up with an error message like that:

[2019-06-21T09:47:58,243][WARN ][logstash.filters.json ] Error parsing json {:source=>"log", :raw=>{"file"=>{"path"=>"/var/lib/docker/containers/abdf3db21fca8e1dc17c888d4aa661fe16ae4371355215157cf7c4fc91b8ea4b/abdf3db21fca8e1dc17c888d4aa661fe16ae4371355215157cf7c4fc91b8ea4b-json.log"}}, :exception=>java.lang.ClassCastException}

It might be just telling you that the field log actually does contain valid json, and no decoding is required.

Categories: FLOSS Project Planets

Talk Python to Me: #217 Notebooks vs data science-enabled scripts

Planet Python - Fri, 2019-06-21 04:00
On this episode, I meet up with Rong Lu and Katherine Kampf from Microsoft while I was at BUILD this year. We cover a bunch of topics around data science and talk about two opposing styles of data science development and related tooling: Notebooks vs Python code files and editors.
Categories: FLOSS Project Planets

Agiledrop.com Blog: Burnout: Symptoms of developer burnout & ways to tackle it

Planet Drupal - Fri, 2019-06-21 03:06

Burnout is becoming an increasingly prevalent problem, especially in a field as fast-paced as development. In this post, we'll take a look at how you can spot the symptoms of burnout in your developers and what measures you can take to tackle it.

READ MORE
Categories: FLOSS Project Planets

OpenSense Labs: Disseminating Knowledge: Drupal for Education and E-learning

Planet Drupal - Fri, 2019-06-21 02:56
Disseminating Knowledge: Drupal for Education and E-learning Shankar Fri, 06/21/2019 - 12:26 "Information is a source of learning. But unless it is organized, processed, and available to the right people in a format for decision making, it is a burden, not a benefit." - C. William Pollard, Chairman, Fairwyn Investment Company

Have you always secretly wanted to spend your evenings writing symphonies, learning about filmography or assessing climate change? Studying niche subjects have traditionally been for niche students. But e-learning platforms have changed all that with the provision for learning almost any subject online.


Corporate e-learning has witnessed a stupendous 900% growth in the last decade or so. With more and more e-learning platforms flourishing, organisations are striving to be the best to stand apart from the rest. Drupal has been a great asset in powering education and e-learning with its powerful capabilities that can help enterprises offer a wonderful digital experience. Let’s trace the roots of e-learning before diving deep into the ocean of possibilities with Drupal for building an amazing e-learning platform.

Before the internet era Source: eFront

A brief history of e-learning can be traced through the compilation made by eFront. Even before the internet existed, distance education was being offered. In 1840, Isaac Pitman taught shorthand via correspondence where completed assignments were sent to him via mail and he would, then, send his students more work.

Fast forward to the 20th century, the first testing machine was invented in 1924 that enabled students to test themselves. The teaching machine was invented in 1954 by a Harvard professor for allowing schools to administer programmed instruction to students. In 1960, the first computer-based training program (CBT program) called Programmed Logic for Automated Teaching Operation (PLATO).

At a CBT systems seminar in 1999, the term ‘e-learning’ was first utilised. Eventually, with internet and computers becoming the core of businesses, the 2000s saw the adoption of e-learning by organisations to train employees. Today, a plenitude of e-learning solutions are available in the form of MOOCs (Massive Open Online Courses), Social platforms and Learning Management System among others.

E-learning: Learn anywhere, anytime

In essence, e-learning refers to the computer-based educational tool or system that allows you to learn anywhere and at any time. It is the online method of building skills and knowledge across the complete workforce and with customers and partners. It comes with numerous formats like the self-paced courses, virtual live classrooms or informal learning.

E-learning refers to the computer-based educational tool or system that allows you to learn anywhere and at any time

Technological advancements have diminished the geographical gap with the use of tools that can make you feel as if you are inside the classroom. E-learning provides the ability to share material in all sorts of formats such as videos, slideshows, and PDFs. It is possible to conduct webinars (live online classes) and communicate with professors via chat and message forums.

There is a superabundance of different e-learning systems (otherwise known as Learning Management Systems or LMS) and methods which enable the courses to be delivered. With the right kind of tools, several processes can be automated like the marking of tests or the creation of engrossing content. E-learning offers the learners with the ability to fit learning around their lifestyles thereby enabling even the busiest of persons to further a career and gain new qualifications.

Merits and Demerits

Some of the major benefits are outlined below:

  • No restrictions: E-learning facilitates learning without having to organise when and where everyone, who is interested in learning a course, can be present.
  • Interactive and fun: Designing a course to make it interactive and fun with the use of multimedia or gamification enhances engagement and the relative lifetime of the course.
  • Affordable: E-learning is cost-effective. For instance, while textbooks can become obsolete, the need to perpetually acquire new editions by paying exorbitant amounts of money is not present in e-learning.

Some of the concerns that are needed to be taken care of:

  • Practical skills: It is considered tougher to pick up skills like building a wooden table, pottery, and car engineering from online resources as these require hands-on experience.
  • Secludedness: Although e-learning enables a person to remotely access a classroom in his or her own time, learners may feel a sense of isolation. Tools such as video conferencing, social media and discussion forums can allow them to actively engage with professors or other students.
  • Health concerns: With the mandatory need of a computer or mobile devices, health-related issues like eyestrain, bad posture, and other physical problems may be troublesome. However, sending out proper guidelines beforehand to the learner like correct sitting posture, desk height, and recommendations for regular breaks can be done.
Building Yardstick LMS with Drupal

OpenSense Labs built Yardstick LMS, a learning management system, for Yardstick Educational Initiatives which caters to the students of various schools of Dubai.

Yardstick LMS Homepage

The architecture of the project involved a lot of custom development:

1. Yardstick Core

This is the core module of the Yardstick LMS where the process of creating, updating and deleting the nodes take place.

2. Yardstick Quiz

We built this custom module for the whole functionality of the quiz component. It generates a quiz, quiz palette and quiz report after quiz completion based upon the validation of the visibility of the report.


We could generate three kinds of reports: 

  • An individual-level quiz where one’s performance is evaluated
  • A sectional-level report where performance for each section is evaluated
  • Grade-level report where performance for all the sections is compared and evaluated.

For the quiz, we had different sub-components like questions, options, marks, the average time to answer, learning objective, skill level score, and concept. The same question could be used for different quiz thereby minimising the redundancy of the data. Also, image, video or text could be added for questions.


3. Yardstick Bulk User Import

This module was built to assist the administrators in creating users all at once by importing a CSV file. Also, there is an option to send invitation mail to all the users with login credentials.


4. Yardstick Custom Login

We provided a custom login feature where same login credentials could be used to log into the Yardstick system. That is, we provided an endpoint for verifying the login credentials and upon success, users were logged in.

5. Yardstick Validation

This module offers all the validation across the site whether it is related to access permission or some time validation.

6. Yardstick Challenge

It offers the user an option to submit a task which is assigned to them where they are provided with text area and file upload widget.

Yardstick LMS has an intricate structure

On the end user side, there is a seamless flow but as we go deeper, it becomes challenging. Yardstick LMS has an intricate structure.

We had two kinds of login:

  • Normal login using Yardstick credentials
  • And the other for school-specific login like the Delhi Public School (DPS) users.
Yardstick LMS custom login for DPS users

For DPS users, we used the same login form but a different functionality for validating credentials. DPS school gave us an endpoint where we sent a POST request with username and password. If the username and password were correct, then that endpoint returned the user information.

If the username was received, we checked on our Yardstick system if the username exists. If it does not exist, then we programmatically created a new user with the information that we received from the endpoint and created a user session. And if does exist, then we updated the password on our system.

Yardstick LMS is designed to govern multiple schools at the same time

We designed Yardstick LMS in such a way that multiple schools can be governed at the same time. All the students of various schools will be learning the same content thereby building uniformity.

The core part of our system dwells in the modules. The module is a content type that can store numerous information like components, concept, description, objective, syllabus among others. 

Several different components can be added like Task, Quiz, Video task, Extension, Feedback, Inspiration, pdf lesson plan, Real life application, and Scientific principles.

Yardstick LMS Real life application component page

Schools could opt for different modules for different grades. When a module was subscribed by a school, a clone module of the master module was created and the school copy was visible only to the school. School version could be modified by the school admin as per their needs and preferences. Master module remained the same. While creating a subscription, administrator had to provide the date so that the components were accessible to the students. School admin could set different dates to different components and only the components with past date were accessible.

Flow Diagram of module subscription to school

Also, we provided an option to create a dynamic feedback form for the modules for analysis. Yardstick Admin had the option to design and create a feedback form as per their requirement and could assign it to a particular module. Different types of elements could be utilised for designing the form like rating, captcha, email, range slider, text field, checkboxes, radio buttons and so on.


Students and teachers need to submit their feedback for each of the modules. On the basis of this, Yardstick team try to improve the content of the system.


Also, various roles were defined for users such as Yardstick Administrator, School Administrator, Teacher, and Student.

1. Yardstick Admin

Yardstick Admin can perform all the operations. He or she can create new users, grant permissions and revoke them as well.

2. School Admin

It has the provision for handling all the operation which are only related to their school. School Admin handles the modules and their components and can import user for their school. All school reports and task submissions are visible to School Admins.

3. Teachers

Teachers can view modules and components assigned to their classes and provide remarks to the students for multiple components and they can view all kinds of reports.

4. Students

They can attempt quiz, submit tasks, view components and view their own reports.

What’s the future of e-learning?

According to a report on Research and Markets, the e-learning market is anticipated to generate revenue of $65.41 billion by 2023 with a growth rate of 7.07% during the forecast period.

The report goes on to state that with the advent of cloud infrastructure, peer-to-peer problem solving and open content creation, more business opportunities would pop up for service providers in the global e-learning market. The introduction of cloud-based learning and AR/VR mobile-based learning will be a major factor in driving the growth of e-learning.

The growth of the e-learning market is due to the learning process enhancements in the academic sector

According to Technavio, the growth of the market is due to the learning process enhancements in the academic sector.

Global self-paced e-learning market 2019-2023 | Source: Technavio

Following are major trends to look forward to:

  • Microlearning, which emphasises on the design of microlearning activities through micro-steps in digital media environments, will be on the rise.
  • Gamification, which is the use of game thinking and game mechanics in a non-game context to keep the users engrossed and help them solve more problems, will see increased adoption rates.
  • Personalised learning, which is the tailoring of pedagogy, curriculum and learning environments to meet the demands of learners, can be a driving force.
  • Automatic learning, like the one shown in the movie The Matrix where a person is strapped onto a high-tech chair and a series of martial arts training programs are downloaded into his brain, can be a possibility.
Conclusion

It’s a world which is replete with possibilities. As one of the most intelligent species to walk on this earth, we perpetually innovate with the way we want to lead a better lifestyle. We learn new things to gain more knowledge. And in the process, we find ways of improving our learning experience. E-learning is one such tech marvel that promises to be a force to reckon with. It is not a disrupting technology but something that is going to get bigger and bigger in the years to come.

As a content management framework, Drupal offers a magnificent platform to build a robust e-learning system. With years of experience in Drupal Development, OpenSense Labs can help in providing an amazing digital experience. 

Contact us at hello@opensenselabs.com to build an e-learning system using Drupal and transform the educational experience.

blog banner blog image E-learning Drupal e-learning Drupal and education Yardstick LMS Drupal Learning Management System Drupal LMS LMS Learning Management System E-learning platform E-learning system E-learning application Blog Type Articles Is it a good read ? On
Categories: FLOSS Project Planets

Learn PyQt: What's the difference between PyQt5 &amp; PySide2? What should you use, and how to migrate.

Planet Python - Fri, 2019-06-21 00:24

If you start building Python application with Qt5 you'll soon discover that there are in fact two packages which you can use to do this — PyQt5 and PySide2.

In this short guide I'll run through why exactly this is, whether you need to care (spoiler: you really don't), what the few differences are and how to work around them. By the end you should be comfortable re-using code examples from both PyQt5 and PySide2 tutorials to build your apps, regardless of which package you're using yourself.

Background

Why are there two packages?

PyQt has been developed by Phil Thompson of Riverbank Computing Ltd. for a very long time — supporting versions of Qt going back to 2.x. Back in 2009 Nokia, who owned the Qt toolkit at the time, wanted to have Python bindings for Qt available under the LGPL license (like Qt itself). Unable to come to agreement with Riverbank (who would lose money from this, so fair enough) they then released their own bindings as PySide (also, fair enough).

If you know why it's called PySide I would love to find out.

The two interfaces were comparable at first but PySide ultimately development lagged behind PyQt. This was particularly noticeable following the release of Qt 5 — the Qt5 version of PyQt (PyQt5) was available from mid-2016, while the first stable release of PySide2 was 2 years later.

It is this delay which explains why many Qt 5 on Python examples use PyQt5 rather than PySide2 — it's not necessarily better, but it existed. However, the Qt project has recently adopted PySide as the official Qt for Python release which should ensure its viability and increase it's popularity going forward.

PyQt5 PySide2 Current stable version (2019-06-23) 5.12 5.12 First stable release Apr 2016 Jul 2018 Developed by Riverbank Computing Ltd. Qt License GPL or commercial LGPL Platforms Python 3 Python 3 and Python 2.7 (Linux and MacOS only)

Which should you use? Well, honestly, it doesn't really matter.

Both packages are wrapping the same library — Qt5 — and so have 99.9% identical APIs (see below for the few differences). Code that is written for one can often be used as-is with other, simply changing the imports from PyQt5 to PySide2. Anything you learn for one library will be easily applied to a project using the other.

Also, no matter with one you choose to use, it's worth familiarising yourself with the other so you can make the best use of all available online resources — using PyQt5 tutorials to build your PySide2 applications for example, and vice versa.

In this short chapter I'll run through the few notable differences between the two packages and explain how to write code which works seamlessly with both. After reading this you should be able to take any PyQt5 example online and convert it to work with PySide2.

Licensing

The key difference in the two versions — in fact the entire reason PySide2 exists — is licensing. PyQt5 is available under a GPL or commercial license, and PySide2 under a LGPL license.

If you are planning to release your software itself under the GPL, or you are developing software which will not be distributed, the GPL requirement of PyQt5 is unlikely to be an issue. However, if you plan to distribute your software commercially you will either need to purchase a commercial license from Riverbank for PyQt5 or use PySide2.

Qt itself is available under a Qt Commercial License, GPL 2.0, GPL 3.0 and LGPL 3.0 licenses.

Python versions
  • PyQt5 is Python 3 only
  • PySide2 is available for Python3 and Python 2.7, but Python 2.7 builds are only available for 64 bit versions of MacOS and Linux. Windows 32 bit is supported on Python 2 only.
UI files

Both packages use slightly different approaches for loading .ui files exported from Qt Creator/Designer. PyQt5 provides the uic submodule which can be used to load UI files directly, to produce an object. This feels pretty Pythonic (if you ignore the camelCase).

import sys from PyQt5 import QtWidgets, uic app = QtWidgets.QApplication(sys.argv) window = uic.loadUi("mainwindow.ui") window.show() app.exec()

The equivalent with PySide2 is one line longer, since you need to create a QUILoader object first. Unfortunately the api of these two interfaces is different too (.load vs .loadUI) and take different parameters.

import sys from PySide2 import QtCore, QtGui, QtWidgets from PySide2.QtUiTools import QUiLoader loader = QUiLoader() app = QtWidgets.QApplication(sys.argv) window = loader.load("mainwindow.ui", None) window.show() app.exec_()

To load a UI onto an object in PyQt5, for example in your QMainWindow.__init__, you can call uic.loadUI passing in self (the target widget) as the second parameter.

import sys from PyQt5 import QtCore, QtGui, QtWidgets from PyQt5 import uic class MainWindow(QtWidgets.QMainWindow): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) uic.loadUi("mainwindow.ui", self) app = QtWidgets.QApplication(sys.argv) window = MainWindow() window.show() app.exec_()

The PySide2 loader does not support this — the second parameter to .load is the parent widget of the widget you're creating. This prevents you adding custom code to the __init__ block of the widget, but you can work around this with a separate function.

import sys from PySide2 import QtWidgets from PySide2.QtUiTools import QUiLoader loader = QUiLoader() def mainwindow_setup(w): w.setTitle("MainWindow Title") app = QtWidgets.QApplication(sys.argv) window = loader.load("mainwindow.ui", None) mainwindow_setup(window) window.show() app.exec() Converting UI files to Python

Both libraries provide identical scripts to generate Python importable modules from Qt Designer .ui files. For PyQt5 the script is named pyuic5 —

pyuic5 mainwindow.ui -o MainWindow.py

You can then import the UI_MainWindow object, subclass using multiple inheritance from the base class you're using (e.g. QMainWIndow) and then call self.setupUI(self) to set the UI up.

import sys from PyQt5 import QtWidgets from MainWindow import Ui_MainWindow class MainWindow(QtWidgets.QMainWindow, Ui_MainWindow): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.setupUi(self) app = QtWidgets.QApplication(sys.argv) window = MainWindow() window.show() app.exec()

For PySide2 it is named pyside2-uic —

pyside2-uic mainwindow.ui -o MainWindow.py

The subsequent setup is identical.

import sys from PySide2 import QtWidgets from MainWindow import Ui_MainWindow class MainWindow(QtWidgets.QMainWindow, Ui_MainWindow): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.setupUi(self) app = QtWidgets.QApplication(sys.argv) window = MainWindow() window.show() app.exec_()

For more information on using Qt Designer with either PyQt5 or PySide2 see the Qt Creator tutorial.

exec() or exec_()

The .exec() method is used in Qt to start the event loop of your QApplication or dialog boxes. In Python 2.7 exec was a keyword, meaning it could not be used for variable, function or method names. The solution used in both PyQt4 and PySide was to rename uses of .exec to .exec_() to avoid this conflict.

Python 3 removed the exec keyword, freeing the name up to be used. As PyQt5 targets only Python 3 it could remove the workaround, and .exec() calls are named just as in Qt itself. However, the .exec_() names are maintained for backwards compatibility.

PySide2 is available on both Python 3 and Python 2.7 and so still uses .exec_(). It is however only available for 64bit Linux and Mac.

If you're targeting both PySide2 and PyQt5 use .exec_()

Slots and Signals

Defining custom slots and signals uses slightly different syntax between the two libraries. PySide2 provides this interface under the names Signal and Slot while PyQt5 provides these as pyqtSignal and pyqtSlot respectively. The behaviour of them both is identical for defining and slots and signals.

The following PyQt5 and PySide2 examples are identical —

my_custom_signal = pyqtSignal() # PyQt5 my_custom_signal = Signal() # PySide2 my_other_signal = pyqtSignal(int) # PyQt5 my_other_signal = Signal(int) # PySide2

Or for a slot —

@pyqtslot def my_custom_slot(): pass @Slot def my_custom_slot(): pass

If you want to ensure consistency across PyQt5 and PySide2 you can use the following import pattern for PyQt5 to use the Signal and @Slot style there too.

from PyQt5.QtCore import pyqtSignal as Signal, pyqtSlot as Slot

You could of course do the reverse from PySide2.QtCore import Signal as pyqtSignal, Slot as pyqtSlot although that's a bit confusing.

Supporting both in libraries

You don't need to worry about this if you're writing a standalone app, just use whichever API you prefer.

If you're writing a library, widget or other tool you want to be compatible with both PyQt5 and PySide2 you can do so easily by adding both sets of imports.

import sys if 'PyQt5' in sys.modules: # PyQt5 from PyQt5 import QtGui, QtWidgets, QtCore from PyQt5.QtCore import pyqtSignal as Signal, pyqtSlot as Slot else: # PySide2 from PySide2 import QtGui, QtWidgets, QtCore from PySide2.QtCore import Signal, Slot

This is the approach used in our custom widgets library, where we support for PyQt5 and PySide2 with a single library import. The only caveat is that you must ensure PyQt5 is imported before (as in on the line above or earlier) when importing this library, to ensure it is in sys.modules.

An alternative would be to use an environment variable to switch between them — see QtPy later.

If you're doing this in multiple files it can get a bit cumbersome. A nice solution to this is to move the import logic to its own file, e.g. named qt.py in your project root. This module imports the Qt modules (QtCore, QtGui, QtWidgets, etc.) from one of the two libraries, and then you import into your application from there.

The contents of the qt.py are the same as we used earlier —

import sys if 'PyQt5' in sys.modules: # PyQt5 from PyQt5 import QtGui, QtWidgets, QtCore from PyQt5.QtCore import pyqtSignal as Signal, pyqtSlot as Slot else: # PySide2 from PySide2 import QtGui, QtWidgets, QtCore from PySide2.QtCore import Signal, Slot

You must remember to add any other PyQt5 modules you use (browser, multimedia, etc.) in both branches of the if block. You can then import Qt5 into your own application with —

from .qt import QtGui, QtWidgets, QtCore

…and it will work seamlessly across either library.

QtPy

If you need to target more than just Qt5 support (e.g. including PyQt4 and PySide v1) take a look at QtPy. This provides a standardised PySide2-like API for PyQt4, PySide, PyQt5 and PySide2. Using QtPy you can control which API to load from your application using the QT_API environment variable e.g.

import os os.environ['QT_API'] = 'pyside2' from qtpy import QtGui, QtWidgets, QtCore # imports PySide2. That's really it

There's not much more to say — the two are really very similar. With the above tips you should feel comfortable taking code examples or documentation from PyQt5 and using it to write an app with PySide2. If you do stumble across any PyQt5 or PySide2 examples which you can't easily convert, drop a note in the comments and I'll update this page with advice.

Categories: FLOSS Project Planets

The Titler Tool – Onward with the 3rd week

Planet KDE - Fri, 2019-06-21 00:00

Hi! It’s been 3 weeks (more than that actually, couldn’t update yesterday due to some network glitches I was facing here) and the progress so far has been good – let’s get into it! In the last week’s blog, I had reasoned why the rendering part is being developed as a library rather than directly starting the work with the framework (MLT) and the one advantage, was that the testing process becomes a whole lot easier. And that’s exactly what I have been doing the last week – writing the test module for the library, i.e. writing unit tests and it has been quite interesting as it gave me a perspective on how the code can break at points. The crucial concept of unit tests is to be able to make sure that there is no regression – meaning your code will do some particular things that it is supposed to do when we know it works, and at whatever point in the future, it will for sure do these certain things when it is working –  Nice, eh? Unit testing, as the name suggests, is testing of the units – we take each functional unit of a code (or simply a function/method) and we test certain characterstics and make sure that these conditions are fulfilled. An example being that I can pick from one my unit tests is the the case of the method QmlRenderer::initializeRenderParams(…)

m_renderer->initialiseRenderParams(QDir::cleanPath(rootPath.currentPath() + "/../sampledata/test.qml"), "test_output", QDir::cleanPath(rootPath.currentPath() + "/../sampledata/output_lib/"), "jpg", QSize(1280,720), 1, 1000, 25);
QVERIFY2(m_renderer->getStatus() != m_renderer->Status::NotRunning, "STATUS ERROR:Not supposed to be running");
QVERIFY2(m_renderer->getActualFrames()!=0, "VALUE ERROR: Frames not supposed to be zero");
QVERIFY2(m_renderer->getSceneGraphStatus()!=false, "SCENE GRAPH ERROR: Scene graph not initialised");
QVERIFY2(m_renderer->getAnimationDriverStatus()==false, "ANIMATION DRIVER ERROR: Driver not supposed to be running");
QVERIFY2(m_renderer->getfboStatus()==true, "FRAME BUFFER OBJECT ERROR: FBO not bound");

What this method does is quite straightforward as the name might suggest – it initalises the parameters (like Fps, duration, etc) that we need for rendering and what I do is verify that these parameters actually got initialised. I do this for each of the methods, running them in succession. The rendering flow goes like this –

initializeRenderParams(…) -> renderQml()  ->  renderEntireQml() or renderSingleFrame()

There are also integration tests which verify the QML content that is actually rendered. How do I do this?
By having a directory of correctly rendered frames and comparing it with what the library produces at points of times.

That means now we have a complete library which can do all the rendering the MLT producer (which I next intend to write on) will need.

Next up: MLT Producer.

You can find the code here

Categories: FLOSS Project Planets

Montreal Python User Group: Montréal-Python 75: Funky Urgency

Planet Python - Fri, 2019-06-21 00:00

The summer has started and it's time for our last edition before the seasonal break. We are inviting you for the occasion at our friends Anomaly, a co-working space in the Mile-End.

As usual, it's gonna be an opportunity to discover how people are pushing our favourite language farther, to understand how to identify bad habit of most programmers and to have fun with data!

Join us on Wednesday, there's gonna be pizza and we're probably gonna continue the evening to share more about our latest discoveries.

Speakers Josh Reed - Put your Data in a Box

The talk would cover the very basics of Algebraic Data Types (ADTs) and available facilities in python for expressing things like this (namedtuple, attrs, dataclasses). The talk would focus on the advantage of using explicitly structured data over ad-hoc structures like dicts and tuples once programs moved past exploratory phases of development.

Greg Ward - Operator Overloading: You're Doing It Wrong

Some people hate operator overloading so much that they design whole programming languages (Java, Go) to rebel against the idea. And some language communities (C++, Python) are perfectly happy to have operator overloading. But we've all seen examples that make us wonder what the original programmer was thinking. I have discovered some key design principles that will help you avoid such traps.

David Taylor - Dataiku and pytabby demo

I had an idea to give a demo of Dataiku Data Science Studio (http://www.dataiku.com) which is made in Python and uses Python to bridge the gap for organizations that want to do quick-win machine learning without having to hire Ph.D.s. I was the Product Owner of Dataiku at my last job, where we used it to give actuaries who were more comfortable in SAS experience in Python and ML.

Where

Anomaly
5555 de Gaspé, Suite 118,
Montreal, Quebec H2T 2A3 https://goo.gl/maps/rqqAT7ez5dEQ19w27

When

Wednesday, June 26th at 6pm

Schedule
  • 6pm: door opens
  • 6:30pm: talks
  • 8pm: Waverly
Categories: FLOSS Project Planets

ListenData: Case Study : Sentiment analysis using Python

Planet Python - Thu, 2019-06-20 22:07
In this article, we will walk you through an application of topic modelling and sentiment analysis to solve a real world business problem. This approach has a onetime effort of building a robust taxonomy and allows it to be regularly updated as new topics emerge. This approach is widely used in topic mapping tools. Please note that this is not a replacement of the topic modelling methodologies such as Latent Dirichlet allocation (LDA) and it is beyond them.
Text Mining Case Study using Python
Case Study : Topic Modeling and Sentiment Analysis
Suppose you are head of the analytics team with a leading Hotel chain “Tourist Hotel”. Each day, you receive hundreds of reviews of your hotel on the company’s website and multiple other social media pages. The business has a challenge of scale in analysing such data and identify areas of improvements. You use a taxonomy based approach to identify topics and then use a built-in functionality of Python NLTK package to attribute sentiment to the comments. This will help you in identifying what the customers like or dislike about your hotel.

Data Structure
The customer review data consists of a serial number, an arbitrary identifier to identify each review uniquely and a text field that has the customer review.
Example : Sentiment Analysis
Steps to topic mapping and sentiment analysis

1. Identify Topics and Sub Topics
2. Build Taxonomy
3. Map customer reviews to topics
4. Map customer reviews to sentiment

Step 1 : Identifying TopicsThe first step is to identify the different topics in the reviews. You can use simple approaches such as Term Frequency and Inverse Document Frequency or more popular methodologies such as LDA to identify the topics in the reviews. In addition, it is a good practice to consult a subject matter expert in that domain to identify the common topics. For example, the topics in the “Tourist Hotel” example could be “Room booking”, “Room Price”, “Room Cleanliness”, “Staff Courtesy”, “Staff Availability ”etc.

Step 2 : Build Taxonomy

I. Build Topic Hierarchy
Based on the topics from Step 1, Build a Taxonomy. A Taxonomy can be considered as a network of topics, sub topics and key words.Topic HierarchyII. Build Keywords
The taxonomy is built in a CSV file format. There are 3 levels of key words for each sub topic namely, Primary key words, Additional key words and Exclude key words. The keywords for the topics need to be manually identified and added to the taxonomy file. The TfIDf, Bigram frequencies and LDA methodologies can help you in identifying the right set of keywords. Although there is no one best way for building key words, below is a suggested approach.

i. Primary key words are the key words that are mostly specific to the topic. These key words need to be mutually exclusive across different topics as far as possible.

ii. Additional key words are specific to the sub topic. These key words need not be mutually exclusive between the topics but it is advised to maintain exclusivity between sub topics under the same sub topic. To explain further, let us say, there is a sub topic “Price” under the topics “Room” as well as “Food”, then the additional key words will have an overlap. This will not create any issue as the primary key words are mutually exclusive.

iii. Exclude key words are key words that are used relatively less than the other two types. If there are two sub topics that have some overlap of additional words OR for example, if the sub topic “booking” is incorrectly mapping comments regarding taxi bookings as room booking, such key words could be used in exclude words to solve the problem.

Snapshot of sample taxonomy:
Sample Taxonomy
Note: while building the key word list, you can put an “*” at the end as it helps as wild character. For example, all the different inflections of “clean” such as “cleaned”, “cleanly”, “cleanliness” can be handled by one keyword “clean*”. If you need to add a phrase or any keyword with a special character in it, you can wrap it in quotes. For example, “online booking”, Wi-Fi” etc need to be in double quotes.

Benefits of using taxonomic approach
Topic modelling approaches identify topics based on the keywords that are present in the content. For novel keywords that are similar to the topics but may come up in the future are not identified. There could be use cases where businesses want to track certain topics that may not always be identified as topics by the topic modelling approaches.
Step 3 : Map customer reviews to topic

Each customer comment is mapped to one or more sub topics. Some of the comments may not be mapped to any comment. Such instances need to be manually inspected to check if we missed any topics in the taxonomy so that it can be updated. Generally, about 90% of the comments have at least one topic. The rest of the comments could be vague. For example: “it was good experience” does not tell us anything specific and it is fine to leave it unmapped.Snapshot of how the topics are mapped:Topic Mapping
Below is the python code that helps in mapping reviews to categories. Firstly, import all the libraries needed for this task. Install these libraries if needed.import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Download Datafiles
Customer Review
Taxonomy

Download Python Code
If you copy-paste the code from the article, some of the lines of code might not work as python follows indentation very strictly so download python code from the link below.  The code is built in Python 2.7Python - Sentiment Analysis
Import reviews data
df = pd.read_csv("D:/customer_reviews.csv");Import taxonomy
df_tx = pd.read_csv("D:/ taxonomy.csv");
Build functions for handling the various repetitive tasks during the mapping exercise. This function identifies taxonomy words ending with (*) and treats it as a wild character. This takes the Keywords as input and uses regular expression to identify all the other keyword matches as output.
def asterix_handler(asterixw, lookupw):
mtch = "F"
for word in asterixw:
for lword in lookupw:
if(word[-1:]=="*"):
if(bool(re.search("^"+ word[:-1],lword))==True):
mtch = "T"
break
return(mtch)
This function removes all punctuations. This is helpful in terms of data cleaning. You can edit the list of punctuations for your own custom punctuation removal at the place highlighted in amber.def remov_punct(withpunct):
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
without_punct = ""
char = 'nan'
for char in withpunct:
if char not in punctuations:
without_punct = without_punct + char
return(without_punct)

Function to remove just the quotes(""). This is different from the above as this only handles double quotes. Recall that we wrap phrases or key words with special characters in double quotes.
def remov_quote(withquote):
quote = '"'
without_quote = ""
char = 'nan'
for char in withquote:
if char not in quote:
without_quote = without_quote + char
return(without_quote)

Split each document by sentences and append one below the other for sentence level topic mapping.
sentence_data = pd.DataFrame(columns=['slno','text'])
for d in range(len(df)):
doc = (df.iloc[d,1].split('.'))
for s in ((doc)):
temp = {'slno': [df['slno'][d]], 'text': [s]}
sentence_data = pd.concat([sentence_data,pd.DataFrame(temp)])
temp = ""
Drop empty text rows if any and export data
sentence_data['text'].replace('',np.nan,inplace=True);
sentence_data.dropna(subset=['text'], inplace=True);


data = sentence_data
cat2list = list(set(df_tx['Subtopic']))
data['Category'] = 0
mapped_data = pd.DataFrame(columns = ['slno','text','Category']);
temp=pd.DataFrame()
for k in range(len(data)):
comment = remov_punct(data.iloc[k,1])
data_words = [str(x.strip()).lower() for x in str(comment).split()]
data_words = filter(None, data_words)
output = []

for l in range(len(df_tx)):
key_flag = False
and_flag = False
not_flag = False
if (str(df_tx['PrimaryKeywords'][l])!='nan'):
kw_clean = (remov_quote(df_tx['PrimaryKeywords'][l]))
if (str(df_tx['AdditionalKeywords'][l])!='nan'):
aw_clean = (remov_quote(df_tx['AdditionalKeywords'][l]))
else:
aw_clean = df_tx['AdditionalKeywords'][l]
if (str(df_tx['ExcludeKeywords'][l])!='nan'):
nw_clean = remov_quote(df_tx['ExcludeKeywords'][l])
else:
nw_clean = df_tx['ExcludeKeywords'][l]
Key_words = 'nan'
and_words = 'nan'
and_words2 = 'nan'
not_words = 'nan'
not_words2 = 'nan'

if(str(kw_clean)!='nan'):
key_words = [str(x.strip()).lower() for x in kw_clean.split(',')]
key_words2 = set(w.lower() for w in key_words)

if(str(aw_clean)!='nan'):
and_words = [str(x.strip()).lower() for x in aw_clean.split(',')]
and_words2 = set(w.lower() for w in and_words)

if(str(nw_clean)!= 'nan'):
not_words = [str(x.strip()).lower() for x in nw_clean.split(',')]
not_words2 = set(w.lower() for w in not_words)

if(str(kw_clean) == 'nan'):
key_flag = False
else:
if set(data_words) & key_words2:
key_flag = True
else:
if(asterix_handler(key_words2, data_words)=='T'):
key_flag = True

if(str(aw_clean)=='nan'):
and_flag = True
else:
if set(data_words) & and_words2:
and_flag = True
else:
if(asterix_handler(and_words2,data_words)=='T'):
and_flag = True
if(str(nw_clean) == 'nan'):
not_flag = False
else:
if set(data_words) & not_words2:
not_flag = True
else:
if(asterix_handler(not_words2, data_words)=='T'):
not_flag = True
if(key_flag == True and and_flag == True and not_flag == False):
output.append(str(df_tx['Subtopic'][l]))
temp = {'slno': [data.iloc[k,0]], 'text': [data.iloc[k,1]], 'Category': [df_tx['Subtopic'][l]]}
mapped_data = pd.concat([mapped_data,pd.DataFrame(temp)])
#data['Category'][k] = ','.join(output)
#output mapped data
mapped_data.to_csv("D:/mapped_data.csv",index = False)

Step 4: Map customer reviews to sentiment
#read category mapped data for sentiment mapping
catdata = pd.read_csv("D:/mapped_data.csv")#Build a function to leverage the built-in NLTK functionality of identifying sentiment. The output 1 means positive, 0 means neutral and -1 means negative. You can choose your own set of thresholds for positive, neutral and negative sentiment.

def findpolar(test_data):
sia = SentimentIntensityAnalyzer()
polarity = sia.polarity_scores(test_data)["compound"];
if(polarity >= 0.1):
foundpolar = 1
if(polarity <= -0.1):
foundpolar = -1
if(polarity>= -0.1 and polarity<= 0.1):
foundpolar = 0

return(foundpolar)

Output the sentiment mapped data
catdata.to_csv("D:/sentiment_mapped_data.csv",index = False)Output : Sentiment Analysis
Additional Reading
Polarity Scoring Explained: 

NLTK offers Valence Aware Dictionary for sEntiment Reasoning(VADER) model that helps in identifying both the direction (polarity) as well as the magnitude(intensity) of the text. Below is the high-level explanation of the methodology.

VADER is a combination of lexical features and rules to identify sentiment and intensity. Hence, this does not need any training data.  To explain further, if we take an example of the sentence “the food is good”, it is easy to identify that it is positive in sentiment. VADER goes a step ahead and identifies intensity based on rule based approach such as punctuation, capitalised words and degree modifications.

The polarity scores for the different variations of similar sentences is as follows:

Polarity Score
Use cases where training sentiment models is suggested over Sentiment Intensity Analyzer:
Although VADER works well on multiple domains, there are could be some domains where it is preferred to build one’s own sentiment training models. Below are the two examples of such use cases.
  1. Customer reviews on alcoholic beverages:
  2. It is common to observe people using otherwise negative sentiment words to describe positive experience. For example, the sentence “this sh*t is fu**ing good” means that this drink is good but VADER approach gives it a “-10” suggesting negative sentiment

  3. Patient reviews regarding hospital treatment
  4. Patient’s description of their problem is a neutral sentiment but VADER approach considers it as negative sentiment. For example, the sentence “I had an unbearable back pain and your medication cured me in no time” is given “-0.67” suggesting negative sentiment.
About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 8 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.

Let's Get Connected: LinkedIn

Categories: FLOSS Project Planets

ListenData: Linear Regression in Python

Planet Python - Thu, 2019-06-20 22:07
Linear Regression is a supervised statistical technique where we try to estimate the dependent variable with a given set of independent variables. We assume the relationship to be linear and our dependent variable must be continuous in nature.
Python : Linear RegressionIn the following diagram we can see that as horsepower increases mileage decreases thus we can think to fit linear regression. The red line is the fitted line of regression and the points denote the actual observations.

The vertical distance between the points and the fitted line (line of best fit) are called errors. The main idea is to fit this line of regression by minimizing the sum of squares of these errors. This is also known as principle of least squares.

Examples:
  • Estimating the price (Y) of a house on the basis of its Area (X1), Number of bedrooms (X2), proximity to market (X3) etc. 
  • Estimating the mileage of a car (Y) on the basis of its displacement (X1), horsepower(X2), number of cylinders(X3), whether it is automatic or manual (X4) etc. 
  • To find the treatment cost or to predict the treatment cost on the basis of factors like age, weight, past medical history, or even if there are blood reports, we can use the information from the blood report.
Simple Linear Regression Model: In this we try to predict the value of dependent variable (Y) with only one regressor or independent variable(X).

Multiple Linear Regression Model: Here we try to predict the value of dependent variable (Y) with more than one regressor or independent variables.

The linear regression model:
Here 'y' is the dependent variable to be estimated, and X are the independent variables and ε is the error term.Multiple Regression Equation
Assumptions of linear regression:
  • There must be a linear relationship between the dependent and independent variables.
  • Sample observations are independent.
  • Error terms are normally distributed with mean 0. 
  • No multicollinearity -  When the independent variables in my model are highly linearly related then such a situation is called multicollinearity.
  • Error terms are identically and independently distributed. (Independence means absence of autocorrelation).
  • Error terms have constant variance i.e. there is no heteroscedasticity.
  • No outliers are present in the data.


Important Model Performance Metrics
Coefficient of Determination (R square)
It suggests the proportion of variation in Y which can be explained with the independent variables. Mathematically, it is the ratio of predicted values and observed values, i.e.
RSquare
If our fit is perfect then

If then R2 = 0 indicates a poor fit. Thus it lies between 0 and 1.
If the value of R2 is 0.912 then this suggests that 91.2% of the variation in Y can be explained with the help of given explanatory variables in that model. In other words, it explains the proportion of variation in the dependent variable that is explained by the independent variables.R square solely not such a good measure:
On addition of a new variable the error is sure to decrease, thus R square always increases whenever a new variable is added to our model. This may not describe the importance of a variable.For eg. In a model determining the price of the house, suppose we had the variables GDP, Inflation rate, Area. If we add a new variable: no. of plane crashes (which is irrelevant) then still R square will increase.
Adjusted R square:

Adjusted R square is given by:or
Adjusted R-Square
where k is the no. of regressors or predictors.
Hence adjusted R square will always be less than or equal to R square.
On addition of a variable then R square in numerator and 'k' in the denominator will increase.If the variable is actually useful then R square will increase by a large amount and 'k' in the denominator will be increased by 1. Thus the magnitude of increase in R square will compensate for increase in 'k'. On the other hand, if a variable is irrelevant then on its addition R square will not increase much and hence eventually adjusted R square will increase.Thus as a general thumb rule if adjusted R square increases when a new variable is added to the model, the variable should remain in the model. If the adjusted R square decreases when the new variable is added then the variable should not remain in the model.Why error terms should be normally distributed?
For parameter estimate (i.e. estimating the βi’s) we don't need that assumption. But, if it is not a normal distribution, some of those hypotheses tests which we will be doing as part of diagnostics may not be valid. For example:  To check whether the Beta (the regression coefficient) is significant or not, I'll do a T-test. So, if my error is not a normal distribution, then the statistic I derive may not be a T-distribution. So, my diagnostic test or hypotheses test is not valid. Similarly, F-test for linear regression which checks whether any of the independent variables in a multiple linear regression model are significant will be not be viable.Why is expectation of error always zero?

The error term is the deviation between observed points and the fitted line. The observed points will be above and below the fitted line, so if I took the average of all the deviations, it should be 0 or near 0. Zero conditional mean is there which says that there are both negative and positive errors which cancel out on an average. This helps us to estimate dependent variable precisely.

Why multicollinearity is a problem? 

If my Xi’s are highly correlated then |X’X| will be close to 0 and hence inverse of (X’X) will not exist or will be indefinitely large. Mathematically, which will be indefinitely large in presence of multicollinearity. Long story in short, multicollinearity increases the estimate of standard error of regression coefficients which makes some variables statistically insignificant when they should be significant.
How can you detect multicollinearity?1. Bunch Map Analysis: By plotting scatter plots between various Xi’ s we can have a visual description of how the variables are related.
2. Correlation Method: By calculating the correlation coefficients between the variables we can get to know about the extent of multicollinearity in the data.
3.  VIF (Variance Inflation Factor) Method: Firstly we fit a model with all the variables and then calculate the variance inflation factor (VIF) for each variable. VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. The higher the value of VIF for ith regressor, the more it is highly correlated to other variables.

So what is Variance Inflation Factor?Variance inflation factor (VIF) for an explanatory variable is given 1/(1-R^2 )  . Here, we take that particular X as response variable and all other explanatory variables as independent variables. So, we run a regression between one of those explanatory variables with remaining explanatory variables. Detecting heteroscedasticity!
  1. Graphical Method: Firstly do the regression analysis and then plot the error terms against the predicted values( Yi^). If there is a definite pattern (like linear or quadratic or funnel shaped) obtained from the scatter plot then heteroscedasticity is present.
  2. Goldfeld Quandt (GQ)Test: It assumes that heteroscedastic variance σi2 is positively related to one of the explanatory variables And errors are assumed to be normal. Thus if heteroscedasticity is present then the variance would be high for large values of X.

Steps for GQ test:
  1. Order/ rank (ascending) the observations according to the value of Xi beginning with the lowest X value.
  2. Omit ‘c’ central observations and divide the remaining (n-c) observations into 2 groups of (n-c)/2 observations each.
  3. Fit separate OLS regression to both the groups and obtain residual sum of squares (RSS1 and RSS2) for both the groups.
  4. Obtain F = RSS2/ RSS1 
It follows F with ((n-c)/2-k) d.f. both both numerator and denominator.
Where k is the no. of parameters to be estimated including the intercept.
If errors are homoscedastic then the two variances RSS2 and RSS1 turn out to be equal i. e. F will tend to 1. Dataset used:We have 1030 observations on 9 variables. We try to estimate the Complete compressive strength(CRS) using:

  1. Cement - kg in a m3 mixture
  2. Blast Furnace Slag - kg in a m3 mixture
  3. Fly Ash - kg in a m3 mixture
  4. Water - kg in a m3 mixture
  5. Superplasticizer - kg in a m3 mixture
  6. Coarse Aggregate - kg in a m3 mixture
  7. Fine Aggregate - kg in a m3 mixture
  8. Age - Day (1-365)

Dataset - Download Data 

Importing the libraries:
Numpy, pandas and matplotlib.pyplot are imported with aliases np, pd and plt respectively.import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Loading the dataWe load our data using pd.read_csv( )data = pd.read_csv("Concrete_Data.csv")Now the data is divided into independent (x) and dependent variables (y)x = data.iloc[:,0:8]
y = data.iloc[:,8:]
Splitting the data into training and test sets
Using sklearn we split 80% of our data into training set and rest in test set. Setting random_state will give the same training and test set everytime on running the code.
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 100) 
Running linear regression using sklearn
Using sklearn linear regression can be carried out using LinearRegression( ) class. sklearn automatically adds an intercept term to our model.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm = lm.fit(x_train,y_train)   #lm.fit(input,output)The coefficients are given by:
lm.coef_array([[ 0.12415357, 0.10366839, 0.093371 , -0.13429401, 0.28804259,
0.02065756, 0.02563037, 0.11461733]])
To store coefficients in a data frame along with their respective independent variables -
coefficients = pd.concat([pd.DataFrame(x_train.columns),pd.DataFrame(np.transpose(lm.coef_))], axis = 1)0 Cement 0.124154
1 Blast 0.103668
2 Fly Ash 0.093371
3 Water -0.134294
4 Superplasticizer 0.288043
5 CA 0.020658
6 FA 0.025630
7 Age 0.114617
The intercept is:
lm.intercept_array([-34.273527])
To predict the values of y on the test set we use lm.predict( )
y_pred = lm.predict(x_test)Errors are the difference between observed and predicted values.
y_error = y_test - y_predR square can be obbtained using sklearn.metrics ( ):
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)0.62252008774048395

Running linear regression using statsmodels:
It is to be noted that statsmodels does not add intercept term automatically thus we need to create an intercept to our model.import statsmodels.api as sma
X_train = sma.add_constant(x_train) ## let's add an intercept (beta_0) to our model
X_test = sma.add_constant(x_test) Linear regression can be run by using sm.OLS:
import statsmodels.formula.api as sm
lm2 = sm.OLS(y_train,X_train).fit()The summary of our model can be obtained via:
lm2.summary()"""
OLS Regression Results
==============================================================================
Dep. Variable: CMS R-squared: 0.613
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 161.0
Date: Wed, 03 Jan 2018 Prob (F-statistic): 4.37e-162
Time: 21:29:10 Log-Likelihood: -3090.4
No. Observations: 824 AIC: 6199.
Df Residuals: 815 BIC: 6241.
Df Model: 8
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const -34.2735 29.931 -1.145 0.253 -93.025 24.478
Cement 0.1242 0.010 13.054 0.000 0.105 0.143
Blast 0.1037 0.011 9.229 0.000 0.082 0.126
Fly Ash 0.0934 0.014 6.687 0.000 0.066 0.121
Water -0.1343 0.046 -2.947 0.003 -0.224 -0.045
Superplasticizer 0.2880 0.102 2.810 0.005 0.087 0.489
CA 0.0207 0.011 1.966 0.050 2.79e-05 0.041
FA 0.0256 0.012 2.131 0.033 0.002 0.049
Age 0.1146 0.006 19.064 0.000 0.103 0.126
==============================================================================
Omnibus: 3.757 Durbin-Watson: 2.033
Prob(Omnibus): 0.153 Jarque-Bera (JB): 3.762
Skew: -0.165 Prob(JB): 0.152
Kurtosis: 2.974 Cond. No. 1.07e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.07e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
The predicted values for test set are given by:
y_pred2 = lm2.predict(X_test) Note that both y_pred and y_pred2 are same. It's just these are calculated via different packages.

Calculate R-Squared and Adjusted R-Squared Manually on Test data

We can also calculate r-squared and adjusted r-squared via formula without using any package.
import numpy as np
y_test = pd.to_numeric(y_test.CMS, errors='coerce')
RSS = np.sum((y_pred2 - y_test)**2)
y_mean = np.mean(y_test)
TSS = np.sum((y_test - y_mean)**2)
R2 = 1 - RSS/TSS
R2

n=X_test.shape[0]
p=X_test.shape[1] - 1

adj_rsquared = 1 - (1 - R2) * ((n - 1)/(n-p-1))
adj_rsquared
R-Squared : 0.6225
Adjusted RSquared : 0.60719

Detecting Outliers:
Firstly we try to get the studentized residuals using get_influence( ). The studentized residuals are saved in resid_student.influence = lm2.get_influence() 
resid_student = influence.resid_studentized_externalCombining the training set and the residuals we have: Cement Blast Fly Ash Water Superplasticizer CA FA Age \
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28.0
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28.0
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270.0
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365.0
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360.0

Studentized Residuals
0 1.559672
1 -0.917354
2 1.057443
3 0.637504
4 -1.170290
resid = pd.concat([x_train,pd.Series(resid_student,name = "Studentized Residuals")],axis = 1)
resid.head() If the absolute value of studentized residuals is more than 3 then that observation is considered as an outlier and hence should be removed. We try to create a logical vector for the absolute studentized residuals more than 3
Cement Blast Fly Ash Water Superplasticizer CA FA Age \
649 166.8 250.2 0.0 203.5 0.0 975.6 692.6 3.0

Studentized Residuals
649 3.161183
resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:]The index of the outliers are given by ind:
ind = resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:].index
ind Int64Index([649], dtype='int64')

Dropping Outlier 
Using the drop( ) function we remove the outlier from our training sets!
y_train.drop(ind,axis = 0,inplace = True)
x_train.drop(ind,axis = 0,inplace = True)  #Interept column is not there
X_train.drop(ind,axis = 0,inplace = True)  #Intercept column is there
Detecting and Removing Multicollinearity 
We use the statsmodels library to calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
[variance_inflation_factor(x_train.values, j) for j in range(x_train.shape[1])][15.477582601956859,
3.2696650121931814,
4.1293255012993439,
82.210084751631086,
5.21853674386234,
85.866945489015535,
71.816336942930675,
1.6861600968467656]

We create a function to remove the collinear variables. We choose a threshold of 5 which means if VIF is more than 5 for a particular variable then that variable will be removed.
def calculate_vif(x):
    thresh = 5.0
    output = pd.DataFrame()
    k = x.shape[1]
    vif = [variance_inflation_factor(x.values, j) for j in range(x.shape[1])]
    for i in range(1,k):
        print("Iteration no.")
        print(i)
        print(vif)
        a = np.argmax(vif)
        print("Max VIF is for variable no.:")
        print(a)
        if vif[a] <= thresh :
            break
        if i == 1 :         
            output = x.drop(x.columns[a], axis = 1)
            vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
        elif i > 1 :
            output = output.drop(output.columns[a],axis = 1)
            vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
    return(output)
train_out = calculate_vif(x_train) Now we view the training set
train_out.head()     Cement Blast Fly Ash Superplasticizer Age337 275.1 0.0 121.4 9.9 56
384 516.0 0.0 0.0 8.2 28
805 393.0 0.0 0.0 0.0 90
682 183.9 122.6 0.0 0.0 28
329 246.8 0.0 125.1 12.0 3

Removing the variables from the test set.
x_test.head()
x_test.drop(["Water","CA","FA"],axis = 1,inplace = True)
x_test.head() Cement Blast Fly Ash Superplasticizer Age
173 318.8 212.5 0.0 14.3 91
134 362.6 189.0 0.0 11.6 28
822 322.0 0.0 0.0 0.0 28
264 212.0 0.0 124.8 7.8 3
479 446.0 24.0 79.0 11.6 7

Running linear regression again on our new training set (without multicollinearity)
import statsmodels.api as sma
import statsmodels.formula.api as sm
train_out = sma.add_constant(train_out) ## let's add an intercept (beta_0) to our model
x_test.drop(["Water","CA","FA"],axis = 1,inplace = True)
X_test = sma.add_constant(x_test)
lm2 = sm.OLS(y_train,train_out).fit()
lm2.summary()"""
OLS Regression Results
==============================================================================
Dep. Variable: CMS R-squared: 0.570
Model: OLS Adj. R-squared: 0.567
Method: Least Squares F-statistic: 216.3
Date: Wed, 10 Jan 2018 Prob (F-statistic): 6.88e-147
Time: 15:14:59 Log-Likelihood: -3128.8
No. Observations: 823 AIC: 6270.
Df Residuals: 817 BIC: 6298.
Df Model: 5
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const -11.1119 1.915 -5.803 0.000 -14.871 -7.353
Cement 0.1031 0.005 20.941 0.000 0.093 0.113
Blast 0.0721 0.006 12.622 0.000 0.061 0.083
Fly Ash 0.0614 0.009 6.749 0.000 0.044 0.079
Superplasticizer 0.7519 0.077 9.739 0.000 0.600 0.903
Age 0.1021 0.006 16.582 0.000 0.090 0.114
==============================================================================
Omnibus: 0.870 Durbin-Watson: 2.090
Prob(Omnibus): 0.647 Jarque-Bera (JB): 0.945
Skew: 0.039 Prob(JB): 0.623
Kurtosis: 2.853 Cond. No. 1.59e+03
==============================================================================

Checking normality of residualsWe use Shapiro Wilk test  from scipy library to check the normality of residuals.
  1. Null Hypothesis: The residuals are normally distributed.
  2. Alternative Hypothesis: The residuals are not normally distributed.
from scipy import stats
stats.shapiro(lm2.resid)(0.9983407258987427, 0.6269884705543518)

Since the pvalue is 0.6269 thus at 5% level of significance we can say that the residuals are normally distributed.

Checking for autocorrelationTo ensure the absence of autocorrelation we use Ljungbox test.
  1. Null Hypothesis: Autocorrelation is absent.
  2. Alternative Hypothesis: Autocorrelation is present.
from statsmodels.stats import diagnostic as diag
diag.acorr_ljungbox(lm2.resid , lags = 1) (array([ 1.97177212]), array([ 0.16025989]))
Since p-value is 0.1602 thus we can accept the null hypothesis and can say that autocorrelation is absent.

Checking heteroscedasticityUsing Goldfeld Quandt we test for heteroscedasticity.
  1. Null Hypothesis: Error terms are homoscedastic
  2. Alternative Hypothesis: Error terms are heteroscedastic.
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(lm2.resid, lm2.model.exog)
lzip(name, test)[('F statistic', 0.9903), ('p-value', 0.539)]

The p-value is 0.539 hence we can say that the residuals have constant variance. Hence we can say that all the assumptions of our linear regression model are satisfied. About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 8 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.

Let's Get Connected: LinkedIn

Categories: FLOSS Project Planets

ListenData: Identify Person, Place and Organisation in content using Python

Planet Python - Thu, 2019-06-20 22:07
This article outlines the concept and python implementation of Named Entity Recognition using StanfordNERTagger. The technical challenges such as installation issues, version conflict issues, operating system issues that are very common to this analysis are out of scope for this article.

NER NLP using Python
Table of contents:

1. Named Entity Recognition defined
2. Business Use cases
3. Installation Pre-requisites
4. Python Code for implementation
5. Additional Reading: CRF model, Multiple models available in the package
6. Disclaimer

1. Named Entity Recognition DefinedThe process of detecting and classifying proper names mentioned in a text can be defined as Named Entity Recognition (NER). In simple words, it locates person name, organization and location etc. in the content. This is generally the first step in most of the Information Extraction (IE) tasks of Natural Language Processing.NER Sample
2. Business Use Cases
There is a need for NER across multiple domains. Below are a few sample business use cases for your reference.
  1. Investment research: To identify the various announcements of the companies, people’s reaction towards them and its impact on the stock prices, one needs to identify people and organisation names in the text
  2. Chat-bots in multiple domains: To identify places and dates for booking hotel rooms, air tickets etc.
  3. Insurance domain: Identify and mask people’s names in the feedback forms before analyzing. This is needed for being regulatory compliant(example: HIPAA)

3. Installation Prerequisites
1. Download Stanford NER from http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip2. Unzip the zipped folder and save in a drive.3. Copy the “stanford-ner.jar” from the folder and save it just outside the folder as shown in the image4. Download the caseless models from https://stanfordnlp.github.io/CoreNLP/history.html by clicking on “caseless” as given below. The models in the first link work as well. However, the caseless models help in identifying named entities even when they are not capitalised as required by formal grammar rules. 5. Save the folder in the same location as the Stanford NER folder for ease of accessStanford NER Installation - Step1




NER Installation - Step2






4. Python Code for implementation:#Import all the required libraries.import osfrom nltk.tag import StanfordNERTaggerimport pandas as pd
#Set environmental variables programmatically.#Set the classpath to the path where the jar file is locatedos.environ['CLASSPATH'] = "<path to the file>/stanford-ner-2015-04-20/stanford-ner.jar"
#Set the Stanford models to the path where the models are storedos.environ['STANFORD_MODELS'] = '<path to the file>/stanford-corenlp-caseless-2015-04-20-models/edu/stanford/nlp/models/ner'
#Set the java jdk pathjava_path = "C:/Program Files/Java/jdk1.8.0_161/bin/java.exe"os.environ['JAVAHOME'] = java_path

#Set the path to the model that you would like to usestanford_classifier  =  '<path to the file>/stanford-corenlp-caseless-2015-04-20-models/edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz'
#Build NER tagger objectst = StanfordNERTagger(stanford_classifier)
#A sample text for NER taggingtext = 'srinivas ramanujan went to the united kingdom. There he studied at cambridge university.'
#Tag the sentence and print outputtagged = st.tag(str(text).split())print(tagged)
Output
[(u'srinivas', u'PERSON'),
(u'ramanujan', u'PERSON'),
(u'went', u'O'),
(u'to', u'O'),
(u'the', u'O'),
(u'united', u'LOCATION'),
(u'kingdom.', u'LOCATION'),
(u'There', u'O'),
(u'he', u'O'),
(u'studied', u'O'),
(u'at', u'O'),
(u'cambridge', u'ORGANIZATION'),
(u'university', u'ORGANIZATION')]

5. Additional Reading
StanfordNER algorithm leverages a general implementation of linear chain Conditional Random Fields (CRFs) sequence models. CRFs seem very similar to Hidden Markov Model but are very different.

Below are some key points to note about the CRFs in general.
  1. It is a discriminative model unlike the HMM model and thus models the conditional probability
  2. It does not assume independence of features unlike the HMM model. This means that the current word, previous word, next word are all considered for model as features
  3. Relative to HMM or Max ent Markov Models, CRFs are the slowest

6. DisclaimerThis article explains the implementation of StanfordNER algorithm for research purposes and does not promote it for commercial gains. For any questions on commercial aspects of implementing this algorithm, please contact Stanford University About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 8 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.

Let's Get Connected: LinkedIn

Categories: FLOSS Project Planets

Pages