FLOSS Project Planets

Carl Trachte: Powershell Encoded Command, sqlcmd, and csv Query Output

Planet Python - Sat, 2017-12-09 21:18
A while back I did a post on using sqlcmd and dumping data to Excel.  At the time I was using Microsoft SQL Server's bcp (bulk copy) utility to dump data to a csv file.

Use of bcp is blocked where I am working now.  But Powershell and sqlcmd are very much available on the Windows workstations we use.  Just as with bcp, smithing text for sqlcmd input can be a little tricky, same with Powershell.  But Powershell has an EncodedCommand feature which allows you to feed input to it as a base 64 string.  This will be a quick demo of the use of this feature and output of a faux comma delimited (csv) file with data.

Disclaimer:  scripts that rely extensively on os.system() calls from Python are indeed hacky and mousetrappy.  I think the saying goes "Necessity is a mother," or something similar.  Onward.

Getting the base 64 string from the original string:

First our SQL code that queries a mock table I made in my mock database:
USE test;
SELECT testpk,
       namex,
    [value]
FROM testtable
ORDER BY testpk;

We will call this file selectdata.sql.
Then the call to sqlcmd/Powershell:
sqlcmd -S localhost -i .\selectdata.sql -E -h -1 -s "," -W  | Tee-Object -FilePath .\testoutput
In Python (we have to use Python 2.7 in our environment, so this is Python 2.x specific):
Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import base64
>>> stringx = r'sqlcmd -S localhost -i .\selectdata.sql -E -h -1 -s "," -W | Tee-Object -FilePath .\testoutput'
>>> bytesx = stringx.encode('utf-16-le')
>>> encodedcommandx = base64.b64encode(bytesx)
>>> encodedcommandx
'cwBxAGwAYwBtAGQAIAAtAFMAIABsAG8AYwBhAGwAaABvAHMAdAAgAC0AaQAgAC4AXABzAGUAbABlAGMAdABkAGEAdABhAC4AcwBxAGwAIAAtAEUAIAAtAGgAIAAtADEAIAAtAHMAIAAiACwAIgAgAC0AVwAgAHwAIABUAGUAZQAtAE8AYgBqAGUAYwB0ACAALQBGAGkAbABlAFAAYQB0AGgAIAAuAFwAdABlAHMAdABvAHUAdABwAHUAdAA='
>>>

I had to type out my command in the Python interpreter.  When I pasted it in from GVim, it choked on the UTF encoding.
Now, Powershell:
PS C:\Users\ctrachte> $sqlcmdstring = 'sqlcmd -S localhost -i .\selectdata.sql -E -h -1 -s "," -W | Tee-Object -FilePath
 .\testoutput'
PS C:\Users\ctrachte> $encodedcommand = [Convert]::ToBase64String([Text.Encoding]::Unicode.GetBytes($sqlcmdstring))
PS C:\Users\ctrachte> $encodedcommand
cwBxAGwAYwBtAGQAIAAtAFMAIABsAG8AYwBhAGwAaABvAHMAdAAgAC0AaQAgAC4AXABzAGUAbABlAGMAdABkAGEAdABhAC4AcwBxAGwAIAAtAEUAIAAtAGgAIAAtADEAIAAtAHMAIAAiACwAIgAgAC0AVwAgAHwAIABUAGUAZQAtAE8AYgBqAGUAYwB0ACAALQBGAGkAbABlAFAAYQB0AGgAIAAuAFwAdABlAHMAdABvAHUAdABwAHUAdAA=
PS C:\Users\ctrachte>

OK, the two base 64 strings are the same, so we are good.
Command Execution from os.system() call:
import os>>> os.system(INVOKEPOWERSHELL.format(encodedcommandx))
Changed database context to 'test'.
000001,VOLUME,11.0
000002,YEAR,1999.0
(2 rows affected)
0
>>>

And, thanks to Powershell's version of UNIX-like system's tee command, we have a faux csv file as well as output to the command line.
Stackoverflow gave me much of what I needed to know for this:
Links:
Powershell's encoded command:
https://blogs.technet.microsoft.com/heyscriptingguy/2015/10/27/powertip-encode-string-and-execute-with-powershell/
sqlcmd's output to faux csv:
https://stackoverflow.com/questions/425379/how-to-export-data-as-csv-format-from-sql-server-using-sqlcmd
The UTF encoding stuff just took some trial and error and fiddling.
Thanks for stopping by.




Categories: FLOSS Project Planets

Wyatt Baldwin: Finding the Oddity in a Sequence

Planet Python - Sat, 2017-12-09 18:36

After reading Ned Batchelder’s Iter-tools for puzzles: oddity post yesterday, my first thought was to use itertools.groupby(). That version is the fastest I could come up with (by quite a bit actually, especially for longer sequences), but it requires sorting the iterable first, which uses additional space and won’t work with infinite sequences.

My next thought was to use a set to keep track of seen elements, but that requires the keys to be hashable, so I scrapped that idea.

I figured using a list to keep track of seen elements wouldn’t be too bad if the seen list was never allowed to grow beyond two elements. After playing around with this version for a while, I finally came up with something on par with Ned’s version performance wise while meeting the following objectives:

  • Don’t require keys to be hashable
  • Don’t store the elements read from the iterable
  • Support infinite sequences

Some interesting things:

  • Rearranging the branches to use in instead of not in sped things up more than I would have thought
  • Initially, I was checking the lengths of the seen and common lists on every iteration, which slowed things down noticeably (which isn’t all that surprising really)
  • Setting the key function to the identity function lambda v: v is noticeably slower than checking to see if it’s set on every iteration (also not surprising)
  • Using sets/dicts for seen, common, and uncommon didn’t make a noticeable difference (I tried adding a hashable option and setting things up accordingly, but it wasn’t worth the additional complexity)

Here’s the code (the gist includes a docstring, tests, and a simplistic benchmark):

def oddity(iterable, key=None, first=False): seen = [] common = [] uncommon = [] for item in iter(iterable): item_key = key(item) if key else item if item_key in seen: if item_key in common: if first and uncommon: return item_key, uncommon[1] else: common.append(item_key) if len(common) == 2: raise TooManyCommonValuesError if item_key in uncommon: i = uncommon.index(item_key) j = i + 2 uncommon[i:j] = [] else: seen.append(item_key) if len(seen) == 3: raise TooManyDistinctValuesError uncommon.extend((item_key, item)) if len(seen) == 0: raise EmptyError if len(common) == 0: raise NoCommonValueError if len(uncommon) == 0: uncommon_value = None else: uncommon_value = uncommon[1] return common[0], uncommon_value
Categories: FLOSS Project Planets

Dirk Eddelbuettel: #12: Know and Customize your OS and Work Environment

Planet Debian - Sat, 2017-12-09 17:30

Welcome to the twelveth post in the randomly relevant R recommendations series, or R4 for short. This post will insert a short diversion into what was planned as a sequence of posts on faster installations that started recently with this post but we will resume to it very shortly (for various definitions of "very" or "shortly").

Earlier today Davis Vaughn posted a tweet about a blog post of his describing a (term) paper he wrote modeling bitcoin volatilty using Alexios's excellent rugarch package---and all that typeset with the styling James and I put together in our little pinp package which is indeed very suitable for such tasks of writing (R)Markdown + LaTeX + R code combinations conveniently in a single source file.

Leaving aside the need to celebreate a term paper with a blog post and tweet, pinp is indeed very nice and deserving of some additional exposure and tutorials. Now, Davis sets out to do all this inside RStudio---as folks these days seem to like to limit themselves to a single tool or paradigm. Older and wiser users prefer the flexibility of switching tools and approaches, but alas, we digress. While Davis manages of course to do all this in RStudio which is indeed rather powerful and therefore rightly beloved, he closes on

I wish there was some way to have Live Rendering like with blogdown so that I could just keep a rendered version of the paper up and have it reload every time I save. That would be the dream!

and I can only add a forceful: Fear not, young man, for we can help thou!

Modern operating systems have support for epoll and libnotify, which can be used from the shell. Just how your pdf application refreshes automagically when a pdf file is updated, we can hook into this from the shell to actually create the pdf when the (R)Markdown file is updated. I am going to use a tool readily available on my Linux systems; macOS will surely have something similar. The entr command takes one or more file names supplied on stdin and executes a command when one of them changes. Handy for invoking make whenever one of your header or source files changes, and useable here. E.g. the last markdown file I was working on was named comments.md and contained comments to a referee, and we can auto-process it on each save via

echo comments.md | entr render.r comments.md

which uses render.r from littler (new release soon too...; a simple Rscript -e 'rmarkdown::render("comments.md")' would probably work too but render.r is shorter and little more powerful so I use it more often myself) on the input file comments.md which also happens to (here sole) file being monitored.

And that is really all there is to it. I wanted / needed something like this a few months ago at work too, and may have used an inotify-based tool there but cannot find any notes. Python has something similar via watchdog which is yet again more complicated / general.

It turns out that auto-processing is actually not that helpful as we often save before an expression is complete, leading to needless error messages. So at the end of the day, I often do something much simpler. My preferred editor has a standard interface to 'building': pressing C-x c loads a command (it recalls) that defaults to make -k (i.e., make with error skipping). Simply replacing that with render.r comments.md (in this case) means we get an updated pdf file when we want with a simple customizable command / key-combination.

So in sum: it is worth customizing your environments, learning about what your OS may have, and looking beyond a single tool / editor / approach. Even dreams may come true ...

Postscriptum: And Davis takes this in a stride and almost immediately tweeted a follow-up with a nice screen capture mp4 movie showing that entr does indeed work just as well on his macbook.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

Categories: FLOSS Project Planets

Freelock : A custom quantity price discount for Drupal Commerce

Planet Drupal - Sat, 2017-12-09 16:01

We're in the midst of a Commerce 2 build-out for a client, and a key requirement was to preserve their quantity pricing rules. With thousands of products, and different pricing rules for each one, they need the price for each item in the cart adjusted to the appropriate price for the quantity purchased. When validating our plan for Drupal Commerce 2, we stumbled upon some examples of a custom price resolver, and knew this would work perfectly for the need.

Drupal 8Drupal CommerceDrupal PlanetField API
Categories: FLOSS Project Planets

Kracekumar Ramaraju: Debugging Python multiprocessing program with strace

Planet Python - Sat, 2017-12-09 14:19

Debugging is a time consuming and brain draining process. It’s essential part of learning and writing maintainable code. Every person has their way of debugging, approaches and tools. Sometimes you can view the traceback, pull the code from memory, and find a quick fix. Some other times, you opt different tricks like the print statement, debugger, and rubber duck method.

Debugging multi-processing bug in Python is hard because of various reasons.

  • The print statement is simple to start. When all the processes write the log to the same file, there is no guarantee of the order without synchronization. You need to use -u or sys.stdout.flush() or synchronizing the log statements using process ids or any identifier. Without PID it’s hard to know which process is stuck.
  • You can’t put pdb.set_trace() inside a multi-processing pool target function.

Let’s say a program reads a metadata file which contains a list of JSON files and total records. Files may have a total of 100 JSON records, but you may need only 5. In any case, the function can return first five or random five records.

Sample code

https://gist.github.com/kracekumar/20cfeec5dc99b084c879ce737f7c7214

The code is just for demonstrating the workflow and production code is not simple like above one.

Consider multiprocessing.Pool.starmap a function call with read_data as target and number of processes as 40.

Let’s say there are 80 files to process. Out of 80, 5 are problematic files(function takes ever to complete reading the data). Whatever be the position of five files, the processes continues forever, while other processes enter sleep state after completing the task.

Once you know the PID of the running process, you can look what system calls process calls using strace -s $PID. The process in running state was calling the system call read with same file name again and again. The while loop went on since the file had zero records and had only one file in the queue.

Strace looks like one below

https://gist.github.com/kracekumar/8cd44ff27b81e200343bc58d194dd595

You may argue a well-placed print may solve the problem. All times, you won’t have the luxury to modify the running program or replicate the code in the local environment.

Categories: FLOSS Project Planets

Elisa 0.0.80 Released

Planet KDE - Sat, 2017-12-09 12:48

The Elisa team is happy to release the first alpha release of a future 0.1 version of the Elisa music player.

Elisa is a music player designed to be simple and nice to use.

Elisa allows to browse music by album, artist or all tracks. The music is indexed using either a private indexer or an indexer using Baloo. The private one can be configured to scan music on chosen paths. The Baloo one is much faster because Baloo is providing all needed data from its own database. You can build and play your own playlist.

It is available in source form only at https://download.kde.org/unstable/elisa/0.0.80/elisa-0.0.80.tar.xz. KDE also provides regular flatpak and windows builds. The windows builds are provided by binary-factory.kde.org and are using Craft. Thanks to the people in the flatpak and windows teams. A binary package is provided for Fedora thanks to Robert-André Mauchin. Other Linux distributions may provide binary packages from the git repository (as far as I know, there is an AUR package and a package for Neon).

The development is still continuing and we will regularly release an alpha version until we have completed review by other KDE developers. We plan to move to kdereview as soon as we have finished a few missing features.

For the curious reader, the phabricator workboard of Elisa is a good way to know what is happening (https://phabricator.kde.org/project/board/193/).

Any help is very much welcome. We have open tasks for code but also for design.


Categories: FLOSS Project Planets

Craig Small: WordPress 4.9.1

Planet Debian - Sat, 2017-12-09 01:23

After a much longer than expected break due to moving and the resulting lack of Internet, plus WordPress releasing a package with a non-free file, the Debian package for WordPress 4.9.1 has been uploaded!

WordPress 4.9 has a number of improvements, especially around the customiser components so that looked pretty slick. The editor for the customiser now has a series of linters what will warn if you write something bad, which is a very good thing! Unfortunately the Javascript linter is jshint which uses a non-free license which that team is attempting to fix.  I have also reported the problem to WordPress upstream to have a look at.

While this was all going on, there were 4 security issues found in WordPress which resulted in the 4.9.1 release.

Finally I got the time to look into the jshint problem and Internet to actually download the upstream files and upload the Debian packages. So version 4.9.1-1 of the packages have now been uploaded and should be in the mirrors soon.  I’ll start looking at the 4.9.1 patches to see what is relevant for Stretch and Jessie.

Categories: FLOSS Project Planets

Wim Leers: API-First Drupal — really!

Planet Drupal - Fri, 2017-12-08 16:54

This blog has been quiet for the last year and a half, because I don’t like to announce things until I feel comfortable recommending them. Until today!

Since July 2016, API-First Drupal became my primary focus, because Dries felt this was one of the most important areas for Drupal’s future. Together with the community, I triaged the issue queue, and helped determine the most important bugs to fix and improvements to add. That’s how we ended up with REST: top priorities for Drupal … plan issues for each Drupal 8 minor:

If you want to see what’s going on, start following that last issue. Whenever there’s news, I post a new comment there.

But enough background. This blog post is not an update on the entire API-First Initiative, it’s about a particular milestone.

100% integration test coverage!

The biggest problem we encountered while working on rest.module, serialization.module and hal.module was unknown BC breaks 1. Because in case of a REST API, the HTTP response is the API. What is a bug fix for person X is a BC break for person Y. The existing test coverage was rather thin, and was often only testing “the happy path”: the simplest possible case. That’s why we would often accidentally introduce BC breaks.

Hence the clear need for really thorough functional (integration) test coverage2, which was completed almost exactly a year ago. We added EntityResourceTestBase, which tests dozens of scenarios3 in a generic way4, and used that to test the 9 entity types, that already had some REST test coverage, more thoroughly than before.

But we had to bring this to all entity types in Drupal core … and covering all 41 entity types in Drupal core was completed exactly a week ago!

The test coverage revealed bugs for almost every entity type. (Most of them are fixed by now.)

Tip: Subclass that base test class for your custom entity types, and easily get full REST test coverage — 41 examples available!

Guaranteed to remain at 100%

We added EntityResourceRestTestCoverageTest, which verifies that we have test coverage for all permutations of:

  • entity type
  • format: json + xml + hal_json
  • authentication: cookie + basic_auth + anon

It is now impossible to add new entity types without also adding solid REST test coverage!

If you forget that test coverage, you’ll find an ASCII-art llama talking to you:

Good people of #Drupal, I present unto you the greatest method of all time. https://github.com/drupal/drupal/blob/8.5.x/core/modules/rest/tests/src/Functional/EntityResource/EntityResourceRestTestCoverageTest.php#L141 pic.twitter.com/TiWLPt7duH

— webcsillag (@webchick) December 8, 2017

That is why we can finally say that Drupal is really API-First!

This of course doesn’t help only core’s REST module, it also helps the contributed JSON API and GraphQL modules: they’ll encounter far fewer bugs!

Thanks

So many people have helped! In random order: rogierbom, alexpott, harings_rob, himanshu-dixit, webflo, tedbow, xjm, yoroy, timmillwood, gaurav.kapoor, Gábor Hojtsy, brentschuddinck, Sam152, seanB, Berdir, larowlan, Yogesh Pawar, jibran, catch, sumanthkumarc, amateescu, andypost, dawehner, naveenvalecha, tstoeckler — thank you all!5

Special thanks to three people I omitted above, because they’re not well known in the Drupal community, and totally deserve the spotlight here, for their impressive contribution to making this happen:

That’s thirty contributors without whom this would not have happened!

And of course thanks to my employer, Acquia, for allowing me to work on this full-time!

Next

What is going to be the next big milestone we hit? That’s impossible to say, because it depends on the chains of blocking issues that we encounter. It could be support for modifying and creating config entities, it could be support for translations, it could be that all major serialization gaps are fixed, it could be file uploads, or it could be ensuring all normalizers work in both rest.module & jsonapi.module

The future will tell, follow along!

  1. Backwards Compatibility. ↩︎

  2. Nowhere near 100% test coverage, definitely not every possible edge case is tested, and that is fine↩︎

  3. Including helpful error responses when unauthenticated, unauthorized or just a bad request. This vastly improves DX: no need to be a Drupal expert to talk to a REST API powered by Drupal! ↩︎

  4. It is designed to be subclassed for an entity type, and then there are subclasses of that for every format + authentication combination. ↩︎

  5. And this is just from all the per-entity type test issues, I didn’t look at the blockers and blockers of blockers. ↩︎

  • Acquia
  • Drupal
Categories: FLOSS Project Planets

pythonwise: Advent of Code 2017 #8 and Python's eval

Planet Python - Fri, 2017-12-08 14:37
I'm having fun solving Advent of Code 2017. Problem 8 reminded the power of Python's eval (and before you start commenting the "eval is evil" may I remind you of this :)

You can check The Go implementation that don't have eval and need to work harder.
Categories: FLOSS Project Planets

Colm O hEigeartaigh: SAML SSO support for the Apache Syncope web console

Planet Apache - Fri, 2017-12-08 12:09
Apache Syncope is a powerful open source Identity Management project, that has recently celebrated 5 years as an Apache top level project. Up to recently, a username and password must be supplied to log onto either the admin or enduser web consoles of Apache Syncope. However SAML SSO login is now supported since the 2.0.3 release. Instead of supplying a username/password, the user is redirected to a third party IdP for login, before redirecting back to the Apache Syncope web console. In 2.0.5, support for the IdP-initiated flow of SAML SSO was added.

In this post we will show how to configure Apache Syncope to use SAML SSO as an alternative to logging in using a username and password. We will use Apache CXF Fediz as the SAML SSO IdP. In addition, we will show how to achieve IdP-initiated SSO using Okta. Please also refer to this tutorial on achieving SAML SSO with Syncope and Shibboleth.

1) Logging in to Apache Syncope using SAML SSO

In this section, we will cover setting up Apache Syncope to re-direct to a third party IdP so that the user can enter their credentials. The next section will cover the IdP-initiated case.

1.a) Enable SAML SSO support in Apache Syncope

First we will configure Apache Syncope to enable SAML SSO support. Download and extract the most recent standalone distribution release of Apache Syncope (2.0.6 was used in this post). Start the embedded Apache Tomcat instance and then open a web browser and navigate to "http://localhost:9080/syncope-console", logging in as "admin" and "password".

Apache Syncope is configured with some sample data to show how it can be used. Click on "Users" and add a new user called "alice" by clicking on the subsequent "+" button. Specify a password for "alice" and then select the default values wherever possible (you will need to specify some required attributes, such as "surname"). Now in the left-hand column, click on "Extensions" and then "SAML 2.0 SP". Click on the "Service Provider" tab and then "Metadata". Save the resulting Metadata document, as it will be required to set up the SAML SSO IdP.

1.b) Set up the Apache CXF Fediz SAML SSO IdP

Next we will turn our attention to setting up the Apache CXF Fediz SAML SSO IdP. Download the most recent source release of Apache CXF Fediz (1.4.3 was used for this tutorial). Unzip the release and build it using maven ("mvn clean install -DskipTests"). In the meantime, download and extract the latest Apache Tomcat 8.5.x distribution (tested with 8.5.24). Once Fediz has finished building, copy all of the "IdP" wars (e.g. in fediz-1.4.3/apache-fediz/target/apache-fediz-1.4.3/apache-fediz-1.4.3/idp/war/fediz-*) to the Tomcat "webapps" directory.

There are a few configuration changes to be made to Apache Tomcat before starting it. Download the HSQLDB jar and copy it to the Tomcat "lib" directory. Next edit 'conf/server.xml' and configure TLS on port 8443:

The two keys referenced here can be obtained from 'apache-fediz/target/apache-fediz-1.4.3/apache-fediz-1.4.3/examples/samplekeys/' and should be copied to the root directory of Apache Tomcat. Tomcat can now be started.

Next we have to configure Apache CXF Fediz to support Apache Syncope as a "service" via SAML SSO. Edit 'webapps/fediz-idp/WEB-INF/classes/entities-realma.xml' and add the following configuration:

In addition, we need to make some changes to the "idp-realmA" bean in this file:
  • Add a reference to this bean in the "applications" list: <ref bean="srv-syncope" />
  • Change the "idpUrl" property to: https://localhost:8443/fediz-idp/saml
  • Change the port for "stsUrl" from "9443" to "8443".
Now we need to configure Fediz to accept Syncope's signing cert. Edit the Metadata file you saved from Syncope in step 1.a. Copy the Base-64 encoded certificate in the "KeyDescriptor" section, and paste it (including line breaks) into 'webapps/fediz-idp/WEB-INF/classes/syncope.cert', enclosing it in between "-----BEGIN CERTIFICATE-----" and "-----END CERTIFICATE-----".

Now restart Apache Tomcat. Open a browser and save the Fediz metadata which is available at "http://localhost:8080/fediz-idp/metadata?protocol=saml", which we will require when configuring Apache Syncope.

1.c) Configure the Apache CXF Fediz IdP in Syncope

The final configuration step takes place in Apache Syncope again. In the "SAML 2.0 SP" configuration screen, click on the "Identity Providers" tab and click the "+" button and select the Fediz metadata that you saved in the previous step. Now logout and an additional login option can be seen:


Select the URL for the SAML SSO IdP and you will be redirected to Fediz. Select the IdP in realm "A" as the home realm and enter credentials of "alice/ecila" when prompted. You will be successfully authenticated to Fediz and redirected back to the Syncope admin console, where you will be logged in as the user "alice". 

2) Using IdP-initiated SAML SSO

Instead of the user starting with the Syncope web console, being redirected to the IdP for authentication, and then redirected back to Syncope - it is possible instead to start from the IdP. In this section we will show how to configure Apache Syncope to support IdP-initiated SAML SSO using Okta.

2.a) Configuring a SAML application in Okta

The first step is to create an account at Okta and configure a SAML application. This process is mapped out at the following link. Follow the steps listed on this page with the following additional changes:
  • Specify the following for the Single Sign On URL: http://localhost:9080/syncope-console/saml2sp/assertion-consumer
  • Specify the following for the audience URL: http://localhost:9080/syncope-console/
  • Specify the following for the default RelayState: idpInitiated
When the application is configured, you will see an option to "View Setup Instructions". Open this link in a new tab and find the section about the IdP Metadata. Save this to a local file and set it aside for the moment. Next you need to assign the application to the username that you have created at Okta.

2.b) Configure Apache Syncope to support IdP-Initiated SAML SSO

Log on to the Apache Syncope admin console using the admin credentials, and add a new IdP Provider in the SAML 2.0 SP extension as before, using the Okta metadata file that you have saved in the previous section. Edit the metadata and select the 'Support Unsolicited Logins' checkbox. Save the metadata and make sure that the Okta user is also a valid user in Apache Syncope.

Now go back to the Okta console and click on the application you have configured for Apache Syncope. You should seemlessly be logged into the Apache Syncope admin console.




Categories: FLOSS Project Planets

Will Kahn-Greene: html5lib-python 1.0 released!

Planet Python - Fri, 2017-12-08 12:00
html5lib-python v1.0 released!

Yesterday, Geoffrey released html5lib 1.0 [1]! The changes aren't wildly interesting.

The more interesting part for me is how the release happened. I'm going to spend the rest of this post talking about that.

[1]Technically there was a 1.0 release followed by a 1.0.1 release because the 1.0 release had issues. The story of Bleach and html5lib

I work on Bleach which is a Python library for sanitizing and linkifying text from untrusted sources for safe usage in HTML. It relies heavily on another library called html5lib-python. Most of the work that I do on Bleach consists of figuring out how to make html5lib do what I need it to do.

Over the last few years, maintainers of the html5lib library have been working towards a 1.0. Those well-meaning efforts got them into a versioning model which had some unenthusing properties. I would often talk to people about how I was having difficulties with Bleach and html5lib 0.99999999 (8 9s) and I'd have to mentally count how many 9s I had said. It was goofy [2].

In an attempt to deal with the effects of the versioning, there's a parallel set of versions that start with 1.0b. Because there are two sets of versions, it was a total pain in the ass to correctly specify which versions of html5lib that Bleach worked with.

While working on Bleach 2.0, I bumped into a few bugs and upstreamed a patch for at least one of them. That patch sat in the PR queue for months. That's what got me wondering--is this project dead?

I tracked down Geoffrey and talked with him a bit on IRC. He seems to be the only active maintainer. He was really busy with other things, html5lib doesn't pay at all, there's a ton of stuff to do, he's burned out, and recently there have been spats of negative comments in the issues and PRs. Generally the project had a lot of stop energy.

Some time in August, I offered to step up as an interim maintainer and shepherd html5lib to 1.0. The goals being:

  1. land or close as many old PRs as possible
  2. triage, fix, and close as many issues as possible
  3. clean up testing and CI
  4. clean up documentation
  5. ship 1.0 which ends the versioning issues
[2]Many things in life are goofy. Thoughts on being an interim maintainer

I see a lot of open source projects that are in trouble in the sense that they don't have a critical mass of people and energy. When the sole part-time volunteer maintainer burns out, the project languishes. Then the entitled users show up, complain, demand changes, and talk about how horrible the situation is and everyone should be ashamed. It's tough--people are frustrated and then do a bunch of things that make everything so much worse. How do projects escape the raging inferno death spiral?

For a while now, I've been thinking about a model for open source projects where someone else pops in as an interim maintainer for a short period of time with specific goals and then steps down. Maybe this alleviates users' frustrations? Maybe this gives the part-time volunteer burned-out maintainer a breather? Maybe this can get the project moving again? Maybe the temporary interim maintainer can make some of the hard decisions that a regular long-term maintainer just can't?

I wondered if I should try that model out here. In the process of convincing myself that stepping up as an interim maintainer was a good idea [3], I looked at projects that rely on html5lib [4]:

  • pip vendors it
  • Bleach relies upon it heavily, so anything that uses Bleach uses html5lib (jupyter, hypermark, readme_renderer, tensorflow, ...)
  • most web browsers (Firefox, Chrome, servo, etc) have it in their repositories because web-platform-tests uses it

I talked with Geoffrey and offered to step up with these goals in mind.

I started with cleaning up the milestones in GitHub. I bumped everything from the 0.9999999999 (10 9s) milestone which I determined will never happen into a 1.0 milestone. I used this as a bucket for collecting all the issues and PRs that piqued my interest.

I went through the issue tracker and triaged all the issues. I tried to get steps to reproduce and any other data that would help resolve the issue. I closed some issues I didn't think would ever get resolved.

I triaged all the pull requests. Some of them had been open for a long time. I apologized to people who had spent their time to upstream a fix that sat around for years. In some cases, the changes had bitrotted severely and had to be redone [5].

Then I plugged away at issues and pull requests for a couple of months and pushed anything out of the milestone that wasn't well-defined or something we couldn't fix in a week.

At the end of all that, Geoffrey released version 1.0 and here we are today!

[3]I have precious little free time, so this decision had sweeping consequences for my life, my work, and people around me. [4]Recently, I discovered libraries.io--it's pretty amazing project. They have a page for html5lib. I had written a (mediocre) tool that does vaguely similar things. [5]This is what happens on projects that don't have a critical mass of energy/people. It sucks for everyone involved. Conclusion and thoughts

I finished up as interim maintainer for html5lib. I don't think I'm going to continue actively as a maintainer. Yes, Bleach uses it, but I've got other things I should be doing.

I think this was an interesting experiment. I also think it was a successful experiment in regards to achieving my stated goals, but I don't know if it gave the project much momentum to continue forward.

I'd love to see other examples of interim maintainers stepping up, achieving specific goals, and then stepping down again. Does it bring in new people to the community? Does it affect the raging inferno death spiral at all? What kinds of projects would benefit from this the most? What kinds of projects wouldn't benefit at all?

Categories: FLOSS Project Planets

Aegir Dispatch: Re-positioning Ægir

Planet Drupal - Fri, 2017-12-08 11:39
Last week, Colan and I attended SaasNorth, a conference for Software-as-a-Service founders and investors, with a strong focus on marketing, finance and growth. To be honest, we were a little skeptical at the near total absence of technical content. However, we were pleasantly surprised by some of the truly excellent speakers we heard. The one that had the most impact on us was April Dunford’s OBVIOUSLY AWESOME – Using context to move your customers from “What?
Categories: FLOSS Project Planets

LakeDrops Drupal Consulting, Development and Hosting: 3 minutes: Starting Drupal 8 project with a composer template

Planet Drupal - Fri, 2017-12-08 11:11
3 minutes: Starting Drupal 8 project with a composer template Jürgen Haas Fri, 12/08/2017 - 17:11

Starting a new Drupal 8 project with the LakeDrops composer template only takes less than 3 minutes. And in that short period of time you're note only getting the code base but also a number of extra benefits:

Categories: FLOSS Project Planets

Dataquest: Using Excel with pandas

Planet Python - Fri, 2017-12-08 10:47

Excel is one of the most popular and widely-used data tools; it's hard to find an organization that doesn't work with it in some way. From analysts, to sales VPs, to CEOs, various professionals use Excel for both quick stats and serious data crunching.

With Excel being so pervasive, data professionals must be familiar with it. You'll also want a tool that can easily read and write Excel files — pandas is perfect for this.

Pandas has excellent methods for reading all kinds of data from Excel files. You can also export your results from pandas back to Excel, if that's preferred by your intended audience. Pandas is great for other routine data analysis tasks, such as:

  • quick Exploratory Data Analysis (EDA)
  • drawing attractive plots
  • feeding data into machine learning tools like scikit-learn
  • building machine learning models on your data
  • taking cleaned and processed data to any number of data tools

Pandas is better at automating data processing tasks than Excel, including processing Excel files.

In this tutorial, we are going to show you how to work with Excel files in pandas. We will cover the following concepts.

  • setting up your computer with the necessary software
  • reading in data from Excel files into pandas
  • data exploration in pandas
  • visualizing data in pandas using the matplotlib visualization library
  • manipulating and reshaping data in pandas
  • moving data from pandas into Excel

Note that this tutorial does not provide a deep dive into pandas. To explore pandas more, check out our course.

System prerequisites

We will use Python 3 and Jupyter Notebook to demonstrate the code in this tutorial.
In addition to Python and Jupyter Notebook, you will need the following Python modules:

  • matplotlib - data visualization
  • NumPy - numerical data functionality
  • OpenPyXL - read/write Excel 2010 xlsx/xlsm files
  • pandas - data import, clean-up, exploration, and analysis
  • xlrd - read Excel data
  • xlwt - write to Excel
  • XlsxWriter - write to Excel (xlsx) files

There are multiple ways to get set up with all the modules. We cover three of the most common scenarios below.

  • If you have Python installed via Anaconda package manager, you can install the required modules using the command conda install. For example, to install pandas, you would execute the command - conda install pandas.

  • If you already have a regular, non-Anaconda Python installed on the computer, you can install the required modules using pip. Open your command line program and execute command pip install <module name> to install a module. You should replace <module name> with the actual name of the module you are trying to install. For example, to install pandas, you would execute command - pip install pandas.

  • If you don't have Python already installed, you should get it through the Anaconda package manager. Anaconda provides installers for Windows, Mac, and Linux Computers. If you choose the full installer, you will get all the modules you need, along with Python and pandas within a single package. This is the easiest and fastest way to get started.

The data set

In this tutorial, we will use a multi-sheet Excel file we created from Kaggle's IMDB Scores data. You can download the file here.

Our Excel file has three sheets: '1900s,' '2000s,' and '2010s.' Each sheet has data for movies from those years.

We will use this data set to find the ratings distribution for the movies, visualize movies with highest ratings and net earnings and calculate statistical information about the movies. We will be analyzing and exploring this data using pandas, thus demonstrating pandas capabilities to work with Excel data.

Read data from the Excel file

We need to first import the data from the Excel file into pandas. To do that, we start by importing the pandas module.

import pandas as pd

We then use the pandas' read_excel method to read in data from the Excel file. The easiest way to call this method is to pass the file name. If no sheet name is specified then it will read the first sheet in the index (as shown below).

excel_file = 'movies.xls' movies = pd.read_excel(excel_file)

Here, the read_excel method read the data from the Excel file into a pandas DataFrame object. Pandas defaults to storing data in DataFrames. We then stored this DataFrame into a variable called movies.

Pandas has a built-in DataFrame.head() method that we can use to easily display the first few rows of our DataFrame. If no argument is passed, it will display first five rows. If a number is passed, it will display the equal number of rows from the top.

movies.head() Title Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings ... Facebook Likes - Actor 1 Facebook Likes - Actor 2 Facebook Likes - Actor 3 Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score 0 Intolerance: Love's Struggle Throughout the Ages 1916 Drama|History|War NaN USA Not Rated 123 1.33 385907.0 NaN ... 436 22 9.0 481 691 1 10718 88 69.0 8.0 1 Over the Hill to the Poorhouse 1920 Crime|Drama NaN USA NaN 110 1.33 100000.0 3000000.0 ... 2 2 0.0 4 0 1 5 1 1.0 4.8 2 The Big Parade 1925 Drama|Romance|War NaN USA Not Rated 151 1.33 245000.0 NaN ... 81 12 6.0 108 226 0 4849 45 48.0 8.3 3 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 1.33 6000000.0 26435.0 ... 136 23 18.0 203 12000 1 111841 413 260.0 8.3 4 Pandora's Box 1929 Crime|Drama|Romance German Germany Not Rated 110 1.33 NaN 9950.0 ... 426 20 3.0 455 926 1 7431 84 71.0 8.0

5 rows × 25 columns

Excel files quite often have multiple sheets and the ability to read a specific sheet or all of them is very important. To make this easy, the pandas read_excel method takes an argument called sheetname that tells pandas which sheet to read in the data from. For this, you can either use the sheet name or the sheet number. Sheet numbers start with zero. If the sheetname argument is not given, it defaults to zero and pandas will import the first sheet.

By default, pandas will automatically assign a numeric index or row label starting with zero. You may want to leave the default index as such if your data doesn't have a column with unique values that can serve as a better index. In case there is a column that you feel would serve as a better index, you can override the default behavior by setting index_col property to a column. It takes a numeric value for setting a single column as index or a list of numeric values for creating a multi-index.

In the below code, we are choosing the first column, 'Title', as index (index=0) by passing zero to the index_col argument.

movies_sheet1 = pd.read_excel(excel_file, sheetname=0, index_col=0) movies_sheet1.head() Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings Director ... Facebook Likes - Actor 1 Facebook Likes - Actor 2 Facebook Likes - Actor 3 Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score Title Intolerance: Love's Struggle Throughout the Ages 1916 Drama|History|War NaN USA Not Rated 123 1.33 385907.0 NaN D.W. Griffith ... 436 22 9.0 481 691 1 10718 88 69.0 8.0 Over the Hill to the Poorhouse 1920 Crime|Drama NaN USA NaN 110 1.33 100000.0 3000000.0 Harry F. Millarde ... 2 2 0.0 4 0 1 5 1 1.0 4.8 The Big Parade 1925 Drama|Romance|War NaN USA Not Rated 151 1.33 245000.0 NaN King Vidor ... 81 12 6.0 108 226 0 4849 45 48.0 8.3 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 1.33 6000000.0 26435.0 Fritz Lang ... 136 23 18.0 203 12000 1 111841 413 260.0 8.3 Pandora's Box 1929 Crime|Drama|Romance German Germany Not Rated 110 1.33 NaN 9950.0 Georg Wilhelm Pabst ... 426 20 3.0 455 926 1 7431 84 71.0 8.0

5 rows × 24 columns

As you noticed above, our Excel data file has three sheets. We already read the first sheet in a DataFrame above. Now, using the same syntax, we will read in rest of the two sheets too.

movies_sheet2 = pd.read_excel(excel_file, sheetname=1, index_col=0) movies_sheet2.head() Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings Director ... Facebook Likes - Actor 1 Facebook Likes - Actor 2 Facebook Likes - Actor 3 Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score Title 102 Dalmatians 2000 Adventure|Comedy|Family English USA G 100.0 1.85 85000000.0 66941559.0 Kevin Lima ... 2000.0 795.0 439.0 4182 372 1 26413 77.0 84.0 4.8 28 Days 2000 Comedy|Drama English USA PG-13 103.0 1.37 43000000.0 37035515.0 Betty Thomas ... 12000.0 10000.0 664.0 23864 0 1 34597 194.0 116.0 6.0 3 Strikes 2000 Comedy English USA R 82.0 1.85 6000000.0 9821335.0 DJ Pooh ... 939.0 706.0 585.0 3354 118 1 1415 10.0 22.0 4.0 Aberdeen 2000 Drama English UK NaN 106.0 1.85 6500000.0 64148.0 Hans Petter Moland ... 844.0 2.0 0.0 846 260 0 2601 35.0 28.0 7.3 All the Pretty Horses 2000 Drama|Romance|Western English USA PG-13 220.0 2.35 57000000.0 15527125.0 Billy Bob Thornton ... 13000.0 861.0 820.0 15006 652 2 11388 183.0 85.0 5.8

5 rows × 24 columns

movies_sheet3 = pd.read_excel(excel_file, sheetname=2, index_col=0) movies_sheet3.head() Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings Director ... Facebook Likes - Actor 1 Facebook Likes - Actor 2 Facebook Likes - Actor 3 Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score Title 127 Hours 2010.0 Adventure|Biography|Drama|Thriller English USA R 94.0 1.85 18000000.0 18329466.0 Danny Boyle ... 11000.0 642.0 223.0 11984 63000 0.0 279179 440.0 450.0 7.6 3 Backyards 2010.0 Drama English USA R 88.0 NaN 300000.0 NaN Eric Mendelsohn ... 795.0 659.0 301.0 1884 92 0.0 554 23.0 20.0 5.2 3 2010.0 Comedy|Drama|Romance German Germany Unrated 119.0 2.35 NaN 59774.0 Tom Tykwer ... 24.0 20.0 9.0 69 2000 0.0 4212 18.0 76.0 6.8 8: The Mormon Proposition 2010.0 Documentary English USA R 80.0 1.78 2500000.0 99851.0 Reed Cowan ... 191.0 12.0 5.0 210 0 0.0 1138 30.0 28.0 7.1 A Turtle's Tale: Sammy's Adventures 2010.0 Adventure|Animation|Family English France PG 88.0 2.35 NaN NaN Ben Stassen ... 783.0 749.0 602.0 3874 0 2.0 5385 22.0 56.0 6.1

5 rows × 24 columns

Since all the three sheets have similar data but for different records\movies, we will create a single DataFrame from all the three DataFrames we created above. We will use the pandas concat method for this and pass in the names of the three DataFrames we just created and assign the results to a new DataFrame object, movies. By keeping the DataFrame name same as before, we are over-writing the previously created DataFrame.

movies = pd.concat([movies_sheet1, movies_sheet2, movies_sheet3])

We can check if this concatenation by checking the number of rows in the combined DataFrame by calling the method shape on it that will give us the number of rows and columns.

movies.shape (5042, 24) Using the ExcelFile class to read multiple sheets

We can also use the ExcelFile class to work with multiple sheets from the same Excel file. We first wrap the Excel file using ExcelFile and then pass it to read_excel method.

xlsx = pd.ExcelFile(excel_file) xlsx = pd.ExcelFile(excel_file) movies_sheets = [] for sheet in xlsx.sheet_names: movies_sheets.append(xlsx.parse(sheet)) movies = pd.concat(movies_sheets)

If you are reading an Excel file with a lot of sheets and are creating a lot of DataFrames, ExcelFile is more convenient and efficient in comparison to read_excel. With ExcelFile, you only need to pass the Excel file once, and then you can use it to get the DataFrames. When using read_excel, you pass the Excel file every time and hence the file is loaded again for every sheet. This can be a huge performance drag if the Excel file has many sheets with a large number of rows.

Exploring the data

Now that we have read in the movies data set from our Excel file, we can start exploring it using pandas. A pandas DataFrame stores the data in a tabular format, just like the way Excel displays the data in a sheet. Pandas has a lot of built-in methods to explore the DataFrame we created from the Excel file we just read in.

We already introduced the method head in the previous section that displays few rows from the top from the DataFrame. Let's look at few more methods that come in handy while exploring the data set.

We can use the shape method to find out the number of rows and columns for the DataFrame.

movies.shape (5042, 25)

This tells us our Excel file has 5042 records and 25 columns or observations. This can be useful in reporting the number of records and columns and comparing that with the source data set.

We can use the tail method to view the bottom rows. If no parameter is passed, only the bottom five rows are returned.

movies.tail() Title Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings ... Facebook Likes - Actor 1 Facebook Likes - Actor 2 Facebook Likes - Actor 3 Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score 1599 War & Peace NaN Drama|History|Romance|War English UK TV-14 NaN 16.00 NaN NaN ... 1000.0 888.0 502.0 4528 11000 1.0 9277 44.0 10.0 8.2 1600 Wings NaN Comedy|Drama English USA NaN 30.0 1.33 NaN NaN ... 685.0 511.0 424.0 1884 1000 5.0 7646 56.0 19.0 7.3 1601 Wolf Creek NaN Drama|Horror|Thriller English Australia NaN NaN 2.00 NaN NaN ... 511.0 457.0 206.0 1617 954 0.0 726 6.0 2.0 7.1 1602 Wuthering Heights NaN Drama|Romance English UK NaN 142.0 NaN NaN NaN ... 27000.0 698.0 427.0 29196 0 2.0 6053 33.0 9.0 7.7 1603 Yu-Gi-Oh! Duel Monsters NaN Action|Adventure|Animation|Family|Fantasy Japanese Japan NaN 24.0 NaN NaN NaN ... 0.0 NaN NaN 0 124 0.0 12417 51.0 6.0 7.0

5 rows × 25 columns

In Excel, you're able to sort a sheet based on the values in one or more columns. In pandas, you can do the same thing with the sort_values method. For example, let's sort our movies DataFrame based on the Gross Earnings column.

sorted_by_gross = movies.sort_values(['Gross Earnings'], ascending=False)

Since we have the data sorted by values in a column, we can do few interesting things with it. For example, we can display the top 10 movies by Gross Earnings.

sorted_by_gross["Gross Earnings"].head(10) 1867 760505847.0 1027 658672302.0 1263 652177271.0 610 623279547.0 611 623279547.0 1774 533316061.0 1281 474544677.0 226 460935665.0 1183 458991599.0 618 448130642.0 Name: Gross Earnings, dtype: float64

We can also create a plot for the top 10 movies by Gross Earnings. Pandas makes it easy to visualize your data with plots and charts through matplotlib, a popular data visualization library. With a couple lines of code, you can start plotting. Moreover, matplotlib plots work well inside Jupyter Notebooks since you can displace the plots right under the code.

First, we import the matplotlib module and set matplotlib to display the plots right in the Jupyter Notebook.

import matplotlib.pyplot as plt %matplotlib inline

We will draw a bar plot where each bar will represent one of the top 10 movies. We can do this by calling the plot method and setting the argument kind to barh. This tells matplotlib to draw a horizontal bar plot.

sorted_by_gross['Gross Earnings'].head(10).plot(kind="barh") plt.show()

Let's create a histogram of IMDB Scores to check the distribution of IMDB Scores across all movies. Histograms are a good way to visualize the distribution of a data set. We use the plot method on the IMDB Scores series from our movies DataFrame and pass it the argument kind="hist".

movies['IMDB Score'].plot(kind="hist") plt.show()

This data visualization suggests that most of the IMDB Scores fall between six and eight.

Getting statistical information about the data

Pandas has some very handy methods to look at the statistical data about our data set. For example, we can use the describe method to get a statistical summary of the data set.

movies.describe() Year Duration Aspect Ratio Budget Gross Earnings Facebook Likes - Director Facebook Likes - Actor 1 Facebook Likes - Actor 2 Facebook Likes - Actor 3 Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score count 4935.000000 5028.000000 4714.000000 4.551000e+03 4.159000e+03 4938.000000 5035.000000 5029.000000 5020.000000 5042.000000 5042.000000 5029.000000 5.042000e+03 5022.000000 4993.000000 5042.000000 mean 2002.470517 107.201074 2.220403 3.975262e+07 4.846841e+07 686.621709 6561.323932 1652.080533 645.009761 9700.959143 7527.457160 1.371446 8.368475e+04 272.770808 140.194272 6.442007 std 12.474599 25.197441 1.385113 2.061149e+08 6.845299e+07 2813.602405 15021.977635 4042.774685 1665.041728 18165.101925 19322.070537 2.013683 1.384940e+05 377.982886 121.601675 1.125189 min 1916.000000 7.000000 1.180000 2.180000e+02 1.620000e+02 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000e+00 1.000000 1.000000 1.600000 25% 1999.000000 93.000000 1.850000 6.000000e+06 5.340988e+06 7.000000 614.500000 281.000000 133.000000 1411.250000 0.000000 0.000000 8.599250e+03 65.000000 50.000000 5.800000 50% 2005.000000 103.000000 2.350000 2.000000e+07 2.551750e+07 49.000000 988.000000 595.000000 371.500000 3091.000000 166.000000 1.000000 3.437100e+04 156.000000 110.000000 6.600000 75% 2011.000000 118.000000 2.350000 4.500000e+07 6.230944e+07 194.750000 11000.000000 918.000000 636.000000 13758.750000 3000.000000 2.000000 9.634700e+04 326.000000 195.000000 7.200000 max 2016.000000 511.000000 16.000000 1.221550e+10 7.605058e+08 23000.000000 640000.000000 137000.000000 23000.000000 656730.000000 349000.000000 43.000000 1.689764e+06 5060.000000 813.000000 9.500000

The describe method displays below information for each of the columns.

  • the count or number of values
  • mean
  • standard deviation
  • minimum, maximum
  • 25%, 50%, and 75% quantile

Please note that this information will be calculated only for the numeric values.

We can also use the corresponding method to access this information one at a time. For example, to get the mean of a particular column, you can use the mean method on that column.

movies["Gross Earnings"].mean() 48468407.526809327

Just like mean, there are methods available for each of the statistical information we want to access. You can read about these methods in our free pandas cheat sheet.

Reading files with no header and skipping records

Earlier in this tutorial, we saw some ways to read a particular kind of Excel file that had headers and no rows that needed skipping. Sometimes, the Excel sheet doesn't have any header row. For such instances, you can tell pandas not to consider the first row as header or columns names. And If the Excel sheet's first few rows contain data that should not be read in, you can ask the read_excel method to skip a certain number of rows, starting from the top.

For example, look at the top few rows of this Excel file.

This file obviously has no header and first four rows are not actual records and hence should not be read in. We can tell read_excel there is no header by setting argument header to None and we can skip first four rows by setting argument skiprows to four.

movies_skip_rows = pd.read_excel("movies-no-header-skip-rows.xls", header=None, skiprows=4) movies_skip_rows.head(5) 0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24 0 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 1.33 6000000.0 26435.0 ... 136 23 18.0 203 12000 1 111841 413 260.0 8.3 1 Pandora's Box 1929 Crime|Drama|Romance German Germany Not Rated 110 1.33 NaN 9950.0 ... 426 20 3.0 455 926 1 7431 84 71.0 8.0 2 The Broadway Melody 1929 Musical|Romance English USA Passed 100 1.37 379000.0 2808000.0 ... 77 28 4.0 109 167 8 4546 71 36.0 6.3 3 Hell's Angels 1930 Drama|War English USA Passed 96 1.20 3950000.0 NaN ... 431 12 4.0 457 279 1 3753 53 35.0 7.8 4 A Farewell to Arms 1932 Drama|Romance|War English USA Unrated 79 1.37 800000.0 NaN ... 998 164 99.0 1284 213 1 3519 46 42.0 6.6

5 rows × 25 columns

We skipped four rows from the sheet and used none of the rows as the header. Also, notice that one can combine different options in a single read statement. To skip rows at the bottom of the sheet, you can use option skip_footer, which works just like skiprows, the only difference being the rows are counted from the bottom upwards.

The column names in the previous DataFrame are numeric and were allotted as default by the pandas. We can rename the column names to descriptive ones by calling the method columns on the DataFrame and passing the column names as a list.

movies_skip_rows.columns = ['Title', 'Year', 'Genres', 'Language', 'Country', 'Content Rating', 'Duration', 'Aspect Ratio', 'Budget', 'Gross Earnings', 'Director', 'Actor 1', 'Actor 2', 'Actor 3', 'Facebook Likes - Director', 'Facebook Likes - Actor 1', 'Facebook Likes - Actor 2', 'Facebook Likes - Actor 3', 'Facebook Likes - cast Total', 'Facebook likes - Movie', 'Facenumber in posters', 'User Votes', 'Reviews by Users', 'Reviews by Crtiics', 'IMDB Score'] movies_skip_rows.head() Title Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings ... Facebook Likes - Actor 1 Facebook Likes - Actor 2 Facebook Likes - Actor 3 Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score 0 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 1.33 6000000.0 26435.0 ... 136 23 18.0 203 12000 1 111841 413 260.0 8.3 1 Pandora's Box 1929 Crime|Drama|Romance German Germany Not Rated 110 1.33 NaN 9950.0 ... 426 20 3.0 455 926 1 7431 84 71.0 8.0 2 The Broadway Melody 1929 Musical|Romance English USA Passed 100 1.37 379000.0 2808000.0 ... 77 28 4.0 109 167 8 4546 71 36.0 6.3 3 Hell's Angels 1930 Drama|War English USA Passed 96 1.20 3950000.0 NaN ... 431 12 4.0 457 279 1 3753 53 35.0 7.8 4 A Farewell to Arms 1932 Drama|Romance|War English USA Unrated 79 1.37 800000.0 NaN ... 998 164 99.0 1284 213 1 3519 46 42.0 6.6

5 rows × 25 columns

Now that we have seen how to read a subset of rows from the Excel file, we can learn how to read a subset of columns.

Reading a subset of columns

Although read_excel defaults to reading and importing all columns, you can choose to import only certain columns. By passing parse_cols=6, we are telling the read_excel method to read only the first columns till index six or first seven columns (the first column being indexed zero).

movies_subset_columns = pd.read_excel(excel_file, parse_cols=6) movies_subset_columns.head() Title Year Genres Language Country Content Rating Duration 0 Intolerance: Love's Struggle Throughout the Ages 1916 Drama|History|War NaN USA Not Rated 123 1 Over the Hill to the Poorhouse 1920 Crime|Drama NaN USA NaN 110 2 The Big Parade 1925 Drama|Romance|War NaN USA Not Rated 151 3 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 4 Pandora's Box 1929 Crime|Drama|Romance German Germany Not Rated 110

Alternatively, you can pass in a list of numbers, which will let you import columns at particular indexes.

Applying formulas on the columns

One of the much-used features of Excel is to apply formulas to create new columns from existing column values. In our Excel file, we have Gross Earnings and Budget columns. We can get Net earnings by subtracting Budget from Gross earnings. We could then apply this formula in the Excel file to all the rows. We can do this in pandas also as shown below.

movies["Net Earnings"] = movies["Gross Earnings"] - movies["Budget"]

Above, we used pandas to create a new column called Net Earnings, and populated it with the difference of Gross Earnings and Budget. It's worth noting the difference here in how formulas are treated in Excel versus pandas. In Excel, a formula lives in the cell and updates when the data changes - with Python, the calculations happen and the values are stored - if Gross Earnings for one movie was manually changed, Net Earnings won't be updated.

Let's use the sot_values method to sort the data by the new column we created and visualize the top 10 movies by Net Earnings.

sorted_movies = movies[['Net Earnings']].sort_values(['Net Earnings'], ascending=[False]) sorted_movies.head(10)['Net Earnings'].plot.barh() plt.show()

Pivot Table in pandas

Advanced Excel users also often use pivot tables. A pivot table summarizes the data of another table by grouping the data on an index and applying operations such as sorting, summing, or averaging. You can use this feature in pandas too.

We need to first identify the column or columns that will serve as the index, and the column(s) on which the summarizing formula will be applied. Let's start small, by choosing Year as the index column and Gross Earnings as the summarization column and creating a separate DataFrame from this data.

movies_subset = movies[['Year', 'Gross Earnings']] movies_subset.head() Year Gross Earnings 0 1916.0 NaN 1 1920.0 3000000.0 2 1925.0 NaN 3 1927.0 26435.0 4 1929.0 9950.0

We now call pivot_table on this subset of data. The method pivot_table takes a parameter index. As mentioned, we want to use Year as the index.

earnings_by_year = movies_subset.pivot_table(index=['Year']) earnings_by_year.head() Gross Earnings Year 1916.0 NaN 1920.0 3000000.0 1925.0 NaN 1927.0 26435.0 1929.0 1408975.0

This gave us a pivot table with grouping on Year and summarization on the sum of Gross Earnings. Notice, we didn't need to specify Gross Earnings column explicitly as pandas automatically identified it the values on which summarization should be applied.

We can use this pivot table to create some data visualizations. We can call the plot method on the DataFrame to create a line plot and call the show method to display the plot in the notebook.

earnings_by_year.plot() plt.show()

We saw how to pivot with a single column as the index. Things will get more interesting if we can use multiple columns. Let's create another DataFrame subset but this time we will choose the columns, Country, Language and Gross Earnings.

movies_subset = movies[['Country', 'Language', 'Gross Earnings']] movies_subset.head() Country Language Gross Earnings 0 USA NaN NaN 1 USA NaN 3000000.0 2 USA NaN NaN 3 Germany German 26435.0 4 Germany German 9950.0

We will use columns Country and Language as the index for the pivot table. We will use Gross Earnings as summarization table, however, we do not need to specify this explicitly as we saw earlier.

earnings_by_co_lang = movies_subset.pivot_table(index=['Country', 'Language']) earnings_by_co_lang.head() Gross Earnings Country Language Afghanistan Dari 1.127331e+06 Argentina Spanish 7.230936e+06 Aruba English 1.007614e+07 Australia Aboriginal 6.165429e+06 Dzongkha 5.052950e+05

Let's visualize this pivot table with a bar plot. Since there are still few hundred records in this pivot table, we will plot just a few of them.

earnings_by_co_lang.head(20).plot(kind='bar', figsize=(20,8)) plt.show()

Exporting the results to Excel

If you're going to be working with colleagues who use Excel, saving Excel files out of pandas is important. You can export or write a pandas DataFrame to an Excel file using pandas to_excel method. Pandas uses the xlwt Python module internally for writing to Excel files. The to_excel method is called on the DataFrame we want to export.We also need to pass a filename to which this DataFrame will be written.

movies.to_excel('output.xlsx')

By default, the index is also saved to the output file. However, sometimes the index doesn't provide any useful information. For example, the movies DataFrame has a numeric auto-increment index, that was not part of the original Excel data.

movies.head() Title Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings ... Facebook Likes - Actor 2 Facebook Likes - Actor 3 Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score Net Earnings 0 Intolerance: Love's Struggle Throughout the Ages 1916.0 Drama|History|War NaN USA Not Rated 123.0 1.33 385907.0 NaN ... 22.0 9.0 481 691 1.0 10718 88.0 69.0 8.0 NaN 1 Over the Hill to the Poorhouse 1920.0 Crime|Drama NaN USA NaN 110.0 1.33 100000.0 3000000.0 ... 2.0 0.0 4 0 1.0 5 1.0 1.0 4.8 2900000.0 2 The Big Parade 1925.0 Drama|Romance|War NaN USA Not Rated 151.0 1.33 245000.0 NaN ... 12.0 6.0 108 226 0.0 4849 45.0 48.0 8.3 NaN 3 Metropolis 1927.0 Drama|Sci-Fi German Germany Not Rated 145.0 1.33 6000000.0 26435.0 ... 23.0 18.0 203 12000 1.0 111841 413.0 260.0 8.3 -5973565.0 4 Pandora's Box 1929.0 Crime|Drama|Romance German Germany Not Rated 110.0 1.33 NaN 9950.0 ... 20.0 3.0 455 926 1.0 7431 84.0 71.0 8.0 NaN

5 rows × 26 columns

You can choose to skip the index by passing along index-False.

movies.to_excel('output.xlsx', index=False)

We need to be able to make our output files look nice before we can send it out to our co-workers. We can use pandas ExcelWriter class along with the XlsxWriter Python module to apply the formatting.

We can do use these advanced output options by creating a ExcelWriter object and use this object to write to the EXcel file.

writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter') movies.to_excel(writer, index=False, sheet_name='report') workbook = writer.book worksheet = writer.sheets['report']

We can apply customizations by calling add_format on the workbook we are writing to. Here we are setting header format as bold.

header_fmt = workbook.add_format({'bold': True}) worksheet.set_row(0, None, header_fmt)

Finally, we save the output file by calling the method save on the writer object.

writer.save()

As an example, we saved the data with column headers set as bold. And the saved file looks like the image below.

Like this, one can use XlsxWriter to apply various formatting to the output Excel file.

Conclusion

Pandas is not a replacement for Excel. Both tools have their place in the data analysis workflow and can be very great companion tools. As we demonstrated, pandas can do a lot of complex data analysis and manipulations, which depending on your need and expertise, can go beyond what you can achieve if you are just using Excel. One of the major benefits of using Python and pandas over Excel is that it helps you automate Excel file processing by writing scripts and integrating with your automated data workflow. Pandas also has excellent methods for reading all kinds of data from Excel files. You can export your results from pandas back to Excel too if that's preferred by your intended audience.

On the other hand, Excel is a such a widely used data tool, it's not a wise to ignore it. Acquiring expertise in both pandas and Excel and making them work together gives you skills that can help you stand out in your organization.

Categories: FLOSS Project Planets

PythonClub - A Brazilian collaborative blog about Python: Programação funcional com Python #1 - Funções

Planet Python - Fri, 2017-12-08 10:30
1. Funções

Como nem tudo são flores, vamos começar do começo e entender algumas características das funções do python (o objeto função) e dar uma revisada básica em alguns conceitos de função só pra gente não se perder no básico depois. Então o primeiro tópico vai se limitar a falar da estrutura básica das funções em python, sem entrar profundamente em cada um dos tópicos. Será uma explanação de código e abrir a cabeça para novas oportunidades de código mais pythonicos e que preferencialmente gere menos efeito colateral. Mas calma, não vamos ensinar a fazer funções, você já está cheio disso.

1.1 Funções como objeto de primeira classe

Funções como objeto de primeira classe, são funções que se comportam como qualquer tipo nativo de uma determinada linguagem. Por exemplo:

# uma lista lista = [1, 'str', [1,2], (1,2), {1,2}, {1: 'um'}]

Todos esses exemplos são tipos de objetos de primeira classe em Python, mas no caso as funções também são. Como assim? Pode-se passar funções como parâmetro de uma outra função, podemos armazenar funções em variáveis, pode-se definir funções em estruturas de dados:

# Funções como objeto de primeira classe func = lambda x: x # a função anônima, lambda, foi armazenada em uma variável def func_2(x): return x + 2 lista = [func, func_2] # a variável que armazena a função foi inserida em uma estrutura, assim como uma função gerada com def lista_2 = [lambda x: x, lambda x: x+1] # aqui as funções foram definidas dentro de uma estrutura

Como é possível notar, em python, as funções podem ser inseridas em qualquer contexto e também geradas em tempo de execução. Com isso nós podemos, além de inserir funções em estruturas, retornar funções, passar funções como parâmetro (HOFs), definir funções dentro de funções(closures) e assim por diante. Caso você tenha aprendido a programar usando uma linguagem em que as funções não são objetos de primeira classe, não se assuste. Isso faz parte da rotina comum do python. Preferencialmente, e quase obrigatoriamente, é melhor fazer funções simples, pequenas e de pouca complexidade para que elas não sofram interferência do meio externo, gerem menos manutenção e o melhor de tudo, possam ser combinadas em outras funções. Então vamos lá!

1.2 Funções puras

Funções puras são funções que não sofrem interferência do meio externo. Vamos começar pelo exemplo ruim:

valor = 5 def mais_cinco(x): return x + valor assert mais_cinco(5) == 10 # True valor = 7 assert mais_cinco(5) == 10 # AssertionError

mais_cinco() é o exemplo claro de uma função que gera efeito colateral. Uma função pura deve funcionar como uma caixa preta, todas as vezes em que o mesmo input for dado nela, ela terá que retornar o mesmo valor. Agora vamos usar o mesmo exemplo, só alterando a linha do return:

valor = 5 def mais_cinco(x): return x + 5 assert mais_cinco(5) == 10 # True valor = 7 assert mais_cinco(5) == 10 # True

Pode parecer trivial, mas muitas vezes por comodidade deixamos o meio influenciar no comportamento de uma função. Por definição o Python só faz possível, e vamos falar disso em outro tópico, a leitura de variáveis externas. Ou seja, dentro do contexto da função as variáveis externas não podem ser modificadas, mas isso não impede que o contexto externo a modifique. Se você for uma pessoa inteligente como o Jaber deve saber que nunca é uma boa ideia usar valores externos. Mas, caso seja necessário, você pode sobrescrever o valor de uma variável no contexto global usando a palavra reservada global. O que deve ficar com uma cara assim:

valor = 5 def teste(): global valor # aqui é feita a definição valor = 7 print(valor) # 7

Só lembre-se de ser sempre coerente quando fizer isso, as consequências podem ser imprevisíveis. Nessa linha de funções puras e pequeninas, podemos caracterizar, embora isso não as defina, funções de ordem superior, que são funções que recebem uma função como argumento, ou as devolvem, e fazem a chamada das mesmas dentro do contexto da função que a recebeu como parâmetro. Isso resulta em uma composição de funções, o que agrega muito mais valor caso as funções não gerem efeitos colaterais.

1.3 Funções de ordem superior (HOFs)

Funções de ordem superior são funções que recebem funções como argumento(s) e/ou retornam funções como resposta. Existem muitas funções embutidas em python de ordem superior, como: map, filter, zip e praticamente todo o módulo functools import functools. Porém, nada impede de criarmos novas funções de ordem superior. Um ponto a ser lembrado é que map e filter não tem mais a devida importância em python com a entrada das comprehensions (embora eu as adore), o que nos faz escolher única e exclusivamente por gosto, apesar de comprehensions serem mais legíveis (vamos falar disso em outro contexto), existem muitos casos onde elas ainda fazem sentido. Mas sem me estender muito, vamos ao código:

func = lambda x: x+2 # uma função simples, soma mais 2 a qualquer inteiro def func_mais_2(funcao, valor): """ Executa a função passada por parâmetro e retorna esse valor somado com dois Ou seja, é uma composição de funções: Dado que func(valor) é processado por func_func: func_mais_2(func(valor)) == f(g(x)) """ return funcao(valor) + 2

Um ponto a tocar, e o que eu acho mais bonito, é que a função vai retornar diferentes respostas para o mesmo valor, variando a entrada da função. Nesse caso, dada a entrada de um inteiro ele será somado com 2 e depois com mais dois. Mas, vamos estender este exemplo:

func = lambda x: x + 2 # uma função simples, soma mais 2 a qualquer inteiro def func_mais_2(funcao, valor): """ Função que usamos antes. """ return funcao(valor) + 2 assert func_mais_2(func, 2) == 6 # true def func_quadrada(val): """ Eleva o valor de entrada ao quadrado. """ return val * val assert func_mais_2(func_quadrada, 2) == 6 # true 1.3.1 Um exemplo usando funções embutidas:

Muitas das funções embutidas em python são funções de ordem superior (HOFs) como a função map, que é uma das minhas preferidas. Uma função de map recebe uma função, que recebe um único argumento e devolve para nós uma nova lista com a função aplicada a cada elemento da lista:

def func(arg): return arg + 2 lista = [2, 1, 0] list(map(func, lista)) == [4, 3, 2] # true

Mas fique tranquilo, falaremos muito mais sobre isso.

1.4 __call__

Por que falar de classes? Lembre-se, Python é uma linguagem construída em classes, e todos os objetos que podem ser chamados/invocados implementam o método __call__:

Em uma função anônima:

func = lambda x: x '__call__' in dir(func) # True

Em funções tradicionais:

def func(x): return x '__call__' in dir(func) # True

Isso quer dizer que podemos gerar classes que se comportam como funções?

SIIIIIM. Chupa Haskell

Essa é uma parte interessante da estrutura de criação do Python a qual veremos mais em outro momento sobre introspecção de funções, mas vale a pena dizer que classes, funções nomeadas, funções anônimas e funções geradoras usam uma base comum para funcionarem, essa é uma das coisas mais bonitas em python e que em certo ponto fere a ortogonalidade da linguagem, pois coisas iguais tem funcionamentos diferentes, mas facilita o aprendizado da linguagem, mas não é nosso foco agora.

1.5 Funções geradoras

Embora faremos um tópico extremamente focado em funções geradoras, não custa nada dar uma palinha, não?

Funções geradoras são funções que nos retornam um iterável. Mas ele é lazy (só é computado quando invocado). Para exemplo de uso, muitos conceitos precisam ser esclarecidos antes de entendermos profundamente o que acontece com elas, mas digo logo: são funções lindas <3

Para que uma função seja geradora, em tese, só precisamos trocar o return por yield:

def gen(lista): for elemento in lista: yield elemento gerador = gen([1, 2, 3, 4, 5]) next(gerador) # 1 next(gerador) # 2 next(gerador) # 3 next(gerador) # 4 next(gerador) # 5 next(gerador) # StopIteration

Passando bem por cima, uma função geradora nos retorna um iterável que é preguiçoso. Ou seja, ele só vai efetuar a computação quando for chamado.

1.6 Funções anônimas (lambda)

Funções anônimas, ou funções lambda, são funções que podem ser declaradas em qualquer contexto. Tá... Todo tipo de função em python pode ser declarada em tempo de execução. Porém funções anônimas podem ser atribuídas a variáveis, podem ser definidas dentro de sequências e declaradas em um argumento de função. Vamos olhar sua sintaxe:

lambda argumento: argumento

A palavra reservada lambda define a função, assim como uma def. Porém em uma def quase que instintivamente sempre quebramos linha:

def func(): pass

Uma das diferenças triviais em python é que as funções anônimas não tem nome. Tá, isso era meio óbvio, mas vamos averiguar:

def func(): pass func.__name__ # func lambda_func = lambda arg: arg lambda_func.__name__ # '<lambda>'

O resultado '<lambda>' será o mesmo para qualquer função. Isso torna sua depuração praticamente impossível em python. Por isso os usuários de python (e nisso incluo todos os usuários, até aqueles que gostam de funcional) não encorajam o uso de funções lambda a todos os contextos da linguagem. Mas, em funções que aceitam outra funções isso é meio que uma tradição, caso a função (no caso a que executa o código a ser usado pelo lambda) não esteja definida e nem seja reaproveitada em outro contexto. Eu gosto de dizer que lambdas são muito funcionais em aplicações parciais de função. Porém, os lambdas não passam de açúcar sintático em Python, pois não há nada que uma função padrão (definida com def), não possa fazer de diferente. Até a introspecção retorna o mesmo resultado:

def func(): pass type(func) # function lambda_func = lambda arg: arg type(lambda_func) # function

Uma coisa que vale ser lembrada é que funções anônimas em python só executam uma expressão. Ou seja, não podemos usar laços de repetição (while, for), tratamento de exceções (try, except, finally). Um simples if com uso de elif também não pode ser definido. Como sintaticamente só são aceitas expressões, o único uso de um if é o ternário:

valor_1 if condicao else valor_2

O que dentro de um lambda teria essa aparência:

func = lambda argumento: argumento + 2 if argumento > 0 else argumento - 2

Funções lambda também podem ter múltiplos argumentos, embora seu processamento só possa ocorrer em uma expressão:

func = lambda arg_1, arg_2, arg_3: True if sum([arg_1, arg_2, arg_3]) > 7 else min([arg_1, arg_2, arg_3])

Embora essa seja uma explanação inicial sobre as funções anônimas, grande parte dos tópicos fazem uso delas e vamos poder explorar melhor sua infinitude.

Mas por hoje é só e no tópico seguinte vamos discutir, mesmo que superficialmente, iteradores e iteráveis e suas relações com a programação funcional.

Categories: FLOSS Project Planets

Colan Schwartz: Ægir Turns 10!

Planet Drupal - Fri, 2017-12-08 10:09
Topics:  This blog post has been re-published from Aegir's blog and edited with permission from its author, Christopher Gervais.

My tenure with the Ægir Project only dates back about 7 or 8 years. I can’t speak first-hand about its inception and those early days. So, I’ll leave that to some of the previous core team members, many of whom are publishing blog posts of their own.

New look for www.aegirproject.org

As part of the run-up to Ægir’s 10-year anniversary, we built a new site for the project, which we released today. It will hopefully do more justice to Ægir’s capabilities. In addition to the re-vamped design, we added a blog section, to make it easier for the core team to communicate with the community. So keep an eye on this space for more news in the coming weeks.

Ægir has come a long way in 10 years

When I first tried to use Ægir (way back at version 0.3), I couldn’t even get all the way through the installation. Luckily, I happened to live in Montréal, not too far from Koumbit, which, at the time, was a hub of Ægir development. I dropped in and introduced myself to Antoine Beaupré; then one of Ægir’s lead developers.

One of his first questions was how far into the installation process I’d reached. As it turns out, he frequently asked this question when approached about Ægir. It helped him gauge how serious poeple were about running it. Back then, you pretty much needed to be a sysadmin to effectively operate it.

A few months, Antoine had wrapped the installation scripts in Debian packaging, making installation a breeze. By that point I was hooked.

Free Software is a core value

Fast forward a couple years to DrupalCon Chicago. This was a time of upheaval in the Drupal community, as Drupal 7.0 was on the cusp of release, and Development Seed had announced their intention to leave the Drupal community altogether. This had far-reaching consequences for Ægir, since Development Seed had been the primary company sponsoring development, and employing project founder and lead developer Adrian Rossouw.

While in Chicago I met with Eric Gundersen, CEO of DevSeed, to talk about the future of Ægir. Whereas DevSeed had sold their flagship Drupal product, OpenAtrium, to another company, Eric was very clear that they wanted to pass on stewardship of Ægir to Koumbit, due in large part to our dedication to Free Software, and deep systems-level knowledge.

Community contributions are key

Since then Ægir has grown a lot. Here is one of the more interesting insights from OpenHub’s Ægir page:

Very large, active development team Over the past twelve months, 35 developers contributed new code to Aegir Hosting System. This is one of the largest open-source teams in the world, and is in the top 2% of all project teams on Open Hub.

To help visualize all these contributions, we produced a video using Gource. It represents the efforts of no less than 136 developers over the past 10 years. It spans the various components of Aegir Core, along with “Golden” contrib and other major sub-systems.

Of course, many other community members have contributed in other ways over the year. These include (but aren’t limited to) filing bug reports and testing patches, improving our documentation, answering other users’ questions on IRC, and giving presentations at local meetups.

The Future

The core team has been discussing options for re-architecting Ægir to modernize the codebase and address some structural issues. In the past few months, this activity has heated up. In order to simultaneously ensure ongoing maintenance of our stable product, and to simplify innovation of future ones, the core team decided to divide responsibilities across 3 branch maintainers.

Herman van Rink (helmo) has taken up maintenance of our stable 3.x branch. He’s handled the majority of release engineering for the project for the past couple years, so we can be very confident in the ongoing stability and quality of Ægir 3.

Jon Pugh, on the other hand, has adopted the 4.x branch, with the primary goal of de-coupling Ægir from Drush. This is driven largely by upstream decisions (in both Drupal and Drush) that would make continuing with our current approach increasingly difficult. He has made significant progress on porting Provision (Ægir’s back-end) to Symfony. Keep an eye out for further news on that front.

For my part, I’m pursuing a more radical departure from our current architecture, re-writing Ægir from scratch atop Drupal 8 and Ansible with a full-featured (Celery/RabbitMQ) task queue in between. This promises to make Ægir significantly more flexible, which is being borne out in recent progress on the new system. While Ægir 5 will be a completely new code-base, the most of the workflows, security model and default interface will be familiar. Once it has proven itself, we can start pursuing other exciting options, like Kubernetes and OpenStack support.

So, with 10 years behind us, the future certainly looks Bryght*.




* Bryght was the company where Adrian Rossouw began work on “hostmaster” that would eventually become the Ægir Hosting System.

Categories: FLOSS Project Planets

Peter Bengtsson: Really simple Django view function timer decorator

Planet Python - Fri, 2017-12-08 10:07

I use this sometimes to get insight into how long some view functions take. Perhaps you find it useful too:

def view_function_timer(prefix='', writeto=print): def decorator(func): @functools.wraps(func) def inner(*args, **kwargs): try: t0 = time.time() return func(*args, **kwargs) finally: t1 = time.time() writeto( 'View Function', '({})'.format(prefix) if prefix else '', func.__name__, args[1:], 'Took', '{:.2f}ms'.format(1000 * (t1 - t0)), args[0].build_absolute_uri(), ) return inner return decorator

And to use it:

from wherever import view_function_timer @view_function_timer() def homepage(request, thing): ... return render(request, template, context)

And then it prints something like this:

View Function homepage ('valueofthing',) Took 23.22ms http://localhost:8000/home/valueofthing

It's useful when you don't want a full-blown solution to measure all view functions with a middleware or something.
It can be useful also to see how a cache decorator might work:

from django.views.decorators.cache import cache_page from wherever import view_function_timer @view_function_timer('possibly cached') @cache_page(60 * 60 * 2) # two hours cache @view_function_timer('not cached') def homepage(request, thing): ... return render(request, template, context)

That way you can trace that, with tail -f or something, to see how/if the cacheing decorator works.

There are many better solutions that are more robust but might be a bigger investment. For example, I would recommend markus which, if you don't have a statsd server you can configure to logger.info call the timings.

Categories: FLOSS Project Planets

Amazee Labs: Amazee Agile Agency Survey Results - Part 5

Planet Drupal - Fri, 2017-12-08 09:51
Amazee Agile Agency Survey Results - Part 5

Welcome to part five of our series, processing the results of the Amazee Agile Agency Survey. Previously I wrote about forming discovery and planning. This time let’s focus on team communication and process.

Josef Dabernig Fri, 12/08/2017 - 15:51 Team Communication

When it comes to ways how to communicate, the ones that got selected with the highest rating of “mostly practised” where “Written communication in tickets”, “Written communication via (i.e. Slack)” as well as “Group meetings for the entire team”. The options that most often got selected as “Not practised” where “Written communication in blog or wiki” and “Written communication in pull requests”.

For us at Amazee Labs Zurich, a variety of communication channels is essential. Regular 1-on-1 meetings between managers and their employees allow us to continuously talk about what’s important to either side and work on improvements. We communicate a lot via Slack where we have various team channels, channels together with clients related to projects, channels for work-related topics or just channels to talk about fun stuff. Each morning, we start with a short team stand-up for the entire company where we check in with each other, and that’s followed by a more in-depth standup for the Scrum teams where we talk about “What has been done, What will be done and What’s blocking us”. Written communication happens between the team and customers in Jira tickets. As part of our 4-eyes-principle peer review process, we also give feedback on code within pull requests that are used to ensure the quality of the code and train each other.

Process

We talked about iteration length in part 1 of this series. Now let’s look into how much time we spend on which things.

According to the survey, the majority of standups take 15 minutes, followed by 5 minutes and 10 minutes with a few ones taking up to 30 minutes.

This also reflects ours: we take 10 minutes for the company-wide stand up amongst 24 team members and another 15 minutes for the Scrum Team specific stand-ups.

For the review-phase, teams equally often selected 2 hours and 1 hour as the top-rated option followed closely by 30 minutes. 4 hours has been chosen by a few other teams, and the last one would be one day. For the retrospectives, the top-rated option was 30 minutes, followed by 1 hour. Much fewer teams take 2 hours or even up to 4 hours for the retrospective. For planning, we saw the most significant gap regarding top rated options: 30 minutes is followed by 4 hours and then 2 hours and 1 hours were selected.

In the teams I work with, we usually spend half a day doing sprint review, retrospective and planning altogether. Our reviews typically take 45 minutes, the retrospective about 1.5 hours and the planning another 30 minutes. We currently don’t do these meetings together with customers because the Scrum teams are stable teams that usually work for multiple customers. Instead, we do demos along with the clients individually outside of these meetings. Also, our plannings are quite fast because the team split up stories already in part of grooming sessions beforehand and we only estimate smaller tasks that don’t get split up later on as usually done in sprint planning 2.

When looking at how much time is being spent on Client work (billable, unbillable) and Internal work we got a good variety of results. The top-rated option for “Client work (billable)” was 50-75%, “Client work (unbillable)” was usually rated below 10% and “Internal work” defaulted to 10-25%. Our internal statistics match these options that have been voted by the industry most often.

I also asked about what is most important to you and your team when it comes to scheduling time? Providing value while keeping our tech debt in a reasonable place has been mentioned which is also true for us. Over the last year, we started introducing our global maintenance team which puts a dedicated focus on maintaining existing sites and keeping customer satisfaction high. By using a Kanban-approach there, we can prioritise timely critical bugs fixes when they are needed and work on maintenance-related tasks such as module updates in a coordinated way. We found it particularly helpful that the Scrum-teams are well connected with the maintenance-team to provide know-how transfer and domain-knowledge where needed.

Another one mentioned, “We still need a good time tracker.” At Amazee we bill by the hour that we work so accurate time tracking is a must. We do so by using Tempo Timesheets for Jira combined with the Toggl app.

How do you communicate and what processes do you follow? Please leave us a comment below. If you are interested in Agile Scrum training, don’t hesitate to contact us.

Stay tuned for the next post where we’ll look at defining work.

   
Categories: FLOSS Project Planets

Ned Batchelder: Iter-tools for puzzles: oddity

Planet Python - Fri, 2017-12-08 08:52

It’s December, which means Advent of Code is running again. It provides a new two-part puzzle every day until Christmas. They are a lot of fun, and usually are algorithmic in nature.

One of the things I like about the puzzles is they often lend themselves to writing unusual but general-purpose helpers. As I have said before, abstraction of iteration is a powerful and under-used feature of Python, so I enjoy exploring it when the opportunity arises.

For yesterday’s puzzle I needed to find the one unusual value in an otherwise uniform list. This is the kind of thing that might be in itertools if itertools had about ten times more functions than it does now. Here was my definition of the needed function:

def oddity(iterable, key=None):
    """
    Find the element that is different.

    The iterable has at most one element different than the others. If a
    `key` function is provided, it is a function used to extract a comparison
    key from each element, otherwise the elements themselves are compared.

    Two values are returned: the common comparison key, and the different
    element.

    If all of the elements are equal, then the returned different element is
    None.  If there is more than one different element, an error is raised.

    """

The challenge I set for myself was to implement this function in as general and useful a way as possible. The iterable might not be a list, it could be a generator, or some other iterable. There are edge cases to consider, like if there are more than two different values.

If you want to take a look, My code is on GitHub (with tests, natch.) Fair warning: that repo has my solutions to all of the Advent of Code problems so far this year.

One problem with my implementation: it stores all the values from the iterable. For the actual Advent of Code puzzle, that was fine, it only had to deal with less than 10 values. But how would you change the code so that it didn’t store them all?

My code also assumes that the comparison values are hashable. What if you didn’t want to require that?

Suppose the iterable could be infinite? This changes the definition somewhat. You can’t detect the case of all the values being the same, since there’s no such thing as “all” the values. And you can’t detect having more than two distinct values, since you’d have to read values forever on the possibility that it might happen. How would you change the code to handle infinite iterables?

These are the kind of considerations you have to take into account to write truly general-purpose itertools functions. It’s an interesting programming exercise to work through each version would differ.

BTW: it might be that there is a way to implement my oddity function with a clever combination of things already in itertools. If so, let me know!

Categories: FLOSS Project Planets
Syndicate content