FLOSS Project Planets

Ritesh Raj Sarraf: Setting up appliances - the new way

Planet Debian - Mon, 2017-02-20 13:39

I own a Fitbit Surge. But Fitibit chose to remain exclusive in terms of interoperability. Which means to make any sense out of the data that the watch gathers, you need to stick with what Fitbit mandates. Fair enough in today's trends. It also is part of their business model to restrict useful aspects of the report to Premium Membership.  Again, fair enough in today's business' trends.

But a nice human chose to write a bridge; to extract Fitbit data and feed into Google Fit. The project is written in Python, so you can get it to work on most common computer platforms. I never bothered to package this tool for Debian, because I never was sure when I'd throw away the Fitbit. But until that happens, I decided to use the tool to sync my data to Google Fit. Which led me to requirements.txt

This project's requirement.txt lists versioned module dependencies, of which many modules in Debian, were either older or newer than what was mentioned in the requirements. To get the tool working, I installed it the pip way. 3 months later, something broke and I needed to revisit the installed modules. At that point, I realized that there's no such thing as: pip upgrade

That further led me to dig on why anyone wouldn't add something so simple, because today, in the days of pip, snap, flatpak and dockers, Distributions are predicted to go obsolete and irrelevant. Users should get the SOURCES directly from the developers. But just looking at the date the bug was filed, killed my enthusiasm any further.

So, without packaging for Debian, and without installing through pip, I was happy that my init has the ability to create confined and containerized environments, something that I could use to get the job done.


rrs@chutzpah:~$ sudo machinectl login fitbit [sudo] password for rrs: Connected to machine fitbit. Press ^] three times within 1s to exit session. Debian GNU/Linux 9 fitbit pts/0 fitbit login: root Last login: Fri Feb 17 12:44:25 IST 2017 on pts/1 The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. root@fitbit:~# tail -n 25 /var/tmp/lxc/fitbit-google.log synced calories - 1440 data points ------------------------------   2017-02-19  ------------------------- synced steps - 1440 data points synced distance - 1440 data points synced heart_rate - 38215 data points synced weight - 0 logs synced body_fat - 0 logs synced calories - 1440 data points ------------------------------   2017-02-20  ------------------------- synced steps - 1270 data points synced distance - 1270 data points synced heart_rate - 32547 data points synced weight - 0 logs synced body_fat - 0 logs synced calories - 1271 data points Synced 7 exercises between : 2017-02-15 -- 2017-02-20 --------------------------------------------------------------------------                                      Like it ? star the repository : https://github.com/praveendath92/fitbit-googlefit -------------------------------------------------------------------------- root@fitbit:~#


Categories: Keywords: Like: 
Categories: FLOSS Project Planets

Holger Levsen: How to use .ics files like it's 1997

Planet Debian - Mon, 2017-02-20 12:46
$ sudo apt install khal … Unpacking khal (0.8.4-3) ... … $ (echo 1;echo 0;echo y;echo 0; echo y; echo n; echo y; echo y) | khal configure … Do you want to write the config to /home/user/.config/khal/khal.conf? (Choosing `No` will abort) [y/N]: Successfully wrote configuration to /home/user/.config/khal/khal.conf $ wget https://anonscm.debian.org/cgit/debconf-data/dc17.git/plain/misc/until-dc17.ics … HTTP request sent, awaiting response... 200 OK Length: 6120 (6.0K) [text/plain] Saving to: ‘until-dc17.ics’ … $ khal import --batch -a private until-dc17.ics $ khal agenda --days 14 Today: 16:30-17:30: DebConf Weekly Meeting ⟳ 27-02-2017 16:30-17:30: DebConf Weekly Meeting ⟳

khal is available in stretch and newer and is probably best run from cron piping into '/usr/bin/mail' Thanks to Gunnar Wolf for figuring it all out.

Categories: FLOSS Project Planets

Jonathan Dowland: Blinkenlights, part 3

Planet Debian - Mon, 2017-02-20 11:31

red blinkenlights!

Part three of a series. part 1, part 2.

One morning last week I woke up to find the LED on my NAS a solid red. I've never been happier to have something fail.

I'd set up my backup jobs to fire off a systemd unit on failure


This is a generator-service, which is used to fire off an email to me when something goes wrong. I followed these instructions on the Arch wiki to set it up). Once I got the blinkstick, I added an additional command to that service to light up the LED:

ExecStart=-/usr/local/bin/blinkstick --index 1 --limit 50 --set-color red

The actual failure was a simple thing to fix. But I never did get the email.

On further investigation, there are problems with using exim and systemd in Debian at the moment: it's possible for the exim4 daemon to exit and for systemd not to know that this is a failure, thus, the mail spool never gets processed. This should probably be fixed by the exim4 package providing a proper systemd service unit.

Categories: FLOSS Project Planets

Jonathan Dowland: Blinkenlights, part 2

Planet Debian - Mon, 2017-02-20 11:31

Part two of a series. part 1, part 3.

To start with configuring my NAS to use the new blinkenlights, I thought I'd start with a really easy job: I plug in my iPod, a script runs to back it up, then the iPod gets unmounted. It's one of the simpler jobs to start with because the iPod is a simple block device and there's no encryption in play. For now, I'm also going to assume the LED Is going to be used exclusively for this job. In the future I will want many independent jobs to perhaps use the LED to signal things and figuring out how that will work is going to be much harder.

I'll skip over the journey and go straight to the working solution. I have a systemd job that is used to invoke a sync from the iPod as follows:

[Service] Type=oneshot ExecStart=/bin/mount /media/ipod ExecStart=/usr/local/bin/blinkstick --index 1 --limit 10 --set-color 33c280 ExecStart=/usr/bin/rsync ... ExecStop=/bin/umount /media/ipod ExecStop=/usr/local/bin/blinkstick --index 1 --limit 10 --set-color green [Install] WantedBy=dev-disk-by\x2duuid-A2EA\x2d96ED.device [Unit] OnFailure=blinkstick-fail.service

/media/ipod is a classic mount configured in /etc/fstab. I've done this rather than use the newer systemd .mount units which sadly don't give you enough hooks for running things after unmount or in the failure case. This feels quite unnatural, much more "systemdy" would be to Requires= the mount unit, but I couldn't figure out an easy way to set the LED to green after the unmount. I'm sure it's possible, but convoluted.

The first blinkstick command sets the LED to a colour to indicate "in progress". I explored some of the blinkstick tool's options for a fading or throbbing colour but they didn't work very well. I'll take another look in the future. After the LED is set, the backup job itself runs. The last blinkstick command, which is only run if the previous umount has succeeded, sets the LED to indicate "safe to unplug".

The WantedBy here instructs systemd that when the iPod device-unit is activated, it should activate my backup service. I can refer to the iPod device-unit using this name based on the partition's UUID; this is not the canonical device name that you see if you run systemctl but it's much shorter and crucially its stable, the canonical name depends on exactly where you plugged it in and what other devices might have been connected at the same time.

If something fails, a second unit blinkstick-fail.service gets activated. This is very short:

[Service] ExecStart=/usr/local/bin/blinkstick --index 1 --limit 50 --set-color red

This simply sets the LED to be red.

Again it's a bit awkward that in 2 cases I'm setting the LED with a simple Exec but in the third I have to activate a separate systemd service: this seems to be the nature of the beast. At least when I come to look at concurrent jobs all interacting with the LED, the failure case should be simple: red trumps any other activity, user must go and check what's up.

Categories: FLOSS Project Planets

Jonathan Dowland: Blinkenlights!

Planet Debian - Mon, 2017-02-20 11:31


Part one of a series. part 2, part 3.

Late last year, I was pondering how one might add a status indicator to a headless machine like my NAS to indicate things like failed jobs.

After a brief run through of some options (a USB-based custom device; a device pretending to be a keyboard attached to a PS/2 port; commandeering the HD activity LED; commandeering the PC speaker wire) I decided that I didn't have the time to learn the kind of skills needed to build something at that level and opted to buy a pre-assembled programmable USB thing instead, called the BlinkStick.

Little did I realise that my friend Jonathan McDowell thought that this was an interesting challenge and actually managed to design, code and build something! Here's his blog post outlining his solution and here's his code on github (or canonically)

Even thought I've bought the blinkstick, given Jonathan's efforts (and the bill of materials) I'm going to have to try and assemble this for myself and give it a go. I've also managed to borrow an Arduino book from a colleague at work.

Either way, I still have some work to do on the software/configuration side to light the LEDs up at the right time and colour based on the jobs running on the NAS and their state.

Categories: FLOSS Project Planets

Coding Diet: Flask and Pytest coverage

Planet Python - Mon, 2017-02-20 11:27

I have written before about Flask and obtaining test coverage results here and with an update here. This is pretty trivial if you're writing unit tests that directly call the application, but if you actually want to write tests which animate a browser, for example with selenium, then it's a little more complicated, because the browser/test code has to run concurrently with the server code.

Previously I would have the Flask server run in a separate process and run 'coverage' over that process. This was slightly unsatisfying, partly because you sometimes want coverage analysis of your actual tests. Test suites, just like application code, can grow in size with many utility functions and imports etc. which may eventually end up not actually being used. So it is good to know that you're not needlessly maintaining some test code which is not actually invoked.

We could probably get around this restriction by running coverage in both the server process and the test-runner's process and combine the results (or simply view them separately). However, this was unsatisfying simply because it felt like something that should not be necessary. Today I spent a bit of time setting up the scheme to test a Flask application without the need for a separate process.

I solved this now, by not using Flask's included Werkzeug server and instead using the WSGI server included in the standard-library wsgiref.simple_server module. Here is, a minimal example:

import flask class Configuration(object): TEST_SERVER_PORT = 5001 application = flask.Flask(__name__) application.config.from_object(Configuration) @application.route("/") def frontpage(): if False: pass # Should not be covered else: return 'I am the lizard queen!' # Should be in coverage. # Now for some testing. from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains import pytest # Currently just used for the temporary hack to quit the phantomjs process # see below in quit_driver. import signal import threading import wsgiref.simple_server class ServerThread(threading.Thread): def setup(self): application.config['TESTING'] = True self.port = application.config['TEST_SERVER_PORT'] def run(self): self.httpd = wsgiref.simple_server.make_server('localhost', self.port, application) self.httpd.serve_forever() def stop(self): self.httpd.shutdown() class BrowserClient(object): """Interacts with a running instance of the application via animating a browser.""" def __init__(self, browser="phantom"): driver_class = { 'phantom': webdriver.PhantomJS, 'chrome': webdriver.Chrome, 'firefox': webdriver.Firefox }.get(browser) self.driver = driver_class() self.driver.set_window_size(1200, 760) def finalise(self): self.driver.close() # A bit of hack this but currently there is some bug I believe in # the phantomjs code rather than selenium, but in any case it means that # the phantomjs process is not being killed so we do so explicitly here # for the time being. Obviously we can remove this when that bug is # fixed. See: https://github.com/SeleniumHQ/selenium/issues/767 self.driver.service.process.send_signal(signal.SIGTERM) self.driver.quit() def log_current_page(self, message=None, output_basename=None): content = self.driver.page_source # This is frequently what we really care about so I also output it # here as well to make it convenient to inspect (with highlighting). basename = output_basename or 'log-current-page' file_name = basename + '.html' with open(file_name, 'w') as outfile: if message: outfile.write("<!-- {} --> ".format(message)) outfile.write(content) filename = basename + '.png' self.driver.save_screenshot(filename) def make_url(endpoint, **kwargs): with application.app_context(): return flask.url_for(endpoint, **kwargs) # TODO: Ultimately we'll need a fixture so that we can have multiple # test functions that all use the same server thread and possibly the same # browser client. def test_server(): server_thread = ServerThread() server_thread.setup() server_thread.start() client = BrowserClient() driver = client.driver try: port = application.config['TEST_SERVER_PORT'] application.config['SERVER_NAME'] = 'localhost:{}'.format(port) driver.get(make_url('frontpage')) assert 'I am the lizard queen!' in driver.page_source finally: client.finalise() server_thread.stop() server_thread.join()

To run this you will of course need flask as well as pytest, pytest-cov, and selenium:

$ pip install flask pytest pytest-cov selenium

In addition you will need the phantomjs to run:

$ npm install phantomjs $ export PATH=$PATH:./node_modules/.bin/

Then to run it, the command is:

$ py.test --cov=./ app.py $ coverage html

The coverage html is of course optional and only if you wish to view the results in friendly HTML format.


I've not used this extensively myself yet, so there may be some problems when using a more interesting flask application.

Don't put your virtual environment directory in the same directory as app.py because in that case it will perform coverage analysis over the standard library and dependencies.

In a real application you will probably want to make a pytest fixture out of the server thread and browser client. So that you can use each for multiple separate test functions. Essentially your test function should just be the part inside the try clause.

I have not used the log_current_page method but I frequently find it quite useful so included it here nonetheless.

Categories: FLOSS Project Planets

GoDjango: Why You Should Pin Your Dependencies by My Mistakes

Planet Python - Mon, 2017-02-20 11:00

Have you ever been bitten by not pinning your dependencies in your django project? If not be glad, and come learn from my problems.

Pinning your dependencies is important to solve future unknown issues, better the devil you know and all that.

In this weeks video I talk about 3 times I had issues. They are either not pinning my dependencies, a weird edge case with pinning and python, and not really understanding what I was doing with pinned dependencies.

Why You Should Pin Your Dependencies

Categories: FLOSS Project Planets

Senthil Kumaran: CPython moved to Github

Planet Python - Mon, 2017-02-20 10:09

CPython project moved it's source code hosting from self-hosted mercurial repository, at hg.python.org to Git version control system hosted at Github. The new location of python project is http://www.github.com/python/cpython

This is second big version control migration that is happening since I got involved. The first one was when we moved from svn to mercurial. Branches were sub-optimal in svn and we used svn-merge.py to merge across branches. Mercurial helped there and everyone got used to a distributed version control written in python, mercurial. It was interesting for me personally to compare mercurial with the other popular DVCS, git.

Over the years, Github has become popular place for developers to host their projects. They have constantly improved their service offering. Many python developers, got used to git version control system and found it's utility value too.

Two years ago, it was decided that Python will move to Git and Github. The effort was led by Bret Cannon assisted by number of other developers and the migration happened on Feb 10, 2017.

I helped with the migration too and helped with providing tool around converting the hg to git, using the facilities available from hg-git mercurial plugin.

We made use hg-git, and wrote some conversions scripts that could get us to the converted repo as we wanted.

  1. https://github.com/orsenthil/cpython-hg-to-git
  2. https://bitbucket.org/orsenthil/hg-git

Now that the migration is done, we are getting ourselves familiar to the new workflow.

Categories: FLOSS Project Planets

Rene Dudfield: Is Type Tracing for Python useful? Some experiments.

Planet Python - Mon, 2017-02-20 10:01
Type Tracing - as a program runs you trace it and record the types of variables coming in and out of functions, and being assigned to variables.Is Type Tracing useful for providing quality benefits, documentation benefits, porting benefits, and also speed benefits to real python programs?

Python is now a gradually typed language, meaning that you can gradually apply types and along with type inference statically check your code is correct. Once you have added types to everything, you can catch quite a lot of errors. For several years I've been using the new type checking tools that have been popping up in the python ecosystem. I've given talks to user groups about them, and also trained people to use them. I think a lot of people are using these tools without even realizing it. They see in their IDE warnings about type issues, and methods are automatically completed for them.

But I've always had some thoughts in the back of my head about recording types at runtime of a program in order to help the type inference out (and to avoid having to annotate them manually yourself).

Note, that this technique is a different, but related thing to what is done in a tracing jit compiler.
Some days ago I decided to try Type Tracing out... and I was quite surprised by the results. I asked myself these questions.
  • Can I store the types coming in and out of python functions, and the types assigned to variables in order to be useful for other things based on tracing the running of a program? (Yes)
  • Can I "Type Trace" a complex program? (Yes, a flask+sqlalchemy app test suite runs)
  • Is porting python 2 code quicker by Type Tracing combined with static type checking, documentation generation, and test generation? (Yes, refactoring is safer with a type checker and no manually written tests)
  • Can I generate better documentation automatically with Type Tracing? (Yes, return and parameter types and example values helps understanding greatly)
  • Can I use the types for automatic property testing? (Yes, hypothesis does useful testing just knowing some types and a few examples... which we recorded with the tracer)
  • Can I use example capture for tests and docs, as well as the types? (Yes)
  • Can I generate faster compiled code automatically just using the recorded types and Cython (Yes).

Benefits from Type Tracing.
    Below I try to show that the following benefits can be obtained by combining Type Tracing with other existing python tools.
    • Automate documentation generation, by providing types to the documentation tool, and by collecting some example inputs and outputs.
    • Automate some type annotation.
    • Automatically find bugs static type checking can not. Without full type inference, existing python static type checkers can not find many issues until the types are fully annotated. Type Tracing can provide those types.
    • Speed up Python2 porting process, by finding issues other tools can't. It can also speed things up by showing people types and example inputs. This can greatly help people understand large programs when documentation is limited.
    • Use for Ahead Of Time (AOT) compilation with Cython.
    • Help property testing tools to find simple bugs without manually setting properties.
        Tools used to hack something together.
        • coverage (extended the coverage checker to record types as it goes) 
        • mypy (static type checker for python)
        • Hypothesis (property testing... automated test generator)
        • Cython (a compiler for python code, and code with type annotations)
        • jedi (another python static type checker)
        • Sphinx (automatic documentation generator).
        • Cpython (the original C implementation of python)
        More details below on the experiments.
        Type Tracing using 'coverage'.Originally I hacked up a set_trace script... and started going. But there really are so many corner cases. Also, I already run the "coverage" tool over the code base I'm working on.

        I started with coverage.pytracer.PyTracer, since it's python. Coverage also comes with a faster tracer written in C. So far I'm just using the python one.
        The plan later would be to perhaps use CoverageData. Which uses JSON, which means storing the type will be hard sometimes (eg, when they are dynamically generated). However, I think I'm happy to start with easy types. To start simple, I'll just record object types as strings with something like `repr(type(o)) if type(o) is not type else repr(o)`. Well, I'm not sure. So far, I'm happy with hacking everything into my fork of coverage, but to move it into production there is more work to be done. Things like multiprocess, multithreading all need to be handled.
        Porting python 2 code with type tracing.I first started porting code to python 3 in the betas... around 2007. Including some C API modules. I think I worked on one of the first single code base packages. Since then the tooling has gotten a lot better. Compatibility libraries exist (six), lots of people have figured out the dangerous points and documented them. Forward compatibility features were added into the python2.6 and 2.7, and 3.5 releases to make porting easier. However, it can still be hard.
        Especially when Python 2 code bases often don't have many tests. Often zero tests. Also, there may be very little documentation, and the original developers have moved on.
        But the code works, and it's been in production for a long time, and gets updates occasionally. Maybe it's not updated as often as it's needed because people are afraid of breaking things.Steps to port to python 3 are usually these:
        1. Understand the code.
        2. Run the code in production (or on a copy of production data).
        3. With a debugger, look at what is coming in and out of functions.
        4. Write tests for everything.
        5. Write documentation.
        6. Run 2to3.
        7. Do lots of manual QA.
        8. Start refactoring.
        9. Repeat. Repeat manually writing tests, docs, and testing manually. Many times.
        Remember that writing tests is usually harder than writing the code in the first place.

        With type tracing helping to generate docs, types for the type checker, examples for human reading plus for the hypothesis property checker we get a lot more tools to help ensure quality.

        A new way to port python2 code could be something like...
        1. Run program under Type Tracing, line/branch coverage, and example capture.
        2. Look at generated types, example inputs and outputs.
        3. Look at generated documentation.
        4. Gradually add type checking info with help of Type Tracing recorded types.
        5. Generate tests automatically with Type Tracing types, examples, and hypothesis automated property testing. Generate empty test stubs for things you still need to test.
        6. Once each module is fully typed, you can statically type check it.
        7. You can cross validate your type checked python code against your original code. Under the Type Tracer.
        8. Refactoring is easier with better docs, static type checks, tests, types for arguments and return values, and example inputs and outputs.
        9. Everything should be ported to work with the new forwards compatibility functionality in python2.7.
        10. Now with your various quality checks in place, you can start porting to python3. Note, you might not have needed to change any of the original code - only add types.
        I would suggest the effort is about 1/5th of the normal time it takes to port things. Especially if you want to make sure the chance of introducing errors is very low.
        Below are a couple of issues where Type Tracing can help over existing tools.
        Integer divide issue.Here I will show that the 2to3 conversion tool makes a bug with. Also, mypy does not detect a problem with the code.

        # int_issue.py
        def int_problem(x):
            return x / 4

        $ python2 int_issue.py
        0 $ python3 int_issue.py

        $ mypy --py2 int_issue.py
        $ mypy int_issue.py$ 2to3 int_issue.py
        RefactoringTool: Skipping optional fixer: buffer
        RefactoringTool: Skipping optional fixer: idioms
        RefactoringTool: Skipping optional fixer: set_literal
        RefactoringTool: Skipping optional fixer: ws_comma
        RefactoringTool: Refactored int_issue.py
        --- int_issue.py    (original)
        +++ int_issue.py    (refactored)
        @@ -3,4 +3,4 @@
         def int_problem(x):
             return x / 4

        RefactoringTool: Files that need to be modified:
        RefactoringTool: int_issue.py

        See how when run under python3 it gives a different result?
        Can we fix it when Type Tracing adds types?  (Yes)So, how about if we run the program under type tracing, and record the input types coming in and out? See how it adds a python3 compatible comment about taking an int, and returning an int. This is so that mypy (and other type checkers) can see what it is supposed to take in.
        def int_problem(x):
            # type: (int) -> int
            return x / 4
        print(int_problem(3))$ mypy int_issue.py
        int_issue.py:5: error: Incompatible return value type (got "float", expected "int")
        I'm happy that Yes, Type Tracing combined with mypy can detect this issue whereas mypy can not by itself.

        Binary or Text file issue?Another porting issue not caught by existing tools is trying to do the right thing when a python file is in binary mode or in text mode. If in binary, read() will return bytes, otherwise it might return text.

        In theory this could be made to work, however at the time of writing, there is an open issue with "dependent types" or "Factory Pattern" functions in mypy. More information on this, and also a work around I wrote see this issue: https://github.com/python/mypy/issues/2337#issuecomment-280850128

        In there I show that you can create your own io.open replacement that always returns one type. eg, open_rw(fname) instead of open(fname, 'rw').

        Once you know that .read() will return bytes, then you also know that it can't call .format() in python 3. The solution is to use % string formatting on bytes, which is supported from python3.5 upwards.

        x = f.read() # type: bytes

        So the answer here is that mypy could likely solve this issue by itself in the future (once things are fully type annotated). But for now, it's good to see combining type tracing with mypy could help detect binary and text encoding issues much faster.

        Generating Cython code with recorded types.I wanted to see if this was possible. So I took the simple example from the cython documentation.http://cython.readthedocs.io/en/latest/src/quickstart/cythonize.html
        I used my type tracer to transform this python:def f(x):
        return x**2-x

        def integrate_f(a, b, N):
        s = 0
        dx = (b-a)/N
        for i in range(N):
        s += f(a+i*dx)
        return s * dx

        Before you look below... take a guess what parameters a, b, and N are? Note, how there are no comments. Note how the variable names are single letter. Note, how there are no tests. There are no examples.
        In [2]: %timeit integrate_f(10.4, 2.3, 17)
        100000 loops, best of 3: 5.12 µs per loop

        Into this Cython code with annotated types after running it through Type Tracing:In [1]: %load_ext Cython

        In [2]: %%cython
           ...: cdef double f(double x):
           ...:     return x**2-x
           ...: def integrate_f_c(double a, double b, int N):
           ...:     """
           ...:     :Example:
           ...:     >>> integrate_f_c(10.4, 2.3, 17)
           ...:     -342.34804152249137
           ...:     """
           ...:     cdef int i
           ...:     cdef double s, dx
           ...:     s = 0
           ...:     dx = (b-a)/N
           ...:     for i in range(N):
           ...:         s += f(a+i*dx)
           ...:     return s * dx 

        In [3]: %timeit integrate_f_c(10.4, 2.3, 17)

        10000000 loops, best of 3: 117 ns per loop
        Normal python was 5200 nanoseconds. The cython compiled version is 117 nanoseconds.  The result is 44x faster code, and we have all the types annotated, with an example. This helps you understand it a little better than before too.

        This was a great result for me. It shows that yes combining Type Tracing with Cython can give improvements over Cython just by itself. Note, that Cython is not only for speeding up simple numeric code. It's also been used to speed up string based code, database access, network access, and game code.

        So far I've made a simple mapping of python types to cython types. To make the code more useful would require quite a bit more effort. However, if you use it as a tool to help you write cython code yourself, then it's very useful to speed up that process.

        The best cases so far are when it knows all of the types, all of the types have direct cython mappings, and it avoids calling python functions inside the function. In other words, 'pure' functions.

        Cross validation for Cython and python versions?In a video processing project I worked on there were implementations in C, and other assembly implementations of the same functions. A very simple way of testing is to run all the implementations and compare the results. If the C implementation gives the same results as the assembly implementations, then there's a pretty good chance they are correct.
        In [1]:  assert integrate_f_c(10.4, 2.3, 17) == integrate_f(10.4, 2.3, 17)

        If we have a test runner, we can check if the inputs and outputs are the same between the compiled code and the non compiled code. That is, cross validate implementations against each other for correctness.
        Property testing.The most popular property testing framework Quickcheck from the Haskell world. However, python also has an implementation - Hypothesis. Rather than supply examples, as is usual with unit testing you tell it about properties which hold true.

        Can we generate a hypothesis test automatically using just types collected with Type Tracing?
        Below we can see some unit tests (example based testing), as well as some Hypothesis tests (property testing). They are for a function "always_add_something(x)", which always adds something to the number given in. As a property, we would say that "always_add_something(x) > x".  That property will hold to be true for every value of x given x is an int.

        Note, that the program is fully typed, and passes type checking with mypy. Also note that there is 100% test coverage if I remove the divide by zero error I inserted.

        from hypothesis import given
        import hypothesis.strategies

        from bad_logic_issue import always_add_something, always_add_something_good

        def test_always_add_something():# type: () -> None
            #type: () -> None
            assert always_add_something(5) >= 5
            assert always_add_something(200) >= 200

        def test_always_add_something_good():
            #type: () -> None
            assert always_add_something_good(5) >= 5
            assert always_add_something_good(200) >= 200

        def test_always_add_something(x):
            assert always_add_something(x) > x

        # Here we test the good one.
        def test_always_add_something(x):
            assert always_add_something_good(x) > xHere are two implementations of the function. The first one is a contrived example in order to show two types of logic errors that are quite common. Even 30 year old code used by billions of people has been shown to have these errors. They're sort of hard to find with normal testing methods.

        def always_add_something(x):
            # type: (int) -> int
            '''Silly function that is supposed to always add something to x.

            But it doesn't always... even though we have
             - 'complete' test coverage.
             - fully typed
            r = x #type: int
            if x > 0 and x < 10:
                r += 20
            elif x > 15 and x < 30:
                r //= 0
            elif x > 100:        r += 30

            return r

        def always_add_something_good(x):
            # type: (int) -> int
            '''This one always does add something.
            return x + 1

        Now, hypothesis can find the errors when you write the property that the return value needs to be greater than the input. What about if we just use the types we record with Type Tracing to give hypothesis a chance to test? Hypothesis comes with a number of test strategies which generate many variations of a type. Eg, there is an "integers" strategy.

        # Will it find an error just telling hypothesis that it takes an int as input?
        def test_always_add_something(x):
        It finds the divide by zero issue (when x is 16). However it does not find the other issue, because it still does not know that there is a problem. We haven't told it anything about the result always needing to be greater than the input.
        bad_logic_issue.py:13: ZeroDivisionError
        -------------------------------------------------------- Hypothesis --------------------------------------------------------
        Falsifying example: test_always_add_something(x=16)The result is that yes, it could find one issue automatically, without having to write any extra test code, just from Trace Typing.

        For pure functions, it would be also useful to record some examples for unit test generation.

        In conclusion.I'm happy with the experiment overall. I think it shows it can be a fairly useful technique for making python programs more understandable, faster, and more correct. It can also help speed up porting old python2 code dramatically (especially when that code has limited documentation and tests).

        I think the experiment also shows that combining existing python tools (coverage, mypy, Cython, and hypothesis) can give some interesting extra abilities without not too much extra effort. eg. I didn't need to write a robust tracing module, I didn't need to write a static type checker, or a python compiler. However, it would take some effort to turn these into robust general purpose tools. Currently what I have is a collection of fragile hacks, without support for many corner cases :)

        For now I don't plan to work on this any more in the short term. (Unless of course someone wants to hire me to port some python2 code. Then I'll work on these tools again since it speeds things up quite a lot).

        Any corrections or suggestions? Please leave a comment, or see you on twitter @renedudfield
        Categories: FLOSS Project Planets

        Weekly Python Chat: Django Forms

        Planet Python - Mon, 2017-02-20 10:00

        Special guest Kenneth Love is going answer your questions about how to use Django's forms.

        Categories: FLOSS Project Planets

        Chromatic: A Taco-Friendly Guide to Cache Metadata in Drupal 8

        Planet Drupal - Mon, 2017-02-20 09:30

        Explaining Drupal 8's cache metadata with the help of tacos.

        Categories: FLOSS Project Planets

        Doug Hellmann: uuid — Universally Unique Identifiers — PyMOTW 3

        Planet Python - Mon, 2017-02-20 09:00
        RFC 4122 defines a system for creating universally unique identifiers for resources in a way that does not require a central registrar. UUID values are 128 bits long and, as the reference guide says, “can guarantee uniqueness across space and time.” They are useful for generating identifiers for documents, hosts, application clients, and other situations … Continue reading uuid — Universally Unique Identifiers — PyMOTW 3
        Categories: FLOSS Project Planets

        Mike Driscoll: PyDev of the Week: Petr Viktorin

        Planet Python - Mon, 2017-02-20 08:30

        This week our PyDev of the Week is Petr Viktorin (@EnCuKou). Petr is the author of PEP 489 — Multi-phase extension module initialization and teaches Python for the local PyLadies in Czech Republic. You can some of what he’s up to via his Github page or on his website. Let’s take some time to get to know Petr better!

        Can you tell us a little about yourself (hobbies, education, etc):


        I’m a Python programmer from Brno, Czech Republic. I studied at the Brno University of Technology, and for my master’s I switched to the University of Eastern Finland.

        When I’m not programming, I enjoy playing board games with my friends, and sometimes go to an orienteering race (without much success).

        Why did you start using Python?

        At the university, I did coursework in languages like C, Java, and Lisp, but then I found Python and got hooked. It fit the way I think about programs, abstracted away most of the boring stuff, and makes it easy to keep the code understandable.

        After I returned home from the university, I found a community that was starting to form around the language, and that’s probably what keeps me around the language now.

        What other programming languages do you know and which is your favorite?

        Since I work with CPython a lot, I code in C – or at least I *read* C regularly. And I’d say C’s my favorite, after Python – they complement each other quite nicely..I can also throw something together in JavaScript. And C++, Java or PHP, though I don’t find much reason to code in those languages any more. Since I finished school, I sadly haven’t made much time to learn new languages. Someday, I’d like to explore Rust more seriously, but I haven’t found a good project for starting that yet.

        What projects are you working on now?

        I work at Red Hat, and the main job of our team is to package Python for Fedora and RHEL. The mission is to make sure everything works really great together, so when we succeed, the results of the work are somewhat invisible.

        My other project is teaching Python. A few years back, and without much teaching experience, I’ve started a beginners’ Python course for the local PyLadies. I’ve spent a lot of time on making the content online and accessible to everyone, and over the years it got picked up in two more cities, and sometimes I find people going through the course from home. Now people are refining the course, and even building new workshops and other courses on top of it. Like any open-source project, it needs some maintenance, and I’m lucky to be able to spend some paid time both teaching and coordinating and improving Czech Python teaching materials.

        When I find some spare time, I hack on crazy side projects like a minimalistic 3D-printed MicroPython-powered game console.

        Which Python libraries are your favorite (core or 3rd party)?

        I’m sure Requests appeared on these interviews before: it’s a great example of how a library should be designed.

        I also like the pyglet library. It’s an easy way to draw graphics on the screen, and I also use it to introduce people to event-driven programming.

        Where do you see Python going as a programming language?

        Strictly as a language, I don’t think Python will evolve too much. It’s already a good way to structure code and express algorithms. There will of course be improvements – especially the async parts are quite new and still have some rough corners – but I’m skeptical about any revolutionary additions.

        I think most improvements will come to the CPython implementation, not the language itself. I’m hopeful for projects like Pyjion and Gilectomy. I’m involved in a similar effort, making CPython’s subinterpreter more useful. Sadly, it’s currently stalled, but maybe I’ll be able to mentor it as a student project.

        What is your take on the current market for Python programmers?

        When I finished school, I had no idea I could actually get a job using Python. But it turns out there’s always demand for Python programmers. And I see new projects started in Python all the time. It doesn’t look like the demand is going away.

        Is there anything else you’d like to say?

        If you visit Czech Republic, look at http://pyvo.cz/en and visit one of our meetups!

        Thanks so much for doing the interview!

        Categories: FLOSS Project Planets

        Web Omelette: How to render your images with image styles in Drupal 8

        Planet Drupal - Mon, 2017-02-20 03:19

        In this article we are going to look at how we can render images using image styles in Drupal 8.

        In Drupal 7, rendering images with a particular style (say the default "thumbnail") was by calling the theme_image_style() theme and passing the image uri and image style you want to render (+ some other optional parameters):

        $image = theme('image_style', array('style_name' => 'thumbnail', 'path' => 'public://my-image.png'));

        You'll see this pattern all over the place in Drupal 7 codebases.

        The theme prepares the URL for the image, runs the image through the style processors and returns a themed image (via theme_image()). The function it uses internally for preparing the url of the image is image_style_url() which returns the URL of the location where the image is stored after being prepared. It may not yet exist, but on the first request, it would get generated.

        So how do we do it in Drupal 8?

        First of all, image styles in Drupal 8 are configuration entities. This means they are created and exported like many other things. Second of all, in Drupal 8 we no longer (should) call theme functions like above directly. What we should do is always return render arrays and expect them to be rendered somewhere down the line. This helps with things like caching etc.

        So to render an image with a particular image style, we need to do the following:

        $render = [ '#theme' => 'image_style', '#style_name' => 'thumbnail', '#uri' => 'public://my-image.png', // optional parameters ];

        This would render the image tag with the image having been processed by the style.

        Finally, if we just want the URL of an image with the image style applied, we need to load the image style config entity and ask it for the URL:

        $style = \Drupal::entityTypeManager()->getStorage('image_style')->load('thumbnail'); $url = $style->buildUrl('public://my-image.png');

        So that is it. You now have the image URL which will generate the image upon the first request.

        Remember though to inject the entity type manager if you are in such a context that you can.

        Categories: FLOSS Project Planets

        foss-gbg on Wednesday

        Planet KDE - Mon, 2017-02-20 01:08

        If you happen to be in Gothenburg on Wednesday you are most welcome to visit foss-gbg. It is a free event (you still have to register so that we can arrange some light food) starting at 17.00.

        The topics are Yocto Linux on FPGA-based hardware, risk and license management in open source projects and a product release by the local start-up Zifra (an encryptable SD-card).

        More information and free tickets are available at the foss-gbg site.


        Categories: FLOSS Project Planets

        Full Stack Python: Creating SSH Keys on macOS Sierra

        Planet Python - Mon, 2017-02-20 00:00

        Deploying Python applications typically requires SSH keys. An SSH key has both a public and a private key file. You can use the private key to authenticate when syncing remote Git repositories, connect to remote servers and automate your application's deployments via configuration management tools like Ansible. Let's learn how to generate SSH key pairs on macOS Sierra.

        Generating New Keys

        Bring up a new terminal window on macOS by going into Applications/Utilities and opening "Terminal".

        The ssh-keygen command provides an interactive command line interface for generating both the public and private keys. Invoke ssh-keygen with the following -t and -b arguments to ensure we get a 4096 bit RSA key. Note that you must use a key with 2048 or more bits in macOS Sierra or the system will not allow you to connect to servers with it.

        Optionally, you can also specify your email address with -C (otherwise one will be generated off your current macOS account):

        ssh-keygen -t rsa -b 4096 -C my.email.address@company.com

        The first prompt you will see asks where to save the key. However, there are actually two files that will be generated: the public key and the private key.

        Generating public/private rsa key pair. Enter file in which to save the key (/Users/matt/.ssh/id_rsa):

        This prompt refers to the private key and whatever you enter will also generate a second file for the public key that has the same name and .pub appended.

        If you already have a key then specify a new filename. I use many SSH keys so I oftne name them "test-deploy", "prod-deploy", "ci-server" along with a unique project name. Naming is one of those hard computer science problems, so take some time to come up with a system that works for you!

        Next you will see a prompt for an optional passphrase:

        Enter passphrase (empty for no passphrase):

        Whether or not you want a passphrase depends on how you will use the key. The system will ask you for the passphrase whenever you use the SSH key, although macOS can store the passphrase in your system Keychain after the first time you enter it. However, if you are automating deployments with a continuous integration server like Jenkins then you will not want a passphrase.

        Note that it is impossible to recover a passphrase if it is lost. Keep that passphrase safe and secure because otherwise a completely new key would have to be generated.

        Enter the passphrase (or just press enter to not have a passphrase) twice. You'll see some output like the following:

        Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /Users/matt/.ssh/deploy_prod. Your public key has been saved in /Users/matt/.ssh/deploy_prod.pub. The key fingerprint is: SHA256:UnRGH/nzYzxUFS9jjd0wOl1ScFGKgW3pU60sSxGnyHo matthew.makai@gmail.com The key's randomart image is: +---[RSA 4096]----+ | ..+o++**@| | . +.o*O.@=| | . oo*=B.*| | . . =o=+ | | . S E. +oo | | . . . =.| | . o| | | | | +----[SHA256]-----+

        Your SSH key is ready to use!

        What now?

        Now that you have your public and private keys, I recommend building and deploying some Python web apps such as:

        Additional ssh-keygen command resources:

        Questions? Contact me via Twitter @fullstackpython or @mattmakai. I'm also on GitHub with the username mattmakai.

        Something wrong with this post? Fork this page's source on GitHub.

        Categories: FLOSS Project Planets

        Russ Allbery: Haul via parents

        Planet Debian - Sun, 2017-02-19 21:39

        My parents were cleaning out a bunch of books they didn't want, so I grabbed some of the ones that looked interesting. A rather wide variety of random stuff. Also, a few more snap purchases on the Kindle even though I've not been actually finishing books recently. (I do have two finished and waiting for me to write reviews, at least.) Who knows when, if ever, I'll read these.

        Mark Ames — Going Postal (nonfiction)
        Catherine Asaro — The Misted Cliffs (sff)
        Ambrose Bierce — The Complete Short Stores of Ambrose Bierce (collection)
        E. William Brown — Perilous Waif (sff)
        Joseph Campbell — A Hero with a Thousand Faces (nonfiction)
        Jacqueline Carey — Miranda and Caliban (sff)
        Noam Chomsky — 9-11 (nonfiction)
        Noam Chomsky — The Common Good (nonfiction)
        Robert X. Cringely — Accidental Empires (nonfiction)
        Neil Gaiman — American Gods (sff)
        Neil Gaiman — Norse Mythology (sff)
        Stephen Gillet — World Building (nonfiction)
        Donald Harstad — Eleven Days (mystery)
        Donald Harstad — Known Dead (mystery)
        Donald Harstad — The Big Thaw (mystery)
        James Hilton — Lost Horizon (mainstream)
        Spencer Johnson — The Precious Present (nonfiction)
        Michael Lerner — The Politics of Meaning (nonfiction)
        C.S. Lewis — The Joyful Christian (nonfiction)
        Grigori Medredev — The Truth about Chernobyl (nonfiction)
        Tom Nadeu — Seven Lean Years (nonfiction)
        Barak Obama — The Audacity of Hope (nonfiction)
        Ed Regis — Great Mambo Chicken and the Transhuman Condition (nonfiction)
        Fred Saberhagen — Berserker: Blue Death (sff)
        Al Sarrantonio (ed.) — Redshift (sff anthology)
        John Scalzi — Fuzzy Nation (sff)
        John Scalzi — The End of All Things (sff)
        Kristine Smith — Rules of Conflict (sff)
        Henry David Thoreau — Civil Disobedience and Other Essays (nonfiction)
        Alan W. Watts — The Book (nonfiction)
        Peter Whybrow — A Mood Apart (nonfiction)

        I've already read (and reviewed) American Gods, but didn't own a copy of it, and that seemed like a good book to have a copy of.

        The Carey and Brown were snap purchases, and I picked up a couple more Scalzi books in a recent sale.

        Categories: FLOSS Project Planets

        Norbert Preining: Ryu Murakami – Tokyo Decadence

        Planet Debian - Sun, 2017-02-19 21:08

        The other Murakami, Ryu Murakami (村上 龍), is hard to compare to the more famous Haruki. His collection of stories reflects the dark sides of Tokyo, far removed from the happy world of AKB48 and the like. Criminals, prostitutes, depression, loss. A bleak image onto a bleak society.

        This collection of short stories is a consequent deconstruction of happiness, love, everything we believe to make our lives worthwhile. The protagonists are idealistic students loosing their faith, office ladies on aberrations, drunkards, movie directors, the usual mixture. But the topic remains constant – the unfulfilled search for happiness and love.

        I felt I was beginning to understand what happiness is about. It isn’t about guzzling ten or twenty energy drinks a day, barreling down the highway for hours at a time, turning over your paycheck to your wife without even opening the envelope, and trying to force your family to respect you. Happiness is based on secrets and lies.Ryu Murakami, It all started just about a year and a half ago

        A deep pessimistic undertone is echoing through these stories, and the atmosphere and writing reminds of Charles Bukowski. This pessimism resonates in the melancholy of the running themes in the stories, Cuban music. Murakami was active in disseminating Cuban music in Japan, which included founding his own label. Javier Olmo’s pieces are often the connecting parts, as well as lending the short stories their title: Historia de un amor, Se fué.

        The belief – that what’s missing now used to be available to us – is just an illusion, if you ask me. But the social pressure of “You’ve got everything you need, what’s your problem?” is more powerful than you might ever think, and it’s hard to defend yourself against it. In this country it’s taboo even to think about looking for something more in life.Ryu Murakami, Historia de un amor

        It is interesting to see that on the surface, the women in the stories are the broken characters, leading feminists to incredible rants about the book, see the rant^Wreview of Blake Fraina at Goodreads:

        I’ll start by saying that, as a feminist, I’m deeply suspicious of male writers who obsess over the sex lives of women and, further, have the audacity to write from a female viewpoint…
        …female characters are pretty much all pathetic victims of the male characters…
        I wish there was absolutely no market for stuff like this and I particularly discourage women readers from buying it…Blake Fraina, Goodreads review

        On first sight it might look like that the female characters are pretty much all pathetic victims of the male characters, but in fact it is the other way round, the desperate characters, the slaves of their own desperation, are the men, and not the women, in these stories. It is dual to the situation in Hitomi Kanehara’s Snakes and Earrings, where on first sight the tattooist and the outlaw friends are the broken characters, but the really cracked one is the sweet Tokyo girly.

        Male-female relationships are always in transition. If there’s no forward progress, things tend to slip backwards.Ryu Murakami, Se fué

        Final verdict: Great reading, hard to put down, very much readable and enjoyable, if one is in the mood of dark and depressing stories. And last but not least, don’t trust feminist book reviews.

        Categories: FLOSS Project Planets

        Carl Trachte: Filling in Missing Grouping Columns of MSSQL SSRS Report Dumped to Excel

        Planet Python - Sun, 2017-02-19 19:34
        This is another simple but common problem in certain business environments:

        1) Data are presented via a Microsoft SQL Server Reporting Services report, BUT

        2) The user wants the data in Excel, and, further, wants to play with it (pivot, etc.) there.  The problem is that the grouping column labels are not in every record, only in the one row that begins the list of records for that group (sanitized screenshot below):

        But I don't WANT to copy and paste all those groupings for 30,000 records :*-(I had this assignment recently from a remote request.  It took about four rounds of an e-mail exchange to figure out that it really wasn't a data problem, but a formatting one that needed solving.

        It is possible to do the whole thing in Python.  I did the Excel part by hand in order to get a handle on the data:

        1) In Excel, delete the extra rows on top of the report leaving just the headers and the data.

        2) In Excel, select everything on the data page, format the cells correctly by unselecting the Merge Cells and Wraparound options.

        3) In Excel, at this point you should be able to see if there are extra empty columns as space fillers; delete them.  Save the worksheet as a csv file.

        4) In a text editor, open your csv file, identify any empty rows, and delete them.  Change column header names as desired.

        Now the Python part:


        Doctor csv dump from unmerged cell
        dump of SSRS dump from MSSQL database.

        Fill in cell gaps where merged
        cells had only one grouping value
        so that all rows are complete records.

        import pprint

        COMMA = ','
        EMPTY = ''

        INFILE = 'rawdata.csv'
        OUTFILE = 'canneddumpfixed.csv'

        ERRORFLAG = 'ERROR!'

        f = open(INFILE, 'r')
        headerline = next(f)
        numbercolumns = len(headerline.split(COMMA))

        f2 = open(OUTFILE, 'w')

        # Assume at least one data column on far right.
        missingvalues = (numbercolumns - 1) * [ERRORFLAG]

        for linex in f:
            print('Processing line {:s} . . .'.format(linex))
            splitrecord = linex.split(COMMA)
            for slotx in range(0, numbercolumns - 1):
                if splitrecord[slotx] != EMPTY:
                    missingvalues[slotx] = splitrecord[slotx]
                    splitrecord[slotx] = missingvalues[slotx]



        At this point you've got your data in csv format - you can open it in Excel and go to work.

        There may be a free or COTS (commercial off the shelf) utility that does all this somewhere in the Microsoft "ecosystem" (I think that's their fancy enviro-friendly word for vendor-user community) but I don't know of one.

        Thanks for stopping by.

        Categories: FLOSS Project Planets

        Matthew Rocklin: Dask Development Log

        Planet Python - Sun, 2017-02-19 19:00

        This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

        To increase transparency I’m blogging weekly(ish) about the work done on Dask and related projects during the previous week. This log covers work done between 2017-02-01 and 2017-02-20. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

        Themes of the last couple of weeks:

        1. Profiling experiments with Dask-GLM
        2. Subsequent graph optimizations, both non-linear fusion and avoiding repeatedly creating new graphs
        3. Tensorflow and Keras experiments
        4. XGBoost experiments
        5. Dask tutorial refactor
        6. Google Cloud Storage support
        7. Cleanup of Dask + SKLearn project
        Dask-GLM and iterative algorithms

        Dask-GLM is currently just a bunch of solvers like Newton, Gradient Descent, BFGS, Proximal Gradient Descent, and ADMM. These are useful in solving problems like logistic regression, but also several others. The mathematical side of this work is mostly done by Chris White and Hussain Sultan at Capital One.

        We’ve been using this project also to see how Dask can scale out machine learning algorithms. To this end we ran a few benchmarks here: https://github.com/dask/dask-glm/issues/26 . This just generates and solves some random problems, but at larger scales.

        What we found is that some algorithms, like ADMM perform beautifully, while for others, like gradient descent, scheduler overhead can become a substantial bottleneck at scale. This is mostly just because the actual in-memory NumPy operations are so fast; any sluggishness on Dask’s part becomes very apparent. Here is a profile of gradient descent:

        Notice all the white space. This is Dask figuring out what to do during different iterations. We’re now working to bring this down to make all of the colored parts of this graph squeeze together better. This will result in general overhead improvements throughout the project.

        Graph Optimizations - Aggressive Fusion

        We’re approaching this in two ways:

        1. More aggressively fuse tasks together so that there are fewer blocks for the scheduler to think about
        2. Avoid repeated work when generating very similar graphs

        In the first case, Dask already does standard task fusion. For example, if you have the following to tasks:

        x = f(w) y = g(x) z = h(y)

        Dask (along with every other compiler-like project since the 1980’s) already turns this into the following:

        z = h(g(f(w)))

        What’s tricky with a lot of these mathematical or optimization algorithms though is that they are mostly, but not entirely linear. Consider the following example:

        y = exp(x) - 1/x

        Visualized as a node-link diagram, this graph looks like a diamond like the following:

        o exp(x) - 1/x / \ exp(x) o o 1/x \ / o x

        Graphs like this generally don’t get fused together because we could compute both exp(x) and 1/x in parallel. However when we’re bound by scheduling overhead and when we have plenty of parallel work to do, we’d prefer to fuse these into a single task, even though we lose some potential parallelism. There is a tradeoff here and we’d like to be able to exchange some parallelism (of which we have a lot) for less overhead.

        PR here dask/dask #1979 by Erik Welch (Erik has written and maintained most of Dask’s graph optimizations).

        Graph Optimizations - Structural Sharing

        Additionally, we no longer make copies of graphs in dask.array. Every collection like a dask.array or dask.dataframe holds onto a Python dictionary holding all of the tasks that are needed to construct that array. When we perform an operation on a dask.array we get a new dask.array with a new dictionary pointing to a new graph. The new graph generally has all of the tasks of the old graph, plus a few more. As a result, we frequently make copies of the underlying task graph.

        y = (x + 1) assert set(y.dask).issuperset(x.dask)

        Normally this doesn’t matter (copying graphs is usually cheap) but it can become very expensive for large arrays when you’re doing many mathematical operations.

        Now we keep dask graphs in a custom mapping (dict-like object) that shares subgraphs with other arrays. As a result, we rarely make unnecessary copies and some algorithms incur far less overhead. Work done in dask/dask #1985.

        TensorFlow and Keras experiments

        Two weeks ago I gave a talk with Stan Seibert (Numba developer) on Deep Learning (Stan’s bit) and Dask (my bit). As part of that talk I decided to launch tensorflow from Dask and feed Tensorflow from a distributed Dask array. See this blogpost for more information.

        That experiment was nice in that it showed how easy it is to deploy and interact with other distributed servies from Dask. However from a deep learning perspective it was immature. Fortunately, it succeeded in attracting the attention of other potential developers (the true goal of all blogposts) and now Brett Naul is using Dask to manage his GPU workloads with Keras. Brett contributed code to help Dask move around Keras models. He seems to particularly value Dask’s ability to manage resources to help him fully saturate the GPUs on his workstation.

        XGBoost experiments

        After deploying Tensorflow we asked what would it take to do the same for XGBoost, another very popular (though very different) machine learning library. The conversation for that is here: dmlc/xgboost #2032 with prototype code here mrocklin/dask-xgboost. As with TensorFlow, the integration is relatively straightforward (if perhaps a bit simpler in this case). The challenge for me is that I have little concrete experience with the applications that these libraries were designed to solve. Feedback and collaboration from open source developers who use these libraries in production is welcome.

        Dask tutorial refactor

        The dask/dask-tutorial project on Github was originally written or PyData Seattle in July 2015 (roughly 19 months ago). Dask has evolved substantially since then but this is still our only educational material. Fortunately Martin Durant is doing a pretty serious rewrite, both correcting parts that are no longer modern API, and also adding in new material around distributed computing and debugging.

        Google Cloud Storage

        Dask developers (mostly Martin) maintain libraries to help Python users connect to distributed file systems like HDFS (with hdfs3, S3 (with s3fs, and Azure Data Lake (with adlfs), which subsequently become usable from Dask. Martin has been working on support for Google Cloud Storage (with gcsfs) with another small project that uses the same API.

        Cleanup of Dask+SKLearn project

        Last year Jim Crist published three great blogposts about using Dask with SKLearn. The result was a small library dask-learn that had a variety of features, some incredibly useful, like a cluster-ready Pipeline and GridSearchCV, other less so. Because of the experimental nature of this work we had labeled the library “not ready for use”, which drew some curious responses from potential users.

        Jim is now busy dusting off the project, removing less-useful parts and generally reducing scope to strictly model-parallel algorithms.

        Categories: FLOSS Project Planets
        Syndicate content