FLOSS Project Planets

Reinout van Rees: Fossgis: sewer cadastre with qgis - jörg Höttges

Planet Python - Thu, 2017-03-23 06:24

(One of my summaries of a talk at the 2017 fossgis conference).

With engineer firms from the Aachen region they created qkan. Qkan is:

  • A data structure.
  • Plugins for Qgis.
  • Direct access. Not a specific application with restricted access, but unrestricted access from within Qgis. (He noticed lots of interest among the engineers to learn qgis during the project!)

It has been designed for the needs of the engineers that have to work with the data. You first import the data from the local sewer database. Qkan converts the data to what it needs. Then you can do simulations in a separate package. The results of the simulation will be visualized by Qkan in qgis. Afterwards you probably have to make some corrections to the data and give corrections back to the original database. Often you have to go look at the actual sewers to make sure the database is correct. Output is often a map with the sewer system.

Some functionality: import sewer data (in various formats). Simulate water levels. Draw graphs of the water levels in a sewer. Support database-level check ("an end node cannot occur halfway a sewer").

They took care to make the database schema simple. The source sewer database is always very complex because it has to hold lots of metadata. The engineer that has to work with it needs a much simpler schema in order to be productive. Qkan does this.

They used qgis, spatialite, postgis, python and qt (for forms). An important note: they used as many postgis functionality as possible instead of the geographical functions from qgis: the reason is that postgis (and even spatialite) is often much quicker.

With qgis, python and the "qt designer", you can make lots of handy forms. But you can always go back to the database that's underneath it.

The code is at https://github.com/hoettges

Categories: FLOSS Project Planets

CubicWeb: Introducing cubicweb-jsonschema

Planet Python - Thu, 2017-03-23 05:57

This is the first post of a series introducing the cubicweb-jsonschema project that is currently under development at Logilab. In this post, I'll first introduce the general goals of the project and then present in more details two aspects about data models (the connection between Yams and JSON schema in particular) and the basic features of the API. This post does not always present how things work in the current implementation but rather how they should.

Goals of cubicweb-jsonschema

From a high level point of view, cubicweb-jsonschema addresses mainly two interconnected aspects. One related to modelling for client-side development of user interfaces to CubicWeb applications while the other one concerns the HTTP API.

As far as modelling is concerned, cubicweb-jsonschema essentially aims at providing a transformation mechanism between a Yams schema and JSON Schema that is both automatic and extensible. This means that we can ultimately expect that Yams definitions alone would sufficient to have generated JSON schema definitions that would consistent enough to build an UI, pretty much as it is currently with the automatic web UI in CubicWeb. A corollary of this goal is that we want JSON schema definitions to match their context of usage, meaning that a JSON schema definition would not be the same in the context of viewing, editing or relationships manipulations.

In terms of API, cubicweb-jsonschema essentially aims at providing an HTTP API to manipulate entities based on their JSON Schema definitions.

Finally, the ultimate goal is to expose an hypermedia API for a CubicWeb application in order to be able to ultimately build an intelligent client. For this we'll build upon the JSON Hyper-Schema specification. This aspect will be discussed in a later post.

Basic usage as an HTTP API library

Consider a simple case where one wants to manipulate entities of type Author described by the following Yams schema definition:

class Author(EntityType): name = String(required=True)

With cubicweb-jsonschema one can get JSON Schema for this entity type in at different contexts such: view, creation or edition. For instance:

  • in a view context, the JSON Schema will be:

    { "$ref": "#/definitions/Author", "definitions": { "Author": { "additionalProperties": false, "properties": { "name": { "title": "name", "type": "string" } }, "title": "Author", "type": "object" } } }
  • whereas in creation context, it'll be:

    { "$ref": "#/definitions/Author", "definitions": { "Author": { "additionalProperties": false, "properties": { "name": { "title": "name", "type": "string" } }, "required": [ "name" ], "title": "Author", "type": "object" } } }

    (notice, the required keyword listing name property).

Such JSON Schema definitions are automatically generated from Yams definitions. In addition, cubicweb-jsonschema exposes some endpoints for basic CRUD operations on resources through an HTTP (JSON) API. From the client point of view, requests on these endpoints are of course expected to match JSON Schema definitions. Some examples:

Get an author resource:

GET /author/855 Accept:application/json HTTP/1.1 200 OK Content-Type: application/json {"name": "Ernest Hemingway"}

Update an author:

PATCH /author/855 Accept:application/json Content-Type: application/json {"name": "Ernest Miller Hemingway"} HTTP/1.1 200 OK Location: /author/855/ Content-Type: application/json {"name": "Ernest Miller Hemingway"}

Create an author:

POST /author Accept:application/json Content-Type: application/json {"name": "Victor Hugo"} HTTP/1.1 201 Created Content-Type: application/json Location: /Author/858 {"name": "Victor Hugo"}

Delete an author:

DELETE /author/858 HTTP/1.1 204 No Content

Now if the client sends invalid input with respect to the schema, they'll get an error:

(We provide a wrong born property in request body.)

PATCH /author/855 Accept:application/json Content-Type: application/json {"born": "1899-07-21"} HTTP/1.1 400 Bad Request Content-Type: application/json { "errors": [ { "details": "Additional properties are not allowed ('born' was unexpected)", "status": 422 } ] } From Yams model to JSON Schema definitions

The example above illustrates automatic generation of JSON Schema documents based on Yams schema definitions. These documents are expected to help developping views and forms for a web client. Clearly, we expect that cubicweb-jsonschema serves JSON Schema documents for viewing and editing entities as cubicweb.web serves HTML documents for the same purposes. The underlying logic for JSON Schema generation is currently heavily inspired by the logic of primary view and automatic entity form as they exists in cubicweb.web.views. That is: the Yams schema is introspected to determine how properties should be generated and any additionnal control over this can be performed through uicfg declarations [1].

To illustrate let's consider the following schema definitions which:

class Book(EntityType): title = String(required=True) publication_date = Datetime(required=True) class Illustration(EntityType): data = Bytes(required=True) class illustrates(RelationDefinition): subject = 'Illustration' object = 'Book' cardinality = '1*' composite = 'object' inlined = True class Author(EntityType): name = String(required=True) class author(RelationDefinition): subject = 'Book' object = 'Author' cardinality = '1*' class Topic(EntityType): name = String(required=True) class topics(RelationDefinition): subject = 'Book' object = 'Topic' cardinality = '**'

and consider, as before, JSON Schema documents in different contexts for the the Book entity type:

  • in view context:

    { "$ref": "#/definitions/Book", "definitions": { "Book": { "additionalProperties": false, "properties": { "author": { "items": { "type": "string" }, "title": "author", "type": "array" }, "publication_date": { "format": "date-time", "title": "publication_date", "type": "string" }, "title": { "title": "title", "type": "string" }, "topics": { "items": { "type": "string" }, "title": "topics", "type": "array" } }, "title": "Book", "type": "object" } } }

    We have a single Book definition in this document, in which we find attributes defined in the Yams schema (title and publication_date). We also find the two relations where Book is involved: topics and author, both appearing as a single array of "string" items. The author relationship appears like that because it is mandatory but not composite. On the other hand, the topics relationship has the following uicfg rule:

    uicfg.primaryview_section.tag_subject_of(('Book', 'topics', '*'), 'attributes')

    so that it's definition appears embedded in the document of Book definition.

    A typical JSON representation of a Book entity would be:

    { "author": [ "Ernest Miller Hemingway" ], "title": "The Old Man and the Sea", "topics": [ "sword fish", "cuba" ] }
  • in creation context:

    { "$ref": "#/definitions/Book", "definitions": { "Book": { "additionalProperties": false, "properties": { "author": { "items": { "oneOf": [ { "enum": [ "855" ], "title": "Ernest Miller Hemingway" }, { "enum": [ "857" ], "title": "Victor Hugo" } ], "type": "string" }, "maxItems": 1, "minItems": 1, "title": "author", "type": "array" }, "publication_date": { "format": "date-time", "title": "publication_date", "type": "string" }, "title": { "title": "title", "type": "string" } }, "required": [ "title", "publication_date" ], "title": "Book", "type": "object" } } }

    notice the differences, we now only have attributes and required relationships (author) in this schema and we have the required listing mandatory attributes; the author property is represented as an array which items consist of pre-existing objects of the author relationship (namely Author entities).

    Now assume we add the following uicfg declaration:

    uicfg.autoform_section.tag_object_of(('*', 'illustrates', 'Book'), 'main', 'inlined')

    the JSON Schema for creation context will be:

    { "$ref": "#/definitions/Book", "definitions": { "Book": { "additionalProperties": false, "properties": { "author": { "items": { "oneOf": [ { "enum": [ "855" ], "title": "Ernest Miller Hemingway" }, { "enum": [ "857" ], "title": "Victor Hugo" } ], "type": "string" }, "maxItems": 1, "minItems": 1, "title": "author", "type": "array" }, "illustrates": { "items": { "$ref": "#/definitions/Illustration" }, "title": "illustrates_object", "type": "array" }, "publication_date": { "format": "date-time", "title": "publication_date", "type": "string" }, "title": { "title": "title", "type": "string" } }, "required": [ "title", "publication_date" ], "title": "Book", "type": "object" }, "Illustration": { "additionalProperties": false, "properties": { "data": { "format": "data-url", "title": "data", "type": "string" } }, "required": [ "data" ], "title": "Illustration", "type": "object" } } }

    We now have an additional illustrates property modelled as an array of #/definitions/Illustration, the later also added the the document as an additional definition entry.

Conclusion

This post illustrated how a basic (CRUD) HTTP API based on JSON Schema could be build for a CubicWeb application using cubicweb-jsonschema. We have seen a couple of details on JSON Schema generation and how it can be controlled. Feel free to comment and provide feedback on this feature set as well as open the discussion with more use cases.

Next time, we'll discuss how hypermedia controls can be added the HTTP API that cubicweb-jsonschema provides.

[1]this choice is essentially driven by simplicity and conformance when the existing behavior to help migration of existing applications.
Categories: FLOSS Project Planets

Reinout van Rees: Fossgis: creating maps with open street map in QGis - Axel Heinemann

Planet Python - Thu, 2017-03-23 05:55

(One of my summaries of a talk at the 2017 fossgis conference).

He wanted to make a map for a local run. He wanted a nice map with the route and the infrastructure (start, end, parking, etc). Instead of the usual not-quite-readable city plan with a simple line on top. With qgis and openstreetmap he should be able to make something better!

A quick try with QGis, combined with the standard openstreetmap base map, already looked quite nice, but he wanted to do more customizations on the map colors. So he needed to download the openstreetmap data. That turned into quite a challenge. He tried two plugins:

  • OSMDownloader: easy selection, quick download. Drawback: too many objects as you cannot filter. The attribute table is hard to read.
  • QuickOSM: key/value selection, quick. Drawback: you need a bit of experience with the tool, as it is easy to forget key/values.

He then landed on https://overpass-turbo.eu . The user interface is very friendly. There is a wizard to get common cases done. And you can browse the available tags.

With the data downloaded with overpass-turbo, he could easily adjust colors and get a much nicer map out of it.

You can get it to work, but it takes a lot of custom work.

Some useful links:

https://taginfo.openstreetmap.org http://tagfinder.herokuapp.com https://gis.stackexchange.com

Photo explanation: just a nice unrelated picture from the recent beautiful 'on traxs' model railway exibition (see video )

Categories: FLOSS Project Planets

Reinout van Rees: Fossgis: introduction on some open source software packages

Planet Python - Thu, 2017-03-23 05:55

(One of my summaries of a talk at the 2017 fossgis conference).

The conference started with a quick introduction on several open source programs.

Openlayers 3 - Marc Jansen

Marc works on both openlayers and GeoExt. Openlayers is a javascript library with lots and lots of features.

To see what it can do, look at the 161 examples on the website :-) It works with both vector layers and raster layers.

Openlayers is a quite mature project, the first version is from 2006. It changed a lot to keep up with the state of the art. But they did take care to keep everything backwards compatible. Upgrading from 2.0 to 2.2 should have been relatively easy. The 4.0.0 version came out last month.

Openlayers...

  • Allows many different data sources and layer types.
  • Has build-in interaction and controls.
  • Is very actively developed.
  • Is well documented and has lots of examples.

The aim is to be easy to start with, but also to allow full control of your map and all sorts of customization.

Geoserver - Marc Jansen

(Again Marc: someone was sick...)

Geoserver is a java-based server for geographical data. It support lots of OGC standards (WMS, WFS, WPS, etc). Flexible, extensible, well documented. "Geoserver is a glorious example that you can write very performant software in java".

Geoserver can connect to many different data sources and make those sources available as map data.

If you're a government agency, you're required to make INSPIRE metadata available for your maps: geoserver can help you with that.

A big advantage of geoserver: it has a browser-based interface for configuring it. You can do 99% of your configuration work in the browser. For maintaining: there is monitoring to keep an eye on it.

Something to look at: the importer plugin. With it you get a REST API to upload shapes, for instance.

The latest version also supports LDAP groups. LDAP was already supported, but group membership not yet.

Mapproxy - Dominik Helle

Dominik is one of the MapProxy developers. Mapproxy is a WMS cache and tile cache. The original goal was to make maps quicker by caching maps.

Some possible sources: WMS, WMTS, tiles (google/bing/etc), MapServer. The output can be WMS, WMS-C, WMTS, TMS, KML. So the input could be google maps and the output WMS. One of their customers combines the output of five different regional organisations into one WMS layer...

The maps that mapproxy returns can be stored on a local disk in order to improve performance. They way they store it allows mapproxy to support intermediary zoom levels instead of fixed ones.

The cache can be in various formats: MBTiles, sqlite, couchdb, riak, arcgis compact cache, redis, s3. The cache is efficient by combining layers and by omitting unneeded data (empty tiles).

You can pre-fill the cache ("seeding").

Some other possibilities, apart from caching:

  • A nice feature: clipping. You can limit a source map to a specific area.
  • Reprojecting from one coordinate system to another. Very handy if someone else doesn't want to support the coordinate system that you need.
  • WMS feature info: you can just pass it on to the backend system, but you can also intercept and change it.
  • Protection of your layers. Password protection. Protect specific layers. Only allow specific areas. Etcetera.
QGis - Thomas Schüttenberg

QGis is an opern source gis platform. Desktop, server, browser, mobile. And it is a library. It runs on osx, linux, windows, android. The base is the QT ui library, hence the name.

Qgis contains almost everything you'd expect from a GIS packages. You can extend it with plugins.

Qgis is a very, very active project. Almost 1 million lines of code. 30.000+ github commits. 332 developers have worked on it, in the last 12 months 104.

Support via documentation, mailinglists and http://gis.stackexchange.com/ . In case you're wondering about the names of the releases: they come from the towns where the twice-a-year project meeting takes place :-)

Since december 2016, there's an official (legal) association.

QGis 3 will have lots of changes: QT 5 and python 3.

Mapbender 3 - Astrid Emde

Mapbender is library to build webgis applications. Ideally, you don't need to write any code yourself, but you configure it instead in your browser. It also supports mobile usage.

You can try it at http://demo.mapbender3.org/ . Examples are at http://mapbender3.org/?q=en/gallery .

You can choose a layout and fill in and configure the various parts. Layers you want to show: add sources. You can configure security/access with roles.

An example component: a search form for addresses that looks up addresses with sql or a web service. Such a search form can be a popup or you can put it in the sidebar, for instance. CSS can be customized.

PostNAS - Astrid Emde, Jelto Buurman

The postnas project is a solution for importing ALKIS data, a data exchange format for the german cadastre (Deutsch: Kataster).

PostNAS is an extension of the GDAL library for the "NAS" vector data format. (NAS = normalisierte Austausch Schnittstelle, "normalized exchange format"). This way, you can use all of the gdal functionality with the cadastre data. But that's not the only thing: there's also a qgis plugin. There is configuration and conversion scripts for postgis, mapbender, mapserver, etc.

They needed postprocessing/conversion scripts to get useful database tables out of the original data, tables that are usable for showing in QGis, for instance.

So... basically a complete open source environment for working with the cadastre data!

Photo explanation: just a nice unrelated picture from the recent beautiful 'on traxs' model railway exibition (see video )

Categories: FLOSS Project Planets

Tomasz Früboes: Unittesting print statements

Planet Python - Thu, 2017-03-23 05:10

Recently I was refactoring a small package that is supposed to allow execution of arbitrary python code on a remote machine. The first implementation was working nicely but with one serious drawback – function handling the actual code execution was running in a synchronous (blocking) mode. As the result all of the output (both stdout and stderr) was presented only at the end, i.e. when code finished its execution. This was unacceptable since the package should work in a way as transparent to the user as possible. So a wall of text when code completes its task wasn’t acceptable.

The goal of the refactoring was simple – to have the output presented to the user immediately after it was printed on the remote host. As a TDD worshipper I wanted to start this in a kosher way, i.e. with a test. And I got stuck.

For a day or so I had no idea how to progress. How do you unittest the print statements? It’s funny when I think about this now. I have used a similar technique many times in the past for output redirection, yet somehow haven’t managed to make a connection with this problem.

The print statement

So how do you do it? First we should understand what happens when print statement is executed. In python 2.x the print statement does two things – converts provided expressions into strings and writes the result to a file like object handling the stdout. Conveniently it is available as sys.stdout (i.e. as a part of sys module). So all you have to do is to overwrite the sys.stdout with your own object providing a ‘write’ method. Later you may discover, that some other methods may be also needed (e.g. ‘flush’ is quite often used), but for starters, having only the ‘write’ method should be sufficient.

A first try – simple stdout interceptor

The code below does just that. The MyOutput class is designed to replace the original sys.stdout:

import unittest import sys def fn_print(nrepeat): print "ab"*nrepeat class MyTest(unittest.TestCase): def test_stdout(self): class MyOutput(object): def __init__(self): self.data = [] def write(self, s): self.data.append(s) def __str__(self): return "".join(self.data) stdout_org = sys.stdout my_stdout = MyOutput() try: sys.stdout = my_stdout fn_print(2) finally: sys.stdout = stdout_org self.assertEquals( str(my_stdout), "abab\n") if __name__ == "__main__": unittest.main()

The fn_print function provides output to test against. After replacing sys.stdout we call this function and compare the obtained output with the expected one. It is worth noting that in the example above the original sys.stdout is first preserved and then carefully restored inside the ‘finally’ block. If you don’t do this you are likely to loose any output coming from other tests.

Is my code async? Logging time of arrival

In the second example we will address the original problem – is output presented as a wall of text at the end or maybe in real time as we want to. For this we will add time of arrival logging capability to the object replacing sys.stdout:

import unittest import time import sys def fn_print_with_delay(nrepeat): for i in xrange(nrepeat): print # prints a single newline time.sleep(0.5) class TestServer(unittest.TestCase): def test_stdout_time(self): class TimeLoggingOutput(object): def __init__(self): self.data = [] self.timestamps = [] def write(self, s): self.timestamps.append(time.time()) self.data.append(s) stdout_org = sys.stdout my_stdout = TimeLoggingOutput() nrep = 3 # make sure is >1 try: sys.stdout = my_stdout fn_print_with_delay(nrep) finally: sys.stdout = stdout_org for i in xrange(nrep): if i > 0: dt = my_stdout.timestamps[i]-my_stdout.timestamps[i-1] self.assertTrue(0.5<dt<0.52) if __name__ == "__main__": unittest.main()

The code is pretty much self-explanatory – the fn_print_with_delay function prints newlines in half of a second intervals. We override sys.stdout with an instance of a class capable of storing timestamps (obtained with time.time()) of all calls to the write method. At the and we assert the timestamps are spaced half of a second approximately. The code above works as expected:

. ---------------------------------------------------------------------- Ran 1 test in 1.502s OK

If we change the interval inside the fn_print_with_delay function to one second, the test will (fortunately) fail.

Wrap-up

As we saw, testing for expected output is in fact trivial – all you have to do is to put an instance of a class with a ‘write’ method in proper place (i.e. sys.stdout). The only ‘gotcha’ is the cleanup – you should remember to restore sys.stdout to its original state. You may apply the exact same technique if you need to test stderr (just target the sys.stderr instead of sys.stdout). It is also worth noting that using a similar technique you could intercept (or completely silence) output coming from external libraries.

Categories: FLOSS Project Planets

DataCamp: PySpark Cheat Sheet: Spark in Python

Planet Python - Thu, 2017-03-23 05:10

Apache Spark is generally known as a fast, general and open-source engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It allows you to speed analytic applications up to 100 times faster compared to technologies on the market today. You can interface Spark with Python through "PySpark". This is the Spark Python API exposes the Spark programming model to Python. 

Even though working with Spark will remind you in many ways of working with Pandas DataFrames, you'll also see that it can be tough getting familiar with all the functions that you can use to query, transform, inspect, ... your data. What's more, if you've never worked with any other programming language or if you're new to the field, it might be hard to distinguish between RDD operations.

Let's face it, map() and flatMap() are different enough, but it might still come as a challenge to decide which one you really need when you're faced with them in your analysis. Or what about other functions, like reduce() and reduceByKey()? 

Even though the documentation is very elaborate, it never hurts to have a cheat sheet by your side, especially when you're just getting into it.

This PySpark cheat sheet covers the basics, from initializing Spark and loading your data, to retrieving RDD information, sorting, filtering and sampling your data. But that's not all. You'll also see that topics such as repartitioning, iterating, merging, saving your data and stopping the SparkContext are included in the cheat sheet. 

Note that the examples in the document take small data sets to illustrate the effect of specific functions on your data. In real life data analysis, you'll be using Spark to analyze big data.

Are you hungry for more? Don't miss our other Python cheat sheets for data science that cover topics such as Python basicsNumpyPandasPandas Data Wrangling and much more! 

Categories: FLOSS Project Planets

Rene Dudfield: pip is broken

Planet Python - Thu, 2017-03-23 05:00
Help?

Since asking people to use pip to install things, I get a lot of feedback on pip not working. Feedback like this.
"Our fun packaging Jargon"
What is a pip? What's it for? It's not built into python?  It's the almost-default and almost-standard tool for installing python code. Pip almost works a lot of the time. You install things from pypi. I should download pypy? No, pee why, pee eye. The cheeseshop. You're weird. Just call it pee why pee eye. But why is it called pip? I don't know.
"Feedback like this."pip is broken on the raspberian

pip3 doesn't exist on windows

People have an old pip. Old pip doesn't support wheels. What are wheels? It's a cute bit of jargon to mean a zip file with python code in it structured in a nice way. I heard about eggs... tell me about eggs? Well, eggs are another zip file with python code in it. Used mainly by easy_install. Easy install? Let's use that, this is all too much.

The pip executable or script is for python 2, and they are using python 3.

pip is for a system python, and they have another python installed. How did they install that python? Which of the several pythons did they install? Maybe if they install another python it will work this time.

It's not working one time and they think that sudo will fix things. And now certain files can't be updated without sudo. However, now they have forgotten that sudo exists.

"pip lets you run it with sudo, without warning."
pip doesn't tell them which python it is installing for. But I installed it! Yes you did. But which version of python, and into which virtualenv? Let's use these cryptic commands to try and find out...

pip doesn't install things atomically, so if there is a failed install, things break. If pip was a database (it is)...

Virtual environments work if you use python -m venv, but not virtualenv. Or some times it's the other way around. If you have the right packages installed on Debian, and Ubuntu... because they don't install virtualenv by default.

What do you mean I can't rename my virtualenv folder? I can't move it to another place on my Desktop?

pip installs things into global places by default.

"Globals by default."
Why are packages still installed globally by default?

"So what works currently most of the time?"
python3 -m venv anenv
. ./anenv/bin/activate
pip install pip --upgrade
pip install pygame


This is not ideal. It doesn't work on windows. It doesn't work on Ubuntu. It makes some text editors crash (because virtualenvs have so many files they get sick). It confuses test discovery (because for some reason they don't know about virtual environments still and try to test random packages you have installed). You have to know about virtualenv, about pip, about running things with modules, about environment variables, and system paths. You have to know that at the beginning. Before you know anything at all.

Is there even one set of instructions where people can have a new environment, and install something? Install something in a way that it might not break their other applications? In a way which won't cause them harm? Please let me know the magic words?

I just tell people `pip install pygame`. Even though I know it doesn't work. And can't work. By design. I tell them to do that, because it's probably the best we got. And pip keeps getting better. And one day it will be even better.

Help? Let's fix this.
Categories: FLOSS Project Planets

KStars 2.7.6 for Windows & OSX released

Planet KDE - Thu, 2017-03-23 04:25
I am glad to announce the release of KStars 2.7.6 release for Windows & OSX. Linux users using the official PPA can install the latest release as well.

In this release, we introduce the Ekos Mount Modelling tool developed by Robert Lancaster. It's currently in Beta now and we would appreciate any feedback. The tool enables you to build a comprehensive mount model if supported by your mount. Any mount that improves its internal pointing model after a SYNC command is applicable. Furthermore, INDI mounts that supports INDI Alignment Subsystem (EQMod, Nexstarevo, Synscan..etc) are also applicable.


Along with the advanced mount modelling tool comes the new Solution Results plot in the Align Module. It displays the quality of your GOTO after each solve and it can help you to identify if there are issues with your mount or the quality of the image..etc.


You can zoom, pan, and drag to explore the plot in details. Annotation for the quality of each GOTO is available on mouse over.

Ekos Polar Alignment Assistant tool also received a few bug fixes from the community feedback. Most users were able to achieve impressive results using the this easy to use Polar Alignment tool.

While Ekos is designed for ease of use, it can be intimidating for new users unfamiliar with the architecture of Ekos/INDI on several operation systems. Therefore, a new Ekos Profile Wizard is now available to guide the users to setting up their equipment for the first time in Ekos across several operating systems and connection topologies.


With INDI v1.4.1+, figuring out which port to use for your mount & focuser is now trivial across Linux & OSX. INDI automatically scans ports on your system and can even automatically connect to all potential available ports as well until a successful connection is established.


Last, but not least, KStars' NEO (Near-Earth-Object) data query from NASA's JPL is now properly working again thanks to our newest KStars developer Valentin Boettcher. Valentin (aka Hiro) is only 18 years old but is quite brilliant and experienced with KDE/Qt development environment. Welcome abroad!

Categories: FLOSS Project Planets

Agiledrop.com Blog: AGILEDROP: Drupal Logos Showing Emotions

Planet Drupal - Thu, 2017-03-23 04:21
It's not over yet. There are still Druplicons that need to be presented. After already exploring the fields of Humans and Superhumans, Fruits and Vegetables, Animals, Outdoor Activities and National Identities, it's now time to look in the field of emotions and see, which emotions are shown by Drupal Logos. After expecting to find many Druplicons in the area of national identities, we came up with an idea of exploring something more challenging. After some thought, we decided it's time to look in the area of emotions. After all, Druplicon was designed with a mischievous smile, so it looks… READ MORE
Categories: FLOSS Project Planets

Talk Python to Me: #104 Game Theory in Python

Planet Python - Thu, 2017-03-23 04:00
Game theory is the study competing interests, be it individual actors within an economy or healthy vs. cancer cells within a body. <br/> <br/> Our guests this week, Vince Knight, Marc Harper, and Owen Campbell, are here to discuss their python project built to study and simulate one of the central problems in Game Theory: The prisoners' dilemma. <br/> <br/> Links from the show: <br/> <div style="font-size: .85em;"> <br/> <b>Axelrod on GitHub</b>: <a href='https://github.com/Axelrod-Python/Axelrod' target='_blank'>github.com/Axelrod-Python/Axelrod</a> <br/> <b>The docs</b>: <a href='http://axelrod.readthedocs.io/en/latest/' target='_blank'>axelrod.readthedocs.io/en/latest</a> <br/> <b>The tournament</b>: <a href='http://axelrod-tournament.readthedocs.io/en/latest/' target='_blank'>axelrod-tournament.readthedocs.io/en/latest</a> <br/> <b>Chat: Gitter room</b>: <a href='https://gitter.im/Axelrod-Python' target='_blank'>gitter.im/Axelrod-Python</a> <br/> <b>Peer reviewed paper</b>: <a href='http://openresearchsoftware.metajnl.com/articles/10.5334/jors.125/' target='_blank'>openresearchsoftware.metajnl.com/articles/10.5334/jors.125</a> <br/> <b>Djaxelrod v2</b>: <a href='https://github.com/Axelrod-Python/axelrod-api' target='_blank'>github.com/Axelrod-Python/axelrod-api</a> <br/> <b>Some examples with jupyter</b>: <a href='https://github.com/Axelrod-Python/Axelrod-notebooks' target='_blank'>github.com/Axelrod-Python/Axelrod-notebooks</a> <br/> <br/> <strong>Find them on Twitter</strong> <br/> <b>The project</b>: <a href='https://twitter.com/AxelrodPython' target='_blank'>@AxelrodPython</a> <br/> <b>Owen on Twitter</b>: <a href='https://twitter.com/opcampbell' target='_blank'>@opcampbell</a> <br/> <b>Vince on on Twitter</b>: <a href='https://twitter.com/drvinceknight' target='_blank'>@drvinceknight</a> <br/> <br/> <strong>Sponsored items</strong> <br/> <b>Our courses</b>: <a href='https://training.talkpython.fm/' target='_blank'>training.talkpython.fm</a> <br/> <b>Podcast's Patreon</b>: <a href='https://www.patreon.com/mkennedy' target='_blank'>patreon.com/mkennedy</a> <br/> </div>
Categories: FLOSS Project Planets

Mike Hommey: Why is the git-cinnabar master branch slower to clone?

Planet Debian - Thu, 2017-03-23 03:38

Apart from the memory considerations, one thing that the data presented in the “When the memory allocator works against you” post that I haven’t touched in the followup posts is that there is a large difference in the time it takes to clone mozilla-central with git-cinnabar 0.4.0 vs. the master branch.

One thing that was mentioned in the first followup is that reducing the amount of realloc and substring copies made the cloning more than 15 minutes faster on master. But the same code exists in 0.4.0, so this isn’t part of the difference.

So what’s going on? Looking at the CPU usage during the clone is enlightening.

On 0.4.0:

On master:

(Note: the data gathering is flawed in some ways, which explains why the git-remote-hg process goes above 100%, which is not possible for this python process. The data is however good enough for the high level analysis that follows, so I didn’t bother to get something more acurate)

On 0.4.0, the git-cinnabar-helper process was saturating one CPU core during the File import phase, and the git-remote-hg process was saturating one CPU core during the Manifest import phase. Overall, the sum of both processes usually used more than one and a half core.

On master, however, the total of both processes barely uses more than one CPU core.

What happened?

This and that happened.

Essentially, before those changes, git-remote-hg would send instructions to git-fast-import (technically, git-cinnabar-helper, but in this case it’s only used as a wrapper for git-fast-import), and use marks to track the git objects that git-fast-import created.

After those changes, git-remote-hg asks git-fast-import the git object SHA1 of objects it just asked to be created. In other words, those changes replaced something asynchronous with something synchronous: while it used to be possible for git-remote-hg to work on the next file/manifest/changeset while git-fast-import was working on the previous one, it now waits.

The changes helped simplify the python code, but made the overall clone process much slower.

If I’m not mistaken, the only real use for that information is for the mapping of mercurial to git SHA1s, which is actually rarely used during the clone, except at the end, when storing it. So what I’m planning to do is to move that mapping to the git-cinnabar-helper process, which, incidentally, will kill not 2, but 3 birds with 1 stone:

  • It will restore the asynchronicity, obviously (at least, that’s the expected main outcome).
  • Storing the mapping in the git-cinnabar-helper process is very likely to take less memory than what it currently takes in the git-remote-hg process. Even if it doesn’t (which I doubt), that should still help stay under the 2GB limit of 32-bit processes.
  • The whole thing that spikes memory usage during the finalization phase, as seen in previous post, will just go away, because the git-cinnabar-helper process will just have prepared the git notes-like tree on its own.

So expect git-cinnabar 0.5 to get moar faster, and to use moar less memory.

Categories: FLOSS Project Planets

Mikkel Høgh: A vote of no confidence in the Drupal Association leadership

Planet Drupal - Thu, 2017-03-23 02:42
A vote of no confidence in the Drupal Association leadership

I have had many differences with the Drupal Association in the past, starting with the many clashes we had with their erstwhile leadership when we were organising DrupalCon Copenhagen 2010, so I’ll admit I wasn’t their biggest fan before the latest events.

mikl Thu, 2017-03-23 - 07:42 Tags Drupal Planet Drupal
Categories: FLOSS Project Planets

Kushal Das: Running MicroPython on 96Boards Carbon

Planet Python - Thu, 2017-03-23 02:42

I received my Carbon from Seedstudio a few months back. But, I never found time to sit down and work on it. During FOSSASIA, in my MicroPython workshop, Siddhesh was working to put MicroPython using Zephyr on his Carbon. That gave me the motivation to have a look at the same after coming back home.

What is Carbon?

Carbon is a 96Boards IoT edition compatible board, with a Cortex-M4 chip, and 512KB flash. It currently runs Zephyr, which is a Linux Foundation hosted project to build a scalable real-time operating system (RTOS).

Setup MicroPython on Carbon

To install the dependencies in Fedora:

$ sudo dnf group install "Development Tools" $ sudo dnf install git make gcc glibc-static \ libstdc++-static python3-ply ncurses-devel \ python-yaml python2 dfu-util

The next step is to setup the Zephyr SDK. You can download the latest binary from here. Then you can install it under your home directory (you don’t have to install it system-wide). I installed it under ~/opt/zephyr-sdk-0.9 location.

Next, I had to check out the zephyr source, I cloned from https://git.linaro.org/lite/zephyr.git repo. I also cloned MicroPython from the official GitHub repo. I will just copy paste the next steps below.

$ source zephyr-env.sh $ cd ~/code/git/ $ git clone https://github.com/micropython/micropython.git $ cd micropython/zephyr

Then I created a project file for the carbon board specially, this file is named as prj_96b_carbon.conf, and I am pasting the content below. I have submitted the same as a patch to the upstream Micropython project. It disables networking (otherwise you will get stuck while trying to get the REPL).

# No networking for carbon CONFIG_NETWORKING=n CONFIG_NET_IPV4=n CONFIG_NET_IPV6=

Next, we have to build MicroPython as a Zephyr application.

$ make BOARD=96b_carbon $ ls outdir/96b_carbon/ arch ext isr_tables.c lib Makefile scripts tests zephyr.hex zephyr.map zephyr.strip boards include isr_tables.o libzephyr.a Makefile.export src zephyr.bin zephyr.lnk zephyr_prebuilt.elf drivers isrList.bin kernel linker.cmd misc subsys zephyr.elf zephyr.lst zephyr.stat

After the build is finished, you will be able to see a zephyr.bin file in the output directory.

Uploading the fresh build to the carbon

Before anything else, I connected my Carbon board to the laptop using an USB cable to the OTG port (remember to check the port name). Then, I had to press the *BOOT0 button and while pressing that one, I also pressed the Reset button. Then, left the reset button first, and then the boot0 button. If you run the dfu-util command after this, you should be able to see some output like below.

$ sudo dfu-util -l dfu-util 0.9 Copyright 2005-2009 Weston Schmidt, Harald Welte and OpenMoko Inc. Copyright 2010-2016 Tormod Volden and Stefan Schmidt This program is Free Software and has ABSOLUTELY NO WARRANTY Please report bugs to http://sourceforge.net/p/dfu-util/tickets/ Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=3, name="@Device Feature/0xFFFF0000/01*004 e", serial="385B38683234" Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=2, name="@OTP Memory /0x1FFF7800/01*512 e,01*016 e", serial="385B38683234" Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=1, name="@Option Bytes /0x1FFFC000/01*016 e", serial="385B38683234" Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=0, name="@Internal Flash /0x08000000/04*016Kg,01*064Kg,03*128Kg", serial="385B38683234"

This means the board is in DFU mode. Next we flash the new application to the board.

$ sudo dfu-util -d [0483:df11] -a 0 -D outdir/96b_carbon/zephyr.bin -s 0x08000000 dfu-util 0.9 Copyright 2005-2009 Weston Schmidt, Harald Welte and OpenMoko Inc. Copyright 2010-2016 Tormod Volden and Stefan Schmidt This program is Free Software and has ABSOLUTELY NO WARRANTY Please report bugs to http://sourceforge.net/p/dfu-util/tickets/ dfu-util: Invalid DFU suffix signature dfu-util: A valid DFU suffix will be required in a future dfu-util release!!! Opening DFU capable USB device... ID 0483:df11 Run-time device DFU version 011a Claiming USB DFU Interface... Setting Alternate Setting #0 ... Determining device status: state = dfuERROR, status = 10 dfuERROR, clearing status Determining device status: state = dfuIDLE, status = 0 dfuIDLE, continuing DFU mode device DFU version 011a Device returned transfer size 2048 DfuSe interface name: "Internal Flash " Downloading to address = 0x08000000, size = 125712 Download [=========================] 100% 125712 bytes Download done. File downloaded successfully Hello World on Carbon

The hello world of the hardware land is the LED blinking code. I used the on-board LED(s) for the same, the sample code is given below. I have now connected the board to the UART (instead of OTG).

$ screen /dev/ttyUSB0 115200 >>> >>> import time >>> from machine import Pin >>> led1 = Pin(("GPIOD",2), Pin.OUT) >>> led2 = Pin(("GPIOB",5), Pin.OUT) >>> while True: ... led2.low() ... led1.high() ... time.sleep(0.5) ... led2.high() ... led1.low() ... time.sleep(0.5)
Categories: FLOSS Project Planets

Bryan Pendleton: It's not just a game, ...

Planet Apache - Thu, 2017-03-23 00:54

... close reading shows that it's an homage to many great works of art before it: 14 Greatest Witcher 3 Easter Eggs That Will Make You Wanna Replay It Immediately

Categories: FLOSS Project Planets

Mike Hommey: Analyzing git-cinnabar memory use

Planet Debian - Thu, 2017-03-23 00:30

In previous post, I was looking at the allocations git-cinnabar makes. While I had the data, I figured I’d also look how the memory use correlates with expectations based on repository data, to put things in perspective.

As a reminder, this is what the allocations look like (horizontal axis being the number of allocator function calls):

There are 7 different phases happening during a git clone using git-cinnabar, most of which can easily be identified on the graph above:

  • Negotiation.

    During this phase, git-cinnabar talks to the mercurial server to determine what needs to be pulled. Once that is done, a getbundle request is emitted, which response is read in the next three phases. This phase is essentially invisible on the graph.

  • Reading changeset data.

    The first thing that a mercurial server sends in the response for a getbundle request is changesets. They are sent in the RevChunk format. Translated to git, they become commit objects. But to create commit objects, we need the entire corresponding trees and files (blobs), which we don’t have yet. So we keep this data in memory.

    In the git clone analyzed here, there are 345643 changesets loaded in memory. Their raw size in RawChunk format is 237MB. I think by the end of this phase, we made 20 million allocator calls, have about 300MB of live data in about 840k allocations. (No certainty because I don’t actually have definite data that would allow to correlate between the phases and allocator calls, and the memory usage change between this phase and next is not as clear-cut as with other phases). This puts us at less than 3 live allocations per changeset, with “only” about 60MB overhead over the raw data.

  • Reading manifest data.

    In the stream we receive, manifests follow changesets. Each changeset points to one manifest ; several changesets can point to the same manifest. Manifests describe the content of the entire source code tree in a similar manner as git trees, except they are flat (there’s one manifest for the entire tree, where git trees would reference other git trees for sub directories). And like git trees, they only map file paths to file SHA1s. The way they are currently stored by git-cinnabar (which is planned to change) requires knowing the corresponding git SHA1s for those files, and we haven’t got those yet, so again, we keep everything in memory.

    In the git clone analyzed here, there are 345398 manifests loaded in memory. Their raw size in RawChunk format is 1.18GB. By the end of this phase, we made 23 million more allocator calls, and have about 1.52GB of live data in about 1.86M allocations. We’re still at less than 3 live allocations for each object (changeset or manifest) we’re keeping in memory, and barely over 100MB of overhead over the raw data, which, on average puts the overhead at 150 bytes per object.

    The three phases so far are relatively fast and account for a small part of the overall process, so they don’t appear clear-cut to each other, and don’t take much space on the graph.

  • Reading and Importing files.

    After the manifests, we finally get files data, grouped by path, such that we get all the file revisions of e.g. .cargo/.gitignore, followed by all the file revisions of .cargo/config.in, .clang-format, and so on. The data here doesn’t depend on anything else, so we can finally directly import the data.

    This means that for each revision, we actually expand the RawChunk into the full file data (RawChunks contain patches against a previous revision), and don’t keep the RawChunk around. We also don’t keep the full data after it was sent to the git-cinnabar-helper process (as far as cloning is concerned, it’s essentially a wrapper for git-fast-import), except for the previous revision of the file, which is likely the patch base for the next revision.

    We however keep in memory one or two things for each file revision: a mapping of its mercurial SHA1 and the corresponding git SHA1 of the imported data, and, when there is one, the file metadata (containing information about file copy/renames) that lives as a header in the file data in mercurial, but can’t be stored in the corresponding git blobs, otherwise we’d have irrelevant data in checkouts.

    On the graph, this is where there is a steady and rather long increase of both live allocations and memory usage, in stairs for the latter.

    In the git clone analyzed here, there are 2.02M file revisions, 78k of which have copy/move metadata for a cumulated size of 8.5MB of metadata. The raw size of the file revisions in RawChunk format is 3.85GB. The expanded data size is 67GB. By the end of this phase, we made 622 million more allocator calls, and peaked at about 2.05GB of live data in about 6.9M allocations. Compared to the beginning of this phase, that added about 530MB in 5 million allocations.

    File metadata is stored in memory as python dicts, with 2 entries each, instead of raw form for convenience and future-proofing, so that would be at least 3 allocations each: one for each value, one for the dict, and maybe one for the dict storage ; their keys are all the same and are probably interned by python, so wouldn’t count.

    As mentioned above, we store a mapping of mercurial to git SHA1s, so for each file that makes 2 allocations, 4.04M total. Plus the 230k or 310k from metadata. Let’s say 4.45M total. We’re short 550k allocations, but considering the numbers involved, it would take less than one allocation per file on average to go over this count.

    As for memory size, per this answer on stackoverflow, python strings have an overhead of 37 bytes, so each SHA1 (kept in hex form) will take 77 bytes (Note, that’s partly why I didn’t particularly care about storing them as binary form, that would only save 25%, not 50%). That’s 311MB just for the SHA1s, to which the size of the mapping dict needs to be added. If it were a plain array of pointers to keys and values, it would take 2 * 8 bytes per file, or about 32MB. But that would be a hash table with no room for more items (By the way, I suspect the stairs that can be seen on the requested and in-use bytes is the hash table being realloc()ed). Plus at least 290 bytes per dict for each of the 78k metadata, which is an additional 22M. All in all, 530MB doesn’t seem too much of a stretch.

  • Importing manifests.

    At this point, we’re done receiving data from the server, so we begin by dropping objects related to the bundle we got from the server. On the graph, I assume this is the big dip that can be observed after the initial increase in memory use, bringing us down to 5.6 million allocations and 1.92GB.

    Now begins the most time consuming process, as far as mozilla-central is concerned: transforming the manifests into git trees, while also storing enough data to be able to reconstruct manifests later (which is required to be able to pull from the mercurial server after the clone).

    So for each manifest, we expand the RawChunk into the full manifest data, and generate new git trees from that. The latter is mostly performed by the git-cinnabar-helper process. Once we’re done pushing data about a manifest to that process, we drop the corresponding data, except when we know it will be required later as the delta base for a subsequent RevChunk (which can happen in bundle2).

    As with file revisions, for each manifest, we keep track of the mapping of SHA1s between mercurial and git. We also keep a DAG of the manifests history (contrary to git trees, mercurial manifests track their ancestry ; files do too, but git-cinnabar doesn’t actually keep track of that separately ; it just relies on the manifests data to infer file ancestry).

    On the graph, this is where the number of live allocations increases while both requested and in-use bytes decrease, noisily.

    By the end of this phase, we made about 1 billion more allocator calls. Requested allocations went down to 1.02GB, for close to 7 million live allocations. Compared to the end of the dip at the beginning of this phase, that added 1.4 million allocations, and released 900MB. By now, we expect everything from the “Reading manifests” phase to have been released, which means we allocated around 620MB (1.52GB – 900MB), for a total of 3.26M additional allocations (1.4M + 1.86M).

    We have a dict for the SHA1s mapping (345k * 77 * 2 for strings, plus the hash table with 345k items, so at least 60MB), and the DAG, which, now that I’m looking at memory usage, I figure has the one of the possibly worst structure, using 2 sets for each node (at least 232 bytes per set, that’s at least 160MB, plus 2 hash tables with 345k items). I think 250MB for those data structures would be largely underestimated. It’s not hard to imagine them taking 620MB, because really, that DAG implementation is awful. The number of allocations expected from them would be around 1.4M (4 * 345k), but I might be missing something. That’s way less than the actual number, so it would be interesting to take a closer look, but not before doing something about the DAG itself.

    Fun fact: the amount of data we’re dealing with in this phase (the expanded size of all the manifests) is close to 2.9TB (yes, terabytes). With about 4700 seconds spent on this phase on a real clone (less with the release branch), we’re still handling more than 615MB per second.

  • Importing changesets.

    This is where we finally create the git commits corresponding to the mercurial changesets. For each changeset, we expand its RawChunk, find the git tree we created in the previous phase that corresponds to the associated manifest, and create a git commit for that tree, with the right date, author, and commit message. For data that appears in the mercurial changeset that can’t be stored or doesn’t make sense to store in the git commit (e.g. the manifest SHA1, the list of changed files[*], or some extra metadata like the source of rebases), we keep some metadata we’ll store in git notes later on.

    [*] Fun fact: the list of changed files stored in mercurial changesets does not necessarily match the list of files in a `git diff` between the corresponding git commit and its parents, for essentially two reasons:

    • Old buggy versions of mercurial have generated erroneous lists that are now there forever (they are part of what makes the changeset SHA1).
    • Mercurial may create new revisions for files even when the file content is not modified, most notably during merges (but that also happened on non-merges due to, presumably, bugs).
    … so we keep it verbatim.

    On the graph, this is where both requested and in-use bytes are only slightly increasing.

    By the end of this phase, we made about half a billion more allocator calls. Requested allocations went up to 1.06GB, for close to 7.7 million live allocations. Compared to the end of the previous phase, that added 700k allocations, and 400MB. By now, we expect everything from the “Reading changesets” phase to have been released (at least the raw data we kept there), which means we may have allocated at most around 700MB (400MB + 300MB), for a total of 1.5M additional allocations (700k + 840k).

    All these are extra data we keep for the next and final phase. It’s hard to evaluate the exact size we’d expect here in memory, but if we divide by the number of changesets (345k), that’s less than 5 allocations per changeset and less than 2KB per changeset, which is low enough not to raise eyebrows, at least for now.

  • Finalizing the clone.

    The final phase is where we actually go ahead storing the mappings between mercurial and git SHA1s (all 2.7M of them), the git notes where we store the data necessary to recreate mercurial changesets from git commits, and a cache for mercurial tags.

    On the graph, this is where the requested and in-use bytes, as well as the number of live allocations peak like crazy (up to 21M allocations for 2.27GB requested).

    This is very much unwanted, but easily explained with the current state of the code. The way the mappings between mercurial and git SHA1s are stored is via a tree similar to how git notes are stored. So for each mercurial SHA1, we have a file that points to the corresponding git SHA1 through git links for commits or directly for blobs (look at the output of git ls-tree -r refs/cinnabar/metadata^3 if you’re curious about the details). If I remember correctly, it’s faster if the tree is created with an ordered list of paths, so the code created a list of paths, and then sorted it to send commands to create the tree. The former creates a new str of length 42 and a tuple of 3 elements for each and every one of the 2.7M mappings. With the 37 bytes overhead by str instance and the 56 + 3 * 8 bytes per tuple, we have at least 429MB wasted. Creating the tree itself keeps the corresponding fast-import commands in a buffer, where each command is going to be a tuple of 2 elements: a pointer to a method, and a str of length between 90 and 93. That’s at least another 440MB wasted.

    I already fixed the first half, but the second half still needs addressing.

Overall, except for the stupid spike during the final phase, the manifest DAG and the glibc allocator runaway memory use described in previous posts, there is nothing terribly bad with the git-cinnabar memory usage, all things considered. Mozilla-central is just big.

The spike is already half addressed, and work is under way for the glibc allocator runaway memory use. The manifest DAG, interestingly, is actually mostly useless. It’s only used to track the heads of the DAG, and it’s very much possible to track heads of a DAG without actually storing the entire DAG. In fact, that’s what git-cinnabar already does for changeset heads… so we would only need to do the same for manifest heads.

One could argue that the 1.4GB of raw RevChunk data we’re keeping in memory for later user could be kept on disk instead. I haven’t done this so far because I didn’t want to have to handle temporary files (and answer questions like “where to put them?”, “what if there isn’t enough disk space there?”, “what if disk access is slow?”, etc.). But the majority of this data is from manifests. I’m already planning changes in how git-cinnabar stores manifests data that will actually allow to import them directly, instead of keeping them in memory until files are imported. This would instantly remove 1.18GB of memory usage. The downside, however, is that this would be more CPU intensive: Importing changesets will require creating the corresponding git trees, and getting the stored manifest data. I think it’s worth, though.

Finally, one thing that isn’t obvious here, but that was found while analyzing why RSS would be going up despite memory usage going down, is that git-cinnabar is doing way too many reallocations and substring allocations.

So let’s look at two metrics that hopefully will highlight the problem:

  • The cumulated amount of requested memory. That is, the sum of all sizes ever given to malloc, realloc, calloc, etc.
  • The compensated cumulated amount of requested memory (naming is hard). That is, the sum of all sizes ever given to malloc, calloc, etc. except realloc. For realloc, we only count the delta in size between what the size was before and after the realloc.

Assuming all the requested memory is filled at some point, the former gives us an upper bound to the amount of memory that is ever filled or copied (the amount that would be filled if no realloc was ever in-place), while the the latter gives us a lower bound (the amount that would be filled or copied if all reallocs were in-place).

Ideally, we’d want the upper and lower bounds to be close to each other (indicating few realloc calls), and the total amount at the end of the process to be as close as possible to the amount of data we’re handling (which we’ve seen is around 3TB).

… and this is clearly bad. Like, really bad. But we already knew that from the previous post, although it’s nice to put numbers on it. The lower bound is about twice the amount of data we’re handling, and the upper bound is more than 10 times that amount. Clearly, we can do better.

We’ll see how things evolve after the necessary code changes happen. Stay tuned.

Categories: FLOSS Project Planets

Fabio Zadrozny: PyDev 5.6.0 released: faster debugger, improved type inference for super and pytest fixtures

Planet Python - Thu, 2017-03-23 00:29
PyDev 5.6.0 is now already available for download (and is already bundled in LiClipse 3.5.0).

There are many improvements on this version!

The major one is that the PyDev.Debugger got some attention and should now be 60%-100% faster overall -- in all supported Python versions (and that's on top of the improvements done previously).

This improvement was a nice example of trading memory vs speed (the major change done was that the debugger now has 2 new caches, one for saving whether a frame should be skipped or not and another to save whether a given line in a traced frame should be skipped or not, which enables the debugger to make much less checks on those occasions).

Also, other fixes were done in the debugger. Namely:

  • the variables are now properly displayed when the interactive console is connected to a debug session;
  • it's possible to select the Qt version for which QThreads should be patched for the debugger to work with (in preferences > PyDev > Debug > Qt Threads);
  • fixed an issue where a native Qt signal is not callable message was raised when connecting a signal to QThread.started.
  • fixed issue displaying variable (Ctrl+Shift+D) when debugging.

Note: from this version onward, the debugger will now only support Python 2.6+ (I believe there should be very few Python 2.5 users -- Python 2.6 itself stopped being supported in 2013, so, I expect this change to affect almost no one -- if someone really needs to use an older version of Python, it's always possible to get an older version of the IDE/debugger too). Also, from now on, supported versions are actually properly tested on the ci (2.6, 2.7 and 3.5 in https://travis-ci.org/fabioz/PyDev.Debugger and 2.7, 3.5 in https://ci.appveyor.com/project/fabioz/pydev-debugger).

The code-completion (Ctrl+Space) and find definition (F3) also had improvements and can now deal with the Python super (so, it's possible to get completions and go to the definition of a method declared in a superclass when using the super construct) and pytest fixtures (so, if you have a pytest fixture, you should now be able to have completions/go to its definition even if you don't add a docstring to the parameter saying its expected type).

Also, this release improved the support in third-party packages, so, coverage, pycodestyle (previously pep8.py) and autopep8 now use the latest version available. Also, PyLint was improved to use the same thread pool used in code-analysis and an issue in the Django shell was fixed when django >= 1.10.

And to finish, the preferences for running unit-tests can now be saved to the project or user settings (i.e.: preferences > PyDev > PyUnit > Save to ...) and an issue was fixed when coloring the matrix multiplication operator (which was wrongly recognized as a decorator).

Thank you very much to all the PyDev supporters and Patrons (http://www.patreon.com/fabioz), who help to keep PyDev moving forward and to JetBrains, which sponsored many of the improvements done in the PyDev.Debugger.

Categories: FLOSS Project Planets

Dries Buytaert: Living our values

Planet Drupal - Thu, 2017-03-23 00:26

The Drupal community is committed to welcome and accept all people. That includes a commitment to not discriminate against anyone based on their heritage or culture, their sexual orientation, their gender identity, and more. Being diverse has strength and as such we work hard to foster a culture of open-mindedness toward differences.

A few weeks ago, I privately asked Larry Garfield, a prominent Drupal contributor, to leave the Drupal project. I did this because it came to my attention that he holds views that are in opposition with the values of the Drupal project.

I had hoped to avoid discussing this decision publicly out of respect for Larry's private life, but now that Larry has written about it on his blog and it is being discussed publicly, I believe I have no choice but to respond on behalf of the Drupal project.

It is not for me to share any of the confidential information that I've received, so I won't point out the omissions in Larry's blog post. However, I can tell you that those who have reviewed Larry's writing, including me, suffered from varying degrees of shock and concern.

In the end, I fundamentally believe that all people are created equally. This belief has shaped the values that the Drupal project has held since it's early days. I cannot in good faith support someone who actively promotes a philosophy that is contrary to this.

While the decision was unpleasant, the choice was clear. I remain steadfast in my obligation to protect the shared values of the Drupal project. This is unpleasant because I appreciate Larry's many contributions to Drupal, because this risks setting a complicated precedent, and because it involves a friend's personal life. The matter is further complicated by the fact that this information was shared by others in a manner I don't find acceptable either.

It's not for me to judge the choices anyone makes in their private life or what beliefs they subscribe to. I certainly don't take offense to the role-playing activities of Larry's alternative lifestyle. However, when a highly-visible community member's private views become public, controversial, and disruptive for the project, I must consider the impact that his words and actions have on others and the project itself. In this case, Larry has entwined his private and professional online identities in such a way that it blurs the lines with the Drupal project. Ultimately, I can't get past the fundamental misalignment of values.

First, collectively, we work hard to ensure that Drupal has a culture of diversity and inclusion. Our goal is not just to have a variety of different people within our community, but to foster an environment of connection, participation and respect. We have a lot of work to do on this and we can't afford to ignore discrepancies between the espoused views of those in leadership roles and the values of our culture. It's my opinion that any association with Larry's belief system is inconsistent with our project's goals.

Second, I believe someone's belief system inherently influences their actions, in both explicit and subtle ways, and I'm unwilling to take this risk going forward.

Third, Larry's continued representation of the Drupal project could harm the reputation of the project and cause harm to the Drupal ecosystem. Any further participation in a leadership role implies our community is complicit with and/or endorses these views, which we do not.

It is my responsibility and obligation to act in the best interest of the project at large and to uphold our values. Decisions like this are unpleasant and disruptive, but important. It is moments like this that test our commitment to our values. We must stand up and act in ways that demonstrate these values. For these reasons, I'm asking Larry to resign from the Drupal project.

(Comments on this post are allowed but for obvious reasons will be moderated.)

Categories: FLOSS Project Planets

ActiveLAMP: Shibboleth Authentication in Symfony 2.8+|3.0+

Planet Drupal - Wed, 2017-03-22 22:00

We recently had the opportunity to work on a Symfony app for one of our Higher Ed clients that we recently built a Drupal distribution for. Drupal 8 moving to Symfony has enabled us to expand our service offering. We have found more opportunities building apps directly using Symfony when a CMS is not needed. This post is not about Drupal, but cross posting to Drupal Planet to demonstrate the value of getting off the island. Writing custom authentication schemes in Symfony used to be on the complicated side. But with the introduction of the Guard authentication component, it has gotten a lot easier.

Read more...
Categories: FLOSS Project Planets

Matthew Rocklin: Dask Release 0.14.1

Planet Python - Wed, 2017-03-22 20:00

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.14.1. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on February 27th.

As always you can conda install from conda-forge

conda install -c conda-forge dask distributed

or you can pip install from PyPI

pip install dask[complete] --upgrade Arrays

Recent work in distributed computing and machine learning have motivated new performance-oriented and usability changes to how we handle arrays.

Automatic chunking and operation on NumPy arrays

Many interactions between Dask arrays and NumPy arrays work smoothly. NumPy arrays are made lazy and are appropriately chunked to match the operation and the Dask array.

>>> x = np.ones(10) # a numpy array >>> y = da.arange(10, chunks=(5,)) # a dask array >>> z = x + y # combined become a dask.array >>> z dask.array<add, shape=(10,), dtype=float64, chunksize=(5,)> >>> z.compute() array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.]) Reshape

Reshaping distributed arrays is simple in simple cases, and can be quite complex in complex cases. Reshape now supports a much more broad set of shape transformations where any dimension is collapsed or merged to other dimensions.

>>> x = da.ones((2, 3, 4, 5, 6), chunks=(2, 2, 2, 2, 2)) >>> x.reshape((6, 2, 2, 30, 1)) dask.array<reshape, shape=(6, 2, 2, 30, 1), dtype=float64, chunksize=(3, 1, 2, 6, 1)>

This operation ends up being quite useful in a number of distributed array cases.

Optimize Slicing to Minimize Communication

Dask.array slicing optimizations are now careful to produce graphs that avoid situations that could cause excess inter-worker communication. The details of how they do this is a bit out of scope for a short blogpost, but the history here is interesting.

Historically dask.arrays were used almost exclusively by researchers with large on-disk arrays stored as HDF5 or NetCDF files. These users primarily used the single machine multi-threaded scheduler. We heavily tailored Dask array optimizations to this situation and made that community pretty happy. Now as some of that community switches to cluster computing on larger datasets the optimization goals shift a bit. We have tons of distributed disk bandwidth but really want to avoid communicating large results between workers. Supporting both use cases is possible and I think that we’ve achieved that in this release so far, but it’s starting to require increasing levels of care.

Micro-optimizations

With distributed computing also comes larger graphs and a growing importance of graph-creation overhead. This has been optimized somewhat in this release. We expect this to be a focus going forward.

DataFrames Set_index

Set_index is smarter in two ways:

  1. If you set_index on a column that happens to be sorted then we’ll identify that and avoid a costly shuffle. This was always possible with the sorted= keyword but users rarely used this feature. Now this is automatic.
  2. Similarly when setting the index we can look at the size of the data and determine if there are too many or too few partitions and rechunk the data while shuffling. This can significantly improve performance if there are too many partitions (a common case).
Shuffle performance

We’ve micro-optimized some parts of dataframe shuffles. Big thanks to the Pandas developers for the help here. This accelerates set_index, joins, groupby-applies, and so on.

Fastparquet

The fastparquet library has seen a lot of use lately and has undergone a number of community bugfixes.

Importantly, Fastparquet now supports Python 2.

We strongly recommend Parquet as the standard data storage format for Dask dataframes (and Pandas DataFrames).

dask/fastparquet #87

Distributed Scheduler Replay remote exceptions

Debugging is hard in part because exceptions happen on remote machines where normal debugging tools like pdb can’t reach. Previously we were able to bring back the traceback and exception, but you couldn’t dive into the stack trace to investigate what went wrong:

def div(x, y): return x / y >>> future = client.submit(div, 1, 0) >>> future <Future: status: error, key: div-4a34907f5384bcf9161498a635311aeb> >>> future.result() # getting result re-raises exception locally <ipython-input-3-398a43a7781e> in div() 1 def div(x, y): ----> 2 return x / y ZeroDivisionError: division by zero

Now Dask can bring a failing task and all necessary data back to the local machine and rerun it so that users can leverage the normal Python debugging toolchain.

>>> client.recreate_error_locally(future) <ipython-input-3-398a43a7781e> in div(x, y) 1 def div(x, y): ----> 2 return x / y ZeroDivisionError: division by zero

Now if you’re in IPython or a Jupyter notebook you can use the %debug magic to jump into the stacktrace, investigate local variables, and so on.

In [8]: %debug > <ipython-input-3-398a43a7781e>(2)div() 1 def div(x, y): ----> 2 return x / y ipdb> pp x 1 ipdb> pp y 0

dask/distributed #894

Async/await syntax

Dask.distributed uses Tornado for network communication and Tornado coroutines for concurrency. Normal users rarely interact with Tornado coroutines; they aren’t familiar to most people so we opted instead to copy the concurrent.futures API. However some complex situations are much easier to solve if you know a little bit of async programming.

Fortunately, the Python ecosystem seems to be embracing this change towards native async code with the async/await syntax in Python 3. In an effort to motivate people to learn async programming and to gently nudge them towards Python 3 Dask.distributed we now support async/await in a few cases.

You can wait on a dask Future

async def f(): future = client.submit(func, *args, **kwargs) result = await future

You can put the as_completed iterator into an async for loop

async for future in as_completed(futures): result = await future ... do stuff with result ...

And, because Tornado supports the await protocols you can also use the existing shadow concurrency API (everything prepended with an underscore) with await. (This was doable before.)

results = client.gather(futures) # synchronous ... results = await client._gather(futures) # asynchronous

If you’re in Python 2 you can always do this with normal yield and the tornado.gen.coroutine decorator.

dask/distributed #952

Inproc transport

In the last release we enabled Dask to communicate over more things than just TCP. In practice this doesn’t come up (TCP is pretty useful). However in this release we now support single-machine “clusters” where the clients, scheduler, and workers are all in the same process and transfer data cost-free over in-memory queues.

This allows the in-memory user community to use some of the more advanced features (asynchronous computation, spill-to-disk support, web-diagnostics) that are only available in the distributed scheduler.

This is on by default if you create a cluster with LocalCluster without using Nanny processes.

>>> from dask.distributed import LocalCluster, Client >>> cluster = LocalCluster(nanny=False) >>> client = Client(cluster) >>> client <Client: scheduler='inproc://192.168.1.115/8437/1' processes=1 cores=4> >>> from threading import Lock # Not serializable >>> lock = Lock() # Won't survive going over a socket >>> [future] = client.scatter([lock]) # Yet we can send to a worker >>> future.result() # ... and back <unlocked _thread.lock object at 0x7fb7f12d08a0>

dask/distributed #919

Connection pooling for inter-worker communications

Workers now maintain a pool of sustained connections between each other. This pool is of a fixed size and removes connections with a least-recently-used policy. It avoids re-connection delays when transferring data between workers. In practice this shaves off a millisecond or two from every communication.

This is actually a revival of an old feature that we had turned off last year when it became clear that the performance here wasn’t a problem.

Along with other enhancements, this takes our round-trip latency down to 11ms on my laptop.

In [10]: %%time ...: for i in range(1000): ...: future = client.submit(inc, i) ...: result = future.result() ...: CPU times: user 4.96 s, sys: 348 ms, total: 5.31 s Wall time: 11.1 s

There may be room for improvement here though. For comparison here is the same test with the concurent.futures.ProcessPoolExecutor.

In [14]: e = ProcessPoolExecutor(8) In [15]: %%time ...: for i in range(1000): ...: future = e.submit(inc, i) ...: result = future.result() ...: CPU times: user 320 ms, sys: 56 ms, total: 376 ms Wall time: 442 ms

Also, just to be clear, this measures total roundtrip latency, not overhead. Dask’s distributed scheduler overhead remains in the low hundreds of microseconds.

dask/distributed #935

Related Projects

There has been activity around Dask and machine learning:

  • dask-learn is undergoing some performance enhancements. It turns out that when you offer distributed grid search people quickly want to scale up their computations to hundreds of thousands of trials.
  • dask-glm now has a few decent algorithms for convex optimization. The authors of this wrote a blogpost very recently if you’re interested: Developing Convex Optimization Algorithms in Dask
  • dask-xgboost lets you hand off distributed data in Dask dataframes or arrays and hand it directly to a distributed XGBoost system (that Dask will nicely set up and tear down for you). This was a nice example of easy hand-off between two distributed services running in the same processes.
Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.0 release on February 27th

  • Antoine Pitrou
  • Brian Martin
  • Elliott Sales de Andrade
  • Erik Welch
  • Francisco de la Peña
  • jakirkham
  • Jim Crist
  • Jitesh Kumar Jha
  • Julien Lhermitte
  • Martin Durant
  • Matthew Rocklin
  • Markus Gonser
  • Talmaj

The following people contributed to the dask/distributed repository since the 1.16.0 release on February 27th

  • Antoine Pitrou
  • Ben Schreck
  • Elliott Sales de Andrade
  • Martin Durant
  • Matthew Rocklin
  • Phil Elson
Categories: FLOSS Project Planets

Justin Mason: Links for 2017-03-22

Planet Apache - Wed, 2017-03-22 19:58
  • Why American Farmers Are Hacking Their Tractors With Ukrainian Firmware

    DRM working as expected:

    To avoid the draconian locks that John Deere puts on the tractors they buy, farmers throughout America’s heartland have started hacking their equipment with firmware that’s cracked in Eastern Europe and traded on invite-only, paid online forums. Tractor hacking is growing increasingly popular because John Deere and other manufacturers have made it impossible to perform “unauthorized” repair on farm equipment, which farmers see as an attack on their sovereignty and quite possibly an existential threat to their livelihood if their tractor breaks at an inopportune time. (via etienneshrdlu)

    (tags: hacking farming drm john-deere tractors firmware right-to-repair repair)

Categories: FLOSS Project Planets
Syndicate content