FLOSS Project Planets

Colm O hEigeartaigh: Configuring Kerberos for Hive in Talend Open Studio for Big Data

Planet Apache - Thu, 2017-09-21 07:12
Earlier this year, I showed how to use Talend Open Studio for Big Data to access data stored in HDFS, where HDFS had been configured to authenticate users using Kerberos. A similar blog post showed how to read data from an Apache Kafka topic using kerberos. In this tutorial I will show how to create a job in Talend Open Studio for Big Data to read data from an Apache Hive table using kerberos. As a prerequisite, please follow a recent tutorial on setting up Apache Hadoop and Apache Hive using kerberos. 

1) Download Talend Open Studio for Big Data and create a job

Download Talend Open Studio for Big Data (6.4.1 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "HiveKerberosRead". In the search bar under "Palette" on the right hand side enter "hive" and hit enter. Drag "tHiveConnection" and "tHiveInput" to the middle of the screen. Do the same for "tLogRow":

"tHiveConnection" will be used to configure the connection to Hive. "tHiveInput" will be used to perform a query on the "words" table we have created in Hive (as per the earlier tutorial linked above), and finally "tLogRow" will just log the data so that we can be sure that it was read correctly. The next step is to join the components up. Right click on "tHiveConnection" and select "Trigger/On Subjob Ok" and drag the resulting line to "tHiveInput". Right click on "tHiveInput" and select "Row/Main" and drag the resulting line to "tLogRow":



3) Configure the components

Now let's configure the individual components. Double click on "tHiveConnection". Select the following configuration options:
  • Distribution: Hortonworks
  • Version: HDP V2.5.0
  • Host: localhost
  • Database: default
  • Select "Use Kerberos Authentication"
  • Hive Principal: hiveserver2/localhost@hadoop.apache.org
  • Namenode Principal: hdfs/localhost@hadoop.apache.org
  • Resource Manager Principal: mapred/localhost@hadoop.apache.org
  • Select "Use a keytab to authenticate"
  • Principal: alice
  • Keytab: Path to "alice.keytab" in the Kerby test project.
  • Unselect "Set Resource Manager"
  • Set Namenode URI: "hdfs://localhost:9000"

Now click on "tHiveInput" and select the following configuration options:
  • Select "Use an existing Connection"
  • Choose the tHiveConnection name from the resulting "Component List".
  • Click on "Edit schema". Create a new column called "word" of type String, and a column called "count" of type int. 
  • Table name: words
  • Query: "select * from words where word == 'Dare'"

Now the only thing that remains is to point to the krb5.conf file that is generated by the Kerby project. Click on "Window/Preferences" at the top of the screen. Select "Talend" and "Run/Debug". Add a new JVM argument: "-Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf":
Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. You should see the following output in the Run Window in the Studio:

Categories: FLOSS Project Planets

Qt 5.6.3 Released

Planet KDE - Thu, 2017-09-21 06:08

I am pleased to inform that Qt 5.6.3 has been released today. As always with a patch release Qt 5.6.3 does not bring any new features, just error corrections. For details of the bug fixes in Qt 5.6.3, please check the change logs for each module and known issues of Qt 5.6.3 wiki page.

Qt 5.6 LTS is currently in the ‘Strict’ phase, and only receives fixes to security issues, crashes, regressions and similar. Since end of 2016 we have already reduced the number of fixes going into the 5.6 branch and after Qt 5.6 LTS enters the ‘Very Strict’ phase it will receive security fixes only. The reason for gradually reducing the amount of changes going into an LTS version of Qt is to avoid problems in stability. While each fix as such is beneficial, they are also possible risks for behavior changes and regressions, which we want to avoid in LTS releases.

As part of our LTS commitment, we continue to support the commercial Qt 5.6 LTS users throughout the three-year standard support period, after which it is possible to purchase extended support. For description of our support services, please check the recent blog post describing the Standard, Extended and Premium Support. In May 2017 we released Qt 5.9 LTS, which includes a wide range of new features, functionality and overall performance improvements. We expect to release Qt 5.9.2 patch release still during September, including all the bug fixes of Qt 5.6.3 and many more. To learn more about the improvements that come with Qt 5.9 LTS you can find all relevant blogs and on-demand webinars here.   

If you are using the online installer, Qt 5.6.3 can be updated using the maintenance tool. Offline packages are available for commercial users in the Qt Account portal and at the qt.io Download page for open-source users.

The post Qt 5.6.3 Released appeared first on Qt Blog.

Categories: FLOSS Project Planets

Iain R. Learmonth: It Died

Planet Debian - Thu, 2017-09-21 05:10

On Sunday, in my weekly report on my free software activities, I wrote about how sustainable my current level of activites are. I had identified the risk that the computer that I use for almost all of my free software work was slowly dying. Last night it entered an endless reboot loop and subsequent efforts to save it have failed.

I cannot afford to replace this machine and my next best machine has half the cores, half the RAM and less than half of the screen real estate. As this is going to be a serious hit to my productivity, I need to seriously consider if I am able to continue to maintain the number of packages I currently do in Debian.

Update: Thank you for all the responses I’ve received on this post. While I have not yet resolved the situation, the level of response has me very confident that I will not have to orphan any packages and I should be back to work soon.

The Sun Ultra 24
Categories: FLOSS Project Planets

SUSECON - librmb

Planet KDE - Thu, 2017-09-21 05:00

I will be next week at the SUSECON in Prague to present about the new librmb/rbox project sponsored by Deutsche Telekom. The presentation slot is Tuesday, Sep 26, 1:30 PM - 2:30 PM. If you attend and are interested on how to store emails directly in Ceph RADOS add the session to your schedule.
Categories: FLOSS Project Planets

eGenix.com: eGenix Talks & Videos: Python Idioms Talk

Planet Python - Thu, 2017-09-21 04:00
EuroPython 2016 in Bilbao, Basque Country, Spain

Marc-André Lemburg, Python Core Developer, chair of the EuroPython Society, the organization behind the EuroPython Conference, and Senior Software Architect, held a talk at EuroPython 2016 show casing our experience with the valuation of a Python startup.

We have now published the talk as video and also released the presentation slides. So you think your Python startup is worth $10 million...

Talk given at the EuroPython 2016 conference in Bilbao, Basque Country, Spain, presenting experience gained from valuation of a Python company and its code base.

Click to proceed to the talk video and slides ...

This talk is based on the speaker’s experience running a Python focused software company for more than 15 years and a recent consulting project to support the valuation of a Python startup company in the due diligence phase.

For the valuation we had to come up with metrics, a catalog of criteria analyzing risks, potential and benefits of the startup’s solution, as well as an estimate for how much effort it would take to reimplement the solution from scratch.

In the talk, I am going to show the metrics we used, how they can be applied to Python code, the importance of addressing risk factors, well designed code and data(base) structures.

By following some of the advice from this talk, you should be able to improve the valuation of your startup or consulting business in preparation for investment rounds or an acquisition.

-- Marc-André Lemburg

More interesting eGenix presentations are available in the presentations and talks community section of our website.

Related Python Coaching and Consulting

If you are interested in learning more about these advanced techniques, eGenix now offers Python project coaching and consulting services to give your project teams advice on how to design Python applications, successfully run projects, or find excellent Python programmers. Please contact our eGenix Sales Team for information.

Enjoy !

Charlie Clark, eGenix.com Sales & Marketing

Categories: FLOSS Project Planets

Claus Ibsen: Getting Started with Apache Camel and Java by Bennet Schulz

Planet Apache - Thu, 2017-09-21 03:38
I just want to spread the word that Bennet Schulz yesterday posted a great short blog how to get started with Apache Camel with just Java. It shows you the basics of creating a new Camel project and with your first Camel route and how to run that with just plain Java.

http://bennet-schulz.com/2017/09/19/getting-started-with-apache-camel-and-java
The blog is very good and I recommend new users to Apache Camel to read it, its short and a 5 minute read.


Categories: FLOSS Project Planets

Krita 3.3.0 – first release candidate

Planet KDE - Thu, 2017-09-21 03:14

Less than a month after Krita 3.2.1, we’re getting ready to release Krita 3.3.0. We’re bumping the version because there are some important changes for Windows users in this version!

Alvin Wong has implemented support for the Windows 8 event API, which means that Krita now supports the n-trig pen in the Surface line of laptops (and similar laptops from Dell, HP and Acer) natively. This is still very new, so you have to enable this in the tablet settings:

And he also refactored Krita’s hardware-accelerated display functionality to optionally use Angle on Windows instead of native OpenGL. That means that many problems with Intel display chips and broken driver versions are worked around because Krita now indirectly uses Direct3D.

There are more changes in this release, of course:

  • Some visual glitches when using hi-dpi screens are fixed (remember: on Windows and Linux, you need to enable this in the settings dialog).
  • If you create a new image from clipboard, the image will have a title
  • Favorite blending modes and favorite brush presets are now loaded correctly on startup
  • GMIC
    • the plugin has been updated to the latest version for Windows and Linux.
    • the configuration for setting the path to the plugin has been removed. Krita looks for the plugin in the folder where the krita executable is, and optionally inside a folder with a name that starts with ‘gmic’ next to the krita executable.
    • there are several fixes for handling layers and communication between Krita and the plugin
  • Some websites save jpeg images with a .png extension: that used to confuse Krita, but Krita now first looks inside the file to see what kind of file it really is.
  • PNG:
    • 16 and 32 bit floating point images are now converted to 16 bit integer when saving the images as PNG.
    • It’s now possible to save the alpha channel to PNG images even if there are no (semi-) transparent pixels in the image
  • When hardware accelerated display is disabled, the color picker mode of the brush tool showed a broken cursor; this has been fixed.
  • The Reference Images docker now only starts loading images when it is visible, instead on Krita startup. Note: the reference images docker uses Qt’s imageio plugins to load images. If you are running on Linux, remove all Deepin desktop components. Deepin comes with severely broken qimageio plugins that will crash any Qt application that tries to display images.
  • File layers now correctly reload on change again
  • Add several new commandline options:
    • –nosplash to start Krita without showing the splash screen
    • –canvasonly to start Krita in canvas-only mode
    • –fullscreen to start Krita full-screen
    • –workspace Workspace to start Krita with the given workspace
  • Selections
    • The Select All action now first clears the selection before selecting the entire image
    • It is now possible to extend selections outside the canvas boundary
  • Performance improvements: in several places superfluous reads from the settings were eliminated, which makes generating a layer thumbnail faster and improves painting if display acceleration is turned off.
  • The smart number input boxes now use the current locale to follow desktop settings for numbers
  • The system information dialog for bug reports is improved
  • macOS/OSX specific changes:
    • Bernhard Liebl has improved the tablet/stylus accuracy. The problem with circles having straight line segments is much improved, though it’s not perfect yet.
    • On macOS/OSX systems with and AMD gpu, support for hardware accelerated display is disabled because saving to PNG and JPG hangs Krita otherwise.
Download Windows

Note for Windows users: if you encounter crashes, please follow these instructions to use the debug symbols so we can figure out where Krita crashes. There are no 32 bits packages at this point, but there will be for the final release.

Linux

(If, for some reason, Firefox thinks it needs to load this as text: to download, right-click on the link.)

When it is updated, you can also use the Krita Lime PPA to install Krita 3.3.0-rc.1 on Ubuntu and derivatives.

OSX

Note: the gmic-qt and pdf plugins are not available on OSX.

Source code md5sums

For all downloads:

Key

The Linux appimage and the source tarball are signed. You can retrieve the public key over https here:
0x58b9596c722ea3bd.asc
. The signatures are here.

Support Krita

Krita is a free and open source project. Please consider supporting the project with donations or by buying training videos or the artbook! With your support, we can keep the core team working on Krita full-time.

Categories: FLOSS Project Planets

Fabio Zadrozny: PyDev 6.0: pip, conda, isort and subword navigation

Planet Python - Thu, 2017-09-21 01:19
The new PyDev release is now out and offers some really nice features on a number of fronts!

The interpreter configuration now integrates with both pip and conda, showing the installed packages and allowing any package to be installed and uninstalled from inside the IDE.

Also, it goes a step further in the conda integration and allows users to load the proper environment variables from the env -- this is actually false by default and can be turned on in the interpreter configuration page when PyDev identifies an interpreter as being managed by conda by checking the "Load conda env vars before run" configuration (so, if you have some library which relies on some configuration you don't have to activate the env outside the IDE).



Another change which is pretty nice is that now when creating a project there's an option to specify that the project should always use the interpreter version for syntax validation.

Previously a default version for the grammar was set, but users could be confused when the version didn't match the interpreter... note that it's still possible to set a different version or even add additional syntax validators, for cases when you're actually dealing with supporting more than one Python version.

The editor now has support for subword navigation (so, navigating words as MyReallyNiceClass with Ctrl+Left/Right will stop after each subword -- i.e.: 'My', 'Really', 'Nice', 'Class' -- remember that Shift+Alt+Up can be used to select the full word for the cases where Ctrl+ShiftLeft/Right did it previously).

This mode is now also consistent among all platforms (previously each platform had its own style based on the underlying platform -- it's still possible to revert to that mode in the Preferences > PyDev > Editor > Word navigation option).

Integration with PyLint and isort were also improved: the PyLint integration now provides an option to search for PyLint in the interpreter which a project is using and isort integration was improved to know about the available packages (i.e.: based on the project/interpreter configuration, PyDev knows a lot about which should be third party/ library projects and passes that information along to isort).

In the unittest front, Robert Gomulka did some nice work and now the name of the unittest being run is now properly shown in the run configuration and it's possible to right-click a given selection in the dialog to run tests (Ctrl+F9) and edit the run configuration (to edit environment variables, etc) before running it.

Aside from that there were also a number of other fixes and adjustments (see http://pydev.org for more details).

Enjoy!

p.s.: Thank you to all PyDev supporters -- https://www.brainwy.com/supporters/PyDev/ -- which enable PyDev to keep on being improved!

p.s.: LiClipse 4.2.0 already bundles PyDev 6.0, see: http://www.liclipse.com/download.html for download links.

Categories: FLOSS Project Planets

Montreal Python User Group: Montréal-Python 66: Call For Speakers

Planet Python - Thu, 2017-09-21 00:00

It's back-to-everything and Montreal-Python is no exception! We are looking for speakers for our first meetup of fall.

We are looking for speakers that want to give a regular presentation (20 to 25 minutes) or a lightning talk (5 minutes).

Submit your proposal at team@montrealpython.org

When

October 2nd, 2017 at 6PM

Where

TBD

PyCon Canada Early Bird Tickets

Also, a little reminder that Early Bird tickets for PyCon Canada (which will be held in Montreal on November 18th to 21st) are now available at https://2017.pycon.ca/.

The early bird rates are only for a limited quantity of tickets, so get yours soon!

PyCon Canada Sponsorship

Would you like to become a sponsor for PyCon Canada? Send an email to sponsorship@pycon.ca

Categories: FLOSS Project Planets

Lullabot: Lullabots Coming to DrupalCon Vienna

Planet Drupal - Wed, 2017-09-20 23:06

Several of our Lullabots and the team from our sister company, Drupalize.me, are about to descend upon the City of Music to present seven kick-ass sessions to the Drupal community in the EU. There will be a cornucopia of topics presented — from softer human-centric topics such as imposter syndrome to more technical topics such as Decoupled Drupal. So, if you're headed to DrupalCon Vienna next week, be sure to eat plenty of Sachertorte, drink lots of Ottakringer, and check out these sessions that will Rock You Like Amadeus:

Contenta - Drupal’s API Distribution Tuesday, September 26, 10:45-11:45

Sally Young, Cristina Chumillas, and Daniel Wehner

Contenta is a decoupled Drupal distribution that has many examples of various front-ends available as best practices guides. Lullabot Senior Technical Architect Sally Young, Christina Chumillas, and Daniel Wehner will bring you up to speed on the latest Contenta developments, including its current features and roadmap. You will also get a tour of Contenta’s possibilities that come with reference applications that implement the out-of-the-box initiative’s cooking recipe.

Automated Testing 101 Tuesday, September 26th, 10:45 - 11:45

Ezequiel “Zequi” Vázquez

Lullabot Developer, Ezequiel “Zequi” Vázquez, will explore the current state of test automation and present the most useful tools that provide testing capabilities for security, accessibility, performance, scaling, and more. Zequi will also give you advice on the best strategies to implement automated testing for your application, and how to cover relevant aspects of your software.

Get Started with Voice User Interfaces Tuesday, September 26th, 15:45pm - 16:45pm Amber Himes Matz

Drupalize.me Production Manager & Trainer, Amber Himes Matz, will survey the current state of voice and conversational interface APIs with an eye toward global language support. She’ll cover services including Alexa, Google, and Cortana by examining their distinct features and the devices, platforms, interactions, and spoken languages they support. If you’re looking for a better understanding of the voice and conversational interface services landscape, ideas on how to approach the voice UI design process, an understanding of concepts and terminology related to voice interaction, and ways to get started, this is the right session for you - complete with a demo!

Breaking the Myths of the Rockstar Developer Wednesday, September 27th, 10:45 - 11:45

Juan Olalla Olmo & Salvador Molina

Lullabot Developer, Juan Olalla Olmo, and Salvador Molina will share their experiences and explore the areas and attitudes that can help everyone become better professionals by embracing who they are and ultimately empower others to do the same. This inspiring session aims to help you grow professionally and provide more value at work by focusing on fostering the human relationships and growing as people.

Juan gave this presentation internally at Lullabot’s recent Design and Development Retreat. It was a highlight that sparked a lively conversation.

Virtual Reality on the Web - Overview and "How-to" Demo Wednesday, September 27th, 13:35 - 14:00

Wes Ruvalcaba

Want to make your own virtual reality experiences? Lullabot Senior Front-end Developer Wes Ruvalcaba will show you how. Starting with an overview of VR (and AR) concepts, technologies, and what its uses are, Wes will also demo and share code examples of VR websites we’ve made at Lullabot. You’ll also get an intro to A-Frame and Wes will explain how you can get started.

Thursday Keynote - Everyone Has Something to Share Thursday, September 28th, 9:00 - 10:15

Joe Shindelar

We’re especially proud of Drupalize.me's Joe Shindelar for being selected to give the Community Keynote. If you’ve been around Drupal for a while, it’s likely you’ve either met or learned from Joe. In this session, Joe will reflect on 10 years of both successfully and unsuccessfully engaging with the community. By doing so he hopes to help others learn about what they have to share, and the benefits of doing so. This is important because sharing:

  • Creates diversity, both of thought and culture
  • Builds people up, helps them realize their potential, and enriches our community
  • Fosters connections, and makes you, as an individual, smarter
  • Creates opportunities for yourself and others
  • Feels all warm and fuzzy
Making Content Editors Happy in Drupal 8 with Entity Browser Thursday, September 28th, 14:15 - 15:15

Marcos Cano

Lullabot Developer Marcos Cano will be presenting on Entity Browser, which is a Drupal 8 contrib module created to upload multiple images/files at once, select and re-use an image/file already present on the server, and more. In this session Marcos will:

  • Explain the basic architecture of the module, and how to take advantage of its plugin-based approach to extend and customize it
  • See how to configure it from scratch to solve different use-cases, including some pitfalls that often occur in that process
  • Check what we can copy or re-use from other contrib modules
  • Explore some possible integrations with other parts of the media ecosystem

See you next week in Wien!

Categories: FLOSS Project Planets

Matthew Rocklin: Fast GeoSpatial Analysis in Python

Planet Python - Wed, 2017-09-20 20:00

This work is supported by Anaconda Inc., the Data Driven Discovery Initiative from the Moore Foundation, and NASA SBIR NNX16CG43P

This work is a collaboration with Joris Van den Bossche. This blogpost builds on Joris’s EuroSciPy talk (slides) on the same topic. You can also see Joris’ blogpost on this same topic.

TL;DR:

Python’s Geospatial stack is slow. We accelerate the GeoPandas library with Cython and Dask. Cython provides 10-100x speedups. Dask gives an additional 3-4x on a multi-core laptop. Everything is still rough, please come help.

We start by reproducing a blogpost published last June, but with 30x speedups. Then we talk about how we achieved the speedup with Cython and Dask.

All code in this post is experimental. It should not be relied upon.

Experiment

In June Ravi Shekhar published a blogpost Geospatial Operations at Scale with Dask and GeoPandas in which he counted the number of rides originating from each of the official taxi zones of New York City. He read, processed, and plotted 120 million rides, performing an expensive point-in-polygon test for each ride, and produced a figure much like the following:

This took about three hours on his laptop. He used Dask and a bit of custom code to parallelize Geopandas across all of his cores. Using this combination he got close to the speed of PostGIS, but from Python.

Today, using an accelerated GeoPandas and a new dask-geopandas library, we can do the above computation in around eight minutes (half of which is reading CSV files) and so can produce a number of other interesting images with faster interaction times.

A full notebook producing these plots is available below:

The rest of this article talks about GeoPandas, Cython, and speeding up geospatial data analysis.

Background in Geospatial Data

The Shapely User Manual begins with the following passage on the utility of geospatial analysis to our society.

Deterministic spatial analysis is an important component of computational approaches to problems in agriculture, ecology, epidemiology, sociology, and many other fields. What is the surveyed perimeter/area ratio of these patches of animal habitat? Which properties in this town intersect with the 50-year flood contour from this new flooding model? What are the extents of findspots for ancient ceramic wares with maker’s marks “A” and “B”, and where do the extents overlap? What’s the path from home to office that best skirts identified zones of location based spam? These are just a few of the possible questions addressable using non-statistical spatial analysis, and more specifically, computational geometry.

Shapely is part of Python’s GeoSpatial stack which is currently composed of the following libraries:

  1. Shapely: Manages shapes like points, linestrings, and polygons. Wraps the GEOS C++ library
  2. Fiona: Handles data ingestion. Wraps the GDAL library
  3. Rasterio: Handles raster data like satelite imagery
  4. GeoPandas: Extends Pandas with a column of shapely geometries to intuitively query tables of geospatially annotated data.

These libraries provide intuitive Python wrappers around the OSGeo C/C++ libraries (GEOS, GDAL, …) which power virtually every open source geospatial library, like PostGIS, QGIS, etc.. They provide the same functionality, but are typically much slower due to how they use Python. This is acceptable for small datasets, but becomes an issue as we transition to larger and larger datasets.

In this post we focus on GeoPandas, a geospatial extension of Pandas which manages tabular data that is annotated with geometry information like points, paths, and polygons.

GeoPandas Example

GeoPandas makes it easy to load, manipulate, and plot geospatial data. For example, we can download the NYC taxi zones, load and plot them in a single line of code.

geopandas.read_file('taxi_zones.shp') .to_crs({'init' :'epsg:4326'}) .plot(column='borough', categorical=True)

Cities are now doing a wonderful job publishing data into the open. This provides transparency and an opportunity for civic involvement to help analyze, understand, and improve our communities. Here are a few fun geospatially-aware datasets to make you interested:

  1. Chicago Crimes from 2001 to present (one week ago)
  2. Paris Velib (bikeshare) in real time
  3. Bike lanes in New Orleans
  4. New Orleans Police Department incidents involving the use of force
Performance

Unfortunately GeoPandas is slow. This limits interactive exploration on larger datasets. For example the Chicago crimes data (the first dataset above) has seven million entries and is several gigabytes in memory. Analyzing a dataset of this size interactively with GeoPandas is not feasible today.

This slowdown is because GeoPandas wraps each geometry (like a point, line, or polygon) with a Shapely object and stores all of those objects in an object-dtype column. When we compute a GeoPandas operation on all of our shapes we just iterate over these shapes in Python. As an example, here is how one might implement a distance method in GeoPandas today.

def distance(self, other): result = [geom.distance(other) for geom in self.geometry] return pd.Series(result)

Unfortunately this just iterates over elements in the series, each of which is an individual Shapely object. This is inefficient for two reasons:

  1. Iterating through Python objects is slow relative to iterating through those same objects in C.
  2. Shapely Python objects consume more memory than the GEOS Geometry objects that they wrap.

This results in slow performance.

Cythonizing GeoPandas

Fortunately, we’ve rewritten GeoPandas with Cython to directly loop over the underlying GEOS pointers. This provides a 10-100x speedup depending on the operation. So instead of using a Pandas object-dtype column that holds shapely objects we instead store a NumPy array of direct pointers to the GEOS objects.

Before

After

As an example, our function for distance now looks like the following Cython implementation (some liberties taken for brevity):

cpdef distance(self, other): cdef int n = self.size cdef GEOSGeometry *left_geom cdef GEOSGeometry *right_geom = other.__geom__ # a geometry pointer geometries = self._geometry_array with nogil: for idx in xrange(n): left_geom = <GEOSGeometry *> geometries[idx] if left_geom != NULL: distance = GEOSDistance_r(left_geom, some_point.__geom) else: distance = NaN

For fast operations we see speedups of 100x. For slower operations we’re closer to 10x. Now these operations run at full C speed.

In his EuroSciPy talk Joris compares the performance of GeoPandas (both before and after Cython) with PostGIS, the standard geospatial plugin for the popular PostgreSQL database (original notebook with the comparison). I’m stealing some plots from his talk below:

Cythonized GeoPandas and PostGIS run at almost exactly the same speed. This is because they use the same underlying C library, GEOS. These algorithms are not particularly complex, so it is not surprising that everyone implements them in exactly the same way.

This is great. The Python GIS stack now has a full-speed library that operates as fast as any other open GIS system is likely to manage.

Problems

However, this is still a work in progress, and there is still plenty of work to do.

First, we need for Pandas to track our arrays of GEOS pointers differently from how it tracks a normal integer array. This is both for usability reasons, like we want to render them differently and don’t want users to be able to perform numeric operations like sum and mean on these arrays, and also for stability reasons, because we need to track these pointers and release their allocated GEOSGeometry objects from memory at the appropriate times. Currently, this goal is pursued by creating a new block type, the GeometryBlock (‘blocks’ are the internal building blocks of pandas that hold the data of the different columns). This will require some changes to Pandas itself to enable custom block types (see this issue on the pandas issue tracker).

Second, data ingestion is still quite slow. This relies not on GEOS, but on GDAL/OGR, which is handled in Python today by Fiona. Fiona is more optimized for consistency and usability rather than raw speed. Previously when GeoPandas was slow this made sense because no one was operating on particularly large datasets. However now we observe that data loading is often several times more expensive than all of our manipulations so this will probably need some effort in the future.

Third, there are some algorithms within GeoPandas that we haven’t yet Cythonized. This includes both particular features like overlay and dissolve operations as well as small components like GeoJSON output.

Finally as with any rewrite on a codebase that is not exhaustively tested (we’re trying to improve testing as we do this) there are probably several bugs that we won’t detect until some patient and forgiving user runs into them first.

Still though, all linear geospatial operations work well and are thoroughly tested. Also spatial joins (a backbone of many geospatial operations) are up and running at full speed. If you work in a non-production environment then Cythonized GeoPandas may be worth your time to investigate.

You can track future progress on this effort at geopandas/geopandas #473 which includes installation instructions.

Parallelize with Dask

Cythonizing gives us speedups in the 10x-100x range. We use a single core as effectively as is possible with the GEOS library. Now we move on to using multiple cores in parallel. This gives us an extra 3-4x on a standard 4 core laptop. We can also scale to clusters, though I’ll leave that for a future blogpost.

To parallelize we need to split apart our dataset into multiple chunks. We can do this naively by placing the first million rows in one chunk, the second million rows in another chunk, etc. or we can partition our data spatially, for example by placing all of the data for one region of our dataset in one chunk and all of the data for another region in another chunk, and so on. Both approaches are implemented in a rudimentary dask-geopandas library available on GitHub.

So just as dask-array organizes many NumPy arrays along a grid and dask-dataframe organizes many Pandas dataframes along a linear index

the dask-geopandas library organizes many GeoPandas dataframes into spatial regions. In the example below we might partition data in the city of New York into its different boroughs. Data for each borough would be handled separately by a different thread or, in a distributed situation, might live on a different machine.

This gives us two advantages:

  1. Even without geospatial partitioning, we can use many cores (or many machines) to accelerate simple operations.
  2. For spatially aware operations, like spatial joins or subselections we can engage only those parts of the parallel dataframe that we know are relevant for various parts of the computation.

However this is also expensive and not always necessary. In our initial exercise with the NYC Taxi data we didn’t do this, and will still got significant speedups just from normal multicore operation.

Exercise

And so to produce the images we did at the top of this post we used a combination of dask.dataframe to load in CSV files, dask-geopandas to perform the spatial join, and then dask.dataframe and normal pandas to perform the actual computations. Our code looked something like the following:

import dask.dataframe as dd import dask_geopandas as dg df = dd.read_csv('yellow_tripdata_2015-*.csv') gf = dg.set_geometry(df, geometry=df[['pickup_longitude', 'pickup_latitude']], crs={'init' :'epsg:4326'}) gf = dg.sjoin(gf, zones[['zone', 'borough', 'geometry']]) full = gf[['zone', 'payment_type', 'tip_amount', 'fare_amount']] full.to_parquet('nyc-zones.parquet') # compute and cache result on disk full = dd.read_parquet('nyc-zones.parquet')

And then we can do typical groupbys and joins on the more typical pandas-like data now properly annotated with zones.

result = full.passenger_count.groupby(full.zone).count().compute() result.name = 'count' joined = pd.merge(result.to_frame(), zones, left_index=True, right_on='zone') joined = geopandas.GeoDataFrame(joined) # convert back for plotting

We’ve replaced most of Ravi’s custom analysis with a few lines of new standard code. This maxes our or CPU when doing spatial joins. Everything here releases the GIL well and the entire computation operates in under a couple gigabytes of RAM.

Problems

The dask-geopandas project is currently a prototype. It will easily break for non-trivial applications (and indeed many trivial ones). It was designed to see how hard it would be to implement some of the trickier operations like spatial joins, repartitioning, and overlays. This is why, for example, it supports a fully distributed spatial join, but lacks simple operations like indexing. There are other longer-term issues as well.

Serialization costs are manageable, but decently high. We currently use the standard “well known binary” WKB format common in other geospatial applications but have found it to be fairly slow, which bogs down inter-process parallelism.

Similarly distributed and spatially partitioned data stores don’t seem to be common (or at least I haven’t run across them yet).

It’s not clear how dask-geopandas dataframes and normal dask dataframes should interact. It would be very convenient to reuse all of the algorithms in dask.dataframe, but the index structures of the two libraries is very different. This may require some clever software engineering on the part of the Dask developers.

Still though, these seem surmountable and generally this process has been easy so far. I suspect that we can build an intuitive and performant parallel GIS analytics system with modest effort.

The notebook for the example at the start of the blogpost shows using dask-geopandas with good results.

Conclusion

With established technologies in the PyData space like Cython and Dask we’ve been able to accelerate and scale GeoPandas operations above and beyond industry standards. However this work is still experimental and not ready for production use. This work is a bit of a side project for both Joris and Matthew and they would welcome effort from other experienced open source developers. We believe that this project can have a large social impact and are enthusiastic about pursuing it in the future. We hope that you share our enthusiasm.

Categories: FLOSS Project Planets

Justin Mason: Links for 2017-09-20

Planet Apache - Wed, 2017-09-20 19:58
  • Locking, Little’s Law, and the USL

    Excellent explanatory mailing list post by Martin Thompson to the mechanical-sympathy group, discussing Little’s Law vs the USL:

    Little’s law can be used to describe a system in steady state from a queuing perspective, i.e. arrival and leaving rates are balanced. In this case it is a crude way of modelling a system with a contention percentage of 100% under Amdahl’s law, in that throughput is one over latency. However this is an inaccurate way to model a system with locks. Amdahl’s law does not account for coherence costs. For example, if you wrote a microbenchmark with a single thread to measure the lock cost then it is much lower than in a multi-threaded environment where cache coherence, other OS costs such as scheduling, and lock implementations need to be considered. Universal Scalability Law (USL) accounts for both the contention and the coherence costs. http://www.perfdynamics.com/Manifesto/USLscalability.html When modelling locks it is necessary to consider how contention and coherence costs vary given how they can be implemented. Consider in Java how we have biased locking, thin locks, fat locks, inflation, and revoking biases which can cause safe points that bring all threads in the JVM to a stop with a significant coherence component.

    (tags: usl scaling scalability performance locking locks java jvm amdahls-law littles-law system-dynamics modelling systems caching threads schedulers contention)

  • “HTML email, was that your fault?”

    jwz may indeed have invented this feature way back in Netscape Mail. FWIW I think he’s right — Netscape Mail was the first usage of HTML email I recall

    (tags: netscape history html email smtp mime mozilla jwz)

Categories: FLOSS Project Planets

Agiledrop.com Blog: AGILEDROP: DrupalCon Vienna sessions you don't want to miss

Planet Drupal - Wed, 2017-09-20 19:25
There will be a lot of sessions on DrupalCon Vienna. That's nothing new to be fair. DrupalCons are the biggest Drupal events, so you don't catch all the sessions you want. Therefore, we have made a short list of the sessions you don't want to miss. We hope it will help you. But before looking at it, it's fair to say that the Business sessions were excluded because we have already presented them on the other occasion. Moreover, our commercial director Iztok Smolic was directly involved in selecting them, so if we pointed out any session from the business track, you may have argued about the… READ MORE
Categories: FLOSS Project Planets

Bay Area Drupal Camp: 10 Things to Make Your BADCamp ROCK!

Planet Drupal - Wed, 2017-09-20 19:24
10 Things to Make Your BADCamp ROCK! Anne Wed, 09/20/2017 - 4:24pm

Here’s a list of the 10 important tips and tricks to help make your sure you have a magical BADCamp experience.

BADCamp is sure to be a great event. Tickets are FREE. Register today!

1. Wear Good, Comfortable Shoes

If you want to have a great time the whole time you’re at BADCamp, we STRONGLY recommend wearing shoes that are comfortable but give you lots of support. You don’t want to miss out on all the fun stuff we have planned because you have to take a break to rest your poor tootsies. Don’t wear brand new shoes either and you might want to also get insoles.

2. Dress in Layers

October in Berkeley is beautiful, but let’s face it, room temperatures are unpredictable. For this reason, bring a hoodie (or donate to get a special edition 2017 BADCamp hoodie) that you can throw on and/or take off as the climate requires. The historical average for that time of year is in the mid 70’s (about 22 – 25 Celsius).

3. Plan Your Schedule

Are you coming to learn specific skills? Check out the training classes, summits and sessions available and create your own schedule.

Do you want to find a new employer? Check out the job board and sponsors expo to meet awesome Drupal shops

Who do you want to meet with while you are at BADCamp? A famous podcaster or module maintainer? Find out who is coming on the attendee list and reach out to connect. Magical moments are frequent at BADCamp

4. Bring a Laptop

If you want to get the most out of your BADCamp experience, be sure to bring a laptop. You will want to follow along and try some of the cool things the presenters show you. Bring a notepad too. Sometimes getting to an outlet to charge your laptop can be tricky. So that you don’t forget something important while your laptop charges, bring a notebook or notepad and a pen and take some notes.

5. Bring a Water Bottle/Travel Mug

There will be water fountains and FREE coffee/tea. Bringing a refillable water bottle means that you can stay focused on what you’re doing longer and get the most out of the sessions you're attending. Not to mention, using a water bottle instead of buying bottles of water is far better for the environment.

6. Bring Chargers for ALL your Devices and a Mobile Charger

There’s nothing worse than being out and about with a dead phone/tablet/laptop. Bring chargers for all of the devices you intend to use at BADCamp. If you use a battery-operated mouse (or wireless remote for presenting), bringing an extra set of batteries couldn’t hurt either. Even if you don’t end up needing them, you could find yourself with a new friend when you share those extra batteries with someone in need.

7. Bring Business Cards

Make it easy to connect! You will meet lots of great people and some of them you will want to keep in touch with. Get in the habit of giving out a card when you meet someone.

8. Condense your Stuff

You will walk around campus, so a lighter load is ideal. Plus you will want room for SWAG!  Condense your backpack down. Pro Tip: Get a small tote or even a Ziploc bag to stick all of your conference swag in. That way all of the stickers and little bits and pieces are in one bag that you can stick in your luggage at the end of the conference.

9. Bring a Pair of Headphones

As much as we all want to be able to unplug from our jobs and just focus on the sessions, it’s not always possible. Sometimes you have to put your nose to the grindstone and get some work done. If you’re the type that needs to listen to some music while you work, bring along a pair of earbuds so that you can focus and not disturb others around you.

10. Bring a Friend

While not required, having a friend tag along with you can make for a memorable BADCamp experience. If you’re like me and you’re road tripping to BADCamp, think of all of the awesome photos, sing-a-longs, and weird roadside attractions that you’ll see and get to enjoy together. If you’re flying, it’s always nice to have a travel buddy to keep you company while you’re waiting at the airport during the inevitable layover.

Pro Tip: Don’t use your buddy as a reason to shut out others. Go in with an open mind and you’re sure to find another new friend (or friends!) to share the experience with.

BADCamp is sure to be a great event. Tickets are FREE. Register today!

Drupal Planet
Categories: FLOSS Project Planets

Sandipan Dey: Some Social Network Analysis with Python

Planet Python - Wed, 2017-09-20 19:04
The following problems appeared in the programming assignments in the coursera course Applied Social Network Analysis in Python.  The descriptions of the problems are taken from the assignments. The analysis is done using NetworkX. 1. Creating and Manipulating Graphs Eight employees at a small company were asked to choose 3 movies that they would most enjoy … Continue reading Some Social Network Analysis with Python
Categories: FLOSS Project Planets

Lullabot: Lullabots Coming to DrupalCon Vienna

Planet Drupal - Wed, 2017-09-20 18:55

Several of our Lullabots and the team from our sister company, Drupalize.me, are about to descend upon the City of Music to present seven kick-ass sessions to the Drupal community in the EU. There will be a cornucopia of topics presented — from softer human-centric topics such as imposter syndrome to more technical topics such as Decoupled Drupal. So, if you're headed to DrupalCon Vienna next week, be sure to eat plenty of Sachertorte, drink lots of Ottakringer, and check out these sessions that will Rock You Like Amadeus:

Contenta - Drupal’s API Distribution Tuesday, September 26, 10:45-11:45

Sally Young, Cristina Chumillas, and Daniel Wehner

Contenta is a decoupled Drupal distribution that has many examples of various front-ends available as best practices guides. Lullabot Senior Technical Architect Sally Young, Christina Chumillas, and Daniel Wehner will bring you up to speed on the latest Contenta developments, including its current features and roadmap. You will also get a tour of Contenta’s possibilities that come with reference applications that implement the out-of-the-box initiative’s cooking recipe.

Automated Testing 101 Tuesday, September 26th, 10:45 - 11:45

Ezequiel “Zequi” Vázquez

Lullabot Developer, Ezequiel “Zequi” Vázquez, will explore the current state of test automation and present the most useful tools that provide testing capabilities for security, accessibility, performance, scaling, and more. Zequi will also give you advice on the best strategies to implement automated testing for your application, and how to cover relevant aspects of your software.

Get Started with Voice User Interfaces Tuesday, September 26th, 15:45 - 16:45

Amber Himes Matz

Drupalize.me Production Manager & Trainer, Amber Himes Matz, will survey the current state of voice and conversational interface APIs with an eye toward global language support. She’ll cover services including Alexa, Google, and Cortana by examining their distinct features and the devices, platforms, interactions, and spoken languages they support. If you’re looking for a better understanding of the voice and conversational interface services landscape, ideas on how to approach the voice UI design process, an understanding of concepts and terminology related to voice interaction, and ways to get started, this is the right session for you - complete with a demo!

Breaking the Myths of the Rockstar Developer Wednesday, September 27th, 10:45 - 11:45

Juan Olalla Olmo & Salvador Molina

Lullabot Developer, Juan Olalla Olmo, and Salvador Molina will share their experiences and explore the areas and attitudes that can help everyone become better professionals by embracing who they are and ultimately empower others to do the same. This inspiring session aims to help you grow professionally and provide more value at work by focusing on fostering the human relationships and growing as people.

Juan gave this presentation internally at Lullabot’s recent Design and Development Retreat. It was a highlight that sparked a lively conversation.

Virtual Reality on the Web - Overview and "How-to" Demo Wednesday, September 27th, 13:35 - 14:00

Wes Ruvalcaba

Want to make your own virtual reality experiences? Lullabot Senior Front-end Developer Wes Ruvalcaba will show you how. Starting with an overview of VR (and AR) concepts, technologies, and what its uses are, Wes will also demo and share code examples of VR websites we’ve made at Lullabot. You’ll also get an intro to A-Frame and Wes will explain how you can get started.

Thursday Keynote - Everyone Has Something to Share Thursday, September 28th, 9:00 - 10:15

Joe Shindelar

We’re especially proud of Drupalize.me's Joe Shindelar for being selected to give the Community Keynote. If you’ve been around Drupal for a while, it’s likely you’ve either met or learned from Joe. In this session, Joe will reflect on 10 years of both successfully and unsuccessfully engaging with the community. By doing so he hopes to help others learn about what they have to share, and the benefits of doing so. This is important because sharing:

  • Creates diversity, both of thought and culture
  • Builds people up, helps them realize their potential, and enriches our community
  • Fosters connections, and makes you, as an individual, smarter
  • Creates opportunities for yourself and others
  • Feels all warm and fuzzy
Making Content Editors Happy in Drupal 8 with Entity Browser Thursday, September 28th, 14:15 - 15:15

Marcos Cano

Lullabot Developer Marcos Cano will be presenting on Entity Browser, which is a Drupal 8 contrib module created to upload multiple images/files at once, select and re-use an image/file already present on the server, and more. In this session Marcos will:

  • Explain the basic architecture of the module, and how to take advantage of its plugin-based approach to extend and customize it
  • See how to configure it from scratch to solve different use-cases, including some pitfalls that often occur in that process
  • Check what we can copy or re-use from other contrib modules
  • Explore some possible integrations with other parts of the media ecosystem

See you next week in Wien!

Categories: FLOSS Project Planets

FSF Blogs: Free Software Directory meeting recap for September 15th, 2017

GNU Planet! - Wed, 2017-09-20 17:33

Every week free software activists from around the world come together in #fsf on irc.freenode.org to help improve the Free Software Directory. This recaps the work we accomplished at the Friday, September 15th, 2017 meeting.

Last week's theme was again adding new entries. This time we ended up filing a lot of bugs with packages, rather than getting to add a lot of packages. That's still a very useful part of the work that we do on the Directory. The Directory helps users to find free software, and making sure that there isn't a freedom issue with a particular package ensures that there's more free software out there for them to find. Often the issue is something simple, like a missing license file. But sometimes it can get a bit tricky to sort out, when there are multiple conflicting licenses. So there's work to be done that can be accomplished by volunteers of any skill level, from just starting out to license-hacking gurus. Hope to see you all there again at the next meeting.

If you would like to help update the directory, meet with us every Friday in #fsf on irc.freenode.org from 12 p.m. to 3 p.m. EDT (16:00 to 19:00 UTC).

Categories: FLOSS Project Planets

Free Software Directory meeting recap for September 15th, 2017

FSF Blogs - Wed, 2017-09-20 17:33

Every week free software activists from around the world come together in #fsf on irc.freenode.org to help improve the Free Software Directory. This recaps the work we accomplished at the Friday, September 15th, 2017 meeting.

Last week's theme was again adding new entries. This time we ended up filing a lot of bugs with packages, rather than getting to add a lot of packages. That's still a very useful part of the work that we do on the Directory. The Directory helps users to find free software, and making sure that there isn't a freedom issue with a particular package ensures that there's more free software out there for them to find. Often the issue is something simple, like a missing license file. But sometimes it can get a bit tricky to sort out, when there are multiple conflicting licenses. So there's work to be done that can be accomplished by volunteers of any skill level, from just starting out to license-hacking gurus. Hope to see you all there again at the next meeting.

If you would like to help update the directory, meet with us every Friday in #fsf on irc.freenode.org from 12 p.m. to 3 p.m. EDT (16:00 to 19:00 UTC).

Categories: FLOSS Project Planets

Steve Kemp: Retiring the Debian-Administration.org site

Planet Debian - Wed, 2017-09-20 17:00

So previously I've documented the setup of the Debian-Administration website, and now I'm going to retire it I'm planning how that will work.

There are currently 12 servers powering the site:

  • web1
  • web2
  • web3
  • web4
    • These perform the obvious role, serving content over HTTPS.
  • public
    • This is a HAProxy host which routes traffic to one of the four back-ends.
  • database
    • This stores the site-content.
  • events
    • There was a simple UDP-based protocol which sent notices here, from various parts of the code.
    • e.g. "Failed login for bob from 1.2.3.4".
  • mailer
    • Sends out emails. ("You have a new reply", "You forgot your password..", etc)
  • redis
    • This stored session-data, and short-term cached content.
  • backup
    • This contains backups of each host, via Obnam.
  • beta
    • A test-install of the codebase
  • planet
    • The blog-aggregation site

I've made a bunch of commits recently to drop the event-sending, since no more dynamic actions will be possible. So events can be retired immediately. redis will go when I turn off logins, as there will be no need for sessions/cookies. beta is only used for development, so I'll kill that too. Once logins are gone, and anonymous content is disabled there will be no need to send out emails, so mailer can be shutdown.

That leaves a bunch of hosts left:

  • database
    • I'll export the database and kill this host.
    • I will install mariadb on each web-node, and each host will be configured to talk to localhost only
    • I don't need to worry about four database receiving diverging content as updates will be disabled.
  • backup
  • planet
    • This will become orphaned, so I think I'll just move the content to the web-nodes.

All in all I think we'll just have five hosts left:

  • public to do the routing
  • web1-web4 to do the serving.

I think that's sane for the moment. I'm still pondering whether to export the code to static HTML, there's a lot of appeal as the load would drop a log, but equally I have a hell of a lot of mod_rewrite redirections in place, and reworking all of them would be a pain. Suspect this is something that will be done in the future, maybe next year.

Categories: FLOSS Project Planets

Last week development in Elisa

Planet KDE - Wed, 2017-09-20 16:55

This week has been focused on finishing the development of persistent notifications at the top of the music views. They are intended to provide information about what happen with actions the user can take to improve things.

The following items have been pushed:

  • Clean up of dependencies. Now all frameworks come with a description. Some have been downgraded to optional or recommended. Some have been upgraded to recommended ;
  • A new version of persistent notifications.
The new persistent notifications

The player can be in four states:

  • No notifications are active No notifications

     

  • One notification is activeOne notification

     

  • More than one notifications are active. Only one is shown such that vertical space is preserved

    Multiple notifications with the first visible

  • The notifications area is expanded and several notifications are visible Show multiple notifications

     

If the user choose to act, the buttons are temporarily disabled to provide instant feedback and the notification disappear when the root cause is fixed (like in the example, if music tracks are discovered or Baloo configuration is modified or Baloo music indexer is disabled in Elisa.).

I have tried to provide smooth transitions between each of those states. Some may still be missing. Please do not hesitate to provide feedback on this feature.

I plan to add more notifications of this kind when the software wants to provide feedback to the user and asks him to choose what he prefers.

Next week, I should continue to improve integration with Baloo. I would also like to improve (in fact allow) the keyboard interaction.


Categories: FLOSS Project Planets
Syndicate content