FLOSS Project Planets

PyTennessee: PyTN Profiles: Calvin Hendryx-Parker and Juice Analytics

Planet Python - Tue, 2017-01-17 08:28



Speaker Profile: Calvin Hendryx-Parker (@calvinhp)

Six Feet Up, Inc. co-founder Calvin Hendryx-Parker has over 18 years of experience in the development and hosting of applications using Python and web frameworks including Django, Pyramid and Flask.

As Chief Technology Officer for Six Feet Up, Calvin is responsible for researching cutting-edge advances that could become part of the company’s technology road map. Calvin provides both the company and its clients with recommendations on tools and technologies, systems architecture and solutions that address specific information-sharing needs. Calvin is an advocate of open source and is a frequent speaker at Python conferences on multisite content management, integration, and web app development. Calvin is also a founder and organizer of the IndyPy meetup group and Pythology training series in Indianapolis.

Outside of work Calvin spends time tinkering with new devices like the Fitbit, Pebble and Raspberry Pi. Calvin is an avid distance runner and ran the 2014 NYC Marathon to support the Innocence Project. Every year he and the family enjoys an extended trip to France where his wife Gabrielle, the CEO of Six Feet Up, is from. Calvin holds a Bachelor of Science from Purdue University.

Calvin will be presenting “Open Source Deployment Automation and Orchestration with SaltStack” at 3:00PM Saturday (2/4) in Room 200. Salt is way more than a configuration management tool. It supports many types of other activities such as remote execution and full-blown system orchestration. It can be used as a replacement for remote task tools such as Fabric or Paver.

Sponsor Profile: Juice Analytics (@juiceanalytics)

At Juice, we’re building Juicebox, a cloud platform to allow anyone to build and share stories with data. Juicebox is built on AWS, Python, Backbone.js and D3. We’re looking for a frontend dev with a love of teamwork, a passion for pixels, and a devotion to data. Love of Oxford commas also required.

Categories: FLOSS Project Planets

#3 Update on the extractors

Planet KDE - Tue, 2017-01-17 08:26

Update: Both Flipkart and Amazon Extractor (Python) with fine, expect for the fore-mentioned issue.

The purpose for which i made the python extractors, give quite accurate results. Parsing the email, to find the appropriate data was quite fun, but what worries me is the longevity of the semi-sketchy methods to extract the data.

Scrapely worked beautifully, but there were some unironed kinks which need attention while parsing the information.

{   "id" : "OE1004125T3442...", "total": "Rs. 310", ... ... "name":"<div><span><b><a href="http:// .../../..">ProductName</a></b></span></div>" }

We can see here that the value of  “name”  is messed up a bit.
The desired result that was needed was:

{"name" : "ProductName"}

Yeah well, for now there was only way which came into my mind to parse this, was some sketchy method . But rest assured, everything else works fine.

Categories: FLOSS Project Planets

PyTennessee: PyTN Profiles: Jared M. Smith and SimplyAgree

Planet Python - Tue, 2017-01-17 08:21


Speaker Profile: Jared M. Smith (@jaredthecoder)

I’m a Research Scientist at Oak Ridge National Laboratory, where I engage in computer security research with the Cyber Warfare Research Team. I am also pursuing my PhD in Computer Science at the University of Tennessee, Knoxville. I founded VolHacks, our university’s hackathon, and HackUTK, our university’s computer security club. I used to work at Cisco Systems as a Software Security Engineer working on pentesting engagements and security tooling.

Back at home, I helped start the Knoxville Python meetup. I also serve on the Knoxville Technology Council, volunteer at the Knoxville Entrepreneur Center, do consulting for VC-backed startups, compete in hackathons and pitch competitions, and hike in the Great Smoky Mountains.

Jared will be presenting “Big Data Analysis in Python with Apache Spark, Pandas, and Matplotlib” at 3:00PM Saturday (2/4) in Room 100. Big data processing is finally approachable for the modern Pythonista. Using Apache Spark and other data analysis tools, we can process, analyze, and visualize more data than ever before using Pythonic APIs and a language you already know, without having to learn Java, C++, or even Fortran. Come hang out and dive into the essentials of big data analysis with Python.

Sponsor Profile: SimplyAgree (@simplyagree)

SimplyAgree is an electronic signature and closing management tool for complex corporate transactions. The app is built on Python, Django and Django REST Framework.

Our growing team is based in East Nashville, TN.

Categories: FLOSS Project Planets

DataCamp: NumPy Cheat Sheet: Data Analysis in Python

Planet Python - Tue, 2017-01-17 08:17

Given the fact that it's one of the fundamental packages for scientific computing, NumPy is one of the packages that you must be able to use and know if you want to do data science with Python. It offers a great alternative to Python lists, as NumPy arrays are more compact, allow faster access in reading and writing items, and are more convenient and more efficient overall. 

In addition, it's (partly) the fundament of other important packages that are used for data manipulation and machine learning which you might already know, namely, Pandas, Scikit-Learn and SciPy:

  • The Pandas data manipulation library builds on NumPy, but instead of the arrays, it makes use of two other fundamental data structures: Series and DataFrames,
  • SciPy builds on Numpy to provide a large number of functions that operate on NumPy arrays, and
  • The machine learning library Scikit-Learn builds not only on NumPy, but also on SciPy and Matplotlib. 

You see, this Python library is a must-know: if you know how to work with it, you'll also gain a better understanding of the other Python data science tools that you'll undoubtedly be using. 

It's a win-win situation, right?

Nevertheless, just like any other library, NumPy can come off as quite overwhelming at start; What are the very basics that you need to know in order to get started with this data analysis library?

This cheat sheet means to give you a good overview of the possibilities that this library has to offer. 

Go and check it out for yourself!

You'll see that this cheat sheet covers the basics of NumPy that you need to get started: it provides a brief explanation of what the Python library has to offer and what the array data structure looks like, and goes on to summarize topics such as array creation, I/O, array examination, array mathematics, copying and sorting arrays, selection of array elements and shape manipulation.

NumPy arrays are often preferred over Python lists, and you'll see that selecting elements from arrays is very similar to selecting elements from lists.

Do you want to know more? Check out DataCamp's Python list tutorial. 

PS. Don't miss our other Python cheat cheets for data science that cover Scikit-LearnBokeh, Pandas and the Python basics. 

Categories: FLOSS Project Planets

ComputerMinds.co.uk: Show elements with form #states when values do not match

Planet Drupal - Tue, 2017-01-17 08:00

I've previously written about dynamic forms in Drupal, using #states to show or hide input boxes depending on other inputs. Since then, Drupal 7 and 8 have both got the ability to combine conditions with OR and XOR operators. This makes it possible to apply changes when a form element's value does not equal something, which is not obvious at all.

Categories: FLOSS Project Planets

Web Omelette: Node access grants in Drupal 8 in an OOP way

Planet Drupal - Tue, 2017-01-17 04:04

The Drupal node access grants system has always been a powerful and flexible way to control access to your nodes. It's been there from Drupal 5 (if not earlier) and it continues to exist in Drupal 8 as we move forward. In this article, I want to quickly highlight this system from a D8 perspective and how I propose to use it in a OOP architecture.

What is it?

The node access grant system is a way by which you can control programatically and very granularly access to all four operations on your Drupal nodes (view, create, edit, delete). It allows to define certain realms of functionality (related to your access requirements) and a set of grants that are required for any of the four mentioned operations, within that realm. Users will then need to posses the grants in the respective realms in order to be granted access.

The two main components of this system are therefore:

  • The implementation of hook_node_access_records() which is called whenever a node is saved (or site-wide permissions rebuilt). It is responsible for storing the access requirements for that given node.
  • The implementation of hook_node_grants() which is called whenever a user is trying to access a node (or a query is being performed in the name of that user). It is responsible for presenting the grants for the current user, which if match the access requirements of the node, allows them access.

The great thing about this node access grants is that it's system-wide in the sense of who checks for the access. In contrast to implementing hook_node_access() which only is called when viewing a node on its canonical URL, the access grants are checked almost everywhere such as views or even custom queries with much ease.

Drupal 8

In Drupal 8 these 2 hooks remain the foundation of the node access grants system, albeit with type hinted parameters. This means that we need to place their implementation inside our .module files.

Node access grants are not used on every site because they serve relatively complex access rules. Complex access rules usually also require a fair bit of calculating what grants a particular node must have for a given realm, as well as whether a given user possesses them. For this very reason I am not so fond of having to put all this logic in my .module file.

So I came up with a basic developer module that defines an interface that has two methods: accessRecords() and grants(). Other modules which want to implement the access grants hooks can instead now create a service which implements this interface and tag it with node_access_grants. My module will do the rest and you won't have to touch any .module file. You can inject whatever dependencies from the container you need and perform whatever logic is needed for determining your grants and access records.

Let me what you think down in the comments. Would love your feedback.

Categories: FLOSS Project Planets

Third & Grove: Quicken.com Drupal Case Study

Planet Drupal - Tue, 2017-01-17 03:00
Quicken.com Drupal Case Study antonella Tue, 01/17/2017 - 03:00
Categories: FLOSS Project Planets

S. Lott: Irrelevant Feature Comparison

Planet Python - Tue, 2017-01-17 03:00
A Real Email.
So, please consider creating a blog post w/ a title something like "Solving the Fred Flintstone Problem using Monads in Python and Haskell"First. There's this: https://pypi.python.org/pypi/PyMonad/ and this: http://www.valuedlessons.com/2008/01/monads-in-python-with-nice-syntax.html. Also, see https://en.wikipedia.org/wiki/Type_class. I think this has been covered nicely.

I can't improve on what's been presented.

Second. I don't see any problems that are solved well by monads in Python. In a lazy, optimized, functional language, monads can be used bind operations into ordered sequences. This is why file parsing and file writing examples of monads abound. They can also be used to bind a number of types so that operator overloading in the presence of strict type checking can be implemented. None of this seems helpful in Python.

Perhaps monads will be helpful with Python type hints. I'll wait and see if a monad definition shows up in the typing module. There, it may be a useful tool for handling dynamic type bindings.

Third. This request is perilously close to a "head-to-head" comparison between languages. The question says "problem", but it is similar to asking to see the exact same algorithm implemented in two different languages. It makes as much sense as comparing Python's built-in complex type with Java's built-in complex type (which Java doesn't have.)

Here's the issue. I replace Fred Flintstone with "Parse JSON Notation".  This is a cool application of monads to recognize the various sub-classes of JSON syntax and emit the correctly-structured document.  See http://fssnip.net/bq/title/JSON-parsing-with-monads.  In Python, this is import json. This isn't informative about the language. If we look at the Python code, we see some operations that might be considered as eligible for a rewrite using monads. But Python isn't compiled and doesn't have the same type-checking issues. The point is that Python has alternatives to monads.

Fourth. It's just asking about a not-required feature to a language. In the spirit of showing the not-required-in-Python features, I'll show the not-required-in-Python GOTO.
Here it is:
def goto(destination):
global next
next = destination

def min_none(sequence):
try:
return min(sequence)
except ValueError:
return None

def execute(program, debug=False, stmt=None):
global next, context
if stmt is None:
stmt = min(program.keys())
context = {'goto': goto}
while stmt is not None:
next = min_none(list(filter(lambda x: x>stmt, program.keys())))
if debug:
print(">>>", program[stmt])
exec(program[stmt], globals(), context)
stmt = next

example = {
100: "a = 10",
200: "if a == 0: goto(500)",
250: "print(a)",
300: "a = a - 1",
400: "goto(200)",
500: "print('done'()",
}

execute(example)

This shows how we can concoct an additional feature that isn't really needed in Python.

Given this, we can now compare the GOTO between Python, BASIC, and Haskell. Or maybe we can look at Monads in BASIC vs. Haskell. 
Categories: FLOSS Project Planets

Python Insider: Python 3.5.3 and 3.4.6 are now available

Planet Python - Mon, 2017-01-16 22:40
Python 3.5.3 and Python 3.4.6 are now available for download.

You can download Python 3.5.3 here, and you can download Python 3.4.6 here.
Categories: FLOSS Project Planets

Wingware Blog: Remote Development with Wing Pro 6

Planet Python - Mon, 2017-01-16 20:00
Wing Pro 6 introduces easy to configure and use remote development, where the IDE can edit, test, debug, search, and manage files as if they were stored on the same machine as the IDE.
Categories: FLOSS Project Planets

Drupal Modules: The One Percent: Drupal Modules: The One Percent — Views Flipped Table (video tutorial)

Planet Drupal - Mon, 2017-01-16 19:59
Drupal Modules: The One Percent — Views Flipped Table (video tutorial) NonProfit Mon, 01/16/2017 - 18:59 Episode 13

Here is where we bring awareness to Drupal modules running on less than 1% of reporting sites. Today we'll consider Views Flipped Table, a module which will rotate your Views tables 90° and ask if using HTML tables is ever appropriate.

Categories: FLOSS Project Planets

Ned Batchelder: Coverage.py 4.3.2 and 4.3.3, and 4.3.4

Planet Python - Mon, 2017-01-16 19:37

A handful of fixes for Coverage.py today: v4.3.2. Having active contributors certainly makes it easier to move code more quickly.

...and then it turns out, 4.3.2 wouldn't run on Python 2.6. So quick like a bunny, here comes Coverage.py version 4.3.3.

...and then that fix broke other situations on all sorts of Python versions, so Coverage.py version 4.3.4.

Categories: FLOSS Project Planets

Ned Batchelder: Coverage.py 4.3.2

Planet Python - Mon, 2017-01-16 19:37

A handful of fixes for Coverage.py today: v4.3.2. Having active contributors certainly makes it easier to move code more quickly.

Categories: FLOSS Project Planets

Matthew Rocklin: Distributed NumPy on a Cluster with Dask Arrays

Planet Python - Mon, 2017-01-16 19:00

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

This page includes embedded large profiles. It may look better on the actual site TODO: link to live site (rather than through syndicated pages like planet.python) and it may take a while to load on non-broadband connections (total size is around 20MB)

Summary

We analyze a stack of images in parallel with NumPy arrays distributed across a cluster of machines on Amazon’s EC2 with Dask array. This is a model application shared among many image analysis groups ranging from satellite imagery to bio-medical applications. We go through a series of common operations:

  1. Inspect a sample of images locally with Scikit Image
  2. Construct a distributed Dask.array around all of our images
  3. Process and re-center images with Numba
  4. Transpose data to get a time-series for every pixel, compute FFTs

This last step is quite fun. Even if you skim through the rest of this article I recommend checking out the last section.

Inspect Dataset

I asked a colleague at the US National Institutes for Health (NIH) for a biggish imaging dataset. He came back with the following message:

*Electron microscopy may be generating the biggest ndarray datasets in the field - terabytes regularly. Neuroscience needs EM to see connections between neurons, because the critical features of neural synapses (connections) are below the diffraction limit of light microscopes. This type of research has been called “connectomics”. Many groups are looking at machine vision approaches to follow small neuron parts from one slice to the next. *

This data is from drosophila: http://emdata.janelia.org/. Here is an example 2d slice of the data http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_5000.

import skimage.io import matplotlib.pyplot as plt sample = skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_5000' skimage.io.imshow(sample)

The last number in the URL is an index into a large stack of about 10000 images. We can change that number to get different slices through our 3D dataset.

samples = [skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % i) for i in [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000]] fig, axarr = plt.subplots(1, 9, sharex=True, sharey=True, figsize=(24, 2.5)) for i, sample in enumerate(samples): axarr[i].imshow(sample, cmap='gray')

We see that our field of interest wanders across the frame over time and drops off in the beginning and at the end.

Create a Distributed Array

Even though our data is spread across many files, we still want to think of it as a single logical 3D array. We know how to get any particular 2D slice of that array using Scikit-image. Now we’re going to use Dask.array to stitch all of those Scikit-image calls into a single distributed array.

import dask.array as da from dask import delayed imread = delayed(skimage.io.imread, pure=True) # Lazy version of imread urls = ['http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % i for i in range(10000)] # A list of our URLs lazy_values = [imread(url) for url in urls] # Lazily evaluate imread on each url arrays = [da.from_delayed(lazy_value, # Construct a small Dask array dtype=sample.dtype, # for every lazy value shape=sample.shape) for lazy_value in lazy_values] stack = da.stack(arrays, axis=0) # Stack all small Dask arrays into one >>> stack dask.array<shape=(10000, 2000, 2000), dtype=uint8, chunksize=(1, 2000, 2000)> >>> stack = stack.rechunk((20, 2000, 2000)) # combine chunks to reduce overhead >>> stack dask.array<shape=(10000, 2000, 2000), dtype=uint8, chunksize=(20, 2000, 2000)>

So here we’ve constructed a lazy Dask.array from 10 000 delayed calls to skimage.io.imread. We haven’t done any actual work yet, we’ve just constructed a parallel array that knows how to get any particular slice of data by downloading the right image if necessary. This gives us a full NumPy-like abstraction on top of all of these remote images. For example we can now download a particular image just by slicing our Dask array.

>>> stack[5000, :, :].compute() array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8) >>> stack[5000, :, :].mean().compute() 11.49902425

However we probably don’t want to operate too much further without connecting to a cluster. That way we can just download all of the images once into distributed RAM and start doing some real computations. I happen to have ten m4.2xlarges on Amazon’s EC2 (8 cores, 30GB RAM each) running Dask workers. So we’ll connect to those.

from dask.distributed import Client, progress client = Client('schdeduler-address:8786') >>> client <Client: scheduler="scheduler-address:8786" processes=10 cores=80>

I’ve replaced the actual address of my scheduler (something like 54.183.180.153 with `scheduler-address. Let’s go ahead and bring in all of our images, persisting the array into concrete data in memory.

stack = client.persist(stack)

This starts downloads of our 10 000 images across our 10 workers. When this completes we have 10 000 NumPy arrays spread around on our cluster, coordinated by our single logical Dask array. This takes a while, about five minutes. We’re mostly network bound here (Janelia’s servers are not co-located with our compute nodes). Here is a parallel profile of the computation as an interactive Bokeh plot.

There will be a few of these profile plots throughout the blogpost, so you might want to familiarize yoursel with them now. Every horizontal rectangle in this plot corresponds to a single Python function running somewhere in our cluster over time. Because we called skimage.io.imread 10 000 times there are 10 000 purple rectangles. Their position along the y-axis denotes which of the 80 cores in our cluster that they ran on and their position along the x-axis denotes their start and stop times. You can hover over each rectangle (function) for more information on what kind of task it was, how long it took, etc.. In the image below, purple rectangles are skimage.io.imread calls and red rectangles are data transfer between workers in our cluster. Click the magnifying glass icons in the upper right of the image to enable zooming tools.

Now that we have persisted our Dask array in memory our data is based on hundreds of concrete in-memory NumPy arrays across the cluster, rather than based on hundreds of lazy scikit-image calls. Now we can do all sorts of fun distributed array computations more quickly.

For example we can easily see our field of interest move across the frame by averaging across time:

skimage.io.imshow(stack.mean(axis=0).compute())

Or we can see when the field of interest is actually present within the frame by averaging across x and y

plt.plot(stack.mean(axis=[1, 2]).compute())

By looking at the profile plots for each case we can see that averaging over time involves much more inter-node communication, which can be quite expensive in this case.

Recenter Images with Numba

In order to remove the spatial offset across time we’re going to compute a centroid for each slice and then crop the image around that center. I looked up centroids in the Scikit-Image docs and came across a function that did way more than what I was looking for, so I just quickly coded up a solution in Pure Python and then JIT-ed it with Numba (which makes this run at C-speeds).

from numba import jit @jit(nogil=True) def centroid(im): n, m = im.shape total_x = 0 total_y = 0 total = 0 for i in range(n): for j in range(m): total += im[i, j] total_x += i * im[i, j] total_y += j * im[i, j] if total > 0: total_x /= total total_y /= total return total_x, total_y >>> centroid(sample) # this takes around 9ms (748.7325324581344, 802.4893005160851) def recenter(im): x, y = centroid(im.squeeze()) x, y = int(x), int(y) if x < 500: x = 500 if y < 500: y = 500 if x > 1500: x = 1500 if y > 1500: y = 1500 return im[..., x-500:x+500, y-500:y+500] plt.figure(figsize=(8, 8)) skimage.io.imshow(recenter(sample))

Now we map this function across our distributed array.

import numpy as np def recenter_block(block): """ Recenter a short stack of images """ return np.stack([recenter(block[i]) for i in range(block.shape[0])]) recentered = stack.map_blocks(recenter, chunks=(20, 1000, 1000), # chunk size changes dtype=a.dtype) recentered = client.persist(recentered)

This profile provides a good opportunity to talk about a scheduling failure; things went a bit wrong here. Towards the beginning we quickly recenter several images (Numba is fast), taking around 300-400ms for each block of twenty images. However as some workers finish all of their allotted tasks, the scheduler erroneously starts to load balance, moving images from busy workers to idle workers. Unfortunately the network at this time appeared to be much slower than expected and so the move + compute elsewhere strategy ended up being much slower than just letting the busy workers finish their work. The scheduler keeps track of expected compute times and transfer times precisely to avoid mistakes like this one. These sorts of issues are rare, but do occur on occasion.

We check our work by averaging our re-centered images across time and displaying that to the screen. We see that our images are better centered with each other as expected.

skimage.io.imshow(recentered.mean(axis=0))

This shows how easy it is to create fast in-memory code with Numba and then scale it out with Dask.array. The two projects complement each other nicely, giving us near-optimal performance with intuitive code across a cluster.

Rechunk to Time Series by Pixel

We’re now going to rearrange our data from being partitioned by time slice, to being partitioned by pixel. This will allow us to run computations like Fast Fourier Transforms (FFTs) on each time series efficiently. Switching the chunk pattern back and forth like this is generally a very difficult operation for distributed arrays because every slice of the array contributes to every time-series. We have N-squared communication.

This analysis may not be appropriate for this data (we won’t learn any useful science from doing this), but it represents a very frequently asked question, so I wanted to include it.

Currently our Dask array has chunkshape (20, 1000, 1000), meaning that our data is collected into 500 NumPy arrays across the cluster, each of size (20, 1000, 1000).

>>> recentered dask.array<shape=(10000, 1000, 1000), dtype=uint8, chunksize=(20, 1000, 1000)>

But we want to change this shape so that the chunks cover the entire first axis. We want all data for any particular pixel to be in the same NumPy array, not spread across hundreds of different NumPy arrays. We could solve this by rechunking so that each pixel is its own block like the following:

>>> rechunked = recentered.rechunk((10000, 1, 1))

However this would result in one million chunks (there are one million pixels) which will result in a bit of scheduling overhead. Instead we’ll collect our time-series into 10 x 10 groups of one hundred pixels. This will help us to reduce overhead.

>>> # rechunked = recentered.rechunk((10000, 1, 1)) # Too many chunks >>> rechunked = recentered.rechunk((10000, 10, 10)) # Use larger chunks

Now we compute the FFT of each pixel, take the absolute value and square to get the power spectrum. Finally to conserve space we’ll down-grade the dtype to float32 (our original data is only 8-bit anyway).

x = da.fft.fft(rechunked, axis=0) power = abs(x ** 2).astype('float32') power = client.persist(power, optimize_graph=False)

This is a fun profile to inspect; it includes both the rechunking and the subsequent FFTs. We’ve included a real-time trace during execution, the full profile, as well as some diagnostics plots from a single worker. These plots total up to around 20MB. I sincerely apologize to those without broadband access.

Here is a real time plot of the computation finishing over time:

And here is a single interactive plot of the entire computation after it completes. Zoom with the tools in the upper right. Hover over rectangles to get more information. Remember that red is communication.

Screenshots of the diagnostic dashboard of a single worker during this computation.

This computation starts with a lot of communication while we rechunk and realign our data (recent optimizations here by Antoine Pitrou in dask #417). Then we transition into doing thousands of small FFTs and other arithmetic operations. All of the plots above show a nice transition from heavy communication to heavy processing with some overlap each way (once some complex blocks are available we get to start overlapping communication and computation). Inter-worker communication was around 100-300 MB/s (typical for Amazon’s EC2) and CPU load remained high. We’re using our hardware.

Finally we can inspect the results. We see that the power spectrum is very boring in the corner, and has typical activity towards the center of the image.

plt.semilogy(1 + power[:, 0, 0].compute())

plt.semilogy(1 + power[:, 500, 500].compute())

Final Thoughts

This blogpost showed a non-trivial image processing workflow, emphasizing the following points:

  1. Construct a Dask array from lazy SKImage calls.
  2. Use NumPy syntax with Dask.array to aggregate distributed data across a cluster.
  3. Build a centroid function with Numba. Use Numba and Dask together to clean up an image stack.
  4. Rechunk to facilitate time-series operations. Perform FFTs.

Hopefully this example has components that look similar to what you want to do with your data on your hardware. We would love to see more applications like this out there in the wild.

What we could have done better

As always with all computationally focused blogposts we’ll include a section on what went wrong and what we could have done better with more time.

  1. Communication is too expensive: Interworker communications that should be taking 200ms are taking up to 10 or 20 seconds. We need to take a closer look at our communications pipeline (which normally performs just fine on other computations) to see if something is acting up. Disucssion here dask/distributed #776 and early work here dask/distributed #810.
  2. Faulty Load balancing: We discovered a case where our load-balancing heuristics misbehaved, incorrectly moving data between workers when it would have been better to let everything alone. This is likely due to the oddly low bandwidth issues observed above.
  3. Loading from disk blocks network I/O: While doing this we discovered an issue where loading large amounts of data from disk can block workers from responding to network requests (dask/distributed #774)
  4. Larger datasets: It would be fun to try this on a much larger dataset to see how the solutions here scale.
Categories: FLOSS Project Planets

Jeff Geerling's Blog: Re-save all nodes of a particular type in an update hook in Drupal 8

Planet Drupal - Mon, 2017-01-16 17:33

I recently needed to re-save all the nodes of a particular content type (after I had added some fields and default configuration) as part of a Drupal 8 site update and deployment. I could go in after deploying the new code and configuration, and manually re-save all content using the built-in bulk operation available on the /admin/content page, but that would not be ideal, because there would be a period of time where the content isn't updated on the live site—plus, manual processes are fragile and prone to failure, so I avoid them at all costs.

In my Drupal 8 module, called custom, I added the following update hook, inside custom.install:

Categories: FLOSS Project Planets

Joachim's blog: Changing the type of a node

Planet Drupal - Mon, 2017-01-16 17:22

There’s an old saying that no information architecture survives contact with the user. Or something like that. You’ll carefully design and build your content types and taxonomies, and then find that the users are actually not quite using what you’ve built in quite the way it was intended when you were building it.

And so there comes a point where you need to grit your teeth, change the structure of the site’s content, and convert existing content.

Back on Drupal 7, I wrote a plugin for Migrate which handled migrations within a single Drupal site, so for example from nodes to a custom entity type, or from one node type to another. (The patch works, though I never found the time to polish it sufficiently to be committed.)

On Drupal 8, without the time to learn the new version of Migrate, I recently had to cobble something together quickly.

Fortunately, this was just changing the type of some nodes, and where all the fields were identical on both source and destination node types. Anything more complex would definitely require Migrate.

First, I created the new node type, and cloned all its fields from the old type to the new type. Here I took the time to update some of the Field Tools module’s functionality to Drupal 8, as it pays off to have a single form to clone fields rather than have to add them to the new node type one by one.

Field Tools also copies display settings where form and view modes match (in other words, if the source bundle has a ‘teaser’ display mode configured, and the destination also has a ‘teaser’ display mode that’s enabled for custom settings, then all of the settings for the fields being cloned are copied over, with field groups too).

With all the new configuration in place, it was now time to get down to the content. This was plain and simple a hack, but one that worked fine for the case in question. Here’s how it went…

We basically want to change the bundle of a bunch of nodes. (Remember, the ‘bundle’ is the generic name for a node type. Node types are bundles, as taxonomy vocabularies are bundles.) The data for a single node is spread over lots of tables, and most of these have the bundle in them.

On Drupal 8 these tables are:

  • the entity base table
  • the entity data table
  • the entity revision data table
  • each field data table
  • each field data revision table

(It’s not entirely clear to me what the separation between base table and data table is for. It looks like it might be that base table is fields that don’t change for revisions, and data table is for fields that do. But then the language is on the base table, and that can be changed, and the created timestamp is on the data table, and while you can change that, I wouldn’t have thought that’s something that has past values kept. Answers on a postcard.)

So we’re basically going to hack the bundle column in a bunch of tables. We start by getting the names of these tables from the entity type storage:

$storage = \Drupal::service('entity_type.manager')->getStorage('node'); // Get the names of the base tables. $base_table_names = []; $base_table_names[] = $storage->getBaseTable(); $base_table_names[] = $storage->getDataTable(); // (Note that revision base tables don't have the bundle.)

For field tables, we need to ask the table mapping handler:

$table_mapping = \Drupal::service('entity_type.manager')->getStorage('node') ->getTableMapping(); // Get the names of the field tables for fields on the service node type. $field_table_names = []; foreach ($source_bundle_fields as $field) { $field_table = $table_mapping->getFieldTableName($field->getName()); $field_table_names[] = $field_table; $field_storage_definition = $field->getFieldStorageDefinition(); $field_revision_table = $table_mapping ->getDedicatedRevisionTableName($field_storage_definition); // Field revision tables DO have the bundle! $field_table_names[] = $field_revision_table; }

(Note the inconsistency in which tables have a bundle field and which don’t! For that matter, surely it’s redundant in all field tables? Does it improve the indexing perhaps?)

Then, get the IDs of the nodes to update. Fortunately, in this case there were only a few, and it wasn’t necessary to write a batched hook_update_N().

// Get the node IDs to update. $query = \Drupal::service('entity.query')->get('node'); // Your conditions here! // In our case, page nodes with a certain field populated. $query->condition('type', 'page'); $query->exists(‘field_in_question’); $nids = $query->execute();

And now, loop over the lists of tables names and hack away!

// Base tables have 'nid' and 'type' columns. foreach ($base_table_names as $table_name) { $query = \Drupal\Core\Database\Database::getConnection('default') ->update($table_name) ->fields(['type' => 'service']) ->condition('nid', $service_nids, 'IN') ->execute(); } // Field tables have 'entity_id' and 'bundle' columns. foreach ($field_table_names as $table_name) { $query = \Drupal\Core\Database\Database::getConnection('default') ->update($table_name) ->fields(['bundle' => 'service']) ->condition('entity_id', $service_nids, 'IN') ->execute(); }

Node-specific tables use ‘nid’ and ‘type’ for their names, because those are the base field names declared in the entity type class, whereas Field API tables use the generic ‘entity_id’ and ‘bundle’. The mapping between these two is declared in the entity type annotation’s entity_keys property.

This worked perfectly. The update system takes care of clearing caches, so entity caches will be fine. Other systems may need a nudge; for instance, Search API won’t notice the changed nodes and its indexes will literally need turning off and on again.

Though I do hope that the next time I have to do something like this, the amount of data justifies getting stuck into using Migrate!

Categories: FLOSS Project Planets

KBibTeX 0.6.1-rc2 released

Planet KDE - Mon, 2017-01-16 17:04

After quite some delay, I finally assembled a second release candidate for KBibTeX 0.6.1. Version 0.6.1 will be the last release in the 0.6.x series.

The following changes were applied since the release of 0.6:

Read more to learn which changes were applied )

comments
Categories: FLOSS Project Planets

Dries Buytaert: Acquia retrospective 2016

Planet Drupal - Mon, 2017-01-16 13:30

As my loyal blog readers know, at the beginning of every year I publish a retrospective to look back and take stock of how far Acquia has come over the past 12 months. If you'd like to read my previous annual retrospectives, they can be found here: 2015, 2014, 2013, 2012, 2011, 2010, 2009. When read together, they provide a comprehensive overview of Acquia's trajectory from its inception in 2008 to where it is today, nine years later.

The process of pulling together this annual retrospective is very rewarding for me as it gives me a chance to reflect with some perspective; a rare opportunity among the hustle and bustle of the day-to-day. Trends and cycles only reveal themselves over time, and I continue to learn from this annual period of reflection.

Crossing the chasm

If I were to give Acquia a headline for 2016, it would be the year in which we crossed the proverbial "chasm" from startup to a true leader in our market. Acquia is now entering its ninth full year of operations (we began commercial operations in the fall of 2008). We've raised $186 million in venture capital, opened offices around the world, and now employ over 750 people. However, crossing the "chasm" is more than achieving a revenue target or other benchmarks of size.

The "chasm" describes the difficult transition conceived by Geoffrey Moore in his 1991 classic of technology strategy, Crossing the Chasm. This is the book that talks about making the transition from selling to the early adopters of a product (the technology enthusiasts and visionaries) to the early majority (the pragmatists). If the early majority accepts the technology solutions and products, they can make a company a de facto standard for its category.

I think future retrospectives will endorse my opinion that Acquia crossed the chasm in 2016. I believe that Acquia has crossed the "chasm" because the world has embraced open source and the cloud without any reservations. The FUD-era where proprietary software giants campaigned aggressively against open source and cloud computing by sowing fear, uncertainty and doubt is over. Ironically, those same critics are now scrambling to paint themselves as committed to open source and cloud architectures. Today, I believe that Acquia sets the standard for digital experiences built with open source and delivered in the cloud.

When Tom (my business partner and Acquia CEO) and I spoke together at Acquia's annual customer conference in November, we talked about the two founding pillars that have served Acquia well over its history: open source and cloud. In 2008, we made a commitment to build a company based on open source and the cloud, with its products and services offered through a subscription model rather than a perpetual license. At the time, our industry was skeptical of this forward-thinking combination. It was a bold move, but we have always believed that this combination offers significant advantages over proprietary software because of its faster rate of innovation, higher quality, freedom from vendor lock-in, greater security, and lower total cost of ownership.

Creating digital winners

Acquia has continued its evolution from a content management company to a company that offers a more complete digital experience platform. This transition inspired an internal project to update our vision and mission accordingly.

In 2016, we updated Acquia's vision to "make it possible for dreamers and doers to craft the digital world". To achieve this vision, we want to build "the universal platform for the world's greatest digital experiences".

We increasingly find ourselves at the center of our customer's technology and digital strategies, and they depend on us to provide the open platform to integrate, syndicate, govern and distribute all of their digital business.

The focus on any and every part of their digital business is important and sets us apart from our competitors. Nearly all of our competitors offer single-point solutions for marketers, customer service, online commerce or for portals. An open source model allows customers to integrate systems together through open APIs, which enables our technology to fit into any part of their existing environment. It gives them the freedom to pursue a best-of-breed strategy outside of the confines of a proprietary "marketing cloud".

Business momentum

We continued to grow rapidly in 2016, and it was another record year for revenue at Acquia. We focused on the growth of our recurring revenue, which includes new customers and the renewal and expansion of our work with existing customers. Ever since we started the company, our corporate emphasis on customer success has fueled both components. Successful customers mean renewals and references for new customers. Customer satisfaction remains extremely high at 96 percent, an achievement I'm confident we can maintain as we continue to grow.

In 2016, the top industry analysts published very positive reviews based on their dealings with our customers. I'm proud that Acquia made the biggest positive move of all vendors in this year's Gartner Magic Quadrant for Web Content Management. There are now three distinct leaders: Acquia, Adobe and Sitecore. Out of the leaders, Acquia is the only player that is open-source and has a cloud-first strategy.

Over the course of 2016 Acquia welcomed an impressive roster of new customers who included Nasdaq, Nestle, Vodafone, iHeartMedia, Advanced Auto Parts, Athenahealth, National Grid UK and more. Exiting 2016, Acquia can count 16 of the Fortune 100 among its customers.

Digital transformation is happening everywhere. Only a few years ago, the majority of our customers were in either government, media and entertainment or higher education. In the past two years, we've seen a lot of growth in other verticals and today, our customers span nearly every industry from pharmaceuticals to finance.

To support our growth, we opened a new sales office in Munich (Germany), and we expanded our global support facilities in Brisbane (Queensland, Australia), Portland (Oregon, USA) and Delhi (India). In total, we now have 14 offices around the world. Over the past year we have also seen our remote workforce expand; 33 percent of Acquia's employees are now remote. They can be found in 225 cities worldwide.

Acquia's offices around the world. The world got more flat for Acquia in 2016.

We've also seen an evolution in our partner ecosystem. In addition to working with traditional Drupal businesses, we started partnering with the world's most elite digital agencies and system integrators to deliver massive projects that span dozens of languages and countries. Our partners are taking Acquia and Drupal into some of the world's most impressive brands, new industries and into new parts of the world.

Growing pains and challenges

I enjoy writing these retrospectives because they allow me to chronicle Acquia's incredible journey. But I also write them for you, because you might be able to learn a thing or two from my experiences. To make these retrospectives useful for everyone, I try to document both milestones and difficulties. To grow an organization, you must learn how to overcome your challenges and growing pains.

Rapid growth does not come without cost. In 2016 we made several leadership changes that will help us continue to grow. We added new heads of revenue, European sales, security, IT, talent acquisition and engineering. I'm really proud of the team we built. We exited 2016 in the market for new heads of finance and marketing.

Part of the Acquia leadership team at The Lobster Pool restaurant in Rockport, MA.

We adjusted our business levers to adapt to changes in the financial markets, which in early 2016 shifted from valuing companies almost solely focused on growth to a combination of growth and free cash flow. This is easier said than done, and required a significant organizational mindshift. We changed our operating plan, took a closer look at expanding headcount, and postponed certain investments we had planned. All this was done in the name of "fiscal fitness" to make sure that we don't have to raise more money down the road. Our efforts to cut our burn rate are paying off, and we were able to beat our targets on margin (the difference between our revenue and operating expenses) while continuing to grow our top line.

We now manage 17,000+ AWS instances within Acquia Cloud. What we once were able to do efficiently for hundreds of clients is not necessarily the best way to do it for thousands. Going into 2016, we decided to improve the efficiency of our operations at this scale. While more work remains to be done, our efforts are already paying off. For example, we can now roll out new Acquia Cloud releases about 10 times faster than we could at the end of 2015.

Lastly, 2016 was the first full year of Drupal 8 availability (it was formally released in November 2015). As expected, it took time for developers and the Drupal community to become familiar with its vast array of changes and new capabilities. This wasn't a surprise; in my DrupalCon keynotes I shared that I expected Drupal 8 to really take off in Q4 of 2016. Through the MAP program we committed over $1M in funds and engineering hours to help module creators upgrade their modules to Drupal 8. All told, Acquia invested about $2.5 million in Drupal code contributions in 2016 alone (excluding our contributions in marketing, events, etc). This is the most we have ever invested in Drupal and something is I'm personally very proud of.

Product milestones

The components and products that make up the Acquia Platform.

Acquia remains an amazing place for engineers who want to build great products. We achieved some big milestones over the course of the year.

One of the largest milestones was the significant enhancements to our multi-site platform: Acquia Cloud Site Factory. Site Factory allows a team to manage and operate thousands of sites around the world from a single console, ensuring all fixes, upgrades and improvements are delivered responsibly and efficiently. Last year we added support for multiple codebases in Site Factory – which we call Stacks – allowing an organization to manage multiple Site Factories from the same administrative console and distribute the operation around the world over multiple data centers. It's unique in its ability and is being deployed globally by many multinational, multi-brand consumer goods companies. We manage thousands of sites for our biggest customers. Site Factory has elevated Acquia into the realm of very large and ambitious digital experience delivery.

Another exciting product release was the third version of Acquia Lift, our personalization and contextualization tool. With the third version of Acquia Lift, we've taken everything we've learned about personalization over the past several years to build a tool that is more flexible and easier to use. The new Lift also provides content syndication services that allow both content and user profile data to be reused across sites. When taken together with Site Factory, Lift permits true content governance and reuse.

We also released Lightning, Acquia's Drupal 8 distribution aimed at developers who want to accelerate their projects based on the set of tested and vetted modules and configurations we use ourselves in our customer work. Acquia's commitment to improving the developer experience also led to the release of both Acquia BLT and Acquia Pipelines (private beta). Acquia BLT is a development tool for building new Drupal projects using a standard approach, while Pipelines is a continuous delivery and continuous deployment service that can be used to develop, test and deploy websites on Acquia Cloud.

Acquia has also set a precedent of contributing significantly to Drupal. We helped with the release management of Drupal 8.1 and Drupal 8.2, and with the community's adoption of a new innovation model that allows for faster innovation. We also invested a lot in Drupal 8's "API-first initiative," whose goal is to improve Drupal's web services capabilities. As part of those efforts, we introduced Waterwheel, a group of SDKs which make it easier to build JavaScript and native mobile applications on top of Drupal 8's REST-backend. We have also been driving usability improvements in Drupal 8 by prototyping a new UX paradigm called "outside in" and by contributing to the media and layout initiatives. In 2017, I believe we should maintain our focus on release management, API-first and usability.

Our core product, Acquia Cloud, received a major reworking of its user interface. That new UI is a more modern, faster and responsive user interface that simplifies interaction for developers and administrators.

The new Acquia Cloud user interface released in 2016.

Our focus on security reached new levels in 2016. In January we secured certification that we complied with ISO 27001: the international security and compliance standard for enterprise cloud frameworks. In April we were awarded our FedRAMP ATO from the U.S. Department of Treasury after we were judged compliant with the U.S. federal standards for cloud security and risk management practices. Today we have the most secure, reliable and agile cloud platform available.

We ended the year with an exciting partnership with commerce platform Magento that will help us advance our vision of content and commerce. Existing commerce platforms have focused primarily on the transactions (cart systems, payment processing, warehouse/supply chain integration, tax compliance, customer credentials, etc.) and neglected the customer's actual shopping experience. We've demonstrated with numerous customers that a better brand experience can be delivered with Drupal and Acquia Lift alongside these existing commerce platforms.

The wind in our sales (pun intended)

Entering 2017, I believe that Acquia is positioned for long-term success. Here are a few reasons why:

  • The current market for content, commerce, and community-focused digital experiences is growing rapidly at just under 20 percent per year.
  • We hold a leadership position in our market, despite our relative market share being small. The analysts gave Acquia top marks for our strategic roadmap, vision and execution.
  • Digitization is top-of-mind for all organizations and impacts all elements of their business and value chain. Digital first businesses are seeking platforms that not only work for marketing, but also for service, compliance, portals, commerce and more.
  • Open source combined with the cloud continue to grow at a furious pace. The continuing rise of the developer's influence on technology selection also works in our favor.
  • Drupal 8 is the most significant advance in the evolution of the Drupal and Drupal's new innovation model allows the Drupal community to innovate faster than ever before.
  • Recent advances in machine learning, Internet of Things, augmented reality, speech technology, and conversational interfaces all coming to fruition will lead to new customer experiences and business models, reinforcing the need for API-first solutions and the levels of freedom that only open source and cloud computing offer.

As I explained at the beginning of this retrospective, trends and cycles reveal themselves over time. After reflecting on 2016, I believe that Acquia is in a unique position. As the world has embraced open source and cloud without reservation, our long-term commitment to this disruptive combination has put us at the right place at the right time. Our investments in expanding the breadth of our platform with products like Acquia Lift and Site Factory are also starting to pay off.

However, Acquia's success is not only determined by the technology we back. Our unique innovation model, which is impossible to cultivate with proprietary software, combined with our commitment to customer success has also contributed to our "crossing of the chasm."

Of course, none of these 2016 results and milestones would be possible without the hard work of the Acquia team, our customers, partners, the Drupal community, and our many friends. Thank you for your support in 2016 – I can't wait to see what the next year will bring!

Categories: FLOSS Project Planets

A. Jesse Jiryu Davis: Python Async Coroutines: A Video Walkthrough

Planet Python - Mon, 2017-01-16 13:19

If your New Year’s resolution is to become an expert in Python coroutines, I have good news.

Back in July, the book “500 Lines or Less: Experienced Programmers Solve Interesting Problems” was published, including the chapter I co-wrote with Guido van Rossum. Our chapter explains async networking. We show how non-blocking sockets work and how Python 3’s coroutines improve asynchronous network programs. I’m very proud of the chapter, but it’s a steeper climb than I’d like.

Now, a top-notch teacher has made the climb much easier. Phillip Guo is an Assistant Professor of Cognitive Science at University of California. He’s broken the chapter into eight small sections, and he explains each one carefully in a video, either talking through our code examples or actually demonstrating them running and dissecting them in the code editor. It’s a fine piece of work and a great way to approach the material. Go watch it:

Python Asynchronous I/O Walkthrough  

Image: Bethlehem Steel

Categories: FLOSS Project Planets

Mike Driscoll: Python 101 is now a Course on Educative

Planet Python - Mon, 2017-01-16 13:15

My first book, Python 101, has been made into an online course on the educative website. Educative is kind of like Code Academy in that you can run the code snippets from the book to see what kind of output they produce. You cannot edit the examples though. You can get 50% off of the course by using the following coupon code: au-pythonlibrary50 (note: This coupon is only good for one week)

Python 101 is for primarily aimed at people who have an understanding of programming concepts or who have programmed with another language already. I do have a lot of readers that are completely new to programming who have enjoyed the book too though. The book itself is split into 5 distinct parts:

Part one covers the basics of Python. Part two moves into learning a little of Python’s standard library. In this section, I cover the libraries that I find myself using the most on a day-to-day basis. Part three moves into intermediate level territory and covers various topics such as decorators, debugging, code profiling and testing your code. Part four introduces the reader to installing 3rd party libraries and briefly demonstrates some of the popular ones, such as lxml, requests, SQLAlchemy and virtualenv. The last section is all about distributing your code. Here you will learn how to add your code to Python Package Index as well as create Windows executables.

For a full table of contents, you can visit the book’s web page here. Educative also has a really good contents page for the online course too.

Categories: FLOSS Project Planets
Syndicate content