FLOSS Project Planets

Kay Hayen: Nuitka Release 0.5.14

Planet Python - Fri, 2015-08-28 01:03

This is to inform you about the new stable release of Nuitka. It is the extremely compatible Python compiler. Please see the page "What is Nuitka?" for an overview.

This release is an intermediate step towards value propagation, which is not considered ready for stable release yet. The major point is the elimination of the try/finally expressions, as they are problems to SSA. The try/finally statement change is delayed.

There are also a lot of bug fixes, and enhancements to code generation, as well as major cleanups of code base.

Bug Fixes
  • Python3: Added support assignments trailing star assignment.

    *a, b = 1, 2

    This raised ValueError before.

  • Python3: Properly detect illegal double star assignments.

    *a, *b = c
  • Python3: Properly detect the syntax error to star assign from non-tuple/list.

    *a = 1
  • Python3.4: Fixed a crash of the binary when copying dictionaries with split tables received as star arguments.

  • Python3: Fixed reference loss, when using raise a from b where b was an exception instance. Fixed in 0.5.13.8 already.

  • Windows: Fix, the flag --disable-windows-console was not properly handled for MinGW32 run time resulting in a crash.

  • Python2.7.10: Was not recognizing this as a 2.7.x variant and therefore not applying minor version compatibility levels properly.

  • Fix, when choosing to have frozen source references, code objects were not use the same value as __file__ did for its filename.

  • Fix, when re-executing itself to drop the site module, make sure we find the same file again, and not according to the PYTHONPATH changes coming from it. Issue#223. Fixed in 0.5.13.4 already.

  • Enhanced code generation for del variable statements, where it's clear that the value must be assigned.

  • When pressing CTRL-C, the stack traces from both Nuitka and Scons were given, we now avoid the one from Scons.

  • Fix, the dump from --xml no longer contains functions that have become unused during analysis.

  • Standalone: Creating or running programs from inside unicode paths was not working on Windows. Issue#231 Issue#229 and. Fixed in 0.5.13.7 already.

  • Namespace package support was not yet complete, importing the parent of a package was still failing. Issue#230. Fixed in 0.5.13.7 already.

  • Python2.6: Compatibility for exception check messages enhanced with newest minor releases.

  • Compatibility: The NameError in classes needs to say global name and not just name too.

  • Python3: Fixed creation of XML representation, now done without lxml as it doesn't support needed features on that version. Fixed in 0.5.13.5 already.

  • Python2: Fix, when creating code for the largest negative constant to still fit into int, that was only working in the main module. Issue#228. Fixed in 0.5.13.5 already.

New Features
  • Added support for Windows 10.
  • Followed changes for Python 3.5 beta 2. Still only usable as a Python 3.4 replacement, no new features.
  • Using a self compiled Python running from the source tree is now supported.
  • Added support for AnaConda Python distribution. As it doesn't install the Python DLL, we copy it along for acceleration mode.
  • Added support for Visual Studio 2015. Issue#222. Fixed in 0.5.13.3 already.
  • Added support for self compiled Python versions running from build tree, this is intended to help debug things on Windows.
Optimization
  • Function inlining is now present in the code, but still disabled, because it needs more changes in other areas, before we can generally do it.

  • Trivial outlines, result of re-formulations or function inlining, are now inlined, in case they just return an expression.

  • The re-formulation for or and and has been giving up, eliminating the use of a try/finally expression, at the cost of dedicated boolean nodes and code generation for these.

    This saves around 8% of compile time memory for Nuitka, and allows for faster and more complete optimization, and gets rid of a complicated structure for analysis.

  • When a frame is used in an exception, its locals are detached. This was done more often than necessary and even for frames that are not necessary our own ones. This will speed up some exception cases.

  • When the default arguments, or the keyword default arguments (Python3) or the annotations (Python3) were raising an exception, the function definition is now replaced with the exception, saving a code generation. This happens frequently with Python2/Python3 compatible code guarded by version checks.

  • The SSA analysis for loops now properly traces "break" statement situations and merges the post-loop situation from all of them. This significantly allows for and improves optimization of code following the loop.

  • The SSA analysis of try/finally statements has been greatly enhanced. The handler for finally is now optimized for exception raise and no exception raise individually, as well as for break, continue and return in the tried code. The SSA analysis for after the statement is now the result of merging these different cases, should they not abort.

  • The code generation for del statements is now taking advantage should there be definite knowledge of previous value. This speed them up slightly.

  • The SSA analysis of del statements now properly decided if the statement can raise or not, allowing for more optimization.

  • For list contractions, the re-formulation was enhanced using the new outline construct instead of a pseudo function, leading to better analysis and code generation.

  • Comparison chains are now re-formulated into outlines too, allowing for better analysis of them.

  • Exceptions raised in function creations, e.g. in default values, are now propagated, eliminating the function's code. This happens most often with Python2/Python3 in branches. On the other hand, function creations that cannot are also annotated now.

  • Closure variables that become unreferenced outside of the function become normal variables leading to better tracing and code generation for them.

  • Function creations cannot raise except their defaults, keyword defaults or annotations do.

Organizational
  • Removed gitorious mirror of the git repository, they shut down.
  • Make it more clear in the documentation that Python2 is needed at compile time to create Python3 executables.
Cleanups
  • Moved more parts of code generation to their own modules, and used registry for code generation for more expression kinds.

  • Unified try/except and try/finally into a single construct that handles both through try/except/break/continue/return semantics. Finally is now solved via duplicating the handler into cases necessary.

    No longer are nodes annotated with information if they need to publish the exception or not, this is now all done with the dedicated nodes.

  • The try/finally expressions have been replaced with outline function bodies, that instead of side effect statements, are more like functions with return values, allowing for easier analysis and dedicated code generation of much lower complexity.

  • No more "tolerant" flag for release nodes, we now decide this fully based on SSA information.

  • Added helper for assertions that code flow does not reach certain positions, e.g. a function must return or raise, aborting statements do not continue and so on.

  • To keep cloning of code parts as simple as possible, the limited use of makeCloneAt has been changed to a new makeClone which produces identical copies, which is what we always do. And a generic cloning based on "details" has been added, requiring to make constructor arguments and details complete and consistent.

  • The re-formulation code helpers have been improved to be more convenient at creating nodes.

  • The old nuitka.codegen module Generator was still used for many things. These now all got moved to appropriate code generation modules, and their users got updated, also moving some code generator functions in the process.

  • The module nuitka.codegen.CodeTemplates got replaces with direct uses of the proper topic module from nuitka.codegen.templates, with some more added, and their names harmonized to be more easily recognizable.

  • Added more assertions to the generated code, to aid bug finding.

  • The autoformat now sorts pylint markups for increased consistency.

  • Releases no longer have a tolerant flag, this was not needed anymore as we use SSA.

  • Handle CTRL-C in scons code preventing per job messages that are not helpful and avoid tracebacks from scons, also remove more unused tools like rpm from out inline copy.

Tests
  • Added the CPython3.4 test suite.

  • The CPython3.2, CPython3.3, and CPython3.4 test suite now run with Python2 giving the same errors. Previously there were a few specific errors, some with line numbers, some with different SyntaxError be raised, due to different order of checks.

    This increases the coverage of the exception raising tests somewhat.

  • Also the CPython3.x test suites now all pass with debug Python, as does the CPython 2.6 test suite with 2.6 now.

  • Added tests to cover all forms of unpacking assignments supported in Python3, to be sure there are no other errors unknown to us.

  • Started to document the reference count tests, and to make it more robust against SSA optimization. This will take some time and is work in progress.

  • Made the compile library test robust against modules that raise a syntax error, checking that Nuitka does the same.

  • Refined more tests to be directly execuable with Python3, this is an ongoing effort.

Summary

This release is clearly major. It represents a huge step forward for Nuitka as it improves nearly every aspect of code generation and analysis. Removing the try/finally expression nodes proved to be necessary in order to even have the correct SSA in their cases. Very important optimization was blocked by it.

Going forward, the try/finally statements will be removed and dead variable elimination will happen, which then will give function inlining. This is expected to happen in one of the next releases.

This release is a consolidation of 8 hotfix releases, and many refactorings needed towards the next big step, which might also break things, and for that reason is going to get its own release cycle.

Categories: FLOSS Project Planets

Mike C. Fletcher: Raspberry Pi PyOpenGL

Planet Python - Fri, 2015-08-28 00:35

So since Soni got me to setup raspbian on the old raspberry pi, I got PyOpenGL + GLES2 working on it today. There is a bug in raspbian's EGL library (it depends on GLES2 without linking to it), but with a work-around for that bzr head of PyOpenGL can run the bcmwindow example/raw.py demo. I *don't* have a spare HDMI cable, however, so I didn't actually get to *see* the demo running. Oh well, next time. bmcwindow is now up on pypi should people be interested.

Categories: FLOSS Project Planets

Montreal Python User Group: Call for Speakers - Montréal-Python 54: Virtualized Utopia

Planet Python - Fri, 2015-08-28 00:00

It's back to school so, at Montréal-Python, we are preparing for the first event of the season!

We are back every second Monday of the month, so our next meeting will take place on Monday, September the 14th at 6:30pm at UQÀM.

For the occasion, we are looking for speakers to give talks of 5, 10, 20, or even 45 minutes.

Come tell us about your latest discoveries, your latest module, or your latest professional or personal realizations. It is your chance to meet with the local Python community.

Send us your propositions at mtlpyteam@googlegroups.com

When:

Monday, September 14th 2015

Schedule:
  • 6:00pm — Doors open
  • 6:30pm — Presentations start
  • 7:30pm — Break
  • 7:45pm — Second round of presentations
  • 9:00pm — One free beer offered at Bénélux just across the street
We’d like to thank our sponsors for their continued support:
  • UQÀM
  • Bénélux
  • w.illi.am/
  • Outbox
  • Savoir-faire Linux
  • Caravan
  • iWeb
Categories: FLOSS Project Planets

The Fiber Engine Poll, Updates, and Breeze

Planet KDE - Thu, 2015-08-27 23:36

Some weeks ago I ran a poll to see what would be the preferred rendering engine for Fiber, and so I figure now is the time to post results. There was a surprising amount of misinformation/confusion running around about what each option potentially meant which I hope to clear up, but overall the results were so compelling I doubt stripping the misinformation and re-running the poll would return a different result.

Third Place: Port to CEF Later

“Porting to CEF later” was the lowest voted option at ~18% of the ballet, and in retrospect it makes sense since it’s just a poor solution. The only upside is that it gets an obsolete implementation out the door (if that’s an upside), but it makes things complicated during an important phase of the project by putting an engine change in motion while trying to flesh out deeply tied APIs. Not good.

Oddly some people wanted a WebEngine/CEF switch and took to this option as Fiber having such a switch. Considering CEF proper is based on Chromium/Blink (which is what WebEngine uses) it’s a bit like asking to take two paths to the same destination; there are differences in the road but in the end both ways lead to Blink. There will be no switch for Cef/WebEngine because adding one would bring down the API potential to the lowest common denominator while increasing complexity to the most advanced method.

Runner up: Use WebEngine

“Use WebEngine” was the runner-up at 24% of the vote. The main prospect behind this is that it would result in a shipping browser fastest, but it also works under the assumption that it may increase code compatibility between Qt-based browsers – but the architecture of Fiber I believe will be very alien compared to contemporary solutions. If there are chances to collaborate I will, but I don’t know how much of that will be possible.

There was also a segment that voted for WebEngine thinking CEFs was just a more complicated route to Chromium, being confused about the road to Servo.

Winner by a mile: Go Exclusively CEF

It’s no surprise that in the end “Use CEF” trounced the remainder of the poll at 59% of respondents voting in favour of it – more than both other options combined or any individual option doubled. From the comments around the internet one of the biggest reasons for the vote is Servo as a major differentiating factor between other browsers, and also because it would help mitigate the Webkit/Blink monopoly forming on non-mozilla browsers for Linux.

This excites me as a web developer, and I’m likely to try pushing Servo as the default engine as it will likely be plenty good by the time Fiber is released. Sadly, I believe there were a few votes placed thinking that Fiber would ultimately usher in a “QCef” or “KCef” framework; and I don’t think this will be the case.

On making a Frameworks 5 API I considered it as a super-interesting Frameworks addition, but after careful consideration I realised there just aren’t too many projects which would benefit from what would be a substantial amount of work. Another issue is that I think the QWebEngine is appropriate for most projects, and that anything more is needless complication. The Qt developers have done a good job picking the right APIs to expose which suits common needs, and I imagine the additional complexity would only hurt projects adopting such a library; it’s killing a mosquito with a cannon. Plus, QWebEngine will evolve in good time to fill any common needs that pop up.

What will Fiber do?

Fiber is going to go exclusively CEF. I’m in the process of fiddling CEF into the browser – but CEF is a bit of a beast and about 3/4 of my time is simply reading CEF documentation, examples, and reading the source code of open projects using the utility. My main concern is properly including CEF without requiring X11; it’s possible, but the Linux example code isn’t using Aura, and the implementation examples are GTK-based as well. Qt and KF5 have solutions, but I’m reseaching the best route to take.

In terms of what engine Fiber is using (Servo vs Blink) I’m going the generic route; you can drop in simple config files pointing to CEF-compatible executables, and when configuring profiles you can pick which engine you would like to use based on those files. This engine switch is already present on the command line and in the “Tuning” section of the profiles manager. This means you can have different profiles running different engines if you choose. There’s a second command-line option which will launch a new instance of Fiber with the engine of your choice running one-time for testing purposes. For the purposes of the default, I’ll probably push Servo.

CEF will not drive UI

Indirectly using CEF means QML may become the exclusive language of UI extensions, popups, and config dialogs. Mainly this is because of the additional abstraction and effort required to offer CEF in several contexts, but it also puts a much cleaner separation between browser and content and will likely make securing the system easier. Extensions will be required to offer pages in HTML.

If you’re using QML, your writing chrome. If you’re using HTML you’re writing a page.

This is also more in-line with the Plasma Mobile guidelines, and though I severely doubt you’ll see Fiber become a mobile browser any time soon this keeps the door open for the far future. In two years I’d rather not break a significant number of extensions for mobile inclusion; I’d rather just have things work, maybe with some minor layout tweaks.

There are real pros and cons to QML as the only way to extend the browser UI, and probably one of the largest I worry about is the fact that QML has a significantly smaller developer base than HTML. On the plus side QML is able adapt to platforms, meaning we might not need to divide extensions between desktop and mobile – that would simply boil down to layout tweaks. All this means is instead of having many extensions of questionable quality, we will aim to offer fewer but higher-quality extensions.

On Progress

Progress is steady. Probably an hour to two of work a night goes into the project, and extra time on weekends as freedom allows. It drives people nuts that I’m taking my dear sweet time on this, but when the groundwork is done there will be a solid base for others to help quickly build on.

I’ve introduced threading into some parts of Fibers management tools, and made significant improvements with how Fiber manages internal data caching for profile data. This all got started when I noticed a split-second of lag on a slider, and realised the long-term implications. Threading was introduced so when the database models are working they do not lag the main thread, and the layer which talks to the model now caches the data and only communicates with the model when one is out of sync. The next step will be to add some internal very coarse timers and event tools which will delay hard data saves until they can be batched efficiently or must be written, and possibly a check to prevent the saving of idenitcal data.

While this may not matter as much for the management tools I’ll be applying these techniques on an extension-wide bases; this will save power, keep Fiber highly responsive, make it CPU wake friendly, and avoid hard drives wakeups – even when bad extensions might behave in “thrashing” behaviours. Ironically this first performance exercise has made me confident that even with many “slow” javascript-driven features, Fiber may become a highly performant browser by virtue of having extremely fine-tuned APIs which give blanket improvements.

One of the most annoying but necessary changes was porting Fiber from QMake to CMake. Originally I had the intention to prototype using QMake, switching to CMake later for the “real” work. As things would have it the prototype had simply evolved and I realised it would just be easier to port it. As I’m not terribly familiar with CMake this started off painfully, but once I realised what CMake was trying to encourage I fell in love and things just clicked.

During the CMake port I also took the opportunity to strip out vestigial or prototypical code and do some housekeeping, which certainly cleaned things up as I not only removed files but also disposed of bits of code too. I also removed all traces of WebEngine which I had used during the earliest prototype phase; the next time Google pops up, it’ll be with CEF.

I’ve also started incorporating the first KF5 libraries into the project. The libraries are very well organised, and also well documented. Finally, I need to compliment Qt and state how amazing the toolkit is. Really. Some of the most notable changes were trivial by Qt making smart use of its internal structure, and even though I’m hardly a veteran developer Qt and it’s extremely good documentation has allowed me to make smart, informed decisions. Really guys, good job.

On other projects

Moving away from Fiber, right now we’re doing a lot of work on refining the Breeze theme for Plasma 5.5 in this thread, where we’re running down paper-cuts on the design and building the next iteration of the style. Ideally, we’d like to see a much more consistent and well-defined visual structure. Later on we will start to address things like alignment issues, and start targeted papercut topics which will address specific visual issues. If you’re interested, please read the entire thread as there is lots of design discussion and contribute your thoughts.

Remember, constructive feedback is the easiest contribution anyone can make to an open-source project!


Categories: FLOSS Project Planets

PyTexas: 2015 Schedule Released and Tons Away to Stay in the Loop

Planet Python - Thu, 2015-08-27 22:10

The 2015 schedule has been released and it is jammed packed with awesome talks and tutorials. Check it out today.

2015 schedule »

We've also added a few extra ways to stay in the loop with PyTexas news.

First, we added push notifications to the site, so if you see a little pop up asking for authorization and hit "Allow", you will receive up to date information on the conference. These notifications even come when your disconnected from the site if you use Chrome (all versions) or Safari (desktop). All other browsers, require you to be on the site to get the notifications.

Lastly, we also created a Gitter.im Chat Room for anyone that wants to talk about the conference or has questions. We'll be monitoring that chat room, so feel free to drop in anytime.

Gitter.im Chat Room »

Don't like any of those options? Then check us out on Twitter too: @pytexas.

Categories: FLOSS Project Planets

Ben Rousch: Kivy – Interactive Applications and Games in Python, 2nd Edition Review

Planet Python - Thu, 2015-08-27 21:16

I was recently asked by the author to review the second edition of “Kivy – Interactive Applications in Python” from Packt Publishing. I had difficulty recommending the first edition mostly due to the atrocious editing – or lack thereof – that it had suffered. It really reflected badly on Packt, and since it was the only Kivy book available, I did not want that same inattention to quality to reflect on Kivy. Packt gave me a free ebook copy of this book in exchange for agreeing to do this review.

At any rate, the second edition is much improved over the first. Although a couple of glaring issues remain, it looks like it has been visited by at least one native English speaking editor. The Kivy content is good, and I can now recommend it for folks who know Python and want to get started with Kivy. The following is the review I posted to Amazon:

This second edition of “Kivy – Interactive Applications and Games in Python” is much improved from the first edition. The atrocious grammar throughout the first edition book has mostly been fixed, although it’s still worse than what I expect from a professionally edited book. The new chapters showcase current Kivy features while reiterating how to build a basic Kivy app, and the book covers an impressive amount material in its nearly 185 pages. I think this is due largely to the efficiency and power of coding in Python and Kivy, but also to the carefully-chosen projects the author selected for his readers to create. Despite several indentation issues in the example code and the many grammar issues typical of Packt’s books, I can now recommend this book for intermediate to experienced Python programmers who are looking to get started with Kivy.

Chapter one is a good, quick introduction to a minimal Kivy app, layouts, widgets, and their properties.

Chapter two is an excellent introduction and exploration of basic canvas features and usage. This is often a difficult concept for beginners to understand, and this chapter handles it well.

Chapter three covers events and binding of events, but is much denser and difficult to grok than chapter two. It will likely require multiple reads of the chapter to get a good understanding of the topic, but if you’re persistent, everything you need is there.

Chapter four contains a hodge-podge of Kivy user interface features. Screens and scatters are covered well, but gestures still feel like magic. I have yet to find a good in-depth explanation of gestures in Kivy, so this does not come as a surprise. Behaviors is a new feature in Kivy and a new section in this second edition of the book. Changing default styles is also covered in this chapter. The author does not talk about providing a custom atlas for styling, but presents an alternative method for theming involving Factories.

In chapter six the author does a good job of covering animations, and introduces sounds, the clock, and atlases. He brings these pieces together to build a version of Space Invaders, in about 500 lines of Python and KV. It ends up a bit code-dense, but the result is a fun game and a concise code base to play around with.

In chapter seven the author builds a TED video player including subtitles and an Android actionbar. There is perhaps too much attention paid to the VideoPlayer widget, but the resulting application is a useful base for creating other video applications.

Categories: FLOSS Project Planets

Ben Hutchings: Securing my own blog

Planet Debian - Thu, 2015-08-27 21:03

Yeah I know, a bit ironic that this isn't available over HTTP-S. I could reuse the mail server certificate to make https://decadent.org.uk/ work...

Categories: FLOSS Project Planets

Ben Hutchings: Securing my own blog

Planet Debian - Thu, 2015-08-27 21:03

Yeah I know, a bit ironic that this isn't available over HTTP-S. I could reuse the mail server certificate to make https://decadent.org.uk/ work...

Categories: FLOSS Project Planets

Ben Hutchings: Securing debcheckout of git repositories

Planet Debian - Thu, 2015-08-27 21:01

Some source packages have Vcs-Git URLs using the git: scheme, which is plain-text and unauthenticated. It's probably harder to MITM than HTTP, but still we can do better than this even for anonymous checkouts. git is now nearly as efficient at cloning/pulling over HTTP-S, so why not make that the default?

Adding the following lines to ~/.gitconfig will make git consistently use HTTP-S to access Alioth. It's not quite HTTPS-Everywhere, but it's a step in that direction:

[url "https://anonscm.debian.org/git/"] insteadOf = git://anonscm.debian.org/ insteadOf = git://git.debian.org/

Additionally you can automatically fix up the push URL in case you have or are later given commit access to the repository on Alioth:

[url "git+ssh://git.debian.org/git/"] pushInsteadOf = git://anonscm.debian.org/ pushInsteadOf = git://git.debian.org/

Similar for git.kernel.org:

[url "https://git.kernel.org/pub/scm/"] insteadOf = git://git.kernel.org/pub/scm/ [url "git+ssh://ra.kernel.org/pub/scm/"] pushInsteadOf = git://git.kernel.org/pub/scm/

RTFM for more information on these configuration variables.

Categories: FLOSS Project Planets

Ben Hutchings: Securing debcheckout of git repositories

Planet Debian - Thu, 2015-08-27 21:01

Some source packages have Vcs-Git URLs using the git: scheme, which is plain-text and unauthenticated. It's probably harder to MITM than HTTP, but still we can do better than this even for anonymous checkouts. git is now nearly as efficient at cloning/pulling over HTTP-S, so why not make that the default?

Adding the following lines to ~/.gitconfig will make git consistently use HTTP-S to access Alioth. It's not quite HTTPS-Everywhere, but it's a step in that direction:

[url "https://anonscm.debian.org/git/"] insteadOf = git://anonscm.debian.org/ insteadOf = git://git.debian.org/

Additionally you can automatically fix up the push URL in case you have or are later given commit access to the repository on Alioth:

[url "git+ssh://git.debian.org/git/"] pushInsteadOf = git://anonscm.debian.org/ pushInsteadOf = git://git.debian.org/

Similar for git.kernel.org:

[url "https://git.kernel.org/pub/scm/"] insteadOf = git://git.kernel.org/pub/scm/ [url "git+ssh://ra.kernel.org/pub/scm/"] pushInsteadOf = git://git.kernel.org/pub/scm/

RTFM for more information on these configuration variables.

Categories: FLOSS Project Planets

Ben Hutchings: Securing git imap-send in Debian

Planet Debian - Thu, 2015-08-27 20:26

I usually send patches from git via git imap-send, which gives me a chance to edit and save them through my regular mail client. Obviously I want to make a secure connection to the IMAP server. The upstream code now supports doing this with OpenSSL, but git is under GPL and it seems that not all relevant contributors have given the extra permission to link with OpenSSL. So in Debian you still need to use an external program to provide a TLS tunnel.

The commonly used TLS tunnelling programs, openssl s_client and stunnel, do not validate server certificates in a useful way - at least by default.

Here's how I've configured git imap-send and stunnel to properly validate the server certificate. If you use the PLAIN or LOGIN authentication method with the server, you will still see the warning:

*** IMAP Warning *** Password is being sent in the clear

The server does see the clear-text password, but it is encrypted on the wire and git imap-send just doesn't know that.

~/.gitconfig [imap] user = ben folder = "drafts" tunnel = "stunnel ~/.git-imap-send/stunnel.conf" ~/.git-imap-send/stunnel.conf debug = 3 foreground = yes client = yes connect = mail.decadent.org.uk:993 sslVersion = TLSv1.2 renegotiation = no verify = 2 ; Current CA for the IMAP server. ; If you don't want to pin to a specific CA certificate, use ; /etc/ssl/certs/ca-certificates.crt instead. CAfile = /etc/ssl/certs/StartCom_Certification_Authority.pem checkHost = mail.decadent.org.uk

If stunnel chokes on the checkHost variable, it doesn't support certificate name validation. Unfortunately no Debian stable release has this feature - only testing/unstable. I'm wondering whether it would be worthwhile to backport it or even to make a stable update to add this important security feature.

Categories: FLOSS Project Planets

Ben Hutchings: Securing git imap-send in Debian

Planet Debian - Thu, 2015-08-27 20:26

I usually send patches from git via git imap-send, which gives me a chance to edit and save them through my regular mail client. Obviously I want to make a secure connection to the IMAP server. The upstream code now supports doing this with OpenSSL, but git is under GPL and it seems that not all relevant contributors have given the extra permission to link with OpenSSL. So in Debian you still need to use an external program to provide a TLS tunnel.

The commonly used TLS tunnelling programs, openssl s_client and stunnel, do not validate server certificates in a useful way - at least by default.

Here's how I've configured git imap-send and stunnel to properly validate the server certificate. If you use the PLAIN or LOGIN authentication method with the server, you will still see the warning:

*** IMAP Warning *** Password is being sent in the clear

The server does see the clear-text password, but it is encrypted on the wire and git imap-send just doesn't know that.

~/.gitconfig [imap] user = ben folder = "drafts" tunnel = "stunnel ~/.git-imap-send/stunnel.conf" ~/.git-imap-send/stunnel.conf debug = 3 foreground = yes client = yes connect = mail.decadent.org.uk:993 sslVersion = TLSv1.2 renegotiation = no verify = 2 ; Current CA for the IMAP server. ; If you don't want to pin to a specific CA certificate, use ; /etc/ssl/certs/ca-certificates.crt instead. CAfile = /etc/ssl/certs/StartCom_Certification_Authority.pem checkHost = mail.decadent.org.uk

If stunnel chokes on the checkHost variable, it doesn't support certificate name validation. Unfortunately no Debian stable release has this feature - only testing/unstable. I'm wondering whether it would be worthwhile to backport it or even to make a stable update to add this important security feature.

Categories: FLOSS Project Planets

Savas Labs: Sassy Drupal theming: a lighter version of SMACSS

Planet Drupal - Thu, 2015-08-27 20:00

It takes some forethought, but a well-organized theme means code that is modular and easy to maintain or pass off to another developer. SMACSS principles are becoming more and more widespread and can be applied to a Drupal theme. At Savas we've picked out what we love from SMACSS and simplified the rest, creating a stylesheet organization method that works for us. In this post (part 2 of my three-part series on Drupal theming with Sass) I'll go through our version of SMACSS and link to real examples.

Continue reading…

Categories: FLOSS Project Planets

Matthew Rocklin: Efficient Tabular Storage

Planet Python - Thu, 2015-08-27 20:00

tl;dr: We discuss efficient techniques for on-disk storage of tabular data, notably the following:

  • Binary stores
  • Column stores
  • Categorical support
  • Compression
  • Indexed/Partitioned stores

We use NYCTaxi dataset for examples, and introduce a small project, Castra.

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Larger than Memory Data and Disk I/O

We analyze large datasets (10-100GB) on our laptop by extending memory with disk. Tools like dask.array and dask.dataframe make this easier for array and tabular data.

Interaction times can improve significantly (from minutes to seconds) if we choose to store our data on disk efficiently. This is particularly important for large data because we can no longer separately “load in our data” while we get a coffee and then iterate rapidly on our dataset once it’s comfortably in memory.

Larger-than-memory datasets force interactive workflows to include the hard drive.

CSV is convenient but slow

CSV is great. It’s human readable, accessible by every tool (even Excel!), and pretty simple.

CSV is also slow. The pandas.read_csv parser maxes out at 100MB/s on simple data. This doesn’t include any keyword arguments like datetime parsing that might slow it down further. Consider the time to parse a 24GB dataset:

24GB / (100MB/s) == 4 minutes

A four minute delay is too long for interactivity. We need to operate in seconds rather than minutes otherwise people leave to work on something else. This improvement from a few minutes to a few seconds is entirely possible if we choose better formats.

Example with CSVs

As an example lets play with the NYC Taxi dataset using dask.dataframe, a library that copies the Pandas API but operates in chunks off of disk.

>>> import dask.dataframe as dd >>> df = dd.read_csv('csv/trip_data_*.csv', ... skipinitialspace=True, ... parse_dates=['pickup_datetime', 'dropoff_datetime']) >>> df.head() medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude 0 89D227B655E5C82AECF13C3F540D4CF4 BA96DE419E711691B9445D6A6307C170 CMT 1 N 2013-01-01 15:11:48 2013-01-01 15:18:10 4 382 1.0 -73.978165 40.757977 -73.989838 40.751171 1 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-06 00:18:35 2013-01-06 00:22:54 1 259 1.5 -74.006683 40.731781 -73.994499 40.750660 2 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-05 18:49:41 2013-01-05 18:54:23 1 282 1.1 -74.004707 40.737770 -74.009834 40.726002 3 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:54:15 2013-01-07 23:58:20 2 244 0.7 -73.974602 40.759945 -73.984734 40.759388 4 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:25:03 2013-01-07 23:34:24 1 560 2.1 -73.976250 40.748528 -74.002586 40.747868 Time Costs

It takes a second to load the first few lines but 11 to 12 minutes to roll through the entire dataset. We make a zoomable picture below of a random sample of the taxi pickup locations in New York City. This example is taken from a full example notebook here.

df2 = df[(df.pickup_latitude > 40) & (df.pickup_latitude < 42) & (df.pickup_longitude > -75) & (df.pickup_longitude < -72)] sample = df2.sample(frac=0.0001) pickup = sample[['pickup_latitude', 'pickup_longitude']] result = pickup.compute() from bokeh.plotting import figure, show, output_notebook p = figure(title="Pickup Locations") p.scatter(result.pickup_longitude, result.pickup_latitude, size=3, alpha=0.2) Eleven minutes is a long time

This result takes eleven minutes to compute, almost all of which is parsing CSV files. While this may be acceptable for a single computation we invariably make mistakes and start over or find new avenues in our data to explore. Each step in our thought process now takes eleven minutes, ouch.

Interactive exploration of larger-than-memory datasets requires us to evolve beyond CSV files.

Principles to store tabular data

What efficient techniques exist for tabular data?

A good solution may have the following attributes:

  1. Binary
  2. Columnar
  3. Categorical support
  4. Compressed
  5. Indexed/Partitioned

We discuss each of these below.

Binary

Consider the text ‘1.23’ as it is stored in a CSV file and how it is stored as a Python/C float in memory:

  • CSV: 1.23
  • C/Python float: 0x3f9d70a4

These look very different. When we load 1.23 from a CSV textfile we need to translate it to 0x3f9d70a4; this takes time.

A binary format stores our data on disk exactly how it will look in memory; we store the bytes 0x3f9d70a4 directly on disk so that when we load data from disk to memory no extra translation is necessary. Our file is no longer human readable but it’s much faster.

This gets more intense when we consider datetimes:

  • CSV: 2015-08-25 12:13:14
  • NumPy datetime representation: 1440529994000000 (as an integer)

Every time we parse a datetime we need to compute how many microseconds it has been since the epoch. This calculation needs to take into account things like how many days in each month, and all of the intervening leap years. This is slow. A binary representation would record the integer directly on disk (as 0x51e278694a680) so that we can load our datetimes directly into memory without calculation.

Columnar

Many analytic computations only require a few columns at a time, often only one, e.g.

>>> df.passenger_counts.value_counts().compute().sort_index() 0 3755 1 119605039 2 23097153 3 7187354 4 3519779 5 9852539 6 6628287 7 30 8 23 9 24 129 1 255 1 Name: passenger_count, dtype: int64

Of our 24 GB we may only need 2GB. Columnar storage means storing each column separately from the others so that we can read relevant columns without passing through irrelevant columns.

Our CSV example fails at this. While we only want two columns, pickup_datetime and pickup_longitude, we pass through all of our data to collect the relevant fields. The pickup location data is mixed with all the rest.

Categoricals

Categoricals encode repetitive text columns (normally very expensive) as integers (very very cheap) in a way that is invisible to the user.

Consider the following (mostly text) columns of our NYC taxi dataset:

>>> df[['medallion', 'vendor_id', 'rate_code', 'store_and_fwd_flag']].head() medallion vendor_id rate_code store_and_fwd_flag 0 89D227B655E5C82AECF13C3F540D4CF4 CMT 1 N 1 0BD7C8F5BA12B88E0B67BED28BEA73D8 CMT 1 N 2 0BD7C8F5BA12B88E0B67BED28BEA73D8 CMT 1 N 3 DFD2202EE08F7A8DC9A57B02ACB81FE2 CMT 1 N 4 DFD2202EE08F7A8DC9A57B02ACB81FE2 CMT 1 N

Each of these columns represents elements of a small set:

  • There are two vendor ids
  • There are twenty one rate codes
  • There are three store-and-forward flags (Y, N, missing)
  • There are about 13000 taxi medallions. (still a small number)

And yet we store these elements in large and cumbersome dtypes:

In [4]: df[['medallion', 'vendor_id', 'rate_code', 'store_and_fwd_flag']].dtypes Out[4]: medallion object vendor_id object rate_code int64 store_and_fwd_flag object dtype: object

We use int64 for rate code, which could easily have fit into an int8 an opportunity for an 8x improvement in memory use. The object dtype used for strings in Pandas and Python takes up a lot of memory and is quite slow:

In [1]: import sys In [2]: sys.getsizeof('CMT') # bytes Out[2]: 40

Categoricals replace the original column with a column of integers (of the appropriate size, often int8) along with a small index mapping those integers to the original values. I’ve written about categoricals before so I won’t go into too much depth here. Categoricals increase both storage and computational efficiency by about 10x if you have text data that describes elements in a category.

Compression

After we’ve encoded everything well and separated our columns we find ourselves limited by disk I/O read speeds. Disk read bandwidths range from 100MB/s (laptop spinning disk hard drive) to 2GB/s (RAID of SSDs). This read speed strongly depends on how large our reads are. The bandwidths given above reflect large sequential reads such as you might find when reading all of a 100MB file in one go. Performance degrades for smaller reads. Fortunately, for analytic queries we’re often in the large sequential read case (hooray!)

We reduce disk read times through compression. Consider the datetimes of the NYC taxi dataset. These values are repetitive and slowly changing; a perfect match for modern compression techniques.

>>> ind = df.index.compute() # this is on presorted index data (see castra section below) >>> ind DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 00:00:00', '2013-01-01 00:00:00', '2013-01-01 00:00:00', '2013-01-01 00:00:00', '2013-01-01 00:00:00', '2013-01-01 00:00:00', '2013-01-01 00:00:00', '2013-01-01 00:00:00', '2013-01-01 00:00:00', ... '2013-12-31 23:59:42', '2013-12-31 23:59:47', '2013-12-31 23:59:48', '2013-12-31 23:59:49', '2013-12-31 23:59:50', '2013-12-31 23:59:51', '2013-12-31 23:59:54', '2013-12-31 23:59:55', '2013-12-31 23:59:57', '2013-12-31 23:59:57'], dtype='datetime64[ns]', name=u'pickup_datetime', length=169893985, freq=None, tz=None) Benchmark datetime compression

We can use a modern compression library, like fastlz or blosc to compress this data at high speeds.

In [36]: import blosc In [37]: %time compressed = blosc.compress_ptr(address=ind.values.ctypes.data, ...: items=len(ind), ...: typesize=ind.values.dtype.alignment, ...: clevel=5) CPU times: user 3.22 s, sys: 332 ms, total: 3.55 s Wall time: 512 ms In [40]: len(compressed) / ind.nbytes # compression ratio Out[40]: 0.14296813539337488 In [41]: ind.nbytes / 0.512 / 1e9 # Compresson bandwidth (GB/s) Out[41]: 2.654593515625 In [42]: %time _ = blosc.decompress(compressed) CPU times: user 1.3 s, sys: 438 ms, total: 1.74 s Wall time: 406 ms In [43]: ind.nbytes / 0.406 / 1e9 # Decompression bandwidth (GB/s) Out[43]: 3.3476647290640393

We store 7x fewer bytes on disk (thus septupling our effective disk I/O) by adding an extra 3GB/s delay. If we’re on a really nice Macbook pro hard drive (~600MB/s) then this is a clear and substantial win. The worse the hard drive, the better this is.

But sometimes compression isn’t as nice

Some data is more or less compressable than others. The following column of floating point data does not compress as nicely.

In [44]: x = df.pickup_latitude.compute().values In [45]: %time compressed = blosc.compress_ptr(x.ctypes.data, len(x), x.dtype.alignment, clevel=5) CPU times: user 5.87 s, sys: 0 ns, total: 5.87 s Wall time: 925 ms In [46]: len(compressed) / x.nbytes Out[46]: 0.7518617315969132

This compresses more slowly and only provides marginal benefit. Compression may still be worth it on slow disk but this isn’t a huge win.

The pickup_latitude column isn’t compressible because most of the information isn’t repetitive. The numbers to the far right of the decimal point are more or less random.

40.747868

Other floating point columns may compress well, particularly when they are rounded to small and meaningful decimal values.

Compression rules of thumb

Optimal compression requires thought. General rules of thumb include the following:

  • Compress integer dtypes
  • Compress datetimes
  • If your data is slowly varying (e.g. sorted time series) then use a shuffle filter (default in blosc)
  • Don’t bother much with floating point dtypes
  • Compress categoricals (which are just integer dtypes)
Avoid gzip and bz2

Finally, avoid gzip and bz2. These are both very common and very slow. If dealing with text data, consider snappy (also available via blosc.)

Indexing/Partitioning

One column usually dominates our queries. In time-series data this is time. For personal data this is the user ID.

Just as column stores let us avoid irrelevant columns, partitioning our data along a preferred index column lets us avoid irrelevant rows. We may need the data for the last month and don’t need several years’ worth. We may need the information for Alice and don’t need the information for Bob.

Traditional relational databases provide indexes on any number of columns or sets of columns. This is excellent if you are using a traditional relational database. Unfortunately the data structures to provide arbitrary indexes don’t mix well with some of the attributes discussed above and we’re limited to a single index that partitions our data into sorted blocks.

Some projects that implement these principles

Many modern distributed database storage systems designed for analytic queries implement these principles well. Notable players include Redshift and Parquet.

Additionally newer single-machine data stores like Dato’s SFrame and BColz follow many of these principles. Finally many people have been doing this for a long time with custom use of libraries like HDF5.

It turns out that these principles are actually quite easy to implement with the right tools (thank you #PyData) The rest of this post will talk about a tiny 500 line project, Castra, that implements these princples and gets good speedups on biggish Pandas data.

Castra

With these goals in mind we built Castra, a binary partitioned compressed columnstore with builtin support for categoricals and integration with both Pandas and dask.dataframe.

Load data from CSV files, sort on index, save to Castra

Here we load in our data from CSV files, sort on the pickup datetime column, and store to a castra file. This takes about an hour (as compared to eleven minutes for a single read.) Again, you can view the full notebook here

>>> import dask.dataframe as dd >>> df = dd.read_csv('csv/trip_data_*.csv', ... skipinitialspace=True, ... parse_dates=['pickup_datetime', 'dropoff_datetime']) >>> (df.set_index('pickup_datetime', compute=False) ... .to_castra('trip.castra', categories=True)) Profit

Now we can take advantage of columnstores, compression, and binary representation to perform analytic queries quickly. Here is code to create a histogram of trip distance. The plot of the results follows below.

Note that this is especially fast because Pandas now releases the GIL on value_counts operations (all groupby operations really). This takes around 20 seconds on my machine on the last release of Pandas vs 5 seconds on the development branch. Moving from CSV files to Castra moved the bottleneck of our computation from disk I/O to processing speed, allowing improvements like multi-core processing to really shine.

We plot the result of the above computation with Bokeh below. Note the spike around 20km. This is around the distance from Midtown Manhattan to LaGuardia airport.

I’ve shown Castra used above with dask.dataframe but it works fine with straight Pandas too.

Credit

Castra was started by myself and Valentin Haenel (current maintainer of bloscpack and bcolz) during an evening sprint following PyData Berlin. Several bugfixes and refactors were followed up by Phil Cloud and Jim Crist.

Castra is roughly 500 lines long. It’s a tiny project which is both good and bad. It’s being used experimentally and there are some heavy disclaimers in the README. This post is not intended as a sales pitch for Castra, but rather to provide a vocabulary to talk about efficient tabular storage.

Response to twitter traffic: again, this blogpost is not saying “use Castra!” Rather it says “don’t use CSVs!” and consider more efficient storage for interactive use. Other, more mature solutions exist or could be built. Castra was an experiment of “how fast can we get without doing too much work.”

Categories: FLOSS Project Planets

Justin Mason: Links for 2015-08-27

Planet Apache - Thu, 2015-08-27 19:58
  • Mining High-Speed Data Streams: The Hoeffding Tree Algorithm

    This paper proposes a decision tree learner for data streams, the Hoeffding Tree algorithm, which comes with the guarantee that the learned decision tree is asymptotically nearly identical to that of a non-incremental learner using infinitely many examples. This work constitutes a significant step in developing methodology suitable for modern ‘big data’ challenges and has initiated a lot of follow-up research. The Hoeffding Tree algorithm has been covered in various textbooks and is available in several public domain tools, including the WEKA Data Mining platform.

    (tags: hoeffding-tree algorithms data-structures streaming streams cep decision-trees ml learning papers)

  • Chinese scammers are now using Stingray tech to SMS-phish

    A Stingray-style false GSM base station, hidden in a backpack; presumably they detect numbers in the vicinity, and SMS-spam those numbers with phishing messages. Reportedly the scammers used this trick in “Guangzhou, Zhuhai, Shenzhen, Changsha, Wuhan, Zhengzhou and other densely populated cities”. Dodgy machine translation:

    March 26, Zhengzhou police telecommunications fraud cases together, for the first time seized a small backpack can hide pseudo station equipment, and arrested two suspects. Yesterday, the police informed of this case, to remind the general public to pay attention to prevention. “I am the landlord, I changed number, please rent my wife hit the bank card, card number ×××, username ××.” Recently, Jiefang Road, Zhengzhou City Public Security Bureau police station received a number of cases for investigation brigade area of ??the masses police said, frequently received similar phone scam messages. Alarm, the police investigators to determine: the suspect may be in the vicinity of twenty-seven square, large-scale use of mobile pseudo-base release fraudulent information. [...] Yesterday afternoon, the Jiefang Road police station, the reporter saw the portable pseudo-base is made up of two batteries, a set-top box the size of the antenna box and a chassis, as well as a pocket computer composed together at most 5 kg. (via t byfield and Danny O’Brien)

    (tags: via:mala via:tbyfield privacy scams phishing sms gsm stingray base-stations mobile china)

Categories: FLOSS Project Planets

Norbert Preining: Kobo Japanese Dictionary Enhancer 1.1

Planet Debian - Thu, 2015-08-27 18:39

Lots of releases in quick succession – the new Kobo Japanese Dictionary Enhancer brings multi-dictionary support and merged translation support. Using the Wadoku project’s edict2 database we can now add also German translations.

Looking at the numbers, we have now 326064 translated entries when using the English edict2, and 368943 translated entries when using the German Wadoku edict version. And more than that, as an extra feature it is now also possible to have merged translations, so to have both German and English translations added.

Please head over to the main page of the project for details and download instructions. If you need my help in creating the updated dictionary, please feel free to contact me.

Enjoy.

Categories: FLOSS Project Planets

Ankit Wagadre (ankitw)

Planet KDE - Thu, 2015-08-27 18:23
DataPicker For LabPlot : GSoC Project 2015

My GSoC project was to develop Datapicker for Labplot, it is a tool which converts input graph in a form of an image, into numbers. I coundn't able to post my last blog properly so this is my first 
blog to the community. 

Datapicker supports several graph types:
  • Cartesian (x, y)
  • Polar (r, Deg/Rad)
  • Logarithmic (x, ln(y))/ (ln(x), y)
Using dock-widget user can define local/logical coordinates of axis/reference points. 
Datapicker provides all the zooming options that worksheet provides. And a few new options
like zoom-window that creates a small magnified window below the cursor has been added.

     Datapicker supports multiple curve for the same graph. Each curve can have its own type of 
    x & y errors (No-error, symmetric, asymmetric), datasheet and a symbol style. The appearance 
    of the symbols to mark points can be changed via dock-widget . New options has been added 
    to support movement of points on the image through arrow keys.



    The segment selection mode allows user to select automatic traced segments of curve. Tracing 
    is done by processing image on the basis of range of color attributes which can be modified in 
    the dock-widget to get better results. User can also use this mode just to remove background 
    and grid lines to clear image view.




    Datapicker supports all types of errors No-error, Symmetric error, and Asymmetric error. Based on
    types of errors each symbol generates error bar around it. Error bars are the movable object that
    allows user to change their position and appearance as needed.





    Categories: FLOSS Project Planets

    Bryan Pendleton: Writing a byte and then reading it back

    Planet Apache - Thu, 2015-08-27 18:17

    The always-wonderful Raymond Chen posted a completely-brilliant description of an old bug:

    There was a bug that said, "The splash screen for this MS-DOS game is all corrupted if you run it from a compressed volume."

    What was the real problem? Well, Chen explains:

    • The Windows 95 I/O system assumed that if it wrote a byte, then it could read it backThe optimization above relied on the property that writing a byte followed by reading the byte produces the byte originally written. But this doesn't work for video memory because of the weird way video memory works. The result was that when the decompression engine tried to read what it thought was the uncompressed data, it was actually asking the video controller to do some strange operations. The result was corrupted decompressed data, and corrupted video data.
    • What is the purpose of the bmPlanes member of the BITMAP structure?If you have 16 colors, then you need four bits per pixel. You would think that the encoding would be to have the each byte of video memory encode two pixels, one in the bottom four bits and one in the top four. But for technical reasons, the structure of video memory was not that simple.
    • Machine Organization IIn EGA and VGA, the Graphics Controller (GC) manages transfers of data among the video memory, CPU registers and the latches.
    • ID3D12Resource::Map methodyour app must honor all restrictions that are associated with such memory

    Wow.

    There are more things in heaven and earth, Horatio, Than are dreamt of in your philosophy
    Categories: FLOSS Project Planets

    Kubuntu Wily Beta 1

    Planet KDE - Thu, 2015-08-27 17:21

    The first Beta of Wily (to become 15.10) has now been released!

    The Beta-1 images can be downloaded from: http://cdimage.ubuntu.com/kubuntu/releases/wily/beta-1/

    More information on Kubuntu Beta-1 can be found here: https://wiki.kubuntu.org/WilyWerewolf/Beta1/Kubuntu

    Categories: FLOSS Project Planets

    Ben Hutchings: Truncating a string in C

    Planet Debian - Thu, 2015-08-27 16:10

    This version uses the proper APIs to work with the locale's multibyte encoding (with single-byte encodings being a trivial case of multibyte). It will fail if it encounters an invalid byte sequence (e.g. byte > 127 in the "C" locale), though it could be changed to treat each rejected byte as a single character.

    #include <locale.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <wchar.h> int main(int argc, char **argv) { size_t n = 12, totlen = 0, maxlen, chlen; setlocale(LC_ALL, ""); if (argc != 2) return EXIT_FAILURE; maxlen = strlen(argv[1]); while (n--) { chlen = mbrlen(argv[1] + totlen, maxlen - totlen, NULL); if (chlen > MB_CUR_MAX) return EXIT_FAILURE; totlen += chlen; } printf("%.*s\n", (int)totlen, argv[1]); return 0; }
    Categories: FLOSS Project Planets
    Syndicate content