FLOSS Project Planets

Europython: Deadline for Financial Assistance applications

Planet Python - Mon, 2014-05-12 10:14

This is a gentle reminder that the deadline for submitting applications for Financial Assistance is 15.05.2014 (23:59:59, German time).

Categories: FLOSS Project Planets

Another hacknight in Gothenburg

Planet KDE - Mon, 2014-05-12 08:00

The following announcement is in Swedish. Please ignore it if it does not make sense to you!

Välkomna på foss-gbg hackafton!

Den 28/5 17:00 träffas vi för att lära oss om GNOME.

Andreas Nilsson och Mattias Bengtsson introducerar projektet, verktygen och hjälper er komma igång.

Vid åttasnåret drar vi vidare och umgås över en öl.

Pelagicore står för lokaler och bjuder på lättare tilltugg under tillställningen.

Inget inträde, men begränsat antal platser. Anmälan på eventbrite.

Categories: FLOSS Project Planets

liberty-eiffel @ Savannah: Liberty Eiffel first release: 2013.11

GNU Planet! - Mon, 2014-05-12 08:00

Liberty Eiffel is a free eiffel compiler started from SmartEiffel code base. Its goal is to retain from SmartEiffel its rigour; but not its rigidity.

Eiffel is an advanced object-oriented programming language that emphasizes the design and construction of high-quality and reusable software.

Liberty Eiffel is maintained by a free and open community.

We are happy to announce the very first version, 2013.11, code-named "Adler" (after Charles Adler, Jr. — an American engineer).

This release is a signal to the FLOSS community at large that Eiffel is still alive and kicking. We need volunteers :-)

Feel free to visit our website: http://www.gnu.org/software/liberty-eiffel/

Categories: FLOSS Project Planets

Kubuntu Utopic Kickoff Meeting

Planet KDE - Mon, 2014-05-12 07:43
KDE Project:

A new cycle and lots of interesting possibilities! Will KF5 and Plasma 5 be supreme? All welcome at the Kubuntu kickoff meeting this european evening and american afternoon at 19:00UTC.

Install mumble, get a headset with headphones and microphone, adjust volumes to be sane and join us on mumble server kyofel.dyndns.org
Chat in #kde-devel

Add items to discuss at the meeting notes and review the TODO items on Trello.

See you there!

Categories: FLOSS Project Planets

Michael McCandless: Choosing a fast unique identifier (UUID) for Lucene

Planet Apache - Mon, 2014-05-12 07:30
Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if you do not provide it.

Sometimes your id values are already pre-defined, for example if an external database or content management system assigned one, or if you must use a URI, but if you are free to assign your own ids then what works best for Lucene?

One obvious choice is Java's UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.

BlockTree terms dictionary

The purpose of the terms dictionary is to store all unique terms seen during indexing, and map each term to its metadata (docFreq, totalTermFreq, etc.), as well as the postings (documents, offsets, postings and payloads). When a term is requested, the terms dictionary must locate it in the on-disk index and return its metadata.

The default codec uses the BlockTree terms dictionary, which stores all terms for each field in sorted binary order, and assigns the terms into blocks sharing a common prefix. Each block contains between 25 and 48 terms by default. It uses an in-memory prefix-trie index structure (an FST) to quickly map each prefix to the corresponding on-disk block, and on lookup it first checks the index based on the requested term's prefix, and then seeks to the appropriate on-disk block and scans to find the term.

In certain cases, when the terms in a segment have a predictable pattern, the terms index can know that the requested term cannot exist on-disk. This fast-match test can be a sizable performance gain especially when the index is cold (the pages are not cached by the the OS's IO cache) since it avoids a costly disk-seek. As Lucene is segment-based, a single id lookup must visit each segment until it finds a match, so quickly ruling out one or more segments can be a big win. It is also vital to keep your segment counts as low as possible!

Given this, fully random ids (like UUID V4) should perform worst, because they defeat the terms index fast-match test and require a disk seek for every segment. Ids with a predictable per-segment pattern, such as sequentially assigned values, or a timestamp, should perform best as they will maximize the gains from the terms index fast-match test.

Testing Performance

I created a simple performance tester to verify this; the full source code is here. The test first indexes 100 million ids into an index with 7/7/8 segment structure (7 big segments, 7 medium segments, 8 small segments), and then searches for a random subset of 2 million of the IDs, recording the best time of 5 runs. I used Java 1.7.0_55, on Ubuntu 14.04, with a 3.5 GHz Ivy Bridge Core i7 3770K.

Since Lucene's terms are now fully binary as of 4.0, the most compact way to store any value is in binary form where all 256 values of every byte are used. A 128-bit id value then requires 16 bytes.

I tested the following identifier sources: For the UUIDs and Flake IDs I also tested binary encoding in addition to their standard (base 16 or 36) encoding. Note that I only tested lookup speed using one thread, but the results should scale linearly (on sufficiently concurrent hardware) as you add threads. Zero-padded sequential ids, encoded in binary are fastest, quite a bit faster than non-zero-padded sequential ids. UUID V4 (using Java's UUID.randomUUID()) is ~4X slower.

But for most applications, sequential ids are not practical. The 2nd fastest is UUID V1, encoded in binary. I was surprised this is so much faster than Flake IDs since Flake IDs use the same raw sources of information (time, node id, sequence) but shuffle the bits differently to preserve total ordering. I suspect the problem is the number of common leading digits that must be traversed in a Flake ID before you get to digits that differ across documents, since the high order bits of the 64-bit timestamp come first, whereas UUID V1 places the low order bits of the 64-bit timestamp first. Perhaps the terms index should optimize the case when all terms in one field share a common prefix.

I also separately tested varying the base from 10, 16, 36, 64, 256 and in general for the non-random ids, higher bases are faster. I was pleasantly surprised by this because I expected a base matching the BlockTree block size (25 to 48) would be best.

There are some important caveats to this test (patches welcome)! A real application would obviously be doing much more work than simply looking up ids, and the results may be different as hotspot must compile much more active code. The index is fully hot in my test (plenty of RAM to hold the entire index); for a cold index I would expect the results to be even more stark since avoiding a disk-seek becomes so much more important. In a real application, the ids using timestamps would be more spread apart in time; I could "simulate" this myself by faking the timestamps over a wider range. Perhaps this would close the gap between UUID V1 and Flake IDs? I used only one thread during indexing, but a real application with multiple indexing threads would spread out the ids across multiple segments at once.

I used Lucene's default TieredMergePolicy, but it is possible a smarter merge policy that favored merging segments whose ids were more "similar" might give better results. The test does not do any deletes/updates, which would require more work during lookup since a given id may be in more than one segment if it had been updated (just deleted in all but one of them).

Finally, I used using Lucene's default Codec, but we have nice postings formats optimized for primary-key lookups when you are willing to trade RAM for faster lookups, such as this Google summer-of-code project from last year and MemoryPostingsFormat. Likely these would provide sizable performance gains!
Categories: FLOSS Project Planets

Ruslan Spivak: How to Pretty Print XML with lxml

Planet Python - Mon, 2014-05-12 06:26

If you use lxml library and need to pretty print XML, here is a snippet that works even if you have existing indentation in your XML.

#!/usr/bin/env python import sys import StringIO from lxml import etree def main(): xml_text = sys.stdin.read() parser = etree.XMLParser(remove_blank_text=True) file_obj = StringIO.StringIO(xml_text) tree = etree.parse(file_obj, parser) print(etree.tostring(tree, pretty_print=True)) if __name__ == '__main__': main()

Save the above code to a file ppxml and make it executable.
After that you could use it on the command line like this, for example:

$ cat << EOF | ppxml > <root> > <child1/> > <child2/> > <child3/> > </root> > EOF <root> <child1/> <child2/> <child3/> </root>
Categories: FLOSS Project Planets

Russell Coker: BTRFS vs LVM

Planet Debian - Mon, 2014-05-12 03:37

For some years LVM (the Linux Logical Volume Manager) has been used in most Linux systems. LVM allows one or more storage devices (either disks, partitions, or RAID sets) to be assigned to a Volume Group (VG) some of which can then allocated to a Logical Volume (LVs) which are equivalent to any other block device, a VG can have many LVs.

One of the significant features of LVM is that you can create snapshots of a LV. One common use is to have multiple snapshots of a LV for online backups and another is to make a snapshot of a filesystem before making a backup to external storage, the snapshot is unchanging so there’s no problem of inconsistencies due to backing up a changing data set. When you create a snapshot it will have the same filesystem label and UUID so you should always mount a LVM device by it’s name (which will be /dev/$VGNAME/$LVNAME).

One of the problems with the ReiserFS filesystem was that there was no way to know whether a block of storage was a data block, a metadata block, or unused. A reiserfsck --rebuild-tree would find any blocks that appeared to be metadata and treat them as such, deleted files would reappear and file contents which matched metadata (such as a file containing an image of a ReiserFS filesystem) would be treated as metadata. One of the impacts of this was that a hostile user could create a file which would create a SUID root program if the sysadmin ran a --rebuild-tree operation.

BTRFS solves the problem of filesystem images by using a filesystem specific UUID in every metadata block. One impact of this is that if you want to duplicate a BTRFS filesystem image and use both copies on the same system you need to regenerate all the checksums of metadata blocks with the new UUID. The way BTRFS works is that filesystems are identified by UUID so having multiple block devices with the same UUID causes the kernel to get confused. Making an LVM snapshot really isn’t a good idea in this situation. It’s possible to change BTRFS kernel code to avoid some of the problems of duplicate block devices and it’s most likely that something will be done about it in future. But it still seems like a bad idea to use LVM with BTRFS.

The most common use of LVM is to divide the storage of a single disk or RAID array for the use of multiple filesystems. Each filesystem can be enlarged (through extending the LV and making the filesystem use the space) and snapshots can be taken. With BTRFS you can use subvolumes for the snapshots and the best use of BTRFS (IMHO) is to give it all the storage that’s available so there is no need to enlarge a filesystem in typical use. BTRFS supports quotas on subvolumes which aren’t really usable yet but in the future will remove the need to create multiple filesystems to control disk space use. An important but less common use of LVM is to migrate a live filesystem to a new disk or RAID array, but this can be done by BTRFS too by adding a new partition or disk to a filesystem and then removing the old one.

It doesn’t seem that LVM offers any benefits when you use BTRFS. When I first experimented with BTRFS I used LVM but I didn’t find any benefit in using LVM and it was only a matter of luck that I didn’t use a snapshot and break things.

Snapshots of BTRFS Filesystems

One reason for creating a snapshot of a filesystem (as opposed to a snapshot of a subvolume) is for making backups of virtual machines without support from inside the virtual machine (EG running an old RHEL5 virtual machine that doesn’t have the BTRFS utilities). Another is for running training on virtual servers where you want to create one copy of the filesystem for each student. To solve both these problems I am currently using files in a BTRFS subvolume. The BTRFS kernel code won’t touch those files unless I create a loop device so I can only create a loop device for one file at a time.

One tip for doing this, don’t use names such as /xenstore/vm1 for the files containing filesystem images, use names such as /xenstore/vm1-root. If you try to create a virtual machine named “vm1″ then Xen will look for a file named “vm1″ in the current directory before looking in /etc/xen and tries to use a filesystem image as a Xen configuration file. It would be nice if there was a path for Xen configuration files that either didn’t include the current directory or included it at the end of the list. Including the current directory in the path is a DOS mistake that should have gone away a long time ago.

Psychology and Block Devices

ZFS has a similar design to BTRFS in many ways and has some similar issues. But one benefit for ZFS is that it manages block devices in a “zpool”, first you create a zpool with the block devices and after that you can create ZFS filesystems or “ZVOL” block devices. I think that most sysadmins would regard a zpool as something similar to LVM (which may or may not be correct depending on how you look at it) and immediately rule out the possibility of running a zpool on LVM.

BTRFS looks like a regular Unix filesystem in many ways, you can have a single block device that you mount with the usual mount command. The fact that BTRFS can support multiple block devices in a RAID configuration isn’t so obvious and the fact that it implements equivalents to most LVM functionality probably isn’t known to most people when they start using it. The most obvious way to start using BTRFS is to use it just like an Ext3/4 filesystem on an LV, and to use LVM snapshots to backup data, this is made even more likely by the fact that there is a program to convert a ext2/3/4 filesystem to BTRFS. This seems likely to cause data loss.

Related posts:

  1. Starting with BTRFS Based on my investigation of RAID reliability [1] I have...
  2. BTRFS Status March 2014 I’m currently using BTRFS on most systems that I can...
  3. BTRFS and ZFS as Layering Violations LWN has an interesting article comparing recent developments in the...
Categories: FLOSS Project Planets

Web Omelette: Sending HTML Emails with Webform in Drupal 7

Planet Drupal - Mon, 2014-05-12 03:12

Have you ever wondered how you can include HTML markup in the emails you send with Webform? Out of the box, you cannot. But I am going to show you a simple way to achieve this using the Mime Mail module and some simple theming. Additionally, I will show you how to control which webforms should send HTML emails and which should not.

First though, make sure you install and enable the Mime Mail and Mail System modules (the latter is a dependency of the former). With Drush, all you have to do is use this command:

drush en mimemail -y

It will take care of all you need to do. If you commit the module to your repo, don't forget that the Mail System module also gets downloaded, so make sure you include it as well.

Next, edit your theme's template.php file and paste this block of code (explained after):

function your_theme_webform_mail_headers($variables) { $headers = array(); $headers = array( 'Content-Type' => 'text/html; charset=UTF-8; format=flowed; delsp=yes', 'X-Mailer' => 'Drupal Webform (PHP/'. phpversion() .')', ); return $headers; }

Make sure you change your_theme with the name of your theme. So what happens here? We override the theme_webform_mail_headers() declared by the Webform module. We do this in order to add a content type to the mail headers, and set it to HTML. And that's pretty much it.

If you now clear your caches and test a webform, you'll see that you can add anchor tags and other basic HTML tags.

One problem you might run into though is that all your webforms are now sending emails in HTML format - a result only partially desired. You'll notice that the default email that you send no longer provides any spacing and all the text gets put on one line - as HTML in fact.

So what you can do is make a selection of webforms for which you'll want HTML emails. A handy way of doing this is by adding a field to your webform content type that will be used to swith HTML emails on/off for a given node. So to illustrate this, let's say we added a new field to the relevant content type called HTML Emails (with the machine name: field_html_email). This field is a boolean type (a single checkbox basically) with the values of 1 for on and 0 for off.

It follows to adapt the theme override above and replace it with something like this:

function your_theme_webform_mail_headers($variables) { $headers = array( 'X-Mailer' => 'Drupal Webform (PHP/' . phpversion() . ')', ); // Get the HTML Email field $html_email_field = field_get_items('node', $variables['node'], 'field_html_email'); // Check if this webform node needs to send HTML emails if (!empty($html_email_field)) { $html = $html_email_field[0]['value'] == 1 ? TRUE : FALSE; } if ($html === TRUE) { $headers['Content-Type'] = 'text/html; charset=UTF-8; format=flowed; delsp=yes'; } return $headers; }

If you consult the documentation for this theme function, you'll know that the $variables parameter contains also the node object which uses Webform to send the email. So we basically check for the value of our field and if it is 1, we add the HTML information to the mail headers. Otherwise, we return the $headers array containing the value it does by default (essentially making no changes).

You can now clear the caches and test it out. Edit a node of the respective content type and check the box. You'll see that it now sends HTML emails. However, if you uncheck the box, it will fallback to the default format that comes with the Webform module.

Hope this helps.

In Theming | Drupal var switchTo5x = true;stLight.options({"publisher":"dr-8de6c3c4-3462-9715-caaf-ce2c161a50c"});
Categories: FLOSS Project Planets

Twisted Matrix Labs: Twisted 14.0.0 Released

Planet Python - Mon, 2014-05-12 01:05
On behalf of Twisted Matrix Laboratories, I am honoured to announce the release of Twisted 14.0! It has been a long road to get here, but we’ve done it!

The highlights of this release are:
  • Twisted Positioning (`twisted.positioning`) makes its entry into Twisted! It comes ready to talk with common GPS devices, and will supersede `twisted.protocols.gps`.
  • A wealth of SSL/TLS improvements, including ECDHE support, TLS Service Identity (with service_identity on PyPI), a stronger default set of ciphers, and strengthening against attacks such as CRIME. A Twisted Web server with pyOpenSSL 0.14 is capable of getting an A in Qualys SSL Labs tests out of the box, and A+ with small application modifications. Twisted Agent can also now do HTTPS hostname verification.
  • Python 3 improvements, including the ability for `pip install` to install all ported modules.
  • Twisted Pair’s TUN/TAP support has been overhauled, with documentation and full test coverage.
  • Significant documentation improvements, including more API documentation for Twisted Mail & Twisted Names, narrative documentation for Twisted Names, and a migration to Sphinx for building Twisted narrative docs.
  • Support is dropped for pyOpenSSL older than 0.10 and Windows XP.
For more information, check the NEWS file.

You can find the downloads at https://pypi.python.org/pypi/Twisted (or alternatively http://twistedmatrix.com/trac/wiki/Downloads) .

Many thanks to everyone who had a part in this release - we’ve got some big things landed, and if it weren’t for the support of developers (both core and occasional), the Twisted Software Foundation, or people giving feedback and filing bugs, we’d have never got it done.

Twisted Regards,
Categories: FLOSS Project Planets

Ryan Szrama: Beyond Wombats

Planet Drupal - Sun, 2014-05-11 22:35

I accidentally started publishing open source software in 2006, the first integration of the QuickBooks Web Connector with anything. This was pre-Ubercart when I was just cutting my teeth on PHP / MySQL development at Prima Supply, and I thought it would be fun to claim the code was written by wombats while I just published it online. I decided to own the silliness and start blogging on bywombats.com using Drupal 4.7 at the time - and immediately picked up a freelance contract doing QuickBooks integration work.

Topics: DrupalWork
Categories: FLOSS Project Planets

Benjamin Mako Hill: Google Has Most of My Email Because It Has All of Yours

Planet Debian - Sun, 2014-05-11 22:11

Republished by Slate. Translations available in French (Français), Spanish (Español), Chinese (中文)

For almost 15 years, I have run my own email server which I use for all of my non-work correspondence. I do so to keep autonomy, control, and privacy over my email and so that no big company has copies of all of my personal email.

A few years ago, I was surprised to find out that my friend Peter Eckersley — a very privacy conscious person who is Technology Projects Director at the EFF — used Gmail. I asked him why he would willingly give Google copies of all his email. Peter pointed out that if all of your friends use Gmail, Google has your email anyway. Any time I email somebody who uses Gmail — and anytime they email me — Google has that email.

Since our conversation, I have often wondered just how much of my email Google really has. This weekend, I wrote a small program to go through all the email I have kept in my personal inbox since April 2004 (when Gmail was started) to find out.

One challenge with answering the question is that many people, like Peter, use Gmail to read, compose, and send email but they configure Gmail to send email from a non-gmail.com “From” address. To catch these, my program looks through each message’s headers that record which computers handled the message on its way to my server and to pick out messages that have traveled through google.com, gmail.com, or googlemail.com. Although I usually filter them, my personal mailbox contains emails sent through a number of mailing lists. Since these mailing lists often “hide” the true provenance of a message, I exclude all messages that are marked as coming from lists using the (usually invisible) “Precedence” header.

The following graph shows the numbers of emails in my personal inbox each week in red and the subset from Google in blue. Because the number of emails I receive week-to-week tends to vary quite a bit, I’ve included a LOESS “smoother” which shows a moving average over several weeks.

From eyeballing the graph, the answer to seems to be that, although it varies, about a third of the email in my inbox comes from Google!

Keep in mind that this is all of my personal email and includes automatic and computer generated mail from banks and retailers, etc. Although it is true that Google doesn’t have these messages, it suggests that the proportion of my truly “personal” email that comes via Google is probably much higher.

I would also like to know how much of the email I send goes to Google. I can do this by looking at emails in my inbox that I have replied to. This works if I am willing to assume that if I reply to an email sent from Google, it ends up back at Google. In some ways, doing this addresses the problem with the emails from retailers and banks since I am very unlikely to reply to those emails. In this sense, it also reflects a measure of more truly personal email.

I’ve broken down the proportions of emails I received that come from Google in the graph below for all email (top) and for emails I have replied to (bottom). In the graphs, the size of the dots represents the total number of emails counted to make that proportion. Once again, I’ve included the LOESS moving average.

The answer is surprisingly large. Despite the fact that I spend hundreds of dollars a year and hours of work to host my own email server, Google has about half of my personal email! Last year, Google delivered 57% of the emails in my inbox that I replied to. They have delivered more than a third of all the email I’ve replied to every year since 2006 and more than half since 2010. On the upside, there is some indication that the proportion is going down. So far this year, only 51% of the emails I’ve replied to arrived from Google.

The numbers are higher than I imagined and reflect somewhat depressing news. They show how it’s complicated to think about privacy and autonomy for communication between parties. I’m not sure what to do except encourage others to consider, in the wake of the Snowden revelations and everything else, whether you really want Google to have all your email. And half of mine.

If you want to run the analysis on your own, you’re welcome to the Python and R code I used to produce the numbers and graphs.

Categories: FLOSS Project Planets

Eko S. Wibowo: Using Jetbrains PyCharm Community Edition for Your Python IDE

Planet Python - Sun, 2014-05-11 20:00

Great chance are, that you will going to love this cross platform Python IDE 

As with any open source software development platform, Python was blessed (or cursed?) with myriad of ways to develop program written on it: "With what IDE/Text Editor should you develop your next killer Python application?". A free and open policy, does initiate and keep energetic programmers to stay awake in this awesome Python world. But for newcomer, choosing the right IDE can become a daunting task. Add that with the phenomenon of epic war between IDE/Editor community, and you may find your way to choose a Python IDE not as trivial as it is in .NET development: any Visual Studio contender anyone?

Hereby, I started a new line of series on this blog that specifically review Python IDE at my disposal. Although it's not easy to stay away from subjectivity of certain IDE, but I am trying to do so, by deciding several important features that must be existed in a Python IDE: its overall value will be increase if it posses these features.

This article will also serve as an important article for Python for Beginner series: I believe coding Python in an IDE will bring much more advantages for beginner out there. Great. Let's start the review with a cross platform IDE : PyCharm from Jetbrains. 

But, wait a minute, what is an IDE?

IDE stand for Integrated Development Environment: a software designed with special purpose to ease programmer in developing computer programs. The term INTEGRATED literally means that you will only need this piece of software in your programming activities. No more external application needed. Or to be exact, even though there are cases that IDE will eventually use external application, at least this was handled automatically. No manual configuration steps needed in your part. You just know that it is works!

IDE will undoubtedly, boost programmer productivity (although not necessarily their code quality). There are voices in the internet that oppose the use of an IDE: what they want is a vanilla working environment. This means a working environment consisted with a general purpose Text Editor (such as ViM, Emacs, Sublime, TextMate, etc) complemented with built-in programming language (command line) tools. And as I have principle that there are really nothing truly right or wrong in this world, there are only preferences, so I am unable to say that this choice is wrong. There are groups of people who love to work in an IDE and there are others that love to work with a Text Editor. Me? To be honest, I often switch side. I'll let you know in great detail in the subsequent Python IDE article.

But, my prejudice is, in the context of newbie who just come to Python programming or programming in general, you will definitely love an IDE more than just a general Text Editor. Let me know if my prejudice is wrong!

Features to Be Expected in an IDE

Actually, if you are a programmer in just this recent years, you won't be surprised at these features. You will even think that this expected features to be .. well, standard features. The truth is, it's not. Programmers have been crafting these "modern" IDE features for years. And now we have it!

What are those? Here they are:

  1. Project files navigation
    Present your application source code files and its related resource in an easily navigable view
  2. Code coloring (including syntax coloring and error hinting) and proper indentation.
    It will allows you to have a quick glimpse at the code and able to identify instantly whether an identifier is a reserved word, your own identifier or an unknown identifier.
  3. Go to definition.
    Ever have to read a new code coming from an open source project? You will greatly benefit yourself from the ease of use IDE feature that let you CTRL+Click (or other shortcut) an identifier to go to its definition (if available). Add that to the fact that an identifier may be defined from external file three level nested down the current active directory (or may even defined in external directory hierarchy). Phew, this feature alone will make me stick to choose an IDE when doing coding activities. 
  4. Smart code completion (or intellisense in Microsoft term).
    You start with a blank *.py file... so, what's next? Try to type CTRL+SPACE... voila! You will get all the available identifiers known to the current context. Or, by just typing two alphabet for a given 16 characters long identifier, you will presented with all matching identifiers starting with that two characters. Simply pick the one that you like and press ENTER. The IDE will automatically completes your long identifier for you.
    But I believe the most important feature is the ability in displaying all methods and properties for a given object when you hit the dot character. In most cases you will stay away from API documentation page and proactively inspect this dot character code completion list.
  5. Walk through debugging.
    In my early programming education, I always love to debug the application that being created one step at a time. It teaches me how the code get executed in a visually pleasing interactive experience. Now, in my professional career, this feature is extremely important when you have to find a needle in haystack that create strange result in your application.
  6. Refactoring.
    Refactoring is the term popularized by Martin Fowler, which is a continuous process taken to restructure current code into better quality code. This process maybe the most complex feature needed in a software development process. Given the nature of code that will be scattered into many files, the task of restructuring code can prove to be daunting task if done incorrectly. Imagine if you have to rename an identifier. Manually renaming them will be too tiresome to accomplished. With a good IDE, this process will be so much fun to do.
  7. Integrated console.
    Your applications will eventually get executed by the operating system. The feature to run them and inspect its console output within the IDE is greatly important when you have to deal with multiple applications.  

Those seven features are features that must existed in an IDE (or a Text Editor) for a convenience programming experience. Try to lose one of them, and I believe you will always seek for another IDE.

Great. Now let's move on to our PyCharm Python IDE Review!

PyCharm : Brief Introduction

If you are a Java Programmer, chance are that you have already use its Java IDE: IntelliJ IDEA. IDEA users will feel right at home when they have to use PyCharm to do Python development. PyCharm source code base must be branched from IDEA itself. PyCharm (like IDEA) is also built with Java technology, hereby, making it a cross platform application.

Thanks to folks at Jetbrains, PyCharm is coming with two versions: Community Edition (free to use at any condition whatsoever) and Professional Edition (paid version, with free version for selected open source project). Some of you maybe thinking that the Community Edition is far inferior than the Commercial Edition. You will amaze that actually, it is not. Or, to be exact, your daily coding activities my not required professional features which are there in Professional Edition. Click here for full comparison of both version.

Even though PyCharm is a Java application, you don't have to download and install separate JRE for it: it already shipped with JRE ready to use. Just download PyCharm from here (either the Community or Professional Edition),  and install it. In a short moment, you will have yourself a cross platform and versatile commercial quality Python IDE: at no cost. Isn't that great?

Study Case: PythonIDE Class

For the subsequent subtopics, I am going to demonstrate PyCharm features using the following class stored in a module named pythonide.py:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 __author__ = 'Eko Wibowo'     class PythonIDE(object):     """    A class to hold Python IDE information    """     def __init__(self):         """        Initialize an IDE with default data        """         self.name = 'Untitled'         self.publisher = 'Unknown'         self.stars = 0       def __repr__(self):         """        Return a string representation of this object        """         return 'Python IDE named %s, created by %s, with rating at %s stars' % (self.name, self.publisher, self.stars)       def the_best(self):         """        Mark this IDE as the best Python IDE there is!        """         self.stars = 5

The above is a simple Python class to represent a Python IDE properties and methods. Also, we are going to need a main module named main.py below to use the above class:

1 2 3 4 5 6 7 8 import pythonide   if __name__ == '__main__':     pycharm = pythonide.PythonIDE()     pycharm.name = 'PyCharm'     pycharm.publisher = 'Jetbrains'     pycharm.the_best()     print pycharm

Using the above two modules will ease us in understanding the expected IDE features more easily.

Feature 01 : Project Navigation

To be considered as a project, PyCharm will identify a source code directory by the existence of .idea/ directory within that directory (hence, you may want to add this directory in your .gitignore file). Inside this special purpose directory, PyCharm will store the current state of the project, project files content, configuration etc. To open a directory as a project, simply choose File->Open Directory menu item and choose a directory to open. The above two modules which were stored in a directory python-for-beginner\IDE_review will be displayed in a tree view as follow:

Project view in PyCharm


To open a file simply double click it in the navigation pane displayed as tree view above. 

Feature 02 : Code Coloring

Have a look at below picture that display an active editor showing the content of pythonide.py:

Charming code coloring for a *.py file

Even if this is your first time in seeing Python code (which is not that unlikely), by seeing the above screenshot I am sure you can easily distinguish Python identifier groups by the different colors used. For example, somehow you know that the method __init__() and __repr__() were lies in different group than the method the_best(). The first is a special purpose method or a method override, while the second is your own user defined method.

Also, how Jetbrains choose a color scheme in my opinion is very clever. For example, you do realize that the comment was drawn with a faded gray color, right? It gives the effect of a not so important piece of code. While literal string was emphasized from the rest by using a green coloring. Notice also how PyCharm has spell checker that was great, but still with any overactive spell checker, it has annoying effect of trying to suggest me that my name was incorrectly spelled. "Thank you very much PyCharm. But I am happy with my last name. Please don't tell me that it was a mistake.." 

Another great thing about PyCharm, which I believe to be the most lovable feature, is how it display visual clue when we violate from the PEP 8 code convention guidelines. For example, if we delete a vertical space from the above class construct, PyCharm will gives us this visual clue that we violate a convention.

We violate a convention here!


I found this to be the best way to follow this PEP-8 guidelines! 

Feature 03 : Go to Definition

Simply press CTRL key and hover to any recognized identifier in your source code, PyCharm will change your mouse cursor into hand icon and draw underline in the given identifier, to give visual clue that you can click it.

Click it to jump to the above identifier definition

Upon clicking it, the cursor will be moved to Python module containing the definition of PythonIDE method constructor. To go back to previous edit position, hit the key CTRL+SHIFT+BACKSPACE. Also, notice how PyCharm display method comment in the popup above. This feature will be greatly useful in your code review/reading purpose.

  Feature 04 : Smart Code Completion

As previously said about dot character code completion list, below picture is what we get when we try to display all known methods and fields in PythonIDE class.

All class contents, sorted by most likely used identifiers


You can hit ENTER to choose the highlighted identifier (stars in the above case). Although you can use UP and DOWN key to scroll to the correct completion identifier (if for example what you really want is to use _repr__ method), in my opinion it will be ineffective. At least try to type several characters in the intended identifier (e.g. re to directly go to __repr__ method). It will gives you higher precision than to manually hit UP and DOWN arrow keys several times.

In retrospect, I remember the old days when there is no code completion feature in programmers IDE. And all that left to us is to have a reference guide standby nearby. It has important purpose actually. That is, programmers will be forced to read reference guide before/while coding, and memorize it...


Feature 05 : Walkthrough Debugging

When you doubted your code and want to step by step walk through to your code or simply want to inspect some variable values, you can do so by creating a debugging bookmark and run your application in debugging mode. The execution of your code will then halt at specified bookmark, which then you can inspect your code at that bookmarked state. Below is a given PyCharm state when debugging the main module. Notice how easy it is to inspect all variable values or watch just certain field values.

Debugging Python application using integrated PyCharm debugger



Feature 06 : Refactoring

The usual case to refactor code in Pycharm is to put the cursor in an identifier, right click it, go to refactor command item and choose all the available refactor features. For example if I want to rename the_best into mark_the_best, PyCharm will optionally display the following refactor code preview:

All the code changes that will be affected by refactor action

Just imagine if this variable is in use by 15 other modules, and you have to rename them manually. Phew..

Feature 07 : Integrated Console

When you run a Python module in PyCharm, it will run it in integrated console. There, you can inspect its console output or even terminate a long or hang process. This feature is inseparable feature in a complete Python IDE. If it is inexist, you will have to open a dedicated terminal/console and manually run Python interpreter to execute a given module. Not nice, eh?

Running a Python application in integrated terminal console

PyCharm also have integrated and interactive Python console, so you can run Python interpreter interactively like you always did in dedicated terminal / console. Below is what I meant by integrated and interactive Python console:

Interactive Python interpreter integrated in PyCharm



This first IDE review article that I presented to you, will act as the opening article for subsequent articles in this series. Any further articles will compare itself with this article. We will then will have a rather complete look of how to choose a particular IDE over another one. One thing to note is, those seven features discussed mainly serve as basic starting point to discuss IDE capabilities. There are of course unexplored features that maybe what you like most in an IDE.

But in the end, the IDE that you choose will depend greatly on your liking and personal preference. There is really nothing right or wrong with it. It simply your preference.

Stay tuned for my next articles!



Categories: FLOSS Project Planets

Armin Ronacher: Everything you did not want to know about Unicode in Python 3

Planet Python - Sun, 2014-05-11 20:00

Readers of this blog on my twitter feed know me as a person that likes to rant about Unicode in Python 3 a lot. This time will be no different. I'm going to tell you more about how painful "doing Unicode right" is and why. "Can you not just shut up Armin?". I spent two weeks fighting with Python 3 again and I need to vent my frustration somewhere. On top of that there is still useful information in those rants because it teaches you how to deal with Python 3. Just don't read it if you get annoyed by me easily.

There is one thing different about this rant this time. It won't be related to WSGI or HTTP or any of that other stuff at all. Usually I'm told that I should stop complaining about the Python 3 Unicode system because I wrote code nobody else writes (HTTP libraries and things of that sort) I decided to write something else this time: a command line application. And not just the app, I wrote a handy little library called click to make this easier.

Note that I'm doing what about every newby Python programmer does: writing a command line application. The "Hello World" of Python programs. But unlike the newcomer to Python I wanted to make sure the application is as stable and Unicode supporting as possible for both Python 2 and Python 3 and make it possible to unittest it. So this is my report on how that went.

What we want to do

In Python 3 we're doing Unicode right as developers. Apparently. I suppose what means is that all text data is Unicode and all non text data is bytes. In this wonderful world of everything being black and white, the "Hello World" example is pretty straightforward. So let's write some helpful shell utilties.

Let's say we want to implement a simple cat In other terms, these are the applications we want to write in Python 2 terms:

import sys import shutil for filename in sys.argv[1:]: f = sys.stdin if filename != '-': try: f = open(filename, 'rb') except IOError as err: print >> sys.stderr, 'cat.py: %s: %s' % (filename, err) continue with f: shutil.copyfileobj(f, sys.stdout)

Obviously neither commands are particularly great as they do not handle any command line options or anything but at least they roughly work. So that's what we start out with.

Unicode in Unix

In Python 2 the above code is dead simple because you implicitly work with bytes everywhere. The command line arguments are bytes, the filenames are bytes (ignore Windows users for a moment) and the file contents are bytes too. Purists will point out that this is incorrect and really that's where the problem is coming from, but if you start thinking about it more, you will realize that this is an unfixable problem.

UNIX is bytes, has been defined that way and will always be that way. To understand why you need to see the different contexts in which data is being passed through:

  • the terminal
  • command line arguments
  • the operating system io layer
  • the filesystem driver

That btw, is not the only thing this data might be going through but let's go with this for the moment. In how many of the situations do we know an encoding? The answer is: in none of them. The closest we have to understanding an encoding is that the terminal exports locale information. This information can be used to show translations but also to understand what encoding text information has.

For instance an LC_CTYPE of en_US.utf-8 tells an application that the system is running US English and that most text data is utf-8. In practice there are more variables but let's assume that this is the only one we need to look at. Note that LC_CTYPE does not say that all data now is utf-8. It instead informs the application how text characters should be classified and what case conversion rules should be applied.

This is important because of the C locale. The C locale is the only locale that POSIX actually specifies and it says: encoding is ASCII and all responses from command line tools in regards to languages are like they are defined in the POSIX spec.

In the above case of our cat tool there is no other way to treat this data as if it was bytes. The reason for this is, that there is no indication on the shell what the data is. For instance if you invoke cat hello.txt the terminal will pass hello.txt encoded in the encoding of the terminal to your application.

But now imagine the other case: echo *. The shell will now pass all the filenames of the current directory to your application. Which encoding are they in? In whatever encoding the filenames are in. There is no filename encoding!

Unicode Madness?

Now a Windows person will probably look at this and say: what the hell are the UNIX people doing. But it's not that dire or not dire at all. The reason this all works is because some clever people designed the system to be backwards compatible. Unlike Windows where all APIs are defined twice, on POSIX the best way to deal with all of this is to assume it's a byte mess that for display purposes is decoded with an encoding hint.

For instance let's take the case of the cat command above. As you might have noticed there is an error message for files it cannot open because they either don't exist or because they are protected or whatever else. In the simple case above let's assume the file is encoded in latin1 garbage because it came from some external drive from 1995. The terminal will get our standard output and will try to decode it as utf-8 because that's what it thinks it's working with. Because that string is latin1 and not the right encoding it will now not decode properly. But fear not, nothing is crashing, because your terminal will just ignore the things it cannot deal with. It's clever like this.

How does it look like for GUIs? They have two versions of each. When a GUI like Nautilus lists all files it makes a symbol for each file. It associates the internal bytes of that filename with the icon for double clicking and secondly it attempts to make a filename it can show for display purposes which might be decoded from something. For instance it will attempt decoding from utf-8 with replacing decoding errors with question marks. Your filename might not be entirely readable but you can still open the file. Success!

Unicode on UNIX is only madness if you force it on everything. But that's not how Unicode on UNIX works. UNIX does not have a distinction between unicode and byte APIs. They are one and the same which makes them easy to deal with.

The C Locale

Nowhere does this show up as much as with the C locale. The C locale is the escape hatch of the POSIX specification to enforce everybody to behave the same. A POSIX compliant operating system needs to support setting LC_CTYPE to C and to force everything to be ASCII.

This locale is traditionally picked in a bunch of different situations. Primarily you will find this locale for any program launched from cron, your init system, subprocesses with an empty environment etc. The C locale restores a sane ASCII land on environments where you otherwise could not trust anything.

But the word ASCII implies that this is an 7bit encoding. This is not a problem because your operating system is dealin in bytes! Any 8 bit byte based content can pass through just fine, but you are following the contract with the operating system that any character processing will be limited to the first 7 bit. Also any message your tool generates out of it's own translations will be ASCII and the language will be English.

Note that the POSIX spec does not say your application should die in flames.

Python 3 Dies in Flames

Python 3 takes a very difference stance on Unicode than UNIX does. Python 3 says: everything is Unicode (by default, except in certain situations, and except if we send you crazy reencoded data, and even then it's sometimes still unicode, albeit wrong unicode). Filenames are Unicode, Terminals are Unicode, stdin and out are Unicode, there is so much Unicode! And because UNIX is not Unicode, Python 3 now has the stance that it's right and UNIX is wrong, and people should really change the POSIX specification to add a C.UTF-8 encoding which is Unicode. And then filenames are Unicode, and terminals are Unicode and never ever will you see bytes again although obviously everything still is bytes and will fail.

And it's not just me saying this. These are bugs in Python related to this braindead idea of doing Unicode:

But then if you Google around you will find so much more. Just check how many people failed to install their pip packages because the changelog had umlauts in it. Or because their home folder has an accent in it. Or because their SSH session negotates ASCII, or because they are connecting from Putty. The list goes on and one.

Python 3 Cat

Now let's start fixing cat for Python 3. How do we do this? Well first of all we now established that we need to deal with bytes because someone might echo something which is not in the encoding the shell says. So at the very least the file contents need to be bytes. But then we also need to open the standard output to support bytes which it does not do by default. We also need to deal with the case separately where the Unicode APIs crap out on us because the encoding is C. So here it is, feature compatible cat for Python 3:

import sys import shutil def _is_binary_reader(stream, default=False): try: return isinstance(stream.read(0), bytes) except Exception: return default def _is_binary_writer(stream, default=False): try: stream.write(b'') except Exception: try: stream.write('') return False except Exception: pass return default return True def get_binary_stdin(): # sys.stdin might or might not be binary in some extra cases. By # default it's obviously non binary which is the core of the # problem but the docs recomend changing it to binary for such # cases so we need to deal with it. Also someone might put # StringIO there for testing. is_binary = _is_binary_reader(sys.stdin, False) if is_binary: return sys.stdin buf = getattr(sys.stdin, 'buffer', None) if buf is not None and _is_binary_reader(buf, True): return buf raise RuntimeError('Did not manage to get binary stdin') def get_binary_stdout(): if _is_binary_writer(sys.stdout, False): return sys.stdout buf = getattr(sys.stdout, 'buffer', None) if buf is not None and _is_binary_writer(buf, True): return buf raise RuntimeError('Did not manage to get binary stdout') def filename_to_ui(value): # The bytes branch is unecessary for *this* script but otherwise # necessary as python 3 still supports addressing files by bytes # through separate APIs. if isinstance(value, bytes): value = value.decode(sys.getfilesystemencoding(), 'replace') else: value = value.encode('utf-8', 'surrogateescape') \ .decode('utf-8', 'replace') return value binary_stdout = get_binary_stdout() for filename in sys.argv[1:]: if filename != '-': try: f = open(filename, 'rb') except IOError as err: print('cat.py: %s: %s' % ( filename_to_ui(filename), err ), file=sys.stderr) continue else: f = get_binary_stdin() with f: shutil.copyfileobj(f, binary_stdout)

And this is not the worst version. Not because I want to make things extra complicated but because it is complicated now. For instance what's not done in this example is to forcefully flush the text stdout before fetching the binary one. In this example it's not necessary because print calls here go to stderr instead of stdout, but if you would want to print to stdout instead, you would have to flush. Why? Because stdout is a buffer on top of another buffer and if you don't flush it forefully you might get output in wrong order.

And it's not just me. For instance see twisted's compat module for the same mess in slightly different color.

Dancing The Encoding Dance

To understand the live of a filename parameter to the shell, this is btw now what happens on Python 3 worst case:

  1. the shell passes the filename as bytes to the script
  2. the bytes are being decoded from the expected encoding by Python before they ever hit your code. Because this is a lossy process, Python 3 applies an special error handler that encodes encoding errors as surrogates into the string.
  3. the python code then encounters a file not existing error and needs to format an error message. Because we write to a text stream we cannot write surrogates out as they are not valid unicode. Instead we now
  4. encode the unicode string with the surrogates to utf-8 and tell it to handle the surrogate escapes as it.
  5. then we decode from utf-8 and tell it to ignore errors.
  6. the resulting string now goes back out to our text only stream (stderr)
  7. after which the terminal will decode our string for displaying purposes.

Here is what happens on Python 2:

  1. the shell passes the filename as bytes to the script.
  2. the shell decodes our string for displaying purposes.

And because no string handling happens anywhere there the Python 2 version is just as correct if not more correct because the shell then can do a better job at showing the filename (for instance it could highlight the encoding errors if it woudl want. In case of Python 3 we need to handle the encoding internally so that's no longer possible to detect for the shell).

Note that this is not making the script less correct. In case you would need to do actual string handling on the input data you would switch to Unicode handling in 2.x or 3.x. But in that case you also want to support a --charset parameter on your script explicitly so the work is pretty much the same on 2.x and 3.x anyways. Just that it's worse because for that to work on 3.x you need to construct the binary stdout first which is unnecessary on 2.x.

But You're Wrong Armin

Clearly I'm wrong. I have been told so far that:

  • I only feel it's painful because I don't think like a beginner and the new Unicode system is so much easier for beginners.
  • I don't consider Windows users and how much more correct this new text model is for Windows users.
  • The problem is not Python, the problem is the POSIX specification.
  • The linux distributions really need to start supporting C.UTF-8 because they are stuck in the past.
  • The problem is SSH because it passes incorrect encodings. This is a problem that needs to be fixed in SSH.
  • The real problem with lots of unicode errors in Python 3 is that people just don't pass explicit encodings and instead assume that Python 3 does the right thing to figure it out (which it really can't so you should pass explicit encodings). Then there would be no problems.
  • I work with "boundary code" so obviously that's harder on Python 3 now (duh).
  • I should spend my time fixing Python 3 instead of complaining on Twitter and my blog.
  • You're making problems where there are none. Just let everybody fix their environment and encodings everywhere and everything is fine. It's a user problem.
  • Java had this problem for ages, it worked just fine for developers.

You know what? I did stop complaining while I was working with HTTP for a while, because I buy the idea that a lot of the problems with HTTP/WSGI are something normal people don't need to deal with. But you know what? The same problem appears in simple Hello World style scenarios. Maybe I should give up trying to achieve a high quality of Unicode support in my libraries and just live with broken stuff.

I can bring up counter arguments for each of the point above, but ultimately it does not matter. If Python 3 was the only Python language I would use, I would eat up all the problems and roll with it. But it's not. There is a perfectly other language available called Python 2, it has the larger user base and that user base is barely at all migrating over. At the moment it's just very frustrating.

Python 3 might be large enough that it will start to force UNIX to go the Windows route and enforce Unicode in many places, but really, I doubt it.

The much more likely thing to happen is that people stick to Python 2 or build broken stuff on Python 3. Or they go with Go. Which uses an even simpler model than Python 2: everything is a byte string. The assumed encoding is UTF-8. End of the story.

Categories: FLOSS Project Planets

Zato Blog: MySQL support added

Planet Python - Sun, 2014-05-11 18:00

Check out the command line snippet and screenshot below - as of recent git master versions on GitHub it's possible to use MySQL as a Zato's SQL Operational Database. This is in addition to previously supported databases - PostgreSQL and Oracle DB.

The command shown was taken straight from the tutorial - the only difference is that MySQL has been used instead of PostgreSQL.

$ zato quickstart create ~/env/qs-1 mysql localhost 3306 zato1 zato1 localhost 6379 ODB database password (will not be echoed): Enter the odb_password again (will not be echoed): Key/value database password (will not be echoed): Enter the kvdb_password again (will not be echoed): [1/8] Certificate authority created [2/8] ODB schema created [3/8] ODB initial data created [4/8] server1 created [5/8] server2 created [6/8] Load-balancer created Superuser created successfully. [7/8] Web admin created [8/8] Management scripts created Quickstart cluster quickstart-887030 created Web admin user:[admin], password:[ilat-edan-atey-uram] Start the cluster by issuing the /home/dsuch/env/qs-1/zato-qs-start.sh command Visit https://zato.io/support for more information and support options $

Categories: FLOSS Project Planets

administration @ Savannah: Savannah VCS Maintenance

GNU Planet! - Sun, 2014-05-11 16:05

The Savannah VCS version control server will be offline on Wednesday, May 14th after 14:30 EDT (2014-05-14 18:30:00 UTC) for file system maintenance. The hosted storage space is being moved to faster backend storage to improve performance. Version control for all projects will be unavailable for possibly as long as three hours until the move is completed.

Categories: FLOSS Project Planets

Yokadi 0.14.0

Planet KDE - Sun, 2014-05-11 14:22

You may not have heard about Yokadi. It is a command-line based TODO list manager which I started some years ago and work on with a bunch of fellow contributors.

Yokadi is a side project for all of us, with occasional bursts of development activities when we find an itch to scratch or foolishly think we finally figured out the missing feature which is going to save us from procrastination :), therefore development is a bit slow. We usually run the latest version from the master branch, but not everybody is comfortable with such a way to work, so it is good to have releases. Version 0.13.0 was released 3 (three!) years ago, it was high time we got a new version out. Last week we finally released version 0.14.0.

If you are a command-line aficionado looking for a way to manage your tasks, Yokadi might be the tool you need. Head over to http://yokadi.github.io to learn more and get the latest version. We look forward to your feedback!

Categories: FLOSS Project Planets

Matthew Garrett: Oracle continue to circumvent EXPORT_SYMBOL_GPL()

Planet Debian - Sun, 2014-05-11 11:14
Oracle won their appeal regarding whether APIs are copyrightable. There'll be ongoing argument about whether Google's use of those APIs is fair use or not, and perhaps an appeal to the Supreme Court, but that's the new status quo. This will doubtless result in arguments over whether Oracle's implementation of Linux APIs in Solaris 10 was a violation of copyright or not (and presumably Google are currently checking whether they own any code that Oracle reimplemented), but that's not what I'm going to talk about today.

Oracle own some code called DTrace (Wikipedia has a good overview here - what it actually does isn't especially relevant) that was originally written as part of Solaris. When Solaris was released under the CDDL, so was DTrace. The CDDL is a file-level copyleft license with some restrictions not present in the GPL - as a result, combining GPLed code with CDDLed code will (in the absence of additional permission grants) result in a work that is under an inconsistent license and cannot legally be distributed.

Oracle wanted to make DTrace available for Linux as part of their Unbreakable Linux product. Integrating it directly into the kernel would obviously cause legal issues, so instead they implemented it as a kernel module. The copyright status of kernel modules is somewhat unclear. The GPL covers derivative works, but the definition of derivative works is a function of copyright law and judges. Making use of explicitly exported API may not be sufficient to constitute a derivative work - on the other hand, it might. This is largely untested in court. Oracle appear to believe that they're legitimate, and so have added just enough in-kernel code (and GPLed) to support DTrace, while keeping the CDDLed core of DTrace separate.

The kernel actually has two levels of exposed (non-userspace) API - those exported via EXPORT_SYMBOL() and those exported via EXPORT_SYMBOL_GPL(). Symbols exported via EXPORT_SYMBOL_GPL() may only be used by modules that claim to be GPLed, with the kernel refusing to load them otherwise. There is no technical limitation on the use of symbols exported via EXPORT_SYMBOL().

(Aside: this should not be interpreted as meaning that modules that only use symbols exported via EXPORT_SYMBOL() will not be considered derivative works. Anything exported via EXPORT_SYMBOL_GPL() is considered by the author to be so fundamental to the kernel that using it would be impossible without creating a derivative work. Using something exported via EXPORT_SYMBOL() may result in the creation of a derivative work. Consult lawyers before attempting to release a non-GPLed Linux kernel module)

DTrace integrates very tightly with the host kernel, and one of the things it needs access to is a high-resolution timer that is guaranteed to monotonically increase. Linux provides one in the form of ktime_get(). Unfortunately for Oracle, ktime_get() is only exported via EXPORT_SYMBOL_GPL(). Attempting to call it directly from the DTrace module would fail.

Oracle work around this in their (GPLed) kernel abstraction code. A function called dtrace_gethrtimer() simply returns the value of ktime_get(). dtrace_gethrtimer() is exported via EXPORT_SYMBOL() and therefore can be called from the DTrace module.

So, in the face of a technical mechanism designed to enforce the author's beliefs about the copyright status of callers of this function, Oracle deliberately circumvent that technical mechanism by simply re-exporting the same function under a new name. It should be emphasised that calling an EXPORT_SYMBOL_GPL() function does not inherently cause the caller to become a derivative work of the kernel - it only represents the original author's opinion of whether it would. You'd still need a court case to find out for sure. But if it turns out that the use of ktime_get() does cause a work to become derivative, Oracle would find it fairly difficult to argue that their infringement was accidental.

Of course, as copyright holders of DTrace, Oracle could solve the problem by dual-licensing DTrace under the GPL as well as the CDDL. The fact that they haven't implies that they think there's enough value in keeping it under an incompatible license to risk losing a copyright infringement suit. This might be just the kind of recklessness that Oracle accused Google of back in their last case.

Categories: FLOSS Project Planets

Daniel Pocock: Is Uber on your side?

Planet Debian - Sun, 2014-05-11 03:40

Crowdsourcing ventures with disruptive business models are a regular point of contention these days.

In London, taxi drivers are threatening to create gridlock as part of an anti-Uber protest. In Melbourne, Uber drivers have been issued with $1,700 fines for operating without a taxi license. San Francisco city officials, despite being the birthplace of many of these ventures, are debating whether AirBNB should be regulated.

An orderly society or an old-school protection racket?

Just what exactly is it that established players in these industries are trying to achieve through their protests and lobbying efforts?

In the case of apartment rentals, many people have sympathy for respecting the wishes of neighbourhoods over those of individual landlords. In the case of car pooling schemes, the arguments tend to come not from motorists at large but from those who are afraid of competition.

Without competition, could things be any worse?

Melbourne actually provides the perfect backdrop for this debate. Only a couple of years before Uber came on the scene, the government had made a detailed study into the taxi industry. One of Australia's most prominent economic experts, a former chairman of the Australian Competition and Consumer Commission spent 18 months studying the industry.

One of the highlights of the incumbent system (and the reason I suggest Melbourne is the perfect backdrop for this debate) is the way licenses are issued to taxi drivers. There are a fixed number of licenses issued by the government. The licenses are traded on the open market so prices can go up and down just like real-estate. Under the rules of Australia's pension scheme, people have even been able to use money from their pension fund to purchase a taxi license as an investment. It goes without saying that this has helped rampant speculation and the price of a license is now comparable to the price of a house.

The end result is that no real taxi driver can afford a license: most of them have to rent their license from one of the speculators who bought the license. These fixed rental fees have to be paid every month whether the driver uses their car or not. Consequently, taxi drivers have cut back on other expenses, they are often criticised for failing to keep their cars clean and the industry as a whole is criticised due to the poor quality of drivers who don't even know their way around the city. The reason, of course, is simple: by the time some newly arrived immigrant has learnt his way around Melbourne he has also figured out that the economics of driving a taxi are not in his favor. Realizing there is no way to break even, they take other jobs instead.

It was originally speculated that the government review would dramatically reduce or abolish these speculative practices but ultimately lower license charges have only been used for the issue of 60 new licenses, barely 1% of the taxi fleet in the city today. Furthermore, the new licenses were only available to existing players in the industry.

Uber to the rescue?

Uber drove into the perfect storm as they launched their service in Melbourne in 2013.

Uber drivers get a significant benefit over their competitors in traditional taxis. In particular, as they don't have the fixed monthly payment to rent a taxi license, they don't have to work every day and can even take holidays or take time to clean the cars. These things may simultaneously benefit road safety and passenger comfort.

Meanwhile, those people who speculated on the old taxi licenses have tried hunger strikes and all kinds of other desperate tactics to defer the inevitable loss of their "investment".

The reality is that crowdsourcing is here to stay. Even if Uber is stopped by bullying and intimidation, the inefficiency of Melbourne's taxi system is plain for all to see and both customers and drivers will continue looking for alternatives. Other car-pooling apps based on barter or cost sharing will continue to find ways to operate even if the Uber model is prohibited.

It is interesting to note that the last great reform of Melbourne taxis, under Premier Jeff Kennett in the 1990s, simply resulted in a change of paint with the aim of making them look like those in New York City. Disruptive services like Uber (with their numerous technology-powered innovations to save time and money) appear to be doing far more to improve the lives of passengers and drivers.

The hidden cost

That said, large scale schemes like Uber do also have a down side for customer privacy. Hailing cabs in the street leaves no records of your movements. This new model, however, is leaving a very detailed trail of breadcrumbs that can be used for both marketing purposes or extracted (lawfully or otherwise) by some third party who wishes to monitor a particular customer's past or future movements. This is the trade-off that arises when we benefit from the efficiencies of any cloud-based service.

Categories: FLOSS Project Planets

Federal Appeals Court Decision in Oracle v. Google

LinuxPlanet - Sat, 2014-05-10 10:33

[ Update on 2014-05-13: If you're more of a listening rather than reading type, you might enjoy the Free as in Freedom oggcast that Karen Sandler and I recorded about this topic. ]

I have a strange relationship with copyright law. Many copyright policies of various jurisdictions, the USA in particular, are draconian at best and downright vindictive at worst. For example, during the public comment period on ACTA, I commented that I think it's always wrong, as a policy matter, for copyright infringement to carry criminal penalties.

That said, much of what I do in my work in the software freedom movement is enforcement of copyleft: assuring that the primary legal tool, which defends the freedom of the Free Software, functions properly, and actually works — in the real world — the way it should.

As I've written about before at great length, copyleft functions primarily because it uses copyright law to stand up and defend the four freedoms. It's commonly called a hack on copyright: turning the copyright system which is canonically used to restrict users' rights, into a system of justice for the equality of users.

However, it's this very activity that leaves me with a weird relationship with copyright. Copyleft uses the restrictive force of copyright in the other direction, but that means the greater the negative force, the more powerful the positive force. So, as I read yesterday the Federal Circuit Appeals Court's decision in Oracle v. Google, I had that strange feeling of simultaneous annoyance and contentment. In this blog post, I attempt to state why I am both glad for and annoyed with the decision.

I stated clearly after Alsup's decision NDCA decision in this case that I never thought APIs were copyrightable, nor does any developer really think so in practice. But, when considering the appeal, note carefully that the court of appeals wasn't assigned the general job of considering whether APIs are copyrightable. Their job is to figure out if the lower court made an error in judgment in this particular case, and to discern any issues that were missed previously. I think that's what the Federal Circuit Court attempted to do here, and while IMO they too erred regarding a factual issue, I don't think their decision is wholly useless nor categorically incorrect.

Their decision is worth reading in full. I'd also urge anyone who wants to opine on this decision to actually read the whole thing (which so often rarely happens in these situations). I bet most pundits out there opining already didn't read the whole thing. I read the decision as soon as it was announced, and I didn't get this post up until early Saturday morning, because it took that long to read the opinion in detail, go back to other related texts and verify some details and then write down my analysis. So, please, go ahead, read it now before reading this blog post further. My post will still be here when you get back. (And, BTW, don't fall for that self-aggrandizing ballyhoo some lawyers will feed you that only they can understand things like court decisions. In fact, I think programmers are going to have an easier time reading decisions about this topic than lawyers, as the technical facts are highly pertinent.)

Ok, you've read the decision now? Good. Now, I'll tell you what I think in detail: (As always, my opinions on this are my own, IANAL and TINLA and these are my personal thoughts on the question.)

The most interesting thing, IMO, about this decision is that the Court focused on a fact from trial that clearly has more nuance than they realize. Specifically, the Court claims many times in this decision that Google conceded that it copied the declaring code used in the 37 packages verbatim (pg 12 of the Appeals decision).

I suspect the Court imagined the situation too simply: that there was a huge body of source code text, and that Google engineers sat there, simply cutting-and-pasting from Oracle's code right into their own code for each of the 7,000 lines or so of function declarations. However, I've chatted with some people (including Mark J. Wielaard) who are much more deeply embedded in the Free Software Java world than I am, and they pointed out it's highly unlikely anyone did a blatant cut-and-paste job to implement Java's core library API, for various reasons. I thus suspect that Google didn't do it that way either.

So, how did the Appeals Court come to this erroneous conclusion? On page 27 of their decision, they write: Google conceded that it copied it verbatim. Indeed, the district court specifically instructed the jury that ‘Google agrees that it uses the same names and declarations’ in Android. Charge to the Jury at 10. So, I reread page 10 of the final charge to the jury. It actually says something much more verbose and nuanced. I've pasted together below all the parts where the Alsup's jury charge mentions this issue (emphasis mine): Google denies infringing any such copyrighted material … Google agrees that the structure, sequence and organization of the 37 accused API packages in Android is substantially the same as the structure, sequence and organization of the corresponding 37 API packages in Java. … The copyrighted Java platform has more than 37 API packages and so does the accused Android platform. As for the 37 API packages that overlap, Google agrees that it uses the same names and declarations but contends that its line-by-line implementations are different … Google agrees that the structure, sequence and organization of the 37 accused API packages in Android is substantially the same as the structure, sequence and organization of the corresponding 37 API packages in Java. Google states, however, that the elements it has used are not infringing … With respect to the API documentation, Oracle contends Google copied the English-language comments in the registered copyrighted work and moved them over to the documentation for the 37 API packages in Android. Google agrees that there are similarities in the wording but, pointing to differences as well, denies that its documentation is a copy. Google further asserts that the similarities are largely the result of the fact that each API carries out the same functions in both systems.

Thus, in the original trial, Google did not admit to copying of any of Oracle's text, documentation or code (other than the rangeCheck thing, which is moot on the API copyrightability issue). Rather, Google said two separate things: (a) they did not copy any material (other than rangeCheck), and (b) admitted that the names and declarations are the same, not because Google copied those names and declarations from Oracle's own work, but because they perform the same functions. In other words, Google makes various arguments of why those names and declarations look the same, but for reasons other than “mundane cut-and-paste copying from Oracle's copyrighted works”.

For we programmers, this is of course a distinction without any difference. Frankly, programmers, when we look at this situation, we'd make many obvious logical leaps at once. Specifically, we all think APIs in the abstract can't possibly be copyrightable (since that's absurd), and we work backwards from there with some quick thinking, that goes something like this: it doesn't make sense for APIs to be copyrightable because if you explain to me with enough detail what the API has to, such that I have sufficient information to implement, my declarations of the functions of that API are going to necessarily be quite similar to yours — so much so that it'll be nearly indistinguishable from what those function declarations might look like if I cut-and-pasted them. So, the fact is, if we both sit down separately to implement the same API, well, then we're likely going to have two works that look similar. However, it doesn't mean I copied your work. And, besides, it makes no sense for APIs, as a general concept, to be copyrightable so why are we discussing this again?0

But this is reasoning a programmer can love but the Courts hate. The Courts want to take a set of laws the legislature passed, some precedents that their system gave them, along with a specific set of facts, and then see what happens when the law is applied to those facts. Juries, in turn, have the job of finding which facts are accurate, which aren't, and then coming to a verdict, upon receiving instructions about the law from the Court.

And that's right where the confusion began in this case, IMO. The original jury, to start with, likely had trouble distinguishing three distinct things: the general concept of an API, the specification of the API, and the implementation of an API. Plus, they were told by the judge to assume API's were copyrightable anyway. Then, it got more confusing when they looked at two implementations of an API, parts of which looked similar for purely mundane technical reasons, and assumed (incorrectly) that textual copying from one file to another was the only way to get to that same result. Meanwhile, the jury was likely further confused that Google argued various affirmative defenses against copyright infringement in the alternative.

So, what happens with the Appeals Court? The Appeals court, of course, has no reason to believe the finding of fact of the jury is wrong, and it's simply not the appeals court's job to replace the original jury's job, but to analyze the matters of law decided by the lower court. That's why I'm admittedly troubled and downright confused that the ruling from the Appeals court seems to conflate the issue of literal copying of text and similarities in independently developed text. That is a factual issue in any given case, but that question of fact is the central nuance to API copyrightiable and it seems the Appeals Court glossed over it. The Appeals Court simply fails to distinguish between literal cut-and-paste copying from a given API's implementation and serendipitous similarities that are likely to happen when two API implementations support the same API.

But that error isn't the interesting part. Of course, this error is a fundamental incorrect assumption by the Appeals Court, and as such the primary ruling are effectively conclusions based on a hypothetical fact pattern and not the actual fact pattern in this case. However, after poring over the decision for hours, it's the only error that I found in the appeals ruling. Thus, setting the fundamental error aside, their ruling has some good parts. For example, I'm rather impressed and swayed by their argument that the lower court misapplied the merger doctrine because it analyzed the situation based on the decisions Google had with regard to functionality, rather than the decisions of Sun/Oracle. To quote: We further find that the district court erred in focusing its merger analysis on the options available to Google at the time of copying. It is well-established that copyrightability and the scope of protectable activity are to be evaluated at the time of creation, not at the time of infringement. … The focus is, therefore, on the options that were available to Sun/Oracle at the time it created the API packages.

Of course, cropping up again in that analysis is that same darned confusion the Court had with regard to copying this declaration code. The ruling goes on to say: But, as the court acknowledged, nothing prevented Google from writing its own declaring code, along with its own implementing code, to achieve the same result.

To go back to my earlier point, Google likely did write their own declaring code, and the code ended up looking the same as the other code, because there was no other way to implement the same API.

In the end, Mark J. Wielaard put it best when he read the decision, pointing out to me that the Appeals Court seemed almost angry that the jury hung on the fair use question. It reads to me, too, like Appeals Court is slyly saying: the right affirmative defense for Google here is fair use, and that a new jury really needs to sit and look at it.

My conclusion is that this just isn't a decision about the copyrightable of APIs in the general sense. The question the Court would need to consider to actually settle that question would be: “If we believe an API itself isn't copyrightable, but its implementation is, how do we figure out when copyright infringement has occurred when there are multiple implementations of the same API floating around, which of course have declarations that look similar?” But the court did not consider that fundamental question, because the Court assumed (incorrectly) there was textual cut-and-paste copying. The decision here, in my view, is about a more narrow, hypothetical question that the Court decided to ask itself instead: “If someone textually copies parts of your API implementation, are merger doctrine, scènes à faire, and de minimis affirmative defenses like to succeed?“ In this hypothetical scenario, the Appeals Court claims “such defenses rarely help you, but a fair use defense might help you”.

However, on this point, in my copyleft-defender role, I don't mind this decision very much. The one thing this decision clearly seems to declare is: “if there is even a modicum of evidence that direct textual copying occurred, then the alleged infringer must pass an extremely high bar of affirmative defense to show infringement didn't occur”. In most GPL violation cases, the facts aren't nuanced: there is always clearly an intention to incorporate and distribute large textual parts of the GPL'd code (i.e., not just a few function declarations). As such, this decision is probably good for copyleft, since on its narrowest reading, this decision upholds the idea that if you go mixing in other copyrighted stuff, via copying and distribution, then it will be difficult to show no copyright infringement occurred.

OTOH, I suspect that most pundits are going to look at this in an overly contrasted way: NDCA said API's aren't copyrightable, and the Appeals Court said they are. That's not what happened here, and if you look at the situation that way, you're making the same kinds of oversimplications that the Appeals Court seems to have erroneously made.

The most positive outcome here is that a new jury can now narrowly consider the question of fair use as it relates to serendipitous similarity of multiple API function declaration code. I suspect a fresh jury focused on that narrow question will do a much better job. The previous jury had so many complex issues before them, I suspect that they were easily conflated. (Recall that the previous jury considered patent questions as well.) I've found that people who haven't spent their lives training (as programmers and lawyers have) to delineate complex matters and separate truly unrelated issues do a poor job at such. Thus, I suspect the jury won't hang the second time if they're just considering the fair use question.

Finally, with regard to this ruling, I suspect this won't become immediate, frequently cited precedent. The case is remanded, so a new jury will first sit down and consider the fair use question. If that jury finds fair use and thus no infringement, Oracle's next appeal will be quite weak, and the Appeals Court likely won't reexamine the question in any detail. In that outcome, very little has changed overall: we'll have certainty that API's aren't copyrightable, as long as any textual copying that occurs during reimplementation is easily called fair use. By contrast, if the new jury rejects Google's fair use defense, I suspect Google will have to appeal all the way to SCOTUS. It's thus going to be at least two years before anything definitive is decided, and the big winners will be wealthy litigation attorneys — as usual.

0This is of course true for any sufficiently simple programming task. I used to be a high-school computer science teacher. Frankly, while I was successful twice in detecting student plagiarism, it was pretty easy to get false positives sometimes. And certainly I had plenty of student programmers who wrote their function declarations the same for the same job! And no, those weren't the students who plagiarized.

Categories: FLOSS Project Planets
Syndicate content