Feeds

Vasudev Ram: Print selected text pages to PDF with Python, selpg and xtopdf on Linux

Planet Python - Fri, 2014-10-24 00:20
By Vasudev Ram



In a recent blog post, titled My IBM developerWorks article, I talked about a tutorial that I had written for IBM developerWorks a while ago. The tutorial showed some of the recommended techniques and practices to follow when writing a Linux command-line utility that is intended for production use, and how to write it in such a way that it can easily cooperate with existing UNIX command-line tools, when used in a UNIX command pipeline.

This ability of properly written command-line tools to cooperate with each other when used in a pipeline, is, as I said in that IBM article, one of the keys to the power of Linux (and UNIX) as a development environment. (See the classic book The UNIX Programming Environment, for much more on this topic.)

The utility I wrote and discussed (in that IBM article), called selpg (for SELect PaGes), allows the user to select a specified range of pages from a text file. At the end of the aforementioned blog post, I had said that I would show some practical uses of the selpg utility later. I describe one such use case below, involving a combination of selpg and my xtopdf toolkit), which is a Python library for PDF creation.

(The xtopdf toolkit contains a PDF creation library, and also includes some sample applications that show how to use the library to create PDF output in various ways, and from various input sources, which is why I tend to call xtopdf a toolkit instead of just a library.

I had written one such application of xtopdf a while ago, called StdinToPDF(.py) (for standard input to PDF). I blogged about it at the time, here:

[xtopdf] PDFWriter can create PDF from standard input. (PDFWriter is a module of xtopdf, which provides the core PDF creation functionality.)

The selpg utility can be used with StdinToPDF, in a pipeline, to select a range of pages (by starting and ending page numbers) from a (possibly large) text file, and write only those selected pages to a PDF file. Here is an example of how to do that:

First, build the selpg utility from source, for your Linux OS. selpg is only meant to work on Linux, since it uses some Linux C standard library functions, such as from stdio.h, and popen(); but you can try to run it on Windows (at your own risk), since Windows does have (had?) a POSIX subsystem, from Windows NT onward. I have used it in the past. (Update: I checked - according to this section of the Wikipedia article about POSIX, Windows may have had POSIX support only from Windows NT up to Windows 2000.) Anyway, to build selpg on Linux, follow the steps below (the $ sign is the shell prompt and not to be typed):

1. Download the source code from the sources section of the selpg project repository on Bitbucket.

Download all of these files: makefile, mk, selpg.c and showsyserr.c .

2. Make the (shell script) file mk executable, with the command:
$ chmod u+x mk
3. Then run the file mk, with:
$ ./mk
That will run the makefile that builds the selpg executable using the C compiler on your Linux box. The C compiler (invoked as cc or gcc) is installed on most mainstream Linux distributions. If it is not, you will need to install it from the repository for your Linux distribution. Sometimes only a minimal version of a C compiler is installed, which is only enough to (re)compile the kernel after making kernel parameter changes, such as for performance tuning. Consult your local Linux expert for help if such is the case.

3. Now make the file selpg executable, with the command:
$ chmod u+x selpg
4. (Optional) You can check the usage of selpg by reading the IBM tutorial article and/or running selpg without any command-line arguments:
$ ./selpg
which will show a usage message.

6. (Optional) You can run selpg a few times with some text file(s) as input, and different values for the -s and -e command-line options, to get a feel for how it works.

Now download xtopdf (which includes StdinToPDF) from here:

xtopdf on Bitbucket.

To install it, follow the steps given in this post:

Guide to installing and using xtopdf, including creating simple PDF e-books

That post was written a while ago, when xtopdf was hosted on SourceForge. So you need to make one change to the instructions given in that guide: instead of downloading xtopdf from SourceForge, as stated in Step 5 of the guide, get it from the xtopdf Bitbucket link I gave above.

(To make xtopdf work, you also have to install ReportLab, which xtopdf depends uses internally; the steps for that are given in my xtopdf installation guide linked above, or you can also look at the instructions in the ReportLab distribution. It is easy, just a couple of steps - download, unzip, configure a setting or two.)

Once you have both selpg and xtopdf installed, you can use selpg and StdinToPDF together. Here is an example run, to select only pages 2 through 4 from an input text file:

I wrote a simple Python program, gen_selpg_test_file,py, to create a text file that can be used to test the selpg and StdinToPDf programs together.

Here is an excerpt of the core logic of gen_selpg_test_file.py, omitting argument and error handling for brevity (I have those in the actual code):
# Generate the test file with the given filename and number of lines of text.
try:
out_fil = open(out_filename, "w")
except IOError as ioe:
sys.stderr.write("Error: Could not open output file {}.\n".format(out_filename))
sys.exit(1)
for line_num in range(1, num_lines + 1):
line = "Line #" + str(line_num).zfill(10) + "\n"
out_fil.write(line)
out_fil.close()
I ran it like this:
$ python gen_selpg_test_file.py selpg_test_file_1000.txt 1000
to generate a text file with 1000 lines, in the file selpg_test_file_1000.txt .

Then I could run the pipeline using selpg and StdinToPDF, as described above:
$ ./selpg -s2 -e4 selpg_test_file_1000.txt | python StdinToPDF.py p2-p4.pdf
This command extracts only the specifed pages (2 to 4) from the input file, and pipes them to StdinToPDF, which converts those pages only, to PDF, in the filename specified at the end of the command.

After doing the above, you can open the file p2_p4.pdf in your favorite PDF reader (Evince is one PDF reader for Linux), to confirm that it contains all (and only) the lines from page 2 to 4 of the input file selpg_test_file_1000.txt (considering 72 lines per page, which is the default that selpg uses).

Read the IBM article to see how that default can be changed - to either another number of lines per page, e.g. 66 or 80 or whatever, or to specify form feeds (ASCII code 12) as the page delimiter. Form feeds are often used as a page delimiter in text file reports generated by programs, when the reports are destined for a printer, since the form feed character causes the printer to advance the print head to the top of the next page/form (that's how the character got its name).

Though this post seemed long, note that a lot it was either background information or instructions on how to build selpg and install xtopdf. Those are both one time jobs. Once those are done, you can select the needed pages from any text file and print them to PDF with a single command-line, as shown in the last command above.

This is useful when you printed the entire file earlier, and some pages didn't print properly because the printer jammed. Just use selpg with xtopdf to print only the needed pages again.



The image above is from the Wikipedia article on Printing, and titled:

Jikji, "Selected Teachings of Buddhist Sages and Son Masters" from Korea, the earliest known book printed with movable metal type, 1377. Bibliothèque Nationale de France, Paris

- Enjoy.

- Vasudev Ram - Dancing Bison Enterprises

Click here to get email about new products from Vasudev Ram.

Contact Page

Share | Vasudev Ram
Categories: FLOSS Project Planets

unifont @ Savannah: Unifont 7.0.06 Now Available

GNU Planet! - Thu, 2014-10-23 20:11

Unifont 7.0.06 is now available at ftp://ftp.gnu.org/gnu/unifont/unifont-7.0.06/.

This release adds coverage for the following Supplemental Multilingual Plane scripts: Old Permic, Ornamental Dingbats, Geometric Shapes Extended, and Supplemental Arrows-C. The SMP now contains over 5700 glyphs.

Some final adjustments were also made to the ASCII lower-case letters for better alignment. Those changes carried over to other Latin scripts where those letters occur.

For full details, view the ChangeLog file in the source tarball.

Categories: FLOSS Project Planets

Drupal Bits at Web-Dev: Drupal: Altering Page Title and or Title Tag

Planet Drupal - Thu, 2014-10-23 19:41

Sometimes you need to alter the title that appears on the page and or the title tag in Drupal 7. If you need to make them both the same, a call to drupal_set_ttile() from within a hook_preprocess_page() will do it.

Categories: FLOSS Project Planets

Print selected text pages to PDF with Python, selpg and xtopdf on Linux

LinuxPlanet - Thu, 2014-10-23 17:09
By Vasudev Ram



In a recent blog post, titled My IBM developerWorks article, I talked about a tutorial that I had written for IBM developerWorks a while ago. The tutorial showed some of the recommended techniques and practices to follow when writing a Linux command-line utility that is intended for production use, and how to write it in such a way that it can easily cooperate with existing UNIX command-line tools, when used in a UNIX command pipeline.

This ability of properly written command-line tools to cooperate with each other when used in a pipeline, is, as I said in that IBM article, one of the keys to the power of Linux (and UNIX) as a development environment. (See the classic book The UNIX Programming Environment, for much more on this topic.)

The utility I wrote and discussed (in that IBM article), called selpg (for SELect PaGes), allows the user to select a specified range of pages from a text file. At the end of the aforementioned blog post, I had said that I would show some practical uses of the selpg utility later. I describe one such use case below, involving a combination of selpg and my xtopdf toolkit), which is a Python library for PDF creation.

(The xtopdf toolkit contains a PDF creation library, and also includes some sample applications that show how to use the library to create PDF output in various ways, and from various input sources, which is why I tend to call xtopdf a toolkit instead of just a library.

I had written one such application of xtopdf a while ago, called StdinToPDF(.py) (for standard input to PDF). I blogged about it at the time, here:

[xtopdf] PDFWriter can create PDF from standard input. (PDFWriter is a module of xtopdf, which provides the core PDF creation functionality.)

The selpg utility can be used with StdinToPDF, in a pipeline, to select a range of pages (by starting and ending page numbers) from a (possibly large) text file, and write only those selected pages to a PDF file. Here is an example of how to do that:

First, build the selpg utility from source, for your Linux OS. selpg is only meant to work on Linux, since it uses some Linux C standard library functions, such as from stdio.h, and popen(); but you can try to run it on Windows (at your own risk), since Windows does have (had?) a POSIX subsystem, from Windows NT onward. I have used it in the past. (Update: I checked - according to this section of the Wikipedia article about POSIX, Windows may have had POSIX support only from Windows NT up to Windows 2000.) Anyway, to build selpg on Linux, follow the steps below (the $ sign is the shell prompt and not to be typed):

1. Download the source code from the sources section of the selpg project repository on Bitbucket.

Download all of these files: makefile, mk, selpg.c and showsyserr.c .

2. Make the (shell script) file mk executable, with the command:
$ chmod u+x mk
3. Then run the file mk, with:
$ ./mk
That will run the makefile that builds the selpg executable using the C compiler on your Linux box. The C compiler (invoked as cc or gcc) is installed on most mainstream Linux distributions. If it is not, you will need to install it from the repository for your Linux distribution. Sometimes only a minimal version of a C compiler is installed, which is only enough to (re)compile the kernel after making kernel parameter changes, such as for performance tuning. Consult your local Linux expert for help if such is the case.

3. Now make the file selpg executable, with the command:
$ chmod u+x selpg
4. (Optional) You can check the usage of selpg by reading the IBM tutorial article and/or running selpg without any command-line arguments:
$ ./selpg
which will show a usage message.

6. (Optional) You can run selpg a few times with some text file(s) as input, and different values for the -s and -e command-line options, to get a feel for how it works.

Now download xtopdf (which includes StdinToPDF) from here:

xtopdf on Bitbucket.

To install it, follow the steps given in this post:

Guide to installing and using xtopdf, including creating simple PDF e-books

That post was written a while ago, when xtopdf was hosted on SourceForge. So you need to make one change to the instructions given in that guide: instead of downloading xtopdf from SourceForge, as stated in Step 5 of the guide, get it from the xtopdf Bitbucket link I gave above.

(To make xtopdf work, you also have to install ReportLab, which xtopdf depends uses internally; the steps for that are given in my xtopdf installation guide linked above, or you can also look at the instructions in the ReportLab distribution. It is easy, just a couple of steps - download, unzip, configure a setting or two.)

Once you have both selpg and xtopdf installed, you can use selpg and StdinToPDF together. Here is an example run, to select only pages 2 through 4 from an input text file:

I wrote a simple Python program, gen_selpg_test_file,py, to create a text file that can be used to test the selpg and StdinToPDf programs together.

Here is an excerpt of the core logic of gen_selpg_test_file.py, omitting argument and error handling for brevity (I have those in the actual code):
# Generate the test file with the given filename and number of lines of text.
try:
out_fil = open(out_filename, "w")
except IOError as ioe:
sys.stderr.write("Error: Could not open output file {}.\n".format(out_filename))
sys.exit(1)
for line_num in range(1, num_lines + 1):
line = "Line #" + str(line_num).zfill(10) + "\n"
out_fil.write(line)
out_fil.close()
I ran it like this:
$ python gen_selpg_test_file.py selpg_test_file_1000.txt 1000
to generate a text file with 1000 lines, in the file selpg_test_file_1000.txt .

Then I could run the pipeline using selpg and StdinToPDF, as described above:
$ ./selpg -s2 -e4 selpg_test_file_1000.txt | python StdinToPDF.py p2-p4.pdf
This command extracts only the specifed pages (2 to 4) from the input file, and pipes them to StdinToPDF, which converts those pages only, to PDF, in the filename specified at the end of the command.

After doing the above, you can open the file p2_p4.pdf in your favorite PDF reader (Evince is one PDF reader for Linux), to confirm that it contains all (and only) the lines from page 2 to 4 of the input file selpg_test_file_1000.txt (considering 72 lines per page, which is the default that selpg uses).

Read the IBM article to see how that default can be changed - to either another number of lines per page, e.g. 66 or 80 or whatever, or to specify form feeds (ASCII code 12) as the page delimiter. Form feeds are often used as a page delimiter in text file reports generated by programs, when the reports are destined for a printer, since the form feed character causes the printer to advance the print head to the top of the next page/form (that's how the character got its name).

Though this post seemed long, note that a lot it was either background information or instructions on how to build selpg and install xtopdf. Those are both one time jobs. Once those are done, you can select the needed pages from any text file and print them to PDF with a single command-line, as shown in the last command above.

This is useful when you printed the entire file earlier, and some pages didn't print properly because the printer jammed. Just use selpg with xtopdf to print only the needed pages again.



The image above is from the Wikipedia article on Printing, and titled:

Jikji, "Selected Teachings of Buddhist Sages and Son Masters" from Korea, the earliest known book printed with movable metal type, 1377. Bibliothèque Nationale de France, Paris

- Enjoy.

- Vasudev Ram - Dancing Bison Enterprises

Click here to get email about new products from Vasudev Ram.

Contact Page

Share | var addthis_config = {"data_track_clickback":true}; Vasudev Ram

Categories: FLOSS Project Planets

FSF Blogs: Friday Free Software Directory IRC meetup: October 24

GNU Planet! - Thu, 2014-10-23 16:54

Join the FSF and friends on Friday, October 24, from 2pm to 5pm EDT (18:00 to 21:00 UTC) to help improve the Free Software Directory by adding new entries and updating existing ones. We will be on IRC in the #fsf channel on freenode.


Tens of thousands of people visit directory.fsf.org each month to discover free software. Each entry in the Directory contains a wealth of useful information, from basic category and descriptions, to providing detailed info about version control, IRC channels, documentation, and licensing info that has been carefully checked by FSF staff and trained volunteers.


While the Free Software Directory has been and continues to be a great resource to the world over the past decade, it has the potential of being a resource of even greater value. But it needs your help!


If you are eager to help and you can't wait or are simply unable to make it onto IRC on Friday, our participation guide will provide you with all the information you need to get started on helping the Directory today!

Categories: FLOSS Project Planets

Friday Free Software Directory IRC meetup: October 24

FSF Blogs - Thu, 2014-10-23 16:54

Join the FSF and friends on Friday, October 24, from 2pm to 5pm EDT (18:00 to 21:00 UTC) to help improve the Free Software Directory by adding new entries and updating existing ones. We will be on IRC in the #fsf channel on freenode.


Tens of thousands of people visit directory.fsf.org each month to discover free software. Each entry in the Directory contains a wealth of useful information, from basic category and descriptions, to providing detailed info about version control, IRC channels, documentation, and licensing info that has been carefully checked by FSF staff and trained volunteers.


While the Free Software Directory has been and continues to be a great resource to the world over the past decade, it has the potential of being a resource of even greater value. But it needs your help!


If you are eager to help and you can't wait or are simply unable to make it onto IRC on Friday, our participation guide will provide you with all the information you need to get started on helping the Directory today!

Categories: FLOSS Project Planets

Emmanuel Lecharny: About Perfection and OSS

Planet Apache - Thu, 2014-10-23 16:48
From time to time I feel like we all have our own Moby Dick, and when it comes to OSS, it's name is 'perfection'.

Those shiny moments of pure joy, this warm feeling that surround you, when you can say 'mission accomplished' are rare and vanishing periods when you work on a never ending project. Its more or less when you get a big bug fixed, or when you read some enthusiast review on the project you are working on.
Fixing a bug is probably the best way to get this reward, as you know that you have made some progress.
It's forever shadowed by the constant pain of knowing that there are other bugs, and that in order to get a release done, you had to make some choices, leaving problems behind.
Is it a sad story about being a developer? No. It's not sad. It's tough, it's long, it's an endless job. Would I prefer doing something else? Certainly not ! At least, I know what I'm chasing, and even if I rarely foresee this fading perfection, sometime, I can almost touch it. Not something you can experience when you work in a company, as you don't have the opportunity to polish the project as much as you want, due to time constraints.
Last, not least, you are not alone. When you think that you are turning in circles, you know that the community you are part of will help you. Use it : they have the clues you don't have.
An Arabic proverb says "It's not that the way is painful, it's just that the pain is the way". So you better deal with it.
Categories: FLOSS Project Planets

Python Sweetness: Guerrilla optimization for PyPy: lazy protocol buffer decoding

Planet Python - Thu, 2014-10-23 16:46

I’ve been hugely distracted this year with commercial and $other work, so for-fun projects have been taking a back seat. Still, that doesn’t mean my interests have changed, it’s just that my energy is a bit low, and stringing out cohesive blog posts is even harder than usual. Consequently I’m trying to keep this post concise, and somewhat lighter on narrative and heavier on useful information compared to my usual rambling.

It only took 8 months, but finally I’ve made some fresh commits to Acid, this time progressing its internal Protocol Buffers implementation toward usefulness. Per ticket #41, the selling point of this implementation will be its ability to operate without requiring any copies (at least in the CPython extension implementation), and its ability to avoid decoding values unless explicitly requested by the user.

The ultimate goal is to let Python scan collections at close to the speed of the storage engine (6m+ rows/sec) without having to write tricksy code, import insane 3rd party code (*cough* SQL) or specialized stores (Neo4j?).

I’ve only prototyped the pure-Python module that will eventually be used on PyPy, trying to get a feel for what the internals of the CPython extension will need to look like, and figuring out the kinds of primitive types (and their corresponding pitfalls) the module might want to provide/avoid.

The design is pretty straightforward, although special care must be paid, e.g. when handling of repeating elements, which will be represented by lists, or list-like things that know how to share memory and lazily decode.

The road to a 55x speedup

In the course of the past few days experimentation, quite a fun story has emerged around optimizing bit-twiddling code like this to run on PyPy.

My initial implementation, based on some old code from an abandoned project, was sufficient to implement immediate decoding of the Protocol Buffer into a dict, where the dict-like Struct class would then proxy __getitem__ calls and suchlike directly on to the dict.

The downside, though, per the design requirement, is that in order to implement a row scan where only one or two fields are selected from each row during the scan, a huge penalty is paid in decoding and then discarding every other unused field.

For testing decoding/encoding, I began with a “representative” Struct/StructType containing the fields:

    Field #1: varint bool, 1 byte, True
    Field #2: varint list, 5x 1 byte elements, [1, 2, 3, 4, 5]
    Field #3: inet4, 1x fixed size 32bit element ‘255.0.255.0’
    Field #4: string, 1979 byte /etc/passwd file
    Field #5: bool, 1 byte, True
    Field #6: string, 12 bytes, my full name

Tests are done using the timeit module to either measure encoding the entire struct, or instantiating the struct from its encoded form, and accessing only field #6. We’re only interested in field #6 since in a lazy implementation, it requires the most effort to locate, since the decoder must first skip over all the previous elements.

On PyPy, the initial implementation was sufficient to net around 9.1usec to decode and 5.1usec to encode, corresponding to a throughput of around 100k rows/sec. Not bad for a starting point, but we can definitely do much better than that.

StringIO/BytesIO is slowww

From previous experience I knew the first place to start looking was the use of file-like objects for buffer management. Both on CPython and PyPy, use of StringIO to coordinate read-only buffer access is horrendously slow. I’m not sure I know why exactly, but I do know how to avoid it.

So first up came replace StringIO with direct access. Instead of passing a file-like object between all the parsing functions, simply to track the current read offset, instead we pass (buf, pos) in the parameter list, and all parsing functions return (pos2, value) as their return value. The caller resumes parsing at pos2. For free, we now get IndexError thrown any time a bounds check fails for a single element access, where previously we had to check the length of the string returned by fp.read(1). The fastest code is nonexistent code.

I’m not even going to attempt guessing at why this is so effective, but clearly it is: parsing time on PyPy 2.4.0 amd64 already dropped from 9.1usec to 1.69usec, all for a very simple, systematic modification to each function signature. Now we’re up from 100k rows/sec to almost 600k/sec.

Not only that, but now the parser can operate on any sequence-like object that has 1-string elements and supports slicing, including e.g. mmap.mmap, which you could call the ultimate form of lazy decoding ;)

Lazy decoding take #1

Next up is Implement lazy decoding. This modifies the Struct type to simply stash the encoded buffer passed to it during initializatoin, and ask StructType during each __getitem__ to find and decode only the single requested element. Once the element is fetched, Struct stores it in its local dict to avoid having to decode it again.

With lazy decoding, work has shifted from heap allocating lists of integers and duplicating large 2kb strings, to simply scanning for field #6, never even having to touch a byte of that 2kb /etc/passwd file embedded in the record. Our parsing time drops from 1.69usec to 0.494usec. Now we’re getting warmer — 2m Struct instantiation + field accessese/sec.

inline read_key() call for read_value()

At this point I thought it was already time to break out the micro-optimizations, and so I tried inlining the read_key() function, responsible for splitting the protocol buffer’s field tag varint into its component parts, moving its implementation to its only call site.

Not much juice here, but a small win — 0.44usec.

precalculate varint to avoid some shift/masking

Now we’re really in micro-optimizations territory. For a savings of 5nsec, precalculate some trivial math. Barely worth of the effort.

specialize encode for PyPy

By now the code is still using StringIO for the encode path, since the convenience is too hard to give up. PyPy provides a magical StringBuilder type, which knows how to incrementally build a string while avoiding (at least) the final copy where it is finalized.

As you can see from the commit message, this switch to more efficient buffering brought encode time on PyPy down considerably.

unroll write_varint()

You’ll probably notice by now that the benchmark script used in the commit messages was getting edited as I went along. The numbers in the messages are a fair indication of the level of speedup occurring, but due to horrendous bugs in the initial unrolled write_varint(), I can’t easily reproduce the reference runtime from this post using the current version of that script.

Usually loop unrolling is a technique reserved for extremely tricky C code, but that doesn’t mean it doesn’t have a place in Python land. This commit takes a while loop that can only execute up to 10 iterations and manually unfolds it, replacing all the state variables with immediate constants and very direct code.

Due to the ubiquitous use of variable-length integers in the protocol buffers scheme, in return we see a huge speed increase: encoding is now 2.5x faster than the clean “Pythonic” implementation. In some ways, this code is vastly easier to follow than the old loop, although I bet if I tried to run flake8 on the resulting file, a black hole would spontaneously form and swallow the entire universe.

use bytearray() instead of StringIO.

While we’re targetting PyPy, that’s not to say we can’t also look after old CPython. Even though a C extension will exist for CPython, having the fallback implementation work well there is also beneficial.

Here we exploit the fact that bytearray.extend is generally faster on CPython than the equivalent StringIO dance, and so in this commit we bid farewell to our final use of StringIO.

Notice how the use of “if StringBuilder:" effectively avoids performing runtime checks in the hot path: instead of wiring the test into the _to_raw() function, we simply substitute the entire function with one written specifically for whatever string builder implementation is available.

For a small amount of effort, encoding on CPython is now almost 25% faster.

partially unroll read_varint()

We’re not really interested in encoding — writes are generally always going to be a slow path in the average web app. We care mainly about decoding, and so back to looking for quick wins in decoder land.

Anyone familiar with what this code does may be noticing some rather extraordinarily obvious bugs in these commits. My only excuse is that this is experimental code, and I’ve already done a full working day before sitting down to it. ;)

Loop unrolling need not go the full hog. By noticing that most variable-length integers are actualy less than 7 bits long, here we avoid the slow ugly loop by testing explicitly for a 7-bit varint and exiting out quickly in that case.

In return, decoding time on PyPy drops by another 33%.

fully unroll read_varint and fix 64bit bugs.

Aware of all the bit-twiddling terrors of the past few days, tonight I rewrote write_varint/read_varint, to ensure that these functions did what they claim.

In the process, unrolled the read_varint loop. I don’t have benchmarks for here, since by this point my benchmarking script would crash due to the aforementioned bugs.

The remainder of the loop unrolling probably isn’t so helpful, since most varints are quite small, but at least it is very easy to understand the varint format from reading the code.

only cache mutable values in Struct dict.

We’re already down to 0.209usec/field access on PyPy, corresponding to somewhere in the region of 4.9m scanned rows/sec.

At this point I re-read ticket #41, realize introducing the “avoid work” cache of decoded elements was not part of the original design, and probably also wasn’t helping performance.

For mutable elements we always track the value returned to the user, since if they modify it, and later re-serialize the Struct, they expect their edits to be reflected. So I removed the cache, preserving it for mutable elements only.

In return, PyPy awards us with a delicious special case, relating to its collection strategies feature. On PyPy, a dict is not really ever a dict. It is one of about 5 different implementations, depending on the contents of the dict.

In the case of a dict that has never been populated with a single element, only 3 machine words are allocated for it (type, strategy, storage), and it’s configured to use the initial EmptyDictStrategy.

By removing a performance misfeature, the runtime has a better chance to do its job, and single field decoding becomes 25% faster, yielding our final instantiate + field access time of 0.168usec.

constant time skip function selection [late addition]

This change splits up the _skip() function, responsible for skipping over unknown or unwanted fields into a set of functions stored in a map keyed by wire type.

This has no discernible effect on my PyPy microbenchmark, but it wins nearly 11% on CPython.

remove method call & reorder branch to avoid jump in the usual case [late addition]

Another micro-optimization with no benefit on PyPy. Swaps the order of the branches in an if: statement to avoid a jump on CPython. Additionally replace a method call with a primitive operation (thus avoiding e.g. building the argument tuple).

Yields another 3% on CPython.

Futures

I notice that 15% of time is spent in Struct.__getitem__, mostly trying to figure out whether the key is a valid field or not, and whether the result value is mutable.

We can get some of this dispatch ‘for free’ by folding the lookup into the type system by producing proper Struct subclasses for each StructType, and introducing a BoundField type that implements the descriptor protocol.

But that means abandoning the dict interface, and significantly complicating the CPython extension implementation when it comes time to write that, and I’m not sure just how much of the 15% can really be recovered by taking this approach.

So there we have it: in a few hours we’ve gone from 100k rows/sec to upwards of 6 million/sec, all through just a little mechanical sympathy. Now imagine this code scaled up, on the average software project, and the hardware cost savings involved should the code see any heavy traffic. But of course, there is never any real business benefit to wasting time on optimization like this! And of course, lets not forget how ugly and un-pythonic the resulting mess is.

Ok, I’m all done for now. If you know any more tricks relating to this code, please drop me a line. Email address is available in git logs, or press the Ask Me Anything link to the right of this text. Remember to include an email address!

Categories: FLOSS Project Planets

Drupal core announcements: All the sprints at and around DrupalCon Latin America Bogotá

Planet Drupal - Thu, 2014-10-23 16:10
Start:  2015-02-08 (All day) - 2015-02-13 (All day) America/Chicago User group meeting

https://latinamerica2015.drupal.org/sprints

We have a great tradition of extended sprints around big Drupal events including DrupalCons and Drupal Dev Days. While there is a sprint day included in DrupalCons (usually) on the last day of the con, given that a lot of the Drupal core and contrib developers fly in for these events, it makes a lot of sense to use this opportunity to start sooner and/or extend our stay and work together in one space on the harder problems.

DrupalCon Latin America in Bogotá is the next DrupalCon! We are still looking for space and additional sponsors for the sprints before/after to help with space, internet, coffee, tea and maybe food. There are already various sprints signed up including Multilingual and Sign me up for anything. We are really friendly and need all kinds of expertise!

Now is the time to consider if you can be available and book your travel and hotel accordingly!

Join the sprinters -- sign up now! Practical details
Dates
February 8 - 13 2015 (all days at DrupalCon and some days both before and after).
Times and locations
Day/Time Location Feb 8 Extended sprint, location: TBD Feb 9 Maybe at the venue. There is also training this day). Feb 10 - 11 These are session days. Sprint lounge at venue. Feb 12 Official sprint day, location: TBD Feb 13 Extended sprint, location: TBD
Sponsors

??

Looking for sponsors

We are looking for more sponsors to be able to pay for extra expenses. If you are interested sponsoring or if you need sponsors to cover expenses, please contact me (YesCT).

Frequently asked questions What is a sprint?

Drupal sprints are opportunities to join existing teams and further Drupal the software, our processes, drupal.org and so on.

Do I need to be a pro developer?

No, not at all. First of all sprints include groups working on user experience, designs, frontend guidelines, drupal.org software setup, testing improvements, figuring out policies, etc. However you can be more productive at most sprints if you have a laptop.

Why are there 6 consecutive days of sprints?

DrupalCon is the time when most people in the Drupal community get together. We try to use this time to share our knowledge as well as further the platform in all possible ways. Therefore there is almost always an opportunity and a place to participate in moving Drupal forward.

What if I'm new to Drupal and/or sprinting, how can I join?

If you feel new and would love helping hands, the best day to start is the Thursday Feb 12 sprint day. This is the biggest sprint day with lots of people sprinting and different opportunities based on experience level. For a guided introduction to the tools and processes we use to collaborate, go to the First Time Sprinter workshop in the morning. If you know the tools but still could use help picking issues and going through the process, the Mentored Core Sprint is for you.

I worked on Drupal before, which sprints are for me?

If you have experience with Drupal issues and maybe already know a team/topic, any days of a DrupalCon may be your sprint days, and even the days before and after. These sprints do not have formal mentoring available, but of course if you have questions, there are always plenty of friendly people to help you. The community organizes off-site sprint opportunities for the days before/after DrupalCon and the event itself provides sprint locations from Feb 10 -12 throughout the session days in the event venue and in the official event hotel. These sprints are broken down to teams working on different topics. It is very important that you sign up for them, so we know what capacity to plan with.

Further questions?

Ask me (YesCT), I am happy to answer.

#node-427578 .picture, #node-427578 h3 { display: none; } #node-427578 .field-type-datestamp { margin: 0 0 2em 0; } #node-427578 dl { margin-bottom: 1em; } #node-427578 dd { margin-top: 0.5em; } #node-427578 h3.content { display: block; }
Categories: FLOSS Project Planets

Petter Reinholdtsen: I spent last weekend recording MakerCon Nordic

Planet Debian - Thu, 2014-10-23 16:00

I spent last weekend at Makercon Nordic, a great conference and workshop for makers in Norway and the surrounding countries. I had volunteered on behalf of the Norwegian Unix Users Group (NUUG) to video record the talks, and we had a great and exhausting time recording the entire day, two days in a row. There were only two of us, Hans-Petter and me, and we used the regular video equipment for NUUG, with a dvswitch, a camera and a VGA to DV convert box, and mixed video and slides live.

Hans-Petter did the post-processing, consisting of uploading the around 180 GiB of raw video to Youtube, and the result is now becoming public on the MakerConNordic account. The videos have the license NUUG always use on our recordings, which is Creative Commons Navngivelse-Del på samme vilkår 3.0 Norge. Many great talks available. Check it out! :)

Categories: FLOSS Project Planets

Blair Wadman: Improve Drupal email delivery rates by using Mandrill

Planet Drupal - Thu, 2014-10-23 15:40

Recently one of my clients had a problem with a large portion of transactional email never being seen. The emails were being directed to the recipients' spam folders and were generally being over-looked. These were important emails regarding things like membership confirmations, invoices and event information and were critical to the experience of the members.

Why was this happening? Mostly because the emails were being sent by the web server. I switched it to a Mandrill, a service designed to take care of the headaches of sending transactional email, and this greatly improve the delivery rate.

It is notoriously difficult to ensure emails from your application (such as Drupal) actually get delivered without getting caught in spam filters. Email providers like Mandrill have the expertise to maximise delivery rate. You are unlikely to have the time or expertise to manage this process for your own web server.

Mandrill provides great stats so that you can gain a greater understanding of email delivery, if it is getting caught by spam filters, bounces, open rates etc. You can also test different versions of the same email to see which one performs best in terms of open rates.....

Tags: Drupal Site buildingPlanet Drupal
Categories: FLOSS Project Planets

Mediacurrent: Drupal at Dreamforce

Planet Drupal - Thu, 2014-10-23 15:16

It’s been several days since the finale of Dreamforce 2014. With over 100,000 attendees, Dreamforce is one of the world’s largest cloud computing and business conferences.

Categories: FLOSS Project Planets

Enrico Zini: systemd-default-rescue

Planet Debian - Thu, 2014-10-23 15:06
Alternate rescue boot entry with systemd

Since systemd version 215, adding systemd.debug-shell to the kernel command line activates the debug shell on tty9 alongside the normal boot. I like the idea of that, and I'd like to have it in my standard 'rescue' entry in my grub menu.

Unfortunately, by default update-grub does not allow to customize the rescue menu entry options. I have just filed #766530 hoping for that to change.

After testing the patch I proposed for /etc/grub.d/10_linux, I now have this in my /etc/default/grub, with some satisfaction:

GRUB_CMDLINE_LINUX_RECOVERY="systemd.log_target=kmsg systemd.log_level=debug systemd.debug-shell"

Further information:

Thanks to sjoerd and uau on #debian-systemd for their help.

Categories: FLOSS Project Planets

FSF Blogs: I spoke at LibrePlanet and you can too

GNU Planet! - Thu, 2014-10-23 15:02

LibrePlanet 2015's call for sessions is open for ten more days, until Sunday, November 2nd. Submit your proposal now! Email campaigns@fsf.org with questions about the call for sessions.

CC BY SA Bryan Smith

When the call for session proposals for LibrePlanet rolled around last year, I wasn't sure whether to submit. I hadn't spoken at many conferences before, and I wasn't sure whether the topic I wanted to speak on -- open science -- would be a good fit. But when I looked through the conference Web sites from previous years, I saw a lot of diverse topics and enthusiasm for welcoming new speakers.

So I applied, and a few months later the panel I organized spoke to a full room. I encourage you to submit a session proposal for LibrePlanet 2015!

LibrePlanet is a small, casual conference with a friendly atmosphere. That makes it a great place to speak for the first time, or to propose a new topic. If you have questions or would like advice about submitting a proposal, you can ask the FSF Campaigns Team at campaigns@fsf.org. Hope to see you at LibrePlanet 2015!

Categories: FLOSS Project Planets

I spoke at LibrePlanet and you can too

FSF Blogs - Thu, 2014-10-23 15:02

LibrePlanet 2015's call for sessions is open for ten more days, until Sunday, November 2nd. Submit your proposal now! Email campaigns@fsf.org with questions about the call for sessions.

CC BY SA Bryan Smith

When the call for session proposals for LibrePlanet rolled around last year, I wasn't sure whether to submit. I hadn't spoken at many conferences before, and I wasn't sure whether the topic I wanted to speak on -- open science -- would be a good fit. But when I looked through the conference Web sites from previous years, I saw a lot of diverse topics and enthusiasm for welcoming new speakers.

So I applied, and a few months later the panel I organized spoke to a full room. I encourage you to submit a session proposal for LibrePlanet 2015!

LibrePlanet is a small, casual conference with a friendly atmosphere. That makes it a great place to speak for the first time, or to propose a new topic. If you have questions or would like advice about submitting a proposal, you can ask the FSF Campaigns Team at campaigns@fsf.org. Hope to see you at LibrePlanet 2015!

Categories: FLOSS Project Planets

Drupal core announcements: Drupal Global Sprint Weekend January 17, 2015 and January 18, 2015

Planet Drupal - Thu, 2014-10-23 14:33

Small local sprints everywhere (well, not everywhere, but anywhere) will be held during the weekend of January 17 and 18 2015. Listed alphabetically by continent, country, locality.

This is a wiki page. Please edit.

Africa

  1. ?

Asia

  1. ?

Europe

  1. ?

North America (ordered by country, then state)

  1. ?

South America (ordered by country, then state)

  1. ?

To participate,

  • use "Drupal Sprint Weekend 2015" in the description of your sprint meetup, sprint camp session, mini-sprint, wind-sprint, or all-day sprint, like: "Drupal All-day Sprint in Anywhere Town, IL, USA is part of Drupal Sprint Weekend 2015."
  • add a link to your sprint on this page. The link can be to a website, meetup, event on groups.drupal.org, blog post or whatever is appropriate for your event.
  • link back to this listing of local sprints
  • add an "event" of type "sprint" on groups.drupal.org in a group for your area, to put your sprint on drupical.com and get exposure to people in your area
  • use the hash tag #SprintWeekend on twitter
  • use the tag "SprintWeekend2015" on d.o issues

For resources to help plan your sprint:

Resources for participating in a sprint (needs updating for 2015, but this is a start):

A blurb to add to your session/event description (edit to fit your event):

Everyone is welcome; if you have built a site in Drupal, you can contribute. We will split into groups and work on Drupal core issues. Bring your laptop. For new folks: you can get a head start also by making an account on Drupal.org, getting some contribution tools, and developers can install git before coming and git clone Drupal 8 core.

The curious might want to see the locations from 2014 and 2013.

Categories: FLOSS Project Planets

Lennart Regebro: 59% of maintained packages support Python 3

Planet Python - Thu, 2014-10-23 14:14

I ran some statistics on PyPI:

  • 50377 packages in total,
  • 35293 unmaintained packages,
  • 15084 maintained packages.

Of the maintained packages:

  • 5907 has no Python classifiers,
  • 3679 support only Python 2,
  • 1188 support only Python 3,
  • 4310 support Python 2 and Python 3.

This means:

  • A total of 5498 packages support Python 3,
  • 36% of all maintained packages declares that they support Python 3,
  • 24% of all maintained packages declares that they do NOT support Python 3,
  • and 39% does not declare any Python support at all.

So: The 59% of maintained packages that declare what version they support, support Python 3.

And if you wonder: “Maintained” means at least one versions released this year (with files uploaded to PyPI) *or* at least 3 versions released the last three years.


Filed under: python, python3 Tagged: python, python 3
Categories: FLOSS Project Planets
Syndicate content