Planet Python

Syndicate content
Planet Python -
Updated: 1 day 5 hours ago

PyCharm: PyCharm 2017.1 Out Now: Faster debugger, new test runners, and more

Fri, 2017-03-24 12:46

PyCharm 2017.1 is out now! Get it now for a much faster debugger, improved Python and JavaScript unit testing, and support for the six library.

  • The Python debugger got forty times faster for Python 3.6 projects, and up to two times faster for older versions of Python
  • We’ve added support for the six compatibility library
  • Unit test runners for Python have been rebuilt from the ground up: you can now run any test configuration with PyCharm
  • Are you a full stack developer? We’ve improved our JavaScript unit testing: gutter icons indicating whether a test passed and support for Jest, Facebook’s JS testing framework (only available in PyCharm Professional edition)
  • Zero-latency typing is now on by default: typing latencies for PyCharm 2017.1 are lower than those for Sublime Text and Emacs
  • Support for native Docker for Mac – no more need to use SOCAT! (only available in PyCharm Professional edition)
  • And more!

Get PyCharm 2017.1 now from our website

Please let us know what you think about PyCharm! You can reach us on Twitter, Facebook, and by leaving a comment on the blog.

PyCharm Team
-The Drive to Develop

Categories: FLOSS Project Planets

Andrew Dalke: ChEMBL bioactivity data

Fri, 2017-03-24 08:00

I almost only use ChEMBL structure files. I download the .sdf files and process them. ChEMBL also supplies bioactivity data, which I've never worked with. Iain Watson suggested I look to it as a source of compound set data, and provided some example SQL queries. This blog post is primarily a set of notes for myself as I experiment with the queries and learn more about what is in the data file.

There is one bit of general advice. If you're going to use the SQLite dump from ChEMBL, make sure that you did "ANALYZE" at least on the tables of interest. This may take a few hours. I'm downloading ChEMBL-22-1 to see if it comes pre-analyzed. If it doesn't, I'll ask them to do so as part of their releases.

For those playing along from home (or the office, or whereever fine SQL database engines may be found), I downloaded the SQLite dump for ChEMBL 21, which is a lovely 2542883 KB (or 2.4) compressed, and 12 GB uncompressed. That link also includes dumps for MySQL, Oracle, and Postgres, as well as schema documentation.

Unpack it the usual way (it takes a while to unpack 12GB), cd into the directory, and open the database using sqlite console: % tar xf chembl_21_sqlite.tar.gz % cd chembl_21_sqlite % sqlite3 chembl_21.db SQLite version 3.8.5 2014-08-15 22:37:57 Enter ".help" for usage hints. sqlite>


The 'compound_structures' table looks interesting. How many structures are there? sqlite> select count(*) from compound_structures; 1583897 Wow. Just .. wow. That took a several minutes to execute. This is a problem I've had before with large databases. SQLite doesn't store the total table size, so the initial count(*) ends up doing a full table scan. This brings in every B-tree node from disk, which requires a lot of random seeks for my poor hard disk made of spinning rust. (Hmm, Crucible says I can get a replacement 500GB SSD for only EUR 168. Hmmm.)

The second time and onwards is just fine, thanks to the power of caching.

What does the structures look like? I'll decided to show only a few of the smallest structures to keep the results from overflowing the screen: sqlite>
   ...> select molregno, standard_inchi, standard_inchi_key, canonical_smiles
  from compound_structures where length(canonical_smiles) 10 limit 4;
1813|InChI=1S/C4H11NO/c1-2-3-4-6-5/h2-5H2,1H3|WCVVIGQKJZLJDB-UHFFFAOYSA-N|CCCCON 3838|InChI=1S/C2H4INO/c3-1-2(4)5/h1H2,(H2,4,5)|PGLTVOMIXTUURA-UHFFFAOYSA-N|NC(=O)CI 4092|InChI=1S/C4H6N2/c5-4-6-2-1-3-6/h1-3H2|VEYKJLZUWWNWAL-UHFFFAOYSA-N|N#CN1CCC1 4730|InChI=1S/CH4N2O2/c2-1(4)3-5/h5H,(H3,2,3,4)|VSNHCAURESNICA-UHFFFAOYSA-N|NC(=O)NO

For fun, are there canonical SMILES which are listed multiple times? There are a few, so I decided to narrow it down to those with more than 2 instances. (None occur more than 3 times.) sqlite> select canonical_smiles, count(*) from compound_structures group by canonical_smiles having count(*) > 2; CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3 CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+]([O-])C(C)C|3 CC(C)[C@@H](C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3 CC(C)[C@H](C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3 CC(C)[S+]([O-])c1nc(c2ccc(F)cc2)c([nH]1)c3ccnc(NC4CCCCC4)c3|3 ... Here are more details about the first output where the same SMILES is used multiple times: sqlite>
   ...>select molregno, standard_inchi from compound_structures
 where canonical_smiles = "CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]";
1144470|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23) 1144471|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23)/t25-/m1/s1 1144472|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23)/t25-/m0/s1 The differences are in the "/t(isotopic:stereo:sp3)", "/m(fixed_:stereo:sp3:inverted)", and "/s(fixed_H:stereo_type=abs)" layers. Got that?

I don't. I used the techniques of the next section to get the molfiles for each structure. The differences are in the bonds between atoms 23/24 (the sulfoxide, represented in charge-separated form) and atoms 23/25 (the methyl on the sulfur). The molfile for the first record has no asigned bond stereochemistry, the second has a down flag for the sulfoxide, and the third has a down flag for the methyl.

molfile column in compound_structures

There's a "molfile" entry. Does it really include the structure as a raw MDL molfile? Yes, yes it does: sqlite> select molfile from compound_structures where molregno = 805; 11280714442D 1 1.00000 0.00000 0 8 8 0 0 0 999 V2000 6.0750 -2.5667 0.0000 C 0 0 0 0 0 0 0 0 0 5.3625 -2.9792 0.0000 N 0 0 3 0 0 0 0 0 0 6.7917 -2.9792 0.0000 N 0 0 0 0 0 0 0 0 0 5.3625 -3.8042 0.0000 C 0 0 0 0 0 0 0 0 0 4.6542 -2.5667 0.0000 C 0 0 0 0 0 0 0 0 0 6.0750 -1.7417 0.0000 C 0 0 0 0 0 0 0 0 0 4.6542 -1.7417 0.0000 C 0 0 0 0 0 0 0 0 0 5.3625 -1.3292 0.0000 C 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 3 1 2 0 0 0 4 2 1 0 0 0 5 2 1 0 0 0 6 1 1 0 0 0 7 8 1 0 0 0 8 6 1 0 0 0 7 5 1 0 0 0 M END

Why did I choose molregno = 805? I looked for a structure with 8 atoms and 8 bond by searching for the substring "  8  8  0", which is in the counts line. (It's not a perfect solution, but rather a good-enough one. sqlite> select molregno from compound_structures where molfile LIKE "% 8 8 0%" limit 1; 805 I bet with a bit of effort I could count the number of rings by using the molfile to get the bond counts and use the number of "."s in the canonical_smiles to get the number of fragments.

compound_properties and molecule_dictionary tables

The compound_properties table stores some molecular properties. I'll get the number of heavy atoms, the number of aromatic rings, and the full molecular weight for structure 805. sqlite> select heavy_atoms, aromatic_rings, full_mwt from compound_properties where molregno = 805; 8|0|112.17 I've been using "805", which is an internal identifier. What's its public ChEMBL id? sqlite> select chembl_id from molecule_dictionary where molregno = 805; CHEMBL266980 What are some of the records with only 1 or 2 atoms? sqlite>
select chembl_id, heavy_atoms from molecule_dictionary, compound_properties
 where molecule_dictionary.molregno = compound_properties.molregno
 and heavy_atoms < 3 limit 5;
CHEMBL1098659|1 CHEMBL115849|2 CHEMBL1160819|1 CHEMBL116336|2 CHEMBL116838|2

InChI and heavy atom count for large structures

I showed that some of the SMILES were used for two or three records. What about the InChI string? I started with: sqlite>
select molregno, standard_inchi, count(*) from compound_structures
  group by standard_inchi having count(*) > 1;
1378059||9 After 10 minutes with no other output, I gave up. Those 9 occurrences have a NULL value, that is: sqlite> select count(*) from compound_structures where standard_inchi is NULL; 9 I was confused at first because there are SMILES string (I'll show only the first 40 characters), so there is structure information. The heavy atom count is also NULL: sqlite>
select compound_structures.molregno, heavy_atoms, substr(canonical_smiles, 1, 40)
 from compound_structures, compound_properties
 where standard_inchi is NULL and compound_structures.molregno = compound_properties.molregno;
615447||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615448||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615449||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615450||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615451||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 1053861||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)n2cnc3 1053864||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)n2cnc3 1053865||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)N2C=CC 1378059||CC[C@H](C)[C@H](NC(=O)[C@H](CCCNC(=N)N)N Then I realized it's because the schema specifies the "heavy_atoms" field as "NUMBER(3,0)". While SQLite ignores that limit, it looks like ChEMBL doesn't try to store a count above 999.

What I'll do instead is get the molecular formula, which shows that there are over 600 heavy atoms in those structures: sqlite>
select chembl_id, full_molformula
  from compound_structures, compound_properties, molecule_dictionary
  where standard_inchi is NULL
  and compound_structures.molregno = compound_properties.molregno
  and compound_structures.molregno = molecule_dictionary.molregno;
CHEMBL1077162|C318H381N118O208P29 CHEMBL1077163|C319H383N118O209P29 CHEMBL1077164|C318H382N119O208P29 CHEMBL1077165|C325H387N118O209P29 CHEMBL1631334|C361H574N194O98P24S CHEMBL1631337|C367H606N172O113P24 CHEMBL1631338|C362H600N180O106P24 CHEMBL2105789|C380H614N112O113S9 Those are some large structures! The reason there are no InChIs for them is that InChI didn't support large molecules until version 1.05, which came out in early 2017. Before then, InChI only supported 1024 atoms. Which is normally fine as most compounds are small (hence "small molecule chemistry"). In fact, there aren't any records with more than 79 heavy atoms: sqlite>
select heavy_atoms, count(*) from compound_properties
  where heavy_atoms > 70 group by heavy_atoms;
71|364 72|207 73|46 74|29 75|3 76|7 78|2 79|2 How in the world do these large structures have 600+ atoms? Are they peptides? Mmm, no, not all. The first 8 contain a lot of phosphorouses. I'm guessing some sort of nucleic acid. The last might be a protein. Perhaps I can get a clue from the chemical name, which is in the compound_records table. Here's an example using the molregno 805 from earlier: sqlite> select * from compound_records where molregno = 805; 1063|805|14385|14|1-Methyl-piperidin-(2Z)-ylideneamine|1| Some of the names of the 600+ atom molecules are too long, so I'll limit the output to the first 50 characters of the name: sqlite>
select chembl_id, full_molformula, substr(compound_name, 1, 50)
  from compound_structures, molecule_dictionary, compound_properties, compound_records
 where standard_inchi is NULL
   and compound_structures.molregno = molecule_dictionary.molregno
   and compound_structures.molregno = compound_properties.molregno
   and compound_structures.molregno = compound_records.molregno;
CHEMBL1077161|C307H368N116O200P28|{[(2R,3S,4R,5R)-5-(4-amino-2-oxo-1,2-dihydropyrimi CHEMBL1077162|C318H381N118O208P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077163|C319H383N118O209P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077164|C318H382N119O208P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077165|C325H387N118O209P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1631334|C361H574N194O98P24S|HRV-EnteroX CHEMBL1631337|C367H606N172O113P24|PV-5'term CHEMBL1631338|C362H600N180O106P24|PV-L4 CHEMBL2105789|C380H614N112O113S9|Mirostipen That didn't help much, but I could at least do a web search for some of the names. For example, HRV-EnteroX is a PPMO (peptide-conjugated phosphorodiamidate morpholino oligomers), which is where those phosphorous atoms come from.

The names weren't really help, and the images at ChEMBL were too small to make sense of the structures, so I looked at them over at PubChem. HRV-EnteroX looks like a 12-mer peptide conjugated to about 25 morpholino oligomers. Mirostipen looks like a peptide. CHEMBL1077161 looks like a nucleic acid strand.

I don't think there's anything interesting to explore in this direction so I'll move on.

Assay data I'll take a look at assay data, which I deal with a lot less often than I do structure data. How many assays are there? sqlite> select count(*) from assays; 1212831 Okay, and how many of them are human assays? For that I need the NCBI taxonomy id. Iain's example code uses 9606, which the NCBI web site tells me is for Homo sapiens. I don't think there's a table in the SQLite data dump with all of the taxonomy ids. The organism_class table says only: sqlite> select * from organism_class where tax_id = 9606; 7|9606|Eukaryotes|Mammalia|Primates The assay table "assay_organism" column stores the "[n]ame of the organism for the assay system", with the caution "[m]ay differ from the target organism (e.g., for a human protein expressed in non-human cells, or pathogen-infected human cells)." I'll throw caution to the wind and check that field: sqlite> select count(*) from assays where assay_organism = "Homo sapiens"; 291143 sqlite>
select assay_organism, count(*) from assays
 where assay_tax_id = 9606 group by assay_organism;
|17 Homo sapiens|291135 sqlite> select count(*) from assays where assay_tax_id = 9606 and assay_organism is NULL; 17 It looks like 9606 is indeed for humans.

Assay activities What sort of assay activities are there? sqlite> select distinct published_type from activities; ED50 Transactivation % % Cell Death ... AUC AUC (0-24h) AUC (0-4h) AUC (0-infinity) ... Change Change HDL -C Change MAP Change TC ... Okay, quite a few. There appear to be some typos as well: sqlite>
select published_type, count(*) from activities where published_type in ("Activity", "A ctivity",
  "Ac tivity", "Act ivity", "Acti vity", "Activ ity", "Activi ty", "Activit y", "Activty")
  group by published_type;
A ctivity|1 Activ ity|2 Activit y|1 Activity|700337 Activty|1 After another 20 minutes of data exploration, I realized that there are two different types. The "published_type" is what the assayer published, while there's also a "standard_type", which looks to be a normalized value by ChEMBL: sqlite>
select published_type, standard_type from activities
  where published_type in ("A ctivity", "Activ ity", "Activit y", "Activty");
A ctivity|Activity Activ ity|Activity Activ ity|Activity Activit y|Activity Activty|Activity There are a many ways to publish a report with IC50 data. I'll show only those that end with "IC50". sqlite> select distinct published_type from activities where published_type like "%IC50"; -Log IC50 -Log IC50/IC50 -logIC50 Average IC50 CC50/IC50 CCIC50 CIC IC50 CIC50 Change in IC50 Cytotoxicity IC50 Decrease in IC50 EIC50 FIC50 Fold IC50 I/IC50 IC50 IC50/IC50 Increase in IC50 Log 1/IC50 Log IC50 MBIC50 MIC50 Mean IC50 RIC50 Ratio CC50/IC50 Ratio CIC95/IC50 Ratio ED50/MIC50 Ratio IC50 Ratio LC50/IC50 Ratio LD50/IC50 Ratio pIC50 Ratio plasma concentration/IC50 Relative ET-A IC50 Relative IC50 TBPS IC50 TC50/IC50 Time above IC50 fIC50 log1/IC50 logIC50 pIC50 pMIC50 rIC50 The "p" prefix, as in "pIC50", is shorthand for "-log", so "-Log IC50", "Log 1/IC50", and "pIC50" are almost certainly the same units. Let's see: sqlite>
select distinct published_type, standard_type from activities
  where published_type in ("-Log IC50", "Log 1/IC50", "pIC50");
-Log IC50|IC50 -Log IC50|pIC50 -Log IC50|-Log IC50 Log 1/IC50|IC50 Log 1/IC50|Log 1/IC50 pIC50|IC50 pIC50|pIC50 pIC50|Log IC50 pIC50|-Log IC50 Well color me confused. Oh! There's a "standard_flag", which "[s]hows whether the standardised columns have been curated/set (1) or just default to the published data (0)." Perhaps that will help enlighten me. sqlite>
select distinct published_type, standard_flag, standard_type from activities
 where published_type in ("-Log IC50", "Log 1/IC50", "pIC50");
-Log IC50|1|IC50 -Log IC50|1|pIC50 -Log IC50|0|-Log IC50 Log 1/IC50|1|IC50 Log 1/IC50|0|Log 1/IC50 pIC50|1|IC50 pIC50|1|pIC50 pIC50|0|Log IC50 pIC50|1|-Log IC50 pIC50|0|pIC50 Nope, I still don't understand what's going on. I'll assume it's all tied to the complexities of data curation. For now, I'll assume that the data set is nice and clean.

IC50 types

Let's look at the "IC50" values only. How do the "published_type" and "standard_type" columns compare to each other? sqlite>
select published_type, standard_type, count(*) from activities
  where published_type = "IC50" group by standard_type;
IC50|% Max Response|21 IC50|Change|2 IC50|Control|1 IC50|Electrophysiological activity|6 IC50|Fold Inc IC50|1 IC50|Fold change IC50|12 IC50|IC50|1526202 IC50|IC50 ratio|1 IC50|Inhibition|12 IC50|Log IC50|20 IC50|Ratio IC50|40 IC50|SI|4 IC50|T/C|1 sqlite>
select published_type, standard_type, count(*) from activities
  where standard_type = "IC50" group by published_type;
-Log IC50|IC50|1736 -Log IC50(M)|IC50|28 -Log IC50(nM)|IC50|39 -logIC50|IC50|84 3.3|IC50|1 Absolute IC50 (CHOP)|IC50|940 Absolute IC50 (XBP1)|IC50|940 Average IC50|IC50|34 CIC50|IC50|6 I 50|IC50|202 I-50|IC50|25 I50|IC50|6059 IC50|IC50|1526202 IC50 |IC50|52 IC50 app|IC50|39 IC50 max|IC50|90 IC50 ratio|IC50|2 IC50(app)|IC50|457 IC50_Mean|IC50|12272 IC50_uM|IC50|20 ID50|IC50|3 Log 1/I50|IC50|280 Log 1/IC50|IC50|589 Log 1/IC50(nM)|IC50|88 Log IC50|IC50|7013 Log IC50(M)|IC50|3599 Log IC50(nM)|IC50|77 Log IC50(uM)|IC50|28 Mean IC50|IC50|1 NA|IC50|5 NT|IC50|20 log(1/IC50)|IC50|1016 pI50|IC50|2386 pIC50|IC50|43031 pIC50(mM)|IC50|71 pIC50(nM)|IC50|107 pIC50(uM)|IC50|419 Yeah, I'm going to throw my hands up here, declare "I'm a programmer, Jim, not a bioactivity specialist", and simply use the published_type of IC50.

IC50 activity values

How are the IC50 values measured? Here too I need to choose between "published_units" and "standard_units". A quick look at the two shows that the standard_units are less diverse. sqlite>
select standard_units, count(*) from activities where published_type = "IC50"
  group by standard_units;
|167556 %|148 % conc|70 10'-11uM|1 10'-4umol/L|1 M ml-1|15 equiv|64 fg ml-1|1 g/ha|40 kJ m-2|20 liposomes ml-1|5 mMequiv|38 mg kg-1|248 mg.min/m3|4 mg/kg/day|1 milliequivalent|22 min|9 ml|3 mmol/Kg|10 mol|6 molar ratio|198 nA|6 nM|1296169 nM g-1|1 nM kg-1|4 nM unit-1|7 nmol/Kg|1 nmol/mg|5 nmol/min|1 ppm|208 ppm g dm^-3|7 uL|7 uM hr|1 uM tube-1|9 uM well-1|52 uM-1|25 uM-1 s-1|1 ucm|6 ucm s-1|2 ug|168 ug cm-2|1 ug g-1|2 ug well-1|12 ug.mL-1|61139 ug/g|16 umol kg-1|3|8 umol/dm3|2 "Less diverse", but still diverse. By far the most common is "nM for "nanomolar", which is the only unit I expected. How many IC50s have an activities better than 1 micromolar, which is 1000 nM?

select count(*) from activities where published_type = "IC50"
   and standard_value < 1000 and standard_units = "nM";
483041 That's fully 483041/1212831 = 40% of the assays in the data dump.

How many of the IC50s are in humans? For that I need a join with the assays table using the assay_id: sqlite>
select count(*) from activities, assays
 where published_type = "IC50"
   and standard_value < 1000 and standard_units = "nM"
   and activities.assay_id = assays.assay_id
   and assay_tax_id = 9606;
240916 About 1/2 of them are in humans.

Assay target type from target_dictionary


Remember earlier when I threw caution to the wind? How many of the assays are actually against human targets? I can join on the target id "tid" to compare the taxon id in the target vs. the taxon id in the assay: sqlite>
select count(*) from assays, target_dictionary
 where assays.tid = target_dictionary.tid
   and target_dictionary.tax_id = 9606;
301810 sqlite>
select count(*) from assays, target_dictionary
 where assays.tid = target_dictionary.tid
   and assays.assay_tax_id = 9606;

Compare assay organisms with target organism

What are some of the non-human assay organisms where the target is humans? sqlite>
select distinct assay_organism from assays, target_dictionary
 where assays.tid = target_dictionary.tid
   and assays.assay_tax_id != 9606
   and target_dictionary.tax_id = 9606
 limit 10;
rice Saccharomyces cerevisiae Oryza sativa Rattus norvegicus Sus scrofa Cavia porcellus Oryctolagus cuniculus Canis lupus familiaris Proteus vulgaris Salmonella enterica subsp. enterica serovar Typhi

Compounds tested against a target name

I'm interested in the "SINGLE PROTEIN" target names in humans. The target name is a manually curated field. sqlite> select distinct pref_name from target_dictionary where tax_id = 9606 limit 5; Maltase-glucoamylase Sulfonylurea receptor 2 Phosphodiesterase 5A Voltage-gated T-type calcium channel alpha-1H subunit Dihydrofolate reductase What are structures used in "Dihydrofolate reductase" assays? This requires three table joins, one on 'tid' to go from target_dictionary to assays, another on 'assay_id' to get to the activity, and another on 'molregno' to go from assay to molecule_dictionary so I can get the compound's chembl_id. (To make it more interesting, three of the tables have a chembl_id column.) sqlite>
select distinct molecule_dictionary.chembl_id
  from target_dictionary, assays, activities, molecule_dictionary
 where target_dictionary.pref_name = "Dihydrofolate reductase"
   and target_dictionary.tid = assays.tid
   and assays.assay_id = activities.assay_id
   and activities.molregno = molecule_dictionary.molregno
 limit 10;
CHEMBL1679 CHEMBL429694 CHEMBL106699 CHEMBL422095 CHEMBL1161155 CHEMBL350033 CHEMBL34259 CHEMBL56282 CHEMBL173175 CHEMBL173901 sqlite>
select count(distinct molecule_dictionary.chembl_id)
   from target_dictionary, assays, activities, molecule_dictionary
  where target_dictionary.pref_name = "Dihydrofolate reductase"
    and target_dictionary.tid = assays.tid
    and assays.assay_id = activities.assay_id
    and activities.molregno = molecule_dictionary.molregno;
3466 There are 3466 of these, including non-human assays. I'll limit it to human ones only: sqlite>
select count(distinct molecule_dictionary.chembl_id)
   from target_dictionary, assays, activities, molecule_dictionary
  where target_dictionary.pref_name = "Dihydrofolate reductase"
    and target_dictionary.tax_id = 9606
    and target_dictionary.tid = assays.tid
    and assays.assay_id = activities.assay_id
    and activities.molregno = molecule_dictionary.molregno;
1386 I'll further limit it to those with an IC50 of under 1 micromolar: sqlite>
.timer on
select count(distinct molecule_dictionary.chembl_id)
   from target_dictionary, assays, activities, molecule_dictionary
  where target_dictionary.pref_name = "Dihydrofolate reductase"
    and target_dictionary.tax_id = 9606
    and target_dictionary.tid = assays.tid
    and assays.assay_id = activities.assay_id
    and activities.published_type = "IC50"
    and activities.standard_units = "nM"
    and activities.standard_value < 1000
    and activities.molregno = molecule_dictionary.molregno;
255 Run Time: real 174.561 user 18.073715 sys 23.285346 I turned on the timer to show that the query took about 3 minutes! I repeated it to ensure that it wasn't a simple cache issue. Still about 3 minutes.

ANALYZE the tables

The earlier query, without the activity filter, took 5.7 seconds when the data wasn't cached, and 0.017 seconds when cached. It found 1386 matches. The new query takes almost 3 minutes more to filter those 1386 matches down to 255. That should not happen.

This is a strong indication that the query planner used the wrong plan. I've had this happen before. My solution then was to "ANALYZE" the tables, which "gathers statistics about tables and indices and stores the collected information in internal tables of the database where the query optimizer can access the information and use it to help make better query planning choices."

It can take a while, so I limited it to the tables of interest. sqlite> analyze target_dictionary; Run Time: real 0.212 user 0.024173 sys 0.016268 sqlite> analyze assays; Run Time: real 248.184 user 5.890109 sys 4.793236 sqlite> analyze activities; Run Time: real 6742.390 user 97.862790 sys 129.854073 sqlite> analyze molecule_dictionary; Run Time: real 33.879 user 2.195662 sys 2.043848 Yes, it took almost 2 hours to analyze the activities table. But it was worth it from a pure performance view. I ran the above code twice, with this pattern: % sudo purge # clear the filesystem cache % sqlite3 chembl_21.db # start SQLite SQLite version 3.8.5 2014-08-15 22:37:57 Enter ".help" for usage hints. sqlite> .timer on sqlite> .... previous query, with filter for IC50 < 1uM ... 255 Run Time: real 8.595 user 0.038847 sys 0.141945 sqlite> .... repeat query using a warm cache 255 Run Time: real 0.009 user 0.005255 sys 0.003653 Nice! Now I only need to do about 60 such queries to justify the overall analysis time.

Categories: FLOSS Project Planets

EuroPython: EuroPython 2017: Get ready for EuroPython Call for Proposals

Fri, 2017-03-24 05:38

Thinking of giving your contribution to EuroPython? Starting from March 27th you can submit a proposal on every aspect of Python: programming from novice to advanced levels, applications and frameworks, or how you have been involved in introducing Python into your organization. 

We offer a variety of different contribution formats that you can present at EuroPython: from regular talks to panel discussions, from trainings to posters; if you have ideas to promote real-time human-to-human-interaction or want to run yourself a helpdesk to answer other people’s python questions, this is your chance. 

Read our different opportunities on our website and start drafting your ideas. Call for Proposals opens in just 3 days!


EuroPython 2017 Team

EuroPython Society

Categories: FLOSS Project Planets

Gocept Weblog: Sprinting to push Zope to the Python 3 wonderland

Fri, 2017-03-24 05:37

Earlier this year there was a sprint in Innsbruck, Austria. We made progress in porting Zope to Python 3 by working on RestrictedPython. After this sprint RestrictedPython no longer seems to be a blocker to port the parts of Zope which rely on RestrictedPython to Python 3.

See the full sprint report on the website.

We will work further on pushing Zope towards the Python 3 wonderland on the Zope 2 Resurrection Sprint in Halle/Saale, Germany at gocept in the first week of May 2017. You are welcome to  join us on site or remote.

Photo copyright: Christine Baumgartner

Categories: FLOSS Project Planets

Catalin George Festila: Take weather data with pyowm from openweathermap .

Fri, 2017-03-24 02:53
This tutorial shows you how to download and install the pyowm python module.
One of the great things about using this python module let you to provide data from openweathermap website (need to have one account).
PyOWM runs on Python 2.7 and Python 3.2+, and integrates with Django 1.10+ models.
All documentation can be found here.

The install is simple with pip , python 2.7 and Fedora 25.
 [root@localhost mythcat]# pip install pyowm
Collecting pyowm
Downloading pyowm-2.6.1.tar.gz (3.6MB)
100% |████████████████████████████████| 3.7MB 388kB/s
Building wheels for collected packages: pyowm
Running bdist_wheel for pyowm ... done
Stored in directory: /root/.cache/pip/wheels/9a/91/17/bb120c765f08df77645cf70a16aa372d5a297f4ae2be749e81
Successfully built pyowm
Installing collected packages: pyowm
Successfully installed pyowm-2.6.1
The source code is very simple just connect with API key and print data.
#/usr/bin/env python
#" -*- coding: utf-8 -*-
import pyowm

print " Have a account to and use with api key free or pro"
print " owm = pyowm.OWM(API_key='your-API-key', subscription_type='pro')"

owm = pyowm.OWM("327407589df060c7f825b63ec1d9a096")
forecast = owm.daily_forecast("Falticeni,ro")
tomorrow = pyowm.timeutils.tomorrow()

observation = owm.weather_at_place('Falticeni,ro')
w = observation.get_weather()
print (w)
print " Weather details"
print " =============== "

print " Get cloud coverage"
print w.get_clouds()
print " ----------------"
print " Get rain volume"
print w.get_rain()
print " ----------------"
print " Get snow volume"
print w.get_snow()

print " Get wind degree and speed"
print w.get_wind()
print " ----------------"
print " Get humidity percentage"
print w.get_humidity()
print " ----------------"
print " Get atmospheric pressure"
print w.get_pressure()
print " ----------------"
print " Get temperature in Kelvin degs"
print w.get_temperature()
print " ----------------"
print " Get temperature in Celsius degs"
print w.get_temperature(unit='celsius')
print " ----------------"
print " Get temperature in Fahrenheit degs"
print w.get_temperature('fahrenheit')
print " ----------------"
print " Get weather short status"
print w.get_status()
print " ----------------"
print " Get detailed weather status"
print w.get_detailed_status()
print " ----------------"
print " Get OWM weather condition code"
print w.get_weather_code()
print " ----------------"
print " Get weather-related icon name"
print w.get_weather_icon_name()
print " ----------------"
print " Sunrise time (ISO 8601)"
print w.get_sunrise_time('iso')
print " Sunrise time (GMT UNIXtime)"
print w.get_sunrise_time()
print " ----------------"
print " Sunset time (ISO 8601)"
print w.get_sunset_time('iso')
print " Sunset time (GMT UNIXtime)"
print w.get_sunset_time()
print " ----------------"
print " Search current weather observations in the surroundings of"
print " Latitude and longitude coordinates for Fălticeni, Romania:"
observation_list = owm.weather_around_coords(47.46, 26.30)
Let's see and the result of running the python script for one random location:
 [root@localhost mythcat]# python
Have a account to and use with api key free or pro
owm = pyowm.OWM(API_key='your-API-key', subscription_type='pro')

Weather details
Get cloud coverage
Get rain volume
Get snow volume
Get wind degree and speed
{u'speed': 5.7, u'deg': 340}
Get humidity percentage
Get atmospheric pressure
{'press': 1021, 'sea_level': None}
Get temperature in Kelvin degs
{'temp_max': 287.15, 'temp_kf': None, 'temp': 287.15, 'temp_min': 287.15}
Get temperature in Celsius degs
{'temp_max': 14.0, 'temp_kf': None, 'temp': 14.0, 'temp_min': 14.0}
Get temperature in Fahrenheit degs
{'temp_max': 57.2, 'temp_kf': None, 'temp': 57.2, 'temp_min': 57.2}
Get weather short status
Get detailed weather status
few clouds
Get OWM weather condition code
Get weather-related icon name
Sunrise time (ISO 8601)
2017-03-24 04:08:33+00
Sunrise time (GMT UNIXtime)
Sunset time (ISO 8601)
2017-03-24 16:33:59+00
Sunset time (GMT UNIXtime)
Search current weather observations in the surroundings of
Latitude and longitude coordinates for Fălticeni, Romania:
Categories: FLOSS Project Planets

Vasudev Ram: Analysing that Python code snippet

Thu, 2017-03-23 20:49
By Vasudev Ram

Hi readers,

Some days ago I had written this post:

Analyse this Python code snippet

in which I had shown a snippet of Python code (run in the Python shell), and said:

"Analyse the snippet of Python code below. See what you make of it. I will discuss it in my next post."

I am a few days late in discussing it; sorry about that.

Here is the analysis:

First, here's the the snippet again, for reference:
>>> a = 1
>>> lis = [a, 2 ]
>>> lis
[1, 2]
>>> lis = [a, 2 ,
... "abc", False ]
>>> lis
[1, 2, 'abc', False]
>>> a
>>> b = 3
>>> lis
[1, 2, 'abc', False]
>>> a = b
>>> a
>>> lis
[1, 2, 'abc', False]
>>> lis = [a, 2 ]
>>> lis
[3, 2]

The potential for confusion (at least, as I said, for newbie Pythonistas) lies in these apparent points:

The variable a is set to 1.
Then it is put into the list lis, along with the constant 2.
Then lis is changed to be [a, 2, "abc", False].
One might now think that the variable a is stored in the list lis.
The next line prints its value, which shows it is 1.
All fine so far.
Then b is set to 3.
Then a is set to b, i.e. to the value of b.
So now a is 3.
But when we print lis again, it still shows 1 for the first item, not 3, as some might expect (since a is now set to 3).
Only when we run the next line:
lis = [a, 2]
and then print lis again, do we see that the first item in lis is now 3.

This has to do with the concept of naming and binding in Python.

When a Python statement like:
a = 1
is run, naming and binding happens. The name on the left is first created, and then bound to the (value of the) object on the right of the equals sign (the assignment operator). The value can be any expression, which, when evaluated, results in a value (a Python object [1]) of some kind. In this case it is the int object with value 1.

[1] Almost everything in Python is an object, like almost everything in Unix is a file. [Conditions apply :)]

When that name, a, is used in an expression, Python looks up the value of the object that the name is bound to, and uses that value in the expression, in place of the name.

So when the name a was used inside any of the lists that were bound to the name lis, it was actually the value bound to the name a that was used instead. So, the first time it was 1, so the first item of the list became 1, and stayed as 1 until another binding of some other (list) object to the name lis was done.

But by this time, the name a had been rebound to another object, the int 3, the same one that name b had been earlier bound to just before. So the next time that the name lis was bound to a list, that list now included the value of the current object that name a was now bound to, which was 3.

This is the reason why the code snippet works as it does.

On a related note (also about Python language features, syntax and semantics), I was playing around with the pprint module (Python's pretty-printer) and the Python is operator, and came up with this other snippet:

>>> import pprint
>>> lis = []
>>> for i in range(10):
... lis.append(lis)
>>> print lis
[[...], [...], [...], [...], [...], [...], [...], [...], [...], [...]]

>>> pprint.pprint(lis)
[<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>]

>>> len(lis)

>>> lis is lis[0]

>>> lis is lis[0] is lis[0][0]

>>> lis is lis[0] is lis[0][0] is lis[0][0][0]

in which I created a list, appended it to itself, and then used pprint.pprint on it. Also used the Python is operator between the list and its 0th item, recursively, and was interested to see that the is operator can be used in a chain. I need to look that up (pun intended).


- Vasudev Ram - Online Python training and consulting

Get updates (via Gumroad) on my forthcoming apps and content.

Jump to posts: Python * DLang * xtopdf

Subscribe to my blog by email

My ActiveState Code recipes

Follow me on: LinkedIn * Twitter

Are you a blogger with some traffic? Get Convertkit:

Email marketing for professional bloggers

Share |

Vasudev Ram
Categories: FLOSS Project Planets

Thomas Guest: From bytes to strings in Python and back again

Thu, 2017-03-23 20:00

Low level languages like C have little opinion about what goes in a string, which is simply a null-terminated sequence of bytes. Those bytes could be ASCII or UTF-8 encoded text, or they could be raw data — object code, for example. It’s quite possible and legal to have a C string with mixed content.

char const * mixed = "EURO SIGN " // ASCII "UTF-8 \xE2\x82\xAC " // UTF-8 encoded EURO SIGN "Latin-9 \xA4"; // Latin-9 encoded EURO SIGN

This might seem indisciplined and risky but it can be useful. Environment variables are notionally text but actually C strings, for example, meaning they can hold whatever data you want. Similarly filenames and command line parameters are only loosely text.

A higher level language like Python makes a strict distinction between bytes and strings. Bytes objects contain raw data — a sequence of octets — whereas strings are Unicode sequences. Conversion between the two types is explicit: you encode a string to get bytes, specifying an encoding (which defaults to UTF-8); and you decode bytes to get a string. Clients of these functions should be aware that such conversions may fail, and should consider how failures are handled.

Simply put, a string in Python is a valid Unicode sequence. Real world text data may not be. Programmers need to take charge of reconciling any discrepancies.

We faced such problems recently at work. We’re in the business of extracting meaning from clinical narratives — text data stored on medical records systems in hospitals, for example. These documents may well have passed through a variety of systems. They may be unclear about their text encoding. They may not be encoded as they claim. So what? They can and do contain abbreviations, mispellings, jargon and colloquialisms. Refining the signal from such noise is our core business: if we can correctly interpret positional and temporal aspects of a sentence such as:

Previous fracture of left neck of femur

then we can surely deal with text which claims to be UTF-8 encoded but isn’t really.

Our application stack is server-based: a REST API to a Python application handles document ingest; lower down, a C++ engine does the actual document processing. The problem we faced was supporting a modern API capable of handling real world data.

It’s both undesirable and unnecessary to require clients to clean their text before submitting it. We want to make the ingest direct and idiomatic. Also, we shouldn’t penalise clients whose data is clean. Thus document upload is an HTTP POST request, and the document content is a JSON string — rather than, say, base64 encoded binary data. Our server, however, will be permissive about the contents of this string.

So far so good. Postel’s prescription advises:

Be liberal in what you accept, and conservative in what you send.

This would suggest accepting messy text data but presenting it in a cleaned up form. In our case, we do normalise the input data — a process which includes detecting and standardising date/time information, expanding abbreviations, fixing typos and so on — but this normalised form links back to a faithful copy of the original data. What gets presented to the user is their own text annotated with our findings. That is, we subscribe to a more primitive prescription than Postel’s:

Garbage in, garbage out

with the caveat that the garbage shouldn’t be damaged in transit.

Happily, there is a simple way to pass dodgy strings through Python. It’s used in the standard library to handle text data which isn’t guaranteed to be clean — those environment variables, command line parameters, and filenames for example.

The surrogateescape error handler smuggles non-decodable bytes into the (Unicode) Python string in such a way that the original bytes can be recovered on encode, as described in PEP 383:

On POSIX systems, Python currently applies the locale’s encoding to convert the byte data to Unicode, failing for characters that cannot be decoded. With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF.

This workaround is possible because Unicode surrogates are intended for use in pairs. Quoting the Unicode specification, they “have no interpretation on their own”. The lone trailing surrogate code — the half-a-pair — can only be the result of a surrogateescape error handler being invoked, and the original bytes can be recovered by using the same error handler on encode.

In conclusion, text data is handled differently in C++ and Python, posing a problem for layered applications. The surrogateescape error handler provides a standard and robust way of closing the gap.

Unicode Surrogate Pairs

Code Listing >>> mixed = b"EURO SIGN \xE2\x82\xAC \xA4" >>> mixed b'EURO SIGN \xe2\x82\xac \xa4' >>> mixed.decode() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 14: invalid start byte >>> help(mixed.decode) Help on built-in function decode: decode(encoding='utf-8', errors='strict') method of builtins.bytes instance Decode the bytes using the codec registered for encoding. encoding The encoding with which to decode the bytes. errors The error handling scheme to use for the handling of decoding errors. The default is 'strict' meaning that decoding errors raise a UnicodeDecodeError. Other possible values are 'ignore' and 'replace' as well as any other name registered with codecs.register_error that can handle UnicodeDecodeErrors. >>> mixed.decode(errors='surrogateescape') 'EURO SIGN € \udca4' >>> s = mixed.decode(errors='surrogateescape') >>> s.encode() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udca4' in position 12: surrogates not allowed >>> s.encode(errors='surrogateescape') b'EURO SIGN \xe2\x82\xac \xa4'
Categories: FLOSS Project Planets

Carl Chenet: Feed2tweet 1.0, tool to post RSS feeds to Twitter, released

Thu, 2017-03-23 19:00

Feed2tweet 1.0, a self-hosted Python app to automatically post RSS feeds to the Twitter social network, was released March 2017, 23th.

The main new feature of this release allows to create filters for each RSS feed, because before you could only define global filters. Contributed by Antoine Beaupré, Feed2tweet is also able to use syslog, starting from this release.

What’s the purpose of Feed2tweet?

Some online services offer to convert your RSS entries into Twitter posts. Theses services are usually not reliable, slow and don’t respect your privacy. Feed2tweet is Python self-hosted app, the source code is easy to read and you can enjoy the official documentation online with lots of examples.

Twitter Out Of The Browser

Have a look at my Github account for my other Twitter automation tools:

  • Retweet , retweet all (or using some filters) tweets from a Twitter account to another one to spread content.
  • db2twitter, get data from SQL database (several supported), build tweets and send them to Twitter
  • Twitterwatch, monitor the activity of your Twitter timeline and warn you if no new tweet appears

What about you? Do you use tools to automate the management of your Twitter account? Feel free to give me feedback in the comments below.

… and finally

You can help Feed2tweet by donating anything through Liberaypay (also possible with cryptocurrencies). That’s a big factor motivation

Categories: FLOSS Project Planets

NumFOCUS: PyData Atlanta Meetup Celebrates 1 Year and over 1,000 members

Thu, 2017-03-23 17:23
PyData Atlanta holds a meetup at MailChimp, where Jim Crozier spoke about analyzing NFL data with PySpark. Atlanta tells a new story about data by Rob Clewley

In late 2015, the three of us (Tony Fast, Neel Shivdasani, and myself) had been regularly  nerding out about data over beers and becoming fast friends. We were eager to see a shift from Atlanta's data community to be more welcoming and encouraging towards beginners, self-starters, and generalists. We were about to find out that we were not alone.

We had met at local data science-related events earlier in the year and had discovered that we had lots of opinions—and weren’t afraid to advocate for them. But we also found that we listened to reason (data-driven learning!), appreciated the art in doing good science, and cared about people and the community. Open science, open data, free-and-open-source software, and creative forms of technical communication and learning were all recurring themes in our conversations. We also all agreed that Python is a great language for working with data.

Invitations were extended to like-minded friends, and the informal hangout was soon known as “Data Beers”. The consistent good buzz that Data Beers generated helped us realize an opportunity to contribute more widely to the Atlanta community. At the time, Atlanta was beginning its emergence as a new hub in the tech world and startup culture.

Some of the existing data-oriented meetups around Atlanta have a more formal business atmosphere, or are highly focused on specific tools or tech opinions. Such environments seem to intimidate newcomers and those less formally educated in math or computer science. This inspired us to take a new perspective through an informal and eclectic approach. So, in January 2016, with the support of not-for-profit organization NumFOCUS, we set up the Atlanta chapter of PyData.

The mission of NumFOCUS is to promote sustainable high-level programming languages, open code development, and reproducible scientific research. NumFOCUS sponsors PyData conferences and local meetups internationally. The PyData community gathers to discuss how best to apply tools using Python, R, Stan, and Julia to meet evolving challenges in data management, processing, analytics, and visualization. In all, PyData is over 28,000 members across 52 international meetups. The Python language and the data-focused ecosystem that has grown around it has been remarkably successful in attracting an inclusive mindset centered around free and open-source software and science. Our Atlanta chapter aims to be even more neutral about specific technologies so long as the underlying spirit resonates with our mission.

The three of us, with the help of friend and colleague Lizzy Rolando, began sourcing great speakers who have a distinctive approach to using data that resonated with the local tech culture. We hosted our first meetup in early April. From the beginning, we encouraged a do-it-yourself, interactive vibe to our meetings, supporting shorter-format 30 minute presentations with 20 minute question and answer sessions.

Regardless of the technical focus, we try to bring in speakers who are applying their data-driven work to something of general interest. Our programming balances technical and more qualitative talks. Our meetings have covered a diverse range of applications, addressing computer literacy and education, human rights, neuroscience, journalism, and civics.

A crowd favorite is the inclusion of 3-4 audience-submitted lightning talks at the end of the main Q&A. The strictly five-minute talks add more energy to the mix and give a wider platform to the local community. They’re an opportunity to practice presentation skills for students, generate conversations around projects needing collaborators, discussions about new tools, or just have fun looking at interesting data sets.

Students, career changers, and professionals have come together as members of PyData to learn and share. Our network has generated new friends, collaborators, and even new jobs. Local organizations that share our community spirit provide generous sponsorship and refreshments for our meetings.

We believe we were in the right place at the right time to meet a need. It’s evident in the positive response and rapid growth we’ve seen, having acquired over 1,000 members in one year and hosted over 120 attendees at our last event. It has been a whirlwind experience, and we are delighted that our community has shared our spirit and become involved with us so strongly. Here’s to healthy, productive, data-driven outcomes for all of us in 2017!
Categories: FLOSS Project Planets

Reinout van Rees: Fossgis: open source for emergencies - Marco Lechner

Thu, 2017-03-23 10:28

(One of my summaries of a talk at the 2017 fossgis conference).

He works for the Bundesamtes fuer Strahlenschutz, basically the government agency that was started after Chernobil to protect against and to measure radioactivity. The software system they use/build is called IMIS.

IMIS consists of three parts:

  • Measurements (automatic + mobile measurements + laboratory results).
  • Prediction system. Including documentation (managed in Plone, a python CMS system).
  • Decision support. Help support the government layers that have to make the decisions.

They have a simple map at

The current core of the system is proprietary. They are dependent on one single firm. The system is heavily customized for their usage.

They need a new system because geographical analysis keeps getting more important and because there are new requirements coming out of the government. The current program cannot handle that.

What they want is a new system that is as simple as possible; that uses standards for geographical exchange; they don't want to be dependent on a single firm anymore. So:

  • Use open standards, so OGC. But also a specific world-wide nuclear info protocol.
  • Use existing open source software. OSGEO.
  • If we need something special, can we change/extend existing open source software?
  • If not, then it is OK to create our their software. Under an open source license.

They use open source companies to help them, including training their employees. And helping getting these employees used to modern software development (jenkins, docker, etc.)

If you use an open source strategy, what do you need to do to make it fair?

  • Your own developments should also be open source!
  • You need your own test and build infrastructure. (For instance Jenkins)
  • You need to make it easy to start working with what you made: documentation, docker, buildout (!), etc.

(Personal note: I didn't expect to hear 'buildout' at this open source GIS conference. I've helped quite a bit with that particular piece of python software :-) )

Categories: FLOSS Project Planets

PyBites: Module of the Week - ipaddress

Thu, 2017-03-23 06:30

While playing around with code for our post on generators we discovered the ipaddress module, part of the Standard Library. Such a handy little module!

Categories: FLOSS Project Planets

Reinout van Rees: Fossgis: sewer cadastre with qgis - jörg Höttges

Thu, 2017-03-23 06:24

(One of my summaries of a talk at the 2017 fossgis conference).

With engineer firms from the Aachen region they created qkan. Qkan is:

  • A data structure.
  • Plugins for Qgis.
  • Direct access. Not a specific application with restricted access, but unrestricted access from within Qgis. (He noticed lots of interest among the engineers to learn qgis during the project!)

It has been designed for the needs of the engineers that have to work with the data. You first import the data from the local sewer database. Qkan converts the data to what it needs. Then you can do simulations in a separate package. The results of the simulation will be visualized by Qkan in qgis. Afterwards you probably have to make some corrections to the data and give corrections back to the original database. Often you have to go look at the actual sewers to make sure the database is correct. Output is often a map with the sewer system.

Some functionality: import sewer data (in various formats). Simulate water levels. Draw graphs of the water levels in a sewer. Support database-level check ("an end node cannot occur halfway a sewer").

They took care to make the database schema simple. The source sewer database is always very complex because it has to hold lots of metadata. The engineer that has to work with it needs a much simpler schema in order to be productive. Qkan does this.

They used qgis, spatialite, postgis, python and qt (for forms). An important note: they used as many postgis functionality as possible instead of the geographical functions from qgis: the reason is that postgis (and even spatialite) is often much quicker.

With qgis, python and the "qt designer", you can make lots of handy forms. But you can always go back to the database that's underneath it.

The code is at

Categories: FLOSS Project Planets

CubicWeb: Introducing cubicweb-jsonschema

Thu, 2017-03-23 05:57

This is the first post of a series introducing the cubicweb-jsonschema project that is currently under development at Logilab. In this post, I'll first introduce the general goals of the project and then present in more details two aspects about data models (the connection between Yams and JSON schema in particular) and the basic features of the API. This post does not always present how things work in the current implementation but rather how they should.

Goals of cubicweb-jsonschema

From a high level point of view, cubicweb-jsonschema addresses mainly two interconnected aspects. One related to modelling for client-side development of user interfaces to CubicWeb applications while the other one concerns the HTTP API.

As far as modelling is concerned, cubicweb-jsonschema essentially aims at providing a transformation mechanism between a Yams schema and JSON Schema that is both automatic and extensible. This means that we can ultimately expect that Yams definitions alone would sufficient to have generated JSON schema definitions that would consistent enough to build an UI, pretty much as it is currently with the automatic web UI in CubicWeb. A corollary of this goal is that we want JSON schema definitions to match their context of usage, meaning that a JSON schema definition would not be the same in the context of viewing, editing or relationships manipulations.

In terms of API, cubicweb-jsonschema essentially aims at providing an HTTP API to manipulate entities based on their JSON Schema definitions.

Finally, the ultimate goal is to expose an hypermedia API for a CubicWeb application in order to be able to ultimately build an intelligent client. For this we'll build upon the JSON Hyper-Schema specification. This aspect will be discussed in a later post.

Basic usage as an HTTP API library

Consider a simple case where one wants to manipulate entities of type Author described by the following Yams schema definition:

class Author(EntityType): name = String(required=True)

With cubicweb-jsonschema one can get JSON Schema for this entity type in at different contexts such: view, creation or edition. For instance:

  • in a view context, the JSON Schema will be:

    { "$ref": "#/definitions/Author", "definitions": { "Author": { "additionalProperties": false, "properties": { "name": { "title": "name", "type": "string" } }, "title": "Author", "type": "object" } } }
  • whereas in creation context, it'll be:

    { "$ref": "#/definitions/Author", "definitions": { "Author": { "additionalProperties": false, "properties": { "name": { "title": "name", "type": "string" } }, "required": [ "name" ], "title": "Author", "type": "object" } } }

    (notice, the required keyword listing name property).

Such JSON Schema definitions are automatically generated from Yams definitions. In addition, cubicweb-jsonschema exposes some endpoints for basic CRUD operations on resources through an HTTP (JSON) API. From the client point of view, requests on these endpoints are of course expected to match JSON Schema definitions. Some examples:

Get an author resource:

GET /author/855 Accept:application/json HTTP/1.1 200 OK Content-Type: application/json {"name": "Ernest Hemingway"}

Update an author:

PATCH /author/855 Accept:application/json Content-Type: application/json {"name": "Ernest Miller Hemingway"} HTTP/1.1 200 OK Location: /author/855/ Content-Type: application/json {"name": "Ernest Miller Hemingway"}

Create an author:

POST /author Accept:application/json Content-Type: application/json {"name": "Victor Hugo"} HTTP/1.1 201 Created Content-Type: application/json Location: /Author/858 {"name": "Victor Hugo"}

Delete an author:

DELETE /author/858 HTTP/1.1 204 No Content

Now if the client sends invalid input with respect to the schema, they'll get an error:

(We provide a wrong born property in request body.)

PATCH /author/855 Accept:application/json Content-Type: application/json {"born": "1899-07-21"} HTTP/1.1 400 Bad Request Content-Type: application/json { "errors": [ { "details": "Additional properties are not allowed ('born' was unexpected)", "status": 422 } ] } From Yams model to JSON Schema definitions

The example above illustrates automatic generation of JSON Schema documents based on Yams schema definitions. These documents are expected to help developping views and forms for a web client. Clearly, we expect that cubicweb-jsonschema serves JSON Schema documents for viewing and editing entities as cubicweb.web serves HTML documents for the same purposes. The underlying logic for JSON Schema generation is currently heavily inspired by the logic of primary view and automatic entity form as they exists in cubicweb.web.views. That is: the Yams schema is introspected to determine how properties should be generated and any additionnal control over this can be performed through uicfg declarations [1].

To illustrate let's consider the following schema definitions which:

class Book(EntityType): title = String(required=True) publication_date = Datetime(required=True) class Illustration(EntityType): data = Bytes(required=True) class illustrates(RelationDefinition): subject = 'Illustration' object = 'Book' cardinality = '1*' composite = 'object' inlined = True class Author(EntityType): name = String(required=True) class author(RelationDefinition): subject = 'Book' object = 'Author' cardinality = '1*' class Topic(EntityType): name = String(required=True) class topics(RelationDefinition): subject = 'Book' object = 'Topic' cardinality = '**'

and consider, as before, JSON Schema documents in different contexts for the the Book entity type:

  • in view context:

    { "$ref": "#/definitions/Book", "definitions": { "Book": { "additionalProperties": false, "properties": { "author": { "items": { "type": "string" }, "title": "author", "type": "array" }, "publication_date": { "format": "date-time", "title": "publication_date", "type": "string" }, "title": { "title": "title", "type": "string" }, "topics": { "items": { "type": "string" }, "title": "topics", "type": "array" } }, "title": "Book", "type": "object" } } }

    We have a single Book definition in this document, in which we find attributes defined in the Yams schema (title and publication_date). We also find the two relations where Book is involved: topics and author, both appearing as a single array of "string" items. The author relationship appears like that because it is mandatory but not composite. On the other hand, the topics relationship has the following uicfg rule:

    uicfg.primaryview_section.tag_subject_of(('Book', 'topics', '*'), 'attributes')

    so that it's definition appears embedded in the document of Book definition.

    A typical JSON representation of a Book entity would be:

    { "author": [ "Ernest Miller Hemingway" ], "title": "The Old Man and the Sea", "topics": [ "sword fish", "cuba" ] }
  • in creation context:

    { "$ref": "#/definitions/Book", "definitions": { "Book": { "additionalProperties": false, "properties": { "author": { "items": { "oneOf": [ { "enum": [ "855" ], "title": "Ernest Miller Hemingway" }, { "enum": [ "857" ], "title": "Victor Hugo" } ], "type": "string" }, "maxItems": 1, "minItems": 1, "title": "author", "type": "array" }, "publication_date": { "format": "date-time", "title": "publication_date", "type": "string" }, "title": { "title": "title", "type": "string" } }, "required": [ "title", "publication_date" ], "title": "Book", "type": "object" } } }

    notice the differences, we now only have attributes and required relationships (author) in this schema and we have the required listing mandatory attributes; the author property is represented as an array which items consist of pre-existing objects of the author relationship (namely Author entities).

    Now assume we add the following uicfg declaration:

    uicfg.autoform_section.tag_object_of(('*', 'illustrates', 'Book'), 'main', 'inlined')

    the JSON Schema for creation context will be:

    { "$ref": "#/definitions/Book", "definitions": { "Book": { "additionalProperties": false, "properties": { "author": { "items": { "oneOf": [ { "enum": [ "855" ], "title": "Ernest Miller Hemingway" }, { "enum": [ "857" ], "title": "Victor Hugo" } ], "type": "string" }, "maxItems": 1, "minItems": 1, "title": "author", "type": "array" }, "illustrates": { "items": { "$ref": "#/definitions/Illustration" }, "title": "illustrates_object", "type": "array" }, "publication_date": { "format": "date-time", "title": "publication_date", "type": "string" }, "title": { "title": "title", "type": "string" } }, "required": [ "title", "publication_date" ], "title": "Book", "type": "object" }, "Illustration": { "additionalProperties": false, "properties": { "data": { "format": "data-url", "title": "data", "type": "string" } }, "required": [ "data" ], "title": "Illustration", "type": "object" } } }

    We now have an additional illustrates property modelled as an array of #/definitions/Illustration, the later also added the the document as an additional definition entry.


This post illustrated how a basic (CRUD) HTTP API based on JSON Schema could be build for a CubicWeb application using cubicweb-jsonschema. We have seen a couple of details on JSON Schema generation and how it can be controlled. Feel free to comment and provide feedback on this feature set as well as open the discussion with more use cases.

Next time, we'll discuss how hypermedia controls can be added the HTTP API that cubicweb-jsonschema provides.

[1]this choice is essentially driven by simplicity and conformance when the existing behavior to help migration of existing applications.
Categories: FLOSS Project Planets

Reinout van Rees: Fossgis: creating maps with open street map in QGis - Axel Heinemann

Thu, 2017-03-23 05:55

(One of my summaries of a talk at the 2017 fossgis conference).

He wanted to make a map for a local run. He wanted a nice map with the route and the infrastructure (start, end, parking, etc). Instead of the usual not-quite-readable city plan with a simple line on top. With qgis and openstreetmap he should be able to make something better!

A quick try with QGis, combined with the standard openstreetmap base map, already looked quite nice, but he wanted to do more customizations on the map colors. So he needed to download the openstreetmap data. That turned into quite a challenge. He tried two plugins:

  • OSMDownloader: easy selection, quick download. Drawback: too many objects as you cannot filter. The attribute table is hard to read.
  • QuickOSM: key/value selection, quick. Drawback: you need a bit of experience with the tool, as it is easy to forget key/values.

He then landed on . The user interface is very friendly. There is a wizard to get common cases done. And you can browse the available tags.

With the data downloaded with overpass-turbo, he could easily adjust colors and get a much nicer map out of it.

You can get it to work, but it takes a lot of custom work.

Some useful links:

Photo explanation: just a nice unrelated picture from the recent beautiful 'on traxs' model railway exibition (see video )

Categories: FLOSS Project Planets

Reinout van Rees: Fossgis: introduction on some open source software packages

Thu, 2017-03-23 05:55

(One of my summaries of a talk at the 2017 fossgis conference).

The conference started with a quick introduction on several open source programs.

Openlayers 3 - Marc Jansen

Marc works on both openlayers and GeoExt. Openlayers is a javascript library with lots and lots of features.

To see what it can do, look at the 161 examples on the website :-) It works with both vector layers and raster layers.

Openlayers is a quite mature project, the first version is from 2006. It changed a lot to keep up with the state of the art. But they did take care to keep everything backwards compatible. Upgrading from 2.0 to 2.2 should have been relatively easy. The 4.0.0 version came out last month.


  • Allows many different data sources and layer types.
  • Has build-in interaction and controls.
  • Is very actively developed.
  • Is well documented and has lots of examples.

The aim is to be easy to start with, but also to allow full control of your map and all sorts of customization.

Geoserver - Marc Jansen

(Again Marc: someone was sick...)

Geoserver is a java-based server for geographical data. It support lots of OGC standards (WMS, WFS, WPS, etc). Flexible, extensible, well documented. "Geoserver is a glorious example that you can write very performant software in java".

Geoserver can connect to many different data sources and make those sources available as map data.

If you're a government agency, you're required to make INSPIRE metadata available for your maps: geoserver can help you with that.

A big advantage of geoserver: it has a browser-based interface for configuring it. You can do 99% of your configuration work in the browser. For maintaining: there is monitoring to keep an eye on it.

Something to look at: the importer plugin. With it you get a REST API to upload shapes, for instance.

The latest version also supports LDAP groups. LDAP was already supported, but group membership not yet.

Mapproxy - Dominik Helle

Dominik is one of the MapProxy developers. Mapproxy is a WMS cache and tile cache. The original goal was to make maps quicker by caching maps.

Some possible sources: WMS, WMTS, tiles (google/bing/etc), MapServer. The output can be WMS, WMS-C, WMTS, TMS, KML. So the input could be google maps and the output WMS. One of their customers combines the output of five different regional organisations into one WMS layer...

The maps that mapproxy returns can be stored on a local disk in order to improve performance. They way they store it allows mapproxy to support intermediary zoom levels instead of fixed ones.

The cache can be in various formats: MBTiles, sqlite, couchdb, riak, arcgis compact cache, redis, s3. The cache is efficient by combining layers and by omitting unneeded data (empty tiles).

You can pre-fill the cache ("seeding").

Some other possibilities, apart from caching:

  • A nice feature: clipping. You can limit a source map to a specific area.
  • Reprojecting from one coordinate system to another. Very handy if someone else doesn't want to support the coordinate system that you need.
  • WMS feature info: you can just pass it on to the backend system, but you can also intercept and change it.
  • Protection of your layers. Password protection. Protect specific layers. Only allow specific areas. Etcetera.
QGis - Thomas Schüttenberg

QGis is an opern source gis platform. Desktop, server, browser, mobile. And it is a library. It runs on osx, linux, windows, android. The base is the QT ui library, hence the name.

Qgis contains almost everything you'd expect from a GIS packages. You can extend it with plugins.

Qgis is a very, very active project. Almost 1 million lines of code. 30.000+ github commits. 332 developers have worked on it, in the last 12 months 104.

Support via documentation, mailinglists and . In case you're wondering about the names of the releases: they come from the towns where the twice-a-year project meeting takes place :-)

Since december 2016, there's an official (legal) association.

QGis 3 will have lots of changes: QT 5 and python 3.

Mapbender 3 - Astrid Emde

Mapbender is library to build webgis applications. Ideally, you don't need to write any code yourself, but you configure it instead in your browser. It also supports mobile usage.

You can try it at . Examples are at .

You can choose a layout and fill in and configure the various parts. Layers you want to show: add sources. You can configure security/access with roles.

An example component: a search form for addresses that looks up addresses with sql or a web service. Such a search form can be a popup or you can put it in the sidebar, for instance. CSS can be customized.

PostNAS - Astrid Emde, Jelto Buurman

The postnas project is a solution for importing ALKIS data, a data exchange format for the german cadastre (Deutsch: Kataster).

PostNAS is an extension of the GDAL library for the "NAS" vector data format. (NAS = normalisierte Austausch Schnittstelle, "normalized exchange format"). This way, you can use all of the gdal functionality with the cadastre data. But that's not the only thing: there's also a qgis plugin. There is configuration and conversion scripts for postgis, mapbender, mapserver, etc.

They needed postprocessing/conversion scripts to get useful database tables out of the original data, tables that are usable for showing in QGis, for instance.

So... basically a complete open source environment for working with the cadastre data!

Photo explanation: just a nice unrelated picture from the recent beautiful 'on traxs' model railway exibition (see video )

Categories: FLOSS Project Planets

Tomasz Früboes: Unittesting print statements

Thu, 2017-03-23 05:10

Recently I was refactoring a small package that is supposed to allow execution of arbitrary python code on a remote machine. The first implementation was working nicely but with one serious drawback – function handling the actual code execution was running in a synchronous (blocking) mode. As the result all of the output (both stdout and stderr) was presented only at the end, i.e. when code finished its execution. This was unacceptable since the package should work in a way as transparent to the user as possible. So a wall of text when code completes its task wasn’t acceptable.

The goal of the refactoring was simple – to have the output presented to the user immediately after it was printed on the remote host. As a TDD worshipper I wanted to start this in a kosher way, i.e. with a test. And I got stuck.

For a day or so I had no idea how to progress. How do you unittest the print statements? It’s funny when I think about this now. I have used a similar technique many times in the past for output redirection, yet somehow haven’t managed to make a connection with this problem.

The print statement

So how do you do it? First we should understand what happens when print statement is executed. In python 2.x the print statement does two things – converts provided expressions into strings and writes the result to a file like object handling the stdout. Conveniently it is available as sys.stdout (i.e. as a part of sys module). So all you have to do is to overwrite the sys.stdout with your own object providing a ‘write’ method. Later you may discover, that some other methods may be also needed (e.g. ‘flush’ is quite often used), but for starters, having only the ‘write’ method should be sufficient.

A first try – simple stdout interceptor

The code below does just that. The MyOutput class is designed to replace the original sys.stdout:

import unittest import sys def fn_print(nrepeat): print "ab"*nrepeat class MyTest(unittest.TestCase): def test_stdout(self): class MyOutput(object): def __init__(self): = [] def write(self, s): def __str__(self): return "".join( stdout_org = sys.stdout my_stdout = MyOutput() try: sys.stdout = my_stdout fn_print(2) finally: sys.stdout = stdout_org self.assertEquals( str(my_stdout), "abab\n") if __name__ == "__main__": unittest.main()

The fn_print function provides output to test against. After replacing sys.stdout we call this function and compare the obtained output with the expected one. It is worth noting that in the example above the original sys.stdout is first preserved and then carefully restored inside the ‘finally’ block. If you don’t do this you are likely to loose any output coming from other tests.

Is my code async? Logging time of arrival

In the second example we will address the original problem – is output presented as a wall of text at the end or maybe in real time as we want to. For this we will add time of arrival logging capability to the object replacing sys.stdout:

import unittest import time import sys def fn_print_with_delay(nrepeat): for i in xrange(nrepeat): print # prints a single newline time.sleep(0.5) class TestServer(unittest.TestCase): def test_stdout_time(self): class TimeLoggingOutput(object): def __init__(self): = [] self.timestamps = [] def write(self, s): self.timestamps.append(time.time()) stdout_org = sys.stdout my_stdout = TimeLoggingOutput() nrep = 3 # make sure is >1 try: sys.stdout = my_stdout fn_print_with_delay(nrep) finally: sys.stdout = stdout_org for i in xrange(nrep): if i > 0: dt = my_stdout.timestamps[i]-my_stdout.timestamps[i-1] self.assertTrue(0.5<dt<0.52) if __name__ == "__main__": unittest.main()

The code is pretty much self-explanatory – the fn_print_with_delay function prints newlines in half of a second intervals. We override sys.stdout with an instance of a class capable of storing timestamps (obtained with time.time()) of all calls to the write method. At the and we assert the timestamps are spaced half of a second approximately. The code above works as expected:

. ---------------------------------------------------------------------- Ran 1 test in 1.502s OK

If we change the interval inside the fn_print_with_delay function to one second, the test will (fortunately) fail.


As we saw, testing for expected output is in fact trivial – all you have to do is to put an instance of a class with a ‘write’ method in proper place (i.e. sys.stdout). The only ‘gotcha’ is the cleanup – you should remember to restore sys.stdout to its original state. You may apply the exact same technique if you need to test stderr (just target the sys.stderr instead of sys.stdout). It is also worth noting that using a similar technique you could intercept (or completely silence) output coming from external libraries.

Categories: FLOSS Project Planets

DataCamp: PySpark Cheat Sheet: Spark in Python

Thu, 2017-03-23 05:10

Apache Spark is generally known as a fast, general and open-source engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It allows you to speed analytic applications up to 100 times faster compared to technologies on the market today. You can interface Spark with Python through "PySpark". This is the Spark Python API exposes the Spark programming model to Python. 

Even though working with Spark will remind you in many ways of working with Pandas DataFrames, you'll also see that it can be tough getting familiar with all the functions that you can use to query, transform, inspect, ... your data. What's more, if you've never worked with any other programming language or if you're new to the field, it might be hard to distinguish between RDD operations.

Let's face it, map() and flatMap() are different enough, but it might still come as a challenge to decide which one you really need when you're faced with them in your analysis. Or what about other functions, like reduce() and reduceByKey()? 

Even though the documentation is very elaborate, it never hurts to have a cheat sheet by your side, especially when you're just getting into it.

This PySpark cheat sheet covers the basics, from initializing Spark and loading your data, to retrieving RDD information, sorting, filtering and sampling your data. But that's not all. You'll also see that topics such as repartitioning, iterating, merging, saving your data and stopping the SparkContext are included in the cheat sheet. 

Note that the examples in the document take small data sets to illustrate the effect of specific functions on your data. In real life data analysis, you'll be using Spark to analyze big data.

Are you hungry for more? Don't miss our other Python cheat sheets for data science that cover topics such as Python basicsNumpyPandasPandas Data Wrangling and much more! 

Categories: FLOSS Project Planets

Rene Dudfield: pip is broken

Thu, 2017-03-23 05:00

Since asking people to use pip to install things, I get a lot of feedback on pip not working. Feedback like this.
"Our fun packaging Jargon"
What is a pip? What's it for? It's not built into python?  It's the almost-default and almost-standard tool for installing python code. Pip almost works a lot of the time. You install things from pypi. I should download pypy? No, pee why, pee eye. The cheeseshop. You're weird. Just call it pee why pee eye. But why is it called pip? I don't know.
"Feedback like this."pip is broken on the raspberian

pip3 doesn't exist on windows

People have an old pip. Old pip doesn't support wheels. What are wheels? It's a cute bit of jargon to mean a zip file with python code in it structured in a nice way. I heard about eggs... tell me about eggs? Well, eggs are another zip file with python code in it. Used mainly by easy_install. Easy install? Let's use that, this is all too much.

The pip executable or script is for python 2, and they are using python 3.

pip is for a system python, and they have another python installed. How did they install that python? Which of the several pythons did they install? Maybe if they install another python it will work this time.

It's not working one time and they think that sudo will fix things. And now certain files can't be updated without sudo. However, now they have forgotten that sudo exists.

"pip lets you run it with sudo, without warning."
pip doesn't tell them which python it is installing for. But I installed it! Yes you did. But which version of python, and into which virtualenv? Let's use these cryptic commands to try and find out...

pip doesn't install things atomically, so if there is a failed install, things break. If pip was a database (it is)...

Virtual environments work if you use python -m venv, but not virtualenv. Or some times it's the other way around. If you have the right packages installed on Debian, and Ubuntu... because they don't install virtualenv by default.

What do you mean I can't rename my virtualenv folder? I can't move it to another place on my Desktop?

pip installs things into global places by default.

"Globals by default."
Why are packages still installed globally by default?

"So what works currently most of the time?"
python3 -m venv anenv
. ./anenv/bin/activate
pip install pip --upgrade
pip install pygame

This is not ideal. It doesn't work on windows. It doesn't work on Ubuntu. It makes some text editors crash (because virtualenvs have so many files they get sick). It confuses test discovery (because for some reason they don't know about virtual environments still and try to test random packages you have installed). You have to know about virtualenv, about pip, about running things with modules, about environment variables, and system paths. You have to know that at the beginning. Before you know anything at all.

Is there even one set of instructions where people can have a new environment, and install something? Install something in a way that it might not break their other applications? In a way which won't cause them harm? Please let me know the magic words?

I just tell people `pip install pygame`. Even though I know it doesn't work. And can't work. By design. I tell them to do that, because it's probably the best we got. And pip keeps getting better. And one day it will be even better.

Help? Let's fix this.
Categories: FLOSS Project Planets

Talk Python to Me: #104 Game Theory in Python

Thu, 2017-03-23 04:00
Game theory is the study competing interests, be it individual actors within an economy or healthy vs. cancer cells within a body. <br/> <br/> Our guests this week, Vince Knight, Marc Harper, and Owen Campbell, are here to discuss their python project built to study and simulate one of the central problems in Game Theory: The prisoners' dilemma. <br/> <br/> Links from the show: <br/> <div style="font-size: .85em;"> <br/> <b>Axelrod on GitHub</b>: <a href='' target='_blank'></a> <br/> <b>The docs</b>: <a href='' target='_blank'></a> <br/> <b>The tournament</b>: <a href='' target='_blank'></a> <br/> <b>Chat: Gitter room</b>: <a href='' target='_blank'></a> <br/> <b>Peer reviewed paper</b>: <a href='' target='_blank'></a> <br/> <b>Djaxelrod v2</b>: <a href='' target='_blank'></a> <br/> <b>Some examples with jupyter</b>: <a href='' target='_blank'></a> <br/> <br/> <strong>Find them on Twitter</strong> <br/> <b>The project</b>: <a href='' target='_blank'>@AxelrodPython</a> <br/> <b>Owen on Twitter</b>: <a href='' target='_blank'>@opcampbell</a> <br/> <b>Vince on on Twitter</b>: <a href='' target='_blank'>@drvinceknight</a> <br/> <br/> <strong>Sponsored items</strong> <br/> <b>Our courses</b>: <a href='' target='_blank'></a> <br/> <b>Podcast's Patreon</b>: <a href='' target='_blank'></a> <br/> </div>
Categories: FLOSS Project Planets

Kushal Das: Running MicroPython on 96Boards Carbon

Thu, 2017-03-23 02:42

I received my Carbon from Seedstudio a few months back. But, I never found time to sit down and work on it. During FOSSASIA, in my MicroPython workshop, Siddhesh was working to put MicroPython using Zephyr on his Carbon. That gave me the motivation to have a look at the same after coming back home.

What is Carbon?

Carbon is a 96Boards IoT edition compatible board, with a Cortex-M4 chip, and 512KB flash. It currently runs Zephyr, which is a Linux Foundation hosted project to build a scalable real-time operating system (RTOS).

Setup MicroPython on Carbon

To install the dependencies in Fedora:

$ sudo dnf group install "Development Tools" $ sudo dnf install git make gcc glibc-static \ libstdc++-static python3-ply ncurses-devel \ python-yaml python2 dfu-util

The next step is to setup the Zephyr SDK. You can download the latest binary from here. Then you can install it under your home directory (you don’t have to install it system-wide). I installed it under ~/opt/zephyr-sdk-0.9 location.

Next, I had to check out the zephyr source, I cloned from repo. I also cloned MicroPython from the official GitHub repo. I will just copy paste the next steps below.

$ source $ cd ~/code/git/ $ git clone $ cd micropython/zephyr

Then I created a project file for the carbon board specially, this file is named as prj_96b_carbon.conf, and I am pasting the content below. I have submitted the same as a patch to the upstream Micropython project. It disables networking (otherwise you will get stuck while trying to get the REPL).


Next, we have to build MicroPython as a Zephyr application.

$ make BOARD=96b_carbon $ ls outdir/96b_carbon/ arch ext isr_tables.c lib Makefile scripts tests zephyr.hex zephyr.strip boards include isr_tables.o libzephyr.a Makefile.export src zephyr.bin zephyr.lnk zephyr_prebuilt.elf drivers isrList.bin kernel linker.cmd misc subsys zephyr.elf zephyr.lst zephyr.stat

After the build is finished, you will be able to see a zephyr.bin file in the output directory.

Uploading the fresh build to the carbon

Before anything else, I connected my Carbon board to the laptop using an USB cable to the OTG port (remember to check the port name). Then, I had to press the *BOOT0 button and while pressing that one, I also pressed the Reset button. Then, left the reset button first, and then the boot0 button. If you run the dfu-util command after this, you should be able to see some output like below.

$ sudo dfu-util -l dfu-util 0.9 Copyright 2005-2009 Weston Schmidt, Harald Welte and OpenMoko Inc. Copyright 2010-2016 Tormod Volden and Stefan Schmidt This program is Free Software and has ABSOLUTELY NO WARRANTY Please report bugs to Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=3, name="@Device Feature/0xFFFF0000/01*004 e", serial="385B38683234" Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=2, name="@OTP Memory /0x1FFF7800/01*512 e,01*016 e", serial="385B38683234" Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=1, name="@Option Bytes /0x1FFFC000/01*016 e", serial="385B38683234" Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=0, name="@Internal Flash /0x08000000/04*016Kg,01*064Kg,03*128Kg", serial="385B38683234"

This means the board is in DFU mode. Next we flash the new application to the board.

$ sudo dfu-util -d [0483:df11] -a 0 -D outdir/96b_carbon/zephyr.bin -s 0x08000000 dfu-util 0.9 Copyright 2005-2009 Weston Schmidt, Harald Welte and OpenMoko Inc. Copyright 2010-2016 Tormod Volden and Stefan Schmidt This program is Free Software and has ABSOLUTELY NO WARRANTY Please report bugs to dfu-util: Invalid DFU suffix signature dfu-util: A valid DFU suffix will be required in a future dfu-util release!!! Opening DFU capable USB device... ID 0483:df11 Run-time device DFU version 011a Claiming USB DFU Interface... Setting Alternate Setting #0 ... Determining device status: state = dfuERROR, status = 10 dfuERROR, clearing status Determining device status: state = dfuIDLE, status = 0 dfuIDLE, continuing DFU mode device DFU version 011a Device returned transfer size 2048 DfuSe interface name: "Internal Flash " Downloading to address = 0x08000000, size = 125712 Download [=========================] 100% 125712 bytes Download done. File downloaded successfully Hello World on Carbon

The hello world of the hardware land is the LED blinking code. I used the on-board LED(s) for the same, the sample code is given below. I have now connected the board to the UART (instead of OTG).

$ screen /dev/ttyUSB0 115200 >>> >>> import time >>> from machine import Pin >>> led1 = Pin(("GPIOD",2), Pin.OUT) >>> led2 = Pin(("GPIOB",5), Pin.OUT) >>> while True: ... led2.low() ... led1.high() ... time.sleep(0.5) ... led2.high() ... led1.low() ... time.sleep(0.5)
Categories: FLOSS Project Planets