The Python Show: Dashboards in Python with Streamlit

Thu, 2024-07-04 21:20

This week, I chatted with Channin Nantasenamat about Python and the Streamlit web framework.

Specifically, we chatted about the following topics:

  • Python packages

  • Streamlit

  • Teaching bioinformatics

  • Differences in data science disciplines

  • Being a YouTuber

  • and much more!

Carl Trachte: DAG Hamilton Workflow for Toy Text Processing Script

Thu, 2024-07-04 18:04

Hello. It's been a minute.

I was fortunate to attend PYCON US in Pittsburgh earlier this year. DAGWorks had a booth on the expo floor where I discovered Hamilton. The project grabbed my attention as something that could help organize and present my code workflow better. My reaction could be compared to browsing Walmart while picking up a hardware item and seeing the perfect storage medium for your clothes or crafts at a bargain price, but even better, having someone there to explain the whole thing to you. The folks at the booth were really helpful.

Below I take on a contrived web scraping (it's crude) script in my domain (metals mining) and create a Hamilton workflow from it.

Pictured below is the Hamilton flow in the graphviz output format the project uses for flowcharts (graphviz has been around for decades - an oldie but goodie as it were).

I start with a csv file that has some really basic data on three big American metal mines (I did have to research the Wikipedia addresses - for instance, I originally looked for the Goldstrike Mine under the name "Post-Betze." It goes by several different names and encompasses several mines - more on that anon):

mine,state,commodity,wikipedia page,colloquial associationRed Dog,Alaska,zinc,,TeckGoldstrike,Nevada,gold,,Nevada Gold MinesBingham Canyon,Utah,copper,,Kennecott

Basically, I am going to attempt to scrape Wikipedia for information on who owns the three mines. Then I will try to use heuristics to gather information on what I think I know about them and gauge how up to date the Wikipedia information is.

Hamilton uses a system whereby you name your functions in a noun-like fashion ("def stuff()" instead of "def getstuff()") and feed those names as variables to the other functions in the workflow as parameters. This is what allows the tool to check your workflow for inconsistencies (types, for instance) and build the graphviz chart shown above.

You can use separate modules with functions and import them. I've done some of this on the bigger workflows I work with. Your Hamilton functions then end up being little one liners that call the bigger functions in the modules. This is necessary if you have functions you use repeatedly in your workflow that take different values at different stages. For this toy project, I've kept the whole thing self contained in one module (yes, the iii in the filename represents my multiple failed attempts at web scraping and text processing - it's harder than it looks).

Below is the Hamilton main file (I believe the "" name is convention.) I have done my best to preserve the dictionary return values as "faux immutable" through use of the copy module in each function. This helps me in debugging and examining output, much of which can be done from the file (all the return values are stored in a dictionary). I've worked with a dataset with about 600,000 rows that had about 10 nodes. My computer has 32GB of RAM (Windows 11); it handled memory fine (less than half). For really big data, keeping all these dictionaries in memory might be a problem.

# python 3.12
"""Hamilton demo."""
import sys
import pprint
from hamilton import driver
import toyscriptiii as ts
dr = driver.Builder().with_modules(ts).build()
dr.display_all_functions("ts.png", deduplicate_inputs=True, keep_dot=True, orient='BR')
results = dr.execute(['parsed_data',                      'data_with_wikipedia',                      'data_with_company',                      'info_output',                      'commodity_word_counts',                      'colloquial_company_word_counts',                      'info_dict_merged',                      'wikipedia_report'],                      inputs={'datafile':'data.csv'})

The main toy module with functions configured for the Hamilton graph:

# python 3.12
"""Toy script.
Takes some input from a csv file on big Americanmines and looks at Wikipedia text for some extracontext."""
import copy
import pprint
import sys
from urllib import request
import re
from bs4 import BeautifulSoup
def parsed_data(datafile:str) -> dict:    """    Get csv data into a dictionary keyed on mine name.    """    retval = {}    with open(datafile, 'r') as f:        headers = [x.strip() for x in next(f).split(',')]        for linex in f:            vals = [x.strip() for x in linex.split(',')]            retval[vals[0]] = {key:val for key, val in zip(headers, vals)}     pprint.pprint(retval)    return retval        def data_with_wikipedia(parsed_data:dict) -> dict:    """    Connect to wikipedia sites and fill in    raw html data.
    Return dictionary.    """    retval = copy.deepcopy(parsed_data)    for minex in retval:        obj = request.urlopen(retval[minex]['wikipedia page'])        html =        soup = BeautifulSoup(html, 'html.parser')        print(soup.title)        # Text from html and strip out newlines.        newstring = soup.get_text().replace('\n', '')        retval[minex]['wikipediatext'] = newstring    return retval
def data_with_company(data_with_wikipedia:dict) -> dict:    """    Fetches company ownership for mine out of     Wikipedia text dump.
    Returns a new dictionary with the company name    without the big wikipedia text dump.    """    # Wikipedia setup for mine company name.    COMPANYPAT = r'[a-z]Company'    # Lower case followed by upper case heuristic.    ENDCOMPANYPAT = '[a-z][A-Z]'    retval = copy.deepcopy(data_with_wikipedia)    companypat = re.compile(COMPANYPAT)    endcompanypat = re.compile(ENDCOMPANYPAT)     for minex in retval:        print(minex)        match =, retval[minex]['wikipediatext'])        if match:            print('Company match span = ', match.span())            companyidx = match.span()[1]            match2 =, retval[minex]['wikipediatext'][companyidx:])            print('End Company match span = ', match2.span())            retval[minex]['company'] = retval[minex]['wikipediatext'][companyidx:companyidx + match2.span()[0] + 1]        # Get rid of big text dump in return value.        retval[minex].pop('wikipediatext')    return retval
def info_output(data_with_company:dict) -> str:    """    Prints some output text to a file for each    mine in the data_with_company dictionary.
    Returns string filename of output.    """    INFOLINEFMT = 'The {mine:s} mine is a big {commodity:s} mine in the State of {state:s} in the US.'    COMPANYLINEFMT = '\n    {company:s} owns the mine.\n\n'    retval = 'mine_info.txt'    with open(retval, 'w') as f:        for minex in data_with_company:            print(INFOLINEFMT.format(**data_with_company[minex]), file=f)            print(COMPANYLINEFMT.format(**data_with_company[minex]), file=f)    return retval
def commodity_word_counts(data_with_wikipedia:dict, data_with_company:dict) -> dict:    """    Return dictionary keyed on mine with counts of    commodity (e.g., zinc etc.) mentions on Wikipedia    page (excluding ones in the company name).    """    retval = {}    # This will probably miss some occurrences at mashed together    # word boundaries. It is a rough estimate.    # '\b[Gg]old\b'    commoditypatfmt = r'\b[{0:s}{1:s}]{2:s}\b'    for minex in data_with_wikipedia:        print(minex)        commodityuc = data_with_wikipedia[minex]['commodity'][0].upper()        commoditypat = commoditypatfmt.format(commodityuc,                                              data_with_wikipedia[minex]['commodity'][0],                                              data_with_wikipedia[minex]['commodity'][1:])        print(commoditypat)        commoditymatches = re.findall(commoditypat, data_with_wikipedia[minex]['wikipediatext'])        # pprint.pprint(commoditymatches)        nummatchesraw = len(commoditymatches)        print('Initial length of commoditymatches is {0:d}.'.format(nummatchesraw))        companymatches = re.findall(data_with_company[minex]['company'],                                    data_with_wikipedia[minex]['wikipediatext'])        numcompanymatches = len(companymatches)        print('Length of companymatches is {0:d}.'.format(numcompanymatches))        # Is the commodity name part of the company name?        print('commoditypat = ', commoditypat)        print(data_with_company[minex]['company'])        commoditymatchcompany =, data_with_company[minex]['company'])        if commoditymatchcompany:            print('commoditymatchcompany.span() = ', commoditymatchcompany.span())            nummatchesfinal = nummatchesraw - numcompanymatches            retval[minex] = nummatchesfinal         else:            retval[minex] = nummatchesraw     return retval
def colloquial_company_word_counts(data_with_wikipedia:dict) -> dict:    """    Find the number of times the company you associate with    the property/mine (very subjective) is within the    text of the mine's wikipedia article.    """    retval = {}    for minex in data_with_wikipedia:        colloquial_pat = data_with_wikipedia[minex]['colloquial association']        print(minex)        nummatches = len(re.findall(colloquial_pat, data_with_wikipedia[minex]['wikipediatext']))        print('{0:d} matches for colloquial association {1:s}.'.format(nummatches, colloquial_pat))        retval[minex] = nummatches    return retval
def info_dict_merged(data_with_company:dict,                     commodity_word_counts:dict,                     colloquial_company_word_counts:dict) -> dict:    """    Get a dictionary with all the collected information    in it minus the big Wikipedia text dump.    """    retval = copy.deepcopy(data_with_company)    for minex in retval:        retval[minex]['colloquial association count'] = colloquial_company_word_counts[minex]        retval[minex]['commodity word count'] = commodity_word_counts[minex]    return retval
def wikipedia_report(info_dict_merged:dict) -> str:    """    Writes out Wikipedia information (word counts)    to file in prose; returns string filename.    """    retval = 'wikipedia_info.txt'    colloqfmt = 'The {0:s} mine has {1:d} occurrences of colloquial association {2:s} in its Wikipedia article text.\n'    commodfmt = 'The {0:s} mine has {1:d} occurrences of commodity name {2:s} in its Wikipedia article text.\n\n'    with open(retval, 'w') as f:        for minex in info_dict_merged:            print(colloqfmt.format(info_dict_merged[minex]['mine'],                                   info_dict_merged[minex]['colloquial association count'],                                   info_dict_merged[minex]['colloquial association']), file=f)            print(commodfmt.format(info_dict_merged[minex]['mine'],                                   info_dict_merged[minex]['commodity word count'],                                   info_dict_merged[minex]['commodity']), file=f)    return retval

My REGEX abilities are somewhere between "I've heard the term REGEX and know regular expressions exist" and bracketed characters in each slot brute force. It worked for this toy example. Each Wikipedia page features the word "Company" followed by the name of the owning corporate entity.

Here is are the two text outputs the script produces from the information provided (Wikipedia articles from July, 2024):

The Red Dog mine is a big zinc mine in the State of Alaska in the US.
    NANA Regional Corporation owns the mine.

The Goldstrike mine is a big gold mine in the State of Nevada in the US.
    Barrick Gold owns the mine.

The Bingham Canyon mine is a big copper mine in the State of Utah in the US.
    Rio Tinto Group owns the mine.

The Red Dog mine has 21 occurrences of colloquial association Teck in its Wikipedia article text.
The Red Dog mine has 29 occurrences of commodity name zinc in its Wikipedia article text.

The Goldstrike mine has 0 occurrences of colloquial association Nevada Gold Mines in its Wikipedia article text.
The Goldstrike mine has 16 occurrences of commodity name gold in its Wikipedia article text.

The Bingham Canyon mine has 49 occurrences of colloquial association Kennecott in its Wikipedia article text.
The Bingham Canyon mine has 84 occurrences of commodity name copper in its Wikipedia article text.

Company names are relatively straightforward, although mining company and properties acquisitions and mergers being what they are, it can get complicated. I unwittingly chose three properties that Wikipedia reports as having one owner. Other big mines like Morenci, Arizona (copper) and Cortez, Nevada (gold) show more than one owner; that case is for another programming day. The Goldstrike information might be out of date - no mention of Nevada Gold Mines or Newmont (one mention, but in a different context). The Cortez Wikipedia page is more current, although it still doesn't mention Nevada Gold Mines.

The inclusion of colloquial association in the input csv file was an afterthought based on a lot of the Wikipedia information not being completely in line with what I thought I knew. Teck is the operator of the Red Dog Mine in Alaska. That name does get mentioned frequently in the Wikipedia article.

Enough mining stuff - it is a programming blog after all. Next time (not written yet) I hope to cover dressing up and highlighting the graphviz output a bit.

Thank you for stopping by.

Eli Bendersky: You don't need virtualenv in Go

Thu, 2024-07-04 16:41

Programmers that come to Go from Python often wonder "do I need something like virtualenv here?"

The short answer is NO; this post will provide some additional details.

While virtualenv in Python is useful in many situations, I think it'd be fair to divide them into two broad scenarios: for execution and for development. Let's see what Go offers for each of these scenarios.


There are multiple, mutually-incompatible versions of Python out in the wild. There are even multiple versions of the packaging tools (like pip). On top of this, different programs need different packages, often themselves with mutually-incompatible versions.

Python code typically expects to be installed, and expects to find packages it depends on installed in a central location. This can be an issue for systems where we don't have the permission to install packages/code to a central location.

All of this makes distributing Python applications quite tricky. It's common to use bundling tools like PyInstaller, but virtualenv is also a popular option [1].

Go is a statically compiled language, so this is a non-problem! Binaries are easy to build and distribute; the binary is a native executable for a given platform (just like a native executable built from C or C++ source), and has no dependencies on compiler or package versions. While you can install Go programs into a central location, you by no means have to do this. In fact, you typically don't have to install Go programs at all. Just invoke the binary.

It's also worth mentioning that Go has great cross-compilation support, making it easy to create binaries for multiple OSes from a single development machine.


Consider the following situation: you're developing a package, which depends on N other packages at specific versions; e.g. you need package foo at version 1.2 or above. Your system may have an older version of foo installed - 0.9; you try to upgrade it to 1.2 and some other program breaks. Now, this all sounds very manageable for package foo - how hard can it be to upgrade the uses of this simple package?

Reality is more difficult. foo could be Django; your code depends on a new version, while some other critical systems depend on an old version. Good luck fixing this conundrum. In Python, viruatenv is a critical tool to make such situations manageable; newer tools like pipenv wrap virtualenv with more usability patterns.

How about Go?

If you're using Go modules, this situation is very easy to handle. In a way, a Go module serves as its own virtualenv. Your go.mod file specifies the exact versions of dependency packages needed for your development, and these versions don't mix up with packages you need to develop some other project (which has its own go.mod).

Moreover, Go module directives like replace make it easy to short-circuit dependencies to try local patches. While debugging your project you find that package foo has a bug that may be affecting you? Want to try a quick fix and see if you're right? No problem, just clone foo locally, apply a fix, and use a replace to use this locally patched foo. See this post for a few ways to automate this process.

What about different Go versions? Suppose you have to investigate a user report complaining that your code doesn't work with an older Go version. Or maybe you're curious to see how the upcoming beta release of a Go version will affect you. Go makes it easy to install different versions locally. These different versions have their own standard libraries that won't interfere with each other.

[1]Fun fact: this blog uses the Pelican static site generator. To regenerate the site I run Pelican in a virtualenv because I need a specific version of Pelican with some personal patches.
Glyph Lefkowitz: Against Innovation Tokens

Thu, 2024-07-04 15:54

Updated 2024-07-04: After some discussion, added an epilogue going into more detail about the value of the distinction between the two types of tokens.

In 2015, Dan McKinley laid out a model for software teams selecting technologies. He proposed that each team have a limited supply of “innovation tokens”, and, when selecting a technology, they can choose boring ones for free but “innovative” ones cost a token. This implies that we all know which technologies are innovative, and we assume that they are inherently costly, so we want to restrict their supply.

That model has become popular to the point that it is now part of the vernacular. In many discussions, it is accepted as received wisdom, or even common sense.

In this post I aim to show you that despite being superficially helpful, this model is wrong, and in fact, may be counterproductive. I believe it is an attractive nuisance in computer programming discourse.

In fairness to Mr. McKinley, the model he described in this post is:

  1. nearly a decade old at this point, and
  2. much more nuanced in its description of the problem with “innovation” than the subsequent memetic mutation of the concept.

While I will be referencing McKinley’s post, and I do take some issue with it, I am reacting more strongly to the life of its own that this idea has taken on once it escaped its original context. There are a zillion worse posts rehashing this concept, on blogs and LinkedIn, but I won’t be linking to them because the goal is not to call anybody out.

To some extent I am re-raising McKinley’s own caveats and reinforcing them. So I may be arguing with a strawman, but it’s a strawman I have seen deployed with some regularity over the years.

To reduce it to its core, this strawman is “don’t use new or interesting technology, and if you have to, only use a little bit”.

Within the broader culture of programmers, an “innovation token” has become a shorthand to smear any technology perceived — almost always based on vibes, not data — as risky, and the adoption of novel approaches as pretentious and unserious. Speaking of programmer culture though, I do have to acknowledge there is also a pervasive tendency for us to get distracted by novelty and waste time on puzzles rather than problem-solving, so I understand where the reactionary attitude represented by the concept of an innovation token comes from.

But it is reactionary.

At its worst, it borders on anti-intellectualism. I have heard it used on more than one occasion as a thought-terminating cliche to discard a potentially promising new tool. But before I get into that, let me try to give a sympathetic summary of the idea, because the model is not entirely bad.

It has been popular for a long time because it does work okay as an heuristic.

The real problem that McKinley is describing is operational overhead. When programmers make a technology selection, we are often considering how difficult it will make the programming. Innovative technology selections are, by definition, less mature.

That lack of maturity — particularly in the open source world — often means that the project is in a part of its lifecycle where it is concerned with development affordances more than operational ones. Therefore, the stereotypical innovative project, even one which might legitimately be a big improvement to development velocity, will create more operational overhead. That operational overhead creates a hidden cost for the operations team later on.

This is a point I emphatically agree with. When selecting a technology, you should consider its ease of operation more than its ease of development. If your team is successful, they will be operating and maintaining it far longer than they are initially integrating and deploying it.

Furthermore, some operational overhead is inevitable. You will need to hire people to mitigate it. More popular, more mature projects will have a bigger talent pool to hire from, so your training costs will be lower, and those training costs are part of your operational cost too.

Rationing innovation tokens therefore can work as a reasonable heuristic, or proxy metric, for avoiding a mess of complex operational problems associated with dependencies that are expensive to operate and hard to hire for.

There are some minor issues I want to point out before getting to the overarching one.

  1. “has a lot of operational overhead” is a stereotype of a new technology, not an inherent property. If you want to reject a technology on the basis of being too high-overhead, at least look into its actual overhead a little bit. Sometimes, especially in 2024 as opposed to 2015, the point of a new, shiny piece of tech is to address operational issues that the more boring, older one had.
  2. “hard to learn” is also a stereotype; if “newer” meant “harder” then we would all be using troff rather than Google Docs. Actually ask if the innovativeness is making things harder or easier; don’t assume.
  3. You are going to have to train people on your stack no matter what. If a technology is adding a lot of value, it’s absolutely worth hiring for general ability and making a plan to teach people about it. You are going to have to do this with the core technology of your product anyway.

As I said, though, these are minor issues. The big problem with modeling operational overhead as an “innovation token” is that an even bigger concern than selecting an innovative tool is selecting too many tools.

The impulse to select more tools and make your operational environment more complex can be made worse by trying to avoid innovative tools. The important thing is not “less innovation”, but more consistency. To illustrate this, let’s do a simple thought experiment.

Let’s say you’re going to make a web app. There’s a tool in Haskell that you really like for a critical part of your app’s problem domain. You don’t want to spend more than one innovation token though, and everything in Haskell is inherently innovative, so you write a little service that just does that one part and you write the rest of your app in Ruby, calling into that service whenever you need to use that thing. This will appropriately restrict your “innovation token” expenditure.

Does doing this actually reduce your operational overhead, though?

First, you will have to find a team that likes both Ruby and Haskell and sees no problem using both. If you are not familiar with the cultural proclivities of these languages, suffice it to say that this is unlikely. Hiring for Haskell programmers is hard because there are fewer of them than Ruby programmers, but hiring for polyglot Haskell/Ruby programmers who are happy to do either is going to be really hard.

Since you will need to find different people to write in the different languages, even in the best case scenario, you will have two teams: the Haskell team and the Ruby team. Even if you are incredibly disciplined about inter-service responsibilities, there will be some areas where duplication of code is necessary across those services. Disagreements will arise and every one of these disagreements will be a source of social friction and software defects.

Then, you need to set up separate CI pipelines for each language, separate deployment systems, and of course, separate databases. Right away you are effectively doubling your workload.

In the worse, and unfortunately more likely scenario, there will be enormous infighting between these two teams. Operational incidents will be more difficult to manage because rather than learning the Haskell tools for operational visibility and disseminating that institutional knowledge amongst your team, you will be half-learning the lessons from two separate ecosystems and attempting to integrate them. Every on-call engineer will be frantically trying to learn a language ecosystem they don’t use regularly, or you will double the size of your on-call rotation. The Ruby team may start to resent the Haskell team for getting to exclusively work on the fun parts of the problem while they are doing things that look more like rote grunt work.

A better way to think about the problem of managing operational overhead is, rather than “innovation tokens”, consider “boundary tokens”.

That is to say, rather than evaluating the general sense of weird vibes from your architecture, consider the consistency of that architecture. If you’re using Haskell, use Haskell. You should be all-in on Haskell web frameworks, Haskell ORMs, Haskell OAuth integrations, and so on.1 To cross the boundary out of Haskell, you need to spend a boundary token, and you shouldn’t have many of those.

I submit that the increased operational overhead that you might experience with an all-Haskell tool selection will be dwarfed by the savings that you get by having a team that is aligned with each other, that can communicate easily, and that can share programs with each other without needing to first strategize about a channel for the two pieces of work to establish bidirectional communication. The ability to simply call a function when you need to call it is very powerful, and extremely underrated.

Consistency ought to apply at each layer of the stack; it is perhaps most obvious with programming languages, but it is true of web frameworks, test frameworks, cryptographic libraries, you name it. Make a choice and stick with it, because every deviation from that choice carries a significant cost. Moreover this cost is a hidden cost, in the same way that the operational downsides of an “innovative” tool that hasn’t seen much production use might be hidden.

Discarding a more standard tool in favor of a tool more consistent with your architecture extends even to fairly uncontroversial, ubiquitous tools. For example, one of my favorite architectural patterns is to forego the use of the venerable — and very boring – Cron, the UNIX task-scheduler. Instead of Cron, it can make a lot of sense to have hand-written bespoke code for scheduling tasks within the application. Within the “innovation tokens” model, this is a very silly waste of a token!

Just use Cron! Everybody knows how to use Cron!

Except… does everybody know how to use Cron? Here are some questions to consider, if you’re about to roll out a big dependency on Cron:

  1. How do you write a unit test for a scheduling rule with Cron?
  2. Can you even remember how to write a cron rule that runs at the times you want?
  3. How do you inject secrets and configuration variables into the distinct and somewhat idiosyncratic runtime execution environment of Cron?
  4. How do you know that you did that variable-injection properly until the job actually runs, possibly in the middle of the night?
  5. How do you deploy your monitoring and error-logging frameworks to observe your scripts run under Cron?

Granted, this architectural choice is less controversial than it once was. Cron used to be ambiently available on whatever servers you happened to be running. As container-based deployments have increased in popularity, this sense that Cron is just kinda around has gone away, and if you need to run a container that just runs Cron, much of the jankiness of its deployment is a lot more immediately visible.

There is friction at the boundary between things. That friction is a cost, but sometimes it’s a cost worth paying.

If there’s a really good library in Haskell and a really good library in Ruby and you really do want to use them both, maybe it makes sense to actually have multiple services. As your team gets larger and more mature, the need to bring in more tools, and the ability to handle the associated overhead, will only increase over time. But the place that the cost comes in the most is at the boundary between tools, not in the operational deficiencies of any one particular tool.

Even in a bog-standard web application with the most boring, least innovative tech stack imaginable (PHP, MySQL, HTML, CSS, JavaScript), many of the annoying points of friction are where different, inconsistent technologies make contact. If you are a programmer working on the web yourself, consider your own impression of the level of controversy of these technologies:

Consider that there are far more complex technical tools in terms of required skills to implement them, like computer vision or physics simulation, tools which are also pretty widely used, which consistently generate lower levels of controversy. People do have strong feelings about these things as well, of course, and it’s hard to find things to link to that show “this isn’t controversial”, but, like, search your feelings, you know it to be true.

You can see the benefits of the boundary token approach in programming language design. Many of the most influential and best-loved programming languages had an impact not by bundling together lots of tools, but by making everything into one thing:

  • LISP: everything is a list
  • Smalltalk: everything is an object
  • ML: everything is an algebraic data type
  • Forth: everything is a stack

There is a tremendous power in thinking about everything as a single kind of thing, because then you don’t have to juggle lots of different ideas about different kinds of things; you can just think about your problem.

When people complain about programming languages, they’re often complaining about how many different kinds of thing they have to remember in order to use it.

If you keep your boundary-token budget small, and allow your developers to accomplish as much as possible while staying within a solution space delineated by a single, clean cognitive boundary, I promise you can innovate as much as you want and your operational costs will remain manageable.


In subsequent Mastodon discussion of this post on with Matt Campbell and Meejah, I realized that I may not have made it entirely clear why I feel the distinction between “boundary” and “innovation” tokens is important. I do say above that the “innovation token” model can be a useful heuristic, so why bother with a new, but slightly different heuristic? Especially since most experienced engineers - indeed, McKinley himself - would budget “innovation” quite similarly to “boundaries”, and might even consider the use of more “innovative” Haskell tools in my hypothetical scenario to not even be an expenditure of innovation tokens at all.

To answer that, I need to highlight the purpose of having heuristics like this in the first place. These are vague, nebulous guidelines, not hard and fast rules. I cannot give you a token calculator to plug your technical decisions into. The purpose of either token heuristic is to facilitate discussions among a team.

With a team of skilled and experienced engineers, the distinction is meaningless. Senior and staff engineers (at least, the ones who deserve their level) will intuit the goals behind “innovation tokens” and inherently consider things like operational overhead anyway. In practice, a high-performing, well-aligned team discussing innovation tokens and one discussing boundary tokens will look functionally indistinguishable.

The distinction starts to be important when you have management pressures, nervous executives, inexperienced engineers, a fresh team without existing consensus about core technology choices, and so on. That is to say, most teams that exist in the messy, perpetually in medias res world of the software industry.

If you are just getting started on a project and you have a bunch of competent but disagreeable engineers, the words “innovation” and “boundaries” function very differently.

If you ask, “is this an innovation” about a particular technical tool, you are asking your interlocutor to pull in a bunch of their skills and experience to subjectively evaluate the relative industry-wide, or maybe company-wide, or maybe team-wide2 newness of the thing being discussed. The discussion of whether it counts as boring or innovative is immediately fraught with a ton of subjective, difficult-to-quantify information about costs of hiring, difficulty of learning, and your impression of the feelings of hundreds or thousands of people outside of your team. And, yes, ultimately you do need to have an estimate of all that stuff, but starting your “is it OK to use this” conversation by simultaneously arguing about all those subjective judgments is setting yourself up for failure.

Instead, if you ask “does this introduce a boundary between two different technologies with different conceptual models”, while that is not a perfectly objective question, it is much easier for your team to answer, with much crisper intermediary factual questions. What are the two technologies? What are the models? How much do they differ? You can just hash out the answers to each one within the team directly, rather than needing to sift through the last few years of Stack Overflow developer surveys to determine relative adoption or popularity of technologies in the world at large.

Restricting your supply of either boundary or innovation tokens is a good idea, but achieving unanimity within your team about what your boundaries are is always going to be easier than deciding what your innovations are.


Thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor! I am also available for consulting work if you think your organization could benefit from expertise on topics like “how can we make our architecture more consistent”.

  1. I gave a talk about this once, a very long time ago, where Haskell was Python. 

  2. It’s not clear, that’s a big part of the problem. 

Python Morsels: Strings in Python

Thu, 2024-07-04 11:16

Strings are used to store text-based data.

Table of contents

  1. Strings store text
  2. How are strings used?
  3. String methods in Python
  4. String concatenation
  5. Double quotes vs single quotes
  6. Escape characters
  7. Strings are everywhere in Python

Strings store text

This is a string:

>>> message = "This is text"

Python strings store text:

>>> message 'This is text' How are strings used?

Strings are often used for …

Read the full article:
PyCharm: Polars vs. pandas: What’s the Difference?

Thu, 2024-07-04 09:58

If you’ve been keeping up with the advances in Python dataframes in the past year, you couldn’t help hearing about Polars, the powerful dataframe library designed for working with large datasets.

Unlike other libraries for working with large datasets, such as Spark, Dask, and Ray, Polars is designed to be used on a single machine, prompting a lot of comparisons to pandas. However, Polars differs from pandas in a number of important ways, including how it works with data and what its optimal applications are. In the following article, we’ll explore the technical details that differentiate these two dataframe libraries and have a look at the strengths and limitations of each.

If you’d like to hear more about this from the creator of Polars, Ritchie Vink, you can also see our interview with him below!

Why use Polars over pandas?

In a word: performance. Polars was built from the ground up to be blazingly fast and can do common operations around 5–10 times faster than pandas. In addition, the memory requirement for Polars operations is significantly smaller than for pandas: pandas requires around 5 to 10 times as much RAM as the size of the dataset to carry out operations, compared to the 2 to 4 times needed for Polars.

You can get an idea of how Polars performs compared to other dataframe libraries here. As you can see, Polars is between 10 and 100 times as fast as pandas for common operations and is actually one of the fastest DataFrame libraries overall. Moreover, it can handle larger datasets than pandas can before running into out-of-memory errors.

Why is Polars so fast?

These results are extremely impressive, so you might be wondering: How can Polars get this sort of performance while still running on a single machine? The library was designed with performance in mind from the beginning, and this is achieved through a few different means.

Written in Rust

One of the most well-known facts about Polars is that it is written in Rust, a low-level language that is almost as fast as C and C++. In contrast, pandas is built on top of Python libraries, one of these being NumPy. While NumPy’s core is written in C, it is still hamstrung by inherent problems with the way Python handles certain types in memory, such as strings for categorical data, leading to poor performance when handling these types (see this fantastic blog post from Wes McKinney for more details).

One of the other advantages of using Rust is that it allows for safe concurrency; that is, it is designed to make parallelism as predictable as possible. This means that Polars can safely use all of your machine’s cores for even complex queries involving multiple columns, which led Ritchie Vink to describe Polar’s performance as “embarrassingly parallel”. This gives Polars a massive performance boost over pandas, which only uses one core to carry out operations. Check out this excellent talk by Nico Kreiling from PyCon DE this year, which goes into more detail about how Polars achieves this.

Based on Arrow

Another factor that contributes to Polars’ impressive performance is Apache Arrow, a language-independent memory format. Arrow was actually co-created by Wes McKinney in response to many of the issues he saw with pandas as the size of data exploded. It is also the backend for pandas 2.0, a more performant version of pandas released in March of this year. The Arrow backends of the libraries do differ slightly, however: while pandas 2.0 is built on PyArrow, the Polars team built their own Arrow implementation.

One of the main advantages of building a data library on Arrow is interoperability. Arrow has been designed to standardize the in-memory data format used across libraries, and it is already used by a number of important libraries and databases, as you can see below.

This interoperability speeds up performance as it bypasses the need to convert data into a different format to pass it between different steps of the data pipeline (in other words, it avoids the need to serialize and deserialize the data). It is also more memory-efficient, as two processes can share the same data without needing to make a copy. As serialization/deserialization is estimated to represent 80–90% of the computing costs in data workflows, Arrow’s common data format lends Polars significant performance gains.

Arrow also has built-in support for a wider range of data types than pandas. As pandas is based on NumPy, it is excellent at handling integer and float columns, but struggles with other data types. In contrast, Arrow has sophisticated support for datetime, boolean, binary, and even complex column types, such as those containing lists. In addition, Arrow is able to natively handle missing data, which requires a workaround in NumPy.

Finally, Arrow uses columnar data storage, which means that, regardless of the data type, all columns are stored in a continuous block of memory. This not only makes parallelism easier, but also makes data retrieval faster.

Query optimization

One of the other cores of Polars’ performance is how it evaluates code. Pandas, by default, uses eager execution, carrying out operations in the order you’ve written them. In contrast, Polars has the ability to do both eager and lazy execution, where a query optimizer will evaluate all of the required operations and map out the most efficient way of executing the code. This can include, among other things, rewriting the execution order of operations or dropping redundant calculations. Take, for example, the following expression to get the mean of column Number1 for each of the categories “A” and “B” in Category.

( df .groupby(by = "Category").agg(pl.col("Number1").mean()) .filter(pl.col("Category").is_in(["A", "B"])) )

If this expression is eagerly executed, the groupby operation will be unnecessarily performed for the whole DataFrame, and then filtered by Category. With lazy execution, the DataFrame can be filtered and groupby performed on only the required data.

Expressive API

Finally, Polars has an extremely expressive API, meaning that basically any operation you want to perform can be expressed as a Polars method. In contrast, more complex operations in pandas often need to be passed to the apply method as a lambda expression. The problem with the apply method is that it loops over the rows of the DataFrame, sequentially executing the operation on each one. Being able to use built-in methods allows you to work on a columnar level and take advantage of another form of parallelism called SIMD.

When should you stick with pandas?

All of this sounds so amazing that you’re probably wondering why you would even bother with pandas anymore. Not so fast! While Polars is superb for doing extremely efficient data transformations, it is currently not the optimal choice for data exploration or for use as part of machine learning pipelines. These are areas where pandas continues to shine.

One of the reasons for this is that while Polars has great interoperability with other packages using Arrow, it is not yet compatible with most of the Python data visualization packages nor machine learning libraries such as scikit-learn and PyTorch. The only exception is Plotly, which allows you to create charts directly from Polars DataFrames.

A solution that is being discussed is using the Python dataframe interchange protocol in these packages to allow them to support a range of dataframe libraries, which would mean that data science and machine learning workflows would no longer be bottlenecked by pandas. However, this is a relatively new idea, and it will take time for these projects to implement.

Tooling for Polars and pandas

After all of this, I am sure you are eager to try Polars yourself! PyCharm Professional for Data Science offers excellent tooling for working with both pandas and Polars in Jupyter notebooks. In particular, pandas and Polars DataFrames are displayed with interactive functionality, which makes exploring your data much quicker and more comfortable.

Some of my favorite features include the ability to scroll through all rows and columns of the DataFrame without truncation, get aggregations of DataFrame values in one click, and export the DataFrame in a huge range of formats (including Markdown!).

If you’re not yet using PyCharm, you can try it with a 30-day trial by following the link below.

Start your PyCharm Pro free trial

Real Python: Working With JSON Data in Python

Wed, 2024-07-03 10:00

Since its introduction, JSON has rapidly emerged as the predominant standard for the exchange of information. Whether you want to transfer data with an API or store information in a document database, it’s likely you’ll encounter JSON. Fortunately, Python provides robust tools to facilitate this process and help you manage JSON data efficiently.

In this tutorial, you’ll learn how to:

  • Understand the JSON syntax
  • Convert Python data to JSON
  • Deserialize JSON to Python
  • Write and read JSON files
  • Validate JSON syntax
  • Prettify JSON in the terminal
  • Minify JSON with Python

While JSON is the most common format for data distribution, it’s not the only option for such tasks. Both XML and YAML serve similar purposes. If you’re interested in how the formats differ, then you can check out the tutorial on how to serialize your data with Python.

Free Bonus: Click here to download the free sample code that shows you how to work with JSON data in Python.

Take the Quiz: Test your knowledge with our interactive “Working With JSON Data in Python” quiz. You’ll receive a score upon completion to help you track your learning progress:

Interactive Quiz

Working With JSON Data in Python

In this quiz, you'll test your understanding of working with JSON in Python. JSON has become the de facto standard for information exchange, and Python provides easy-to-use tools to handle JSON data.

Introducing JSON

The acronym JSON stands for JavaScript Object Notation. As the name suggests, JSON originated from JavaScript. However, JSON has transcended its origins to become language-agnostic and is now recognized as the standard for data interchange.

The popularity of JSON can be attributed to native support by the JavaScript language, resulting in excellent parsing performance in web browsers. On top of that, JSON’s straightforward syntax allows both humans and computers to read and write JSON data effortlessly.

To get a first impression of JSON, have a look at this example code:

JSON hello_world.json { "greeting": "Hello, world!" } Copied!

You’ll learn more about the JSON syntax later in this tutorial. For now, recognize that the JSON format is text-based. In other words, you can create JSON files using the code editor of your choice. Once you set the file extension to .json, most code editors display your JSON data with syntax highlighting out of the box:

The screenshot above shows how VS Code displays JSON data using the Bearded color theme. You’ll have a closer look at the syntax of the JSON format next!

Examining JSON Syntax

In the previous section, you got a first impression of how JSON data looks. And as a Python developer, the JSON structure probably reminds you of common Python data structures, like a dictionary that contains a string as a key and a value. If you understand the syntax of a dictionary in Python, you already know the general syntax of a JSON object.

Note: Later in this tutorial, you’ll learn that you’re free to use lists and other data types at the top level of a JSON document.

The similarity between Python dictionaries and JSON objects is no surprise. One idea behind establishing JSON as the go-to data interchange format was to make working with JSON as convenient as possible, independently of which programming language you use:

[A collection of key-value pairs and arrays] are universal data structures. Virtually all modern programming languages support them in one form or another. It makes sense that a data format that is interchangeable with programming languages is also based on these structures. (Source)

To explore the JSON syntax further, create a new file named hello_frieda.json and add a more complex JSON structure as the content of the file:

JSON hello_frieda.json 1{ 2 "name": "Frieda", 3 "isDog": true, 4 "hobbies": ["eating", "sleeping", "barking"], 5 "age": 8, 6 "address": { 7 "work": null, 8 "home": ["Berlin", "Germany"] 9 }, 10 "friends": [ 11 { 12 "name": "Philipp", 13 "hobbies": ["eating", "sleeping", "reading"] 14 }, 15 { 16 "name": "Mitch", 17 "hobbies": ["running", "snacking"] 18 } 19 ] 20} Copied!

In the code above, you see data about a dog named Frieda, which is formatted as JSON. The top-level value is a JSON object. Just like Python dictionaries, you wrap JSON objects inside curly braces ({}).

In line 1, you start the JSON object with an opening curly brace ({), and then you close the object at the end of line 20 with a closing curly brace (}).

Note: Although whitespace doesn’t matter in JSON, it’s customary for JSON documents to be formatted with two or four spaces to indicate indentation. If the file size of the JSON document is important, then you may consider minifying the JSON file by removing the whitespace. You’ll learn more about minifying JSON data later in the tutorial.

Inside the JSON object, you can define zero, one, or more key-value pairs. If you add multiple key-value pairs, then you must separate them with a comma (,).

Read the full article at »

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Real Python: Quiz: Python's Magic Methods: Leverage Their Power in Your Classes

Wed, 2024-07-03 08:00

In this quiz, you’ll test your understanding of Python’s Magic Methods.

By working through this quiz, you’ll revisit the concept of magic methods in Python, how they work, and how you can use them to customize the behavior of your classes.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Gaël Varoquaux: Skrub 0.2.0: tabular learning made easy

Tue, 2024-07-02 18:00

We just released skrub 0.2.0. This release markedly simplifies learning on complex dataframes.

model = tabular_learner(‘classifier’) Simple, yet solid default baseline

The highlight of the release is the tabular_learner function, which facilitates creating pipelines that readily perform machine learning on dataframes, adding preprocessing to a scikit-learn compatible learner. The function packs defaults and heuristics to transform all forms of dataframes to a representation that is well suited to a learner, and it can adapt these transformation: tabular_learner(HistGradientBoostingClassifier()) encodes categories differently than tabular_learner(LogisticRegression()).

The heuristics are tuned based on much benchmarking and experience shows that they give good tradeoffs. The default tabular_learner(‘classifier’) is often a strong baseline.

The benefit are visible in a really simple example:

>>> # First retrieve data >>> from skrub.datasets import fetch_employee_salaries >>> dataset = fetch_employee_salaries() >>> df = dataset.X >>> y = dataset.y >>> # The dataframe is a quite rich and complex dataframe, with various columns >>> df

We can then easily build a learner that applies readily to this dataframe, without any transformation:

>>> from skrub import tabular_learner >>> learner = tabular_learner('regressor') >>> # The resulting learner can apply all the machine-learning conveniences (eg cross-validation) directly on the dataframe >>> from sklearn.model_selection import cross_val_score >>> cross_val_score(learner, df, y) array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666]) transformer = TableVectorizer() Making encoding complex dataframes easy

Behind the hood, the work is done by the skrub.TableVectorizer(), a scikit-learn compatible transformer that facilitates combining multiple transformations on the different columns of a dataframe. The TableVectorizer is not new in the 0.2.0 release, but we have completely revamped its internals to cover really well edge cases. Indeed, one challenge is to make sure that nothing different or strange happens at test time. Actually, enforcing consistency between train-time and test-time transformation is the real value of skrub compared to using pandas or polars to do transformation.

Increasing support of polars Short-term goal of optimized support for pandas and polars

We have implemented a new mechanism for supporting both pandas and polars. It has not been applied on all the codebase, hence the support is still imperfect. However, we are seeing increasing support for polars in skrub, and our goal in the short term is to provide rock-solid polar support.

Try skrub out! It’s still young, but in mind opinion, it provides a lot of value to tabular learning.

PyCoder’s Weekly: Issue #636 (July 2, 2024)

Tue, 2024-07-02 15:30

#636 – JULY 2, 2024
View in Browser »

Build a GUI Calculator With PyQt and Python

In this video course, you’ll learn how to create graphical user interface (GUI) applications with Python and PyQt. Once you’ve covered the basics, you’ll build a fully functional desktop calculator that can respond to user events with concrete actions.

Satellites Spotting Ships

Umbra Space has released a data set consisting of satellite based radar images of shipping. This article from Mark shows you how to grab the data, visualize, and annotate it.

Discover the Power of Observability With Pydantic Logfire

Logfire, by the makers of Pydantic, is an observability platform that will help you understand your app’s behavior with less code and time. Built on OpenTelemetry, it features user-friendly dashboards, SQL querying, and Python-specific integrations. Get started today →
PYDANTIC sponsor

Modern Good Practices for Python Development

This is a very detailed list of best practices for developing in Python. It includes tools, language features, application design, which libraries to use an more.

PSF Board Candidates for 2024


Python 3.13.0 Beta 3 Released


Django 5.1 Beta 1 Released


PyBay 2024 Call for Proposals


Articles & Tutorials Build a Guitar Synthesizer: Play Musical Tablature in Python

In this tutorial, you’ll build a guitar synthesizer using the Karplus-Strong algorithm in Python. You’ll model vibrating strings, simulate strumming techniques, read musical notation and tablature, and apply audio effects. By the end, you’ll have created a digital guitar that can play any song. This tutorial was also discussed on Real Python Podcast Episode #210.

Python’s Security Model After the xz-utils Backdoor

The backdoor introduced to the xz-utils compression project through social engineering was one of the topics at the Python Language Summit. Participants discussed what can be done to prevent similar social engineering attacks on the Python source.

Authentication Your Whole Team Will Love

“With PropelAuth, I think I’ve spent about a day – total – on auth over the past year.” PropelAuth is easy to integrate and provides all the tools your team needs to manage your users - dashboards, user insights, impersonation, SSO and more →

Running Prettier Against Django or Jinja Templates

“Prettier” is a JavaScript based linting tool for templates. For folks not familiar with the world of npm, it can be a bit daunting to get it going. Simon fiddled with it so you don’t have to and posted how he got it working on his system.

Write Less Code, You Must

An often overlooked aspect of software development is architecture at the module & function level. It is important to write code that is simple and easy to move from one place to another.

Quickstart for Playing With LLMs Locally

This is a simple, quick guide to getting started running LLMs on your local computer. It covers the basics of the powerful libraries Ollama and LangChain for controlling these AI models.
JOSHUA COOK • Shared by Joshua Cook

Under the Hood of Python’s set Data Structure

This tutorial covers hash tables, collision handling, performance optimization and how it relates to the implementation of the set data structure in Python.

Ways to Have an Atomic Counter in Django

Keeping a counter across objects in Django means having to be careful about race conditions. This article outlines several approaches to the problem.

Saying Thanks to Open Source Maintainers

Brett talks about the different ways you can support the many maintainers of open source projects, and often times just saying “thanks” means a lot.

A Guide to Python’s Weak References

Learn all about weak references in Python: reference counting, garbage collection, and practical uses of the weakref module
MARTIN HEINZ • Shared by Martin Heinz

Creating Great README Files for Your Python Projects

In this tutorial, you’ll learn how to create, organize, and format high-quality README files for your Python projects.

Get Terminal Size

This quick TIL post from Rodrigo shows you how to get information about the terminal size from the shutil module.

A Complete Guide to pytest Fixtures

Learn how to use pytest fixtures for writing maintainable and isolated tests.

Projects & Code burr: Build Apps That Make Decisions (Chatbots, Agents, etc)


Lazy f-strings

GITHUB.COM/POMPONCHIK • Shared by pomponchik

oxo: Security Scanning Orchestrator


jax: Composable Transformations of Python+NumPy Programs


dbt-utils: Utility Functions for DBT Projects


Events Weekly Real Python Office Hours Q&A (Virtual)

July 3, 2024

Canberra Python Meetup

July 4, 2024

Sydney Python User Group (SyPy)

July 4, 2024

EuroPython 2024

July 8 to July 15, 2024

SciPy US 2024

July 8 to July 14, 2024

PyCon Nigeria 2024

July 10 to July 14, 2024

Happy Pythoning!
This was PyCoder’s Weekly Issue #636.
View in Browser »

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

Real Python: Defining Python Constants for Code Maintainability

Tue, 2024-07-02 10:00

In programming, the term constant refers to names representing values that don’t change during a program’s execution. Constants are a fundamental concept in programming, and Python developers use them in many cases. However, Python doesn’t have a dedicated syntax for defining constants. In practice, Python constants are just variables that never change.

To prevent programmers from reassigning a name that’s supposed to hold a constant, the Python community has adopted a naming convention: use uppercase letters. For every Pythonista, it’s essential to know what constants are, as well as why and when to use them.

In this video course, you’ll learn how to:

  • Properly define constants in Python
  • Identify some built-in constants
  • Use constants to improve your code’s readability, reusability, and maintainability
  • Apply different approaches to organize and manage constants in a project
  • Use several techniques to make constants strictly constant in Python

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Python Software Foundation: The 2024 PSF Board Election is Open!

Tue, 2024-07-02 06:05

It’s time to cast your vote! Voting is open starting today Tuesday, July 2nd, through Friday, July 16th, 2024 2:00 pm UTC. Check the Elections page to see how much time you have left to vote.

How to Vote

If you are a voting member of the PSF that affirmed your intention to participate in this year’s election, you will receive an email from “OpaVote Voting Link <>” with a link to your ballot. The subject line will read “Python Software Foundation Board of Directors Election 2024”. If you haven’t seen your ballot by Wednesday, please check your spam folder for a message from “”. If you don’t see anything get in touch by emailing so we can look into your account and make sure we have the most up-to-date email for you.

Three seats on the board are open, but you can approve as many of the 19 candidates as you like. We’re delighted by how many of you are willing to contribute to the Python community by serving on the PSF Board! Make sure you take some time to look at all the nominee statements and choose your candidates carefully. ATTN: Choose carefully before you press the big green vote button. Once your vote is cast, it cannot be changed.

Who can vote?

You need to be a Contributing, Managing, Supporting, or Fellow member and have affirmed your voting intention by June 25th, 2024, to vote in this election. If you’d like to learn more or sign up as a PSF Member, check out our membership types. You can check your membership status on your User Information page on (you will need to be logged in). If you have questions about your membership or the election please email

Python Bytes: #390 Coding in a Castle

Tue, 2024-07-02 04:00
<strong>Topics covered in this episode:</strong><br> <ul> <li><a href=""><strong>Joining Strings in Python: A</strong></a><a href=""> </a><a href=""><strong>"Huh"</strong></a><a href=""> <strong>Moment</strong></a></li> <li><a href="">10 hard-to-swallow truths they won't tell you about software engineer job</a></li> <li><a href=""><strong>My thoughts on Python in Excel</strong></a></li> <li><strong>Extra, extra, extra</strong></li> <li><strong>Extras</strong></li> <li><strong>Joke</strong></li> </ul><a href='' style='font-weight: bold;'data-umami-event="Livestream-Past" data-umami-event-episode="390">Watch on YouTube</a><br> <p><strong>About the show</strong></p> <p>Sponsored by ScoutAPM: <a href=""><strong></strong></a></p> <p><strong>Connect with the hosts</strong></p> <ul> <li>Michael: <a href=""><strong></strong></a></li> <li>Brian: <a href=""><strong></strong></a></li> <li>Show: <a href=""><strong></strong></a></li> </ul> <p>Join us on YouTube at <a href=""><strong></strong></a> to be part of the audience. Usually Tuesdays at 10am PT. Older video versions available there too.</p> <p>Finally, if you want an artisanal, hand-crafted digest of every week of the show notes in email form? Add your name and email to <a href="">our friends of the show list</a>, we'll never share it. </p> <p><strong>Brian #1:</strong> <a href=""><strong>Joining Strings in Python: A</strong></a><a href=""> </a><a href=""><strong>"Huh"</strong></a><a href=""> <strong>Moment</strong></a></p> <ul> <li>Veronica Berglyd Olsen</li> <li><p>Standard solution to “read lines from a file, do some filtering, create a multiline string”:</p> <pre><code>f = open("input_file.txt") filtered_text = "\n".join(x for x in f if not x.startswith("#")) </code></pre></li> <li><p>This uses a generator, file reading, and passes the generator to join.</p></li> <li><p>Another approach is to add brackets and pass that generator to a list comprehension:</p> <pre><code>f = open("input_file.txt") filtered_text = "\n".join([x for x in f if not x.startswith("#")]) </code></pre></li> <li><p>At first glance, this seems to just be extra typing, but it’s actually faster by 16% on CPython due to the implementation of .join() doing 2 passes on input if passed a generator. </p> <ul> <li>From Trey Hunner: “I do know that it’s not possible to do 2 passes over a generator (since it’d be exhausted after the first pass) so from my understanding, the generator version requires an extra step of storing all the items in a list first.”</li> </ul></li> </ul> <p><strong>Michael #2:</strong> <a href="">10 hard-to-swallow truths they won't tell you about software engineer job</a></p> <ol> <li>College will not prepare you for the job</li> <li>You will rarely get greenfield projects</li> <li>Nobody gives a BLANK about your clean code</li> <li>You will sometimes work with incompetent people</li> <li>Get used to being in meetings for hours</li> <li>They will ask you for estimates a lot of times</li> <li>Bugs will be your arch-enemy for life</li> <li>Uncertainty will be your toxic friend</li> <li>It will be almost impossible to disconnect from your job</li> <li>You will profit more from good soft skills than from good technical skills</li> </ol> <p><strong>Brian #3:</strong> <a href=""><strong>My thoughts on Python in Excel</strong></a></p> <ul> <li>Felix Zumstein</li> <li>Interesting take on one person’s experience with trying Python in Excel.</li> <li>“We wanted an alternative to VBA, but got an alternative to the Excel formula language”</li> <li>“Python runs in the cloud on Azure Container Instances and not inside Excel.”</li> <li>“DataFrames are great, but so are NumPy arrays and lists.”</li> <li>… lots of other interesting takaways.</li> </ul> <p><strong>Michael #4:</strong> <strong>Extra, extra, extra</strong></p> <ul> <li><a href="">Code in a castle</a> - Michael’s Python Zero to Hero course in Tuscany</li> <li><a href=""> JavaScript supply chain attack impacts over 100K sites</a> <ul> <li>Now required reading: <a href="">Reasons to avoid Javascript CDNs</a></li> </ul></li> <li><a href="">Mac users served info-stealer malware through Google ads</a></li> <li><a href="">HTMX for the win</a>!</li> <li>ssh to <a href="">run remote commands</a> <pre><code>&gt; ssh user@server "command_to_run --arg1 --arg2" </code></pre></li> </ul> <p><strong>Extras</strong> </p> <p>Brian:</p> <ul> <li><a href="">A fun </a><a href="">reaction</a><a href=""> to AI </a>- I will not be showing the link on our live stream, due to colorful language.</li> </ul> <p>Michael:</p> <ul> <li><a href="">Coding in a Castle</a> Developer Education Event</li> <li><a href=""> JavaScript supply chain attack impacts over 100K sites</a> <ul> <li>See <a href="">Reasons to avoid Javascript CDNs</a></li> </ul></li> </ul> <p><strong>Joke:</strong> <a href="">HTML Hacker</a></p>
Tryton News: Newsletter June 2024

Tue, 2024-07-02 02:00

In the last month we focused on speeding-up performance issues, fixing bugs and improving the behaviour of things - building on the changes from our last release. We also added some new features which we would like to introduce to you in this newsletter.

For an in depth overview of the Tryton issues please take a look at our issue tracker or see the issues and merge requests filtered by label.

Changes for the User Sales, Purchases and Projects

We use a dedicated Web Shop-page on the product form which contains web shop related fields.

We’ve added relates from sale and purchase lines to their stock moves and invoice lines.

Purchase and sale amendments now allow to update the secondary unit of its lines.

Now Tryton deletes a purchase request when its related product is deleted. Previously such a purchase request was kept in the system, but we decided that is better to remove it.

Accounting, Invoicing and Payments

Payments with zero amount are allowed again in the system. This allows to correctly handle full refunds for some payment gateways that use zero amounts on them.

Stock, Production and Shipments

When counting inventories with lots we now also show the lot in addition to the product, as a product may have many lots.

User Interface

Sao now uses a grid to display trytond.model.fields.Dict items to add more flexibility.

To make Tryton more accessible we now make the contents of the message-dialog selectable and copiable.

Data and Configuration

We improved the user experience when importing CSV data. This eases the adoption of Tryton by lowering the barrier to load initial data in to the system. Here is a list of the relevant changes:

The CSV export also got new features. It now supports different languages per column in one export. This is specially useful when working with translatable master data like for example product names.

We now replace the “Accounting Party” user access group by the “Accounting” user access group. There is no need to limit accounting fields from party to a specific group by default.

New Documentation

The ldap_authentication module is now documented.

Did you know, that a Model._rec_name must point to a trytond.model.fields.Char field?

New Releases

We released bug fixes for the currently maintained long term support series
7.0 and 6.0, and for the penultimate series 7.2.

Changes for the System Administrator

We added a new configuration section [report] with option convert_command to be able to use a different document converter.

Now the trytond-admin command validates the email-value. The interactive email input loops until a valid email address is entered.

Changes for Implementers and Developers

We added the option --export-translations to the trytond-admin command. It exports the translation of any activated module to their respective locale folder.

Authors: @dave @pokoli @udono

1 post - 1 participant

Read full topic

Categories: FLOSS Project Planets

Zato Blog: Understanding API rate-limiting techniques

Tue, 2024-07-02 00:43
Understanding API rate-limiting techniques 2024-07-02, by Dariusz Suchojad

Enabling rate-limiting in Zato means that access to Zato APIs can be throttled per endpoint, user or service - including options to make limits apply to specific IP addresses only - and if limits are exceeded within a selected period of time, the invocation will fail. Let's check how to use it all.

API rate limiting works on several levels and the configuration is always checked in the order below, which follows from the narrowest, most specific parts of the system (endpoints), through users which may apply to multiple endpoints, up to services which in turn may be used by both multiple endpoints and users.

  • First, per-endpoint limits
  • Then, per-user limits
  • Finally, per-service limits

When a request arrives through an endpoint, that endpoint's rate limiting configuration is checked. If the limit is already reached for the IP address or network of the calling application, the request is rejected.

Next, if there is any user associated with the endpoint, that account's rate limits are checked in the same manner and, similarly, if they are reached, the request is rejected.

Finally, if the endpoint's underlying service is configured to do so, it also checks if its invocation limits are not exceeded, rejecting the message accordingly if they are.

Note that the three levels are distinct yet they overlap in what they allow one to achieve.

For instance, it is possible to have the same user credentials be used in multiple endpoints and express ideas such as "Allow this and that user to invoke my APIs 1,000 requests/day but limit each endpoint to at most 5 requests/minute no matter which user".

Moreover, because limits can be set on services, it is possible to make it even more flexible, e.g. "Let this service be invoked at most 10,000 requests/hour, no matter which user it is, with particular users being able to invoke at most 500 requests/minute, no matter which service, topping it off with per separate limits for REST vs. SOAP vs. JSON-RPC endpoint, depending on what application is invoke the endpoints". That lets one conveniently express advanced scenarios that often occur in practical situations.

Also, observe that API rate limiting applies to REST, SOAP and JSON-RPC endpoints only, it is not used with other API endpoints, such as AMQP, IBM MQ, SAP, task scheduler or any other technologies. However, per-service limits work no matter which endpoint the service is invoked with and they will work with endpoints such as WebSockets, ZeroMQ or any other.

Lastly, limits pertain to with incoming requests only - any outgoing ones, from Zato to external resources - are not covered by it.

Per-IP restrictions

The architecture is made even more versatile thanks to the fact that for each object - endpoint, user or service - different limits can be configured depending on the caller's IP address.

This adds yet another dimension and allows to express ideas commonly witnessed in API-based projects, such as:

  • External applications, depending on their IP addresses, can have their own limits
  • Internal users, e.g. employees of the company using VPN, may have hire limits if their addresses are in the 172.x.x.x range
  • For performance testing purposes, access to Zato from a few selected hosts may have no limits at all

IP-based limits work hand in hand are an integral part of the mechanism - they do not rule out per-endpoit, user or service limits. In fact, for each such object, multiple IP-using limits can be set independently, thus allowing for highest degree of flexibility.

Exact or approximate

Rate limits come in two types:

  • Exact
  • Approximate

Exact rate limits are just that, exact - they en that a limit is not exceeded at all, not even by a single request.

Approximate limits may let a very small number of requests to exceed the limit with the benefit being that approximate limits are faster to check than exact ones.

When to use which type depends on a particular project:

  • In some projects, it does not really matter if callers have a limit of 1,000 requests/minute or 1,005 requests/minute because the difference is too tiny to make a business impact. Approximate limits work best in this case.

  • In other projects, there may be requirements that the limit never be exceeded no matter the circumstances. Use exact limits here.

Python code and web-admin

Alright, let's check how to define the limits in the Zato Dashboard. We will use the sample service below:

# -*- coding: utf-8 -*- # Zato from zato.server.service import Service class Sample(Service): name = 'api.sample' def handle(self): # Return a simple string on response self.response.payload = 'Hello there!\n'

Now, in web-admin, we will configure limits - separately for the service, a new and a new REST API channel (endpoint).

Points of interest:

  • Configuration for each type of object is independent - within the same invocation some limits may be exact, some may be approximate
  • There can be multiple configuration entries for each object
  • A unit of time is "m", "h" or "d", depending on whether the limit is per minute, hour or day, respectively
  • All limits within the same configuration are checked in the order of their definition which is why the most generic ones should be listed first
Testing it out

Now, all is left is to invoke the service from curl.

As long as limits are not reached, a business response is returned:

$ curl http://my.user:password@localhost:11223/api/sample Hello there! $

But if a limit is reached, the caller receives an error message with the 429 HTTP status.

$ curl -v http://my.user:password@localhost:11223/api/sample * Trying ... < HTTP/1.1 429 Too Many Requests < Server: Zato < X-Zato-CID: b8053d68612d626d338b02 ... {"zato_env":{"result":"ZATO_ERROR","cid":"b8053d68612d626d338b02eb", "details":"Error 429 Too Many Requests"}} $

Note that the caller never knows what the limit was - that information is saved in Zato server logs along with other details so that API authors can correlate what callers get with the very rate limiting definition that prevented them from accessing the service.

zato.common.rate_limiting.common.RateLimitReached: Max. rate limit of 100/m reached; from:``, network:`*`; last_from:`; last_request_time_utc:`2020-11-22T15:30:41.943794; last_cid:`5f4f1ef65490a23e5c37eda1`; (cid:b8053d68612d626d338b02)

And this is it - we have created a new API rate limiting definition in Zato and tested it out successfully!

More resources

➤ Python API integration tutorial
What is an integration platform?
Python Integration platform as a Service (iPaaS)
What is an Enterprise Service Bus (ESB)? What is SOA?

More blog posts
Seth Michael Larson: Lockdown Mode for Apple devices

Mon, 2024-07-01 20:00
Lockdown Mode for Apple devices AboutBlogNewsletterLinks Lockdown Mode for Apple devices

Published 2024-07-02 by Seth Larson
Reading time: minutes

Back in September 2023 the libwebp vulnerability (also known as BLASTPASS) was being actively exploited to target a journalist's mobile device. After reading the report from Citizen Lab I learned about an iOS feature called "Lockdown Mode" for Apple devices.

I've been running Lockdown Mode for almost a year now, and at the time I promised a write-up of my experience with the feature, so here it is!

How does Lockdown Mode keep your phone more secure?

Lockdown Mode prevents some methods of sending or injecting data into your phone without your active engagement (such as preloading data, injecting data into unsecured connections, etc). Data that's processed by your phone automatically, such as images, can exploit flaws in image format parser in order to escape and begin executing code.

BLASTPASS exploited memory safety issues in the libwebp library which processes WebP images. The malicious WebP image was delivered to the target's device via a PassKit attachment which can be sent in a text message.

What does Lockdown Mode disable?

Here's the full list of disabled or degraded features when Lockdown Mode is enabled, quoted from Apple's docs on the feature:

  • Most message attachment types are blocked. Some features such as links and link previews are unavailable.
  • Certain complex web technologies are blocked. (ie JavaScript JIT)
  • FaceTime calls from unknown contacts are blocked. SharePlay and Live Photos are unavailable.
  • Photo location information is excluded. Shared Albums are removed and disabled.
  • Wi-Fi must be secure for device to connect to a network. 2G cellular support is disabled.
  • Mobile Device Management and Configuration Profiles are disabled.
What are the impacts?

The biggest impacts for day-to-day usage is two-fold: Message Links and Search.

With Lockdown Mode enabled, links will not highlight like they typically do, and they won't show the fancy preloaded image that gives you a preview of the content on the other side of a click.

Not having links and link previews in messages is a real inconvenience. The fastest work-around to extract a link in the middle of a text message is to either copy the whole message into your own message box and then copy the URL or to screenshot the message and use Live Text to copy-and-paste directly from your screenshot.

If you're able to persuade your partner to send links in a separate message, that also speeds up the copy-and-paste process by copying the whole message. Persuading your partner is left as an exercise to the reader :)

The other major impact is not being able to search through my messages. This feature is super helpful when you're trying to recall something from years ago, but not something you're using every day usually. This feature being disabled has never been such a problem that I've had a memorable negative outcome, but it definitely is frustrating when you know the answer is somewhere in your messages.

The only other time Lockdown Mode has introduced friction is during Trina and I's wedding. The wedding party was sharing pictures and videos via a Shared Album which aren't available when Lockdown Mode is enabled. Fortunately, I could disable Lockdown Mode for a short time after the wedding was over, copy all the photos that I wanted, and then re-enable Lockdown Mode to work-around this.

Beyond this, some image formats don't load in any context (likely WebP?) and I haven't noticed any slowdown from not having a JavaScript JIT.

Would I recommend Lockdown Mode?

For most people: no. If you have a decent reason to expect you'd be the target of a cyberattack, then you should definitely consider it.

There is a non-zero amount of extra friction to using your phone, but as someone who's trying to actively reduce my phone usage anyway it wasn't a big issue over the year that I've had it enabled.

Bonus tip: Quick one-time disabling of biometric authentication

Privacy gated by biometrics (ie, "Face ID" or fingerprint scanners) doesn't have the same legal protections as a password. Biometrics are quite convenient, especially if you've configured a relatively short amount of time that your phone will lock itself after a lack of use.

So how can one have the benefits of biometrics while maintaining the ability to disable biometrics if needed?

By holding down the volume up and side button on your iPhone you'll bring up the screen that offers to shut down your phone or enter "SOS mode". If you select cancel on this screen your phone will become locked again but will require non-biometric authentication for the next phone unlock.

Because this process is fast (takes less than a second of holding the two buttons) it's great to have in your back pocket in case you need it. It's also useful for one-time activities when you're separated from your device such as crossing a security checkpoint.

Thanks for reading! ♡ Did you find this article helpful and want more content like it? Get notified of new posts by subscribing to the RSS feed or the email newsletter.

This work is licensed under CC BY-SA 4.0

Categories: FLOSS Project Planets

Quansight Labs Blog: An overview of the Sparse Array Ecosystem for Python

Mon, 2024-07-01 20:00
An overview of the different options available for working with sparse arrays in Python
Mon, 2024-07-01 15:42
This tutorial looks at how to add server-side components to our client-side setup with Django.
RoseHosting Blog: How to Install Python on Ubuntu 24.04

Mon, 2024-07-01 13:30

In this tutorial, we are going to explain how to install Python on Ubuntu 24.04 OS. Python is a high-level ...

Read More

The post How to Install Python on Ubuntu 24.04 appeared first on RoseHosting.

Real Python: Python's Built-in Functions: A Complete Exploration

Mon, 2024-07-01 10:00

Python has many built-in functions that you can use directly without importing anything. These functions cover a wide variety of common programming tasks that include performing math operations, working with built-in data types, processing iterables of data, handling input and output in your programs, working with scopes, and more.

In this tutorial, you’ll:

  • Get to know Python’s built-in functions
  • Learn about common use cases of Python’s built-in functions
  • Use these functions to solve practical problems

To get the most out of this tutorial, you’ll need to be familiar with Python programming, including topics like working with built-in data types, functions, classes, decorators, scopes, and the import system.

Get Your Code: Click here to download the free sample code that shows you how to use Python’s built-in functions.

Take the Quiz: Test your knowledge with our interactive “Python's Built-in Functions: A Complete Exploration” quiz. You’ll receive a score upon completion to help you track your learning progress:

Interactive Quiz

Python's Built-in Functions: A Complete Exploration

Take this quiz to test your knowledge of the available built-in functions in Python. By taking this quiz, you'll deepen your understanding of how to use these functions and the common programming problems they cover, from mathematical computations to Python-specific features.

Built-in Functions in Python

Python has several functions available for you to use directly from anywhere in your code. These functions are known as built-in functions and they cover many common programming problems, from mathematical computations to Python-specific features.

Note: Many of Python’s built-in functions are classes with function-style names. Good examples are str, tuple, list, and dict, which are classes that define built-in data types. These classes are listed in the Python documentation as built-in functions and you’ll find them in this tutorial.

In this tutorial, you’ll learn the basics of Python’s built-in functions. By the end, you’ll know what their use cases are and how they work. To kick things off, you’ll start with those built-in functions related to math computations.

Using Math-Related Built-in Functions

In Python, you’ll find a few built-in functions that take care of common math operations, like computing the absolute value of a number, calculating powers, and more. Here’s a summary of the math-related built-in functions in Python:

Function Description abs() Calculates the absolute value of a number divmod() Computes the quotient and remainder of integer division max() Finds the largest of the given arguments or items in an iterable min() Finds the smallest of the given arguments or items in an iterable pow() Raises a number to a power round() Rounds a floating-point value sum() Sums the values in an iterable

In the following sections, you’ll learn how these functions work and how to use them in your Python code.

Getting the Absolute Value of a Number: abs()

The absolute value or modulus of a real number is its non-negative value. In other words, the absolute value is the number without its sign. For example, the absolute value of -5 is 5, and the absolute value of 5 is also 5.

Note: To learn more about abs(), check out the How to Find an Absolute Value in Python tutorial.

Python’s built-in abs() function allows you to quickly compute the absolute value or modulus of a number:

Python >>> from decimal import Decimal >>> from fractions import Fraction >>> abs(-42) 42 >>> abs(42) 42 >>> abs(-42.42) 42.42 >>> abs(42.42) 42.42 >>> abs(complex("-2+3j")) 3.605551275463989 >>> abs(complex("2+3j")) 3.605551275463989 >>> abs(Fraction("-1/2")) Fraction(1, 2) >>> abs(Fraction("1/2")) Fraction(1, 2) >>> abs(Decimal("-0.5")) Decimal('0.5') >>> abs(Decimal("0.5")) Decimal('0.5') Copied!

In these examples, you compute the absolute value of different numeric types using the abs() function. First, you use integer numbers, then floating-point and complex numbers, and finally, fractional and decimal numbers. In all cases, when you call the function with a negative value, the final result removes the sign.

For a practical example, say that you need to compute the total profits and losses of your company from a month’s transactions:

Python >>> transactions = [-200, 300, -100, 500] >>> incomes = sum(income for income in transactions if income > 0) >>> expenses = abs( ... sum(expense for expense in transactions if expense < 0) ... ) >>> print(f"Total incomes: ${incomes}") Total incomes: $800 >>> print(f"Total expenses: ${expenses}") Total expenses: $300 >>> print(f"Total profit: ${incomes - expenses}") Total profit: $500 Copied!

In this example, to compute the expenses, you use the abs() function to get the absolute value of the expenses, which results in a positive value.

Finding the Quotient and Remainder in Division: divmod()

Python provides a built-in function called divmod() that takes two numbers as arguments and returns a tuple with the quotient and remainder that result from the integer division of the input numbers:

Read the full article at »

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

