Planet Python

Subscribe to Planet Python feed
Planet Python - http://planetpython.org/
Updated: 16 hours 13 min ago

Wingware: Wing Python IDE Version 10.0.5 - July 8, 2024

Sun, 2024-07-07 21:00

Wing 10.0.5 adds support for running the IDE on arm64 Linux, updates the German language UI localization, changes the default OpenAI model to lower cost and better performing gpt-4o, and fixes several bugs.

See the change log for details.

Download Wing 10 Now: Wing Pro | Wing Personal | Wing 101 | Compare Products


What's New in Wing 10

AI Assisted Development

Wing Pro 10 takes advantage of recent advances in the capabilities of generative AI to provide powerful AI assisted development, including AI code suggestion, AI driven code refactoring, description-driven development, and AI chat. You can ask Wing to use AI to (1) implement missing code at the current input position, (2) refactor, enhance, or extend existing code by describing the changes that you want to make, (3) write new code from a description of its functionality and design, or (4) chat in order to work through understanding and making changes to code.

Examples of requests you can make include:

"Add a docstring to this method" "Create unit tests for class SearchEngine" "Add a phone number field to the Person class" "Clean up this code" "Convert this into a Python generator" "Create an RPC server that exposes all the public methods in class BuildingManager" "Change this method to wait asynchronously for data and return the result with a callback" "Rewrite this threaded code to instead run asynchronously"

Yes, really!

Your role changes to one of directing an intelligent assistant capable of completing a wide range of programming tasks in relatively short periods of time. Instead of typing out code by hand every step of the way, you are essentially directing someone else to work through the details of manageable steps in the software development process.

Read More

Support for Python 3.12 and ARM64 Linux

Wing 10 adds support for Python 3.12, including (1) faster debugging with PEP 669 low impact monitoring API, (2) PEP 695 parameterized classes, functions and methods, (3) PEP 695 type statements, and (4) PEP 701 style f-strings.

Wing 10 also adds support for running Wing on ARM64 Linux systems.

Poetry Package Management

Wing Pro 10 adds support for Poetry package management in the New Project dialog and the Packages tool in the Tools menu. Poetry is an easy-to-use cross-platform dependency and package manager for Python, similar to pipenv.

Ruff Code Warnings & Reformatting

Wing Pro 10 adds support for Ruff as an external code checker in the Code Warnings tool, accessed from the Tools menu. Ruff can also be used as a code reformatter in the Source > Reformatting menu group. Ruff is an incredibly fast Python code checker that can replace or supplement flake8, pylint, pep8, and mypy.


Try Wing 10 Now!

Wing 10 is a ground-breaking new release in Wingware's Python IDE product line. Find out how Wing 10 can turbocharge your Python development by trying it today.

Downloads: Wing Pro | Wing Personal | Wing 101 | Compare Products

See Upgrading for details on upgrading from Wing 9 and earlier, and Migrating from Older Versions for a list of compatibility notes.

Categories: FLOSS Project Planets

Robin Wilson: Who reads my blog? Send me an email or comment if you do!

Sun, 2024-07-07 13:23

I’m interested to find out who is reading my blog. Following the lead of Jamie Tanna who was in turn copying Terence Eden (both of whose blogs I read), I’d like to ask people who read this to drop me an email or leave a comment on this post if you read this blog and see this post. I have basic analytics on this blog, and I seem to get a reasonable number of views – but I don’t know how many of those are real people, and how many are bots etc.

Feel free to just say hi, but if you have chance then I’d love to find out a bit more about you and how you read this. Specifically, feel free to answer any or all of the following questions:

  • Do you read this on the website or via RSS?
  • Do you check regularly/occasionally for new posts, or do you follow me on social media (if so, which one?) to see new posts?
  • How did you find my blog in the first place?
  • Are you interested in and/or working in one of my specialisms – like geospatial data, general data science/data processing, Python or similar?
  • Which posts do you find most interesting? Programming posts on how to do things? Geographic analyses? Book reviews? Rare unusual posts (disability, recipes etc)?
  • Have you met me in real life?
  • Is there anything particular you’d like me to write about?

The comments box should be just below here, and my email is robin@rtwilson.com

Thanks!

Categories: FLOSS Project Planets

Carl Trachte: Graphviz - Editing a DAG Hamilton Graph dot File

Fri, 2024-07-05 21:10

Last post featured the DAG Hamilton generated graphviz graph shown below. I'll be dressing this up a little and highlighting some functionality. For the toy example here, the script employed is a bit of overkill. For a bigger workflow, it may come in handy.





I'll start with the finished products:

1) A Hamilton logo and a would be company logo get added (manual; the Data Inputs Highlighted subtitle is there for later processing when we highlight functionality.)
2) through 4) are done programmatically (code is shown further down). I saw an example on the Hamilton web pages that used aquamarine as the highlight color; I liked that, so I stuck with it.

2) Data source and data source function highlighted.



3) Web scraping functions highlighted.



4) Output nodes highlighted.


A few observations and notes before we look at configuration and code: I've found the charts to be really helpful in presenting my workflow to users and leadership (full disclosure: my boss liked some initial charts I made; my dream of the PowerPoint to solve all scripter<->customer communication challenges is not yet reality, but for the first time in a long time, I have hope.)

In the web scraping highlighted diagram, you can pretty clearly see that data_with_company node has an input into the commodity_word_counts node. The domain specific rationale from the last blog post is that I don't want to count every "Barrick Gold" company name occurrence as another mention of "Gold" or "gold."

Toy example notwithstanding, in real life, being able to show where something branches critically is a real help. Assumptions about what a script is actually doing versus what it is doing can actually be costly in terms of time and productivity for all parties. Being able to say and show ideas like, "What it's doing over here doesn't carry over to that other mission critical part you're really concerned with; it's only for purposes of the visualization which lies over here on the diagram" or "This node up here representing <the real life thing> is your sole source of input for this script; it is not looking at <other real world thing> at all."

graphviz and diagrams like this have been around for decades - UML, database schema visualizations, etc. What makes this whole DAG Hamilton thing better for me is how easy and accessible it is. I've seen C++ UML diagrams over the years (all respect to the C++ people - it takes a lot of ability, discipline, and effort); my first thought is often, "Oh wow . . . I'm not sure I have what it takes to do that . . . and I'm not sure I'd want to . . ."

Enough rationalization and qualifying - on to the config and the code!

I added the title and logos manually. The assumption that the graphviz dot file output of DAG Hamilton will always be in the format shown would be premature and probably wrong. It's an implementation detail subject to change and not a feature. That said, I needed some features in my graph outputs and I achieved them this one time.

Towards the top of the dot file is where the title goes:

// Dependency Graphdigraph {        labelloc="t"        label=<<b>Toy Web Scraping Script Run Diagram<BR/>Data Inputs Highlighted</b>> fontsize="36" fontname=Helvetica

labelalloc="t" puts the text at the top of the graph (t for top, I think).
// Dependency Graphdigraph {        labelloc="t"        label=<<b>Toy Web Scraping Script Run Diagram<BR/>Data Inputs Highlighted</b>> fontsize="36" fontname=Helvetica        hamiltonlogo [label="" image="hamiltonlogolarge.png" shape="box", width=0.6, height=0.6, fixedsize=true]        companylogo [label="" image="fauxcompanylogo.png" shape="box", width=5.10 height=0.6 fixedsize=true]

The DAG Hamilton logo listed first appears to end up in the upper left part of the diagram most of the time (this is an empirical observation on my part; I don't have a super great handle on the internals of graphviz yet).

Getting the company logo next to it requires a bit more effort. A StackOverflow exchange had a suggestion of connecting it invisibly to an initial node. In this case, that would be the data source. Inputs in DAG Hamilton don't get listed in the graphviz dot file by their names, but rather by the node or nodes they are connected to: _parsed_data_inputs instead of "datafile" like you might expect. I have a preference for listing my input nodes only once (deduplicate_inputs=True is the keyword argument to DAG Hamilton's driver object's display_all_functions method that makes the graph).

The change is about one third of the way down the dot file where the node connection edges start getting listed:

parsed_data -> data_with_wikipedia _parsed_data_inputs [label=<<table border="0"><tr><td>datafile</td><td>str</td></tr></table>> fontname=Helvetica margin=0.15 shape=rectangle style="filled,dashed" fillcolor="#ffffff"]        companylogo -> _parsed_data_inputs [style=invis]

DAG Hamilton has a dashed box for script inputs. That's why there is all that extra description inside the square brackets for that node. I manually added the fillcolor="#ffffff" at the end. It's not necessary for the chart (I believe the default fill of white /#ffffff was specified near the top of the file), but it is necessary for the code I wrote to replace the existing color with something else. Otherwise, it does not affect the output.

I think that's it for manual prep.

Onto the code. Both DAG Hamilton and graphviz have API's for customizing the graphviz dot file output. I've opted to approach this with brute force text processing. For my needs, this is the best option. YMMV. In general, text processing any code or configuration tends to be brittle. It worked this time.

# python 3.12
"""Try to edit properties of graphviz output."""
import sys
import re
import itertools
import graphviz
INPUT = 'ts_with_logos_and_colors'
FILLCOLORSTRLEN = 12AQUAMARINE = '7fffd4'COLORLEN = len(AQUAMARINE)
BOLDED = ' penwidth=5'BOLDEDEDGE = ' [penwidth=5]'
NODESTOCOLOR = {'data_source':['_parsed_data_inputs',                               'parsed_data'],                'webscraping':['data_with_wikipedia',                               'colloquial_company_word_counts',                               'data_with_company',                               'commodity_word_counts'],                'output':['info_output',                          'info_dict_merged',                          'wikipedia_report']}
EDGEPAT = r'\b{0:s}\b[ ][-][>][ ]\b{1:s}\b'
TITLEPAT = r'Toy Web Scraping Script Run Diagram[<]BR[/][>]'ENDTITLEPAT = r'</b>>'
# Two tuples as values for edges.EDGENODESTOBOLD = {'data_source':[('_parsed_data_inputs', 'parsed_data')],                   'webscraping':[('data_with_wikipedia', 'colloquial_company_word_counts'),                                  ('data_with_wikipedia', 'data_with_company'),                                  ('data_with_wikipedia', 'commodity_word_counts'),                                  ('data_with_company', 'commodity_word_counts')],                   'output':[('data_with_company', 'info_output'),                             ('colloquial_company_word_counts', 'info_dict_merged'),                             ('commodity_word_counts', 'info_dict_merged'),                             ('info_dict_merged', 'wikipedia_report'),                             ('data_with_company', 'info_dict_merged')]}
OUTPUTFILES = {'data_source':'data_source_highlighted',               'webscraping':'web_scraping_functions_highlighted',               'output':'output_functions_highlighted'}
TITLES = {'data_source':'Data Sources and Data Source Functions Highlighted',          'webscraping':'Web Scraping Functions Highlighted',          'output':'Output Functions Highlighted'}
def get_new_source_nodecolor(src, nodex):    """    Return new source string for graphviz    with selected node colored aquamarine.
    src is the original graphviz text source    from file.
    nodex is the node to have it's color edited.    """    # Full word, exact match.    wordmatchpat = r'\b' + nodex + r'\b'    pat = re.compile(wordmatchpat)    # Empty string to hold full output of edited source.    src2 = ''    match = re.search(pat, src)    # nodeidx = src.find(nodex)    nodeidx = match.span()[0]    print('nodeidx = ', nodeidx)    src2 += src[:nodeidx]    idxcolor = src[nodeidx:].find('fillcolor')    print('idxcolor = ', idxcolor)    # fillcolor="#b4d8e4"    # 012345678901234567    src2 += src[nodeidx:nodeidx + idxcolor + FILLCOLORSTRLEN]    src2 += AQUAMARINE    currentposit = nodeidx + idxcolor + FILLCOLORSTRLEN + COLORLEN    src2 += src[currentposit:]    return src2
def get_new_title(src, title):    """    Return new source string for graphviz    with new title part of header.
    src is the original graphviz text source    from file.
    title is a string.    """    # Empty string to hold full output of edited source.    src2 = ''    match = re.search(TITLEPAT, src)    titleidx = match.span()[1]    print('titleidx = ', titleidx)    src2 += src[:titleidx]    idxendtitle = src[titleidx:].find(ENDTITLEPAT)    print('idxendtitle = ', idxendtitle)    src2 += title    currentposit = titleidx + idxendtitle    print('currentposit = ', currentposit)    src2 += src[currentposit:]    return src2
def get_new_source_penwidth_nodes(src, nodex):    """    Return new source string for graphviz    with selected node having bolded border.
    src is the original graphviz text source    from file.
    nodex is the node to have its box bolded.    """    # Full word, exact match.    wordmatchpat = r'\b' + nodex + r'\b'    pat = re.compile(wordmatchpat)    # Empty string to hold full output of edited source.    src2 = ''    match = re.search(pat, src)    nodeidx = match.span()[0]    print('nodeidx = ', nodeidx)    src2 += src[:nodeidx]    idxbracket = src[nodeidx:].find(']')    src2 += src[nodeidx:nodeidx + idxbracket]    print('idxbracket = ', idxbracket)    src2 += BOLDED    src2 += src[nodeidx + idxbracket:]    return src2
def get_new_source_penwidth_edges(src, nodepair):    """    Return new source string for graphviz    with selected node pair having bolded edge.
    src is the original graphviz text source    from file.
    nodepair is the two node tuple to have    its edge bolded.    """    # Full word, exact match.    edgepat = EDGEPAT.format(*nodepair)    print(edgepat)    pat = re.compile(edgepat)    # Empty string to hold full output of edited source.    src2 = ''    match = re.search(pat, src)    edgeidx = match.span()[1]    print('edgeidx = ', edgeidx)    src2 += src[:edgeidx]    src2 += BOLDEDEDGE     src2 += src[edgeidx:]    return src2
def makehighlightedfuncgraphs():    """    Cycle through functionalities to make specific    highlighted functional parts of the workflow    output graphs.
    Returns dictionary of new filenames.    """    with open(INPUT, 'r') as f:        src = f.read()
    retval = {}        for functionality in TITLES:        print(functionality)        src2 = src        retval[functionality] = {'dot':None,                                 'svg':None,                                 'png':None}        src2 = get_new_title(src, TITLES[functionality])        # list of nodes.        to_process = (nodex for nodex in NODESTOCOLOR[functionality])        countergenerator = itertools.count()        count = next(countergenerator)        print('\nDoing node colors\n')        for nodex in to_process:            print(nodex)            src2 = get_new_source_nodecolor(src2, nodex)            count = next(countergenerator)        to_process = (nodex for nodex in NODESTOCOLOR[functionality])        countergenerator = itertools.count()        count = next(countergenerator)        print('\nDoing node bolding\n')        for nodex in to_process:            print(nodex)            src2 = get_new_source_penwidth_nodes(src2, nodex)            count = next(countergenerator)        print('Bolding edges . . .')        to_process = (nodex for nodex in EDGENODESTOBOLD[functionality])        countergenerator = itertools.count()        count = next(countergenerator)        for nodepair in to_process:            print(nodepair)            src2 = get_new_source_penwidth_edges(src2, nodepair)            count = next(countergenerator)        print('Writing output files . . .')        outputfile = OUTPUTFILES[functionality]        with open(outputfile, 'w') as f:            f.write(src2)        graphviz.render('dot', 'png', outputfile)        graphviz.render('dot', 'svg', outputfile)
makehighlightedfuncgraphs()

Thanks for stopping by.

Categories: FLOSS Project Planets

TestDriven.io: Developing GraphQL APIs in Django with Strawberry

Fri, 2024-07-05 17:42
This tutorial details how to integrate GraphQL with Django using Strawberry.
Categories: FLOSS Project Planets

Real Python: The Real Python Podcast – Episode #211: Python Doesn't Round Numbers the Way You Might Think

Fri, 2024-07-05 08:00

Does Python round numbers the same way you learned back in math class? You might be surprised by the default method Python uses and the variety of ways to round numbers in Python. Christopher Trudeau is back on the show this week, bringing another batch of PyCoder's Weekly articles and projects.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Categories: FLOSS Project Planets

The Python Show: Dashboards in Python with Streamlit

Thu, 2024-07-04 21:20

This week, I chatted with Channin Nantasenamat about Python and the Streamlit web framework.

Specifically, we chatted about the following topics:

  • Python packages

  • Streamlit

  • Teaching bioinformatics

  • Differences in data science disciplines

  • Being a YouTuber

  • and much more!

Links
Categories: FLOSS Project Planets

Carl Trachte: DAG Hamilton Workflow for Toy Text Processing Script

Thu, 2024-07-04 18:04

Hello. It's been a minute.

I was fortunate to attend PYCON US in Pittsburgh earlier this year. DAGWorks had a booth on the expo floor where I discovered Hamilton. The project grabbed my attention as something that could help organize and present my code workflow better. My reaction could be compared to browsing Walmart while picking up a hardware item and seeing the perfect storage medium for your clothes or crafts at a bargain price, but even better, having someone there to explain the whole thing to you. The folks at the booth were really helpful.




Below I take on a contrived web scraping (it's crude) script in my domain (metals mining) and create a Hamilton workflow from it.

Pictured below is the Hamilton flow in the graphviz output format the project uses for flowcharts (graphviz has been around for decades - an oldie but goodie as it were).





I start with a csv file that has some really basic data on three big American metal mines (I did have to research the Wikipedia addresses - for instance, I originally looked for the Goldstrike Mine under the name "Post-Betze." It goes by several different names and encompasses several mines - more on that anon):

mine,state,commodity,wikipedia page,colloquial associationRed Dog,Alaska,zinc,https://en.wikipedia.org/wiki/Red_Dog_mine,TeckGoldstrike,Nevada,gold,https://en.wikipedia.org/wiki/Goldstrike_mine,Nevada Gold MinesBingham Canyon,Utah,copper,https://en.wikipedia.org/wiki/Bingham_Canyon_Mine,Kennecott

Basically, I am going to attempt to scrape Wikipedia for information on who owns the three mines. Then I will try to use heuristics to gather information on what I think I know about them and gauge how up to date the Wikipedia information is.

Hamilton uses a system whereby you name your functions in a noun-like fashion ("def stuff()" instead of "def getstuff()") and feed those names as variables to the other functions in the workflow as parameters. This is what allows the tool to check your workflow for inconsistencies (types, for instance) and build the graphviz chart shown above.

You can use separate modules with functions and import them. I've done some of this on the bigger workflows I work with. Your Hamilton functions then end up being little one liners that call the bigger functions in the modules. This is necessary if you have functions you use repeatedly in your workflow that take different values at different stages. For this toy project, I've kept the whole thing self contained in one module toyscriptiii.py (yes, the iii in the filename represents my multiple failed attempts at web scraping and text processing - it's harder than it looks).

Below is the Hamilton main file run.py (I believe the "run.py" name is convention.) I have done my best to preserve the dictionary return values as "faux immutable" through use of the copy module in each function. This helps me in debugging and examining output, much of which can be done from the run.py file (all the return values are stored in a dictionary). I've worked with a dataset with about 600,000 rows that had about 10 nodes. My computer has 32GB of RAM (Windows 11); it handled memory fine (less than half). For really big data, keeping all these dictionaries in memory might be a problem.

# python 3.12
"""Hamilton demo."""
import sys
import pprint
from hamilton import driver
import toyscriptiii as ts
dr = driver.Builder().with_modules(ts).build()
dr.display_all_functions("ts.png", deduplicate_inputs=True, keep_dot=True, orient='BR')
results = dr.execute(['parsed_data',                      'data_with_wikipedia',                      'data_with_company',                      'info_output',                      'commodity_word_counts',                      'colloquial_company_word_counts',                      'info_dict_merged',                      'wikipedia_report'],                      inputs={'datafile':'data.csv'})
pprint.pprint(results['info_dict_merged'])print(results['info_output'])print(results['wikipedia_report'])

The main toy module with functions configured for the Hamilton graph:

# python 3.12
"""Toy script.
Takes some input from a csv file on big Americanmines and looks at Wikipedia text for some extracontext."""
import copy
import pprint
import sys
from urllib import request
import re
from bs4 import BeautifulSoup
def parsed_data(datafile:str) -> dict:    """    Get csv data into a dictionary keyed on mine name.    """    retval = {}    with open(datafile, 'r') as f:        headers = [x.strip() for x in next(f).split(',')]        for linex in f:            vals = [x.strip() for x in linex.split(',')]            retval[vals[0]] = {key:val for key, val in zip(headers, vals)}     pprint.pprint(retval)    return retval        def data_with_wikipedia(parsed_data:dict) -> dict:    """    Connect to wikipedia sites and fill in    raw html data.
    Return dictionary.    """    retval = copy.deepcopy(parsed_data)    for minex in retval:        obj = request.urlopen(retval[minex]['wikipedia page'])        html = obj.read()        soup = BeautifulSoup(html, 'html.parser')        print(soup.title)        # Text from html and strip out newlines.        newstring = soup.get_text().replace('\n', '')        retval[minex]['wikipediatext'] = newstring    return retval
def data_with_company(data_with_wikipedia:dict) -> dict:    """    Fetches company ownership for mine out of     Wikipedia text dump.
    Returns a new dictionary with the company name    without the big wikipedia text dump.    """    # Wikipedia setup for mine company name.    COMPANYPAT = r'[a-z]Company'    # Lower case followed by upper case heuristic.    ENDCOMPANYPAT = '[a-z][A-Z]'    retval = copy.deepcopy(data_with_wikipedia)    companypat = re.compile(COMPANYPAT)    endcompanypat = re.compile(ENDCOMPANYPAT)     for minex in retval:        print(minex)        match = re.search(companypat, retval[minex]['wikipediatext'])        if match:            print('Company match span = ', match.span())            companyidx = match.span()[1]            match2 = re.search(endcompanypat, retval[minex]['wikipediatext'][companyidx:])            print('End Company match span = ', match2.span())            retval[minex]['company'] = retval[minex]['wikipediatext'][companyidx:companyidx + match2.span()[0] + 1]        # Get rid of big text dump in return value.        retval[minex].pop('wikipediatext')    return retval
def info_output(data_with_company:dict) -> str:    """    Prints some output text to a file for each    mine in the data_with_company dictionary.
    Returns string filename of output.    """    INFOLINEFMT = 'The {mine:s} mine is a big {commodity:s} mine in the State of {state:s} in the US.'    COMPANYLINEFMT = '\n    {company:s} owns the mine.\n\n'    retval = 'mine_info.txt'    with open(retval, 'w') as f:        for minex in data_with_company:            print(INFOLINEFMT.format(**data_with_company[minex]), file=f)            print(COMPANYLINEFMT.format(**data_with_company[minex]), file=f)    return retval
def commodity_word_counts(data_with_wikipedia:dict, data_with_company:dict) -> dict:    """    Return dictionary keyed on mine with counts of    commodity (e.g., zinc etc.) mentions on Wikipedia    page (excluding ones in the company name).    """    retval = {}    # This will probably miss some occurrences at mashed together    # word boundaries. It is a rough estimate.    # '\b[Gg]old\b'    commoditypatfmt = r'\b[{0:s}{1:s}]{2:s}\b'    for minex in data_with_wikipedia:        print(minex)        commodityuc = data_with_wikipedia[minex]['commodity'][0].upper()        commoditypat = commoditypatfmt.format(commodityuc,                                              data_with_wikipedia[minex]['commodity'][0],                                              data_with_wikipedia[minex]['commodity'][1:])        print(commoditypat)        commoditymatches = re.findall(commoditypat, data_with_wikipedia[minex]['wikipediatext'])        # pprint.pprint(commoditymatches)        nummatchesraw = len(commoditymatches)        print('Initial length of commoditymatches is {0:d}.'.format(nummatchesraw))        companymatches = re.findall(data_with_company[minex]['company'],                                    data_with_wikipedia[minex]['wikipediatext'])        numcompanymatches = len(companymatches)        print('Length of companymatches is {0:d}.'.format(numcompanymatches))        # Is the commodity name part of the company name?        print('commoditypat = ', commoditypat)        print(data_with_company[minex]['company'])        commoditymatchcompany = re.search(commoditypat, data_with_company[minex]['company'])        if commoditymatchcompany:            print('commoditymatchcompany.span() = ', commoditymatchcompany.span())            nummatchesfinal = nummatchesraw - numcompanymatches            retval[minex] = nummatchesfinal         else:            retval[minex] = nummatchesraw     return retval
def colloquial_company_word_counts(data_with_wikipedia:dict) -> dict:    """    Find the number of times the company you associate with    the property/mine (very subjective) is within the    text of the mine's wikipedia article.    """    retval = {}    for minex in data_with_wikipedia:        colloquial_pat = data_with_wikipedia[minex]['colloquial association']        print(minex)        nummatches = len(re.findall(colloquial_pat, data_with_wikipedia[minex]['wikipediatext']))        print('{0:d} matches for colloquial association {1:s}.'.format(nummatches, colloquial_pat))        retval[minex] = nummatches    return retval
def info_dict_merged(data_with_company:dict,                     commodity_word_counts:dict,                     colloquial_company_word_counts:dict) -> dict:    """    Get a dictionary with all the collected information    in it minus the big Wikipedia text dump.    """    retval = copy.deepcopy(data_with_company)    for minex in retval:        retval[minex]['colloquial association count'] = colloquial_company_word_counts[minex]        retval[minex]['commodity word count'] = commodity_word_counts[minex]    return retval
def wikipedia_report(info_dict_merged:dict) -> str:    """    Writes out Wikipedia information (word counts)    to file in prose; returns string filename.    """    retval = 'wikipedia_info.txt'    colloqfmt = 'The {0:s} mine has {1:d} occurrences of colloquial association {2:s} in its Wikipedia article text.\n'    commodfmt = 'The {0:s} mine has {1:d} occurrences of commodity name {2:s} in its Wikipedia article text.\n\n'    with open(retval, 'w') as f:        for minex in info_dict_merged:            print(colloqfmt.format(info_dict_merged[minex]['mine'],                                   info_dict_merged[minex]['colloquial association count'],                                   info_dict_merged[minex]['colloquial association']), file=f)            print(commodfmt.format(info_dict_merged[minex]['mine'],                                   info_dict_merged[minex]['commodity word count'],                                   info_dict_merged[minex]['commodity']), file=f)    return retval

My REGEX abilities are somewhere between "I've heard the term REGEX and know regular expressions exist" and bracketed characters in each slot brute force. It worked for this toy example. Each Wikipedia page features the word "Company" followed by the name of the owning corporate entity.

Here is are the two text outputs the script produces from the information provided (Wikipedia articles from July, 2024):

The Red Dog mine is a big zinc mine in the State of Alaska in the US.
    NANA Regional Corporation owns the mine.

The Goldstrike mine is a big gold mine in the State of Nevada in the US.
    Barrick Gold owns the mine.

The Bingham Canyon mine is a big copper mine in the State of Utah in the US.
    Rio Tinto Group owns the mine.

The Red Dog mine has 21 occurrences of colloquial association Teck in its Wikipedia article text.
The Red Dog mine has 29 occurrences of commodity name zinc in its Wikipedia article text.

The Goldstrike mine has 0 occurrences of colloquial association Nevada Gold Mines in its Wikipedia article text.
The Goldstrike mine has 16 occurrences of commodity name gold in its Wikipedia article text.

The Bingham Canyon mine has 49 occurrences of colloquial association Kennecott in its Wikipedia article text.
The Bingham Canyon mine has 84 occurrences of commodity name copper in its Wikipedia article text.

Company names are relatively straightforward, although mining company and properties acquisitions and mergers being what they are, it can get complicated. I unwittingly chose three properties that Wikipedia reports as having one owner. Other big mines like Morenci, Arizona (copper) and Cortez, Nevada (gold) show more than one owner; that case is for another programming day. The Goldstrike information might be out of date - no mention of Nevada Gold Mines or Newmont (one mention, but in a different context). The Cortez Wikipedia page is more current, although it still doesn't mention Nevada Gold Mines.

The inclusion of colloquial association in the input csv file was an afterthought based on a lot of the Wikipedia information not being completely in line with what I thought I knew. Teck is the operator of the Red Dog Mine in Alaska. That name does get mentioned frequently in the Wikipedia article.

Enough mining stuff - it is a programming blog after all. Next time (not written yet) I hope to cover dressing up and highlighting the graphviz output a bit.

Thank you for stopping by.


Categories: FLOSS Project Planets

Eli Bendersky: You don't need virtualenv in Go

Thu, 2024-07-04 16:41

Programmers that come to Go from Python often wonder "do I need something like virtualenv here?"

The short answer is NO; this post will provide some additional details.

While virtualenv in Python is useful in many situations, I think it'd be fair to divide them into two broad scenarios: for execution and for development. Let's see what Go offers for each of these scenarios.

Execution

There are multiple, mutually-incompatible versions of Python out in the wild. There are even multiple versions of the packaging tools (like pip). On top of this, different programs need different packages, often themselves with mutually-incompatible versions.

Python code typically expects to be installed, and expects to find packages it depends on installed in a central location. This can be an issue for systems where we don't have the permission to install packages/code to a central location.

All of this makes distributing Python applications quite tricky. It's common to use bundling tools like PyInstaller, but virtualenv is also a popular option [1].

Go is a statically compiled language, so this is a non-problem! Binaries are easy to build and distribute; the binary is a native executable for a given platform (just like a native executable built from C or C++ source), and has no dependencies on compiler or package versions. While you can install Go programs into a central location, you by no means have to do this. In fact, you typically don't have to install Go programs at all. Just invoke the binary.

It's also worth mentioning that Go has great cross-compilation support, making it easy to create binaries for multiple OSes from a single development machine.

Development

Consider the following situation: you're developing a package, which depends on N other packages at specific versions; e.g. you need package foo at version 1.2 or above. Your system may have an older version of foo installed - 0.9; you try to upgrade it to 1.2 and some other program breaks. Now, this all sounds very manageable for package foo - how hard can it be to upgrade the uses of this simple package?

Reality is more difficult. foo could be Django; your code depends on a new version, while some other critical systems depend on an old version. Good luck fixing this conundrum. In Python, viruatenv is a critical tool to make such situations manageable; newer tools like pipenv wrap virtualenv with more usability patterns.

How about Go?

If you're using Go modules, this situation is very easy to handle. In a way, a Go module serves as its own virtualenv. Your go.mod file specifies the exact versions of dependency packages needed for your development, and these versions don't mix up with packages you need to develop some other project (which has its own go.mod).

Moreover, Go module directives like replace make it easy to short-circuit dependencies to try local patches. While debugging your project you find that package foo has a bug that may be affecting you? Want to try a quick fix and see if you're right? No problem, just clone foo locally, apply a fix, and use a replace to use this locally patched foo. See this post for a few ways to automate this process.

What about different Go versions? Suppose you have to investigate a user report complaining that your code doesn't work with an older Go version. Or maybe you're curious to see how the upcoming beta release of a Go version will affect you. Go makes it easy to install different versions locally. These different versions have their own standard libraries that won't interfere with each other.

[1]Fun fact: this blog uses the Pelican static site generator. To regenerate the site I run Pelican in a virtualenv because I need a specific version of Pelican with some personal patches.
Categories: FLOSS Project Planets

Glyph Lefkowitz: Against Innovation Tokens

Thu, 2024-07-04 15:54

Updated 2024-07-04: After some discussion, added an epilogue going into more detail about the value of the distinction between the two types of tokens.

In 2015, Dan McKinley laid out a model for software teams selecting technologies. He proposed that each team have a limited supply of “innovation tokens”, and, when selecting a technology, they can choose boring ones for free but “innovative” ones cost a token. This implies that we all know which technologies are innovative, and we assume that they are inherently costly, so we want to restrict their supply.

That model has become popular to the point that it is now part of the vernacular. In many discussions, it is accepted as received wisdom, or even common sense.

In this post I aim to show you that despite being superficially helpful, this model is wrong, and in fact, may be counterproductive. I believe it is an attractive nuisance in computer programming discourse.

In fairness to Mr. McKinley, the model he described in this post is:

  1. nearly a decade old at this point, and
  2. much more nuanced in its description of the problem with “innovation” than the subsequent memetic mutation of the concept.

While I will be referencing McKinley’s post, and I do take some issue with it, I am reacting more strongly to the life of its own that this idea has taken on once it escaped its original context. There are a zillion worse posts rehashing this concept, on blogs and LinkedIn, but I won’t be linking to them because the goal is not to call anybody out.

To some extent I am re-raising McKinley’s own caveats and reinforcing them. So I may be arguing with a strawman, but it’s a strawman I have seen deployed with some regularity over the years.

To reduce it to its core, this strawman is “don’t use new or interesting technology, and if you have to, only use a little bit”.

Within the broader culture of programmers, an “innovation token” has become a shorthand to smear any technology perceived — almost always based on vibes, not data — as risky, and the adoption of novel approaches as pretentious and unserious. Speaking of programmer culture though, I do have to acknowledge there is also a pervasive tendency for us to get distracted by novelty and waste time on puzzles rather than problem-solving, so I understand where the reactionary attitude represented by the concept of an innovation token comes from.

But it is reactionary.

At its worst, it borders on anti-intellectualism. I have heard it used on more than one occasion as a thought-terminating cliche to discard a potentially promising new tool. But before I get into that, let me try to give a sympathetic summary of the idea, because the model is not entirely bad.

It has been popular for a long time because it does work okay as an heuristic.

The real problem that McKinley is describing is operational overhead. When programmers make a technology selection, we are often considering how difficult it will make the programming. Innovative technology selections are, by definition, less mature.

That lack of maturity — particularly in the open source world — often means that the project is in a part of its lifecycle where it is concerned with development affordances more than operational ones. Therefore, the stereotypical innovative project, even one which might legitimately be a big improvement to development velocity, will create more operational overhead. That operational overhead creates a hidden cost for the operations team later on.

This is a point I emphatically agree with. When selecting a technology, you should consider its ease of operation more than its ease of development. If your team is successful, they will be operating and maintaining it far longer than they are initially integrating and deploying it.

Furthermore, some operational overhead is inevitable. You will need to hire people to mitigate it. More popular, more mature projects will have a bigger talent pool to hire from, so your training costs will be lower, and those training costs are part of your operational cost too.

Rationing innovation tokens therefore can work as a reasonable heuristic, or proxy metric, for avoiding a mess of complex operational problems associated with dependencies that are expensive to operate and hard to hire for.

There are some minor issues I want to point out before getting to the overarching one.

  1. “has a lot of operational overhead” is a stereotype of a new technology, not an inherent property. If you want to reject a technology on the basis of being too high-overhead, at least look into its actual overhead a little bit. Sometimes, especially in 2024 as opposed to 2015, the point of a new, shiny piece of tech is to address operational issues that the more boring, older one had.
  2. “hard to learn” is also a stereotype; if “newer” meant “harder” then we would all be using troff rather than Google Docs. Actually ask if the innovativeness is making things harder or easier; don’t assume.
  3. You are going to have to train people on your stack no matter what. If a technology is adding a lot of value, it’s absolutely worth hiring for general ability and making a plan to teach people about it. You are going to have to do this with the core technology of your product anyway.

As I said, though, these are minor issues. The big problem with modeling operational overhead as an “innovation token” is that an even bigger concern than selecting an innovative tool is selecting too many tools.

The impulse to select more tools and make your operational environment more complex can be made worse by trying to avoid innovative tools. The important thing is not “less innovation”, but more consistency. To illustrate this, let’s do a simple thought experiment.

Let’s say you’re going to make a web app. There’s a tool in Haskell that you really like for a critical part of your app’s problem domain. You don’t want to spend more than one innovation token though, and everything in Haskell is inherently innovative, so you write a little service that just does that one part and you write the rest of your app in Ruby, calling into that service whenever you need to use that thing. This will appropriately restrict your “innovation token” expenditure.

Does doing this actually reduce your operational overhead, though?

First, you will have to find a team that likes both Ruby and Haskell and sees no problem using both. If you are not familiar with the cultural proclivities of these languages, suffice it to say that this is unlikely. Hiring for Haskell programmers is hard because there are fewer of them than Ruby programmers, but hiring for polyglot Haskell/Ruby programmers who are happy to do either is going to be really hard.

Since you will need to find different people to write in the different languages, even in the best case scenario, you will have two teams: the Haskell team and the Ruby team. Even if you are incredibly disciplined about inter-service responsibilities, there will be some areas where duplication of code is necessary across those services. Disagreements will arise and every one of these disagreements will be a source of social friction and software defects.

Then, you need to set up separate CI pipelines for each language, separate deployment systems, and of course, separate databases. Right away you are effectively doubling your workload.

In the worse, and unfortunately more likely scenario, there will be enormous infighting between these two teams. Operational incidents will be more difficult to manage because rather than learning the Haskell tools for operational visibility and disseminating that institutional knowledge amongst your team, you will be half-learning the lessons from two separate ecosystems and attempting to integrate them. Every on-call engineer will be frantically trying to learn a language ecosystem they don’t use regularly, or you will double the size of your on-call rotation. The Ruby team may start to resent the Haskell team for getting to exclusively work on the fun parts of the problem while they are doing things that look more like rote grunt work.

A better way to think about the problem of managing operational overhead is, rather than “innovation tokens”, consider “boundary tokens”.

That is to say, rather than evaluating the general sense of weird vibes from your architecture, consider the consistency of that architecture. If you’re using Haskell, use Haskell. You should be all-in on Haskell web frameworks, Haskell ORMs, Haskell OAuth integrations, and so on.1 To cross the boundary out of Haskell, you need to spend a boundary token, and you shouldn’t have many of those.

I submit that the increased operational overhead that you might experience with an all-Haskell tool selection will be dwarfed by the savings that you get by having a team that is aligned with each other, that can communicate easily, and that can share programs with each other without needing to first strategize about a channel for the two pieces of work to establish bidirectional communication. The ability to simply call a function when you need to call it is very powerful, and extremely underrated.

Consistency ought to apply at each layer of the stack; it is perhaps most obvious with programming languages, but it is true of web frameworks, test frameworks, cryptographic libraries, you name it. Make a choice and stick with it, because every deviation from that choice carries a significant cost. Moreover this cost is a hidden cost, in the same way that the operational downsides of an “innovative” tool that hasn’t seen much production use might be hidden.

Discarding a more standard tool in favor of a tool more consistent with your architecture extends even to fairly uncontroversial, ubiquitous tools. For example, one of my favorite architectural patterns is to forego the use of the venerable — and very boring – Cron, the UNIX task-scheduler. Instead of Cron, it can make a lot of sense to have hand-written bespoke code for scheduling tasks within the application. Within the “innovation tokens” model, this is a very silly waste of a token!

Just use Cron! Everybody knows how to use Cron!

Except… does everybody know how to use Cron? Here are some questions to consider, if you’re about to roll out a big dependency on Cron:

  1. How do you write a unit test for a scheduling rule with Cron?
  2. Can you even remember how to write a cron rule that runs at the times you want?
  3. How do you inject secrets and configuration variables into the distinct and somewhat idiosyncratic runtime execution environment of Cron?
  4. How do you know that you did that variable-injection properly until the job actually runs, possibly in the middle of the night?
  5. How do you deploy your monitoring and error-logging frameworks to observe your scripts run under Cron?

Granted, this architectural choice is less controversial than it once was. Cron used to be ambiently available on whatever servers you happened to be running. As container-based deployments have increased in popularity, this sense that Cron is just kinda around has gone away, and if you need to run a container that just runs Cron, much of the jankiness of its deployment is a lot more immediately visible.

There is friction at the boundary between things. That friction is a cost, but sometimes it’s a cost worth paying.

If there’s a really good library in Haskell and a really good library in Ruby and you really do want to use them both, maybe it makes sense to actually have multiple services. As your team gets larger and more mature, the need to bring in more tools, and the ability to handle the associated overhead, will only increase over time. But the place that the cost comes in the most is at the boundary between tools, not in the operational deficiencies of any one particular tool.

Even in a bog-standard web application with the most boring, least innovative tech stack imaginable (PHP, MySQL, HTML, CSS, JavaScript), many of the annoying points of friction are where different, inconsistent technologies make contact. If you are a programmer working on the web yourself, consider your own impression of the level of controversy of these technologies:

Consider that there are far more complex technical tools in terms of required skills to implement them, like computer vision or physics simulation, tools which are also pretty widely used, which consistently generate lower levels of controversy. People do have strong feelings about these things as well, of course, and it’s hard to find things to link to that show “this isn’t controversial”, but, like, search your feelings, you know it to be true.

You can see the benefits of the boundary token approach in programming language design. Many of the most influential and best-loved programming languages had an impact not by bundling together lots of tools, but by making everything into one thing:

  • LISP: everything is a list
  • Smalltalk: everything is an object
  • ML: everything is an algebraic data type
  • Forth: everything is a stack

There is a tremendous power in thinking about everything as a single kind of thing, because then you don’t have to juggle lots of different ideas about different kinds of things; you can just think about your problem.

When people complain about programming languages, they’re often complaining about how many different kinds of thing they have to remember in order to use it.

If you keep your boundary-token budget small, and allow your developers to accomplish as much as possible while staying within a solution space delineated by a single, clean cognitive boundary, I promise you can innovate as much as you want and your operational costs will remain manageable.

Epilogue

In subsequent Mastodon discussion of this post on with Matt Campbell and Meejah, I realized that I may not have made it entirely clear why I feel the distinction between “boundary” and “innovation” tokens is important. I do say above that the “innovation token” model can be a useful heuristic, so why bother with a new, but slightly different heuristic? Especially since most experienced engineers - indeed, McKinley himself - would budget “innovation” quite similarly to “boundaries”, and might even consider the use of more “innovative” Haskell tools in my hypothetical scenario to not even be an expenditure of innovation tokens at all.

To answer that, I need to highlight the purpose of having heuristics like this in the first place. These are vague, nebulous guidelines, not hard and fast rules. I cannot give you a token calculator to plug your technical decisions into. The purpose of either token heuristic is to facilitate discussions among a team.

With a team of skilled and experienced engineers, the distinction is meaningless. Senior and staff engineers (at least, the ones who deserve their level) will intuit the goals behind “innovation tokens” and inherently consider things like operational overhead anyway. In practice, a high-performing, well-aligned team discussing innovation tokens and one discussing boundary tokens will look functionally indistinguishable.

The distinction starts to be important when you have management pressures, nervous executives, inexperienced engineers, a fresh team without existing consensus about core technology choices, and so on. That is to say, most teams that exist in the messy, perpetually in medias res world of the software industry.

If you are just getting started on a project and you have a bunch of competent but disagreeable engineers, the words “innovation” and “boundaries” function very differently.

If you ask, “is this an innovation” about a particular technical tool, you are asking your interlocutor to pull in a bunch of their skills and experience to subjectively evaluate the relative industry-wide, or maybe company-wide, or maybe team-wide2 newness of the thing being discussed. The discussion of whether it counts as boring or innovative is immediately fraught with a ton of subjective, difficult-to-quantify information about costs of hiring, difficulty of learning, and your impression of the feelings of hundreds or thousands of people outside of your team. And, yes, ultimately you do need to have an estimate of all that stuff, but starting your “is it OK to use this” conversation by simultaneously arguing about all those subjective judgments is setting yourself up for failure.

Instead, if you ask “does this introduce a boundary between two different technologies with different conceptual models”, while that is not a perfectly objective question, it is much easier for your team to answer, with much crisper intermediary factual questions. What are the two technologies? What are the models? How much do they differ? You can just hash out the answers to each one within the team directly, rather than needing to sift through the last few years of Stack Overflow developer surveys to determine relative adoption or popularity of technologies in the world at large.

Restricting your supply of either boundary or innovation tokens is a good idea, but achieving unanimity within your team about what your boundaries are is always going to be easier than deciding what your innovations are.

Acknowledgments

Thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor! I am also available for consulting work if you think your organization could benefit from expertise on topics like “how can we make our architecture more consistent”.

  1. I gave a talk about this once, a very long time ago, where Haskell was Python. 

  2. It’s not clear, that’s a big part of the problem. 

Categories: FLOSS Project Planets

Python Morsels: Strings in Python

Thu, 2024-07-04 11:16

Strings are used to store text-based data.

Table of contents

  1. Strings store text
  2. How are strings used?
  3. String methods in Python
  4. String concatenation
  5. Double quotes vs single quotes
  6. Escape characters
  7. Strings are everywhere in Python

Strings store text

This is a string:

>>> message = "This is text"

Python strings store text:

>>> message 'This is text' How are strings used?

Strings are often used for …

Read the full article: https://www.pythonmorsels.com/strings-in-python/
Categories: FLOSS Project Planets

PyCharm: Polars vs. pandas: What’s the Difference?

Thu, 2024-07-04 09:58

If you’ve been keeping up with the advances in Python dataframes in the past year, you couldn’t help hearing about Polars, the powerful dataframe library designed for working with large datasets.

Unlike other libraries for working with large datasets, such as Spark, Dask, and Ray, Polars is designed to be used on a single machine, prompting a lot of comparisons to pandas. However, Polars differs from pandas in a number of important ways, including how it works with data and what its optimal applications are. In the following article, we’ll explore the technical details that differentiate these two dataframe libraries and have a look at the strengths and limitations of each.

If you’d like to hear more about this from the creator of Polars, Ritchie Vink, you can also see our interview with him below!

Why use Polars over pandas?

In a word: performance. Polars was built from the ground up to be blazingly fast and can do common operations around 5–10 times faster than pandas. In addition, the memory requirement for Polars operations is significantly smaller than for pandas: pandas requires around 5 to 10 times as much RAM as the size of the dataset to carry out operations, compared to the 2 to 4 times needed for Polars.

You can get an idea of how Polars performs compared to other dataframe libraries here. As you can see, Polars is between 10 and 100 times as fast as pandas for common operations and is actually one of the fastest DataFrame libraries overall. Moreover, it can handle larger datasets than pandas can before running into out-of-memory errors.

Why is Polars so fast?

These results are extremely impressive, so you might be wondering: How can Polars get this sort of performance while still running on a single machine? The library was designed with performance in mind from the beginning, and this is achieved through a few different means.

Written in Rust

One of the most well-known facts about Polars is that it is written in Rust, a low-level language that is almost as fast as C and C++. In contrast, pandas is built on top of Python libraries, one of these being NumPy. While NumPy’s core is written in C, it is still hamstrung by inherent problems with the way Python handles certain types in memory, such as strings for categorical data, leading to poor performance when handling these types (see this fantastic blog post from Wes McKinney for more details).

One of the other advantages of using Rust is that it allows for safe concurrency; that is, it is designed to make parallelism as predictable as possible. This means that Polars can safely use all of your machine’s cores for even complex queries involving multiple columns, which led Ritchie Vink to describe Polar’s performance as “embarrassingly parallel”. This gives Polars a massive performance boost over pandas, which only uses one core to carry out operations. Check out this excellent talk by Nico Kreiling from PyCon DE this year, which goes into more detail about how Polars achieves this.

Based on Arrow

Another factor that contributes to Polars’ impressive performance is Apache Arrow, a language-independent memory format. Arrow was actually co-created by Wes McKinney in response to many of the issues he saw with pandas as the size of data exploded. It is also the backend for pandas 2.0, a more performant version of pandas released in March of this year. The Arrow backends of the libraries do differ slightly, however: while pandas 2.0 is built on PyArrow, the Polars team built their own Arrow implementation.

One of the main advantages of building a data library on Arrow is interoperability. Arrow has been designed to standardize the in-memory data format used across libraries, and it is already used by a number of important libraries and databases, as you can see below.

This interoperability speeds up performance as it bypasses the need to convert data into a different format to pass it between different steps of the data pipeline (in other words, it avoids the need to serialize and deserialize the data). It is also more memory-efficient, as two processes can share the same data without needing to make a copy. As serialization/deserialization is estimated to represent 80–90% of the computing costs in data workflows, Arrow’s common data format lends Polars significant performance gains.

Arrow also has built-in support for a wider range of data types than pandas. As pandas is based on NumPy, it is excellent at handling integer and float columns, but struggles with other data types. In contrast, Arrow has sophisticated support for datetime, boolean, binary, and even complex column types, such as those containing lists. In addition, Arrow is able to natively handle missing data, which requires a workaround in NumPy.

Finally, Arrow uses columnar data storage, which means that, regardless of the data type, all columns are stored in a continuous block of memory. This not only makes parallelism easier, but also makes data retrieval faster.

Query optimization

One of the other cores of Polars’ performance is how it evaluates code. Pandas, by default, uses eager execution, carrying out operations in the order you’ve written them. In contrast, Polars has the ability to do both eager and lazy execution, where a query optimizer will evaluate all of the required operations and map out the most efficient way of executing the code. This can include, among other things, rewriting the execution order of operations or dropping redundant calculations. Take, for example, the following expression to get the mean of column Number1 for each of the categories “A” and “B” in Category.

( df .groupby(by = "Category").agg(pl.col("Number1").mean()) .filter(pl.col("Category").is_in(["A", "B"])) )

If this expression is eagerly executed, the groupby operation will be unnecessarily performed for the whole DataFrame, and then filtered by Category. With lazy execution, the DataFrame can be filtered and groupby performed on only the required data.

Expressive API

Finally, Polars has an extremely expressive API, meaning that basically any operation you want to perform can be expressed as a Polars method. In contrast, more complex operations in pandas often need to be passed to the apply method as a lambda expression. The problem with the apply method is that it loops over the rows of the DataFrame, sequentially executing the operation on each one. Being able to use built-in methods allows you to work on a columnar level and take advantage of another form of parallelism called SIMD.

When should you stick with pandas?

All of this sounds so amazing that you’re probably wondering why you would even bother with pandas anymore. Not so fast! While Polars is superb for doing extremely efficient data transformations, it is currently not the optimal choice for data exploration or for use as part of machine learning pipelines. These are areas where pandas continues to shine.

One of the reasons for this is that while Polars has great interoperability with other packages using Arrow, it is not yet compatible with most of the Python data visualization packages nor machine learning libraries such as scikit-learn and PyTorch. The only exception is Plotly, which allows you to create charts directly from Polars DataFrames.

A solution that is being discussed is using the Python dataframe interchange protocol in these packages to allow them to support a range of dataframe libraries, which would mean that data science and machine learning workflows would no longer be bottlenecked by pandas. However, this is a relatively new idea, and it will take time for these projects to implement.

Tooling for Polars and pandas

After all of this, I am sure you are eager to try Polars yourself! PyCharm Professional for Data Science offers excellent tooling for working with both pandas and Polars in Jupyter notebooks. In particular, pandas and Polars DataFrames are displayed with interactive functionality, which makes exploring your data much quicker and more comfortable.

Some of my favorite features include the ability to scroll through all rows and columns of the DataFrame without truncation, get aggregations of DataFrame values in one click, and export the DataFrame in a huge range of formats (including Markdown!).

If you’re not yet using PyCharm, you can try it with a 30-day trial by following the link below.

Start your PyCharm Pro free trial

Categories: FLOSS Project Planets

Real Python: Working With JSON Data in Python

Wed, 2024-07-03 10:00

Since its introduction, JSON has rapidly emerged as the predominant standard for the exchange of information. Whether you want to transfer data with an API or store information in a document database, it’s likely you’ll encounter JSON. Fortunately, Python provides robust tools to facilitate this process and help you manage JSON data efficiently.

In this tutorial, you’ll learn how to:

  • Understand the JSON syntax
  • Convert Python data to JSON
  • Deserialize JSON to Python
  • Write and read JSON files
  • Validate JSON syntax
  • Prettify JSON in the terminal
  • Minify JSON with Python

While JSON is the most common format for data distribution, it’s not the only option for such tasks. Both XML and YAML serve similar purposes. If you’re interested in how the formats differ, then you can check out the tutorial on how to serialize your data with Python.

Free Bonus: Click here to download the free sample code that shows you how to work with JSON data in Python.

Take the Quiz: Test your knowledge with our interactive “Working With JSON Data in Python” quiz. You’ll receive a score upon completion to help you track your learning progress:

Interactive Quiz

Working With JSON Data in Python

In this quiz, you'll test your understanding of working with JSON in Python. JSON has become the de facto standard for information exchange, and Python provides easy-to-use tools to handle JSON data.

Introducing JSON

The acronym JSON stands for JavaScript Object Notation. As the name suggests, JSON originated from JavaScript. However, JSON has transcended its origins to become language-agnostic and is now recognized as the standard for data interchange.

The popularity of JSON can be attributed to native support by the JavaScript language, resulting in excellent parsing performance in web browsers. On top of that, JSON’s straightforward syntax allows both humans and computers to read and write JSON data effortlessly.

To get a first impression of JSON, have a look at this example code:

JSON hello_world.json { "greeting": "Hello, world!" } Copied!

You’ll learn more about the JSON syntax later in this tutorial. For now, recognize that the JSON format is text-based. In other words, you can create JSON files using the code editor of your choice. Once you set the file extension to .json, most code editors display your JSON data with syntax highlighting out of the box:

The screenshot above shows how VS Code displays JSON data using the Bearded color theme. You’ll have a closer look at the syntax of the JSON format next!

Examining JSON Syntax

In the previous section, you got a first impression of how JSON data looks. And as a Python developer, the JSON structure probably reminds you of common Python data structures, like a dictionary that contains a string as a key and a value. If you understand the syntax of a dictionary in Python, you already know the general syntax of a JSON object.

Note: Later in this tutorial, you’ll learn that you’re free to use lists and other data types at the top level of a JSON document.

The similarity between Python dictionaries and JSON objects is no surprise. One idea behind establishing JSON as the go-to data interchange format was to make working with JSON as convenient as possible, independently of which programming language you use:

[A collection of key-value pairs and arrays] are universal data structures. Virtually all modern programming languages support them in one form or another. It makes sense that a data format that is interchangeable with programming languages is also based on these structures. (Source)

To explore the JSON syntax further, create a new file named hello_frieda.json and add a more complex JSON structure as the content of the file:

JSON hello_frieda.json 1{ 2 "name": "Frieda", 3 "isDog": true, 4 "hobbies": ["eating", "sleeping", "barking"], 5 "age": 8, 6 "address": { 7 "work": null, 8 "home": ["Berlin", "Germany"] 9 }, 10 "friends": [ 11 { 12 "name": "Philipp", 13 "hobbies": ["eating", "sleeping", "reading"] 14 }, 15 { 16 "name": "Mitch", 17 "hobbies": ["running", "snacking"] 18 } 19 ] 20} Copied!

In the code above, you see data about a dog named Frieda, which is formatted as JSON. The top-level value is a JSON object. Just like Python dictionaries, you wrap JSON objects inside curly braces ({}).

In line 1, you start the JSON object with an opening curly brace ({), and then you close the object at the end of line 20 with a closing curly brace (}).

Note: Although whitespace doesn’t matter in JSON, it’s customary for JSON documents to be formatted with two or four spaces to indicate indentation. If the file size of the JSON document is important, then you may consider minifying the JSON file by removing the whitespace. You’ll learn more about minifying JSON data later in the tutorial.

Inside the JSON object, you can define zero, one, or more key-value pairs. If you add multiple key-value pairs, then you must separate them with a comma (,).

Read the full article at https://realpython.com/python-json/ »

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Categories: FLOSS Project Planets

Real Python: Quiz: Python's Magic Methods: Leverage Their Power in Your Classes

Wed, 2024-07-03 08:00

In this quiz, you’ll test your understanding of Python’s Magic Methods.

By working through this quiz, you’ll revisit the concept of magic methods in Python, how they work, and how you can use them to customize the behavior of your classes.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Categories: FLOSS Project Planets

Ga&#235;l Varoquaux: Skrub 0.2.0: tabular learning made easy

Tue, 2024-07-02 18:00

We just released skrub 0.2.0. This release markedly simplifies learning on complex dataframes.

model = tabular_learner(‘classifier’) Simple, yet solid default baseline

The highlight of the release is the tabular_learner function, which facilitates creating pipelines that readily perform machine learning on dataframes, adding preprocessing to a scikit-learn compatible learner. The function packs defaults and heuristics to transform all forms of dataframes to a representation that is well suited to a learner, and it can adapt these transformation: tabular_learner(HistGradientBoostingClassifier()) encodes categories differently than tabular_learner(LogisticRegression()).

The heuristics are tuned based on much benchmarking and experience shows that they give good tradeoffs. The default tabular_learner(‘classifier’) is often a strong baseline.

The benefit are visible in a really simple example:

>>> # First retrieve data >>> from skrub.datasets import fetch_employee_salaries >>> dataset = fetch_employee_salaries() >>> df = dataset.X >>> y = dataset.y >>> # The dataframe is a quite rich and complex dataframe, with various columns >>> df

We can then easily build a learner that applies readily to this dataframe, without any transformation:

>>> from skrub import tabular_learner >>> learner = tabular_learner('regressor') >>> # The resulting learner can apply all the machine-learning conveniences (eg cross-validation) directly on the dataframe >>> from sklearn.model_selection import cross_val_score >>> cross_val_score(learner, df, y) array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666]) transformer = TableVectorizer() Making encoding complex dataframes easy

Behind the hood, the work is done by the skrub.TableVectorizer(), a scikit-learn compatible transformer that facilitates combining multiple transformations on the different columns of a dataframe. The TableVectorizer is not new in the 0.2.0 release, but we have completely revamped its internals to cover really well edge cases. Indeed, one challenge is to make sure that nothing different or strange happens at test time. Actually, enforcing consistency between train-time and test-time transformation is the real value of skrub compared to using pandas or polars to do transformation.

Increasing support of polars Short-term goal of optimized support for pandas and polars

We have implemented a new mechanism for supporting both pandas and polars. It has not been applied on all the codebase, hence the support is still imperfect. However, we are seeing increasing support for polars in skrub, and our goal in the short term is to provide rock-solid polar support.


Try skrub out! It’s still young, but in mind opinion, it provides a lot of value to tabular learning.

Categories: FLOSS Project Planets

PyCoder’s Weekly: Issue #636 (July 2, 2024)

Tue, 2024-07-02 15:30

#636 – JULY 2, 2024
View in Browser »

Build a GUI Calculator With PyQt and Python

In this video course, you’ll learn how to create graphical user interface (GUI) applications with Python and PyQt. Once you’ve covered the basics, you’ll build a fully functional desktop calculator that can respond to user events with concrete actions.
REAL PYTHON course

Satellites Spotting Ships

Umbra Space has released a data set consisting of satellite based radar images of shipping. This article from Mark shows you how to grab the data, visualize, and annotate it.
MARK LITWINTSCHIK

Discover the Power of Observability With Pydantic Logfire

Logfire, by the makers of Pydantic, is an observability platform that will help you understand your app’s behavior with less code and time. Built on OpenTelemetry, it features user-friendly dashboards, SQL querying, and Python-specific integrations. Get started today →
PYDANTIC sponsor

Modern Good Practices for Python Development

This is a very detailed list of best practices for developing in Python. It includes tools, language features, application design, which libraries to use an more.
STUART ELLIS

PSF Board Candidates for 2024

PYTHON SOFTWARE FOUNDATION

Python 3.13.0 Beta 3 Released

CPYTHON DEV BLOG

Django 5.1 Beta 1 Released

DJANGO SOFTWARE FOUNDATION

PyBay 2024 Call for Proposals

PYBAY

Articles & Tutorials Build a Guitar Synthesizer: Play Musical Tablature in Python

In this tutorial, you’ll build a guitar synthesizer using the Karplus-Strong algorithm in Python. You’ll model vibrating strings, simulate strumming techniques, read musical notation and tablature, and apply audio effects. By the end, you’ll have created a digital guitar that can play any song. This tutorial was also discussed on Real Python Podcast Episode #210.
REAL PYTHON

Python’s Security Model After the xz-utils Backdoor

The backdoor introduced to the xz-utils compression project through social engineering was one of the topics at the Python Language Summit. Participants discussed what can be done to prevent similar social engineering attacks on the Python source.
PYTHON SOFTWARE FOUNDATION

Authentication Your Whole Team Will Love

“With PropelAuth, I think I’ve spent about a day – total – on auth over the past year.” PropelAuth is easy to integrate and provides all the tools your team needs to manage your users - dashboards, user insights, impersonation, SSO and more →
PROPELAUTH sponsor

Running Prettier Against Django or Jinja Templates

“Prettier” is a JavaScript based linting tool for templates. For folks not familiar with the world of npm, it can be a bit daunting to get it going. Simon fiddled with it so you don’t have to and posted how he got it working on his system.
SIMON WILLISON

Write Less Code, You Must

An often overlooked aspect of software development is architecture at the module & function level. It is important to write code that is simple and easy to move from one place to another.
DAVIDVUJIC.BLOGSPOT.COM • Shared by David Vujic

Quickstart for Playing With LLMs Locally

This is a simple, quick guide to getting started running LLMs on your local computer. It covers the basics of the powerful libraries Ollama and LangChain for controlling these AI models.
JOSHUA COOK • Shared by Joshua Cook

Under the Hood of Python’s set Data Structure

This tutorial covers hash tables, collision handling, performance optimization and how it relates to the implementation of the set data structure in Python.
ABHINAV UPADHYAY

Ways to Have an Atomic Counter in Django

Keeping a counter across objects in Django means having to be careful about race conditions. This article outlines several approaches to the problem.
GONÇALO VALÉRIO

Saying Thanks to Open Source Maintainers

Brett talks about the different ways you can support the many maintainers of open source projects, and often times just saying “thanks” means a lot.
BRETT CANNON

A Guide to Python’s Weak References

Learn all about weak references in Python: reference counting, garbage collection, and practical uses of the weakref module
MARTIN HEINZ • Shared by Martin Heinz

Creating Great README Files for Your Python Projects

In this tutorial, you’ll learn how to create, organize, and format high-quality README files for your Python projects.
REAL PYTHON

Get Terminal Size

This quick TIL post from Rodrigo shows you how to get information about the terminal size from the shutil module.
RODRIGO GIRÃO SERRÃO

A Complete Guide to pytest Fixtures

Learn how to use pytest fixtures for writing maintainable and isolated tests.
STANLEY ULILI

Projects & Code burr: Build Apps That Make Decisions (Chatbots, Agents, etc)

GITHUB.COM/DAGWORKS-INC

Lazy f-strings

GITHUB.COM/POMPONCHIK • Shared by pomponchik

oxo: Security Scanning Orchestrator

GITHUB.COM/OSTORLAB

jax: Composable Transformations of Python+NumPy Programs

GITHUB.COM/GOOGLE

dbt-utils: Utility Functions for DBT Projects

GITHUB.COM/DBT-LABS

Events Weekly Real Python Office Hours Q&A (Virtual)

July 3, 2024
REALPYTHON.COM

Canberra Python Meetup

July 4, 2024
MEETUP.COM

Sydney Python User Group (SyPy)

July 4, 2024
SYPY.ORG

EuroPython 2024

July 8 to July 15, 2024
EUROPYTHON.EU

SciPy US 2024

July 8 to July 14, 2024
SCIPY.ORG

PyCon Nigeria 2024

July 10 to July 14, 2024
PYCON.ORG

Happy Pythoning!
This was PyCoder’s Weekly Issue #636.
View in Browser »

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

Categories: FLOSS Project Planets

Real Python: Defining Python Constants for Code Maintainability

Tue, 2024-07-02 10:00

In programming, the term constant refers to names representing values that don’t change during a program’s execution. Constants are a fundamental concept in programming, and Python developers use them in many cases. However, Python doesn’t have a dedicated syntax for defining constants. In practice, Python constants are just variables that never change.

To prevent programmers from reassigning a name that’s supposed to hold a constant, the Python community has adopted a naming convention: use uppercase letters. For every Pythonista, it’s essential to know what constants are, as well as why and when to use them.

In this video course, you’ll learn how to:

  • Properly define constants in Python
  • Identify some built-in constants
  • Use constants to improve your code’s readability, reusability, and maintainability
  • Apply different approaches to organize and manage constants in a project
  • Use several techniques to make constants strictly constant in Python

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Categories: FLOSS Project Planets

Python Software Foundation: The 2024 PSF Board Election is Open!

Tue, 2024-07-02 06:05

It’s time to cast your vote! Voting is open starting today Tuesday, July 2nd, through Friday, July 16th, 2024 2:00 pm UTC. Check the Elections page to see how much time you have left to vote.

How to Vote

If you are a voting member of the PSF that affirmed your intention to participate in this year’s election, you will receive an email from “OpaVote Voting Link <noreply@opavote.com>” with a link to your ballot. The subject line will read “Python Software Foundation Board of Directors Election 2024”. If you haven’t seen your ballot by Wednesday, please check your spam folder for a message from “noreply@opavote.com”. If you don’t see anything get in touch by emailing psf-elections@python.org so we can look into your account and make sure we have the most up-to-date email for you.


Three seats on the board are open, but you can approve as many of the 19 candidates as you like. We’re delighted by how many of you are willing to contribute to the Python community by serving on the PSF Board! Make sure you take some time to look at all the nominee statements and choose your candidates carefully. ATTN: Choose carefully before you press the big green vote button. Once your vote is cast, it cannot be changed.

Who can vote?

You need to be a Contributing, Managing, Supporting, or Fellow member and have affirmed your voting intention by June 25th, 2024, to vote in this election. If you’d like to learn more or sign up as a PSF Member, check out our membership types. You can check your membership status on your User Information page on psfmember.org (you will need to be logged in). If you have questions about your membership or the election please email psf-elections@python.org

Categories: FLOSS Project Planets

Python Bytes: #390 Coding in a Castle

Tue, 2024-07-02 04:00
<strong>Topics covered in this episode:</strong><br> <ul> <li><a href="https://berglyd.net/blog/2024/06/joining-strings-in-python/"><strong>Joining Strings in Python: A</strong></a><a href="https://berglyd.net/blog/2024/06/joining-strings-in-python/"> </a><a href="https://berglyd.net/blog/2024/06/joining-strings-in-python/"><strong>"Huh"</strong></a><a href="https://berglyd.net/blog/2024/06/joining-strings-in-python/"> <strong>Moment</strong></a></li> <li><a href="https://www.mensurdurakovic.com/hard-to-swallow-truths-they-wont-tell-you-about-software-engineer-job/">10 hard-to-swallow truths they won't tell you about software engineer job</a></li> <li><a href="https://www.xlwings.org/blog/my-thoughts-on-python-in-excel"><strong>My thoughts on Python in Excel</strong></a></li> <li><strong>Extra, extra, extra</strong></li> <li><strong>Extras</strong></li> <li><strong>Joke</strong></li> </ul><a href='https://www.youtube.com/watch?v=Xi9FM1pZQZ0' style='font-weight: bold;'data-umami-event="Livestream-Past" data-umami-event-episode="390">Watch on YouTube</a><br> <p><strong>About the show</strong></p> <p>Sponsored by ScoutAPM: <a href="https://pythonbytes.fm/scout"><strong>pythonbytes.fm/scout</strong></a></p> <p><strong>Connect with the hosts</strong></p> <ul> <li>Michael: <a href="https://fosstodon.org/@mkennedy"><strong>@mkennedy@fosstodon.org</strong></a></li> <li>Brian: <a href="https://fosstodon.org/@brianokken"><strong>@brianokken@fosstodon.org</strong></a></li> <li>Show: <a href="https://fosstodon.org/@pythonbytes"><strong>@pythonbytes@fosstodon.org</strong></a></li> </ul> <p>Join us on YouTube at <a href="https://pythonbytes.fm/stream/live"><strong>pythonbytes.fm/live</strong></a> to be part of the audience. Usually Tuesdays at 10am PT. Older video versions available there too.</p> <p>Finally, if you want an artisanal, hand-crafted digest of every week of the show notes in email form? Add your name and email to <a href="https://pythonbytes.fm/friends-of-the-show">our friends of the show list</a>, we'll never share it. </p> <p><strong>Brian #1:</strong> <a href="https://berglyd.net/blog/2024/06/joining-strings-in-python/"><strong>Joining Strings in Python: A</strong></a><a href="https://berglyd.net/blog/2024/06/joining-strings-in-python/"> </a><a href="https://berglyd.net/blog/2024/06/joining-strings-in-python/"><strong>"Huh"</strong></a><a href="https://berglyd.net/blog/2024/06/joining-strings-in-python/"> <strong>Moment</strong></a></p> <ul> <li>Veronica Berglyd Olsen</li> <li><p>Standard solution to “read lines from a file, do some filtering, create a multiline string”:</p> <pre><code>f = open("input_file.txt") filtered_text = "\n".join(x for x in f if not x.startswith("#")) </code></pre></li> <li><p>This uses a generator, file reading, and passes the generator to join.</p></li> <li><p>Another approach is to add brackets and pass that generator to a list comprehension:</p> <pre><code>f = open("input_file.txt") filtered_text = "\n".join([x for x in f if not x.startswith("#")]) </code></pre></li> <li><p>At first glance, this seems to just be extra typing, but it’s actually faster by 16% on CPython due to the implementation of .join() doing 2 passes on input if passed a generator. </p> <ul> <li>From Trey Hunner: “I do know that it’s not possible to do 2 passes over a generator (since it’d be exhausted after the first pass) so from my understanding, the generator version requires an extra step of storing all the items in a list first.”</li> </ul></li> </ul> <p><strong>Michael #2:</strong> <a href="https://www.mensurdurakovic.com/hard-to-swallow-truths-they-wont-tell-you-about-software-engineer-job/">10 hard-to-swallow truths they won't tell you about software engineer job</a></p> <ol> <li>College will not prepare you for the job</li> <li>You will rarely get greenfield projects</li> <li>Nobody gives a BLANK about your clean code</li> <li>You will sometimes work with incompetent people</li> <li>Get used to being in meetings for hours</li> <li>They will ask you for estimates a lot of times</li> <li>Bugs will be your arch-enemy for life</li> <li>Uncertainty will be your toxic friend</li> <li>It will be almost impossible to disconnect from your job</li> <li>You will profit more from good soft skills than from good technical skills</li> </ol> <p><strong>Brian #3:</strong> <a href="https://www.xlwings.org/blog/my-thoughts-on-python-in-excel"><strong>My thoughts on Python in Excel</strong></a></p> <ul> <li>Felix Zumstein</li> <li>Interesting take on one person’s experience with trying Python in Excel.</li> <li>“We wanted an alternative to VBA, but got an alternative to the Excel formula language”</li> <li>“Python runs in the cloud on Azure Container Instances and not inside Excel.”</li> <li>“DataFrames are great, but so are NumPy arrays and lists.”</li> <li>… lots of other interesting takaways.</li> </ul> <p><strong>Michael #4:</strong> <strong>Extra, extra, extra</strong></p> <ul> <li><a href="https://www.codeinacastle.com/python-zero-to-hero-2024?utm_source=pythonbytes">Code in a castle</a> - Michael’s Python Zero to Hero course in Tuscany</li> <li><a href="https://www.bleepingcomputer.com/news/security/polyfillio-javascript-supply-chain-attack-impacts-over-100k-sites/">Polyfill.io JavaScript supply chain attack impacts over 100K sites</a> <ul> <li>Now required reading: <a href="https://blog.wesleyac.com/posts/why-not-javascript-cdn">Reasons to avoid Javascript CDNs</a></li> </ul></li> <li><a href="https://arstechnica.com/security/2024/06/mac-info-stealer-malware-distributed-through-google-ads/">Mac users served info-stealer malware through Google ads</a></li> <li><a href="https://fosstodon.org/@mkennedy/112712603915775986">HTMX for the win</a>!</li> <li>ssh to <a href="https://www.shellhacks.com/ssh-execute-remote-command-script-linux/">run remote commands</a> <pre><code>&gt; ssh user@server "command_to_run --arg1 --arg2" </code></pre></li> </ul> <p><strong>Extras</strong> </p> <p>Brian:</p> <ul> <li><a href="https://ludic.mataroa.blog/blog/i-will-fucking-piledrive-you-if-you-mention-ai-again/?utm_source=pocket_shared">A fun </a><a href="https://ludic.mataroa.blog/blog/i-will-fucking-piledrive-you-if-you-mention-ai-again/">reaction</a><a href="https://ludic.mataroa.blog/blog/i-will-fucking-piledrive-you-if-you-mention-ai-again/?utm_source=pocket_shared"> to AI </a>- I will not be showing the link on our live stream, due to colorful language.</li> </ul> <p>Michael:</p> <ul> <li><a href="https://talkpython.fm/castle">Coding in a Castle</a> Developer Education Event</li> <li><a href="https://www.bleepingcomputer.com/news/security/polyfillio-javascript-supply-chain-attack-impacts-over-100k-sites/">Polyfill.io JavaScript supply chain attack impacts over 100K sites</a> <ul> <li>See <a href="https://blog.wesleyac.com/posts/why-not-javascript-cdn">Reasons to avoid Javascript CDNs</a></li> </ul></li> </ul> <p><strong>Joke:</strong> <a href="https://www.reddit.com/r/programminghumor/comments/1dkfm5p/html/">HTML Hacker</a></p>
Categories: FLOSS Project Planets

Tryton News: Newsletter June 2024

Tue, 2024-07-02 02:00

In the last month we focused on speeding-up performance issues, fixing bugs and improving the behaviour of things - building on the changes from our last release. We also added some new features which we would like to introduce to you in this newsletter.

For an in depth overview of the Tryton issues please take a look at our issue tracker or see the issues and merge requests filtered by label.

Changes for the User Sales, Purchases and Projects

We use a dedicated Web Shop-page on the product form which contains web shop related fields.

We’ve added relates from sale and purchase lines to their stock moves and invoice lines.

Purchase and sale amendments now allow to update the secondary unit of its lines.

Now Tryton deletes a purchase request when its related product is deleted. Previously such a purchase request was kept in the system, but we decided that is better to remove it.

Accounting, Invoicing and Payments

Payments with zero amount are allowed again in the system. This allows to correctly handle full refunds for some payment gateways that use zero amounts on them.

Stock, Production and Shipments

When counting inventories with lots we now also show the lot in addition to the product, as a product may have many lots.

User Interface

Sao now uses a grid to display trytond.model.fields.Dict items to add more flexibility.

To make Tryton more accessible we now make the contents of the message-dialog selectable and copiable.

Data and Configuration

We improved the user experience when importing CSV data. This eases the adoption of Tryton by lowering the barrier to load initial data in to the system. Here is a list of the relevant changes:

The CSV export also got new features. It now supports different languages per column in one export. This is specially useful when working with translatable master data like for example product names.

We now replace the “Accounting Party” user access group by the “Accounting” user access group. There is no need to limit accounting fields from party to a specific group by default.

New Documentation

The ldap_authentication module is now documented.

Did you know, that a Model._rec_name must point to a trytond.model.fields.Char field?

New Releases

We released bug fixes for the currently maintained long term support series
7.0 and 6.0, and for the penultimate series 7.2.

Changes for the System Administrator

We added a new configuration section [report] with option convert_command to be able to use a different document converter.

Now the trytond-admin command validates the email-value. The interactive email input loops until a valid email address is entered.

Changes for Implementers and Developers

We added the option --export-translations to the trytond-admin command. It exports the translation of any activated module to their respective locale folder.

Authors: @dave @pokoli @udono

1 post - 1 participant

Read full topic

Categories: FLOSS Project Planets

Zato Blog: Understanding API rate-limiting techniques

Tue, 2024-07-02 00:43
Understanding API rate-limiting techniques 2024-07-02, by Dariusz Suchojad

Enabling rate-limiting in Zato means that access to Zato APIs can be throttled per endpoint, user or service - including options to make limits apply to specific IP addresses only - and if limits are exceeded within a selected period of time, the invocation will fail. Let's check how to use it all.

API rate limiting works on several levels and the configuration is always checked in the order below, which follows from the narrowest, most specific parts of the system (endpoints), through users which may apply to multiple endpoints, up to services which in turn may be used by both multiple endpoints and users.

  • First, per-endpoint limits
  • Then, per-user limits
  • Finally, per-service limits

When a request arrives through an endpoint, that endpoint's rate limiting configuration is checked. If the limit is already reached for the IP address or network of the calling application, the request is rejected.

Next, if there is any user associated with the endpoint, that account's rate limits are checked in the same manner and, similarly, if they are reached, the request is rejected.

Finally, if the endpoint's underlying service is configured to do so, it also checks if its invocation limits are not exceeded, rejecting the message accordingly if they are.

Note that the three levels are distinct yet they overlap in what they allow one to achieve.

For instance, it is possible to have the same user credentials be used in multiple endpoints and express ideas such as "Allow this and that user to invoke my APIs 1,000 requests/day but limit each endpoint to at most 5 requests/minute no matter which user".

Moreover, because limits can be set on services, it is possible to make it even more flexible, e.g. "Let this service be invoked at most 10,000 requests/hour, no matter which user it is, with particular users being able to invoke at most 500 requests/minute, no matter which service, topping it off with per separate limits for REST vs. SOAP vs. JSON-RPC endpoint, depending on what application is invoke the endpoints". That lets one conveniently express advanced scenarios that often occur in practical situations.

Also, observe that API rate limiting applies to REST, SOAP and JSON-RPC endpoints only, it is not used with other API endpoints, such as AMQP, IBM MQ, SAP, task scheduler or any other technologies. However, per-service limits work no matter which endpoint the service is invoked with and they will work with endpoints such as WebSockets, ZeroMQ or any other.

Lastly, limits pertain to with incoming requests only - any outgoing ones, from Zato to external resources - are not covered by it.

Per-IP restrictions

The architecture is made even more versatile thanks to the fact that for each object - endpoint, user or service - different limits can be configured depending on the caller's IP address.

This adds yet another dimension and allows to express ideas commonly witnessed in API-based projects, such as:

  • External applications, depending on their IP addresses, can have their own limits
  • Internal users, e.g. employees of the company using VPN, may have hire limits if their addresses are in the 172.x.x.x range
  • For performance testing purposes, access to Zato from a few selected hosts may have no limits at all

IP-based limits work hand in hand are an integral part of the mechanism - they do not rule out per-endpoit, user or service limits. In fact, for each such object, multiple IP-using limits can be set independently, thus allowing for highest degree of flexibility.

Exact or approximate

Rate limits come in two types:

  • Exact
  • Approximate

Exact rate limits are just that, exact - they en that a limit is not exceeded at all, not even by a single request.

Approximate limits may let a very small number of requests to exceed the limit with the benefit being that approximate limits are faster to check than exact ones.

When to use which type depends on a particular project:

  • In some projects, it does not really matter if callers have a limit of 1,000 requests/minute or 1,005 requests/minute because the difference is too tiny to make a business impact. Approximate limits work best in this case.

  • In other projects, there may be requirements that the limit never be exceeded no matter the circumstances. Use exact limits here.

Python code and web-admin

Alright, let's check how to define the limits in the Zato Dashboard. We will use the sample service below:

# -*- coding: utf-8 -*- # Zato from zato.server.service import Service class Sample(Service): name = 'api.sample' def handle(self): # Return a simple string on response self.response.payload = 'Hello there!\n'

Now, in web-admin, we will configure limits - separately for the service, a new and a new REST API channel (endpoint).

Points of interest:

  • Configuration for each type of object is independent - within the same invocation some limits may be exact, some may be approximate
  • There can be multiple configuration entries for each object
  • A unit of time is "m", "h" or "d", depending on whether the limit is per minute, hour or day, respectively
  • All limits within the same configuration are checked in the order of their definition which is why the most generic ones should be listed first
Testing it out

Now, all is left is to invoke the service from curl.

As long as limits are not reached, a business response is returned:

$ curl http://my.user:password@localhost:11223/api/sample Hello there! $

But if a limit is reached, the caller receives an error message with the 429 HTTP status.

$ curl -v http://my.user:password@localhost:11223/api/sample * Trying 127.0.0.1... ... < HTTP/1.1 429 Too Many Requests < Server: Zato < X-Zato-CID: b8053d68612d626d338b02 ... {"zato_env":{"result":"ZATO_ERROR","cid":"b8053d68612d626d338b02eb", "details":"Error 429 Too Many Requests"}} $

Note that the caller never knows what the limit was - that information is saved in Zato server logs along with other details so that API authors can correlate what callers get with the very rate limiting definition that prevented them from accessing the service.

zato.common.rate_limiting.common.RateLimitReached: Max. rate limit of 100/m reached; from:`10.74.199.53`, network:`*`; last_from:`127.0.0.1; last_request_time_utc:`2020-11-22T15:30:41.943794; last_cid:`5f4f1ef65490a23e5c37eda1`; (cid:b8053d68612d626d338b02)

And this is it - we have created a new API rate limiting definition in Zato and tested it out successfully!

More resources

➤ Python API integration tutorial
What is an integration platform?
Python Integration platform as a Service (iPaaS)
What is an Enterprise Service Bus (ESB)? What is SOA?

More blog posts
Categories: FLOSS Project Planets

Pages