Planet Python

Subscribe to Planet Python feed
Planet Python - http://planetpython.org/
Updated: 15 hours 13 min ago

Glyph Lefkowitz: Okay, I’m A Centrist I Guess

Mon, 2024-01-22 12:41

Today I saw a short YouTube video about “cozy games” and started writing a comment, then realized that this was somehow prompting me to write the most succinct summary of my own personal views on politics and economics that I have ever managed. So, here goes.

Apparently all I needed to trim down 50,000 words on my annoyance at how the term “capitalism” is frustratingly both a nexus for useful critque and also reductive thought-terminating clichés was to realize that Animal Crossing: New Horizons is closer to my views on political economy than anything Adam Smith or Karl Marx ever wrote.

Cozy games illustrate that the core mechanics of capitalism are fun and motivating, in a laboratory environment. It’s fun to gather resources, to improve one’s skills, to engage in mutually beneficial exchanges, to collect things, to decorate. It’s tremendously motivating. Even merely pretending to do those things can captivate huge amounts of our time and attention.

In real life, people need to be motivated to do stuff. Not because of some moral deficiency, but because in a large complex civilization it’s hard to tell what needs doing. By the time it’s widely visible to a population-level democratic consensus of non-experts that there is an unmet need — for example, trash piling up on the street everywhere indicating a need for garbage collection — that doesn’t mean “time to pick up some trash”, it means “the sanitation system has collapsed, you’re probably going to get cholera”. We need a system that can identify utility signals more granularly and quickly, towards the edges of the social graph. To allow person A to earn “value credits” of some kind for doing work that others find valuable, then trade those in to person B for labor which they find valuable, even if it is not clearly obvious to anyone else why person A wants that thing. Hence: money.

So, a market can provide an incentive structure that productively steers people towards needs, by aggregating small price signals in a distributed way, via the communication technology of “money”. Authoritarian communist states are famously bad at this, overproducing “necessary” goods in ways that can hold their own with the worst excesses of capitalists, while under-producing “luxury” goods that are politically seen as frivolous.

This is the kernel of truth around which the hardcore capitalist bootstrap grindset ideologues build their fabulist cinematic universe of cruelty. Markets are motivating, they reason, therefore we must worship the market as a god and obey its every whim. Markets can optimize some targets, therefore we must allow markets to optimize every target. Markets efficiently allocate resources, and people need resources to live, therefore anyone unable to secure resources in a market is undeserving of life. Thus we begin at “market economies provide some beneficial efficiencies” and after just a bit of hand-waving over some inconvenient details, we get to “thus, we must make the poor into a blood-sacrifice to Moloch, otherwise nobody will ever work, and we will all die, drowning in our own laziness”. “The cruelty is the point” is a convenient phrase, but among those with this worldview, the prosperity is the point; they just think the cruelty is the only engine that can possibly drive it.

Cozy games are therefore a centrist1 critique of capitalism. They present a world with the prosperity, but without the cruelty. More importantly though, by virtue of the fact that people actually play them in large numbers, they demonstrate that the cruelty is actually unnecessary.

You don’t need to play a cozy game. Tom Nook is not going to evict you from your real-life house if you don’t give him enough bells when it’s time to make rent. In fact, quite the opposite: you have to take time away from your real-life responsibilities and work, in order to make time for such a game. That is how motivating it is to engage with a market system in the abstract, with almost exclusively positive reinforcement.

What cozy games are showing us is that a world with tons of “free stuff” — universal basic income, universal health care, free education, free housing — will not result in a breakdown of our society because “no one wants to work”. People love to work.

If we can turn the market into a cozy game, with low stakes and a generous safety net, more people will engage with it, not fewer. People are not lazy; laziness does not exist. The motivation that people need from a market economy is not a constant looming threat of homelessness, starvation and death for themselves and their children, but a fun opportunity to get a five-star island rating.

Acknowledgments

Thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support me on Patreon as well!

  1. Okay, I guess “far left” on the current US political compass, but in a just world socdems would be centrists. 

Categories: FLOSS Project Planets

PyCon: Applications For Booth Space on Startup Row Are Now Open!

Mon, 2024-01-22 10:15

 Applications For Booth Space on Startup Row Are Now Open

To all the startup founders out there, ‌PyCon US organizers have some awesome news for you! The application window for Startup Row at PyCon US is now open.

You’ve got until March 15th to apply, but don’t delay. (And if you want to skip all this reading and go straight to the application, here’s a link for ya.)

That’s right! Your startup could get the best of what PyCon US has to offer:

  • Coveted Expo Hall booth space
  • Exclusive placement on the PyCon US website
  • Access to the PyCon Jobs Fair (since, after all, there’s no better place to meet and recruit Python professionals)
  • A unique in-person platform to access a fantastically diverse crowd of thousands of engineers, data wranglers, academic researchers, students, and enthusiasts who come to PyCon US.

Corporate sponsors pay thousands of dollars for this level of access, but to support the entrepreneurial community PyCon US organizers are excited to give the PyCon experience to up-and-coming startup companies for free. (Submitting a Startup Row application is completely free. To discourage no-shows at the conference itself, we do require a fully-refundable $400 deposit from companies who are selected for and accept a spot on Startup Row. If you show up, you’ll get your deposit back after the conference.)

Does My Startup Qualify?

The goal of Startup Row is to give seed and early-stage companies access to the Python community. Here are the qualification criteria:

  • Somewhat obviously: Python is used somewhere in your tech or business stack, the more of it the better!
  • Your startup is roughly 2.5 years or less at the time of applying. (If you had a major pivot or took awhile to get a product in the market, measure from there.)
  • You have 25 or fewer folks on the team, including founders, employees, and contractors.
  • You or your company will fund travel and accommodation to PyCon US 2024 in Pittsburgh, Pennsylvania. (There’s a helpful page on the PyCon US website with venue and hotel information.)
  • You haven’t already presented on Startup Row or sponsored a previous PyCon US. (If you applied before but weren’t accepted, please do apply again!)

There is a little bit of wiggle room. If your startup is more of a fuzzy rather than an exact match for these criteria, still consider applying.

How Do I Apply?

Assuming you’ve already created a user account on the PyCon US website, applying for Startup Row is easy. 

  1. Make sure you’re logged in.
  2. Go to the Startup Row application page and submit your application by March 15th. (Note: It might be helpful to draft your answers in a separate document.)
  3. Wait to hear back! Our goal is to notify folks about their application decision toward the end of March.

Again, the application deadline is March 15, 2024 at 11:59 PM Eastern. Applications submitted after that deadline may not be considered.

Can I learn more about Startup Row?

You bet! Check out the Startup Row page for more details and testimonials from prior Startup Row participants. (There’s a link to the application there, too!)

Who do I contact with questions about Startup Row?

First off, if you have questions about PyCon US in general, you can send an email to the PyCon US organizing team at pycon-reg@python.org. We’re always happy to help.

For specific Startup Row-related questions, reach out to co-chair Jason D. Rowley via email at jdr [at] omg [dot] lol, or find some time in his calendar at calendly [dot] com [slash] jdr.

Wait, What’s The Deadline Again?

Again, the application deadline is March 15, 2024 at 11:59PM Eastern.

Good luck! We look forward to reviewing your application!

Categories: FLOSS Project Planets

Real Python: When to Use a List Comprehension in Python

Mon, 2024-01-22 09:00

One of Python’s most distinctive features is the list comprehension, which you can use to create powerful functionality within a single line of code. However, many developers struggle to fully leverage the more advanced features of list comprehensions in Python. Some programmers even use them too much, which can lead to code that’s less efficient and harder to read.

By the end of this tutorial, you’ll understand the full power of Python list comprehensions and know how to use their features comfortably. You’ll also gain an understanding of the trade-offs that come with using them so that you can determine when other approaches are preferable.

In this tutorial, you’ll learn how to:

  • Rewrite loops and map() calls as list comprehensions in Python
  • Choose between comprehensions, loops, and map() calls
  • Supercharge your comprehensions with conditional logic
  • Use comprehensions to replace filter()
  • Profile your code to resolve performance questions

Get Your Code: Click here to download the free code that shows you how and when to use list comprehensions in Python.

Transforming Lists in Python

There are a few different ways to create and add items to a lists in Python. In this section, you’ll explore for loops and the map() function to perform these tasks. Then, you’ll move on to learn about how to use list comprehensions and when list comprehensions can benefit your Python program.

Use for Loops

The most common type of loop is the for loop. You can use a for loop to create a list of elements in three steps:

  1. Instantiate an empty list.
  2. Loop over an iterable or range of elements.
  3. Append each element to the end of the list.

If you want to create a list containing the first ten perfect squares, then you can complete these steps in three lines of code:

Python >>> squares = [] >>> for number in range(10): ... squares.append(number * number) ... >>> squares [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] Copied!

Here, you instantiate an empty list, squares. Then, you use a for loop to iterate over range(10). Finally, you multiply each number by itself and append the result to the end of the list.

Work With map Objects

For an alternative approach that’s based in functional programming, you can use map(). You pass in a function and an iterable, and map() will create an object. This object contains the result that you’d get from running each iterable element through the supplied function.

As an example, consider a situation in which you need to calculate the price after tax for a list of transactions:

Python >>> prices = [1.09, 23.56, 57.84, 4.56, 6.78] >>> TAX_RATE = .08 >>> def get_price_with_tax(price): ... return price * (1 + TAX_RATE) ... >>> final_prices = map(get_price_with_tax, prices) >>> final_prices <map object at 0x7f34da341f90> >>> list(final_prices) [1.1772000000000002, 25.4448, 62.467200000000005, 4.9248, 7.322400000000001] Copied!

Here, you have an iterable, prices, and a function, get_price_with_tax(). You pass both of these arguments to map() and store the resulting map object in final_prices. Finally, you convert final_prices into a list using list().

Leverage List Comprehensions

List comprehensions are a third way of making or transforming lists. With this elegant approach, you could rewrite the for loop from the first example in just a single line of code:

Python >>> squares = [number * number for number in range(10)] >>> squares [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] Copied!

Rather than creating an empty list and adding each element to the end, you simply define the list and its contents at the same time by following this format:

new_list = [expression for member in iterable]

Every list comprehension in Python includes three elements:

  1. expression is the member itself, a call to a method, or any other valid expression that returns a value. In the example above, the expression number * number is the square of the member value.
  2. member is the object or value in the list or iterable. In the example above, the member value is number.
  3. iterable is a list, set, sequence, generator, or any other object that can return its elements one at a time. In the example above, the iterable is range(10).
Read the full article at https://realpython.com/list-comprehension-python/ »

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Categories: FLOSS Project Planets

Zato Blog: How to correctly integrate APIs in Python

Sun, 2024-01-21 23:43
How to correctly integrate APIs in Python 2024-01-22, by Dariusz Suchojad

Understanding how to effectively integrate various systems and APIs is crucial. Yet, without a dedicated integration platform, the result will be brittle point-to-point integrations that never lead to good outcomes.

Read this article about Zato, an open-source integration platform in Python, for an overview of what to avoid and how to do it correctly instead.

More blog posts
Categories: FLOSS Project Planets

Luke Plant: Python packaging must be getting better - a datapoint

Sun, 2024-01-21 15:47

I’m developing some Python software for a client, which in its current early state is desktop software that will need to run on Windows.

So far, however, I have done all development on my normal comfortable Linux machine. I haven’t really used Windows in earnest for more than 15 years – to the point where my wife happily installs Linux on her own machine, knowing that I’ll be hopeless at helping her fix issues if the OS is Windows – and certainly not for development work in that time. So I was expecting a fair amount of pain.

There was certainly a lot of friction getting a development environment set up. RealPython.com have a great guide which got me a long way, but even that had some holes and a lot of inconvenience, mostly due to the fact that, on the machine I needed to use, my main login and my admin login are separate. (I’m very lucky to be granted an admin login at all, so I’m not complaining). And there are lots of ways that Windows just seems to be broken, but that’s another blog post.

When it came to getting my app running, however, I was very pleasantly surprised.

At this stage in development, I just have a rough requirements.txt that I add Python deps to manually. This might be a good thing, as I avoid the pain of some the additional layers people have added.

So after installing Python and creating a virtual environment on Windows, I ran pip install -r requirements.txt, expecting a world of pain, especially as I already had complex non-Python dependencies, including Qt5 and VTK. I had specified both of these as simple Python deps via the wrappers pyqt5 and vtk in my requirements.txt, and nothing else, with the attitude of “well I may as well dream this is going to work”.

And in fact, it did! Everything just downloaded as binary wheels – rather large ones, but that’s fine. I didn’t need compilers or QMake or header files or anything.

And when I ran my app, apart from a dependency that I’d forgotten to add to requirements.txt, everything worked perfectly first time. This was even more surprising as I had put zero conscious effort into Windows compatibility. In retrospect I realise that use of pathlib, which is automatic for me these days, had helped me because it smooths over some Windows/Unix differences with path handling.

Of course, this is a single datapoint. From other people’s reports there are many, many ways that this experience may not be typical. But that it is possible at all suggests that a lot of progress has been made and we are very much going in the right direction. A lot of people have put a lot of work in to achieve that, for which I’m very grateful!

Categories: FLOSS Project Planets

TechBeamers Python: LangChain Agent Basics with Sample Agent Code

Sun, 2024-01-21 01:29

LangChain agents are fascinating creatures! They live in the world of text and code, interacting with humans through conversations and completing tasks based on instructions. Think of them as your digital assistants, but powered by artificial intelligence and fueled by language models. Getting Started with Agents in LangChain Imagine a chatty robot friend that gets […]

The post LangChain Agent Basics with Sample Agent Code appeared first on TechBeamers.

Categories: FLOSS Project Planets

TechBeamers Python: LangChain Memory Basics

Sun, 2024-01-21 00:34

Langchain Memory is like a brain for your conversational agents. It remembers past chats, making conversations flow smoothly and feel more personal. Think of it like chatting with a real friend who recalls what you talked about before. This makes the agent seem smarter and more helpful. Getting Started with Memory in LangChain Imagine you’re […]

The post LangChain Memory Basics appeared first on TechBeamers.

Categories: FLOSS Project Planets

TypeThePipe: Data Engineering Bootcamp 2024 (Week 1) Docker &amp; Terraform

Sat, 2024-01-20 19:00
pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { color: #008000; } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { color: #008000; font-weight: bold; } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */


Free Data Engineering Bootcamp 2024 to become a skilled Data Analytics Engineer. Week 1

I’ve just enrolled in the DataTalks free Data Engineering zootcamp. It’s a fantastic initiative that has been running for several years, with each cohort occurring annually.

The course is organized weekly, featuring one online session per week. There are optional weekly homework assignments which are reviewed, and the course concludes with a mandatory Data Eng final project, which is required to earn the certification.

In this series of posts, I will be sharing with you my course notes and comments, and also how I’m resolving the homework.


1. Dockerized data pipeline (Intro, dockerfile and doocker compose)

Let’s delve into the essentials of Docker, Dockerfile, and Docker Compose. These three components are crucial in the world of software development, especially when dealing with application deployment and management.

Docker: The Cornerstone of Containerization

Docker stands at the forefront of containerization technology. It allows developers to package applications and their dependencies into containers. A container is an isolated environment, akin to a lightweight, standalone, and secure package of software that includes everything needed to run it: code, runtime, system tools, system libraries, and settings. This technology ensures consistency across multiple development and release cycles, standardizing your environment across different stages.

Dockerfile: Blueprint for Docker images

A Dockerfile is a text document containing all the commands a user could call on the command line to assemble a Docker image. It automates the process of creating Docker images. A Dockerfile defines what goes on in the environment inside your container. It allows you to create a container that meets your specific needs, which can then be run on any Docker-enabled machine.

Docker Compose: Simplifying multi-container applications

Docker Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services, networks, and volumes. Then, with a single command, you create and start all the services from your configuration. Docker Compose works in all environments: production, staging, development, testing, as well as CI workflows.

Why these tools matter?

The combination of Docker, Dockerfile, and Docker Compose streamlines the process of developing, shipping, and running applications. Docker encapsulates your application and its environment, Dockerfile builds the image for this environment, and Docker Compose manages the orchestration of multi-container setups. Together, they provide a robust and efficient way to handle the lifecycle of applications. This ecosystem is integral for developers looking to leverage the benefits of containerization for more reliable, scalable, and collaborative software development.

Get Docker and read there the documentation.


What is Docker and why is it useful for Data Engineering?

Alright, data engineers, gather around! Why should you care about containerization and Docker? Well, it’s like having a Swiss Army knife in your tech toolkit. Here’s why:

  • Local Experimentation: Setting up things locally for experiments becomes a breeze. No more wrestling with conflicting dependencies or environments. Docker keeps it clean and easy.

  • Testing and CI Made Simple: Integration tests and CI/CD pipelines? Docker smoothens these processes. It’s like having a rehearsal space for your code before it hits the big stage.

  • Batch Jobs and Beyond: While Docker plays nice with AWS Batch, Kubernetes jobs, and more (though that’s a story for another day), it’s a glimpse into the world of possibilities with containerization.

  • Spark Joy: If you’re working with Spark, Docker can be a game-changer. It’s like having a consistent and controlled playground for your data processing.

  • Serverless, Stress-less: With the rise of serverless architectures like AWS Lambda, Docker ensures that you’re developing in an environment that mirrors your production setup. No more surprises!

So, there you have it. Containers are everywhere, and Docker is leading the parade. It’s not just a tool; it’s an essential part of the modern software development and deployment process.


Run Postgres and PGAdmin containers

You may need to create a network in order the containers communicate. Then, run the Postgres database container. Notice that in order to persist the data, as docker containers would be stateless and reinizialized after each run, you may indicate Docker volumes to persist the PG internal and ingested data.

docker network create pg-network docker run -it \ -e POSTGRES_USER="root" \ -e POSTGRES_PASSWORD="root" \ -e POSTGRES_DB="ny_taxi" \ -v /Users/jobandtalent/data-eng-bootcamp/ny_taxi_pg_data:/var/lib/postgresql/data \ -p 5437:5432 \ --name pg-database \ --network=pg-network \ postgres:13 `` Let's create and start the PG Admin container: docker run -it \ -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \ -e PGADMIN_DEFAULT_PASSWORD="root" \ -p 8080:80 \ --name pgadmin \ --network=pg-network \ dpage/pgadmin4


Manage multiple containers with the Docker-compose file

Instead of managing our containers individually, it is way better to manage them from one single source. The docker-compose file allows you to specify the services/containers you want to build and run, from the image to run till the environment variables and volumes.

version: "3.11" services: pg-database: image: postgres:13 environment: - POSTGRES_USER=root - POSTGRES_PASSWORD=root - POSTGRES_DB=ny_taxi volumes: - ./ny_taxi_pg_data:/var/lib/postgresql/data ports: - 5432:5432 pg-admin: image: dpage/pgadmin4 environment: - PGADMIN_DEFAULT_EMAIL=admin@admin.com - PGADMIN_DEFAULT_PASSWORD=root ports: - 8080:80 volumes: - "pgadmin_conn_data:/var/lib/pgadmin:rw" volumes: pgadmin_conn_data:


Create your pipeline script and a Dockerfile

The pipeline script objective is download the ddata from the US taxi rides and insert it in the Postgres database. The script could be as simple as:

import polars as pl from pydantic_settings import BaseSettings, SettingsConfigDict from typing import ClassVar TRIPS_TABLE_NAME = "green_taxi_trips" ZONES_TABLE_NAME = "zones" class PgConn(BaseSettings): model_config = SettingsConfigDict(env_prefix='PG_', env_file='.env', env_file_encoding='utf-8') user: str pwd: str host: str port: int db: str connector: ClassVar[str] = "postgresql" @property def uri(self): return f"{self.connector}://{self.user}:{self.pwd}@{self.host}:{self.port}/{self.db}" df_ny_taxi = pl.read_csv("https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-09.csv.gz") df_zones = pl.read_csv("https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv") conn = PgConn() df_zones.write_database(ZONES_TABLE_NAME, conn.uri) df_ny_taxi.write_database(TRIPS_TABLE_NAME, conn.uri)

We have used Polars and Pydantic (v2) libraries. With Polars, we load the data from csvs and also manage how to write it to the database with write_database DataFrame method. We load Pydantic module in order to create a Postgres connection object, that loads the configuration from the environment. We are using for convenience an .env config file, but it is not mandatory. As we will explain in the following chunk of code, we set the env variables on the Dockerfile and they remain accessible from inside the container, being trivial to load them using BaseSettings and SettingsConfigDict.

Now, in order to Dockerize your data pipeline script, that as we just saw it download and ingest the data to Postgres, we need to create a Dockerfile with the specifications of the container. We are using Poetry as a dependency manager, so we need to include those pyproject.toml (multi-pourpose file, used to specify to Poetry the desired module version constraints) and the poetry.lock (specific for Poetry version pin respecting the package version constraints from pyproject.toml). Also, we may include to the container the actual ingest_data.py Python pipeline file.

FROM python:3.11 ARG POETRY_VERSION=1.7.1 WORKDIR /app COPY ingest_data.py ingest_data.py COPY .env .env COPY pyproject.toml pyproject.toml COPY poetry.lock poetry.lock RUN pip3 install --no-cache-dir poetry==${POETRY_VERSION} \ && poetry env use 3.11 \ && poetry install --no-cache ENTRYPOINT [ "poetry", "run", "python", "ingest_data.py" ]

Just by building the container in our root folder and running it, the ingest_data.py script will be executed, and therefore the data downloaded and persisted on the Postgres database.


2. Terraform: Manage your GCP infra. Google Storage and Google BigQuery


Terraform intro and basic commands

Terraform has become a key tool in modern infrastructure management. Terraform, named with a nod to the concept of terraforming planets, applies a similar idea to cloud and local platform infrastructure. It’s about creating and managing the necessary environment for software to run efficiently on platforms like AWS, GCP…

Terraform, developed by HashiCorp, is described as an “infrastructure as code” tool. It allows users to define and manage both cloud-based and on-premises resources in human-readable configuration files. These files can be versioned, reused, and shared, offering a consistent workflow to manage infrastructure throughout its lifecycle. The advantages of using Terraform include simplicity in tracking and modifying infrastructure, ease of collaboration (since configurations are file-based and can be shared on platforms like GitHub), and reproducibility. For instance, an infrastructure set up in a development environment can be replicated in production with minor adjustments. Additionally, Terraform helps in ensuring that resources are properly removed when no longer needed, avoiding unnecessary costs.

So this tool not only simplifies the process of infrastructure management but also ensures consistency and compliance with your infrastructure setup.

However, it’s important to note what Terraform is not. It doesn’t handle the deployment or updating of software on the infrastructure; it’s focused solely on the infrastructure itself. It doesn’t allow modifications to immutable resources without destroying and recreating them. For example, changing the type of a virtual machine would require its recreation. Terraform also only manages resources defined within its configuration files.


Set up Terraform for GCP deploys. From GCP account permissions to the main.tf file

Diving into the world of cloud infrastructure can be a daunting task, but with tools like Terraform, the process becomes more manageable and streamlined. Terraform, an open-source infrastructure as code software tool, allows users to define and provision a datacenter infrastructure using a high-level configuration language. Here’s a guide to setting up Terraform for Google Cloud Platform (GCP).

Creating a Service Account in GCP

Before we start coding with Terraform, it’s essential to establish a method for Terraform on our local machine to communicate with GCP. This involves setting up a service account in GCP – a special type of account used by applications, as opposed to individuals, to interact with the GCP services.

Creating a service account is straightforward. Log into the GCP console, navigate to the “IAM & Admin” section, and create a new service account. This account should be given specific permissions relevant to the resources you plan to manage with Terraform, such as Cloud Storage or BigQuery.

Once the service account is created, the next step is to manage its keys. These keys are crucial as they authenticate and authorize the Terraform script to perform actions in GCP. It’s vital to handle these keys with care, as they can be used to access your GCP resources. You should never expose these credentials publicly.

Setting Up Your Local Environment

After downloading the key as a JSON file, store it securely in your local environment. It’s recommended to create a dedicated directory for these keys to avoid any accidental uploads, especially if you’re using version control like Git.

Remember, you can configure Terraform to use these credentials in several ways. One common method is to set an environment variable pointing to the JSON file, but you can also specify the path directly in your Terraform configuration.

Writing Terraform Configuration

With the service account set up, you can begin writing your Terraform configuration. This is done in a file typically named main.tf. In this file, you define your provider (in this case, GCP) and the resources you wish to create, update, or delete.

For instance, if you’re setting up a GCP storage bucket, you would define it in your main.tf file. Terraform configurations are declarative, meaning you describe your desired state, and Terraform figures out how to achieve it. You are ready for terraform init to start with your project.

Planning and Applying Changes

Before applying any changes, it’s good practice to run terraform plan. This command shows what Terraform will do without actually making any changes. It’s a great way to catch errors or unintended actions.

Once you’re satisfied with the plan, run terraform apply to make the changes. Terraform will then reach out to GCP and make the necessary adjustments to match your configuration.

Cleaning Up: Terraform Destroy

When you no longer need the resources, Terraform makes it easy to clean up. Running terraform destroy will remove the resources defined in your Terraform configuration from your GCP account.

Lastly, a word on security: If you’re storing your Terraform configuration in a version control system like Git, be mindful of what you commit. Ensure that your service account keys and other sensitive data are not pushed to public repositories. Using a .gitignore file to exclude these sensitive files is a best practice.

For instance, our main.tf file for creating a GCP Storage Bucket and a Big Query dataset looks like:

terraform { required_providers { google = { source = "hashicorp/google" version = "5.12.0" } } } provider "google" { credentials = var.credentials project = "concise-quarter-411516" region = "us-central1" } resource "google_storage_bucket" "demo_bucket" { name = var.gsc_bucket_name location = var.location force_destroy = true lifecycle_rule { condition { age = 1 } action { type = "AbortIncompleteMultipartUpload" } } } resource "google_bigquery_dataset" "demo_dataset" { dataset_id = var.bq_dataset_name }

As you may noticed, some of the values are strings/ints/floats but others are var.* values. In the next section we are talking about keeping the Terraform files tidy with the usage of variables.


Parametrize files with variables.tf

Terraform variables offer a centralized and reusable way to manage values in infrastructure automation, separate from deployment plans. They are categorized into two main types: input variables for configuring infrastructure and output variables for retrieving information post-deployment. Input variables define values like server configurations and can be strings, lists, maps, or booleans. String variables simplify complex values, lists represent indexed values, maps store key-value pairs, and booleans handle true/false conditions.

Output variables are used to extract details like IP addresses after the infrastructure is deployed. Variables can be predefined in a file or via command-line, enhancing flexibility and readability. They also support overriding at deployment, allowing for customized infrastructure management. Sensitive information can be set as environmental variables, prefixed with TF_VAR_, for enhanced security. Terraform variables are essential for clear, manageable, and secure infrastructure plans.

In our case, we are using variables.tf looks as:

variable "credentials" { default = "./keys/my_creds.json" } variable "location" { default = "US" } variable "bq_dataset_name" { description = "BigQuery dataset name" default = "demo_dataset" } variable "gcs_storage_class" { description = "Bucket Storage class" default = "STANDARD" } variable "gsc_bucket_name" { description = "Storage bucket name" default = "terraform-demo-20240115-demo-bucket" }

We are parametrizing here the credentials file, buckets location, storage class, bucket name…

As we’ve discussed, mastering Terraform variables is a key step towards advanced infrastructure automation and efficient code management.

For more information about Terraform variables, you can visit this post.


´

Stay updated on the Data Engineering and Data Analytics Engineer career paths

This was the content I gathered for the very first week of the DataTalks Data Engineering bootcamp. I’ve definetively enjoyed it and I’m excited to continue with Week 2.

If you want to stay updated, homework the homework along with explanations…

#mc_embed_signup{background:#fff; clear:left; font:14px Helvetica,Arial,sans-serif; width:100%;} #mc_embed_signup .button { background-color: #0294A5; /* Green */ color: white; transition-duration: 0.4s; } #mc_embed_signup .button:hover { background-color: #379392 !important; } Suscribe for Data Eng content and explained homework!
Categories: FLOSS Project Planets

TechBeamers Python: Understanding Python Timestamps: A Simple Guide

Sat, 2024-01-20 14:19

In Python, a timestamp is like a timestamp! It’s a number that tells you when something happened. This guide will help you get comfortable with timestamps in Python. We’ll talk about how to make them, change them, and why they’re useful. Getting Started with Timestamp in Python A timestamp is just a way of saying, […]

The post Understanding Python Timestamps: A Simple Guide appeared first on TechBeamers.

Categories: FLOSS Project Planets

TechBeamers Python: Python Pip Usage for Beginners

Sat, 2024-01-20 13:09

Python Pip, short for “Pip Installs Packages,” is a powerful package management system that simplifies the process of installing, upgrading, and managing Python libraries and dependencies. In this tutorial, we’ll delve into the various aspects of Python Pip usage, covering essential commands, best practices, and the latest updates in the Python package ecosystem. Before we […]

The post Python Pip Usage for Beginners appeared first on TechBeamers.

Categories: FLOSS Project Planets

death and gravity: This is not interview advice: building a priority-expiry LRU cache without heaps or trees in Python

Sat, 2024-01-20 07:00

It's not your fault I got nerdsniped, but that doesn't matter.

Hi, I'm Adrian, and today we're implementing a least recently used cache with priorities and expiry, using only the Python standard library.

This is a bIG TEch CoDINg InTerVIEW problem, so we'll work hard to stay away from the correct™ data structures – no heaps, no binary search trees – but we'll end up with a decent solution anyway!

Requirements #

So you're at an interview and have to implement a priority-expiry LRU cache.

Maybe you get more details, but maybe the problem is deliberately vague; either way, we can reconstruct the requirements from the name alone.

A cache is something that stores data for future reuse, usually the result of a slow computation or retrieval. Each piece of data has an associated key used to retrieve it. Most caches are bounded in some way, for example by limiting the number of items.

The other words have to do with eviction – how and when items are removed.

  • Each item has a priority – when the cache fills up, we evict items with the lowest priority first.
  • Each item has a maximum age – if it's older than that, it is expired, so we don't return it to the user. It stands to reason expired items are evicted regardless of their priority or how full the cache is.
  • All other things being equal, we evict items least recently used relative to others.

The problem may specify an API; we can reconstruct that from first principles too. Since the cache is basically a key-value store, we can get away with two methods:

  • set(key, value, maxage, priority)
  • get(key) -> value or None

The problem may also suggest:

  • delete(key) – allows users to invalidate items for reasons external to the cache; not strictly necessary, but we'll end up with it as part of refactoring
  • evict(now) – not strictly necessary either, but hints eviction is a separate bit of logic, and may come in handy for testing

Types deserve their own discussion:

  • key – usually, the key is a string, but we can relax this to any hashable value
  • value – for an in-memory cache, any kind of object is fine
  • maxage and priority – a number should do for these; a float is more general, but an integer may allow a simpler solution; limits on these are important too, as we'll see soon enough

Tip

Your interviewer may be expecting you to uncover some of these details through clarifying questions. Be sure to think out loud and state your assumptions.

A minimal plausible solution #

I'm sure there are a lot smart people out there that can Think Really Hard™ and just come up with a solution, but I'm not one of them, so we'll take an iterative, problem-solution approach to this.

Since right now we don't have any solution, we start with the simplest thing that could possibly work1 – a basic cache with no fancy eviction and no priorities; we can then write some tests against that, to know if we break anything going forward.

Tip

If during an interview you don't know what to do and choose to work up from the naive solution, make it very clear that's what you're doing. Your interviewer may give you hints to help you skip that.

A class holds all the things we need and gives us something to stick the API on:

class Cache: def __init__(self, maxsize, time=time.monotonic): self.maxsize = maxsize self.time = time self.cache = {}

And this is our first algorithmic choice: a dict (backed by a hash table) provides average O(1) search / insert / delete.

Tip

Given the problem we're solving, and the context we're solving it in, we have to talk about time complexity. Ned Batchelder's Big-O: How Code Slows as Data Grows provides an excellent introduction (text and video available).

set() leads to more choices:

def set(self, key, value, *, maxage=10, priority=0): now = self.time() if key in self.cache: self.cache.pop(key) elif len(self.cache) >= self.maxsize: self.evict(now) expires = now + maxage item = Item(key, value, expires, priority) self.cache[key] = item

First, we evict items only if there's no more room left. (There are other ways of doing this; for example, evicting expired items periodically minimizes memory usage.)

Second, if the key is already in the cache, we remove and insert it again, instead of updating things in place. This way, there's only one code path for setting items, which will make it a lot easier to keep multiple data structures in sync later on.

We use a named tuple to store the parameters associated with a key:

class Item(NamedTuple): key: object value: object expires: int priority: int

For now, we just evict an arbitrary item; in a happy little coincidence, dicts preserve insertion order, so when iterating over the cache, the oldest key is first.

def evict(self, now): if not self.cache: return key = next(iter(self.cache)) del self.cache[key]

Finally, get() is trivial:

def get(self, key): item = self.cache.get(key) if not item: return None if self.time() >= item.expires: return None return item.value

With everything in place, here's the first test:

def test_basic(): cache = Cache(2, FakeTime()) assert cache.get('a') == None cache.set('a', 'A') assert cache.get('a') == 'A' cache.set('b', 'B') assert cache.get('a') == 'A' assert cache.get('b') == 'B' cache.set('c', 'C') assert cache.get('a') == None assert cache.get('b') == 'B' assert cache.get('c') == 'C'

To make things predictable, we inject a fake time implementation:

class FakeTime: def __init__(self, now=0): self.now = now def __call__(self): return self.now Problem: expired items should go first... #

Following from the requirements, there's an order in which items get kicked out: first expired (lowest expiry time), then lowest priority, and only then least recently used. So, we need a data structure that can efficiently remove the smallest element.

Turns out, there's an abstract data type for that called a priority queue;2 for now, we'll honor its abstract nature and not bother with an implementation.

self.cache = {} self.expires = PriorityQueue()

Since we need the item with the lowest expiry time, we need a way to get back to the item; an (expires, key) tuple should do fine – since tuples compare lexicographically, it'll be like comparing by expires alone, but with key along for the ride; in set(), we add:

self.cache[key] = item self.expires.push((expires, key))

You may be tempted (like I was) to say "hey, the item's already a tuple, if we make expires the first field, we can use the item itself", but let's delay optimizations until we have and understand a full solution – make it work, make it right, make it fast.

Still in set(), if the key is already in the cache, we also remove and insert it from the expires queue, so it's added back with the new expiry time.

if key in self.cache: item = self.cache.pop(key) self.expires.remove((item.expires, key)) del item elif len(self.cache) >= self.maxsize: self.evict(now)

Moving on to evicting things; for this, we need two operations: first peek at the item that expires next to see if it's expired, then, if it is, pop it from the queue. (Another choice: we only have to evict one item, but evict all expired ones.)

def evict(self, now): if not self.cache: return initial_size = len(self.cache) while self.cache: expires, key = self.expires.peek() if expires > now: break self.expires.pop() del self.cache[key] if len(self.cache) == initial_size: _, key = self.expires.pop() del self.cache[key]

If there are no expired items, we still have to make room for the one item; since we're not handling priorities yet, we'll evict the item that expires next a little early.

Problem: name PriorityQueue is not defined #

OK, to get the code working again, we need a PriorityQueue class. It doesn't need to be fast, we can deal with that after we finish everything else; for now, let's just keep our elements in a plain list.

class PriorityQueue: def __init__(self): self.data = []

The easiest way to get the smallest value is to keep the list sorted; the downside is that push() is now O(n log n) – although, because the list is always sorted, it can be as good as O(n) depending on the implementation.

def push(self, item): self.data.append(item) self.data.sort()

This makes peek() and pop() trivial; still, pop() is O(n), because it shifts all the items left by one position.

def peek(self): return self.data[0] def pop(self): rv = self.data[0] self.data[:1] = [] return rv

remove() is just as simple, and just as O(N), because it first needs to find the item, and then shift the ones after it to cover the gap.

def remove(self, item): self.data.remove(item)

We didn't use the is empty operation, but it should be O(1) regardless of implementation, so let's throw it in anyway:

def __bool__(self): return bool(self.data)

OK, let's wrap up with a quick test:

def test_priority_queue(): pq = PriorityQueue() pq.push(1) pq.push(3) pq.push(2) assert pq assert pq.peek() == 1 assert pq.pop() == 1 assert pq.peek() == 2 assert pq.remove(3) is None assert pq assert pq.peek() == 2 with pytest.raises(ValueError): pq.remove(3) assert pq.pop() == 2 assert not pq with pytest.raises(IndexError): pq.peek() with pytest.raises(IndexError): pq.pop()

Now the existing tests pass, and we can add more – first, that expired items are evicted (note how we're moving the time forward):

def test_expires(): cache = Cache(2, FakeTime()) cache.set('a', 'A', maxage=10) cache.set('b', 'B', maxage=20) assert cache.get('a') == 'A' assert cache.get('b') == 'B' cache.time.now = 15 assert cache.get('a') == None assert cache.get('b') == 'B' cache.set('c', 'C') assert cache.get('a') == None assert cache.get('b') == 'B' assert cache.get('c') == 'C'

Second, that setting an existing item changes its expire time:

def test_update_expires(): cache = Cache(2, FakeTime()) cache.set('a', 'A', maxage=10) cache.set('b', 'B', maxage=10) cache.time.now = 5 cache.set('a', 'X', maxage=4) cache.set('b', 'Y', maxage=6) assert cache.get('a') == 'X' assert cache.get('b') == 'Y' cache.time.now = 10 assert cache.get('a') == None assert cache.get('b') == 'Y' Problem: ...low priority items second #

Next up, kick out items by priority – shouldn't be too hard, right?

In __init__(), add another priority queue for priorities:

self.cache = {} self.expires = PriorityQueue() self.priorities = PriorityQueue()

In set(), add new items to the priorities queue:

self.cache[key] = item self.expires.push((expires, key)) self.priorities.push((priority, key))

...and remove already-cached items:

if key in self.cache: item = self.cache.pop(key) self.expires.remove((item.expires, key)) self.priorities.remove((item.priority, key)) del item

In evict(), remove expired items from the priorities queue:

while self.cache: expires, key = self.expires.peek() if expires > now: break self.expires.pop() item = self.cache.pop(key) self.priorities.remove((item.priority, key))

...and finally, if none are expired, remove the one with the lowest priority:

if len(self.cache) == initial_size: _, key = self.priorities.pop() item = self.cache.pop(key) self.expires.remove((item.expires, key))

Add one test for eviction by priority:

def test_priorities(): cache = Cache(2, FakeTime()) cache.set('a', 'A', priority=1) cache.set('b', 'B', priority=0) assert cache.get('a') == 'A' assert cache.get('b') == 'B' cache.set('c', 'C') assert cache.get('a') == 'A' assert cache.get('b') == None assert cache.get('c') == 'C'

...and one for updating the priority of existing items:

def test_update_priorities(): cache = Cache(2, FakeTime()) cache.set('a', 'A', priority=1) cache.set('b', 'B', priority=0) cache.set('b', 'Y', priority=2) cache.set('c', 'C') assert cache.get('a') == None assert cache.get('b') == 'Y' assert cache.get('c') == 'C' Problem: we're deleting items in three places #

I said we'll postpone performance optimizations until we have a complete solution, but I have a different kind of optimization in mind – for readability.

We're deleting items in three slightly different ways, careful to keep three data structures in sync each time; it would be nice to do it only once. While a bit premature, through the magic of having written the article already, I'm sure it will pay off.

def delete(self, key): *_, expires, priority = self.cache.pop(key) self.expires.remove((expires, key)) self.priorities.remove((priority, key))

Sure, eviction is twice as slow, but the complexity stays the the same – the constant factor in O(2n) gets removed, leaving us with O(n). If needed, we can go back to the unpacked version once we have a reasonably efficient implementation (that's what tests are for).

Deleting already-cached items is shortened to:

if key in self.cache: self.delete(key)

Idem for the core eviction logic:

while self.cache: expires, key = self.expires.peek() if expires > now: break self.delete(key) if len(self.cache) == initial_size: _, key = self.priorities.peek() self.delete(key)

Neat!

Problem: ...least recently used items last #

So, how does one implement a least recently used cache?

We could google it... or, we could look at an existing implementation.

functools.lru_cache() #

Standard library functools.lru_cache() comes to mind first; let's have a look.

Tip

You can read the code of stdlib modules by following the Source code: link at the top of each documentation page.

lru_cache() delegates to _lru_cache_wrapper(), which sets up a bunch of variables to be used by nested functions.3 Among the variables is a cache dict and a doubly linked list where nodes are [prev, next, key, value] lists.4

And that's the answer – a doubly linked list allows tracking item use in O(1): each time a node is used, remove it from its current position and plop it at the "recently used" end; whatever's at the other end will be the least recently used item.

Note that, unlike lru_cache(), we need one doubly linked list for each priority.

But, before making Item mutable and giving it prev/next links, let's dive deeper.

OrderedDict #

If you search the docs for "LRU", the next result after lru_cache() is OrderedDict.

Some differences from dict still remain: [...] The OrderedDict algorithm can handle frequent reordering operations better than dict. [...] this makes it suitable for implementing various kinds of LRU caches.

Specifically:

OrderedDict has a move_to_end() method to efficiently reposition an element to an endpoint.

Since dicts preserve insertion order, you can use d[k] = d.pop(k) to move items to the end... What makes move_to_end() better, then? This comment may shed some light:

90 # The internal self.__map dict maps keys to links in a doubly linked list.

Indeed, move_to_end() does exactly what we described above – this is good news, it means we don't have to do it ourselves!

So, we need one OrderedDict (read: doubly linked list) for each priority, but still need to keep track the lowest priority:

self.cache = {} self.expires = PriorityQueue() self.priority_buckets = {} self.priority_order = PriorityQueue()

Handling priorities in set() gets a bit more complicated:

priority_bucket = self.priority_buckets.get(priority) if not priority_bucket: priority_bucket = self.priority_buckets[priority] = OrderedDict() self.priority_order.push(priority) priority_bucket[key] = None

But now we can finally evict the least recently used item:

if len(self.cache) == initial_size: priority = self.priority_order.peek() priority_bucket = self.priority_buckets.get(priority) key = next(iter(priority_bucket)) self.delete(key)

In delete(), we're careful to get rid of empty buckets:5

priority_bucket = self.priority_buckets[priority] del priority_bucket[key] if not priority_bucket: del self.priority_buckets[priority] self.priority_order.remove(priority)

Existing tests pass again, and we can add a new (still failing) one:

def test_lru(): cache = Cache(2, FakeTime()) cache.set('a', 'A') cache.set('b', 'B') cache.get('a') == 'A' cache.set('c', 'C') assert cache.get('a') == 'A' assert cache.get('b') == None assert cache.get('c') == 'C'

All that's needed to make it pass is to call move_to_end() in get():

self.priority_buckets[item.priority].move_to_end(key) return item.value Liking this so far? Here's another article you might like:

Learn by reading code: Python standard library design decisions explained

Problem: our priority queue is slow #

OK, we have a complete solution, it's time to deal with the priority queue implementation. Let's do a quick recap of the methods we need and why:

  • push() – to add items
  • peek() – to get items / buckets with the lowest expiry time / priority
  • remove() – to delete items
  • pop() – not used, but would be without the delete() refactoring

We make two related observations: first, there's no remove operation on the priority queue Wikipedia page; second, even if we unpack delete(), we only get to pop() an item/bucket from one of the queues, and still have to remove() it from the other.

And this is what makes the problem tricky – we need to maintain not one, but two independent priority queues.

heapq #

If you search the docs for "priority queue", you'll find heapq, which:

[...] provides an implementation of the heap queue algorithm, also known as the priority queue algorithm.

While priority queues are often implemented using [heaps][heap], > they are conceptually distinct from heaps. [...] -->

Reading on, we find extensive notes on implementing priority queues; particularly interesting are using (priority, item) tuples (already doing this!) and removing entries:

Removing the entry or changing its priority is more difficult because it would break the heap structure invariants. So, a possible solution is to mark the entry as removed and add a new entry with the revised priority.

This workaround is needed because while removing the i-th element can be done in O(log n), finding its index is O(n). To summarize, we have:

sort heapq push() O(n) O(log n) peek() O(1) O(1) pop() O(n) O(log n) remove() O(n) O(n)

Still, with a few mitigating assumptions, it could work:

  • we can assume priorities are static, so buckets never get removed
  • to-be-removed expiry times will get popped sooner or later anyway; we can assume that most evictions are due to expired items, and that items being evicted due to low priority (i.e. when the cache is full) and item updates are rare (both cause to-be-removed entries to accumulate in the expiry queue)
bisect #

One way of finding an element in better than O(n) is bisect, which:

[...] provides support for maintaining a list in sorted order without having to sort the list after each insertion.

This may provide an improvement to our naive implementation; sadly, reading further to Performance Notes we find that:

The insort() functions are O(n) because the logarithmic search step is dominated by the linear time insertion step.

While in general that's better than just about any sort, we happen to be hitting the best case of our sort implementation, which has the same complexity. (Nevertheless, shifting elements is likely cheaper than the same number of comparisons.)

sort heapq bisect push() O(n) O(log n) O(n) peek() O(1) O(1) O(1) pop() O(n) O(log n) O(n) remove() O(n) O(n) O(n)

Further down in the docs, there's a see also box:

Sorted Collections is a high performance module that uses bisect to managed sorted collections of data.

Not in stdlib, moving along... ¯\_(ツ)_/¯

pop() optimization #

There's an unrelated improvement that applies to both the naive solution and bisect. With a sorted list, pop() is O(n) because it shifts all elements after the first; if the order was reversed, we'd pop() from the end, which is O(1). So:

sort heapq bisect push() O(n) O(log n) O(n) peek() O(1) O(1) O(1) pop() O(1)* O(log n) O(1)* remove() O(n) O(n) O(n) Binary search trees #

OK, I'm out of ideas, and there's nothing else in stdlib that can help.

We can restate the problem as follows: we need a sorted data structure that can do better than O(n) for push() / remove().

We've already peeked at the Wikipedia priority queue page, so let's keep reading – skipping past the naive implementations, to the usual implementation, we find that:

To improve performance, priority queues are typically based on a heap, [...]

Looked into that, didn't work; next:

Alternatively, when a self-balancing binary search tree is used, insertion and removal also take O(log n) time [...]

There, that's what we're looking for! (And likely what your interviewer is, too.)

sort heapq bisect BST push() O(n) O(log n) O(n) O(log n) peek() O(1) O(1) O(1) O(log n) pop() O(1)* O(log n) O(1)* O(log n) remove() O(n) O(log n) O(n) O(log n)

But there's no self-balancing BST in the standard library, and I sure as hell am not implementing one right now – I still have flashbacks from when I tried to do a red-black tree and two hours later it still had bugs (I mean, look at the length of this explanation!).

After a bit of googling we find, among others, bintrees, a mature library that provides all sorts of binary search trees... except:

Bintrees Development Stopped

Use sortedcontainers instead: https://pypi.python.org/pypi/sortedcontainers

Sounds familiar, doesn't it?

Sorted Containers #

Let's go back to that Sorted Collections library bisect was recommending:

Depends on the Sorted Containers module.

(¬‿¬ )

I remember, I remember now... I'm not salty because the red-black tree took two hours to implement. I'm salty because after all that time, I found Sorted Containers, a pure Python library that is faster in practice than fancy self-balancing binary search trees implemented in C!

It has extensive benchmarks to prove it, and simulated workload benchmarks for our own use case, priority queues – so yeah, while the interview answer is "self-balancing BSTs", the actual answer is Sorted Containers.

How does it work? There's an extensive explanation too:6

The Sorted Containers internal implementation is based on a couple observations. The first is that Python’s list is fast, really fast. [...] The second is that bisect.insort7 is fast. This is somewhat counter-intuitive since it involves shifting a series of items in a list. But modern processors do this really well. A lot of time has been spent optimizing mem-copy/mem-move-like operations both in hardware and software.

But using only one list and bisect.insort would produce sluggish behavior for lengths exceeding ten thousand. So the implementation of Sorted List uses a list of lists to store elements. [...]

There's also a comparison with trees, which I'll summarize: fewer memory allocations, better cache locality, lower memory overhead, and faster iteration.

I think that gives you a decent idea of how and why it works, enough that with a bit of tinkering you might be able to implement it yourself.8

Problem: Sorted Containers is not in stdlib #

But Sorted Containers is not in the standard library either, and we don't want to implement it ourselves. We did learn something from it, though:

sort heapq bisect bisect <10k BST push() O(n) O(log n) O(n) O(log n) O(log n) peek() O(1) O(1) O(1) O(1) O(log n) pop() O(1)* O(log n) O(1)* O(1)* O(log n) remove() O(n) O(log n) O(n) O(log n) O(log n)

We still need to make some assumptions, though:

  1. Do we really need more than 10k priorities? Likely no, let's just cap them at 10k.
  2. Do we really need more than 10k expiry times? Maybe? – with 1 second granularity we can represent only up to 2.7 hours; 10 seconds takes us up to 27 hours, which may just work.

OK, one more and we're done. The other issue, aside from the maximum time, is that the granularity is too low, especially for short times – rounding 1 to 10 seconds is much worse than rounding 2001 to 2010 seconds. Which begs the question –

  1. Does it really matter if items expire in 2010 seconds instead of 2001? Likely no, but we need a way to round small values with higher granularity than big ones.
Logarithmic time #

How about 1, 2, 4, 8, ...? Rounding up to powers of 2 gets us decreasing granularity, but time doesn't actually start at zero. We fix this by rounding up to multiples of powers of 2 instead; let's get an intuition of how it works:

ceil((2000 + 1) / 1) * 1 = 2001 ceil((2000 + 2) / 2) * 2 = 2002 ceil((2000 + 3) / 4) * 4 = 2004 ceil((2000 + 4) / 4) * 4 = 2004 ceil((2000 + 15) / 16) * 16 = 2016 ceil((2000 + 16) / 16) * 16 = 2016 ceil((2000 + 17) / 32) * 32 = 2048

So far so good, how about after some time has passed?

ceil((2013 + 1) / 1) * 1 = 2014 ceil((2013 + 2) / 2) * 2 = 2016 ceil((2013 + 3) / 4) * 4 = 2016 ceil((2013 + 4) / 4) * 4 = 2020 ceil((2013 + 15) / 16) * 16 = 2032 ceil((2013 + 16) / 16) * 16 = 2032 ceil((2013 + 17) / 32) * 32 = 2048

The beauty of aligned powers is that for a relatively constant number of expiry times, the number of buckets remains roughly the same over time – as closely packed buckets are removed from the beginning, new ones fill the gaps between the sparser ones towards the end.

OK, let's put it into code:

def log_bucket(now, maxage): next_power = 2 ** math.ceil(math.log2(maxage)) expires = now + maxage bucket = math.ceil(expires / next_power) * next_power return bucket >>> [log_bucket(0, i) for i in [1, 2, 3, 4, 15, 16, 17]] [1, 2, 4, 4, 16, 16, 32] >>> [log_bucket(2000, i) for i in [1, 2, 3, 4, 15, 16, 17]] [2001, 2002, 2004, 2004, 2016, 2016, 2048] >>> [log_bucket(2013, i) for i in [1, 2, 3, 4, 15, 16, 17]] [2014, 2016, 2016, 2020, 2032, 2032, 2048]

Looking good!

There are two sources of error – first from rounding maxage, worst when it's one more than a power of 2, and second from rounding the expiry time, also worst when it's one more than a power of two. Together, they approach 200% of maxage:

>>> log_bucket(0, 17) # (32 - 17) / 17 ~= 88% 32 >>> log_bucket(0, 33) # (64 - 33) / 33 ~= 94% 64 >>> log_bucket(16, 17) # (64 - 31) / 17 ~= 182% 64 >>> log_bucket(32, 33) # (128 - 64) / 33 ~= 191% 128

200% error is quite a lot; before we set to fix it, let's confirm our reasoning.

def error(now, maxage, *args): """log_bucket() error.""" bucket = log_bucket(now, maxage, *args) return (bucket - now) / maxage - 1 def max_error(now, max_maxage, *args): """Worst log_bucket() error for all maxages up to max_maxage.""" return max( error(now, maxage, *args) for maxage in range(1, max_maxage) ) def max_error_random(n, *args): """Worst log_bucket() error for random inputs, out of n tries.""" max_now = int(time.time()) * 2 max_maxage = 3600 * 24 * 31 rand = functools.partial(random.randint, 1) return max( error(rand(max_now), rand(max_maxage), *args) for _ in range(n) ) >>> max_error(0, 10_000) 0.9997558891736849 >>> max_error(2000, 10_000) 1.9527896995708156 >>> max_error_random(10_000_000) 1.9995498725910554

Looks confirmed enough to me.

So, how do we make the error smaller? Instead of rounding to the next power of 2, we round to the next half of a power of 2, or next quarter, or next eighth...

def log_bucket(now, maxage, shift=0): next_power = 2 ** max(0, math.ceil(math.log2(maxage) - shift)) expires = now + maxage bucket = math.ceil(expires / next_power) * next_power return bucket

It seems to be working:

>>> for s in range(5): ... print([log_bucket(0, i, s) for i in [1, 2, 3, 4, 15, 16, 17]]) ... [1, 2, 4, 4, 16, 16, 32] [1, 2, 4, 4, 16, 16, 32] [1, 2, 3, 4, 16, 16, 24] [1, 2, 3, 4, 16, 16, 20] [1, 2, 3, 4, 15, 16, 18] >>> for s in range(10): ... e = max_error_random(1_000_000, s) ... print(f'{s} {e:6.1%}') ... 0 199.8% 1 99.9% 2 50.0% 3 25.0% 4 12.5% 5 6.2% 6 3.1% 7 1.6% 8 0.8% 9 0.4%

With shift=7, the error is less that two percent; I wonder how many buckets that is...

def max_buckets(max_maxage, *args): """Number of buckets to cover all maxages up to max_maxage.""" now = time.time() buckets = { log_bucket(now, maxage, *args) for maxage in range(1, max_maxage) } return len(buckets) >>> max_buckets(3600 * 24, 7) 729 >>> max_buckets(3600 * 24 * 31, 7) 1047 >>> max_buckets(3600 * 24 * 365, 7) 1279

A bit over a thousand buckets for the whole year, not bad!

Before we can use any of that, we need to convert expiry times to buckets; that looks a lot like the priority buckets code, the only notable part being eviction.

__init__():

self.cache = {} self.expires_buckets = {} self.expires_order = PriorityQueue() self.priority_buckets = {} self.priority_order = PriorityQueue()

set():

expires_bucket = self.expires_buckets.get(expires) if not expires_bucket: expires_bucket = self.expires_buckets[expires] = set() self.expires_order.push(expires) expires_bucket.add(key)

delete():

expires_bucket = self.expires_buckets[expires] expires_bucket.remove(key) if not expires_bucket: del self.expires_buckets[expires] self.expires_order.remove(expires)

evict():

while self.cache: expires = self.expires_order.peek() if expires > now: break expires_bucket = self.expires_buckets[expires] for key in list(self.expires_buckets[expires]): self.delete(key)

And now we use log_bucket(). Since we're at it, why not have unlimited priorities too? A hammer is a hammer and everything is a nail, after all.

expires = log_bucket(now, maxage, shift=7) priority = log_bucket(0, priority+1, shift=7) item = Item(key, value, expires, priority) If you've made it this far, you will definitely like:

Has your password been pwned? Or, how I almost failed to search a 37 GB text file in under 1 millisecond (in Python)

bisect, redux #

Time to fix that priority queue.

We use insort() to add priorities and operator.neg() to keep the list reversed:9

def push(self, item): bisect.insort(self.data, item, key=operator.neg)

We update peek() and pop() to handle the reverse order:

def peek(self): return self.data[-1] def pop(self): return self.data.pop()

Finally, for remove() we adapt the index() recipe from Searching Sorted Lists:

def remove(self, item): i = bisect.bisect_left(self.data, -item, key=operator.neg) if i != len(self.data) and self.data[i] == item: del self.data[i] return raise ValueError

And that's it, we're done!

Here's the final version of the code.

Conclusion #

Anyone expecting you to implement this in under an hour is delusional. Explaining what you would use and why should be enough for reasonable interviewers, although that may prove difficult if you haven't solved this kind of problem before.

Bullshit interviews aside, it is useful to have basic knowledge of time complexity. Again, can't recommend Big-O: How Code Slows as Data Grows enough.

But, what big O notation says and what happens in practice can differ quite a lot. Be sure to measure, and be sure to think of limits – sometimes, the n in O(n) is or can be made small enough you don't have to do the theoretically correct thing.

You don't need to know how to implement all the data structures, that's what (software) libraries and Wikipedia are for (and for that matter, book libraries too). However, it is useful to have an idea of what's available and when to use it.

Good libraries educate – the Python standard library docs already cover a lot of the practical knowledge we needed, and so did Sorted Containers. But, that won't show up in the API reference you see in your IDE, you have read the actual documentation.

Learned something new today? Share this with others, it really helps!

Want to know when new articles come out? Subscribe here to get new stuff straight to your inbox!

  1. Note the subtitle: if you're not sure what to do yet. [return]

  2. This early on, the name doesn't really matter, but we'll go with the correct, descriptive one; in the first draft of the code, it was called MagicDS. ✨ [return]

  3. You have to admit this is at least a bit weird; what you're looking at is an object in a trench coat, at least if you think closures and objects are equivalent. [return]

  4. Another way of getting an "object" on the cheap. [return]

  5. If we assume a relatively small number of buckets that will be reused soon enough, this isn't strictly necessary. I'm partly doing it to release the memory held by the dict, since dicts are resized only when items are added. [return]

  6. There's even a PyCon talk with the same explanation, if you prefer that. [return]

  7. bisect itself has a fast C implementation, so I guess technically it's not pure Python. But given that stdlib is already there, does that count? [return]

  8. If the implementation is easy to explain, it may be a good idea. [return]

  9. This limits priorities to values that can be negated, so tuples won't work anymore. We could use a "reversed view" wrapper if we really cared about that. [return]

Categories: FLOSS Project Planets

TechBeamers Python: How to Remove Whitespace from a String in Python

Sat, 2024-01-20 05:57

This tutorial explains multiple ways to remove whitespace from a string in Python. Often, as a programmer, we need to put in a mechanism to remove whitespace, especially trailing spaces. So, with the solutions given here, you would be able to remove most types of whitespaces from the string. Multiple Ways to Remove Whitespace from […]

The post How to Remove Whitespace from a String in Python appeared first on TechBeamers.

Categories: FLOSS Project Planets

TechBeamers Python: Install Python packages with pip and requirements.txt

Sat, 2024-01-20 02:26

Check this tutorial if you want to learn how to use pip to install Python packages using the requirements.txt. It is a simple configuration file where we can list the exact packages with their versions. It will ensure that our Python project uses the correct packages. No matter whether you build it on a new […]

The post Install Python packages with pip and requirements.txt appeared first on TechBeamers.

Categories: FLOSS Project Planets

Test and Code: 213: Repeating Tests

Fri, 2024-01-19 17:09

If a test fails in a test suite, I'm going to want to re-run the test. I may even want to re-run a test, or a subset of the suite, a bunch of times.  
There are a few pytest plugins that help with this:

We talk about each of these in this episode.


Sponsored by PyCharm Pro

The Complete pytest Course

  • For the fastest way to learn pytest, go to courses.pythontest.com
  • Whether your new to testing or pytest, or just want to maximize your efficiency and effectiveness when testing.
<p>If a test fails in a test suite, I'm going to want to re-run the test. I may even want to re-run a test, or a subset of the suite, a bunch of times.  <br>There are a few pytest plugins that help with this:</p><ul><li><a href="https://github.com/pytest-dev/pytest-repeat">pytest-repeat</a></li><li><a href="https://github.com/pytest-dev/pytest-rerunfailures">pytest-rerunfailures</a></li><li><a href="https://github.com/dropbox/pytest-flakefinder">pytest-flakefinder</a></li><li><a href="https://github.com/pytest-dev/pytest-instafail">pytest-instafail</a></li></ul><p>We talk about each of these in this episode.</p> <br><p><strong>Sponsored by PyCharm Pro</strong></p><ul><li>Use code PYTEST for 20% off PyCharm Professional at <a href="https://www.jetbrains.com/pycharm/">jetbrains.com/pycharm</a></li><li>First 10 to sign up this month get a free month of AI Assistant</li><li>See how easy it is to run pytest from PyCharm at <a href="https://pythontest.com/pycharm/">pythontest.com/pycharm</a></li></ul><p><strong>The Complete pytest Course</strong></p><ul><li>For the fastest way to learn pytest, go to <a href="https://courses.pythontest.com/p/complete-pytest-course">courses.pythontest.com</a></li><li>Whether your new to testing or pytest, or just want to maximize your efficiency and effectiveness when testing.</li></ul>
Categories: FLOSS Project Planets

Marcos Dione: Sending AWS CloudWatch alarms through SNS to MSTeams

Fri, 2024-01-19 15:36

I'm new to AWS os please take the following statements with a grain of salt. Also, I'm tired, but I want to get this of my chest before the weekend begins (although, technically, it has already begun), so it might not be so coherent.

AWS provides some minimum monitoring of your resources with a tool called CloudWatch. Think of prometheus + grafana, but more limited. Still, is good enough to the point it makes sense to setup some Alerts on it. Many of AWS's resources are not processes running on a computer you have access to, so you can't always install some exporters and do the monitoring yourself.

If you're like me, CloudWatch Alerts must be sent to the outside world so you can receive them and react. One way to do this1 is to channel them through SNS. SNS supports many protocols, most of them internal to AWS, but also HTTP/S. SNS is a pub-sub system, and requires a little bit of protocol before it works.

On the other end we2 have MSTeams3. MSTeams has many ways of communicating. One is Chat, which is a crappy chat67, and another is some kind of mix between a blog and twitter, confusingly called Teams. The idea in a Team is that you can post... Posts? Articles? And from them you can have an unthreaded converstion. Only Teams have webhooks; Chats do not, so you can't point SNS there.

If you have read other articles about integrating CloudWatch Alerts or SNS to MSTeams, they will always tell you that you not only need SNS, but also a Lambda program. Since we already handle gazillion servers, not all of them in AWS, and one in particular we pay quite cheap for dedicated HW, and also we're trying to slim our AWS bill (who doesn't), I decided to see if I can build my own bridge between SNS and Teams.

I already said that SNS has a litte protocol. The idea is that when you create an HTTP/S Subscription in SNS, it will POST a first message to the URL you define. This message will have a JSON payload. We're interested in two fields:

{ "Type": "SubscriptionConfirmation", "SubscribeURL": "..." }

What you have to do is get this URL and call it. That way SNS will know the endpoint exists and will associate an ARN to the Subscription. Otherwise, the Subscription will stay unconfirmed and no messages will be sent to it. Interestingly, you can't neither edit nor remove Subscriptions (at least not with the web interface), and I read that unconfirmed Subscriptions will disappear after 3 days or so 4.

SNS messages are also a JSON payload POST'ed to the URL. They look like this:

{ "Type" : "Notification", "MessageId" : "<uuid1>", "TopicArn" : "<arn>", "Subject" : "...", "Message" : "...", "Timestamp" : "2024-01-19T14:29:54.147Z", "SignatureVersion" : "1", "Signature" : "cTQUWntlQW5evk/bZ5lkhSdWj2+4oa/4eApdgkcdebegX3Dvwpq786Zi6lZbxGsjof2C+XMt4rV9xM1DBlsVq6tsBQvkfzGBzOvwerZZ7j4Sfy/GTJvtS4L2x/OVUCLleY3ULSCRYX2H1TTTanK44tOU5f8W+8AUz1DKRT+qL+T2fWqmUrPYSK452j/rPZcZaVwZnNaYkroPmJmI4gxjr/37Q6gA8sK+WyC0U91/MDKHpuAmCAXrhgrJIpEX/1t2mNlnlbJpcsR9h05tHJNkQEkPwFY0HFTnyGvTM2DP6Ep7C2z83/OHeVJ6pa7Sn3txVWR5AQC1PF8UbT7zdGJL9Q==", "SigningCertURL" : "https://sns.eu-west-1.amazonaws.com/SimpleNotificationService-01d088a6f77103d0fe307c0069e40ed6.pem", "UnsubscribeURL" : "https://sns.eu-west-1.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=<arn>:<uuid2>" }

Now, CloudWatch Alerts sent via SNS are sent in the Message field. As Message's value is a string and the Alert is encoded as JSON, yes, you guessed it, it's double encoded:

{ "Message" : "{\"AlarmName\":\"foo\",...}" }

Sigh. After unwrapping it, it looks like this:

{ "AlarmName": "...", "AlarmDescription": "...", "AWSAccountId": "...", "AlarmConfigurationUpdatedTimestamp": "2024-01-18T14:32:17.244+0000", "NewStateValue": "ALARM", "NewStateReason": "Threshold Crossed: 1 out of the last 1 datapoints [10.337853107344637 (18/01/24 14:28:00)] was greater than the threshold (10.0) (minimum 1 datapoint for OK -> ALARM transition).", "StateChangeTime": "2024-01-18T14:34:54.103+0000", "Region": "EU (Ireland)", "AlarmArn": "<alarm_arn>", "OldStateValue": "INSUFFICIENT_DATA", "OKActions": [], "AlarmActions": [ "<sns_arn>" ], "InsufficientDataActions": [], "Trigger": { "MetricName": "CPUUtilization", "Namespace": "AWS/EC2", "StatisticType": "Statistic", "Statistic": "AVERAGE", "Unit": null, "Dimensions": [ { "value": "<aws_id>", "name": "InstanceId" } ], "Period": 60, "EvaluationPeriods": 1, "DatapointsToAlarm": 1, "ComparisonOperator": "GreaterThanThreshold", "Threshold": 10.0, "TreatMissingData": "missing", "EvaluateLowSampleCountPercentile": "" } }

The name and description are arbitrary texts you wrote when setting the Alarm and the Subscription. Notice that the region is not the codename as in eu-west-1 but a supposedly more human readable text. The rest is mostly info about the Alarm itself. Also notice the Dimensions field. I don't know what other data comes here (probably the arbitrary fields and values you can setup in the Alarm), all I can say is that that format (list of dicts with only two fields, one called name and the other value) is possibly the most annoying implementation of a simple dict. I hope they have a reason for that, besides over engineering.

Finally, notice that the only info we get here about the source of the alarm is the InstanceId. As those are random strings, to me they don't mean anything. Maybe I can setup the Alarm so it also includes the instance'a name5, and even maybe the URL pointing to the metric's graph.

Finally, Teams' webhook also expects a JSON payload. I didn't delve much in what you can give to it, I just used the title, text and themeColor fields. At least text can be written in MarkDown. You get such a webhook going to the Team, click in the ⋮ ("vertical ellipsis") icon, "Connectors", add a webhook and obtain the URL from there. @type and @context I copied from an SNS-to-Lambda-to-Teams post.

So to build a bridge between CloudWatch Alerts through SNS to MSTeams's Team we just need a quite straightforward script. I decided to write it in Flask, but I'm pretty sure writing it in plain http.server and urllib.request to avoid dependencies is not much more work; I just didn't want to do it. Maybe I should have tried FastAPI instead; I simply forgot about it.

Without further ado, here's the script. I'm running Python 3.8, so I don't have case/match yet.

#! /usr/bin/env python3 ​ from flask import Flask, request import json import requests ​ app = Flask(__name__) ​ @app.route('/', methods=[ 'POST' ]) def root(): print(f"{request.data=}") ​ request_data = json.loads(request.data) ​ # python3.8, not case/match yet message_type = request_data['Type'] ​ if message_type == 'SubscriptionConfirmation': response = requests.get(request_data['SubscribeURL']) print(response.text) ​ return f"hello {request_data['TopicArn']}!" ​ message = { '@type': 'MessageCard', '@context': 'http://schema.org/extensions', 'themeColor': '4200c5', } ​ if message_type == 'Notification': try: alarm = json.loads(request_data['Message']) except json.JSONDecodeError: message['title'] = request_data['Subject'] message['text'] = request_data['Message'] else: instance_id = alarm['Trigger']['Dimensions'][0]['value'] state = alarm['NewStateValue'] ​ if state == 'ALARM': color = 'FF0000' else: color = '00FF00' ​ message['title'] = f"{instance_id}: {alarm['Trigger']['MetricName']} {state}" message['text'] = f"""{alarm['AlarmName']} ​ {alarm['Trigger']['MetricName']} {alarm['Trigger']['ComparisonOperator']} {alarm['Trigger']['Threshold']} for {int(alarm['Trigger']['Period']) // 60} minutes. ​ {alarm['AlarmDescription']} ​ {alarm['NewStateReason']} ​ for {instance_id} passed to {state} at {alarm['StateChangeTime']}.""" message['themeColor'] = color ​ response = requests.post('https://<company>.webhook.office.com/webhookb2/<uuid1>@<uuid2>/IncomingWebhook/<id>/<uuid3>', json=message) print(response.text) ​ return f"OK"
  1. Again, I'm new to AWS. This is how it's setup at $NEW_JOB, but there might be better ways. If there are, I'm happy to hear them. 

  2. 'we' as in me and my colleagues. 

  3. Don't get me started... 

  4. I know all this because right now I have like 5-8 unconfirmed Subscriptions because I had to figure all this out, mostly because I couldn't find sample data or, preferably, a tool that already does this. They're 5-8 because you can't create a second Subscription to the same URL, so I changed the port for every failed attempt to confirm the Subscription. 

  5. We don't have pets, but don't quite have cattle either. We have cows we name, and we get a little bit sad when we sell them, but we're happy when they invite us to the barbecue. 

  6. OK, I already started... 

  7. I added this footnote (I mean, the previous one... but this one too) while reviewing the post before publishing. Putting the correct number means editing the whole post, changing each number twice, which is error prone. In theory nikola and/or MarkDown support auto-numbered footnotes, but I never managed to make it work. I used to have the same issue with the previous static blog/stite compiler, ikiwiki, so this is not the first time I have out-of-order footnotes. In any case, I feel like they're a quirk that I find cute and somehow defining. 

Categories: FLOSS Project Planets

Django Weblog: DSF calls for applicants for a Django Fellow

Fri, 2024-01-19 14:18

After five years as part of the Django Fellowship program, Mariusz Felisiak has let us know that he will be stepping down as a Django Fellow in March 2024 to explore other things. Mariusz has made an extraordinary impact as a Django Fellow and has been a critical part of the Django community.

The Django Software Foundation and the wider Django community are grateful for his service and assistance.

The Fellowship program was started in 2014 as a way to dedicate high-quality and consistent resources to the maintenance of Django. As Django has matured, the DSF has been able to fundraise and earmark funds for this vital role. As a result, the DSF currently supports two Fellows - Mariusz Felisiak and Natalia Bidart. With the departure of Mariusz, the Django Software Foundation is announcing a call for Django Fellow applications. The new Fellow will work alongside Natalia.

The position of Fellow is focused on maintenance and community support - the work that benefits most from constant, guaranteed attention rather than volunteer-only efforts. In particular, the duties include:

  • Answering contributor questions on Forum and the django-developers mailing list
  • Helping new Django contributors land patches and learn our philosophy
  • Monitoring the security@djangoproject.com email alias and ensuring security issues are acknowledged and responded to promptly
  • Fixing release blockers and helping to ensure timely releases
  • Fixing severe bugs and helping to backport fixes to these and security issues
  • Reviewing and merging pull requests
  • Triaging tickets on Trac

Being a Django contributor isn't a prerequisite for this position — we can help get you up to speed. We'll consider applications from anyone with a proven history of working with either the Django community or another similar open-source community. Geographical location isn't important either - we have several methods of remote communication and coordination that we can use depending on the timezone difference to the supervising members of Django.

If you're interested in applying for the position, please email us at fellowship-committee@djangoproject.com describing why you would be a good fit along with details of your relevant experience and community involvement. Also, please include your preferred hourly rate and when you'd like to start working. Lastly, please include at least one recommendation.

Applicants will be evaluated based on the following criteria:

  • Details of Django and/or other open-source contributions
  • Details of community support in general
  • Understanding of the position
  • Clarity, formality, and precision of communications
  • Strength of recommendation(s)

Applications will be open until 1200 AoE, February 16, 2024, with the expectation that the successful candidate will be notified no later than March 1, 2024.

Categories: FLOSS Project Planets

Real Python: The Real Python Podcast – Episode #188: Measuring Bias, Toxicity, and Truthfulness in LLMs With Python

Fri, 2024-01-19 07:00

How can you measure the quality of a large language model? What tools can measure bias, toxicity, and truthfulness levels in a model using Python? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, returns to discuss techniques and tools for evaluating LLMs With Python.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Categories: FLOSS Project Planets

Daniel Roy Greenfeld: TIL: Writing decorators for classes

Fri, 2024-01-19 07:00

To my surprise writing decorators for classes is easier than for functions. Here's how to do it in annotated fashion with an unnecessary decorator that doesn't accept any additional arguments.

# Write a callable that accepts a cls as an argument def tools(cls): # Write functions that accept "self: object" as an argument. def simplistic_attribute_count(self: object) -> int: """Returns the number of attributes.""" return len(self.__dict__) def docs(self: object) -> str: """Returns the docstring for the class.""" return self.__doc__ # Attach the functions as methods cls.simplistic_attribute_count = simplistic_attribute_count cls.docs = docs # Return the modified class return cls

Let's test it out:

@tools class A: """Docstring for testing the tools decorator""" a = A() a.one = 1 assert a.simplistic_attribute_count() == 1 assert a.docs() == 'Docstring for testing the tools decorator'

Next up, how to do this while passing in arguments!

Categories: FLOSS Project Planets

PyCharm: PyCharm 2024.1 EAP Is Open!

Fri, 2024-01-19 04:27

This blog post marks the start of the year’s first Early Access Program. The PyCharm 2024.1 EAP 1 build is now accessible for download, providing an early glimpse into the exciting updates on the horizon.

Download PyCharm 2024.1 EAP

If you’re new to the EAP process, we encourage you to read our introductory blog post. It offers valuable insights into the program and explains why your participation is integral.

Join us in the coming weeks to explore the new features in PyCharm, test them out, and provide feedback on the new additions. Your engagement is what helps us shape the evolution of PyCharm.

User experience  Option to scale down the entire IDE

PyCharm 2023.1 introduced the ability to zoom in and out of the entire IDE, adjusting the size of all UI elements simultaneously. However, the initial scaling range was limited to between 100 and 200%. In PyCharm 2024.1 EAP 1, we have incorporated a new option allowing users to scale down the IDE to 90%, 80%, or 70%, offering an extended range of customization options. 

Django Structure tool window

This first build also brings multiple improvements to the recently introduced Django Structure tool window. Among other enhancements, there is now an action to register model admin classes.

That’s it for the first week! For the full list of changes in this EAP build, please read the release notes.

Stay tuned for more updates that will be covered in the blog every week until the major release date. We highly value your input, so be sure to provide your feedback on the new features. You can drop a comment in the comments section under this blog post or reach out to our team on X (formerly Twitter). If you come across any bugs while using this build, please report them via our issue tracker

Categories: FLOSS Project Planets

Talk Python to Me: #445: Inside Azure Data Centers with Mark Russinovich

Fri, 2024-01-19 03:00
When you run your code in the cloud, how much do you know about where it runs? I mean, the hardware it runs on and the data center it runs in? There are just a couple of hyper-scale cloud providers in the world. This episode is a very unique chance to get a deep look inside one of them: Microsoft Azure. Azure is comprised of over 200 physical data centers, each with 100,000s of servers. A look into how code runs on them is fascinating. Our guide for this journey will be Mark Russinovich. Mark is the CTO of Microsoft Azure and a Technical Fellow, Microsoft's senior-most technical position. He's also a bit of a programming hero of mine. Even if you don't host your code in the cloud, I think you'll enjoy this conversation. Let's dive in.<br/> <br/> <strong>Episode sponsors</strong><br/> <br/> <a href='https://talkpython.fm/posit'>Posit</a><br> <a href='https://talkpython.fm/pdm2024-v1'>Pybites PDM</a><br> <a href='https://talkpython.fm/training'>Talk Python Courses</a><br/> <br/> <strong>Links from the show</strong><br/> <br/> <div><b>Mark Russinovich</b>: <a href="https://twitter.com/markrussinovich?lang=en" target="_blank" rel="noopener">@markrussinovich</a><br/> <b>Mark Russinovich on LinkedIn</b>: <a href="https://www.linkedin.com/in/markrussinovich/" target="_blank" rel="noopener">linkedin.com</a><br/> <br/> <b>SysInternals</b>: <a href="https://learn.microsoft.com/en-us/sysinternals/" target="_blank" rel="noopener">learn.microsoft.com</a><br/> <b>Zero Day: A Jeff Aiken Novel</b>: <a href="https://www.amazon.com/Zero-Day-Jeff-Aiken-Novel/dp/1250007305/ref=pd_bxgy_img_d_sccl_1/137-8705053-7401005?pd_rd_w=ZlR5u&content-id=amzn1.sym.7746dde5-5539-43d2-b75f-28935d70f100&pf_rd_p=7746dde5-5539-43d2-b75f-28935d70f100&pf_rd_r=TX24D8NYH7N6PB3TWS0Y&pd_rd_wg=Uqqcc&pd_rd_r=df72b7fc-1541-4ff8-bf7f-0780800b5610&pd_rd_i=1250007305&psc=1" target="_blank" rel="noopener">amazon.com</a><br/> <b>Inside Azure Datacenters</b>: <a href="https://www.youtube.com/watch?v=sgIBC3yWa-M" target="_blank" rel="noopener">youtube.com</a><br/> <b>What runs chatgpt?</b>: <a href="https://www.youtube.com/watch?v=Rk3nTUfRZmo" target="_blank" rel="noopener">youtube.com</a><br/> <b>Azure Cobalt ARM chip</b>: <a href="https://www.servethehome.com/microsoft-azure-cobalt-100-128-core-arm-neoverse-n2-cpu-launched/" target="_blank" rel="noopener">servethehome.com</a><br/> <b>Closing talk by Mark at Ignite 2023</b>: <a href="https://youtu.be/c4SUhWBybXo?si=_tFb9XCn7xh7hs2O&t=124" target="_blank" rel="noopener">youtube.com</a><br/> <b>Episode transcripts</b>: <a href="https://talkpython.fm/episodes/transcript/445/inside-azure-data-centers-with-mark-russinovich" target="_blank" rel="noopener">talkpython.fm</a><br/> <br/> <b>--- Stay in touch with us ---</b><br/> <b>Subscribe to us on YouTube</b>: <a href="https://talkpython.fm/youtube" target="_blank" rel="noopener">youtube.com</a><br/> <b>Follow Talk Python on Mastodon</b>: <a href="https://fosstodon.org/web/@talkpython" target="_blank" rel="noopener"><i class="fa-brands fa-mastodon"></i>talkpython</a><br/> <b>Follow Michael on Mastodon</b>: <a href="https://fosstodon.org/web/@mkennedy" target="_blank" rel="noopener"><i class="fa-brands fa-mastodon"></i>mkennedy</a><br/></div>
Categories: FLOSS Project Planets

Pages