Feeds

Dirk Eddelbuettel: RProtoBuf 0.4.22 on CRAN: Updated Windows Support!

Planet Debian - Sun, 2024-01-21 22:41

A new maintenance release 0.4.22 of RProtoBuf arrived on CRAN earlier today. RProtoBuf provides R with bindings for the Google Protocol Buffers (“ProtoBuf”) data encoding and serialization library used and released by Google, and deployed very widely in numerous projects as a language and operating-system agnostic protocol.

This release matches the recent 0.4.21 release which enabled use of the package with newer ProtoBuf releases. Tomas has been updating the Windows / rtools side of things, and supplied us with simple PR that will enable building with those updated versions once finalised.

The following section from the NEWS.Rd file has full details.

Changes in RProtoBuf version 0.4.22 (2022-12-13)
  • Apply patch by Tomas Kalibera to support updated rtools to build with newer ProtoBuf releases on windows

Thanks to my CRANberries, there is a diff to the previous release. The RProtoBuf page has copies of the (older) package vignette, the ‘quick’ overview vignette, and the pre-print of our JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

If you like this or other open-source work I do, you can sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

Categories: FLOSS Project Planets

Luke Plant: Python packaging must be getting better - a datapoint

Planet Python - Sun, 2024-01-21 15:47

I’m developing some Python software for a client, which in its current early state is desktop software that will need to run on Windows.

So far, however, I have done all development on my normal comfortable Linux machine. I haven’t really used Windows in earnest for more than 15 years – to the point where my wife happily installs Linux on her own machine, knowing that I’ll be hopeless at helping her fix issues if the OS is Windows – and certainly not for development work in that time. So I was expecting a fair amount of pain.

There was certainly a lot of friction getting a development environment set up. RealPython.com have a great guide which got me a long way, but even that had some holes and a lot of inconvenience, mostly due to the fact that, on the machine I needed to use, my main login and my admin login are separate. (I’m very lucky to be granted an admin login at all, so I’m not complaining). And there are lots of ways that Windows just seems to be broken, but that’s another blog post.

When it came to getting my app running, however, I was very pleasantly surprised.

At this stage in development, I just have a rough requirements.txt that I add Python deps to manually. This might be a good thing, as I avoid the pain of some the additional layers people have added.

So after installing Python and creating a virtual environment on Windows, I ran pip install -r requirements.txt, expecting a world of pain, especially as I already had complex non-Python dependencies, including Qt5 and VTK. I had specified both of these as simple Python deps via the wrappers pyqt5 and vtk in my requirements.txt, and nothing else, with the attitude of “well I may as well dream this is going to work”.

And in fact, it did! Everything just downloaded as binary wheels – rather large ones, but that’s fine. I didn’t need compilers or QMake or header files or anything.

And when I ran my app, apart from a dependency that I’d forgotten to add to requirements.txt, everything worked perfectly first time. This was even more surprising as I had put zero conscious effort into Windows compatibility. In retrospect I realise that use of pathlib, which is automatic for me these days, had helped me because it smooths over some Windows/Unix differences with path handling.

Of course, this is a single datapoint. From other people’s reports there are many, many ways that this experience may not be typical. But that it is possible at all suggests that a lot of progress has been made and we are very much going in the right direction. A lot of people have put a lot of work in to achieve that, for which I’m very grateful!

Categories: FLOSS Project Planets

Debian Brasil: MiniDebConf BH 2024 - patrocínio e financiamento coletivo

Planet Debian - Sun, 2024-01-21 06:00

Já está rolando a inscrição de participante e a chamada de atividades para a MiniDebConf Belo Horizonte 2024, que acontecerá de 27 a 30 de abril no Campus Pampulha da UFMG.

Este ano estamos ofertando bolsas de alimentação, hospedagem e passagens para contribuidores(as) ativos(as) do Projeto Debian.

Patrocínio:

Para a realização da MiniDebConf, estamos buscando patrocínio financeiro de empresas e entidades. Então se você trabalha em uma empresa/entidade (ou conhece alguém que trabalha em uma) indique o nosso plano de patrocínio para ela. Lá você verá os valores de cada cota e os seus benefícios.

Financiamento coletivo:

Mas você também pode ajudar a realização da MiniDebConf por meio do nosso financiamento coletivo!

Faça uma doação de qualquer valor e tenha o seu nome publicado no site do evento como apoiador(a) da MiniDebConf Belo Horizonte 2024.

Mesmo que você não pretenda vir a Belo Horizonte para participar do evento, você pode doar e assim contribuir para o mais importante evento do Projeto Debian no Brasil.

Contato

Qualquer dúvida, mande um email para contato@debianbrasil.org.br

Organização

Categories: FLOSS Project Planets

TechBeamers Python: LangChain Agent Basics with Sample Agent Code

Planet Python - Sun, 2024-01-21 01:29

LangChain agents are fascinating creatures! They live in the world of text and code, interacting with humans through conversations and completing tasks based on instructions. Think of them as your digital assistants, but powered by artificial intelligence and fueled by language models. Getting Started with Agents in LangChain Imagine a chatty robot friend that gets […]

The post LangChain Agent Basics with Sample Agent Code appeared first on TechBeamers.

Categories: FLOSS Project Planets

TechBeamers Python: LangChain Memory Basics

Planet Python - Sun, 2024-01-21 00:34

Langchain Memory is like a brain for your conversational agents. It remembers past chats, making conversations flow smoothly and feel more personal. Think of it like chatting with a real friend who recalls what you talked about before. This makes the agent seem smarter and more helpful. Getting Started with Memory in LangChain Imagine you’re […]

The post LangChain Memory Basics appeared first on TechBeamers.

Categories: FLOSS Project Planets

TypeThePipe: Data Engineering Bootcamp 2024 (Week 1) Docker & Terraform

Planet Python - Sat, 2024-01-20 19:00
pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { color: #008000; } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { color: #008000; font-weight: bold; } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */


Free Data Engineering Bootcamp 2024 to become a skilled Data Analytics Engineer. Week 1

I’ve just enrolled in the DataTalks free Data Engineering zootcamp. It’s a fantastic initiative that has been running for several years, with each cohort occurring annually.

The course is organized weekly, featuring one online session per week. There are optional weekly homework assignments which are reviewed, and the course concludes with a mandatory Data Eng final project, which is required to earn the certification.

In this series of posts, I will be sharing with you my course notes and comments, and also how I’m resolving the homework.


1. Dockerized data pipeline (Intro, dockerfile and doocker compose)

Let’s delve into the essentials of Docker, Dockerfile, and Docker Compose. These three components are crucial in the world of software development, especially when dealing with application deployment and management.

Docker: The Cornerstone of Containerization

Docker stands at the forefront of containerization technology. It allows developers to package applications and their dependencies into containers. A container is an isolated environment, akin to a lightweight, standalone, and secure package of software that includes everything needed to run it: code, runtime, system tools, system libraries, and settings. This technology ensures consistency across multiple development and release cycles, standardizing your environment across different stages.

Dockerfile: Blueprint for Docker images

A Dockerfile is a text document containing all the commands a user could call on the command line to assemble a Docker image. It automates the process of creating Docker images. A Dockerfile defines what goes on in the environment inside your container. It allows you to create a container that meets your specific needs, which can then be run on any Docker-enabled machine.

Docker Compose: Simplifying multi-container applications

Docker Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services, networks, and volumes. Then, with a single command, you create and start all the services from your configuration. Docker Compose works in all environments: production, staging, development, testing, as well as CI workflows.

Why these tools matter?

The combination of Docker, Dockerfile, and Docker Compose streamlines the process of developing, shipping, and running applications. Docker encapsulates your application and its environment, Dockerfile builds the image for this environment, and Docker Compose manages the orchestration of multi-container setups. Together, they provide a robust and efficient way to handle the lifecycle of applications. This ecosystem is integral for developers looking to leverage the benefits of containerization for more reliable, scalable, and collaborative software development.

Get Docker and read there the documentation.


What is Docker and why is it useful for Data Engineering?

Alright, data engineers, gather around! Why should you care about containerization and Docker? Well, it’s like having a Swiss Army knife in your tech toolkit. Here’s why:

  • Local Experimentation: Setting up things locally for experiments becomes a breeze. No more wrestling with conflicting dependencies or environments. Docker keeps it clean and easy.

  • Testing and CI Made Simple: Integration tests and CI/CD pipelines? Docker smoothens these processes. It’s like having a rehearsal space for your code before it hits the big stage.

  • Batch Jobs and Beyond: While Docker plays nice with AWS Batch, Kubernetes jobs, and more (though that’s a story for another day), it’s a glimpse into the world of possibilities with containerization.

  • Spark Joy: If you’re working with Spark, Docker can be a game-changer. It’s like having a consistent and controlled playground for your data processing.

  • Serverless, Stress-less: With the rise of serverless architectures like AWS Lambda, Docker ensures that you’re developing in an environment that mirrors your production setup. No more surprises!

So, there you have it. Containers are everywhere, and Docker is leading the parade. It’s not just a tool; it’s an essential part of the modern software development and deployment process.


Run Postgres and PGAdmin containers

You may need to create a network in order the containers communicate. Then, run the Postgres database container. Notice that in order to persist the data, as docker containers would be stateless and reinizialized after each run, you may indicate Docker volumes to persist the PG internal and ingested data.

docker network create pg-network docker run -it \ -e POSTGRES_USER="root" \ -e POSTGRES_PASSWORD="root" \ -e POSTGRES_DB="ny_taxi" \ -v /Users/jobandtalent/data-eng-bootcamp/ny_taxi_pg_data:/var/lib/postgresql/data \ -p 5437:5432 \ --name pg-database \ --network=pg-network \ postgres:13 `` Let's create and start the PG Admin container: docker run -it \ -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \ -e PGADMIN_DEFAULT_PASSWORD="root" \ -p 8080:80 \ --name pgadmin \ --network=pg-network \ dpage/pgadmin4


Manage multiple containers with the Docker-compose file

Instead of managing our containers individually, it is way better to manage them from one single source. The docker-compose file allows you to specify the services/containers you want to build and run, from the image to run till the environment variables and volumes.

version: "3.11" services: pg-database: image: postgres:13 environment: - POSTGRES_USER=root - POSTGRES_PASSWORD=root - POSTGRES_DB=ny_taxi volumes: - ./ny_taxi_pg_data:/var/lib/postgresql/data ports: - 5432:5432 pg-admin: image: dpage/pgadmin4 environment: - PGADMIN_DEFAULT_EMAIL=admin@admin.com - PGADMIN_DEFAULT_PASSWORD=root ports: - 8080:80 volumes: - "pgadmin_conn_data:/var/lib/pgadmin:rw" volumes: pgadmin_conn_data:


Create your pipeline script and a Dockerfile

The pipeline script objective is download the ddata from the US taxi rides and insert it in the Postgres database. The script could be as simple as:

import polars as pl from pydantic_settings import BaseSettings, SettingsConfigDict from typing import ClassVar TRIPS_TABLE_NAME = "green_taxi_trips" ZONES_TABLE_NAME = "zones" class PgConn(BaseSettings): model_config = SettingsConfigDict(env_prefix='PG_', env_file='.env', env_file_encoding='utf-8') user: str pwd: str host: str port: int db: str connector: ClassVar[str] = "postgresql" @property def uri(self): return f"{self.connector}://{self.user}:{self.pwd}@{self.host}:{self.port}/{self.db}" df_ny_taxi = pl.read_csv("https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-09.csv.gz") df_zones = pl.read_csv("https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv") conn = PgConn() df_zones.write_database(ZONES_TABLE_NAME, conn.uri) df_ny_taxi.write_database(TRIPS_TABLE_NAME, conn.uri)

We have used Polars and Pydantic (v2) libraries. With Polars, we load the data from csvs and also manage how to write it to the database with write_database DataFrame method. We load Pydantic module in order to create a Postgres connection object, that loads the configuration from the environment. We are using for convenience an .env config file, but it is not mandatory. As we will explain in the following chunk of code, we set the env variables on the Dockerfile and they remain accessible from inside the container, being trivial to load them using BaseSettings and SettingsConfigDict.

Now, in order to Dockerize your data pipeline script, that as we just saw it download and ingest the data to Postgres, we need to create a Dockerfile with the specifications of the container. We are using Poetry as a dependency manager, so we need to include those pyproject.toml (multi-pourpose file, used to specify to Poetry the desired module version constraints) and the poetry.lock (specific for Poetry version pin respecting the package version constraints from pyproject.toml). Also, we may include to the container the actual ingest_data.py Python pipeline file.

FROM python:3.11 ARG POETRY_VERSION=1.7.1 WORKDIR /app COPY ingest_data.py ingest_data.py COPY .env .env COPY pyproject.toml pyproject.toml COPY poetry.lock poetry.lock RUN pip3 install --no-cache-dir poetry==${POETRY_VERSION} \ && poetry env use 3.11 \ && poetry install --no-cache ENTRYPOINT [ "poetry", "run", "python", "ingest_data.py" ]

Just by building the container in our root folder and running it, the ingest_data.py script will be executed, and therefore the data downloaded and persisted on the Postgres database.


2. Terraform: Manage your GCP infra. Google Storage and Google BigQuery


Terraform intro and basic commands

Terraform has become a key tool in modern infrastructure management. Terraform, named with a nod to the concept of terraforming planets, applies a similar idea to cloud and local platform infrastructure. It’s about creating and managing the necessary environment for software to run efficiently on platforms like AWS, GCP…

Terraform, developed by HashiCorp, is described as an “infrastructure as code” tool. It allows users to define and manage both cloud-based and on-premises resources in human-readable configuration files. These files can be versioned, reused, and shared, offering a consistent workflow to manage infrastructure throughout its lifecycle. The advantages of using Terraform include simplicity in tracking and modifying infrastructure, ease of collaboration (since configurations are file-based and can be shared on platforms like GitHub), and reproducibility. For instance, an infrastructure set up in a development environment can be replicated in production with minor adjustments. Additionally, Terraform helps in ensuring that resources are properly removed when no longer needed, avoiding unnecessary costs.

So this tool not only simplifies the process of infrastructure management but also ensures consistency and compliance with your infrastructure setup.

However, it’s important to note what Terraform is not. It doesn’t handle the deployment or updating of software on the infrastructure; it’s focused solely on the infrastructure itself. It doesn’t allow modifications to immutable resources without destroying and recreating them. For example, changing the type of a virtual machine would require its recreation. Terraform also only manages resources defined within its configuration files.


Set up Terraform for GCP deploys. From GCP account permissions to the main.tf file

Diving into the world of cloud infrastructure can be a daunting task, but with tools like Terraform, the process becomes more manageable and streamlined. Terraform, an open-source infrastructure as code software tool, allows users to define and provision a datacenter infrastructure using a high-level configuration language. Here’s a guide to setting up Terraform for Google Cloud Platform (GCP).

Creating a Service Account in GCP

Before we start coding with Terraform, it’s essential to establish a method for Terraform on our local machine to communicate with GCP. This involves setting up a service account in GCP – a special type of account used by applications, as opposed to individuals, to interact with the GCP services.

Creating a service account is straightforward. Log into the GCP console, navigate to the “IAM & Admin” section, and create a new service account. This account should be given specific permissions relevant to the resources you plan to manage with Terraform, such as Cloud Storage or BigQuery.

Once the service account is created, the next step is to manage its keys. These keys are crucial as they authenticate and authorize the Terraform script to perform actions in GCP. It’s vital to handle these keys with care, as they can be used to access your GCP resources. You should never expose these credentials publicly.

Setting Up Your Local Environment

After downloading the key as a JSON file, store it securely in your local environment. It’s recommended to create a dedicated directory for these keys to avoid any accidental uploads, especially if you’re using version control like Git.

Remember, you can configure Terraform to use these credentials in several ways. One common method is to set an environment variable pointing to the JSON file, but you can also specify the path directly in your Terraform configuration.

Writing Terraform Configuration

With the service account set up, you can begin writing your Terraform configuration. This is done in a file typically named main.tf. In this file, you define your provider (in this case, GCP) and the resources you wish to create, update, or delete.

For instance, if you’re setting up a GCP storage bucket, you would define it in your main.tf file. Terraform configurations are declarative, meaning you describe your desired state, and Terraform figures out how to achieve it. You are ready for terraform init to start with your project.

Planning and Applying Changes

Before applying any changes, it’s good practice to run terraform plan. This command shows what Terraform will do without actually making any changes. It’s a great way to catch errors or unintended actions.

Once you’re satisfied with the plan, run terraform apply to make the changes. Terraform will then reach out to GCP and make the necessary adjustments to match your configuration.

Cleaning Up: Terraform Destroy

When you no longer need the resources, Terraform makes it easy to clean up. Running terraform destroy will remove the resources defined in your Terraform configuration from your GCP account.

Lastly, a word on security: If you’re storing your Terraform configuration in a version control system like Git, be mindful of what you commit. Ensure that your service account keys and other sensitive data are not pushed to public repositories. Using a .gitignore file to exclude these sensitive files is a best practice.

For instance, our main.tf file for creating a GCP Storage Bucket and a Big Query dataset looks like:

terraform { required_providers { google = { source = "hashicorp/google" version = "5.12.0" } } } provider "google" { credentials = var.credentials project = "concise-quarter-411516" region = "us-central1" } resource "google_storage_bucket" "demo_bucket" { name = var.gsc_bucket_name location = var.location force_destroy = true lifecycle_rule { condition { age = 1 } action { type = "AbortIncompleteMultipartUpload" } } } resource "google_bigquery_dataset" "demo_dataset" { dataset_id = var.bq_dataset_name }

As you may noticed, some of the values are strings/ints/floats but others are var.* values. In the next section we are talking about keeping the Terraform files tidy with the usage of variables.


Parametrize files with variables.tf

Terraform variables offer a centralized and reusable way to manage values in infrastructure automation, separate from deployment plans. They are categorized into two main types: input variables for configuring infrastructure and output variables for retrieving information post-deployment. Input variables define values like server configurations and can be strings, lists, maps, or booleans. String variables simplify complex values, lists represent indexed values, maps store key-value pairs, and booleans handle true/false conditions.

Output variables are used to extract details like IP addresses after the infrastructure is deployed. Variables can be predefined in a file or via command-line, enhancing flexibility and readability. They also support overriding at deployment, allowing for customized infrastructure management. Sensitive information can be set as environmental variables, prefixed with TF_VAR_, for enhanced security. Terraform variables are essential for clear, manageable, and secure infrastructure plans.

In our case, we are using variables.tf looks as:

variable "credentials" { default = "./keys/my_creds.json" } variable "location" { default = "US" } variable "bq_dataset_name" { description = "BigQuery dataset name" default = "demo_dataset" } variable "gcs_storage_class" { description = "Bucket Storage class" default = "STANDARD" } variable "gsc_bucket_name" { description = "Storage bucket name" default = "terraform-demo-20240115-demo-bucket" }

We are parametrizing here the credentials file, buckets location, storage class, bucket name…

As we’ve discussed, mastering Terraform variables is a key step towards advanced infrastructure automation and efficient code management.

For more information about Terraform variables, you can visit this post.


´

Stay updated on the Data Engineering and Data Analytics Engineer career paths

This was the content I gathered for the very first week of the DataTalks Data Engineering bootcamp. I’ve definetively enjoyed it and I’m excited to continue with Week 2.

If you want to stay updated, homework the homework along with explanations…

#mc_embed_signup{background:#fff; clear:left; font:14px Helvetica,Arial,sans-serif; width:100%;} #mc_embed_signup .button { background-color: #0294A5; /* Green */ color: white; transition-duration: 0.4s; } #mc_embed_signup .button:hover { background-color: #379392 !important; } Suscribe for Data Eng content and explained homework!
Categories: FLOSS Project Planets

Debug symbols for all!

Planet KDE - Sat, 2024-01-20 19:00

When running Linux software and encountering a crash, and you make a bug report about it (thank you!), you may be asked for backtraces and debug symbols.

And if you're not developer you may wonder what in the heck are those?

I wanted to open up this topic a bit, but if you want more technical in-depth look into these things, internet is full of info. :)

This is more a guide for any common user who encounters this situation and what they can do to get these mystical backtraces and symbols and magic to the devs.

Backtrace

When developers ask for a backtrace, they're basically asking "what are the steps that caused this crash to happen?" Debugger software can show this really nicely, line by line. However without correct debug symbols, the backtrace can be meaningless.

But first, how do you get a backtrace of something?

On systems with systemd installed, you often have a terminal tool called coredumpctl. This tool can list many crashes you have had with software. When you see something say "segmentation fault, core dumped", this is the tool that can show you those core dumps.

So, here's a few ways to use it!

How to see all my crashes (coredumps)

Just type coredumpctl in terminal and a list opens. It shows you a lot of information and last the app name.

How to open a specific coredump in a debugger

First, check from the plain coredumpctl list the coredump you want to check out. Easiest way to deduce it is to check the date and time. After that, there's something called PID number, for example 12345. You can close the list by pressing q and then type coredumpctl debug 12345.

This will often open GDB, where you can type bt for it to start printing the backtrace. You can then copy that backtrace. But there's IMO easier way.

Can I just get the backtrace automatically in a file..?

If you only want the latest coredump of the app that crashed on you, then print the backtrace in a text file that you can just send to devs, here's a oneliner to run in terminal:

coredumpctl debug APP_NAME_HERE -A "-ex bt -ex quit" |& tee backtrace.txt

You can also use the PID shown earlier in place of the app name, if you want some specific coredump.

The above command will open the coredump in a debugger, then run bt command, then quit, and it will write it all down in a file called backtrace.txt that you can share with developers.

As always when using debugging and logging features, check the file for possible personal data! It's very unlikely to have anything personal data, BUT it's still a good practice to check it!

Here's a small snippet from a backtrace I have for Kate text editor:

#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 ... #18 0x00007f5653fbcdb9 in parse_file (table=table@entry=0x19d5a60, file=file@entry=0x19c8590, file_name=file_name@entry=0x7f5618001590 "/usr/share/X11/locale/en_US.UTF-8/Compose") at ../src/compose/parser.c:749 #19 0x00007f5653fc5ce0 in xkb_compose_table_new_from_locale (ctx=0x1b0cc80, locale=0x18773d0 "en_IE.UTF-8", flags=<optimized out>) at ../src/compose/table.c:217 #20 0x00007f565138a506 in QtWaylandClient::QWaylandInputContext::ensureInitialized (this=0x36e63c0) at /usr/src/debug/qt6-qtwayland-6.6.0-1.fc39.x86_64/src/client/qwaylandinputcontext.cpp:228 #21 QtWaylandClient::QWaylandInputContext::ensureInitialized (this=0x36e63c0) at /usr/src/debug/qt6-qtwayland-6.6.0-1.fc39.x86_64/src/client/qwaylandinputcontext.cpp:214 #22 QtWaylandClient::QWaylandInputContext::filterEvent (this=0x36e63c0, event=0x7ffd27940c50) at /usr/src/debug/qt6-qtwayland-6.6.0-1.fc39.x86_64/src/client/qwaylandinputcontext.cpp:252 ...

The first number is the step where we are. Step #0 is where the app crashes. The last step is where the application starts running. Keep in mind though that even the app crashes at #0 that may be just the computer handling the crash, instead of the actual culprit. The culprit for the crash can be anywhere in the backtrace. So you have to do some detective work if you want to figure it out. Often crashes happen when some code execution path goes in unexpected route, and the program is not prepared for that.

Remember that you will, however, need proper debug symbols for this to be useful! We'll check that out in the next chapter.

Debug symbols

Debug symbols are something that tells the developer using debugger software, like GDB, what is going on and where. Without debugging symbols the debugger can only show the developer more obfuscated data.

I find this easier to show with an example:

Without debug symbols, this is what the developer sees when reading the backtrace:

0x00007f7e9e29d4e8 in QCoreApplication::notifyInternal2(QObject*, QEvent*) () from /lib64/libQt5Core.so.5

Or even worse case scenario, where the debugger can't read what's going on but only can see the "mangled" names, it can look like this:

_ZN20QEventDispatcherGlib13processEventsE6QFlagsIN10QEventLoop17ProcessEventsFlagEE

Now, those are not very helpful. At least the first example tells what file the error is happening in, but it doesn't really tell where. And the second example is just very difficult to understand what's going on. You don't even see what file it is.

With correct debug symbols installed however, this is what the developer sees:

QCoreApplication::notifyInternal2(QObject*, QEvent*) (receiver=0x7fe88c001620, event=0x7fe888002c20) at kernel/qcoreapplication.cpp:1064

As you can see, it shows the file and line. This is super helpful since developers can just open the file in this location and start mulling it over. No need to guess what line it may have happened, it's right there!

So, where to get the debug symbols?

Every distro has it's own way, but KDE wiki has an excellent list of most common operating systems and how to get debug symbols for them: https://community.kde.org/Guidelines_and_HOWTOs/Debugging/How_to_create_useful_crash_reports

As always, double check with your distros official documentation how to proceed. But the above link is a good starting point!

But basically, your package manager should have them. If not, you will have to build the app yourself with debug symbols enabled, which is definitely not ideal.. If the above list does not have your distro/OS, you may have to ask the maintainers of your distro/OS for help with getting the debug symbols installed.

Wait, which ones do I download?!

Usually the ones for the app that is crashing. Sometimes you may also need include the libraries the app is using.

There is no real direct answer this, but at the very least, get debug symbols for the app. If developers need more, they will ask you to install the other ones too.

You can uninstall the debug symbols after you're done, but that's up to you.

Thanks for reading!

I hope this has been useful! I especially hope the terminal "oneliner" command mentioned above for printing backtraces quickly into a file is useful for you!

Happy backtracing! :)

Categories: FLOSS Project Planets

TechBeamers Python: Understanding Python Timestamps: A Simple Guide

Planet Python - Sat, 2024-01-20 14:19

In Python, a timestamp is like a timestamp! It’s a number that tells you when something happened. This guide will help you get comfortable with timestamps in Python. We’ll talk about how to make them, change them, and why they’re useful. Getting Started with Timestamp in Python A timestamp is just a way of saying, […]

The post Understanding Python Timestamps: A Simple Guide appeared first on TechBeamers.

Categories: FLOSS Project Planets

Improving the qcolor-from-literal Clazy check

Planet KDE - Sat, 2024-01-20 13:26
Improving the qcolor-from-literal Clazy check

For all of you who don't know, Clazy is a clang compiler plugin that adds checks for Qt semantics. I have it as my default compiler, because it gives me useful hints when writing or working with preexisting code. Recently, I decided to give working on the project a try! One bigger contribution of mine was to the qcolor-from-literal check, which is a performance optimization. A QColor object has different constructors, this check is about the string constructor. It may accept standardized colors like “lightblue”, but also color patterns. Those can have different formats, but all provide an RGB value and optionally transparency. Having Qt parse this as a string causes performance overhead compared to alternatives.

Fixits for RGB/RGBA patterns

When using a color pattern like “#123” or “#112233”, you may simply replace the string parameter with an integer providing the same value. Rather than getting a generic warning about using this other constructor, a more specific warning with a replacement text (called fixit) is emitted.

testf4ile.cpp:92:16: warning: The QColor ctor taking RGB int value is cheaper than one taking string literals [-Wclazy-qcolor-from-literal] QColor("#123"); ^~~~~~ 0x112233 testfile.cpp:93:16: warning: The QColor ctor taking RGB int value is cheaper than one taking string literals [-Wclazy-qcolor-from-lite QColor("#112233"); ^~~~~~~~~ 0x112233

In case a transparency parameter is specified, the fixit and message are adjusted:

testfile.cpp:92:16: warning: The QColor ctor taking ints is cheaper than one taking string literals [-Wclazy-qcolor-from-literal] QColor("#9931363b"); ^~~~~~~~~~~ 0x31, 0x36, 0x3b, 0x99 Warnings for invalid color patterns

Next to providing fixits for more optimized code, the check now verifies that the provided pattern is valid regarding the length and contained characters. Without this addition, an invalid pattern would be silently ignored or cause an improper fixit to be suggested.

.../qcolor-from-literal/main.cpp:21:28: warning: Pattern length does not match any supported one by QColor, check the documentation [-Wclazy-qcolor-from-literal] QColor invalidPattern1("#0000011112222"); ^ .../qcolor-from-literal/main.cpp:22:28: warning: QColor pattern may only contain hexadecimal digits [-Wclazy-qcolor-from-literal] QColor invalidPattern2("#G00011112222"); Fixing a misleading warning for more precise patterns

In case a “#RRRGGGBBB” or “#RRRRGGGGBBBB” pattern is used, the message would previously suggest using the constructor taking ints. This would however result in an invalid QColor, because the range from 0-255 is exceeded. QRgba64 should be used instead, which provides higher precision.

I hope you find this new or rather improved feature of Clazy useful! I utilized the fixits in Kirigami, see https://invent.kde.org/frameworks/kirigami/-/commit/8e4a5fb30cc014cfc7abd9c58bf3b5f27f468168. Doing the change manually in Kirigami would have been way faster, but less fun. Also, we wouldn't have ended up with better tooling :)

Categories: FLOSS Project Planets

Gunnar Wolf: Ruffle helps bring back my family history

Planet Debian - Sat, 2024-01-20 13:17

Probably a trait of my family’s origins as migrants from East Europe, probably part of the collective trauma of jews throughout the world… or probably because that’s just who I turned out to be, I hold in high regard the preservation of memory of my family’s photos, movies and such items. And it’s a trait shared by many people in my familiar group.

Shortly after my grandmother died 24 years ago, my mother did a large, loving work of digitalization and restoration of my grandparent’s photos. Sadly, the higher resolution copies of said photos is lost… but she took the work of not just scanning the photos, but assembling them in presentations, telling a story, introducing my older relatives, many of them missing 40 or more years before my birth.

But said presentations were built using Flash. Right, not my choice of tool, and I told her back in the day — but given I wasn’t around to do the work in what I’d chosen (a standards-abiding format, naturally), and given my graphic design skills are nonexistant… Several years ago, when Adobe pulled the plug on the Flash format, we realized they would no longer be accessible. I managed to get the photos out of the preentations, but lost the narration, that is a great part of the work.

Three days ago, however, I read a post on https://www.osnews.com that made me jump to action: https://www.osnews.com/story/138350/ruffle-an-open-source-flash-player-emulator/.

Ruffle is an open source Flash Player emulator, written in Rust and compiled to WASM. Even though several OSnews readers report it to be buggy to play some Flash games they long for, it worked just fine for a simple slideshow presentator.

So… I managed to bring it back to life! Yes, I’d like to make a better index page, but that will come later 😉

I am now happy and proud to share with you:

https://ausencia.iszaevich.net/

(which would be roughly translated as Caressing the absence: Iszaevich Fajerstein family, 1900-2000).

Categories: FLOSS Project Planets

TechBeamers Python: Python Pip Usage for Beginners

Planet Python - Sat, 2024-01-20 13:09

Python Pip, short for “Pip Installs Packages,” is a powerful package management system that simplifies the process of installing, upgrading, and managing Python libraries and dependencies. In this tutorial, we’ll delve into the various aspects of Python Pip usage, covering essential commands, best practices, and the latest updates in the Python package ecosystem. Before we […]

The post Python Pip Usage for Beginners appeared first on TechBeamers.

Categories: FLOSS Project Planets

Gunnar Wolf: A deep learning technique for intrusion detection system using a recurrent neural networks based framework

Planet Debian - Sat, 2024-01-20 12:42

So let’s assume you already know and understand that artificial intelligence’s main building blocks are perceptrons, that is, mathematical models of neurons. And you know that, while a single perceptron is too limited to get “interesting” information from, very interesting structures–neural networks–can be built with them. You also understand that neural networks can be “trained” with large datasets, and you can get them to become quite efficient and accurate classifiers for data comparable to your dataset. Finally, you are interested in applying this knowledge to defensive network security, particularly in choosing the right recurrent neural network (RNN) framework to create an intrusion detection system (IDS). Are you still with me? Good! This paper might be right for you!

The paper builds on a robust and well-written introduction and related work sections to arrive at explaining in detail what characterizes a RNN, the focus of this work, among other configurations also known as neural networks, and why they are particularly suited for machine learning (ML) tasks. RNNs must be trained for each problem domain, and publicly available datasets are commonly used for such tasks. The authors present two labeled datasets representing normal and hostile network data, identified according to different criteria: NSL-KDD and UNSW-NB15. They proceed to show a framework to analyze and compare different RNNs and run them against said datasets, segmented for separate training and validation phases, compare results, and finally select the best available model for the task–measuring both training speed as well as classification accuracy.

The paper is quite heavy due to both its domain-specific terminology–many acronyms are used throughout the text–and its use of mathematical notation, both to explain specific properties of each of the RNN types and for explaining the preprocessing carried out for feature normalization and selection. This is partly what led me to start the first paragraph by assuming that we, as readers, already understand a large body of material if we are to fully follow the text. The paper does begin by explaining its core technologies, but quickly ramps up and might get too technical for nonexpert readers.

It is undeniably an interesting and valuable read, showing the state of the art in IDS and ML-assisted technologies. It does not detail any specific technology applying its findings, but we will probably find the information conveyed here soon enough in industry publications.

Categories: FLOSS Project Planets

The Drop Times: Drupal Droid - The Custom Drupal GPT by Michael Miles

Planet Drupal - Sat, 2024-01-20 12:24
Meet Drupal Droid, a tailored AI model purpose-built for the Drupal Community, offering assistance with Drupal 9+ site building, development, and coding standards. Michael Miles, the brain behind this custom GPT, shares his input on Drupal Droid with The DropTimes.
Categories: FLOSS Project Planets

Niels Thykier: Making debputy: Writing declarative parsing logic

Planet Debian - Sat, 2024-01-20 12:10

In this blog post, I will cover how debputy parses its manifest and the conceptual improvements I did to make parsing of the manifest easier.

All instructions to debputy are provided via the debian/debputy.manifest file and said manifest is written in the YAML format. After the YAML parser has read the basic file structure, debputy does another pass over the data to extract the information from the basic structure. As an example, the following YAML file:

manifest-version: "0.1" installations: - install: source: foo dest-dir: usr/bin

would be transformed by the YAML parser into a structure resembling:

{ "manifest-version": "0.1", "installations": [ { "install": { "source": "foo", "dest-dir": "usr/bin", } } ] }

This structure is then what debputy does a pass on to translate this into an even higher level format where the "install" part is translated into an InstallRule.

In the original prototype of debputy, I would hand-write functions to extract the data that should be transformed into the internal in-memory high level format. However, it was quite tedious. Especially because I wanted to catch every possible error condition and report "You are missing the required field X at Y" rather than the opaque KeyError: X message that would have been the default.

Beyond being tedious, it was also quite error prone. As an example, in debputy/0.1.4 I added support for the install rule and you should allegedly have been able to add a dest-dir: or an as: inside it. Except I crewed up the code and debputy was attempting to look up these keywords from a dict that could never have them.

Hand-writing these parsers were so annoying that it demotivated me from making manifest related changes to debputy simply because I did not want to code the parsing logic. When I got this realization, I figured I had to solve this problem better.

While reflecting on this, I also considered that I eventually wanted plugins to be able to add vocabulary to the manifest. If the API was "provide a callback to extract the details of whatever the user provided here", then the result would be bad.

  1. Most plugins would probably throw KeyError: X or ValueError style errors for quite a while. Worst case, they would end on my table because the user would have a hard time telling where debputy ends and where the plugins starts. "Best" case, I would teach debputy to say "This poor error message was brought to you by plugin foo. Go complain to them". Either way, it would be a bad user experience.
  2. This even assumes plugin providers would actually bother writing manifest parsing code. If it is that difficult, then just providing a custom file in debian might tempt plugin providers and that would undermine the idea of having the manifest be the sole input for debputy.

So beyond me being unsatisfied with the current situation, it was also clear to me that I needed to come up with a better solution if I wanted externally provided plugins for debputy. To put a bit more perspective on what I expected from the end result:

  1. It had to cover as many parsing errors as possible. An error case this code would handle for you, would be an error where I could ensure it sufficient degree of detail and context for the user.
  2. It should be type-safe / provide typing support such that IDEs/mypy could help you when you work on the parsed result.
  3. It had to support "normalization" of the input, such as
# User provides - install: "foo" # Which is normalized into: - install: source: "foo" 4) It must be simple to tell ``debputy`` what input you expected.

At this point, I remembered that I had seen a Python (PYPI) package where you could give it a TypedDict and an arbitrary input (Sadly, I do not remember the name). The package would then validate the said input against the TypedDict. If the match was successful, you would get the result back casted as the TypedDict. If the match was unsuccessful, the code would raise an error for you. Conceptually, this seemed to be a good starting point for where I wanted to be.

Then I looked a bit on the normalization requirement (point 3). What is really going on here is that you have two "schemas" for the input. One is what the programmer will see (the normalized form) and the other is what the user can input (the manifest form). The problem is providing an automatic normalization from the user input to the simplified programmer structure. To expand a bit on the following example:

# User provides - install: "foo" # Which is normalized into: - install: source: "foo"

Given that install has the attributes source, sources, dest-dir, as, into, and when, how exactly would you automatically normalize "foo" (str) into source: "foo"? Even if the code filtered by "type" for these attributes, you would end up with at least source, dest-dir, and as as candidates. Turns out that TypedDict actually got this covered. But the Python package was not going in this direction, so I parked it here and started looking into doing my own.

At this point, I had a general idea of what I wanted. When defining an extension to the manifest, the plugin would provide debputy with one or two definitions of TypedDict. The first one would be the "parsed" or "target" format, which would be the normalized form that plugin provider wanted to work on. For this example, lets look at an earlier version of the install-examples rule:

# Example input matching this typed dict. # { # "source": ["foo"] # "into": ["pkg"] # } class InstallExamplesTargetFormat(TypedDict): # Which source files to install (dest-dir is fixed) sources: List[str] # Which package(s) that should have these files installed. into: NotRequired[List[str]]

In this form, the install-examples has two attributes - both are list of strings. On the flip side, what the user can input would look something like this:

# Example input matching this typed dict. # { # "source": "foo" # "into": "pkg" # } # class InstallExamplesManifestFormat(TypedDict): # Note that sources here is split into source (str) vs. sources (List[str]) sources: NotRequired[List[str]] source: NotRequired[str] # We allow the user to write ``into: foo`` in addition to ``into: [foo]`` into: Union[str, List[str]] FullInstallExamplesManifestFormat = Union[ InstallExamplesManifestFormat, List[str], str, ]

The idea was that the plugin provider would use these two definitions to tell debputy how to parse install-examples. Pseudo-registration code could look something like:

def _handler( normalized_form: InstallExamplesTargetFormat, ) -> InstallRule: ... # Do something with the normalized form and return an InstallRule. concept_debputy_api.add_install_rule( keyword="install-examples", target_form=InstallExamplesTargetFormat, manifest_form=FullInstallExamplesManifestFormat, handler=_handler, )

This was my conceptual target and while the current actual API ended up being slightly different, the core concept remains the same.

From concept to basic implementation

Building this code is kind like swallowing an elephant. There was no way I would just sit down and write it from one end to the other. So the first prototype of this did not have all the features it has now.

Spoiler warning, these next couple of sections will contain some Python typing details. When reading this, it might be helpful to know things such as Union[str, List[str]] being the Python type for either a str (string) or a List[str] (list of strings). If typing makes your head spin, these sections might less interesting for you.

To build this required a lot of playing around with Python's introspection and typing APIs. My very first draft only had one "schema" (the normalized form) and had the following features:

  • Read TypedDict.__required_attributes__ and TypedDict.__optional_attributes__ to determine which attributes where present and which were required. This was used for reporting errors when the input did not match.
  • Read the types of the provided TypedDict, strip the Required / NotRequired markers and use basic isinstance checks based on the resulting type for str and List[str]. Again, used for reporting errors when the input did not match.

This prototype did not take a long (I remember it being within a day) and worked surprisingly well though with some poor error messages here and there. Now came the first challenge, adding the manifest format schema plus relevant normalization rules. The very first normalization I did was transforming into: Union[str, List[str]] into into: List[str]. At that time, source was not a separate attribute. Instead, sources was a Union[str, List[str]], so it was the only normalization I needed for all my use-cases at the time.

There are two problems when writing a normalization. First is determining what the "source" type is, what the target type is and how they relate. The second is providing a runtime rule for normalizing from the manifest format into the target format. Keeping it simple, the runtime normalizer for Union[str, List[str]] -> List[str] was written as:

def normalize_into_list(x: Union[str, List[str]]) -> List[str]: return x if isinstance(x, list) else [x]

This basic form basically works for all types (assuming none of the types will have List[List[...]]). The logic for determining when this rule is applicable is slightly more involved. My current code is about 100 lines of Python code that would probably lose most of the casual readers. For the interested, you are looking for _union_narrowing in declarative_parser.py

With this, when the manifest format had Union[str, List[str]] and the target format had List[str] the generated parser would silently map a string into a list of strings for the plugin provider.

But with that in place, I had covered the basics of what I needed to get started. I was quite excited about this milestone of having my first keyword parsed without handwriting the parser logic (at the expense of writing a more generic parse-generator framework).

Adding the first parse hint

With the basic implementation done, I looked at what to do next. As mentioned, at the time sources in the manifest format was Union[str, List[str]] and I considered to split into a source: str and a sources: List[str] on the manifest side while keeping the normalized form as sources: List[str]. I ended up committing to this change and that meant I had to solve the problem getting my parser generator to understand the situation:

# Map from class InstallExamplesManifestFormat(TypedDict): # Note that sources here is split into source (str) vs. sources (List[str]) sources: NotRequired[List[str]] source: NotRequired[str] # We allow the user to write ``into: foo`` in addition to ``into: [foo]`` into: Union[str, List[str]] # ... into class InstallExamplesTargetFormat(TypedDict): # Which source files to install (dest-dir is fixed) sources: List[str] # Which package(s) that should have these files installed. into: NotRequired[List[str]]

There are two related problems to solve here:

  1. How will the parser generator understand that source should be normalized and then mapped into sources?
  2. Once that is solved, the parser generator has to understand that while source and sources are declared as NotRequired, they are part of a exactly one of rule (since sources in the target form is Required). This mainly came down to extra book keeping and an extra layer of validation once the previous step is solved.

While working on all of this type introspection for Python, I had noted the Annotated[X, ...] type. It is basically a fake type that enables you to attach metadata into the type system. A very random example:

# For all intents and purposes, ``foo`` is a string despite all the ``Annotated`` stuff. foo: Annotated[str, "hello world"] = "my string here"

The exciting thing is that you can put arbitrary details into the type field and read it out again in your introspection code. Which meant, I could add "parse hints" into the type. Some "quick" prototyping later (a day or so), I got the following to work:

# Map from # { # "source": "foo" # (or "sources": ["foo"]) # "into": "pkg" # } class InstallExamplesManifestFormat(TypedDict): # Note that sources here is split into source (str) vs. sources (List[str]) sources: NotRequired[List[str]] source: NotRequired[ Annotated[ str, DebputyParseHint.target_attribute("sources") ] ] # We allow the user to write ``into: foo`` in addition to ``into: [foo]`` into: Union[str, List[str]] # ... into # { # "source": ["foo"] # "into": ["pkg"] # } class InstallExamplesTargetFormat(TypedDict): # Which source files to install (dest-dir is fixed) sources: List[str] # Which package(s) that should have these files installed. into: NotRequired[List[str]]

Without me (as a plugin provider) writing a line of code, I can have debputy rename or "merge" attributes from the manifest form into the normalized form. Obviously, this required me (as the debputy maintainer) to write a lot code so other me and future plugin providers did not have to write it.

High level typing

At this point, basic normalization between one mapping to another mapping form worked. But one thing irked me with these install rules. The into was a list of strings when the parser handed them over to me. However, I needed to map them to the actual BinaryPackage (for technical reasons). While I felt I was careful with my manual mapping, I knew this was exactly the kind of case where a busy programmer would skip the "is this a known package name" check and some user would typo their package resulting in an opaque KeyError: foo.

Side note: "Some user" was me today and I was super glad to see debputy tell me that I had typoed a package name (I would have been more happy if I had remembered to use debputy check-manifest, so I did not have to wait through the upstream part of the build that happened before debhelper passed control to debputy...)

I thought adding this feature would be simple enough. It basically needs two things:

  1. Conversion table where the parser generator can tell that BinaryPackage requires an input of str and a callback to map from str to BinaryPackage. (That is probably lie. I think the conversion table came later, but honestly I do remember and I am not digging into the git history for this one)
  2. At runtime, said callback needed access to the list of known packages, so it could resolve the provided string.

It was not super difficult given the existing infrastructure, but it did take some hours of coding and debugging. Additionally, I added a parse hint to support making the into conditional based on whether it was a single binary package. With this done, you could now write something like:

# Map from class InstallExamplesManifestFormat(TypedDict): # Note that sources here is split into source (str) vs. sources (List[str]) sources: NotRequired[List[str]] source: NotRequired[ Annotated[ str, DebputyParseHint.target_attribute("sources") ] ] # We allow the user to write ``into: foo`` in addition to ``into: [foo]`` into: Union[BinaryPackage, List[BinaryPackage]] # ... into class InstallExamplesTargetFormat(TypedDict): # Which source files to install (dest-dir is fixed) sources: List[str] # Which package(s) that should have these files installed. into: NotRequired[ Annotated[ List[BinaryPackage], DebputyParseHint.required_when_multi_binary() ] ]

Code-wise, I still had to check for into being absent and providing a default for that case (that is still true in the current codebase - I will hopefully fix that eventually). But I now had less room for mistakes and a standardized error message when you misspell the package name, which was a plus.

The added side-effect - Introspection

A lovely side-effect of all the parsing logic being provided to debputy in a declarative form was that the generated parser snippets had fields containing all expected attributes with their types, which attributes were required, etc. This meant that adding an introspection feature where you can ask debputy "What does an install rule look like?" was quite easy. The code base already knew all of this, so the "hard" part was resolving the input the to concrete rule and then rendering it to the user.

I added this feature recently along with the ability to provide online documentation for parser rules. I covered that in more details in my blog post Providing online reference documentation for debputy in case you are interested. :)

Wrapping it up

This was a short insight into how debputy parses your input. With this declarative technique:

  • The parser engine handles most of the error reporting meaning users get most of the errors in a standard format without the plugin provider having to spend any effort on it. There will be some effort in more complex cases. But the common cases are done for you.
  • It is easy to provide flexibility to users while avoiding having to write code to normalize the user input into a simplified programmer oriented format.
  • The parser handles mapping from basic types into higher forms for you. These days, we have high level types like FileSystemMode (either an octal or a symbolic mode), different kind of file system matches depending on whether globs should be performed, etc. These types includes their own validation and parsing rules that debputy handles for you.
  • Introspection and support for providing online reference documentation. Also, debputy checks that the provided attribute documentation covers all the attributes in the manifest form. If you add a new attribute, debputy will remind you if you forget to document it as well. :)

In this way everybody wins. Yes, writing this parser generator code was more enjoyable than writing the ad-hoc manual parsers it replaced. :)

Categories: FLOSS Project Planets

death and gravity: This is not interview advice: building a priority-expiry LRU cache without heaps or trees in Python

Planet Python - Sat, 2024-01-20 07:00

It's not your fault I got nerdsniped, but that doesn't matter.

Hi, I'm Adrian, and today we're implementing a least recently used cache with priorities and expiry, using only the Python standard library.

This is a bIG TEch CoDINg InTerVIEW problem, so we'll work hard to stay away from the correct™ data structures – no heaps, no binary search trees – but we'll end up with a decent solution anyway!

Requirements #

So you're at an interview and have to implement a priority-expiry LRU cache.

Maybe you get more details, but maybe the problem is deliberately vague; either way, we can reconstruct the requirements from the name alone.

A cache is something that stores data for future reuse, usually the result of a slow computation or retrieval. Each piece of data has an associated key used to retrieve it. Most caches are bounded in some way, for example by limiting the number of items.

The other words have to do with eviction – how and when items are removed.

  • Each item has a priority – when the cache fills up, we evict items with the lowest priority first.
  • Each item has a maximum age – if it's older than that, it is expired, so we don't return it to the user. It stands to reason expired items are evicted regardless of their priority or how full the cache is.
  • All other things being equal, we evict items least recently used relative to others.

The problem may specify an API; we can reconstruct that from first principles too. Since the cache is basically a key-value store, we can get away with two methods:

  • set(key, value, maxage, priority)
  • get(key) -> value or None

The problem may also suggest:

  • delete(key) – allows users to invalidate items for reasons external to the cache; not strictly necessary, but we'll end up with it as part of refactoring
  • evict(now) – not strictly necessary either, but hints eviction is a separate bit of logic, and may come in handy for testing

Types deserve their own discussion:

  • key – usually, the key is a string, but we can relax this to any hashable value
  • value – for an in-memory cache, any kind of object is fine
  • maxage and priority – a number should do for these; a float is more general, but an integer may allow a simpler solution; limits on these are important too, as we'll see soon enough

Tip

Your interviewer may be expecting you to uncover some of these details through clarifying questions. Be sure to think out loud and state your assumptions.

A minimal plausible solution #

I'm sure there are a lot smart people out there that can Think Really Hard™ and just come up with a solution, but I'm not one of them, so we'll take an iterative, problem-solution approach to this.

Since right now we don't have any solution, we start with the simplest thing that could possibly work1 – a basic cache with no fancy eviction and no priorities; we can then write some tests against that, to know if we break anything going forward.

Tip

If during an interview you don't know what to do and choose to work up from the naive solution, make it very clear that's what you're doing. Your interviewer may give you hints to help you skip that.

A class holds all the things we need and gives us something to stick the API on:

class Cache: def __init__(self, maxsize, time=time.monotonic): self.maxsize = maxsize self.time = time self.cache = {}

And this is our first algorithmic choice: a dict (backed by a hash table) provides average O(1) search / insert / delete.

Tip

Given the problem we're solving, and the context we're solving it in, we have to talk about time complexity. Ned Batchelder's Big-O: How Code Slows as Data Grows provides an excellent introduction (text and video available).

set() leads to more choices:

def set(self, key, value, *, maxage=10, priority=0): now = self.time() if key in self.cache: self.cache.pop(key) elif len(self.cache) >= self.maxsize: self.evict(now) expires = now + maxage item = Item(key, value, expires, priority) self.cache[key] = item

First, we evict items only if there's no more room left. (There are other ways of doing this; for example, evicting expired items periodically minimizes memory usage.)

Second, if the key is already in the cache, we remove and insert it again, instead of updating things in place. This way, there's only one code path for setting items, which will make it a lot easier to keep multiple data structures in sync later on.

We use a named tuple to store the parameters associated with a key:

class Item(NamedTuple): key: object value: object expires: int priority: int

For now, we just evict an arbitrary item; in a happy little coincidence, dicts preserve insertion order, so when iterating over the cache, the oldest key is first.

def evict(self, now): if not self.cache: return key = next(iter(self.cache)) del self.cache[key]

Finally, get() is trivial:

def get(self, key): item = self.cache.get(key) if not item: return None if self.time() >= item.expires: return None return item.value

With everything in place, here's the first test:

def test_basic(): cache = Cache(2, FakeTime()) assert cache.get('a') == None cache.set('a', 'A') assert cache.get('a') == 'A' cache.set('b', 'B') assert cache.get('a') == 'A' assert cache.get('b') == 'B' cache.set('c', 'C') assert cache.get('a') == None assert cache.get('b') == 'B' assert cache.get('c') == 'C'

To make things predictable, we inject a fake time implementation:

class FakeTime: def __init__(self, now=0): self.now = now def __call__(self): return self.now Problem: expired items should go first... #

Following from the requirements, there's an order in which items get kicked out: first expired (lowest expiry time), then lowest priority, and only then least recently used. So, we need a data structure that can efficiently remove the smallest element.

Turns out, there's an abstract data type for that called a priority queue;2 for now, we'll honor its abstract nature and not bother with an implementation.

self.cache = {} self.expires = PriorityQueue()

Since we need the item with the lowest expiry time, we need a way to get back to the item; an (expires, key) tuple should do fine – since tuples compare lexicographically, it'll be like comparing by expires alone, but with key along for the ride; in set(), we add:

self.cache[key] = item self.expires.push((expires, key))

You may be tempted (like I was) to say "hey, the item's already a tuple, if we make expires the first field, we can use the item itself", but let's delay optimizations until we have and understand a full solution – make it work, make it right, make it fast.

Still in set(), if the key is already in the cache, we also remove and insert it from the expires queue, so it's added back with the new expiry time.

if key in self.cache: item = self.cache.pop(key) self.expires.remove((item.expires, key)) del item elif len(self.cache) >= self.maxsize: self.evict(now)

Moving on to evicting things; for this, we need two operations: first peek at the item that expires next to see if it's expired, then, if it is, pop it from the queue. (Another choice: we only have to evict one item, but evict all expired ones.)

def evict(self, now): if not self.cache: return initial_size = len(self.cache) while self.cache: expires, key = self.expires.peek() if expires > now: break self.expires.pop() del self.cache[key] if len(self.cache) == initial_size: _, key = self.expires.pop() del self.cache[key]

If there are no expired items, we still have to make room for the one item; since we're not handling priorities yet, we'll evict the item that expires next a little early.

Problem: name PriorityQueue is not defined #

OK, to get the code working again, we need a PriorityQueue class. It doesn't need to be fast, we can deal with that after we finish everything else; for now, let's just keep our elements in a plain list.

class PriorityQueue: def __init__(self): self.data = []

The easiest way to get the smallest value is to keep the list sorted; the downside is that push() is now O(n log n) – although, because the list is always sorted, it can be as good as O(n) depending on the implementation.

def push(self, item): self.data.append(item) self.data.sort()

This makes peek() and pop() trivial; still, pop() is O(n), because it shifts all the items left by one position.

def peek(self): return self.data[0] def pop(self): rv = self.data[0] self.data[:1] = [] return rv

remove() is just as simple, and just as O(N), because it first needs to find the item, and then shift the ones after it to cover the gap.

def remove(self, item): self.data.remove(item)

We didn't use the is empty operation, but it should be O(1) regardless of implementation, so let's throw it in anyway:

def __bool__(self): return bool(self.data)

OK, let's wrap up with a quick test:

def test_priority_queue(): pq = PriorityQueue() pq.push(1) pq.push(3) pq.push(2) assert pq assert pq.peek() == 1 assert pq.pop() == 1 assert pq.peek() == 2 assert pq.remove(3) is None assert pq assert pq.peek() == 2 with pytest.raises(ValueError): pq.remove(3) assert pq.pop() == 2 assert not pq with pytest.raises(IndexError): pq.peek() with pytest.raises(IndexError): pq.pop()

Now the existing tests pass, and we can add more – first, that expired items are evicted (note how we're moving the time forward):

def test_expires(): cache = Cache(2, FakeTime()) cache.set('a', 'A', maxage=10) cache.set('b', 'B', maxage=20) assert cache.get('a') == 'A' assert cache.get('b') == 'B' cache.time.now = 15 assert cache.get('a') == None assert cache.get('b') == 'B' cache.set('c', 'C') assert cache.get('a') == None assert cache.get('b') == 'B' assert cache.get('c') == 'C'

Second, that setting an existing item changes its expire time:

def test_update_expires(): cache = Cache(2, FakeTime()) cache.set('a', 'A', maxage=10) cache.set('b', 'B', maxage=10) cache.time.now = 5 cache.set('a', 'X', maxage=4) cache.set('b', 'Y', maxage=6) assert cache.get('a') == 'X' assert cache.get('b') == 'Y' cache.time.now = 10 assert cache.get('a') == None assert cache.get('b') == 'Y' Problem: ...low priority items second #

Next up, kick out items by priority – shouldn't be too hard, right?

In __init__(), add another priority queue for priorities:

self.cache = {} self.expires = PriorityQueue() self.priorities = PriorityQueue()

In set(), add new items to the priorities queue:

self.cache[key] = item self.expires.push((expires, key)) self.priorities.push((priority, key))

...and remove already-cached items:

if key in self.cache: item = self.cache.pop(key) self.expires.remove((item.expires, key)) self.priorities.remove((item.priority, key)) del item

In evict(), remove expired items from the priorities queue:

while self.cache: expires, key = self.expires.peek() if expires > now: break self.expires.pop() item = self.cache.pop(key) self.priorities.remove((item.priority, key))

...and finally, if none are expired, remove the one with the lowest priority:

if len(self.cache) == initial_size: _, key = self.priorities.pop() item = self.cache.pop(key) self.expires.remove((item.expires, key))

Add one test for eviction by priority:

def test_priorities(): cache = Cache(2, FakeTime()) cache.set('a', 'A', priority=1) cache.set('b', 'B', priority=0) assert cache.get('a') == 'A' assert cache.get('b') == 'B' cache.set('c', 'C') assert cache.get('a') == 'A' assert cache.get('b') == None assert cache.get('c') == 'C'

...and one for updating the priority of existing items:

def test_update_priorities(): cache = Cache(2, FakeTime()) cache.set('a', 'A', priority=1) cache.set('b', 'B', priority=0) cache.set('b', 'Y', priority=2) cache.set('c', 'C') assert cache.get('a') == None assert cache.get('b') == 'Y' assert cache.get('c') == 'C' Problem: we're deleting items in three places #

I said we'll postpone performance optimizations until we have a complete solution, but I have a different kind of optimization in mind – for readability.

We're deleting items in three slightly different ways, careful to keep three data structures in sync each time; it would be nice to do it only once. While a bit premature, through the magic of having written the article already, I'm sure it will pay off.

def delete(self, key): *_, expires, priority = self.cache.pop(key) self.expires.remove((expires, key)) self.priorities.remove((priority, key))

Sure, eviction is twice as slow, but the complexity stays the the same – the constant factor in O(2n) gets removed, leaving us with O(n). If needed, we can go back to the unpacked version once we have a reasonably efficient implementation (that's what tests are for).

Deleting already-cached items is shortened to:

if key in self.cache: self.delete(key)

Idem for the core eviction logic:

while self.cache: expires, key = self.expires.peek() if expires > now: break self.delete(key) if len(self.cache) == initial_size: _, key = self.priorities.peek() self.delete(key)

Neat!

Problem: ...least recently used items last #

So, how does one implement a least recently used cache?

We could google it... or, we could look at an existing implementation.

functools.lru_cache() #

Standard library functools.lru_cache() comes to mind first; let's have a look.

Tip

You can read the code of stdlib modules by following the Source code: link at the top of each documentation page.

lru_cache() delegates to _lru_cache_wrapper(), which sets up a bunch of variables to be used by nested functions.3 Among the variables is a cache dict and a doubly linked list where nodes are [prev, next, key, value] lists.4

And that's the answer – a doubly linked list allows tracking item use in O(1): each time a node is used, remove it from its current position and plop it at the "recently used" end; whatever's at the other end will be the least recently used item.

Note that, unlike lru_cache(), we need one doubly linked list for each priority.

But, before making Item mutable and giving it prev/next links, let's dive deeper.

OrderedDict #

If you search the docs for "LRU", the next result after lru_cache() is OrderedDict.

Some differences from dict still remain: [...] The OrderedDict algorithm can handle frequent reordering operations better than dict. [...] this makes it suitable for implementing various kinds of LRU caches.

Specifically:

OrderedDict has a move_to_end() method to efficiently reposition an element to an endpoint.

Since dicts preserve insertion order, you can use d[k] = d.pop(k) to move items to the end... What makes move_to_end() better, then? This comment may shed some light:

90 # The internal self.__map dict maps keys to links in a doubly linked list.

Indeed, move_to_end() does exactly what we described above – this is good news, it means we don't have to do it ourselves!

So, we need one OrderedDict (read: doubly linked list) for each priority, but still need to keep track the lowest priority:

self.cache = {} self.expires = PriorityQueue() self.priority_buckets = {} self.priority_order = PriorityQueue()

Handling priorities in set() gets a bit more complicated:

priority_bucket = self.priority_buckets.get(priority) if not priority_bucket: priority_bucket = self.priority_buckets[priority] = OrderedDict() self.priority_order.push(priority) priority_bucket[key] = None

But now we can finally evict the least recently used item:

if len(self.cache) == initial_size: priority = self.priority_order.peek() priority_bucket = self.priority_buckets.get(priority) key = next(iter(priority_bucket)) self.delete(key)

In delete(), we're careful to get rid of empty buckets:5

priority_bucket = self.priority_buckets[priority] del priority_bucket[key] if not priority_bucket: del self.priority_buckets[priority] self.priority_order.remove(priority)

Existing tests pass again, and we can add a new (still failing) one:

def test_lru(): cache = Cache(2, FakeTime()) cache.set('a', 'A') cache.set('b', 'B') cache.get('a') == 'A' cache.set('c', 'C') assert cache.get('a') == 'A' assert cache.get('b') == None assert cache.get('c') == 'C'

All that's needed to make it pass is to call move_to_end() in get():

self.priority_buckets[item.priority].move_to_end(key) return item.value Liking this so far? Here's another article you might like:

Learn by reading code: Python standard library design decisions explained

Problem: our priority queue is slow #

OK, we have a complete solution, it's time to deal with the priority queue implementation. Let's do a quick recap of the methods we need and why:

  • push() – to add items
  • peek() – to get items / buckets with the lowest expiry time / priority
  • remove() – to delete items
  • pop() – not used, but would be without the delete() refactoring

We make two related observations: first, there's no remove operation on the priority queue Wikipedia page; second, even if we unpack delete(), we only get to pop() an item/bucket from one of the queues, and still have to remove() it from the other.

And this is what makes the problem tricky – we need to maintain not one, but two independent priority queues.

heapq #

If you search the docs for "priority queue", you'll find heapq, which:

[...] provides an implementation of the heap queue algorithm, also known as the priority queue algorithm.

While priority queues are often implemented using [heaps][heap], > they are conceptually distinct from heaps. [...] -->

Reading on, we find extensive notes on implementing priority queues; particularly interesting are using (priority, item) tuples (already doing this!) and removing entries:

Removing the entry or changing its priority is more difficult because it would break the heap structure invariants. So, a possible solution is to mark the entry as removed and add a new entry with the revised priority.

This workaround is needed because while removing the i-th element can be done in O(log n), finding its index is O(n). To summarize, we have:

sort heapq push() O(n) O(log n) peek() O(1) O(1) pop() O(n) O(log n) remove() O(n) O(n)

Still, with a few mitigating assumptions, it could work:

  • we can assume priorities are static, so buckets never get removed
  • to-be-removed expiry times will get popped sooner or later anyway; we can assume that most evictions are due to expired items, and that items being evicted due to low priority (i.e. when the cache is full) and item updates are rare (both cause to-be-removed entries to accumulate in the expiry queue)
bisect #

One way of finding an element in better than O(n) is bisect, which:

[...] provides support for maintaining a list in sorted order without having to sort the list after each insertion.

This may provide an improvement to our naive implementation; sadly, reading further to Performance Notes we find that:

The insort() functions are O(n) because the logarithmic search step is dominated by the linear time insertion step.

While in general that's better than just about any sort, we happen to be hitting the best case of our sort implementation, which has the same complexity. (Nevertheless, shifting elements is likely cheaper than the same number of comparisons.)

sort heapq bisect push() O(n) O(log n) O(n) peek() O(1) O(1) O(1) pop() O(n) O(log n) O(n) remove() O(n) O(n) O(n)

Further down in the docs, there's a see also box:

Sorted Collections is a high performance module that uses bisect to managed sorted collections of data.

Not in stdlib, moving along... ¯\_(ツ)_/¯

pop() optimization #

There's an unrelated improvement that applies to both the naive solution and bisect. With a sorted list, pop() is O(n) because it shifts all elements after the first; if the order was reversed, we'd pop() from the end, which is O(1). So:

sort heapq bisect push() O(n) O(log n) O(n) peek() O(1) O(1) O(1) pop() O(1)* O(log n) O(1)* remove() O(n) O(n) O(n) Binary search trees #

OK, I'm out of ideas, and there's nothing else in stdlib that can help.

We can restate the problem as follows: we need a sorted data structure that can do better than O(n) for push() / remove().

We've already peeked at the Wikipedia priority queue page, so let's keep reading – skipping past the naive implementations, to the usual implementation, we find that:

To improve performance, priority queues are typically based on a heap, [...]

Looked into that, didn't work; next:

Alternatively, when a self-balancing binary search tree is used, insertion and removal also take O(log n) time [...]

There, that's what we're looking for! (And likely what your interviewer is, too.)

sort heapq bisect BST push() O(n) O(log n) O(n) O(log n) peek() O(1) O(1) O(1) O(log n) pop() O(1)* O(log n) O(1)* O(log n) remove() O(n) O(log n) O(n) O(log n)

But there's no self-balancing BST in the standard library, and I sure as hell am not implementing one right now – I still have flashbacks from when I tried to do a red-black tree and two hours later it still had bugs (I mean, look at the length of this explanation!).

After a bit of googling we find, among others, bintrees, a mature library that provides all sorts of binary search trees... except:

Bintrees Development Stopped

Use sortedcontainers instead: https://pypi.python.org/pypi/sortedcontainers

Sounds familiar, doesn't it?

Sorted Containers #

Let's go back to that Sorted Collections library bisect was recommending:

Depends on the Sorted Containers module.

(¬‿¬ )

I remember, I remember now... I'm not salty because the red-black tree took two hours to implement. I'm salty because after all that time, I found Sorted Containers, a pure Python library that is faster in practice than fancy self-balancing binary search trees implemented in C!

It has extensive benchmarks to prove it, and simulated workload benchmarks for our own use case, priority queues – so yeah, while the interview answer is "self-balancing BSTs", the actual answer is Sorted Containers.

How does it work? There's an extensive explanation too:6

The Sorted Containers internal implementation is based on a couple observations. The first is that Python’s list is fast, really fast. [...] The second is that bisect.insort7 is fast. This is somewhat counter-intuitive since it involves shifting a series of items in a list. But modern processors do this really well. A lot of time has been spent optimizing mem-copy/mem-move-like operations both in hardware and software.

But using only one list and bisect.insort would produce sluggish behavior for lengths exceeding ten thousand. So the implementation of Sorted List uses a list of lists to store elements. [...]

There's also a comparison with trees, which I'll summarize: fewer memory allocations, better cache locality, lower memory overhead, and faster iteration.

I think that gives you a decent idea of how and why it works, enough that with a bit of tinkering you might be able to implement it yourself.8

Problem: Sorted Containers is not in stdlib #

But Sorted Containers is not in the standard library either, and we don't want to implement it ourselves. We did learn something from it, though:

sort heapq bisect bisect <10k BST push() O(n) O(log n) O(n) O(log n) O(log n) peek() O(1) O(1) O(1) O(1) O(log n) pop() O(1)* O(log n) O(1)* O(1)* O(log n) remove() O(n) O(log n) O(n) O(log n) O(log n)

We still need to make some assumptions, though:

  1. Do we really need more than 10k priorities? Likely no, let's just cap them at 10k.
  2. Do we really need more than 10k expiry times? Maybe? – with 1 second granularity we can represent only up to 2.7 hours; 10 seconds takes us up to 27 hours, which may just work.

OK, one more and we're done. The other issue, aside from the maximum time, is that the granularity is too low, especially for short times – rounding 1 to 10 seconds is much worse than rounding 2001 to 2010 seconds. Which begs the question –

  1. Does it really matter if items expire in 2010 seconds instead of 2001? Likely no, but we need a way to round small values with higher granularity than big ones.
Logarithmic time #

How about 1, 2, 4, 8, ...? Rounding up to powers of 2 gets us decreasing granularity, but time doesn't actually start at zero. We fix this by rounding up to multiples of powers of 2 instead; let's get an intuition of how it works:

ceil((2000 + 1) / 1) * 1 = 2001 ceil((2000 + 2) / 2) * 2 = 2002 ceil((2000 + 3) / 4) * 4 = 2004 ceil((2000 + 4) / 4) * 4 = 2004 ceil((2000 + 15) / 16) * 16 = 2016 ceil((2000 + 16) / 16) * 16 = 2016 ceil((2000 + 17) / 32) * 32 = 2048

So far so good, how about after some time has passed?

ceil((2013 + 1) / 1) * 1 = 2014 ceil((2013 + 2) / 2) * 2 = 2016 ceil((2013 + 3) / 4) * 4 = 2016 ceil((2013 + 4) / 4) * 4 = 2020 ceil((2013 + 15) / 16) * 16 = 2032 ceil((2013 + 16) / 16) * 16 = 2032 ceil((2013 + 17) / 32) * 32 = 2048

The beauty of aligned powers is that for a relatively constant number of expiry times, the number of buckets remains roughly the same over time – as closely packed buckets are removed from the beginning, new ones fill the gaps between the sparser ones towards the end.

OK, let's put it into code:

def log_bucket(now, maxage): next_power = 2 ** math.ceil(math.log2(maxage)) expires = now + maxage bucket = math.ceil(expires / next_power) * next_power return bucket >>> [log_bucket(0, i) for i in [1, 2, 3, 4, 15, 16, 17]] [1, 2, 4, 4, 16, 16, 32] >>> [log_bucket(2000, i) for i in [1, 2, 3, 4, 15, 16, 17]] [2001, 2002, 2004, 2004, 2016, 2016, 2048] >>> [log_bucket(2013, i) for i in [1, 2, 3, 4, 15, 16, 17]] [2014, 2016, 2016, 2020, 2032, 2032, 2048]

Looking good!

There are two sources of error – first from rounding maxage, worst when it's one more than a power of 2, and second from rounding the expiry time, also worst when it's one more than a power of two. Together, they approach 200% of maxage:

>>> log_bucket(0, 17) # (32 - 17) / 17 ~= 88% 32 >>> log_bucket(0, 33) # (64 - 33) / 33 ~= 94% 64 >>> log_bucket(16, 17) # (64 - 31) / 17 ~= 182% 64 >>> log_bucket(32, 33) # (128 - 64) / 33 ~= 191% 128

200% error is quite a lot; before we set to fix it, let's confirm our reasoning.

def error(now, maxage, *args): """log_bucket() error.""" bucket = log_bucket(now, maxage, *args) return (bucket - now) / maxage - 1 def max_error(now, max_maxage, *args): """Worst log_bucket() error for all maxages up to max_maxage.""" return max( error(now, maxage, *args) for maxage in range(1, max_maxage) ) def max_error_random(n, *args): """Worst log_bucket() error for random inputs, out of n tries.""" max_now = int(time.time()) * 2 max_maxage = 3600 * 24 * 31 rand = functools.partial(random.randint, 1) return max( error(rand(max_now), rand(max_maxage), *args) for _ in range(n) ) >>> max_error(0, 10_000) 0.9997558891736849 >>> max_error(2000, 10_000) 1.9527896995708156 >>> max_error_random(10_000_000) 1.9995498725910554

Looks confirmed enough to me.

So, how do we make the error smaller? Instead of rounding to the next power of 2, we round to the next half of a power of 2, or next quarter, or next eighth...

def log_bucket(now, maxage, shift=0): next_power = 2 ** max(0, math.ceil(math.log2(maxage) - shift)) expires = now + maxage bucket = math.ceil(expires / next_power) * next_power return bucket

It seems to be working:

>>> for s in range(5): ... print([log_bucket(0, i, s) for i in [1, 2, 3, 4, 15, 16, 17]]) ... [1, 2, 4, 4, 16, 16, 32] [1, 2, 4, 4, 16, 16, 32] [1, 2, 3, 4, 16, 16, 24] [1, 2, 3, 4, 16, 16, 20] [1, 2, 3, 4, 15, 16, 18] >>> for s in range(10): ... e = max_error_random(1_000_000, s) ... print(f'{s} {e:6.1%}') ... 0 199.8% 1 99.9% 2 50.0% 3 25.0% 4 12.5% 5 6.2% 6 3.1% 7 1.6% 8 0.8% 9 0.4%

With shift=7, the error is less that two percent; I wonder how many buckets that is...

def max_buckets(max_maxage, *args): """Number of buckets to cover all maxages up to max_maxage.""" now = time.time() buckets = { log_bucket(now, maxage, *args) for maxage in range(1, max_maxage) } return len(buckets) >>> max_buckets(3600 * 24, 7) 729 >>> max_buckets(3600 * 24 * 31, 7) 1047 >>> max_buckets(3600 * 24 * 365, 7) 1279

A bit over a thousand buckets for the whole year, not bad!

Before we can use any of that, we need to convert expiry times to buckets; that looks a lot like the priority buckets code, the only notable part being eviction.

__init__():

self.cache = {} self.expires_buckets = {} self.expires_order = PriorityQueue() self.priority_buckets = {} self.priority_order = PriorityQueue()

set():

expires_bucket = self.expires_buckets.get(expires) if not expires_bucket: expires_bucket = self.expires_buckets[expires] = set() self.expires_order.push(expires) expires_bucket.add(key)

delete():

expires_bucket = self.expires_buckets[expires] expires_bucket.remove(key) if not expires_bucket: del self.expires_buckets[expires] self.expires_order.remove(expires)

evict():

while self.cache: expires = self.expires_order.peek() if expires > now: break expires_bucket = self.expires_buckets[expires] for key in list(self.expires_buckets[expires]): self.delete(key)

And now we use log_bucket(). Since we're at it, why not have unlimited priorities too? A hammer is a hammer and everything is a nail, after all.

expires = log_bucket(now, maxage, shift=7) priority = log_bucket(0, priority+1, shift=7) item = Item(key, value, expires, priority) If you've made it this far, you will definitely like:

Has your password been pwned? Or, how I almost failed to search a 37 GB text file in under 1 millisecond (in Python)

bisect, redux #

Time to fix that priority queue.

We use insort() to add priorities and operator.neg() to keep the list reversed:9

def push(self, item): bisect.insort(self.data, item, key=operator.neg)

We update peek() and pop() to handle the reverse order:

def peek(self): return self.data[-1] def pop(self): return self.data.pop()

Finally, for remove() we adapt the index() recipe from Searching Sorted Lists:

def remove(self, item): i = bisect.bisect_left(self.data, -item, key=operator.neg) if i != len(self.data) and self.data[i] == item: del self.data[i] return raise ValueError

And that's it, we're done!

Here's the final version of the code.

Conclusion #

Anyone expecting you to implement this in under an hour is delusional. Explaining what you would use and why should be enough for reasonable interviewers, although that may prove difficult if you haven't solved this kind of problem before.

Bullshit interviews aside, it is useful to have basic knowledge of time complexity. Again, can't recommend Big-O: How Code Slows as Data Grows enough.

But, what big O notation says and what happens in practice can differ quite a lot. Be sure to measure, and be sure to think of limits – sometimes, the n in O(n) is or can be made small enough you don't have to do the theoretically correct thing.

You don't need to know how to implement all the data structures, that's what (software) libraries and Wikipedia are for (and for that matter, book libraries too). However, it is useful to have an idea of what's available and when to use it.

Good libraries educate – the Python standard library docs already cover a lot of the practical knowledge we needed, and so did Sorted Containers. But, that won't show up in the API reference you see in your IDE, you have read the actual documentation.

Learned something new today? Share this with others, it really helps!

Want to know when new articles come out? Subscribe here to get new stuff straight to your inbox!

  1. Note the subtitle: if you're not sure what to do yet. [return]

  2. This early on, the name doesn't really matter, but we'll go with the correct, descriptive one; in the first draft of the code, it was called MagicDS. ✨ [return]

  3. You have to admit this is at least a bit weird; what you're looking at is an object in a trench coat, at least if you think closures and objects are equivalent. [return]

  4. Another way of getting an "object" on the cheap. [return]

  5. If we assume a relatively small number of buckets that will be reused soon enough, this isn't strictly necessary. I'm partly doing it to release the memory held by the dict, since dicts are resized only when items are added. [return]

  6. There's even a PyCon talk with the same explanation, if you prefer that. [return]

  7. bisect itself has a fast C implementation, so I guess technically it's not pure Python. But given that stdlib is already there, does that count? [return]

  8. If the implementation is easy to explain, it may be a good idea. [return]

  9. This limits priorities to values that can be negated, so tuples won't work anymore. We could use a "reversed view" wrapper if we really cared about that. [return]

Categories: FLOSS Project Planets

Thomas Koch: Rebuild search with trust

Planet Debian - Sat, 2024-01-20 06:10
Posted on January 20, 2024 Tags: debian, free software, life, search, decentralization

Finally there is a thing people can agree on:

Apparently, Google Search is not good anymore. And I’m not the only one thinking about decentralization to fix it:

Honey I federated the search engine - finding stuff online post-big tech - a lightning talk at the recent chaos communication congress

The speaker however did not mention, that there have already been many attempts at building distributed search engines. So why do I think that such an attempt could finally succeed?

  • More people are searching for alternatives to Google.
  • Mainstream hard discs are incredibly big.
  • Mainstream internet connection is incredibly fast.
  • Google is bleeding talent.
  • Most of the building blocks are available as free software.
  • “Success” depends on your definition…

My definition of success is:

A mildly technical computer user (able to install software) has access to a search engine that provides them with superior search results compared to Google for at least a few predefined areas of interest.

The exact algorithm used by Google Search to rank websites is a secret even to most Googlers. However I assume that it relies heavily on big data.

A distributed search engine however can instead rely on user input. Every admin of one node seeds the node ranking with their personal selection of trusted sites. They connect their node with nodes of people they trust. This results in a web of (transitive) trust much like pgp.

Imagine you are searching for something in a world without computers: You ask the people around you and probably they forward your question to their peers.

I already had a look at YaCy. It is active, somewhat usable and has a friendly maintainer. Unfortunately I consider the codebase to not be worth the effort. Nevertheless, YaCy is a good example that a decentralized search software can be done even by a small team or just one person.

I myself started working on a software in Haskell and keep my notes here: Populus:DezInV. Since I’m learning Haskell along the way, there is nothing there to see yet. Additionally I took a yak shaving break to learn nix.

By the way: DuckDuckGo is not the alternative. And while I would encourage you to also try Yandex for a second opinion, I don’t consider this a solution.

Categories: FLOSS Project Planets

TechBeamers Python: How to Remove Whitespace from a String in Python

Planet Python - Sat, 2024-01-20 05:57

This tutorial explains multiple ways to remove whitespace from a string in Python. Often, as a programmer, we need to put in a mechanism to remove whitespace, especially trailing spaces. So, with the solutions given here, you would be able to remove most types of whitespaces from the string. Multiple Ways to Remove Whitespace from […]

The post How to Remove Whitespace from a String in Python appeared first on TechBeamers.

Categories: FLOSS Project Planets

Events in 2024

Planet KDE - Sat, 2024-01-20 05:00

Having ended 2023 with attending 37C3 the new years starts with planning and booking for conferences and events I want to attend in 2024.

My Events in 2024 FOSDEM

February 3-4 in Brussels, Belgium. I’ll be speaking about semantic data extration in KMail and Nextcloud Mail in the Modern Email devroom.

There’s also going to be a Railways and Open Transport devroom again which I’m particularly looking forward to, and of course the KDE stand.

OSM Hack Weekend

February 24-25 in Karlruhe, Germany (wiki).

Based on historical patterns another such weekend can probably be expected for autumn, and a similar pair of events might also happen in Berlin again. As time permits I’ll try to join those as well.

FOSSGIS Konferenz

March 20-23 in Hamburg, Germany. Together with Tobias Knerr I’ll be hosting a session about OSM indoor mapping.

KDE Akademy

September 07-12 in Würzburg, Germany. It would need a pandemic for me to miss that one.

38C3

And for concluding 2024 there’s then going to be the 38th Chaos Communication Congress (38C3), December 27-30, presumably in Hamburg, Germany again. Ticket supply permitting I’d love to go there and hope we’ll have a KDE presence there again.

What’s (still) missing

Some events aren’t on the list (yet), for various reasons:

  • Linux App Summit (LAS) 2024: No official date and location yet. Obviously something I’d attend again as well, however there’s indications it might not be in Europe this time, and intercontinental travel for a three day event is hard to justify for me.
  • State of the Map (SotM) 2024: OSM’s global yearly conference I so far have failed to attend in person. Overlaps with Akademy this year though, and also would require intercontinental travel.
  • KDE sprints: None scheduled yet, although there have been first discussion for another KDE PIM sprint at least. Time permitting I’ll of course try to attend anything I’m involved with.

Even with all of that it’s still likely incomplete, typically more things come up during the year or I discover new events to attend along the way. I also haven’t seen anything scheduled from Wikidata or the Open Transport community yet.

Events in KDE Itinerary

For planning travel to events it’s helpful to have the event time and location data in Itineray, without having to enter all that manually. Besides supporting some commonly encountered registration/ticketing systems (e.g. Indico, Pretix) it has also been possible to import events by opening, pasting or dropping URLs to a corresponding Mobilizon or other ActivityPub-compatible source, as well as the URL of the event website directly.

For the latter to work, the website needs schema.org semantic annotations describing the event, as e.g. found on the Akademy website. Since a week ago the FOSS event aggregation site // foss.events also has such annotations, so you can easily import any event from there now as well.

FOSDEM 2024 event data imported from foss.events in Itinerary.

Looking forward to an exciting year and meeting many of you at events :)

Categories: FLOSS Project Planets

TechBeamers Python: Install Python packages with pip and requirements.txt

Planet Python - Sat, 2024-01-20 02:26

Check this tutorial if you want to learn how to use pip to install Python packages using the requirements.txt. It is a simple configuration file where we can list the exact packages with their versions. It will ensure that our Python project uses the correct packages. No matter whether you build it on a new […]

The post Install Python packages with pip and requirements.txt appeared first on TechBeamers.

Categories: FLOSS Project Planets

This week in KDE: auto-save in Dolphin and better fractional scaling

Planet KDE - Sat, 2024-01-20 00:42

We’re in the home stretch now!

Plasma and Gear apps have branched, which means anything committed to master and not backported is going into the next release after the mega-release next month. For Plasma, the next one is 6.1, and for Gear apps, it’s 24.05. Quite a few new features and UI improvements are starting to accumulate there! Here are a few:

Dolphin now periodically auto-saves its open windows and tabs, so you don’t lose state if the app crashes or the system is restarted unexpectedly (Amol Godbole, Dolphin 24.05. Link)

In Dolphin, you can now configure whether backup and trash files are shown when hidden files are made visible (Méven Car, Dolphin 24.05. Link)

In Dolphin, you can now pop out a split view pane into its own new window (Loren Burkholder, Dolphin 24.05. Link)

Fixed an issue in Dolphin that could cause it to freeze when you use it to duplicate the same file multiple times (Eugene Popov, Dolphin 24.05. Link)

Okular now supports displaying popup menus in certain kinds of PDF documents that include them (Alexis Murzeau, Okular 24.05. Link)

Spectacle now lets you use more placeholders for screenshot and screen recording filenames (Noah Davis, Spectacle 24.05. Link 1 and link 2)

The Networks system tray popup can now tell you a network’s channel in addition to its frequency (Kai Uwe Broulik, link)

KDE 6 Mega-Release

(Includes all software to be released on the February 28th mega-release: Plasma 6, Frameworks 6, and apps from Gear 24.02)

General infoOpen issues: 237

UI improvements

Plasma’s global Edit Mode toolbar now has an “Add Panel” button that lets you add panels. With this located there, the desktop context menu has now lost its “Add Widgets” and “Add Panels” menu items since the functionality is fully available in the global Edit Mode. This makes the menu smaller and less overwhelming by default. Of course, if you want those menu items back, you can just re-add them. (Akseli Lahtinen and me: Nate Graham, link 1, link 2, and link 3):

In the portal-based “Choose a screen/window to record” dialog, items are now chosen with a single-click, unless the dialog is in multi-select mode, in which case a double-click will choose one (because a single-click only selects it). Also, in multi-select mode, the items have little checkboxes in the corner so you know that you can select more than one (Yifan Zhu and me: Nate Graham, link 1, and link 2):

This dialog could still use more UI polish, which is being scoped out Bug fixes

Important note: I don’t mention fixes for bugs that were never released to users; it’s just too much for me (it would probably be too much for you to read as well), and most people never encountered them in the first place. Because we’re in the middle of a big Plasma dev cycle, there are a lot of these bugs! So big thanks to everyone who’s made it a priority to fix them!

Powerdevil no longer fails to start at login when using the ddcutil-2.0.0 library and certain DDC-compatible monitors (David Edmundson, link). Note that we also have reports of new issues for people using ddcutil-2.1.0, but those are different and need separate investigation, which is ongoing.

Did some more work to ensure that visual glitches in QtQuick apps are minimized when using a fractional scale factor. There’s still more work to do for text and window outlines/shadows, but you should no longer see weird tearing-related glitches in buttons and icons (Arjen Hiemstra and Marco Martin, link)

Made KWin more robust when restoring settings for multi-screen arrangements when any of the screens are missing their EDIDs (Stefan Hoffmeister, link)

When using a weather provider that gives forecasts longer than 7 days (like EnvCan), the right edge of the Weather widget’s forecast never gets cut off when viewed in the System Tray (Ismael Asensio, link)

Your Plasma panels will no longer flicker oddly when certain full-screen games do something rather odd by repeatedly switching their windows between full-screen and maximized states (Xaver Hugl, link)

The “Window Type” window rule–which did not work on Wayland–has been replaced with a new “Window Layer” rule which works better for the purposes people typically use it for (Vlad Zahorodnii, link)

Other bug information of note:

Performance & Technical

Improved the speed of various config file lookups used very commonly throughout KDE software by 13-16% (Friedrich Kossebau, link)

Fixes for KF5

We’ve also got a few nice fixes for KF5 software this week!

Fixed an issue when moving or copying large number of files that could cause some of them to get skipped (and potentially lost) after skipping duplicated folders (Eugene Popov, Frameworks 5.115. Link)

Fixed an issue that caused folders inside network shares/mounts to be non-expandable in Details view (Alessandro Astone, Frameworks 5.115. Link)

…And Everything Else

This blog only covers the tip of the iceberg! If you’re hungry for more, check out https://planet.kde.org, where you can find more news from other KDE contributors.

How You Can Help

Thanks to you, our Plasma 6 fundraiser has been a crazy success! I originally thought the goal of 500 new KDE e.V. supporting members was over-optimistic, but you’ve all proven me happily wrong. We’re now up to an incredible 665 members and unlocked both stretch goals! It’s pretty humbling. Thank you everyone for the confidence you’ve shown in us; we’ll try not to screw it up! For those who haven’t donated ot become members yet, spreading the wealth via this fundraiser is a great way to share the love.

If you’re a developer, work on Qt6/KF6/Plasma 6 issues! Which issues? These issues. Plasma 6 is very usable for daily driving now, but still in need of some final bug-fixing and polishing to get it into a solid state by February.

Otherwise, visit https://community.kde.org/Get_Involved to discover other ways to be part of a project that really matters. Each contributor makes a huge difference in KDE; you are not a number or a cog in a machine! You don’t have to already be a programmer, either. I wasn’t when I got started. Try it, you’ll like it! We don’t bite!

Categories: FLOSS Project Planets

Pages