Acquia Developer Portal Blog: DrupalCon Portland Day 1 Recap

Planet Drupal - Tue, 2024-05-07 17:09

DrupalCon Portland (2024 edition) kicked off with a bang at the Oregon Convention Center yesterday. This is the third time the conference has been at this venue–and I’ve been fortunate enough to attend all three. And, this year’s iteration is shaping up to be a really significant entry in the DrupalCon codex. 

Categories: FLOSS Project Planets

Python Engineering at Microsoft: Announcing Data Wrangler: Code-centric viewing and cleaning of tabular data in Visual Studio Code

Planet Python - Tue, 2024-05-07 16:20

Today, we are excited to announce the general availability of the Data Wrangler extension for Visual Studio Code! Data Wrangler is a free extension that offers data viewing and cleaning that is directly integrated into VS Code and the Jupyter extension. It provides a rich user interface to view and analyze your data, show insightful column statistics and visualizations, and automatically generate Pandas code as you clean and transform the data. We want to thank all the early adopters who tried out the extension preview over the past year, as your valuable feedback has been crucial to this release.

With this general availability, we are also announcing that the data viewer feature in the Jupyter extension will be going away. In its place, you will be able to use the new and improved data viewing experience offered by Data Wrangler, which is also built by Microsoft. We understand that the data viewer was a beloved feature from our customers, and we see this as the next evolution to working with data in VS Code in an extensible manner and hope that you will love the Data Wrangler extension even more than the data viewer feature. Several of the improvements and features of Data Wrangler are highlighted below.


Previewing data

Once the Data Wrangler extension is installed, you can get to Data Wrangler in one of three ways from the Jupyter Notebook.

  1. In the Jupyter > Variables panel, beside any supported data object, you can see a button to open it in Data Wrangler.
  2. If you have a supported data object in your notebook (such as a Pandas DataFrame), you can now see an Open ‘df’ in Data Wrangler button (where ‘df’ is the variable name of your data frame) appear in bottom of the cell after running code that outputs the data frame. This includes df.head(), df.tail(), display(df), print(df), df.
  3. In the notebook toolbar, selecting View data brings up a list of every supported data object in your notebook. You can then choose which variable in that list you want to open in Data Wrangler.

Alternatively, Data Wrangler can also be directly opened from a local file (such as CSV, Excel, or parquet files) by right clicking the file and selecting “Open in Data Wrangler”.


Filtering and sorting

Data Wrangler can be used to quickly filter and sort through your rows of data.

Transforming data

Switch from Viewing to Editing mode to unlock additional functionality and built-in data cleaning operations in Data Wrangler. For a full list of supported operations, see the documentation here.


Code generation

As you make changes to the data using the built-in operations, Data Wrangler automatically generates code using open-source Python libraries for the data transformation operations you perform.

When you are done wrangling your data, all the automatically generated code from your data cleaning session can then be exported either back into your Notebook, or into a new Python file.


Trying Data Wrangler today

To start using Data Wrangler today in Visual Studio Code, just download the Data Wrangler extension from the VS Code marketplace to try it out! You can then launch Data Wrangler from any supported data object in a Jupyter Notebook or direct from a data file.

This article only covered some of the high-level features of what Data Wrangler can do. To learn more about Data Wrangler in detail, please check out the Data Wrangler documentation.

The post Announcing Data Wrangler: Code-centric viewing and cleaning of tabular data in Visual Studio Code appeared first on Python.

Categories: FLOSS Project Planets

PyCoder’s Weekly: Issue #628 (May 7, 2024)

Planet Python - Tue, 2024-05-07 15:30

#628 – MAY 7, 2024
View in Browser »

TypeIs Does What I Thought TypeGuard Would Do in Python

In this post, Redowan discusses the fact that TypeGuard has always confused him, and that the newer TypeIs feature does what he thought TypeGuard should do. Read on to learn about them both.

Python’s unittest: Writing Unit Tests for Your Code

In this tutorial, you’ll learn how to use the unittest framework to create unit tests for your Python code. Along the way, you’ll also learn how to create test cases, fixtures, test suites, and more.

Webinar - Make Open Source Suck Less

Tired of dependency conflicts, corrupted environments and “works on my machine” issues? Learn the shortfalls of standard package and environment tools (i.e. pip and venv), and how you can achieve reproducibility, dependency management and security at scale - Watch the Webinar On-Demand →

Avoid Conflicts and Let Your OS Select a Python Web App Port

Hard-coded port numbers can be problematic during development because they prevent you from running multiple instances of the same server process in parallel. This article explains how to work around this issue by letting your operating system automatically select a random port number.
CHRISTOPH SCHIESSL • Shared by Christoph Schiessl

Quiz: The Python Calendar Module

In this quiz, you’ll test your understanding of the calendar module in Python. It’ll evaluate your proficiency in manipulating, customizing, and displaying calendars directly within your terminal. By working through this quiz, you’ll revisit the fundamental functions and methods provided by the calendar module.

Quiz: What Is the __pycache__ Folder in Python?

In this quiz, you’ll have the opportunity to test your knowledge of the __pycache__ folder, including when, where, and why Python creates these folders.

Pydantic v2.7.0 Released


Discussions Everything Google’s Python Team Were Responsible For

Google recently laid off the majority of their internal Python team. This post to HN, from one of the former team members, covers just what that team was responsible for. The ensuing discussion also includes comments from others on that team as well.

Python Jobs Senior Python Engineer: Generative AI, Social Media x Web3 (Anywhere)


More Python Jobs >>>

Articles & Tutorials Working With Global Variables in Python Functions

In this video course, you’ll learn how to use global variables in Python functions using the global keyword or the built-in globals() function. You’ll also learn a few strategies to avoid relying on global variables because they can lead to code that’s difficult to understand, debug, and maintain.

Embarking on a Relaxed and Friendly Python Coding Journey

Do you get stressed while trying to learn Python? Do you prefer to build small programs or projects as you continue your coding journey? This week on the show, Real Python author Stephen Gruppetta is here to talk about his new book, “The Python Coding Book.”

How to Watermark a Graph With Matplotlib

“Matplotlib is one of the most popular data visualization packages for the Python programming language. It allows you to create many different charts and graphs.” With it you can even put a watermark on your charts, this tutorial shows you how.

Software Friction

Friction is everywhere in software development. Two setbacks are more than twice as bad as one setback. This article discusses the sources of software friction and what you can do about it.

The Magician’s Sleight of Hand

Even functions in Python are objects and can be re-assigned and manipulated. This article shows a problem that at first looks impossible, but can be handled with a few key re-assigments.
STEPHEN GRUPPETTA • Shared by Stephen Gruppetta

4 Software Design Principles I Learned the Hard Way

Leonardo talks about four principles of software engineering he’s learned though his career. Some are against common practice: DRY may not be you friend.

Isolating Risk in the CPython Release Process

This is a quick summary of the changes to the CPython build process to help reduce the risks caused by extra dependencies.

Building Reusable Components in Django

This tutorial looks at how to build server-side, reusable UI components in Django using the django-viewcomponent library.
MICHAEL YIN • Shared by Michael Herman

Projects & Code vet: A Poetry Plugin for Establishing Chain of Trust


octarine: WGPU-based 3D Viewer


Python Turtle Bingo


pacemaker: For Controlling Time Per Iteration Loop in Python


vulture: Find Dead Python Code


Events Weekly Real Python Office Hours Q&A (Virtual)

May 8, 2024

Python Atlanta

May 9 to May 10, 2024

DFW Pythoneers 2nd Saturday Teaching Meeting

May 11, 2024

PiterPy Meetup

May 14, 2024

Leipzig Python User Group Meeting

May 14, 2024

IndyPy Monthly Meetup

May 14 to May 15, 2024

PyCon US 2024

May 15 to May 24, 2024

Flask Con 2024

May 17 to May 18, 2024

Happy Pythoning!
This was PyCoder’s Weekly Issue #628.
View in Browser »

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

Categories: FLOSS Project Planets

Python Software Foundation: PSF Board Election Dates for 2024

Planet Python - Tue, 2024-05-07 13:28

PSF Board elections are a chance for the community to choose representatives to help the PSF create a vision for and build the future of the Python community. This year there are 3 seats open on the PSF board. Check out who is currently on the PSF Board. (Débora Azevedo, Kwon-Han Bae, and Tania Allard are at the end of their current terms.)

Board Election Timeline
  • Nominations open: Tuesday, June 11th, 2:00 pm UTC
  • Nomination cut-off: Tuesday, June 25th, 2:00 pm UTC
  • Voter application/affirmation cut-off date: Tuesday, June 25th, 2:00 pm UTC
  • Announce candidates: Thursday, June 27th
  • Voting start date: Tuesday, July 2nd, 2:00 pm UTC
  • Voting end date: Tuesday, July 16th, 2:00 pm UTC

You must be a contributing, managing, supporting, or fellow member by June 25th to vote in this election. Check out the PSF membership page to learn more about membership classes and benefits. If you have questions about membership or nominations please email psf-elections@pyfound.org

Run for the Board

Who runs for the board? People who care about the Python community, who want to see it flourish and grow, and also have a few hours a month to attend regular meetings, serve on committees, participate in conversations, and promote the Python community. Check out our Life as Python Software Foundation Director video to learn more about what being a part of the PSF Board entails. We also invite you to review our Annual Impact Report for 2023 to learn more about the PSF mission and what we do.

You can nominate yourself or someone else. We would encourage you to reach out to folks before you nominate them to make sure they are enthusiastic about the potential of joining the Board. Nominations open on Tuesday, June 11th, 2:00 pm UTC, so you have a few weeks to research the role and craft a nomination statement. 

Learn more and join the discussion

You are welcome to join the discussion about the PSF Board election on our forum. This year we’ll also be running Office Hours on the PSF Discord to answer questions about running for the board and serving on the board. Details for the Office Hours will be announced soon! Subscribe to the PSF blog or join psf-member-announce to receive updates leading up to the election.

Categories: FLOSS Project Planets

Python Anywhere: CPU resetting issues report: 3 - 5 May 2024

Planet Python - Tue, 2024-05-07 11:00

We have a number of background processes that execute periodically on our systems; one of these is the one that resets the amount of CPU used, so that you get a fresh allowance every day. Early in the morning of 2024-05-03, on our US-hosted system, that service failed silently.

Unfortunately, we only realized it was not working on the morning of 2024-05-04. Putting a fix in place required another day.

At the same time, our load balancing system was experiencing a DDoS attack by malicious bots, which led to an overall decline of performance.

For some of our users, who noticed the CPU issue, these two separate events correlated, leading to confusion.

These issues appeared only on our US-based system – users on our EU system were not affected.

Categories: FLOSS Project Planets

March and April in KDE PIM

Planet KDE - Tue, 2024-05-07 11:00

Here's our bi-monthly update from KDE's personal information management applications team. This report covers progress made in the months of March and April 2024.

Since the last report, 36 people contributed more than 1300 code changes. Most of the changes will be available in the coming KDE Gear 24.05 release.


When Akonadi stores the timestamp of when a database entry has been last modified, the conversion from user's local time zone to UTC and back now works correctly regardless of the database engine used BKO#483060.


KOrganizer has received a number of bug fixes:

  • Fixed parsing of events with all-day recurrence rules (BKO#483707)
  • KOrganizer now correctly tracks active (selected) tasks in the ToDo View (BKO#485185)
  • Creating a new event from the date navigator in top left corner uses the correct date now (BKO#483823)
  • Custom filters for event views in KOrganizer work again (BKO#484040)
  • Improved handling of calendar colors
  • The iCal Resource now correclty handles iCal calendars generated by Google Calendar, which previously caused an endless loop and high CPU usage by the Akonadi iCal Resource (BKO#384309)
  • The ToDo view in the KOrganizer side-bar now works even when the Todo View isn't open
  • The "Custom Pages" settings page, which didn't worked for years, have been removed
  • Fixed a crash on exit after the Settings dialog was opened (BKO#483336)

g10 Code has kindly sponsored Dan's work on those bug fixes and improvements.

  • Fixed name of UI element being too long (Hide/Show Sidebar) (BKO#484599)
  • Fixed a regression in the message composer that caused attachments to not get automatically encrypted when encrypting a message (T7059)
  • Fixed not translated shortcut (BKO#484281)
  • Fixed Monochromatic icons in system tray not always used (BKO#484420)
  • Fixed some not extracted i18n string (BKO#484186)
  • Allow to change print layout when we export as pdf (BKO#480733)
  • Fixed KMail unexpectedly trying to connect to safebrowsing.googleapis.com (BKO#483283)
  • Fixed KMail's config dialog taking a long time to show up (BKO#484328)
Identity Management

A new feature will arrive in 24.08: Plasma-Activities support (only Linux). So these class were adapted for supporting it. A check was added in KMail/Akregator/Knotes/KAddressbook, all work is in progress at the moment.


The certificate details (user IDs, subkeys, certifications, etc.) are now shown in a single window. Additionally, information about the smart cards a certificate is stored on is now shown.

Further improvements are:

  • The creation of OpenPGP certificates was simplified by replacing the complicated advanced settings with a simple selection of the algorithm and the validity period.
  • If the search for certificates on a server takes longer, a progress dialog shows that the search is still ongoing. If no certificates are found, a corresponding message is shown instead of just showing an empty list of results. (T6493)
  • Certificates stored on TCOS smart cards (e.g. the German Signature Card V2.0) are now imported automatically. Previously, the import had to be triggered manually. (T6846)
KNotes Akregator

Martín González Gómez implemented a new article theme for Akregator which is not only more readable for long-form content but also adapts correctly to dark color themes.

Akregator has received a number of bug fixes:


Our travel assistance app Itinerary now shows more details about vehicle and train coach amenities, informs about daylight saving time changes at travel destinations and received many more fixes and improvements for extracting travel documents. See Itinerary's own bi-monthly status update for more details.


Merkuro now make uses of the new Date and Time picker from Kirigami Addons instead of bringing it's own. The date picker instance is also now shared in multiple places to reduce memory and CPU usage and speedup opening the even editor.

Various dialogs were also modernized.

Get Involved

Join us in the #kontact:kde.org Matrix channel or the kde-pim mailing list!

Categories: FLOSS Project Planets

Django Weblog: Django bugfix releases issued: 5.0.6 and 4.2.13

Planet Python - Tue, 2024-05-07 10:55

Today we've issued 5.0.6 and 4.2.13 as reissues of the 5.0.5 and 4.2.12 bugfix releases.

The release package and checksums are available from our downloads page, as well as from the Python Package Index. The PGP key ID used for this release is Natalia Bidart: 2EE82A8D9470983E.

Categories: FLOSS Project Planets

Melissa Wen: Get Ready to 2024 Linux Display Next Hackfest in A Coruña!

Planet Debian - Tue, 2024-05-07 10:33

We’re excited to announce the details of our upcoming 2024 Linux Display Next Hackfest in the beautiful city of A Coruña, Spain!

This year’s hackfest will be hosted by Igalia and will take place from May 14th to 16th. It will be a gathering of minds from a diverse range of companies and open source projects, all coming together to share, learn, and collaborate outside the traditional conference format.

Who’s Joining the Fun?

We’re excited to welcome participants from various backgrounds, including:

  • GPU hardware vendors;
  • Linux distributions;
  • Linux desktop environments and compositors;
  • Color experts, researchers and enthusiasts;

This diverse mix of backgrounds are represented by developers from several companies working on the Linux display stack: AMD, Arm, BlueSystems, Bootlin, Collabora, Google, GravityXR, Igalia, Intel, LittleCMS, Qualcomm, Raspberry Pi, RedHat, SUSE, and System76. It’ll ensure a dynamic exchange of perspectives and foster collaboration across the Linux Display community.

Please take a look at the list of participants for more info.

What’s on the Agenda?

The beauty of the hackfest is that the agenda is driven by participants! As this is a hybrid event, we decided to improve the experience for remote participants by creating a dedicated space for them to propose topics and some introductory talks in advance. From those inputs, we defined a schedule that reflects the collective interests of the group, but is still open for amendments and new proposals. Find the schedule details in the official event webpage.

Expect discussions on:

  • Proposal with new DRM object type:
    • Brief presentation of GPU-vendor features;
    • Status update of plane color management pipeline per vendor on Linux;
  • HDR/Color Use-cases:
    • HDR gainmap images and how should we think about HDR;
    • Google/ChromeOS GFX view about HDR/per-plane color management, VKMS and lessons learned;
  • Post-blending Color Pipeline.
  • Color/HDR testing/CI
    • VKMS status update;
    • Chamelium boards, video capture.
  • Wayland protocols
    • color-management protocol status update;
    • color-representation and video playback.
Display control
  • HDR signalling status update;
  • backlight status update;
  • EDID and DDC/CI.
Strategy for video and gaming use-cases
  • Multi-plane support in compositors
    • Underlay, overlay, or mixed strategy for video and gaming use-cases;
    • KMS Plane UAPI to simplify the plane arrangement problem;
    • Shared plane arrangement algorithm desired.
  • HDR video and hardware overlay
Frame timing and VRR
  • Frame timing:
    • Limitations of uAPI;
    • Current user space solutions;
    • Brainstorm better uAPI;
  • Cursor/overlay plane updates with VRR;
  • KMS commit and buffer-readiness deadlines;
Power Saving vs Color/Latency
  • ABM (adaptive backlight management);
  • PSR1 latencies;
  • Power optimization vs color accuracy/latency requirements.
Content-Adaptive Scaling & Sharpening
  • Content-Adaptive Scalers on display hardware;
  • New drm_colorop for content adaptive scaling;
  • Proprietary algorithms.
Display Mux
  • Laptop muxes for switching of the embedded panel between the integrated GPU and the discrete GPU;
  • Seamless/atomic hand-off between drivers on Linux desktops.
Real time scheduling & async KMS API
  • Potential benefits: lower latency input feedback, better VRR handling, buffer synchronization, etc.
  • Issues around “async” uAPI usage and async-call handling.
In-person, but also geographically-distributed event

This year Linux Display Next hackfest is a hybrid event, hosted onsite at the Igalia offices and available for remote attendance. In-person participants will find an environment for networking and brainstorming in our inspiring and collaborative office space. Additionally, A Coruña itself is a gem waiting to be explored, with stunning beaches, good food, and historical sites.

Semi-structured structure: how the 2024 Linux Display Next Hackfest will work
  • Agenda: Participants proposed the topics and talks for discussing in sessions.
  • Interactive Sessions: Discussions, workshops, introductory talks and brainstorming sessions lasting around 1h30. There is always a starting point for discussions and new ideas will emerge in real time.
  • Immersive experience: We will have coffee-breaks between sessions and lunch time at the office for all in-person participants. Lunches and coffee-breaks are sponsored by Igalia. This will keep us sharing knowledge and in continuous interaction.
  • Spaces for all group sizes: In-person participants will find different room sizes that match various group sizes at Igalia HQ. Besides that, there will be some devices for showcasing and real-time demonstrations.
Social Activities: building connections beyond the sessions

To make the most of your time in A Coruña, we’ll be organizing some social activities:

  • First-day Dinner: In-person participants will enjoy a Galician dinner on Tuesday, after a first day of intensive discussions in the hackfest.
  • Getting to know a little of A Coruña: Finding out a little about A Coruña and current local habits.

  • On Thursday afternoon, we will close the 2024 Linux Display Next hackfest with a guided tour of the Museum of Galicia’s favorite beer brand, Estrella Galicia. The guided tour covers the eight sectors of the museum and ends with beer pouring and tasting. After this experience, a transfer bus will take us to the Maria Pita square.
  • At Maria Pita square we will see the charm of some historical landmarks of A Coruña, explore the casual and vibrant style of the city center and taste local foods while chatting with friends.

Igalia sponsors lunches and coffee-breaks on hackfest days, Tuesday’s dinner, and the social event on Thursday afternoon for in-person participants.

We can’t wait to welcome hackfest attendees to A Coruña! Stay tuned for further details and outcomes of this unconventional and unique experience.

Categories: FLOSS Project Planets

Real Python: Flattening a List of Lists in Python

Planet Python - Tue, 2024-05-07 10:00

Sometimes, when you’re working with data, you may have the data as a list of nested lists. A common operation is to flatten this data into a one-dimensional list in Python. Flattening a list involves converting a multidimensional list, such as a matrix, into a one-dimensional list.

In this video course, you’ll learn how to do that in Python.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Categories: FLOSS Project Planets

The Drop Times: Detailed Overview of the 2024 Drupal Developer Survey Results

Planet Drupal - Tue, 2024-05-07 09:08
The 2024 Drupal Developer Survey, led by Jeff Geerling, Chris Urban, and Michael Richardson, provided a comprehensive overview of the global Drupal community. With 648 developers from 65 countries, including significant contributions from the United States, France, and India, the survey showcased a mature developer base, with 76% aged between 30 and 49. Despite regional variations in sentiment, community engagement remained strong, with 65% participating in Drupal events, and 80% expressing optimism for Drupal's future. Notable trends included the rise of decoupled architectures and the endorsement of DDEV as the preferred choice for integrated development environments (IDEs) and local environment managers. The survey also highlighted opportunities for the Drupal Association to enhance its visibility and communication efforts. These insights will inform strategies for fostering growth and innovation within the ecosystem.
Categories: FLOSS Project Planets

GNU Guix: Authenticate your Git checkouts!

GNU Planet! - Tue, 2024-05-07 08:08

You clone a Git repository, then pull from it. How can you tell its contents are “authentic”—i.e., coming from the “genuine” project you think you’re pulling from, written by the fine human beings you’ve been working with? With commit signatures and “verified” badges ✅ flourishing, you’d think this has long been solved—but nope!

Four years after Guix deployed its own tool to allow users to authenticate updates fetched with guix pull (which uses Git under the hood), the situation hasn’t changed all that much: the vast majority of developers using Git simply do not authenticate the code they pull. That’s pretty bad. It’s the modern-day equivalent of sharing unsigned tarballs and packages like we’d blissfully do in the past century.

The authentication mechanism Guix uses for channels is available to any Git user through the guix git authenticate command. This post is a guide for Git users who are not necessarily Guix users but are interested in using this command for their own repositories. Before looking into the command-line interface and how we improved it to make it more convenient, let’s dispel any misunderstandings or misconceptions.

Why you should care

When you run git pull, you’re fetching a bunch of commits from a server. If it’s over HTTPS, you’re authenticating the server itself, which is nice, but that does not tell you who the code actually comes from—the server might be compromised and an attacker pushed code to the repository. Not helpful. At all.

But hey, maybe you think you’re good because everyone on your project is signing commits and tags, and because you’re disciplined, you routinely run git log --show-signature and check those “Good signature” GPG messages. Maybe you even have those fancy “✅ verified” badges as found on GitLab and on GitHub.

Signing commits is part of the solution, but it’s not enough to authenticate a set of commits that you pull; all it shows is that, well, those commits are signed. Badges aren’t much better: the presence of a “verified” badge only shows that the commit is signed by the OpenPGP key currently registered for the corresponding GitLab/GitHub account. It’s another source of lock-in and makes the hosting platform a trusted third-party. Worse, there’s no notion of authorization (which keys are authorized), let alone tracking of the history of authorization changes (which keys were authorized at the time a given commit was made). Not helpful either.

Being able to ensure that when you run git pull, you’re getting code that genuinely comes from authorized developers of the project is basic security hygiene. Obviously it cannot protect against efforts to infiltrate a project to eventually get commit access and insert malicious code—the kind of multi-year plot that led to the xz backdoor—but if you don’t even protect against unauthorized commits, then all bets are off.

Authentication is something we naturally expect from apt update, pip, guix pull, and similar tools; why not treat git pull to the same standard?

Initial setup

The guix git authenticate command authenticates Git checkouts, unsurprisingly. It’s currently part of Guix because that’s where it was brought to life, but it can be used on any Git repository. This section focuses on how to use it; you can learn about the motivation, its design, and its implementation in the 2020 blog post, in the 2022 peer-reviewed academic paper entitled Building a Secure Software Supply Chain with GNU Guix, or in this 20mn presentation.

To support authentication of your repository with guix git authenticate, you need to follow these steps:

  1. Enable commit signing on your repo: git config commit.gpgSign true. (Git now supports other signing methods but here we need OpenPGP signatures.)

  2. Create a keyring branch containing all the OpenPGP keys of all the committers, along these lines:

    git checkout --orphan keyring git reset --hard gpg --export alice@example.org > alice.key gpg --export bob@example.org > bob.key … git add *.key git commit -m "Add committer keys."

    All the files must end in .key. You must never remove keys from that branch: keys of users who left the project are necessary to authenticate past commits.

  3. Back to the main branch, add a .guix-authorizations file, listing the OpenPGP keys of authorized committers—we’ll get back to its format below.

  4. Commit! This becomes the introductory commit from which authentication can proceed. The introduction of your repository is the ID of this commit and the OpenPGP fingerprint of the key used to sign it.

That’s it. From now on, anyone who clones the repository can authenticate it. The first time, run:

guix git authenticate COMMIT SIGNER

… where COMMIT is the commit ID of the introductory commit, and SIGNER is the OpenPGP fingerprint of the key used to sign that commit (make sure to enclose it in double quotes if there are spaces!). As a repo maintainer, you must advertise this introductory commit ID and fingerprint on a web page or in a README file so others know what to pass to guix git authenticate.

The commit and signer are now recorded on the first run in .git/config; next time, you can run it without any arguments:

guix git authenticate

The other new feature is that the first time you run it, the command installs pre-push and pre-merge hooks (unless preexisting hooks are found) such that your repository is automatically authenticated from there on every time you run git pull or git push.

guix git authenticate exits with a non-zero code and an error message when it stumbles upon a commit that lacks a signature, that is signed by a key not in the keyring branch, or that is signed by a key not listed in .guix-authorizations.

Maintaining the list of authorized committers

The .guix-authorizations file in the repository is central: it lists the OpenPGP fingerprints of authorized committers. Any commit that is not signed by a key listed in the .guix-authorizations file of its parent commit(s) is considered inauthentic—and an error is reported. The format of .guix-authorizations is based on S-expressions and looks like this:

;; Example ‘.guix-authorizations’ file. (authorizations (version 0) ;current file format version (("AD17 A21E F8AE D8F1 CC02 DBD9 F8AE D8F1 765C 61E3" (name "alice")) ("2A39 3FFF 68F4 EF7A 3D29 12AF 68F4 EF7A 22FB B2D5" (name "bob")) ("CABB A931 C0FF EEC6 900D 0CFB 090B 1199 3D9A EBB5" (name "charlie"))))

The name bits are hints and do not have any effect; what matters is the fingerprints that are listed. You can obtain them with GnuPG by running commands like:

gpg --fingerprint charlie@example.org

At any time you can add or remove keys from .guix-authorizations and commit the changes; those changes take effect for child commits. For example, if we add Billie’s fingerprint to the file in commit A, then Billie becomes an authorized committer in descendants of commit A (we must make sure to add Billie’s key as a file in the keyring branch, too, as we saw above); Billie is still unauthorized in branches that lack A. If we remove Charlie’s key from the file in commit B, then Charlie is no longer an authorized committer, except in branches that start before B. This should feel rather natural.

That’s pretty much all you need to know to get started! Check the manual for more info.

All the information needed to authenticate the repository is contained in the repository itself—it does not depend on a forge or key server. That’s a good property to allow anyone to authenticate it, to ensure determinism and transparency, and to avoid lock-in.

Interested? You can help!

guix git authenticate is a great tool that you can start using today so you and fellow co-workers can be sure you’re getting the right code! It solves an important problem that, to my knowledge, hasn’t really been addressed by any other tool.

Maybe you’re interested but don’t feel like installing Guix “just” for this tool. Maybe you’re not into Scheme and Lisp and would rather use a tool written in your favorite language. Or maybe you think—and rightfully so—that such a tool ought to be part of Git proper.

That’s OK, we can talk! We’re open to discussing with folks who’d like to come up with alternative implementations—check out the articles mentioned above if you’d like to take that route. And we’re open to contributing to a standardization effort. Let’s get in touch!


Thanks to Florian Pelz and Simon Tournier for their insightful comments on an earlier draft of this post.

Categories: FLOSS Project Planets

Shannon -jj Behrens: Python: My Favorite Python Tricks for LeetCode Questions

Planet Python - Tue, 2024-05-07 07:44

I've been spending a lot of time practicing on LeetCode recently, so I thought I'd share some of my favorite intermediate-level Python tricks. I'll also cover some newer features of Python you may not have started using yet. I'll start with basic tips and then move to more advanced ones.

Get help()

Python's documentation is pretty great, and some of these examples are taken from there.

For instance, if you just google "heapq", you'll see the official docs for heapq, which are often enough.

However, it's also helpful to sometimes just quickly use help() in the shell. Here, I can't remember that push() is actually called append().

>>> help([]) >>> dir([]) >>> help([].append) enumerate()

If you need to loop over a list, you can use enumerate() to get both the item as well as the index. As a mnemonic, I like to think for (i, x) in enumerate(...):

for (i, x) in enumerate(some_list): ... items()

Similarly, you can get both the key and the value at the same time when looping over a dict using items():

for (k, v) in some_dict.items(): ... [] vs. get()

Remember, when you use [] with a dict, if the value doesn't exist, you'll get a KeyError. Rather than see if an item is in the dict and then look up its value, you can use get():

val = some_dict.get(key) # It defaults to None. if val is None: ...

Similarly, .setdefault() is sometimes helpful.

Some people prefer to just use [] and handle the KeyError since exceptions aren't as expensive in Python as they are in other languages.

range() is smarter than you think for item in range(items): ... for index in range(len(items)): ... # Count by 2s. for i in range(0, 100, 2): ... # Count backward from 100 to 0 inclusive. for i in range(100, -1, -1): ... # Okay, Mr. Smarty Pants, I'm sure you knew all that, but did you know # that you can pass a range object around, and it knows how to reverse # itself via slice notation? :-P r = range(100) r = r[::-1] # range(99, -1, -1) print(f'') debugging

Have you switched to Python's new format strings yet? They're more convenient and safer (from injection vulnerabilities) than % and .format(). They even have a syntax for outputing the thing as well as its value:

# Got 2+2=4 print(f'Got {2+2=}') for else

Python has a feature that I haven't seen in other programming languages. Both for and while can be followed by an else clause, which is useful when you're searching for something.

for item in some_list: if is_what_im_looking_for(item): print(f"Yay! It's {item}.") break else: print("I couldn't find what I was looking for.") Use a list as a stack

The cost of using a list as a stack is (amortized) O(1):

elements = [] elements.append(element) # Not push element = elements.pop()

Note that inserting something at the beginning of the list or in the middle is more expensive it has to shift everything to the right--see deque below.

sort() vs. sorted() # sort() sorts a list in place. my_list.sort() # Whereas sorted() returns a sorted *copy* of an iterable: my_sorted_list = sorted(some_iterable)

And, both of these can take a key function if you need to sort objects.

set and frozenset

Sets are so useful for so many problems! Just in case you didn't know some of these tricks:

# There is now syntax for creating sets. s = {'Von'} # There are set "comprehensions" which are like list comprehensions, but for sets. s2 = {f'{name} the III' for name in s} {'Von the III'} # If you can't remember how to use union, intersection, difference, etc. help(set()) # If you need an immutable set, for instance, to use as a dict key, use frozenset. frozenset((1, 2, 3)) deque

If you find yourself needing a queue or a list that you can push and pop from either side, use a deque:

>>> from collections import deque >>> >>> d = deque() >>> d.append(3) >>> d.append(4) >>> d.appendleft(2) >>> d.appendleft(1) >>> d deque([1, 2, 3, 4]) >>> d.popleft() 1 >>> d.pop() 4 Using a stack instead of recursion

Instead of using recursion (which has a depth of about 1024 frames), you can use a while loop and manually manage a stack yourself. Here's a slightly contrived example:

work = [create_initial_work()] while work: work_item = work.pop() result = process(work_item) if is_done(result): return result work.append(result.pieces[0]) work.append(result.pieces[1]) Using yield from

If you don't know about yield, you can go spend some time learning about that. It's awesome.

Sometimes, when you're in one generator, you need to call another generator. Python now has yield from for that:

def my_generator(): yield 1 yield from some_other_generator() yield 6

So, here's an example of backtracking:

class Solution: def problem(self, digits: str) -> List[str]: def generate_possibilities(work_so_far, remaining_work): if not remaining_work: if work_so_far: yield work_so_far return first_part, remaining_part = remaining_work[0], remaining_work[1:] for i in things_to_try: yield from generate_possibilities(work_so_far + i, remaining_part) output = list(generate_possibilities(no_work_so_far, its_all_remaining_work)) return output

This is appropriate if you have less than 1000 "levels" but a ton of possibilities for each of those levels. This won't work if you're going to need more than 1000 layers of recursion. In that case, switch to "Using a stack instead of recursion".

Updated: On the other hand, if you can have the recursive function append to some list of answers instead of yielding it all the way back to the caller, that's faster.

Pre-initialize your list

If you know how long your list is going to be ahead of time, you can avoid needing to resize it multiple times by just pre-initializing it:

dp = [None] * len(items) collections.Counter()

How many times have you used a dict to count up something? It's built-in in Python:

>>> from collections import Counter >>> c = Counter('abcabcabcaaa') >>> c Counter({'a': 6, 'b': 3, 'c': 3}) defaultdict

Similarly, there's defaultdict:

>>> from collections import defaultdict >>> d = defaultdict(list) >>> d['girls'].append('Jocylenn') >>> d['boys'].append('Greggory') >>> d defaultdict(<class 'list'>, {'girls': ['Jocylenn'], 'boys': ['Greggory']})

Notice that I didn't need to set d['girls'] to an empty list before I started appending to it.


I had heard of heaps in school, but I didn't really know what they were. Well, it turns out they're pretty helpful for several of the problems, and Python has a list-based heap implementation built-in.

If you don't know what a heap is, I recommend this video and this video. They'll explain what a heap is and how to implement one using a list.

The heapq module is a built-in module for managing a heap. It builds on top of an existing list:

import heapq some_list = ... heapq.heapify(some_list) # The head of the heap is some_list[0]. # The len of the heap is still len(some_list). heapq.heappush(some_list, item) head_item = heapq.heappop(some_list)

The heapq module also has nlargest and nsmallest built-in so you don't have to implement those things yourself.

Keep in mind that heapq is a minheap. Let's say that what you really want is a maxheap, and you're not working with ints, you're working with objects. Here's how to tweak your data to get it to fit heapq's way of thinking:

heap = [] heapq.heappush(heap, (-obj.value, obj)) (ignored, first_obj) = heapq.heappop()

Here, I'm using - to make it a maxheap. I'm wrapping things in a tuple so that it's sorted by the obj.value, and I'm including the obj as the second value so that I can get it.

Use bisect for binary search

I'm sure you've implemented binary search before. Python has it built-in. It even has keyword arguments that you can use to search in only part of the list:

import bisect insertion_point = bisect.bisect_left(sorted_list, some_item, lo=lo, high=high)

Pay attention to the key argument which is sometimes useful, but may take a little work for it to work the way you want.

namedtuple and dataclasses

Tuples are great, but it can be a pain to deal with remembering the order of the elements or unpacking just a single element in the tuple. That's where namedtuple comes in.

>>> from collections import namedtuple >>> Point = namedtuple('Point', ['x', 'y']) >>> p = Point(5, 7) >>> p Point(x=5, y=7) >>> p.x 5 >>> q = p._replace(x=92) >>> p Point(x=5, y=7) >>> q Point(x=92, y=7)

Keep in mind that tuples are immutable. I particularly like using namedtuples for backtracking problems. In that case, the immutability is actually a huge asset. I use a namedtuple to represent the state of the problem at each step. I have this much stuff done, this much stuff left to do, this is where I am, etc. At each step, you take the old namedtuple and create a new one in an immutable way.

Updated: Python 3.7 introduced dataclasses. These have multiple advantages:

  • They can be mutable or immutable (although, there's a small performance penalty).
  • You can use type annotations.
  • You can add methods.

from dataclasses import dataclass @dataclass # Or: @dataclass(frozen=True) class InventoryItem: """Class for keeping track of an item in inventory.""" name: str unit_price: float quantity_on_hand: int = 0 def total_cost(self) -> float: return self.unit_price * self.quantity_on_hand item = InventoryItem(name='Box', unit_price=19, quantity_on_hand=2)

dataclasses are great when you want a little class to hold some data, but you don't want to waste much time writing one from scratch.

Updated: Here's a comparison between namedtuples and dataclasses. It leads me to favor dataclasses since they have faster property access and use 30% less memory :-/ Per the Python docs, using frozen=True is slightly slower than not using it. In my (extremely unscientific) testing, using a normal class with __slots__ is faster and uses less memory than a dataclass.

int, decimal, math.inf, etc.

Thankfully, Python's int type supports arbitrarily large values by default:

>>> 1 << 128 340282366920938463463374607431768211456

There's also the decimal module if you need to work with things like money where a float isn't accurate enough or when you need a lot of decimal places of precision.

Sometimes, they'll say the range is -2 ^ 32 to 2 ^ 32 - 1. You can get those values via bitshifting:

>>> -(2 ** 32) == -(1 << 32) True >>> (2 ** 32) - 1 == (1 << 32) - 1 True

Sometimes, it's useful to initialize a variable with math.inf (i.e. infinity) and then try to find new values less than that.

Updated: If you want to save memory by not importing the math module, just use float("inf").


I'm not sure every interviewer is going to like this, but I tend to skip the OOP stuff and use a bunch of local helper functions so that I can access things via closure:

class Solution(): # This is what LeetCode gave me. def solveProblem(self, arg1, arg2): # Why they used camelCase, I have no idea. def helper_function(): # I have access to arg1 and arg2 via closure. # I don't have to store them on self or pass them around # explicitly. return arg1 + arg2 counter = 0 def can_mutate_counter(): # By using nonlocal, I can even mutate counter. # I rarely use this approach in practice. I usually pass in it # as an argument and return a value. nonlocal counter counter += 1 can_mutate_counter() return helper_function() + counter match statement

Did you know Python now has a match statement?

# Taken from: https://learnpython.com/blog/python-match-case-statement/ >>> command = 'Hello, World!' >>> match command: ... case 'Hello, World!': ... print('Hello to you too!') ... case 'Goodbye, World!': ... print('See you later') ... case other: ... print('No match found')

It's actually much more sophisticated than a switch statement, so take a look, especially if you've never used match in a functional language like Haskell.


If you ever need to implement an LRU cache, it'll be quite helpful to have an OrderedDict.

Python's dicts are now ordered by default. However, the docs for OrderedDict say that there are still some cases where you might need to use OrderedDict. I can't remember. If you never need your dicts to be ordered, just read the docs and figure out if you need an OrderedDict or if you can use just a normal dict.


If you need a cache, sometimes you can just wrap your code in a function and use functools.cache:

from functools import cache @cache def factorial(n): return n * factorial(n - 1) if n else 1 print(factorial(5)) ... factorial.cache_info() # CacheInfo(hits=3, misses=8, maxsize=32, currsize=8) Debugging ListNodes

A lot of the problems involve a ListNode class that's provided by LeetCode. It's not very "debuggable". Add this code temporarily to improve that:

def list_node_str(head): seen_before = set() pieces = [] p = head while p is not None: if p in seen_before: pieces.append(f'loop at {p.val}') break pieces.append(str(p.val)) seen_before.add(p) p = p.next joined_pieces = ', '.join(pieces) return f'[{joined_pieces}]' ListNode.__str__ = list_node_str Saving memory with the array module

Sometimes you need a really long list of simple numeric (or boolean) values. The array module can help with this, and it's an easy way to decrease your memory usage after you've already gotten your algorithm working.

>>> import array >>> array_of_bytes = array.array('b') >>> array_of_bytes.frombytes(b'\0' * (array_of_bytes.itemsize * 10_000_000))

Pay close attention to the type of values you configure the array to accept. Read the docs.

I'm sure there's a way to use individual bits for an array of booleans to save even more space, but it'd probably cost more CPU, and I generally care about CPU more than memory.

Using an exception for the success case rather than the error case

A lot of Python programmers don't like this trick because it's equivalent to goto, but I still occasionally find it convenient:

class Eureka(StopIteration): """Eureka means "I found it!" """ pass def do_something_else(): some_value = 5 raise Eureka(some_value) def do_something(): do_something_else() try: do_something() except Eureka as exc: print(f'I found it: {exc.args[0]}') Updated: EnumsPython now has a built-in enums:from enum import Enum # Either: class Color(Enum): RED = 1 GREEN = 2 BLUE = 3 # Or: Color = Enum('Color', ['RED', 'GREEN', 'BLUE']) However, in my experience, when coding for LeetCode, just having some local constants (even if the values are strings) is a tad faster and requires a tad less memory:
RED = "RED" GREEN = "GREEN" BLUE = "BLUE" Using strings isn't actually slow if all you're doing is pointer comparisons.Updated: Using a profilerYou'll need some sample data. Make your code crash when it sees a test case with a lot of data. Grab the data in order to get your code to run on its own. Run something like the following. It'll print out enough information to figure out how to improve your code.import cProfile cProfile.run("Solution().someMethod(sampleData)") Using VS Code, etc.

VS Code has a pretty nice Python extension. If you highlight the code and hit shift-enter, it'll run it in a shell. That's more convenient than just typing everything directly in the shell. Other editors have something similar, or perhaps you use a Jupyter notebook for this.

Another thing that helps me is that I'll often have separate files open with separate attempts at a solution. I guess you can call this the "fast" approach to branching.

Write English before Python

One thing that helps me a lot is to write English before writing Python. Just write all your thoughts. Keep adding to your list of thoughts. Sometimes you have to start over with a new list of thoughts. Get all the thoughts out, and then pick which thoughts you want to start coding first.


Well, those are my favorite tricks off the top of my head. I'll add more if I think of any.

This is just a single blog post, but if you want more, check out Python 3 Module of the Week.

Categories: FLOSS Project Planets

Marcos Dione: Collating, processing, managing, backing up and serving a gallery of a 350GiB, 60k picture collection

Planet Python - Tue, 2024-05-07 07:06

In the last two days I have commented a little bit how I process and manage my photos. I'm not a very avid photographer, I have like 350 gigabytes of photos, most of them are yet not processed, around 60,000 of them. So I will comment a little bit more how do I manage all that.

I start with the camera, a 24Mpx camera, just a couple of lenses, nothing fancy. Go out, take some pictures, come back home.

I put the SD camera on my computer and I use my own software to import it. The import process is not fancy, it just empties the SD card, checks every file for the EXIF information, uses the date and time to create the filename, a sequence number if needed, and puts them all in a single incoming directory where all the current unprocessed images are1.

Then I use this software I developed in PyQt5. It's very, very basic, but it's really quick, it's mostly keyboard based. It reads the EXIF information and present some of the tags at the left of the screen; things like date, time, size, orientation and then focal length, aperture, ISO and various other data I can get from the images. It's mostly focused on my current camera and the previous one, both Nikons2. The previous one was an N90, right now it's an N7200. The image occupies most of the window, and the program is always in full screen. At the bottom there's the filename and a couple of toggles.

I can do several things with this:

  • Go forwards, backwards, by one, by ten, by a hundred and by a thousand, because that incoming directory right now has almost seven years of history, probably ten thousand pictures.

  • Move randomly, which allows me to pick up a new thing to collate when I get bored with the current one but I want to keep doing it to reduce the backlog.

  • Mark the images in different ways. The main ones are about selecting for storing, with two modes: One is to keep the image in the original size. I usually use this for my best landscape or astro photos. The other one will resize it down to twelve megapixels3, from 6000x4000 pixels to 4500x3000 pixels, 75% on each dimension.

  • Rotate the images, just in case the camera did not guess the orientation correctly, usually when I'm taking pictures right upward or right downwards.
  • Select several pictures for stitching, which will use hugin to do so. It's not 100% automatic, but at least puts the pictures in a stitch directory and point hugin there.

  • Select a picture for cropping or editing; I'm not going to develop a whole image editor, so I just delegate to an existing program, gwenview.

  • Select images for deleting and delete them permanently.

  • Select several images for comparison and enter/exit comparison mode, which means that going backwards and forwards applies only this this set. This is good for things like when you take certain pictures, but there are not necessarily sequences in the original picture sequence, which for me makes culling images faster.

  • It has two zoom levels, fit to screen and full size. I don't have much the need for other options.
  • 99% of the pictures I take are freehand, so in a sequence there's always some movement between images. In full size I can put every image on its own position, aligning the whole sequence and allow culling based on blurriness or other factors.

  • Also in full size, I can lock the view, so when I pan one of the images and I switch to another one, it will also pan that second image to that position. It also helps when I'm checking for details between two different images of the same thing.

  • Move all the selected images, resize them if needed, and put them in a folder. It also creates a hardlink between my categorization in folders into a folder that collects all the images by date; there's one folder for each month and year with all the pictures of that month inside. It uses hardlinks so it doesn't duplicate the image file, saving space.

  • It also has a readonly mode, so I can hand the computer to my kids to watch the photos.

When culling, I use the comparison mode and individual position and lock view features a lot, going back and forth between images, discarding until only one is left.

That's the first part, the one I must spend my time on, just basic culling, selection and storage. My main tree is just a tree based on my way of categorizing the images.

My program doesn't have a directory view; instead, I just use gwenview again.

Notice there's no photo editing in this workflow. I rarely shoot in RAW for two reasons: a) I'm really bad at postprocessing; and b) even if I was good, I don't have the time to do it; my free time is shared among several hobbies. I only do it for astro photograpy and very few, rare occasions.

The third tool I use is digikam. I use it for two things, which are related: semi-automatic and manual tagging. The semi-automatic is face detection; digikam can find and guess faces, but requires manual confirmation4. The fully manual part is plain tagging, mostly with location5 and sometimes some other info. I sometimes also rate my pictures; I mostly use four and five, sometimes three, only for my best pictures.

Then there's another script that reads the digikam database and uses the tags to create another directory for the tags, which also uses hardlinks. It still doesn't do anything about the rating, but I could easily add that.

That's all on my personal computer. I use rsync to make a copy on my home server that has two purposes. One, it's a backup, which includes all the original 24Mpx images that I hadn't culled yet, which I think is the biggest part of my collection.

The second one, it feeds a gallery program that is developed in PHP by a guy named Karl. It's probably the single paid software I use. It's a single PHP file that you put at the root of your gallery, you enable PHP processing by your web server (in my case, Apache), and generates the gallery on the run, just reading the directories and creating all the necessary thumbnails and all that. I did a small change to this program. The original algorithm creates thumbnails based on each file's path (and other attributes, 4 or 5 I think), but because I have all these hard links, it creates duplicated thumbnail files. So I changed it to use the filename instead of the filepath6.

I don't have any kind of synchronization with my phone. Most of the pictures I take with it are not the kind of pictures I usually will put in my own gallery, except the times I go out without my camera and I end up taking pictures anyway. I still don't have a workflow for that, it's mostly manual. So if I ever lose my phone, I'm fscked because I have definitely no backups of it.

That lack of synchronization also means that the only way to see the pictures in my phone is by opening the gallery in the browser. It's not the best, but I don't do that that often. I have tried to use alternatives like NextCloud, which I also have installed on my home server. I have some issues with permissions because, again, this is a backup directory, so it has all the owner information that belongs to me, instead of the web server. That means it doesn't have the proper permissions to let NextCloud manage those files. Luckily files.gallery just needs a subdirectory.

Another reason is that before I was using static gallery generators: sigal, gallerpy or even nikola, which drives this glob. All those can generate the gallery statically, so serving them is so much easier. My old home server died at some point and I had to come up with something. I had a spare old laptop laying around and I used that. Now it's enough to generate the gallery on the fly. I have plans to make something bigger, but that's for another time.

  1. In fact I have another directory for all the unprocessed photos from another era, and I'm thinking of starting a new era. 

  2. Even if EXIV is a standard for storing tags, there's no standard for the tag names, so every manufacturer has its own sets, that even change between camera lines. For a better idea of what I'm talking about, just peruse Image::ExifTool's source code

  3. I currently own no screen that is 4500 pixels of width, let alone 6000. Maybe my kids will, but by then Mpx count will be so different that it won't make any sense to accomodate that. Right now storage for me is expensive, so I'll keep it this way. 

  4. Or rejection: the false positive rate is bigger that I would like, and it doesn't have a way to say 'yes, this is that person, but don't train on this image'. This is the case for pictures where the face is either semi occluded, sometimes painted, sometimes bad lightning, and mostly just blurry. 

  5. Most of my pictures don't have GPS info, not even the ones in the phone. The latter I only enable when I really need the info later, mostly for mapping. Later I either discard the photo or remove the info. 

  6. For a while now I'm even making this distinction in my own code, filename vs filepath. 

Categories: FLOSS Project Planets

Qt Creator 13.0.1 released

Planet KDE - Tue, 2024-05-07 06:59

We are happy to announce the release of Qt Creator 13.0.1!

Categories: FLOSS Project Planets

Why datasets built on public domain might not be enough for AI

Open Source Initiative - Tue, 2024-05-07 06:00

There is tension between copyright laws and large datasets suitable to train large language models. Common Corpus is a dataset that only uses text from copyright-expired sources to bypass the legal issues. It’s a useful achievement, paving the path to research without immediate risk of lawsuits. I also fear that this approach may lead to bad policies, reinforcing the power of copyright holders; not the small creators but large corporations. 

A dataset built on public domain sources

In March 2024 Common Corpus was released as an open access dataset for training large language models (LLMs). Announcing the release, the lead developer Pierre-Carl Langlais says “Common Corpus shows it is possible to train fully open LLMs on sources without copyright concerns.” The dataset contains 500 billion words in multiple European languages and different cultural heritages. It is a project coordinated by the French startup Pleias and supported by organizations committed to open science such as Occiglot, Eleuther AI and Nomic AI as well as being partly funded by the French government. The stated intention of Common Corpus is to democratize access to large quality datasets. It has many other positive characteristics, highlighted also by Open Future’s summary of a talk given by Langlais

The commons needs more data

The debates sparked by the Deep Dive: AI process on the role of training data highlighted that AI practitioners encounter many obstacles assembling datasets. At the same time, we discovered that tech giants have an incredible advantage over researchers and startups. They’ve been slurping data for decades, have the financial means to go to court and can enter into bilateral agreements to license data. These strategies are inaccessible to small competitors and academics. Accepting that the only path to creating open large datasets suitable to train Open Source AI systems is to use sources in the public domain, risks cementing the dominant positions of existing large corporations.

The open landscape already faces issues with big tech and their ability to influence legislation. The big corporations have lobbied to extend the duration of copyright, introduced the DMCA, are opposing the right to repair, and have the resources to continue lobbying and sue any new entrant who they deem to get too close. There are plenty of examples showing an unequal advantage in protecting what they think is theirs. The non-profit Fairly Trained certifies companies “willing to prove that they’ve trained their AI models on data that they own, have licensed, or that is in the public domain,” respecting copyright law: who’s going to benefit from this approach?

Unsuitable for public policies

Initiatives like Common Corpus and The Stack (used to train Starcoder2) are important achievements as they allow researchers to develop new AI systems while mitigating the risk of being sued. They also push the technical boundaries of what can be achieved with smaller datasets that don’t require a nuclear power plant to train new models. But I think they mask the underlying issue: AI needs data and limiting open datasets to only public domain sources will never give them a chance to match the size of the proprietary ones. The lobby for copyright maximalists is always looking for ways to expand scope and extend terms for copyright laws, and when they succeed it is a one-way ratchet. It would be a tragedy for society if legislators listened to their sophistry and made new laws doing this based on the apparent consensus that creators need protection from AI.
The role of data for training machine learning systems is a divisive topic and a complex one. Having datasets like Common Corpus is a very useful way for the science of AI to progress with better sources. For policies, we’d be better off pushing for something like the proposal advanced by Open Future and Creative Commons in their paper Towards a Books Data Commons for AI Training.

Categories: FLOSS Research

MBition becomes a KDE patron

Planet KDE - Tue, 2024-05-07 05:30

MBition supports the work of the KDE community with its generous sponsorship.

MBition designs and implements the infotainment system for future generations of Mercedes-Benz cars and relies on KDE's technology and know-how for its products.

"After multiple years of collaboration across domains, we feel that becoming a patron of KDE e.V is the next step in deepening our partnership and furthering our open-source strategy" says Marcus Mennemeier, Chief of Technology at MBition.

"We are delighted to welcome MBition as a Patron," says Lydia Pintscher, Vice President of KDE e.V. "MBition has been contributing to KDE software and the stack we build on it for some time now. This is a great step to bring us even closer together and support the KDE community, and further demonstrates the robustness and hardware readiness of KDE's software products."

MBition joins KDE e.V.'s other patrons: Blue Systems, Canonical, g10 Code, Google, Kubuntu Focus, Slimbook, SUSE, The Qt Company and TUXEDO Computers, who support free open source software and KDE development through KDE e.V.

Categories: FLOSS Project Planets

Robin Wilson: Simple segmentation of geospatial images

Planet Python - Tue, 2024-05-07 05:30

I had a need to do some segmentation of some satellite imagery the other day, for a client. Years ago I was quite experienced at doing segmentation and classification using eCognition but that was using the university’s license, and I don’t have a license myself (and they’re very expensive). So, I wanted a free solution.

There are various segmentation tools in the scikit-image library, but I’ve often struggled using them on satellite or aerial imagery – the algorithms seem better suited to images with a clear foreground and background.

Luckily, I remembered RSGISLib – a very comprehensive library of remote sensing and GIS functions. I last used it many years ago, when most of the documentation was for using it from C++, and installation was a pain. I’m very pleased to say that installation is nice and easy now, and all of the examples are in Python.

So, doing segmentation – using an algorithm specifically designed for segmenting satellite/aerial images – is actually really easy now. Here’s how:

First, install RSGISLib. By far the easiest way is to use conda, but there is further documentation on other installation methods, and there are Docker containers available.

conda install -c conda-forge rsgislib

Then it’s a simple matter of calling the relevant function from Python. The documentation shows the segmentation functions available, and the one you’re most likely to want to use is the Shepherd segmentation algorithm, which is described in this paper). So, to call it, run something like this:

from rsgislib.segmentation.shepherdseg import run_shepherd_segmentation run_shepherd_segmentation(input_image, output_seg_image, gdalformat=’GTiff’, calc_stats=False, num_clusters=20, min_n_pxls=300)

The parameters are fairly self-explanatory – it will take the input_image filename (any GDAL-supported format will work), produce an output in output_seg_image filename in the gdalformat given. The calc_stats parameter is important if you’re using a format like GeoTIFF, or any format that doesn’t support a Raster Attribute Table (these are mostly supported by somewhat more unusual formats like KEA). You’ll need to set it to False if your format doesn’t support RATs – and I found that if I forgot to set it to false then the script crashed when trying to write stats.

The final two parameters control how the segmentation algorithm itself works. I’ll leave you to read the paper to find out the details, but the names are fairly self-explanatory.

The output of the algorithm will look something like this:

It’s a raster where the value of all the pixels in the first segment are 1, the pixels in the second segment are 2, and so on. The image above uses a greyscale ‘black to white’ colormap, so as the values of the segments increase towards the bottom of the image, they show as more white.

You can convert this raster output to a set of vector polygons, one for each segment, by using any standard raster to vector ‘polygonize’ algorithm. The easiest is probably using GDAL, by running a command like:

gdal_polygonize.py SegRaster.tif SegVector.gpkg

This will give you a result that looks like the red lines on this image:

So, there’s a simple way of doing satellite image segmentation in Python. I hope it was useful.

Categories: FLOSS Project Planets

Python Bytes: #382 A Simple Game

Planet Python - Tue, 2024-05-07 04:00
<strong>Topics covered in this episode:</strong><br> <ul> <li><a href="https://github.com/nektos/act"><strong>act: Run your GitHub Actions locally!</strong> </a></li> <li><a href="https://portr.dev">portr</a></li> <li><a href="https://rednafi.com/python/annotate_args_and_kwargs/"><strong>Annotating args and kwargs in Python</strong></a></li> <li><a href="https://github.com/Envoy-VC/awesome-badges">github badges</a></li> <li><strong>Extras</strong></li> <li><strong>Joke</strong></li> </ul><a href='https://www.youtube.com/watch?v=v3x4WqEwamg' style='font-weight: bold;'data-umami-event="Livestream-Past" data-umami-event-episode="382">Watch on YouTube</a><br> <p><strong>About the show</strong></p> <p>Sponsored by ScoutAPM: <a href="https://pythonbytes.fm/scout">pythonbytes.fm/scout</a></p> <p><strong>Connect with the hosts</strong></p> <ul> <li>Michael: <a href="https://fosstodon.org/@mkennedy"><strong>@mkennedy@fosstodon.org</strong></a></li> <li>Brian: <a href="https://fosstodon.org/@brianokken"><strong>@brianokken@fosstodon.org</strong></a></li> <li>Show: <a href="https://fosstodon.org/@pythonbytes"><strong>@pythonbytes@fosstodon.org</strong></a></li> </ul> <p>Join us on YouTube at <a href="https://pythonbytes.fm/stream/live"><strong>pythonbytes.fm/live</strong></a> to be part of the audience. Usually Tuesdays at 11am PT. Older video versions available there too.</p> <p>Finally, if you want an artisanal, hand-crafted digest of every week of </p> <p>the show notes in email form? Add your name and email to <a href="https://pythonbytes.fm/friends-of-the-show">our friends of the show list</a>, we'll never share it.</p> <p><strong>Brian #1:</strong> <a href="https://github.com/nektos/act"><strong>act: Run your GitHub Actions locally!</strong> </a></p> <ul> <li>Why? <ul> <li>“Fast Feedback - Rather than having to commit/push every time you want to test out the changes you are making to your .github/workflows/ files (or for any changes to embedded GitHub actions), you can use act to run the actions locally. The environment variables and filesystem are all configured to match what GitHub provides.”</li> <li>“Local Task Runner - I love make. However, I also hate repeating myself. With act, you can use the GitHub Actions defined in your .github/workflows/ to replace your Makefile!”</li> </ul></li> <li>Docs: <a href="https://nektosact.com/introduction.html">nektosact.com</a></li> <li>Uses Docker to run containers for each action.</li> </ul> <p><strong>Michael #2:</strong> <a href="https://portr.dev">portr</a></p> <ul> <li>Open source ngrok alternative designed for teams</li> <li>Expose local http, tcp or websocket connections to the public internet</li> <li>Warning: Portr is currently in beta. Expect bugs and anticipate breaking changes.</li> <li><a href="https://portr.dev/server/">Server setup</a> (docker basically).</li> </ul> <p><strong>Brian #3:</strong> <a href="https://rednafi.com/python/annotate_args_and_kwargs/"><strong>Annotating args and kwargs in Python</strong></a></p> <ul> <li>Redowan Delowar</li> <li>I don’t think I’ve ever tried, but this is a fun rabbit hole.</li> <li>Leveraging bits of PEP-5891, PEP-6462, PEP-6553, and PEP-6924.</li> <li><p>Punchline:</p> <pre><code>from typing import TypedDict, Unpack *# Python 3.12+* *# from typing_extensions import TypedDict, Unpack # &lt; Python 3.12* class Kw(TypedDict): key1: int key2: bool def foo(*args: Unpack[tuple[int, str]], **kwargs: Unpack[Kw]) -&gt; None: ... </code></pre></li> <li><p>A recent pic from Redowan’s blog: </p> <ul> <li><a href="https://rednafi.com/python/typeguard_vs_typeis/">TypeIs does what I thought TypeGuard would do in Python</a></li> </ul></li> </ul> <p><strong>Michael #4:</strong> <a href="https://github.com/Envoy-VC/awesome-badges">github badges</a></p> <ul> <li><img src="https://paper.dropboxstatic.com/static/img/ace/emoji/1f60e.png?version=8.0.0" alt="smiling face with sunglasses" /> A curated list of GitHub badges for your next project</li> </ul> <p><strong>Extras</strong> </p> <p>Brian:</p> <ul> <li><a href="https://www.bleepingcomputer.com/news/security/fake-job-interviews-target-developers-with-new-python-backdoor/">Fake job interviews target developers with new Python backdoor</a></li> <li>Later this week, <a href="https://courses.pythontest.com">course.pythontest.com</a> will shift from Teachable to Podia <ul> <li>Same great content. Just a different backend.</li> <li>To celebrate, get 25% off at <a href="https://pythontest.podia.com">pythontest.podia.com</a> now through this Sunday using coupon code PYTEST</li> </ul></li> <li><a href="https://podcast.pythontest.com/episodes/220-juggling-pycon">Getting the most out of PyCon, including juggling - Rob Ludwick</a> <ul> <li>Latest PythonTest episode, also cross posted to <a href="https://pythonpeople.fm">pythonpeople.fm</a></li> </ul></li> <li><a href="https://hci.social/@orion/112167137880858495">3D visualization of dom</a></li> </ul> <p>Michael:</p> <ul> <li><a href="https://djangonaut.space/comms/2024/04/25/2024-opening-session-2/">Djangonauts Space Session 2 Applications</a> Open! More background at <a href="https://talkpython.fm/episodes/show/451/djangonauts-ready-for-blast-off">Djangonauts, Ready for Blast-Off</a> on Talk Python.</li> <li><a href="https://djangochat.com/episodes/michael-kennedy">Self-Hosted Open Source - Michael Kennedy</a> on Django Chat</li> </ul> <p><strong>Joke:</strong> <a href="https://www.reddit.com/r/programminghumor/comments/1ceo0ds/just_a_silly_little_game/">silly games</a></p> <p>Closing song: <a href="https://www.youtube.com/watch?v=pGbodliLFVE">Permission Granted</a></p>
Categories: FLOSS Project Planets

The Drop Times: Tim Hestenes Lehnen Delves into 'Fog & Fireflies': A Journey of Magic and Metaphor

Planet Drupal - Tue, 2024-05-07 03:45
Discover the enchanting world of 'Fog & Fireflies' as Tim Hestenes Lehnen shares the story behind his latest fantasy novel in an exclusive interview with Kazima Abbas. Uncover the inspiration, themes, and secrets that await within the pages of this captivating tale.
Categories: FLOSS Project Planets

The Drop Times: Introducing Drupal Starshot and Charting a New Course for the Future

Planet Drupal - Tue, 2024-05-07 03:30
Discover the highlights from DrupalCon Portland 2024, where Dries Buytaert presents the latest innovations including Drupal Starshot. Explore the significant strides toward enhancing usability, inclusivity, and the global impact of Drupal on maintaining an open, accessible web. Join the movement shaping the future of digital experiences.
Categories: FLOSS Project Planets