Feeds

Kubuntu 24.10 Oracular Oriole Released

Planet KDE - Thu, 2024-10-10 11:05

The Kubuntu Team is happy to announce that Kubuntu 24.10 has been released, featuring the new and beautiful KDE Plasma 6.1 simple by default, powerful when needed.

Codenamed “Oracular Oriole”, Kubuntu 24.10 continues our tradition of giving you Friendly Computing by integrating the latest and greatest open source technologies into a high-quality, easy-to-use Linux distribution.

Under the hood, there have been updates to many core packages, including a new 6.11 based kernel, KDE Frameworks 5.116 and 6.6.0, KDE Plasma 6.1 and many updated KDE gear applications.

Kubuntu 24.10 with Plasma 6.1

Kubuntu has seen many updates for other applications, both in our default install, and installable from the Ubuntu archive.

Applications for core day-to-day usage are included and updated, such as Firefox, and LibreOffice.

For a list of other application updates, and known bugs be sure to read our release notes.

Wayland as default Plasma session.

The Plasma wayland session is now the default option in sddm (display manager login screen). An X11 session can be selected instead if desired. The last used session type will be remembered, so you do not have to switch type on each login.

Download Kubuntu 24.10, or learn how to upgrade from 24.04 LTS.

Note: For upgrades from 24.04, there may a delay of a few hours to days between the official release announcements and the Ubuntu Release Team enabling upgrades.

Categories: FLOSS Project Planets

Qt for Python release: 6.8 is out now!

Planet KDE - Thu, 2024-10-10 08:38

We’re very happy to announce the latest release of Qt for Python 6.8. With every new release, we try to bring great things with Qt's new features and new trending ideas. For your convenience, you can check out what's new in Qt for Python 6.8 and what’s improved, along with the entire change log.   

Categories: FLOSS Project Planets

Real Python: Quiz: Structural Pattern Matching

Planet Python - Thu, 2024-10-10 08:00

In this quiz, you’ll test your understanding of Structural Pattern Matching in Python.

You’ll revisit the syntax of the match statement and case clauses, explore various types of patterns supported by Python, and learn about guards, unions, aliases, and name binding.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Categories: FLOSS Project Planets

Rahmat Akintola: Voices of the Open Source AI Definition

Open Source Initiative - Thu, 2024-10-10 07:30

The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.

This series features the voices of the volunteers who have helped shape and are shaping the Definition.

Meet Rahmat Akintola

What’s your background related to Open Source and AI?

Sure. I’ll start with Open Source. My journey began at PyCon Africa in 2019, where I participated in a hackathon on Cookiecutter. At the time, I had just transitioned into web development, and I was looking for ways to improve my skills beyond personal projects. So, I joined the Cookiecutter Academy at Python Africa in 2019. That’s how I got introduced to Open Source.

Since then, I’ve been contributing regularly, starting with one-off contributions to different projects. These days, I primarily focus on code and documentation contributions, mainly in web development.

As for AI, my journey started with data science. I had been working as a program manager and was part of the Women in Machine Learning and Data Science community in Accra, which was looking for volunteers. Coincidentally, I had lost my job at the time, so I applied for the program manager role and got it. That experience sparked my interest in AI. I started learning more about machine learning and AI, and I needed to build my domain knowledge to help with my role in the community.

I’ve worked on traditional models like linear and logistic regression through various courses. Recently, as part of our community, we organized a “Mathematics for Machine Learning” boot camp, where we worked on projects related to reinforcement learning and logistic regression. One dataset I worked with involved predicting BP (blood pressure) levels in the US. The task was to assess the risk of developing hypertension based on various factors.

What motivated you to join this co-design process to define Open Source AI?

The Open Source AI journey started when I was informed about a virtual co-design process that was reaching out to different communities, including mine. As the program lead, I saw it as an opportunity to merge my two passions—Open Source and AI.

I volunteered and worked on testing the OpenCV workbook, as I was using OpenCV at the time. I participated in the first phase, which focused on determining whether certain datasets needed to be open. Unfortunately, I couldn’t participate in the validation phase because I was involved in the mathematics boot camp, but I followed the discussions closely.

When the opportunity came up to participate in the co-design process, I saw it as a chance to bridge my work in Open Source web development and my growing interest in AI. It felt like the perfect moment. I was already using OpenCV, which happened to be part of the AI systems under review, so I jumped right in.

Through the process, I realized that defining Open Source AI goes beyond just using tools or making code contributions—it involves a deep understanding of data, legality, and the broader system.

How did you get invited to speak at the Deep Learning Indaba conference in Dakar? How was the conference experience? Did you make any meaningful connections?

As for speaking at Deep Learning Indaba, the opportunity came unexpectedly. One day, Mer Joyce (the OSAID co-design organizer) sent an email offering a chance to speak on Open Source AI at the conference. I had previously applied to attend but didn’t get in, so I jumped on this opportunity. We used a presentation similar to one May had given at Open Source Community Africa.

I made excellent connections. The conference itself was amazing—though the food and the Senegal experience also played a part! There were many AI and machine learning researchers, and I learned new concepts, like using JAX, which was introduced as an alternative to some common frameworks. The tutorials were well-targeted at beginners, which was perfect for me.

On a personal level, it was great to connect with academics. I’m considering applying for a master’s or Ph.D., and the conference provided an opportunity to ask questions and receive guidance.

Why do you think AI should be Open Source?

AI is becoming a significant part of our lives. I work with the Meltwater Entrepreneurial School of Technology (MEST) as a technical lead, and we use AI for various training purposes. Opening up parts of AI systems allows others to adapt and refine them to suit their needs, especially in localized contexts. For example, I saw someone on Twitter excited about building a GPT for dating, customizing it to ask specific questions.

This ability for people to tweak and refine AI models, even without building them from scratch, is important. Open-sourcing AI enables more innovation and helps tailor models for specific needs, which is why I believe it should be open to an extent.

Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?

One new perspective I gained was on the legal and data availability aspects of AI. Before this, I had never really considered the legal side of things, but during the co-design process, it became clear that these elements are crucial in defining Open Source AI systems. It’s more than just contributing code—it’s about ensuring compliance with legal frameworks and making sure data is available and usable.

What do you think the primary benefit will be once there is a clear definition of Open Source AI?

A clear definition would help people understand that Open Source AI involves more than just attaching an MIT or Apache license to a project on GitHub. There’s more complexity around sharing models, data and parameters.

For instance, I was once asked whether using an “Open Source” large language model like LLaMA meant the data had to be open too. A well-defined standard would provide guidance for questions like these, ensuring people understand the legal and technical aspects of making their AI systems Open Source.

What do you think are the next steps for the community involved in Open Source AI?

In Africa, I think the next step is spreading awareness about the Open Source AI Definition. Many people are still unaware of the complexities, and there’s still a tendency to assume that adding an Open Source license to a project automatically makes it open. Building collaborations with local communities to share this information is important.

For women, especially in Africa, visibility is key. When women see others doing similar work, they feel encouraged to join. Representation and community engagement play significant roles in driving diversity in Open Source AI.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the forum: share your comment on the drafts.
  • Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
  • Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Categories: FLOSS Research

Kirigami Addons 1.5

Planet KDE - Thu, 2024-10-10 07:30

Kirigami Addons is out. This releases contains mostly code cleanup and minor improvements. There is netherless a few relevant changes. Thanks to everyone who contributed some code.

New KAppTemplate’s template

A new KAppTemplate is available as a good starting point for application that manage multimedia libraries. It is based on shared design of Peruse, Arianna and the WIP Calligra Launcher.

Hopefully it helps people who want to develop game launchers and other type of specialized multimedia applications.

More templates are planned (e.g. for chat applications), so stay tunned!

FormCard

FormCard is the part of Kirigami Addons that received the most changes in this release. First of all, FormCard now use more consistent spacing and padding, which slighly less horizontal padding. Descriptions for radio and checkbox delegates are also put underneath the delegate’s main text and checkbox, in an effort to make FormCard a bit more compact.

Before After

Additionally FormComboBoxDelegate now lets you display an inline status similar to that is available in other FormCard’s delegates.

Finally FormCard.AboutKDE was renamed to FormCard.AboutKDEPage. This improve the naming consistency with other page compoenents. A compatibility wrapper on top of AboutKDEPage named AboutKDE is still available to not break any existing applications.

Deprecations

The Banner component is now deprecated. Kirigami.InlineMessage now has a position parameter which can be set to Header or Footer. Additionally with KDE Frameworks 6.8 Kirigami.InlineMessage will look exactly the same as Banner! So there is no more reasons for this component to exists in Kirigami Addons.

Other

Kirigami Addons supports static builds with a recent enough version of extra-cmake-modules.

Packager Section

You can find the package on download.kde.org and it has been signed with my GPG key.

Categories: FLOSS Project Planets

PyCharm: How I do Django APIs in PyCharm

Planet Python - Thu, 2024-10-10 07:21

I learn so much from watching conference talks, especially live, when I’m vibing with the crowd. But sometimes I watch and think: “Wow, I wish I could show you how awesome that would be in PyCharm.”

That just happened. Here’s the explainer, with a little special something at the end.

Hello, DjangoConf

I recently attended DjangoConf 2024 which kicked off this year’s DSF-PyCharm fundraiser. I attended Felipe’s DRF tutorial where he showed off using PyCharm and even a little bit about endpoints.

Afterwards, I ran into a PyCharm fan who told me what he really likes when using PyCharm for Django. It matched what I really like. Hence, a blog post.

The end is the point

My superfan friend was an early adopter of endpoints, our feature for rethinking the API developer experience (DX) in Django, FastAPI, and Flask. Me too. It’s cool to have a listing of endpoints, jumping to the definition, and most of all – issuing an HTTP request right there in the IDE. No going out to Postman. 

I covered endpoints and the HTTP client in my previous blog post. One extra point: he said Postman pricing is going up. I guess I should talk more about the HTTP Client.

Always be debugging

Most folks know that I’m a debugger stan, probably because I just won’t shut up about it. It turns out that he also uses the debugger first, meaning he runs the Django server, under the debugger, all the time, even when he isn’t debugging.

Why? First, it’s so fast, you don’t notice the speed hit. As he also knew, Python 3.12 lowers the impact of debugging and PyCharm uses this automatically. The bigger point though: when you want to poke around, you don’t need to stop the regular “run”, launch under “debug”, then return to “run.” That’s disruptive, so people just do print. Which makes me a sad panda.

If you’re always debugging, then poking around is already RIGHT THERE. Even if you don’t have a bug and just want to investigate. Even if you are in a template.

This is great with endpoints, as you can click a breakpoint in your code and issue a request without leaving the tool.

He made one last point – PyCharm’s Django support and debugger is more mature and polished. We’ve been doing this for a while!

I didn’t know there would be a test

There’s one more step to the higher-zen of using PyCharm to the fullest with Django. Why use the browser or an HTTP client at all? Why not just sit in a test module and let PyCharm + pytest bring joy to your world? In fact – don’t even run Django. No server process, less hassle.

Django makes it really easy to issue fake requests in a test, get the results back, and make sure things are cool. I like having my code on the left, my test on the right, and the test output on the bottom. In fact, I also like combining Always Be Testing with Always Be Debugging, which makes it crazy-easy to stop in the middle of a view and see what’s going on.

I like it so much, here’s a little video:

This works great for how code works. You can skip going to the browser, reloading, and poking around. You stay in the IDE, the flow. But there’s a catch.

Seeing is believing

Sometimes you need to see how the page looks. In the browser. With your eyeballs. Any chance PyCharm can improve the DX for this?

As it turns out, in 2023.3 we shipped Django Preview, a feature-rich browser in the IDE that keeps up as you type.

A love letter to Django

This concludes my speaking from the heart about my way of doing Django API development in PyCharm: endpoints, debugger, testing, and preview.

But I’d like to close by speaking from the heart about Django, leading with an odd little twist of fate about Django killing my project.

Categories: FLOSS Project Planets

Annertech: DrupalCon Barcelona: Our highlights

Planet Drupal - Thu, 2024-10-10 05:00

DrupalCon Barcelona 2024 was one of our busiest yet. We were a platinum sponsor, sponsored the contribution room, had numerous social activities (including Trivia Night), and Annertechies took to the mic at least seven times.

Categories: FLOSS Project Planets

Gunnar Wolf: Started a guide to writing FUSE filesystems in Python

Planet Debian - Wed, 2024-10-09 21:07

As DebConf22 was coming to an end, in Kosovo, talking with Eeveelweezel they invited me to prepare a talk to give for the Chicago Python User Group. I replied that I’m not really that much of a Python guy… But would think about a topic. Two years passed. I meet Eeveelweezel again for DebConf24 in Busan, South Korea. And the topic came up again. I had thought of some ideas, but none really pleased me. Again, I do write some Python when needed, and I teach using Python, as it’s the language I find my students can best cope with. But delivering a talk to ChiPy?

On the other hand, I have long used a very simplistic and limited filesystem I’ve designed as an implementation project at class: FIUnamFS (for “Facultad de Ingeniería, Universidad Nacional Autónoma de México”: the Engineering Faculty for Mexico’s National University, where I teach. Sorry, the link is in Spanish — but you will find several implementations of it from the students 😉). It is a toy filesystem, with as many bad characteristics you can think of, but easy to specify and implement. It is based on contiguous file allocation, has no support for sub-directories, and is often limited to the size of a 1.44MB floppy disk.

As I give this filesystem as a project to my students (and not as a mere homework), I always ask them to try and provide a good, polished, professional interface, not just the simplistic menu I often get. And I tell them the best possible interface would be if they provide support for FIUnamFS transparently, usable by the user without thinking too much about it. With high probability, that would mean: Use FUSE.

But, in the six semesters I’ve used this project (with 30-40 students per semester group), only one student has bitten the bullet and presented a FUSE implementation.

Maybe this is because it’s not easy to understand how to build a FUSE-based filesystem from a high-level language such as Python? Yes, I’ve seen several implementation examples and even nice web pages (i.e. the examples shipped with thepython-fuse module Stavros’ passthrough filesystem, Dave Filesystem based upon, and further explaining, Stavros’, and several others) explaining how to provide basic functionality. I found a particularly useful presentation by Matteo Bertozzi presented ~15 years ago at PyCon4… But none of those is IMO followable enough by itself. Also, most of them are very old (maybe the world is telling me something that I refuse to understand?).

And of course, there isn’t a single interface to work from. In Python only, we can find python-fuse, Pyfuse, Fusepy… Where to start from?

…So I setup to try and help.

Over the past couple of weeks, I have been slowly working on my own version, and presenting it as a progressive set of tasks, adding filesystem calls, and being careful to thoroughly document what I write (but… maybe my documentation ends up obfuscating the intent? I hope not — and, read on, I’ve provided some remediation).

I registered a GitLab project for a hand-holding guide to writing FUSE-based filesystems in Python. This is a project where I present several working FUSE filesystem implementations, some of them RAM-based, some passthrough-based, and I intend to add to this also filesystems backed on pseudo-block-devices (for implementations such as my FIUnamFS).

So far, I have added five stepwise pieces, starting from the barest possible empty filesystem, and adding system calls (and functionality) until (so far) either a read-write filesystem in RAM with basicstat() support or a read-only passthrough filesystem.

I think providing fun or useful examples is also a good way to get students to use what I’m teaching, so I’ve added some ideas I’ve had: DNS Filesystem, on-the-fly markdown compiling filesystem, unzip filesystem and uncomment filesystem.

They all provide something that could be seen as useful, in a way that’s easy to teach, in just some tens of lines. And, in case my comments/documentation are too long to read, uncommentfs will happily strip all comments and whitespace automatically! 😉

So… I will be delivering my talk tomorrow (2024.10.10, 18:30 GMT-6) at ChiPy (virtually). I am also presenting this talk virtually at Jornadas Regionales de Software Libre in Santa Fe, Argentina, next week (virtually as well). And also in November, in person, at nerdear.la, that will be held in Mexico City for the first time.

Of course, I will also share this project with my students in the next couple of weeks… And hope it manages to lure them into implementing FUSE in Python. At some point, I shall report!

Categories: FLOSS Project Planets

Freexian Collaborators: Debian Contributions: Packaging Pydantic v2, Reworking of glib2.0 for cross bootstrap, Python archive rebuilds and more! (by Anupa Ann Joseph)

Planet Debian - Wed, 2024-10-09 20:00
Debian Contributions: 2024-09

Contributing to Debian is part of Freexian’s mission. This article covers the latest achievements of Freexian and their collaborators. All of this is made possible by organizations subscribing to our Long Term Support contracts and consulting services.

Pydantic v2, by Colin Watson

Pydantic is a useful library for validating data in Python using type hints: Freexian uses it in a number of projects, including Debusine. Its Debian packaging had been stalled at 1.10.17 in testing for some time, partly due to needing to make sure everything else could cope with the breaking changes introduced in 2.x, but mostly due to needing to sort out packaging of its new Rust dependencies. Several other people (notably Alexandre Detiste, Andreas Tille, Drew Parsons, and Timo Röhling) had made some good progress on this, but nobody had quite got it over the line and it seemed a bit stuck.

Colin upgraded a few Rust libraries to new upstream versions, packaged rust-jiter, and chased various failures in other packages. This eventually allowed getting current versions of both pydantic-core and pydantic into testing. It should now be much easier for us to stay up to date routinely.

Reworking of glib2.0 for cross bootstrap, by Helmut Grohne

Simon McVittie (not affiliated with Freexian) earlier restructured the libglib2.0-dev such that it would absorb more functionality and in particular provide tools for working with .gir files. Those tools practically require being run for their host architecture (practically this means running under qemu-user) which is at odds with the requirements of architecture cross bootstrap. The qemu requirement was expressed in package dependencies and also made people unhappy attempting to use libglib2.0-dev for i386 on amd64 without resorting to qemu. The use of qemu in architecture bootstrap is particularly problematic as it tends to not be ready at the time bootstrapping is needed.

As a result, Simon proposed and implemented the introduction of a libgio-2.0-dev package providing a subset of libglib2.0-dev that does not require qemu. Packages should continue to use libglib2.0-dev in their Build-Depends unless involved in architecture bootstrap. Helmut reviewed and tested the implementation and integrated the necessary changes into rebootstrap. He also prepared a patch for libverto to use the new package and proposed adding forward compatibility to glib2.0.

Helmut continued working on adding cross-exe-wrapper to architecture-properties and implemented autopkgtests later improved by Simon. The cross-exe-wrapper package now provides a generic mechanism to a program on a different architecture by using qemu when needed only. For instance, a dependency on cross-exe-wrapper:i386 provides a i686-linux-gnu-cross-exe-wrapper program that can be used to wrap an ELF executable for the i386 architecture. When installed on amd64 or i386 it will skip installing or running qemu, but for other architectures qemu will be used automatically. This facility can be used to support cross building with targeted use of qemu in cases where running host code is unavoidable as is the case for GObject introspection.

This concludes the joint work with Simon and Niels Thykier on glib2.0 and architecture-properties resolving known architecture bootstrap regressions arising from the glib2.0 refactoring earlier this year.

Analyzing binary package metadata, by Helmut Grohne

As Guillem Jover (not affiliated with Freexian) continues to work on adding metadata tracking to dpkg, the question arises how this affects existing packages. The dedup.debian.net infrastructure provides an easy playground to answer such questions, so Helmut gathered file metadata from all binary packages in unstable and performed an explorative analysis. Some results include:

Guillem also performed a cursory analysis and reported other problem categories such as mismatching directory permissions for directories installed by multiple packages and thus gained a better understanding of what consistency checks dpkg can enforce.

Python archive rebuilds, by Stefano Rivera

Last month Stefano started to write some tooling to do large-scale rebuilds in debusine, starting with finding packages that had already started to fail to build from source (FTBFS) due to the removal of setup.py test. This month, Stefano did some more rebuilds, starting with experimental versions of dh-python.

During the Python 3.12 transition, we had added a dependency on python3-setuptools to dh-python, to ease the transition. Python 3.12 removed distutils from the stdlib, but many packages were expecting it to still be available. Setuptools contains a version of distutils, and dh-python was a convenient place to depend on setuptools for most package builds. This dependency was never meant to be permanent. A rebuild without it resulted in mass-filing about 340 bugs (and around 80 more by mistake).

A new feature in Python 3.12, was to have unittest’s test runner exit with a non-zero return code, if no tests were run. We added this feature, to be able to detect tests that are not being discovered, by mistake. We are ignoring this failure, as we wouldn’t want to suddenly cause hundreds of packages to fail to build, if they have no tests. Stefano did a rebuild to see how many packages were affected, and found that around 1000 were. The Debian Python community has not come to a conclusion on how to move forward with this.

As soon as Python 3.13 release candidate 2 was available, Stefano did a rebuild of the Python packages in the archive against it. This was a more complex rebuild than the others, as it had to be done in stages. Many packages need other Python packages at build time, typically to run tests. So transitions like this involve some manual bootstrapping, followed by several rounds of builds. Not all packages could be tested, as not all their dependencies support 3.13 yet. The result was around 100 bugs in packages that need work to support Python 3.13. Many other packages will need additional work to properly support Python 3.13, but being able to build (and run tests) is an important first step.

Miscellaneous contributions
  • Carles prepared the update of python-pyaarlo package to a new upstream release.

  • Carles worked on updating python-ring-doorbell to a new upstream release. Unfinished, pending to package a new dependency python3-firebase-messaging RFP #1082958 and its dependency python3-http-ece RFP #1083020.

  • Carles improved po-debconf-manager. Main new feature is that it can open Salsa merge requests. Aiming for a lightning talk in MiniDebConf Toulouse (November) to be functional end to end and get feedback from the wider public for this proof of concept.

  • Carles helped one translator to use po-debconf-manager (added compatibility for bullseye, fixed other issues) and reviewed 17 package templates.

  • Colin upgraded the OpenSSH packaging to 9.9p1.

  • Colin upgraded the various YubiHSM packages to new upstream versions, enabled more tests, fixed yubihsm-shell build failures on some 32-bit architectures, made yubihsm-shell build reproducibly, and fixed yubihsm-connector to apply udev rules to existing devices when the package is installed. As usual, bookworm-backports is up to date with all these changes.

  • Colin fixed quite a bit of fallout from setuptools 72.0.0 removing setup.py test, backported a large upstream patch set to make buildbot work with SQLAlchemy 2.0, and upgraded 25 other Python packages to new upstream versions.

  • Enrico worked with Jakob Haufe to get him up to speed for managing sso.debian.org

  • Raphaël did remove spam entries in the list of teams on tracker.debian.org (see #1080446), and he applied a few external contributions, fixing a rendering issue and replacing the DDPO link with a more useful alternative. He also gave feedback on a couple of merge requests that required more work. As part of the analysis of the underlying problem, he suggested to the ftpmasters (via #1083068) to auto-reject packages having the “too-many-contacts” lintian error, and he raised the severity of #1076048 to serious to actually have that 4 year old bug fixed.

  • Raphaël uploaded zim and hamster-time-tracker to fix issues with Python 3.12 getting rid of setuptools. He also uploaded a new gnome-shell-extension-hamster to cope with the upcoming transition to GNOME 47.

  • Helmut sent seven patches and sponsored one upload for cross build failures.

  • Helmut uploaded a Nagios/Icinga plugin check-smart-attributes for monitoring the health of physical disks.

  • Helmut collaborated on sbuild reviewing and improving a MR for refactoring the unshare backend.

  • Helmut sent a patch fixing coinstallability of gcc-defaults.

  • Helmut continued to monitor the evolution of the /usr-move. With more and more key packages such as libvirt or fuse3 fixed. We’re moving into the boring long-tail of the transition.

  • Helmut proposed updating the meson buildsystem in debhelper to use env2mfile.

  • Helmut continued to update patches maintained in rebootstrap. Due to the work on glib2.0 above, rebootstrap moves a lot further, but still fails for any architecture.

  • Santiago reviewed some Merge Request in Salsa CI, such as: !478, proposed by Otto to extend the information about how to use additional runners in the pipeline and !518, proposed by Ahmed to add support for Ubuntu images, that will help to test how some debian packages, including the complex MariaDB are built on Ubuntu.

    Santiago also prepared !545, which will make the reprotest job more consistent with the result seen on reproducible-builds.

  • Santiago worked on different tasks related to DebConf 25. Especially he drafted the fundraising brochure (which is almost ready).

  • Thorsten Alteholz uploaded package libcupsfilter to fix the autopkgtest and a dependency problem of this package. After package splix was abandoned by upstream and OpenPrinting.org adopted its maintenance, Thorsten uploaded their first release.

  • Anupa published posts on the Debian Administrators group in LinkedIn and moderated the group, one of the tasks of the Debian Publicity Team.

  • Anupa helped organize DebUtsav 2024. It had over 100 attendees with hand-on sessions on making initial contributions to Linux Kernel, Debian packaging, submitting documentation to Debian wiki and assisting Debian Installations.

Categories: FLOSS Project Planets

KDE Gear 24.08.2

Planet KDE - Wed, 2024-10-09 20:00

Over 180 individual programs plus dozens of programmer libraries and feature plugins are released simultaneously as part of KDE Gear.

Today they all get new bugfix source releases with updated translations, including:

  • dolphin: Ignore trailing slashes when comparing place URLs (Commit)
  • kate: Fix session restore of tabs/views of untitled documents (Commit, fixes bug #464703, bug #462112 and bug #462523)
  • konsole: Fix a crash when sending OSC 4 (RGB) color outside the 256 range (Commit, fixes bug #494205)

Distro and app store packagers should update their application packages.

Categories: FLOSS Project Planets

Ben Hutchings: FOSS activity in September 2024

Planet Debian - Wed, 2024-10-09 18:57
Categories: FLOSS Project Planets

GNUnet News: GNUnet 0.22.1

GNU Planet! - Wed, 2024-10-09 18:00
GNUnet 0.22.1

This is a bugfix release for gnunet 0.22.0. It addresses some issues in HELLO URI handling and formatting as well as regressions in the DHT subsystem along with other bug fixes.

Links

The GPG key used to sign is: 3D11063C10F98D14BD24D1470B0998EF86F59B6A

Note that due to mirror synchronization, not all links may be functional early after the release. For direct access try https://ftp.gnu.org/gnu/gnunet/

Categories: FLOSS Project Planets

How we passed the AI conundrums

Open Source Initiative - Wed, 2024-10-09 13:01

Some people believe that full unfettered access to all training data is paramount. This group argues that anything less than all the data would compromise the Open Source principles, forever removing full reproducibility of AI systems, transparency, security and other outcomes. We’ve heard them and we’ve provided a solution rooted in decades of Open Source practice.

To have the chance for powerful Open Source AI systems to exist in any domain, the OSI community has incorporated in the Definition this principle: 

An Open Source AI needs to make available three kinds of components: the software used to create the dataset and run the training, the model parameters and the code to run inference, and finally all the data that can be made available legally.

Recognizing that there are four kinds of “data”, each with its own legal frameworks allowing different freedoms of distribution, we bypass what Stephen O’Grady called the “AI conundrums” and give Open Source AI builders a chance to build freedom-respecting alternatives to pretty much any proprietary AI.

Limiting Open source AI only to systems trainable on freely distributable data would relegate Open Source AI to a niche. One of which is that the amount of freely and legally shareable data is a tiny fraction of what is necessary to train powerful systems. Additionally, it’d be excluding Open Source AI from areas where data cannot be shared, like medical or anything dealing with personal or private data. What remains for “Open Source AI” would be tiny. There are abundant motives to reject this limitation.

The fact is, mixing openly distributable and non-distributable data is very similar to a reality we are very familiar with: Open Source software built with proprietary compilers and system libraries.

Is GNU Emacs Open Source software?

I’m sure you’d answer yes (and some of you will say “well, actually it’s free software”) and we’ll all agree. Below is a rough diagram of Emacs built for the GNOME desktop on a modern Linux distribution. Emacs depends on a few system libraries that GNOME provides with OSI-Approved Licenses. The whole stack is Open Source these days and one can distribute Emacs on a disk with all its dependencies without too much legal trouble. Imagine scientists who want to freeze the whole environment of an experiment they made; they could package all the pieces of a system like this without trouble and distribute it all with their paper. No problem here.

Now let’s go back to an age when Linux systems weren’t ready. When Stallman started writing Emacs, there was no GNOME and no Linux, no gcc and no glibc. He thought very early on that in order to have more freedom, he had to create a wedge to allow Emacs to run on proprietary software.

Emacs on the latest Solaris versions would look something like this: some pieces like X11 and Gstreamer are Open Source. Others, like libc and others aren’t. The hypothetical scientists from before couldn’t really freeze their full scientific environment. All they could say in their paper was: “We used Emacs from this CVS version, built with gcc version X with these makefile; tar.gz attached” and make a list of the operating system’s version and libraries versions they used. That’s because they have the right only to distribute Emacs, X11, some libraries and not the rest of Solaris.

Is Emacs on Solaris Open Source? Of course it is, even though the source code for the system libraries are not available.

One more question, Emacs on Mac OS: it can only be built with a proprietary compiler on proprietary GUI and other proprietary libraries.

Is Emacs on Mac Open Source? Of course it is. Can you fully study Emacs on Mac OS? For Emacs, yes. For the MacOS components, no. There are many programs that run only on MacOS or Windows: for OSI, those are Open Source. Would someone argue that they’re not “really Open Source” because you can’t see “everything?” Some people might but we’ve learned to live with that, adding governance rules in addition to those of the Open Source Definition. Debian for example requires that programs are Open Source and support multiple hardware platforms; the ASF graduates only projects that are Open Source and have a diverse community of contributors. If you only want to use Open Source applications running on Open Source stacks, you can decide that! Just as you can decide that your company will only acquire Open Source software whose copyright is owned by multiple entities. 

These are all additional requirements built on top of the base floor set by the Open Source Definition.

For AI, you can do the same: You can say “I will only use Open Source AI built with open data, because I don’t want to trust anything less than that.” A large organization could say “I will buy only Open Source AI that allows me to audit their full dataset, including unshareable data.” You can do all that. Open Source AI is the floor that you can build on, like the OSD.

Bypassing the conundrums

We’ve looked for a solution for almost three years and this is it: Require all the data that is legally shareable, and for the other data provide all the details. It’s exactly what we’ve been doing for Open Source software: 

You developed a text editor for Mac OS but you can’t share the system libraries? Fine, we’ll fork it: give us all the code you can legally share with an OSI-Approved License and we’ll rip the dependencies and “liberate” it to run on GNU. The editor will be slightly different, like code that runs on some ARM+Linux systems behaves differently on Intel+Windows for the different capabilities of the underlying hardware and OS, but it’s still Open Source.

For Open Source AI it’s a similar dance: You can’t legally give us all the data? Fine, we’ll fork it. For example, you made an AI that recognizes bone cancer in humans but the data can’t be shared. We’ll fork it! Tell us exactly how you built the system, how you trained it, share the code you used, and an anonymized sample of the data you used so we can train on our X-ray images. The system will be slightly different but it’s still Open Source AI.

If we want to have broad availability of powerful alternatives to proprietary AI systems that respect the freedoms of users and deployers, we must recognize conditions that make sense for the domain of AI. These examples of proprietary compilers and system libraries used to build Open Source software prove that there is room for similar conditions when talking about Code, Data and Parameters within the definition of Open Source AI.

Categories: FLOSS Research

FSF News: Free Software Foundation to serve on "artificial intelligence" safety consortium

GNU Planet! - Wed, 2024-10-09 10:05
BOSTON (October 8, 2024) -- The Free Software Foundation (FSF) has announced that it is taking part in the US National Institute of Standards and Technology (NIST)'s consortium on the safety of (so-called) artificial intelligence, particularly with reference to "generative" AI systems. The FSF will ensure the free software perspective is adequately represented in these discussions.
Categories: FLOSS Project Planets

Real Python: Build a Contact Book App With Python, Textual, and SQLite

Planet Python - Wed, 2024-10-09 10:00

Building projects is a great way to learn programming and have fun at the same time. When you work on a project, you apply different coding skills simultaneously, which is good practice for what you’ll do in a real-life project. In this tutorial, you’ll create a contact book application with a text-based interface (TUI) based on Python and Textual. To store the contact data, your app will use an SQLite database.

In this tutorial, you’ll learn how to:

  • Create the contact book app’s TUI using Textual
  • Handle the database operations using SQLite
  • Connect the app’s TUI with the database code and make it functional

At the end of this project, you’ll have a functional contact book application that will allow you to store and manage your contact information.

To get the complete source code for the application and the code for every step in this tutorial, click the link below:

Get Your Code: Click here to download the free sample code you’ll use to build a contact book app with Python, Textual, and SQLite.

Demo: A Contact Book Built With Python and Textual

Contact or address books are a widely used type of application. They can be found on phones and computers, allowing users to store and manage contact information for family, friends, coworkers, and so on.

In this tutorial, you’ll code a contact book TUI app with Python, Textual, and SQLite. Here’s a demo of how your contact book will look once you’ve followed all the steps:

Your contact book will provide a basic set of features for this type of application, and you’ll be able to display, add, and remove the information in your contacts list.

Project Overview

To build your contact book app, you’ll organize the code in a few modules under a package. In this tutorial, you’ll use the following directory structure:

rpcontacts_project/ │ ├── rpcontacts/ │ ├── __init__.py │ ├── __main__.py │ ├── database.py │ ├── rpcontacts.tcss │ └── tui.py │ ├── README.md └── requirements.txt

The root directory of your project is rpcontacts_project/. Inside, there’s an rpcontacts/ subdirectory that holds the application’s main package.

You’ll cover the content of each file in this tutorial. The name of each file will give you an idea of its role in the application.

For example, __main__.py will host the application, and database.py will provide database-related code. Similarly, rpcontacts.tcss is a CSS file that will allow you to tweak the visual style of your Textual app. Finally, tui.py will contain the code to generate the app’s TUI, including the main screen and a couple of auxiliary screens or dialogs.

Prerequisites

To get the most out of this project, you should have some previous knowledge of how to lay out a Python project and work with SQLite databases. You should also know the basics of working with Python classes. Some knowledge about writing CSS code would also be a plus.

To satisfy these knowledge requirements, you can take a look at the following resources:

Don’t worry if you don’t have all of the prerequisite knowledge before starting this tutorial—that’s completely okay! You’ll learn through the process of getting your hands dirty as you build the project. If you get stuck, then take some time to review the resources linked above. Then, get back to the code.

The contact book application you’ll build in this tutorial has a single external dependency, which is Textual. This library provides a rapid application development framework that allows you to create apps you can run in your terminal and browser.

To follow best practices in your development process, you can start by creating a virtual environment and then install Textual using pip:

Read the full article at https://realpython.com/contact-book-python-textual/ »

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Categories: FLOSS Project Planets

Stefanie Molin: Mind Your Image Metadata

Planet Python - Wed, 2024-10-09 09:00
Most devices record a variety of metadata when generating images. While some of that information *may* be innocuous, you could end up exposing the GPS coordinates to your home if you aren't careful. In this article, I provide a brief introduction to image metadata, and then show you how to remove it with `exif-stripper`.
Categories: FLOSS Project Planets

CTI Digital: Drupal CMS: A New Era for Non-Technical Users

Planet Drupal - Wed, 2024-10-09 08:55

Drupal CMS (formerly known as Drupal Starshot) is set to revolutionise how non-technical users engage with Drupal, enabling them to get started with just a few clicks. From installing Drupal directly in the browser to swiftly selecting Recipes that tailor the site to their specific needs, along with utilising AI-driven site-building tools, Drupal CMS is transforming the introduction of non-developers to the platform.

Unsurprisingly, Drupal CMS was a hot topic at DrupalCon, where an entire track of talks was dedicated to the various initiatives that constitute one of the most ambitious changes in Drupal's history. Dries Buytaert, the co-founder of Drupal and an influential leader in the open-source community, provided invaluable insights in his keynote presentation, showcasing the latest enhancements in Drupal CMS.

Categories: FLOSS Project Planets

1xINTERNET blog: 1xINTERNET at DrupalCon Barcelona 2024

Planet Drupal - Wed, 2024-10-09 08:00

This year, DrupalCon took place in Barcelona, and we were proud to be Platinum Sponsors. DrupalCon is a vibrant celebration of the Drupal community, bringing together developers, designers, marketers, and business leaders to share ideas, collaborate and innovate. This year we thoroughly enjoyed connecting with other Drupal users and makers.

Categories: FLOSS Project Planets

Cursor Size Problems In Wayland, Explained

Planet KDE - Wed, 2024-10-09 05:36

I've been fixing cursor problems on and off in the last few months. Here's a recap of what I've done, explanation of some cursor size problems you might encounter, and how new developments like Wayland cursor shape protocol and SVG cursors might improve the situation.

(I'm by no means an expert on cursors, X11 or Wayland, so please correct me if I'm wrong.)

Why don't we have cursors in the same size anymore?

My involvement with cursors started back in the end of 2023, when KDE Plasma 6.0 was about to be released. A major change in 6.0 was enabling Wayland by default. And if you enabled global scaling in Wayland, especially with a fractional scale like 2.5x, cursor sizes would be a mess across various apps (the upper row: Breeze cursors in Plasma 6.0 Beta 1, Wayland, 2.5x global scale, the lower row: Same cursors in Plasma 6.0):

So I dug into the code of my favorite terminal emulator, Kitty, which at the time drew the cursor in a slightly smaller size than it should be (similar to vscode in the above image). I gained some understanding of the problem, and eventually fixed it. Let me explain.

How to draw cursors in the same size in different apps?

In X11, there used to be a standard set of cursors, but nowadays most apps use the XCursor lib to load a (user-specified) cursor theme and draw the cursor themselves. So in order to have cursors in the same theme and size across apps, we need to make sure that:

  1. Apps get the same cursor theme and size from the system.
  2. Apps draw the cursor in the same way.

The transition to Wayland created difficulties in both points:

1. Get the same cursor theme and size from the system

It used to be simple in X11: we have Xcursor.size and Xcursor.theme in xrdb, also XCURSOR_SIZE and XCURSOR_THEME in environment variables. Setting them to the same value would make sure that all apps get the same cursor theme and size.

But Wayland apps don't use xrdb, and they interpret XCURSOR_SIZE differently: in X11, the size is in physical pixels, but in Wayland it's in logical pixels. E.g., if you have a cursor size 24 and global scale 2x, then in X11, XCURSOR_SIZE should be 48, but in Wayland it should be 24.

The Wayland way is necessary. Imagine you have two monitors with different DPI, e.g. they are both 24" but monitor A is 1920x1080, while monitor B is 3840x2160. You set scale=1 for A and scale=2 for B, so UI elements would be the same size on both monitors. Then you would also want the cursor to be of the same size on both monitors, which requires it to have 2x more physical pixels on B than on A, but it would be the same logical pixels.

So Plasma 6.0 no longer sets the two environment variables, because XCURSOR_SIZE can't be simultaneously correct for both X11 and Wayland apps. But without them and xrdb, Wayland apps no longer have a standard way to get the cursor theme and size. Instead, different frameworks / toolkits have their own ways. In Plasma, KDE / Qt apps get them from the Qt platform integration plugin provided by Plasma, GTK4 apps from ~/.config/gtk-4.0/settings.ini (also set by Plasma), Flatpak GTK apps from the GTK-specific configs in XDG Settings Portal.

The last one is particularly weird, as you need to install xdg-desktop-portal-gtk in order fix Flatpak apps in Plasma, which surprised many. It might seem like a hack, but it's not. Plasma officially recommends installing xdg-desktop-portal-gtk, and this was suggested by GNOME developers.

But what for 3rd-party Wayland apps besides GTK and Qt? The best hope is to read settings in either the GTK or the Qt way, piggy-backing the two major toolkits, assuming that the DE would at least take care of the two.

(IMHO either Wayland or the XDG Settings Portal should provide a standard way for apps to get the cursor theme and size.)

That was part of the problem in Kitty. It used to read settings from the GTK portal, but only under a GNOME session. I changed it to always try to read from the portal, even if under Plasma. But that's not the end of the story...

2. Draw the cursor in the same way

It's practically a non-issue in X11, as the user usually sets a size that the cursor theme provides, and the app just draws the cursor images as-is. But if you do set a cursor size not available in the theme (you can't do that in the cursor theme settings UI, but you can manually set XCURSOR_SIZE), you'll open a can of worms: various toolkits / apps deal with it differently:

  1. Some just use the closest size available (Electron and Kitty at the time), so it can be a bit smaller.
  2. Some use the XCursor default size 24, so it's a lot smaller.
  3. Some scale the cursor to the desired size, and the scaling algorithm might be different, resulting in pixelated or blurry cursors; Also they might scale from either the default size or the closest size available, resulting in very blurry (GTK) or slightly blurry (Qt) cursors.

The situation becomes worse with Wayland, as the user now specifies the size in logical pixels, then apps need to multiply it by the global scale to get the size in physical pixels, and try to load a cursor in that size. (If the app load the cursor in the logical size, then either the app or the compositor needs to scale it, resulting in a blurry / pixelated cursor.) With fractional scaling, it's even more likely that the required physical size is not available in the theme (which typically has only 2~5 sizes), and you see the result in the picture above.

One way to fix it (and why I didn't do)

It can be fixed by moving the "when we can't load cursors in the size we need, load a different size and scale it" logic from apps / toolkits to the XCursor lib. When the app requests cursors in a size, instead of returning the closest size available, the lib could scale the image to the requested size. So apps would always get the cursor in the size they ask for, and their own different scaling algorithms won't get a chance to run.

Either the default behavior can be changed, or it can be hidden behind a new option. But I didn't do that, because I felt at the time that it would be difficult to either convince XCursor lib maintainers to make a (potentially breaking) change to the default behavior, or to go around convincing all apps / toolkits to use a new option.

My fix (or shall we say workaround)

Then it came to me that although I can't fix all these toolkits / apps, they seem to all work the same way if the required physical size is available in the theme - then they just draw the cursor as-is. So I added a lot of sizes to the Breeze theme. It only has size 24, 36 and 48 at the time, but I added physical sizes corresponding to a logical size 24 and all global scales that Plasma allows, from 0.5x to 3x, So it's 12, 18, 24 ... all the way to 72.

It was easy. The source code of the Breeze theme is SVG (so are most other themes). Then a build script renders it into images using Inkscape, and packages them to XCursor format. The script has a list of the sizes it renders in, so I added a lot more.

And it worked! If you choose Breeze and size 24, then (as in the bottom row in the picture above) various apps draw the cursor in the same size at any global scale available in Plasma.

But this method has its limitations:

  1. We can't do that to 3rd-party themes, as we don't have their source SVG.
  2. It only works if you choose the default size 24. If you choose a different size, e.g. 36, and a global scale 3x, then the physical size 36x3=108 is not available in the theme, and you see the mess again. But we can't add sizes infinitely, as explained in Vlad's blog, the XCursor format stores cursor images uncompressed, so the binary size grows very fast when adding larger sizes.

Both limitations can be lifted with SVG cursors. But before getting to that, let's talk about the "right" way to fix the cursor size problem:

The "right" fix: Wayland cursor shape protocol

The simple and reliable way to get consistent cursors across apps is to not let apps draw the cursor at all. Instead, they only specify the name of the cursor shape, and the compositor draws the cursor for them. This is how Wayland cursor shape protocol works. Apps no longer need to care about the cursor theme and size (well, they might still need the size, if they want to draw custom cursors in the same size as standard shapes), and since the compositor is the only program drawing the cursor, it's guaranteed to be consistent for all apps using the protocol.

(It's quite interesting that we seem to went a full circle back to the original server-defined cursor font way in X11.)

Support for this protocol leaves a lot to improve, though. Not all compositors support it. On the client side, both Qt and Electron have the support, but GTK doesn't.

There are merge requests for GTK and Mutter, but GNOME devs request some modifications in the Wayland protocol before merging them, and the request seems to be stuck for some months. I hope the recent Wayland "things" could move it out of this seemingly deadlock.

Anyway, with this protocol, only the compositor has to be modified to support a new way to draw cursors. This makes it much easier to change how cursors work. So we come to:

SVG cursors

Immediately after the fix in Breeze, I proposed this idea of shipping the source SVG files of the Breeze cursor theme to the end user, and re-generate the XCursor files whenever the user changes the cursor size or global scale. This way, the theme will always be able to provide the exact size requested by apps. (Similar to the "modify XCursor lib" idea, but in a different way.) It would remove the limitation 2 above (and also limitation 1 if 3rd-party themes ship their source SVGs too).

With SVG cursors support in KWin and Breeze, I plan to implement this idea. It would also allow the user to set arbitrary cursor size, instead of limited to a predefined list.

Problems you might still encounter today Huge cursors in GTK4 apps

It's a new problem in GTK 4.16. If you use the Breeze cursor theme and a large global scale like 2x or 3x, you get huge cursors:

It has not limited to Plasma. Using Breeze in GNOME would result in the same problem. To explain it, let me first introduce the concept of "nominal size" and "image size" in XCursor.

Here is GNOME's default cursor theme, Adwaita:

"Nominal size" is the "cursor size" we are talking about above. It makes the list of sizes you choose from in the cursor theme settings UI. It's also the size you set in XCURSOR_SIZE. "Image size" is the actual size of the cursor image. "Hot spot" is the point in the image where the cursor is pointing at.

Things are a bit different in the Plasma default cursor theme, Breeze:

Unlike Adwaita, the image size is larger than the nominal size. That, combined with a global scale, triggers the bug in GTK4. Explanation of the bug.

XCursor allows the image size to be different from the nominal size. I don't know why it was designed this way, but my guess is so you can crop the empty part of the image. This both reduces file size, and reduces flicking when the cursor changes (with software cursors under X11). But the image size can also be larger than the nominal size, and Breeze (and a lot of other themes) uses this feature.

You can see in the above images that the "arrow" of nominal size 24 in Breeze is actually similar in size to the same nominal size in Adwaita. But the "badge" in Breeze is further apart, so it can't fit into a 24x24 image. That's why Breeze is built this way. In a sense, "nominal size" is similar to how "font size" works, where it resembles the "main part" of a character in the font, but some characters can have "extra parts" that go through the ceiling or floor.

This problem is already fixed in the main branch of GTK 4, but it's not backported to 4.16 yet, probably because the fix uses a Wayland feature that Mutter doesn't support yet. So at the moment, your only option is to use a different cursor theme whose "nominal size" and "image size" are equal.

Smaller cursors in GTK3 apps (most notably, Firefox)

The cursor code in GTK3 is different from GTK4, with its own limitations. You might find the cursor to be smaller than in other apps, and if you run the app in a terminal, you might see warnings like:

cursor image size (64x64) not an integer multiple of scale (3)

GTK3 doesn't support fractional scales in cursors. So if you have cursor size 24 and global scale 2.5x or 3x, it will use a scale 3x and try to load a cursor with a nominal size 24x3=72. And it requires the image size to be an integer multiple of the scale. So if your theme doesn't have a size 72, or it does but the image size is not multiple of 3, GTK3 fallbacks to a smaller unscaled cursor.

End words

OK, this is a long post. Hope I can bring you more cursor goodies in Plasma 6.3 and beyond.

Categories: FLOSS Project Planets

PyCharm: Where To Get Data for Your Data Science Projects

Planet Python - Wed, 2024-10-09 05:00

Whether you’re starting a new project or expanding an existing one, as a data scientist, you’re always on the lookout for new material to explore. Knowing where to get data for data science projects can be challenging, and finding “good data” can be even more difficult. In this article, we’ll look at what makes “good data”, what format that data might be in, where to find it, and what the next steps are. 

What is “good data” for data science projects?

Firstly, we should consider how relevant the dataset is to our work. You can stumble upon lots of datasets that overlap with your work in some way, but it can be difficult to decide which is the best one for you to put your effort into. In this scenario, we’ll briefly explore some of the attributes of the data. 

To start with, how consistent is the dataset? Specifically, are there any missing values? Data might be missing for a variety of acceptable reasons, but it can also be a sign of selection bias or other factors that might skew your results. Often, we can choose to either accept missing data or delete the records that contain it before we do our analysis, but knowing about missing data early in the process can help you make an informed decision to use that dataset or not. 

Along with missing data, it’s worth checking to see if any of the data is duplicated. Duplicated data might be fine, but it might also signify a lack of consistency that could skew your results. Duplicated data might also reduce your confidence in the dataset as a whole, so it’s important to consider when choosing your dataset. 

Another aspect to consider for good data is timeliness. The time over which the data was gathered is usually pertinent to the questions you want to answer when you start analyzing it. Checking if the data was collected in the timespan that you’re interested in and considering the continuity of that timespan is helpful. 

When you’re starting your journey into data science and picking your first few datasets to play with, you don’t need to worry about picking the perfect dataset – focus on the process and exploring instead. When you’re ready to learn more about datasets and how to avoid common pitfalls, I recommend you watch this talk from Dr. Jodie Burchell – Garbage data in, garbage models out.

Do you want structured or unstructured data?

Structured data is what you’ll find in a table where each row is an observation, and each column is a variable or field. By contrast, unstructured data usually needs to be pre-processed before you can work with it in a data science project, or it can be used by specialist models that can process it internally. Examples of unstructured data include text, images, and sound. 

As you might have guessed, unstructured data is used more in advanced and specialized subfields in data science, like natural language processing and computer vision. Most data scientists start with, and continue working with, structured data for many of their projects. I recommend that this is where you start, too.

I recommend you keep the notion of structured and unstructured data in mind as we explore standard data formats.

What are standard data formats?

In addition to the quality of the data, we also have to choose between available data formats. You’ll come across two broad types of data formats as a data scientist: downloadable data (often CSV) and databases. 

Downloadable data is nearly always structured data and often takes the form of comma-separated value (CSV) files. These downloads are available from various online repositories. They are among some of the most prolific and most accessible sources of data. If you’re new to data exploration, this is the best place to get started, as they’re easy to find, human-readable, and easy to work with without any extra steps. 

If you’re ready to enter the world of databases, it’s worth understanding that they are further subdivided into relational (SQL) and non-relational (non-SQL) databases. As a broad rule, relational databases contain structured data and non-relational databases contain non-structured data, but determining whether data is structured is not an exact science. Instead, think of non-relational databases as being adaptable to the shape of the data they are storing. 

Databases are commonly used in the following cases: when you have large datasets, when multiple people need to access and modify the data simultaneously, when datasets need to be able to scale, and when data is unstructured (non-SQL only). In addition, if you’re commissioned to do data analysis for your company, you may find that you’re given a database to work with as it’s already in-house. 

PyCharm Professional has excellent support for SQL and non-SQL databases. If your work involves using various databases and writing SQL queries, you can check out our webinar on Visual SQL Development with PyCharm to get more information about the functionality. Alternatively, you can learn how to explore tables without writing a single line of SQL with PyCharm and import your dataset into PyCharm and explore it

Try PyCharm Professional for free

Where can I find datasets for my data science projects?

Once you’re ready to find out how to get data, there are plenty of resources you can download to use for your data science project. This is not an endless list, but it’s a good place to start and a natural progression for your data science journey.

UCI Machine Learning Repository

The UCI Machine Learning Repository has over 600 datasets covering a host of exciting topics for you to explore, such as biology, health, physics, and climate. UCI datasets also have a diverse set of data types, including images, sequential, and time series. I recommend looking at a few different datasets and types of data if you’re new to data science, as it will help you expand your understanding of what data often looks like. 

Kaggle

Another well-known website for datasets is Kaggle. Not only can you sign up to Kaggle to download datasets for data science projects, but it also has a large community of like-minded people who run company-sponsored competitions designed to help you develop your data science skills. If you’re looking for a famous dataset that you’ve seen used in numerous examples, you’ll almost certainly find it hosted on Kaggle.

Hugging Face

Hugging Face is another resource that is rich in datasets. You can filter the results by modalities, including audio, geospatial, and video, and provide a range for the size of your dataset, which can be particularly helpful when you want to start small. Hugging Face has many natural language and computer vision datasets, so you might want to head over there once you’re past the basics and interested in more specialized fields.

Many more

There are many more places that you can go on your data science journey to find fun datasets to explore. You can check out GitHub for curated open source datasets, FiveThirtyEight for datasets relating to American politics and sports, and lastly, one of my favorites, the UK government, to get datasets relating to public services and the economy in the UK. 

What are the next steps?

Congratulations! You’ve gained a better understanding of what “good data” is, and you know where to look to find datasets for data science projects. Once you’ve chosen a dataset, you’re ready to start preparing and analyzing your data

Remember, you can use Jupyter notebooks inside PyCharm to explore both file format and database datasets

You can read or watch a video showing just some of the ways you can use Jupyter notebooks inside PyCharm to boost your productivity on your data science journey with your chosen dataset. 

Try PyCharm Professional for free

Categories: FLOSS Project Planets

Pages