Feeds
PyCharm: Where To Get Data for Your Data Science Projects
Whether you’re starting a new project or expanding an existing one, as a data scientist, you’re always on the lookout for new material to explore. Knowing where to get data for data science projects can be challenging, and finding “good data” can be even more difficult. In this article, we’ll look at what makes “good data”, what format that data might be in, where to find it, and what the next steps are.
What is “good data” for data science projects?Firstly, we should consider how relevant the dataset is to our work. You can stumble upon lots of datasets that overlap with your work in some way, but it can be difficult to decide which is the best one for you to put your effort into. In this scenario, we’ll briefly explore some of the attributes of the data.
To start with, how consistent is the dataset? Specifically, are there any missing values? Data might be missing for a variety of acceptable reasons, but it can also be a sign of selection bias or other factors that might skew your results. Often, we can choose to either accept missing data or delete the records that contain it before we do our analysis, but knowing about missing data early in the process can help you make an informed decision to use that dataset or not.
Along with missing data, it’s worth checking to see if any of the data is duplicated. Duplicated data might be fine, but it might also signify a lack of consistency that could skew your results. Duplicated data might also reduce your confidence in the dataset as a whole, so it’s important to consider when choosing your dataset.
Another aspect to consider for good data is timeliness. The time over which the data was gathered is usually pertinent to the questions you want to answer when you start analyzing it. Checking if the data was collected in the timespan that you’re interested in and considering the continuity of that timespan is helpful.
When you’re starting your journey into data science and picking your first few datasets to play with, you don’t need to worry about picking the perfect dataset – focus on the process and exploring instead. When you’re ready to learn more about datasets and how to avoid common pitfalls, I recommend you watch this talk from Dr. Jodie Burchell – Garbage data in, garbage models out.
Do you want structured or unstructured data?Structured data is what you’ll find in a table where each row is an observation, and each column is a variable or field. By contrast, unstructured data usually needs to be pre-processed before you can work with it in a data science project, or it can be used by specialist models that can process it internally. Examples of unstructured data include text, images, and sound.
As you might have guessed, unstructured data is used more in advanced and specialized subfields in data science, like natural language processing and computer vision. Most data scientists start with, and continue working with, structured data for many of their projects. I recommend that this is where you start, too.
I recommend you keep the notion of structured and unstructured data in mind as we explore standard data formats.
What are standard data formats?In addition to the quality of the data, we also have to choose between available data formats. You’ll come across two broad types of data formats as a data scientist: downloadable data (often CSV) and databases.
Downloadable data is nearly always structured data and often takes the form of comma-separated value (CSV) files. These downloads are available from various online repositories. They are among some of the most prolific and most accessible sources of data. If you’re new to data exploration, this is the best place to get started, as they’re easy to find, human-readable, and easy to work with without any extra steps.
If you’re ready to enter the world of databases, it’s worth understanding that they are further subdivided into relational (SQL) and non-relational (non-SQL) databases. As a broad rule, relational databases contain structured data and non-relational databases contain non-structured data, but determining whether data is structured is not an exact science. Instead, think of non-relational databases as being adaptable to the shape of the data they are storing.
Databases are commonly used in the following cases: when you have large datasets, when multiple people need to access and modify the data simultaneously, when datasets need to be able to scale, and when data is unstructured (non-SQL only). In addition, if you’re commissioned to do data analysis for your company, you may find that you’re given a database to work with as it’s already in-house.
PyCharm Professional has excellent support for SQL and non-SQL databases. If your work involves using various databases and writing SQL queries, you can check out our webinar on Visual SQL Development with PyCharm to get more information about the functionality. Alternatively, you can learn how to explore tables without writing a single line of SQL with PyCharm and import your dataset into PyCharm and explore it.
Try PyCharm Professional for free
Where can I find datasets for my data science projects?Once you’re ready to find out how to get data, there are plenty of resources you can download to use for your data science project. This is not an endless list, but it’s a good place to start and a natural progression for your data science journey.
UCI Machine Learning RepositoryThe UCI Machine Learning Repository has over 600 datasets covering a host of exciting topics for you to explore, such as biology, health, physics, and climate. UCI datasets also have a diverse set of data types, including images, sequential, and time series. I recommend looking at a few different datasets and types of data if you’re new to data science, as it will help you expand your understanding of what data often looks like.
KaggleAnother well-known website for datasets is Kaggle. Not only can you sign up to Kaggle to download datasets for data science projects, but it also has a large community of like-minded people who run company-sponsored competitions designed to help you develop your data science skills. If you’re looking for a famous dataset that you’ve seen used in numerous examples, you’ll almost certainly find it hosted on Kaggle.
Hugging FaceHugging Face is another resource that is rich in datasets. You can filter the results by modalities, including audio, geospatial, and video, and provide a range for the size of your dataset, which can be particularly helpful when you want to start small. Hugging Face has many natural language and computer vision datasets, so you might want to head over there once you’re past the basics and interested in more specialized fields.
Many moreThere are many more places that you can go on your data science journey to find fun datasets to explore. You can check out GitHub for curated open source datasets, FiveThirtyEight for datasets relating to American politics and sports, and lastly, one of my favorites, the UK government, to get datasets relating to public services and the economy in the UK.
What are the next steps?Congratulations! You’ve gained a better understanding of what “good data” is, and you know where to look to find datasets for data science projects. Once you’ve chosen a dataset, you’re ready to start preparing and analyzing your data.
Remember, you can use Jupyter notebooks inside PyCharm to explore both file format and database datasets.
You can read or watch a video showing just some of the ways you can use Jupyter notebooks inside PyCharm to boost your productivity on your data science journey with your chosen dataset.
Talk Python to Me: #480: Ahoy, Narwhals are bridging the data science APIs
Droptica: Drupal 7 to the latest version migration – Droptica’s top recommended enhancements
Upgrading from Drupal 7 to the latest version opens up a range of benefits, allowing you to leverage a modern CMS. By enhancing areas like content structure, SEO, and security during migration, you can maximize the impact of your investment. But, without Drupal expertise, deciding what to change and improve can be overwhelming during the migration process. Our free checklist, built on Droptica’s experience with clients, helps you explore common migration improvements and decide which ones fit your needs.
Capellic: Migrating critical keys from Lockr to Pantheon Secrets
Django Weblog: Why Django supports the Open Source Pledge
We at the Django Software Foundation are pleased to share that Sentry, alongside other partners, has launched the Open Source Pledge — an initiative designed to address sustainability challenges in open source.
The Open Source Pledge is a commitment for member companies to pay OSS maintainers meaningfully for their work. When maintainers are adequately supported, they can better sustain their projects, ensuring the growth, stability, and security of the broader ecosystem.
The sustainability challenge in the Django communityIn our community and OSS at large, the challenge is real and significant. Django packages are often maintained by small teams or even individuals, often unpaid. As the demands on these projects grow, so too does the pressure on the maintainers. And without financial support, maintainers often move on without a clear succession plan. The potential failure of these projects not only impacts the developers involved but also the thousands of companies and millions of users who rely on these critical pieces of infrastructure.
Here are a few assorted examples from Django packages in the top 10 by download counts:
- Is DRF still considered alive?, Moving REST framework forward
- Lots of open PRs with no feedback or action
- Recruiting maintainers
- We need more roadies in jazzband
The Open Source Pledge is simple but impactful: member companies commit a minimum of $2,000 per year, per developer on staff, to support open source maintainers. Additionally, companies are encouraged to publish an annual report detailing their payments, creating transparency and accountability within the community.
We encourage companies of all sizes to join the Pledge and contribute to the sustainability of the software we all depend on. By making a financial commitment, you are not just supporting maintainers—you are investing in the stability, security, and growth of the entire tech ecosystem.
If you're interested in joining the Open Source Pledge or learning more about the sustainability issues facing OSS, please visit the initiative’s page. Together, we can build a stronger, more sustainable open source future. And if you believe in this cause, we encourage you to share this post to help broaden awareness and inspire further commitments from peers and partners.
KPhotoAlbum 5.13.0 released
After almost a year, we’re very pleased to announce a new release of KPhotoAlbum, the Linux/KDE photo management software!
There are two new features/changes:
- The “time ago”/birthday/age calculation has been reworked. Timespans should now be displayed in a nicer (more natural) way. Also, the age of people born on February 29 is now calculated correctly.
- The ‘--db’ command line argument now rejects any file name that is not either an existing directory or an index.xml file within an existing directory (cf. Bug #418647).
Apart from that, quite a number of bugs have been fixed (cf. the ChangeLog for more info): #477529, #477530, #477531, #477532, #478944, #479483, #481181, #483266, #444744 and #493849. And on top some bugs that weren’t reported as a bug in the first place :-)
One additional change that should be mostly interesting for the distributors is: The key used for signing the release has been updated. All PGP keys used to sign KDE software releases can be found in the sysadmin/release-keyring repo. My currently used key that I used to sign the tarball can also be found there, cf. tleupold@key2.asc.
… and what about Qt 6?!
Fear not! Of course, there will be a Qt6/KF6 release of KPhotoAlbum. We currently have a working Qt6/KF6 branch, so most of the porting is already done. Last thing that’s missing is a Qt6/KF6 release of Marble, which we use to display maps for geographic coordinates in photos (preferrably stored there using KGeoTag ;-). It seems like there will be such a release towards the end of the year. We will get KPhotoAlbum ready for Qt6/KF6 shortly afterwards. Stay tuned!
According to git log, the following individuals contributed commits since the last release:
- Boudhayan Bhattacharya
- Oliver Kellogg
- Tobias Leupold
- Randall Rude
- Johannes Zarl-Zierl
Have a lot of fun with KPhotoAlbum 5.13.0 :-)
Drupal Core News: Coding standards proposals for final discussion on 23 October 2024
The Technical Working Group (TWG) is announcing one coding standards change for final discussion. Feedback will be reviewed at the meeting scheduled for 23 October 2024 UTC.
Issues for discussionThe Coding Standards project page outlines the process for changing Drupal coding standards.
Join the team working on Coding StandardsJoin #coding-standards in Drupal Slack to meet and work with others on improving the Drupal coding standards. We work on improving our standards as well as implementing them in the core software.
Thorsten Alteholz: My Debian Activities in September 2024
This month I accepted 441 and rejected 29 packages. The overall number of packages that got accepted was 448.
I couldn’t believe my eyes, but this month I really accepted the same number of packages as last month.
Debian LTSThis was my hundred-twenty-third month that I did some work for the Debian LTS initiative, started by Raphael Hertzog at Freexian. During my allocated time I uploaded or worked on:
- [unstable] libcupsfilters security update to fix one CVE related to validation of IPP attributes obtained from remote printers
- [unstable] cups-filters security update to fix two CVEs related to validation of IPP attributes obtained from remote printers
- [unstable] cups security update to fix one CVE related to validation of IPP attributes obtained from remote printers
- [DSA 5778-1] prepared package for cups-filters security update to fix two CVEs related to validation of IPP attributes obtained from remote printers
- [DSA 5779-1] prepared package for cups security update to fix one CVE related to validation of IPP attributes obtained from remote printers
- [DLA 3905-1] cups-filters security update to fix two CVEs related to validation of IPP attributes obtained from remote printers
- [DLA 3904-1] cups security update to fix one CVE related to validation of IPP attributes obtained from remote printers
- [DLA 3905-1] cups-filters security update to fix two CVEs related to validation of IPP attributes obtained from remote printers
Despite the announcement the package libppd in Debian is not affected by the CVEs related to CUPS. By pure chance there is an unrelated package with the same name in Debian. I also answered some question about the CUPS related uploads. Due to the CUPS issues, I postponed my work on other packages to October.
Last but not least I did a week of FD this month and attended the monthly LTS/ELTS meeting.
Debian ELTSThis month was the seventy-fourth ELTS month. During my allocated time I uploaded or worked on:
- [ELA-1186-1]cups-filters security update for two CVEs in Stretch and Buster to fix the IPP attribute related CVEs.
- [ELA-1187-1]cups-filters security update for one CVE in Jessie to fix the IPP attribute related CVEs (the version in Jessie was not affected by the other CVE).
I also started to work on updates for cups in Buster, Stretch and Jessie, but their uploads will happen only in October.
I also did a week of FD and attended the monthly LTS/ELTS meeting.
Debian PrintingThis month I uploaded …
- … libcupsfilters to also fix a dependency and autopkgtest issue besides the security fix mentioned above.
- … splix for a new upstream version. This package is managed now by OpenPrinting.
Last but not least I tried to prepare an update for hplip. Unfortunately this is a nerve-stretching task and I need some more time.
This work is generously funded by Freexian!
Debian MatomoThis month I even found some time to upload packages that are dependencies of Matomo …
This work is generously funded by Freexian!
Debian AstroThis month I uploaded a new upstream or bugfix version of:
- … openvlbi
- … indi-playerone
- … libsbig
- … indi-pentax
- … indi-sbig
- … indi-fishcamp
- … indi-inovaplx
- … libfishcamp
- … libsbig
- … libplayeronecamera
- … libplayerone
- … libahp-gt
- … libahp-xc
Most of the uploads were related to package migration to testing. As some of them are in non-free or contrib, one has to build all binary versions. From my point of view handling packages in non-free or contrib could be very much improved, but well, they are not part of Debian …
Anyway, starting in December there is an Outreachy project that takes care of automatic updates of these packages. So hopefully it will be much easier to keep those package up to date. I will keep you informed.
Debian IoTThis month I uploaded new upstream or bugfix versions of:
- … pywws
This month I did source uploads of all the packages that were prepared last month by Nathan and started the transition. It went rather smooth except for a few packages where the new version did not propagate to the tracker and they got stuck in old failing autopkgtest. Anyway, in the end all packages migrated to testing.
I also uploaded new upstream releases or fixed bugs in:
miscThis month I uploaded new upstream or bugfix versions of:
Most of those uploads were needed to help packages to migrate to testing.
PyCoder’s Weekly: Issue #650 (Oct. 8, 2024)
#650 – OCTOBER 8, 2024
View in Browser »
In this video course, you’ll learn how Python mutable and immutable data types work internally and how you can take advantage of mutability or immutability to power your code.
REAL PYTHON course
Learn how to run DuckDB in an in-browser Python environment to enable simple querying on remote files, interactive documentation, and easy to use training materials.
ALEX MONAHAN
Looking to add new functionality to your Django app? Learn how to integrate Speech-to-Text and build a working app that transcribes audio files—with 100+ free hours to get started →
ASSEMBLY AI sponsor
This post talks about combining the new experimental free threading feature of Python 3.13 with Asyncio.
CHANGS.CO.UK • Shared by Jamie Chang
Earlier this week Trey considered whether to switch from virtualenvwrapper to using local .venv managed by direnv. He then also started experimenting with uv and Starship. This post explains why and his new configuration.
TREY HUNNER
Some template blocks are meant to be overloaded and forgetting to do so results in rendering bugs. This post talks about creating a new tag that throws an exception which alerts your tests if you forget to overload.
TOM CARRICK
Simplify workloads and elevate customer service. Build customized AI assistants that respond to voice prompts with powerful language and comprehension capabilities - all based on your unique needs with Intel’s OpenVINO toolkit.
INTEL CORPORATION sponsor
Learn what the Arrange, Act, and Assert (AAA) pattern is, how it works, the benefits it offers, and its role in unit test automation. Note: sample code is not in Python, but the concepts apply to all unit testing.
ANTONELLO ZANINI
This article outlines the system that Rodrigo uses to prepare his Python talks. Steal his ideas and suggestions so that you, too, can start giving talks at your local meetups and at PyCons all over the world.
MATHSPP.COM • Shared by Rodrigo
A singleton pattern is one where only one instance of an object type is allowed at a time. One way to implement this concept is through the use of a decorator. This post teaches you how.
PIETER CLAERHOUT
This Python Enhancement Proposal specifies a mechanism by which projects hosted on pypi.org can safely host wheel artifacts on external sites other than PyPI.
PYTHON.ORG
The use of containers can mean a lot of calls to PyPI. This post talks about caching properly to reduce the load on our shared community servers.
MICHAEL KENNEDY
If you need to cycle through values, one way to do that is with deque. This post shows you through an example service for a game engine.
JUHA-MATTI SANTALA
All you need to know about the latest Python release’s changes to the Global Interpreter Lock and Just-in-Time compilation.
DREW SILCOCK
“Not a real dinosaur and not real poetry.” This post is about Paul changing how what tools he uses for his Python setup.
PAUL COCHRANE
GITHUB.COM/TESORIO • Shared by Caio Ariede
foc: A Collection of Python Functions for Somebody’s Sanity spiderweb: A Small Web Framework pipefunc: DAGs for Scientific WorkflowsGITHUB.COM/PIPEFUNC • Shared by Bas Nijholt
django-unique-user-email: Emial Logins With User Model Events Weekly Real Python Office Hours Q&A (Virtual) October 9, 2024
REALPYTHON.COM
October 9 to October 14, 2024
PYCON.ORG
October 10 to October 11, 2024
PYCON.ORG
October 10 to October 11, 2024
MEETUP.COM
October 12, 2024
MEETUP.COM
October 14 to October 16, 2024
GLOBALDEVSLAM.COM
October 16 to October 21, 2024
PYTHONBRASIL.ORG.BR
October 16 to October 19, 2024
PYCON.PA
October 17 to October 19, 2024
PYTHON-SUMMIT.CH
Happy Pythoning!
This was PyCoder’s Weekly Issue #650.
View in Browser »
[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]
Smartbees: How to Add Google Maps in Drupal (Best Modules)
Maps on websites are a useful addition, allowing users to reach, for example, your company headquarters in a simple way. Depending on the requirements, you can implement them in many different ways. This article will show you how to embed Google Maps in a Drupal website.
Steinar H. Gunderson: Pimp my SV08
The Sovol SV08 is a 3D printer which is a semi-assembled clone of Voron 2.4, an open-source design. It's not the cheapest of printers, but for what you get, it's extremely good value for money—as long as you can deal with certain, err, quality issues.
Anyway, I have one, and one of the fun things about an open design is that you can switch out things to your liking. (If you just want a tool, buy something else. Bambu P1S, for instance, if you can live with a rather closed ecosystem. It's a bit like an iPhone in that aspect, really.) So I've put together a spreadsheet with some of the more common choices:
It doesn't contain any of the really difficult mods, and it also doesn't cover pure printables. And none of the dreaded macro stuff that people seem to be obsessing over (it's really like being in the 90s with people's mIRC scripts all over again sometimes :-/), except where needed to make hardware work.
Linux App Summit – A Review!
I had the privilege of attending LAS this year. True to my role as a designer, I brought my camera and volunteered during the event to be a photographer. The venue and university of Monterrey were beautiful.
The main hall is a wall-to-wall glass building placed in the middle of campus. The pictures we got from there were so nice!
The day began with a review of OnlyOffice features and capabilities. We then reviewed the progress that Mexico has seen in advancing Open Source initiatives.
The sessions showcased a myriad of topics. They focused on how open source applications can make a difference in many areas. Other sessions focused on design guidelines, application-building logic, publication and efforts to promote Linux in education.
The work done by the organization was great. Internet access at the venue was strong, and allowed the team onsite to broadcast the sessions online. We were in a university setting. A team managed the broadcasting and sound for the venue and online audiences.
The city was beautiful and filled with great food.
During the conference I contributed with images that I will make available to the organizers soon.
Would love to come back!
FSF Events: Free Software Directory meeting on IRC: Friday, October 11, starting at 12:00 EDT (16:00 UTC)
Real Python: What's New in Python 3.13
Python 3.13 was published on October 7, 2024. This new version is a major step forward for the language, although several of the biggest changes are happening under the hood and won’t be immediately visible to you.
In a sense, Python 3.13 is laying the groundwork for some future improvements, especially to the language’s performance. As you watch the course, you’ll learn more about the background for this and dive into some new features that are fully available now.
In this video course, you’ll learn about some of the improvements in the new version, including:
- Improvements made to the interactive interpreter (REPL)
- Clearer error messages that can help you fix common mistakes
- Advancements done in removing the global interpreter lock (GIL) and making Python free-threaded
- The implementation of an experimental Just-In-Time (JIT) compiler
- A host of minor upgrades to Python’s static type system
In this video course, you’ll explore these changes and see how this new version of Python can work for you.
If you want to try any of the examples in this video course, then you’ll need to use Python 3.13. The Python 3 Installation & Setup Guide and How Can You Install a Pre-Release Version of Python? walk you through several options for adding a new version of Python to your system.
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
The Open Source Initiative Supports the Open Source Pledge
As businesses rely more heavily on Open Source software (OSS), the strain on maintainers to provide timely updates and security patches continues to grow – often without fair compensation for their crucial work. Recent high-profile security incidents like XZ and Log4Shell have put a spotlight on the security challenges developers face against a backdrop of burnout that has reached an all-time high.
To help address this imbalance, the Open Source Initiative (OSI) supports the Open Source Pledge, launched today by Sentry and partners to support maintainers and inspire a shift toward a healthier work-life balance, and more robust software security practices. The Pledge is a commitment from member companies to pay Open Source maintainers and organizations meaningfully in support of a more sustainable maintainer ecosystem and a reduction of flare-ups of high-profile security incidents.
This Pledge is an attempt to address a problem that has long existed within the Open Source ecosystem. Many companies have built their businesses on top of Open Source software, benefiting from the contributions of maintainers taking them for granted. While they’ve reaped the rewards, the burden has been placed on unpaid or underpaid developers.
It is essential that companies recognize their role in sustaining the ecosystem that powers their innovations. By taking the Pledge, companies have one more instrument to commit to supporting an ecosystem of maintainers and organizations, ensuring the long-term health of the Open Source projects they rely on.
In order to qualify, the projects that companies pledge to should meet the Open Source Definition. You can join the Open Source Pledge by donating to the Open Source Initiative or contacting us to become a sponsor.
Real Python: Quiz: Python Closures: Common Use Cases and Examples
In this quiz, you’ll test your understanding of Python closures. Closures are a common feature in functional programming languages and are particularly popular in Python because they allow you to create function-based decorators.
Take this quiz after reading our Python Closures: Common Use Cases and Examples tutorial.
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
Django Weblog: Django bugfix release issued: 5.1.2
Today we've issued the 5.1.2 bugfix release.
The release package and checksums are available from our downloads page, as well as from the Python Package Index. The PGP key ID used for this release is Natalia Bidart: 2EE82A8D9470983E.
Python Software Foundation: Join the Python Developers Survey 2024: Share your experience!
This year we are conducting the eighth iteration of the official Python Developers Survey. The goal is to capture the current state of the language and the ecosystem around it. By comparing the results with last year’s, we can identify and share with everyone the hottest trends in the Python community and the key insights into it.
We encourage you to contribute to our community’s knowledge by sharing your experience and perspective. Your participation is valued! The survey should only take you about 10-15 minutes to complete.
Contribute to the Python Developers Survey 2024!
This year we aim to reach even more of our community and ensure accurate global representation by highlighting our localization efforts:
- The survey is translated into Spanish, Portuguese, Chinese, Korean, Japanese, German, French and Russian. It has been translated in years past, as well, but we plan to be louder about the translations available this year!
- To assist individuals in promoting the survey and encouraging their local communities and professional networks we have created a Promotion Kit with images and social media posts translated into a variety of languages. We hope this promotion kit empowers folks to spread the invitation to respond to the survey within their local communities.
- We’d love it if you’d share one or more of the posts below to your social media or any community accounts you manage, as well as share the information in discords, mailing lists, or chats you participate in.
- If you would like to help out with translations you see are missing, please request edit access to the doc and share what language you will be translating to. Translation into languages the survey may not be translated to is also welcome.
- If you have ideas about what else we can do to get the word out and encourage a diversity of responses, please comment on the corresponding Discuss thread.
The survey is organized in partnership between the Python Software Foundation and JetBrains. After the survey is over, we will publish the aggregated results and randomly choose 20 winners (among those who complete the survey in its entirety), who will each receive a $100 Amazon Gift Card or a local equivalent.
Debian Brasil: Testing feed in English
Testing the feed in English and check If it's going to Debian Planet.
Sorry the noise :-)
Mariatta: Perks of Being a Python Core Developer
I’ve been a Python core developer since January 27, 2017.
Being a Python core developer comes with perks, privileges, and also responsibilities.
Sometimes I can’t tell whether something is a perk, or a privilege, or a responsibility. I think depends on who you’re talking to, they might see it as an optional nice thing they could get/do, but the same thing might be seen as burden responsibility to others.