Planet Python

Subscribe to Planet Python feed
Planet Python - http://planetpython.org/
Updated: 13 hours 41 min ago

Mirek Długosz: The problems with test levels

14 hours 18 min ago
Test levels in common knowledge

A test pyramid usually distinguishes three levels: unit tests, integration tests and end to end tests; the last level is sometimes called “UI tests” instead. The main idea is that as you move down the pyramid, tests tend to run faster and be more stable, but at the expense of being isolated. Only tests on higher levels are able to detect problems in how building blocks work together.

ISTQB syllabus presents similar idea. They distinguish four test levels: component, integration, system and acceptance. These test levels drive a lot of thought around testing - each level has its own distinct definition and properties, guides responsibility assignment within a team, is aligned with specific test techniques and may be mapped to phase in software development lifecycle. That’s a lot of work!

Both of these categorizations share the idea that higher level encompasses level below it, and builds upon it. There’s also certain synergy effect at play here - tests at higher level cover something more than all the tests at the levels below. That’s why teams with “100% unit tests coverage” still get bug reports from actual customers. As far as I can tell, these two properties - hierarchy and synergy - are shared by all test levels categorizations.

The problems

I have some problems with this common understanding. In my experience, while test levels look easy and simple, it’s unclear how to apply them in practice. If you give the same set of tests to two testers, they are likely to group them to test levels in very different ways. Inconsistencies like that begs the question: are test levels actually useful categorization tool?

I know, because I have faced these issues when we tried to standardize test metadata in Red Hat Satellite.

One of the things provided by Satellite is host management. You can create, start, stop, restart or destroy the host. If you have tests exercising these capabilities, you could file them under component level, because host management is one of components of Satellite system.

Satellite also provides content management. You can synchronize packages from Red Hat CDN to your Satellite server and tell your hosts to use that exclusively. This gives you ability to specify what content is available, e.g. you can offer specific version of PostgreSQL until all the apps are tested against newer version. This also allows for faster updates, because all the data is already in your data center and you can use fast local connection to fetch it. Tests exercising various content management features can be filed under component level, because content management is one of components of Satellite system.

You can set up host to consume content from specific content view. Your test might create a host, create a content view, attach host to content view and verify that some packages are or are not available to this host. You could file such test under integration level, because you integrate two distinct components.

But you could also file that test under system level, because serving specific filtered view of all available content to specific hosts based on various criteria is one of primary use cases of Satellite, and possibly the main reason people are willing to pay money for it.

For the sake of argument, let’s assume that test above is integration level test, and system level is reserved for tests that exercise some larger, end to end flows. Something like: create a host, create a content view, sync content to host, install a specific package update that requires restart and wait for a host to be back online.

Satellite may be set up to periodically send data about hosts to cloud.redhat.com. When you test this feature, you might consider Satellite as a whole to be one component and cloud.redhat.com to be another component. This leads to conclusion that such test should be filed under integration level.

While this conclusion is logical (it follows directly from premises), it doesn’t feel right. If test levels form a kind of hierarchy, then why test that exercises the system as a whole is on integration level?

You can try to eliminate the problem by lifting this test to system level. But there still are two visibly distinct tests filed under single label - some system level tests exercise Satellite as a whole, and some system level tests exercise integration between Satellite and some external system.

Either way, your levels become internally inconsistent.

Let’s leave integration and system level for now. How about acceptance level?

Satellite is a product that is developed and sold to anyone who wants to buy it. There is no “acceptance” phase in Satellite lifecycle. Each potential customer would run their own acceptance testing, and while the team obviously appreciated the feedback from these sessions, it was rarely considered to be a “release blocker”.

Given these circumstances, we decided to create a simple heuristic - if the test covers issue reported by customer, then this test should be on acceptance level.

Soon we realized that a large number of customer issues are caused by specific data they have used, or specific environment in which the product operates. Our heuristic elevated tests from component or integration level way up to acceptance level.

This shows the biggest problem with acceptance level - it belongs to completely different categorization scheme. Acceptance level is not defined by what is being tested, but by who performs the testing.

Perhaps there was a time when that distinction had only theoretical meaning. As a software vendor, you built units, integrated them, verified that system as a whole performs as expected and sent that to customer, who would verify that it fits the purpose. Acceptance level tests were truly something greater than system level tests.

But we don’t live in such world anymore. These days, most software is in perpetual development. There’s no separate “acceptance” phase, because what is subject to acceptance testing of one customer, is actual production version of another customer. If product is changed based on acceptance testing results, all customers receive that change.

Perhaps placing acceptance testing at the level above system testing was always something that only made sense in very specific context - when developing business software tailored to specific customer that does not subscribe to “all companies are software companies” world view.

While I do not have this kind of experience, I have heard about military contractor that had to submit each function for independent verification by US Army staff, because army needed to be really sure there’s nothing dicey going on in the system. I find it believable. I can think of bunch of reasons why a customer would want to run acceptance tests on units smaller than the whole system. One of them would be a really high stake - when a bug in a system could mean a difference between being alive and dead. Another would be when system is expected to last decades and it’s really important for a customer to obtain certain knowledge and prepare for future maintenance. Military, government (especially intelligence), medicine and automotive all sound like a places where customer might want to verify parts of the system.

Finally, what about unit (component) level? Are they simple?

Most of testers learn to understand unit tests as a thing that is a developer problem - they are created, maintained and run by developers. Of course you might question this understanding in the world of shifting left, DevTestOps and “quality is everyone’s responsibility” mantra, but let’s ignore that discussion for now. If unit tests are developers problem, we should see what developers think about them.

Apparently, they discuss at length what unit even is. There’s also an anecdote floating around of a person that covered 24 different definitions of unit test in the first morning of their training course.

Could we do better?

I think it’s clear that there are problems with common understanding of test levels. But the question remains: are these problems with that specific implementation of the idea, or is the idea of tests levels itself completely busted? Could there be another way of defining test levels? Would it be free of problems discussed above?

My thinking about test levels is guided by two principles. First, levels are hierarchical - higher level should built upon things from the level below. Obviously, the higher level should be, in some way, more than simple sum of these things below. Second, it should be relatively obvious to which level a given test belongs. “Relatively”, because borderline cases are always going to exist in one form or another, and we are humans, so we are going to see things a little different sometimes. But these should be exceptions, not the norm.

Function level. For large majority of us, function is the smallest building block of our programs. That’s why the lowest level is named after it. On the function level, your tests focus on individual functions in isolation. Most of the time, you would try various inputs and verify outputs or side-effects. Of course it helps when your functions are pure and idempotent. This is the level mainly targeted by techniques like fuzzing and property-based testing.

Class level. The name comes from object-oriented paradigm, where we tend to group functions that work together into classes. The main goal of tests at this level is to verify integration between functions. These functions may, but don’t have to, be grouped in the single class. Since classes group behavior and state, the setup code is much more common on this level - you will find yourself ensuring that class is in specific state before you can test what you actually care about. Test cleanup code will also appear more often than on function level, for the same reason. Property-based testing is harder to apply at this level.

Package level. This name is inspired by Python naming convention, where package is a collection of modules (i.e. functions and classes) that work together to achieve single goal. This is also what package level tests are all about - they test interactions between classes, and between classes and functions. These are the tests that pose the first challenge for common understanding of test levels. Some people might consider them integration tests (because there are few classes working together, and you want to test how well they integrate with each other), while others would consider them unit tests (because package is designed to solve the single “unit” of domain problem). For me, package is something that is coherent enough to have somewhat clear boundary with the rest of the system, but not abstract enough to be considered for extraction from the system into 3rd-party library. This level might be easier to understand in relation to the next level.

Service level. The name comes from microservice architecture. We can discuss at length whether microservices are right for you, and if they are anything more than a buzzword, but that is a discussion for another time. What’s important is that your project consists of multiple packages (unless you are in the business of creating libraries). Some of these packages, or some sets of packages, have very clearly defined responsibility within the system, and boundaries that set them apart from the rest of the system. At least theoretically, these packages could be extracted into separate library (or separate API service) that your project would pull in as a dependency. Service level tests focus at these special packages, or collections of packages.

Service level is where things start to become really interesting. All levels below are focused on code organization. At service level, you have to face the question of why are you developing the software at all. Service level is primarily driven by business needs, and relationship between them and specific system components. Some services encapsulate “business logic” - external constraints that system has to adhere to. Other services exist only to support these core services or to enable integration with other systems. Some services are relatively abstract and are likely to be implemented by some open source library (think about database access service or user authentication service).

Service level is also where testers traditionally got involved, because some services exist only to facilitate interaction of the system with outside world. Think about generating HTML, sending e-mails, REST API endpoints, desktop UIs etc.

System level. For large number of intents and purposes, system is a synonym of “software”. These days, where everything is interconnected and integrated, sometimes it might be hard to clearly define “system” boundaries. I would use a handful of heuristics: your customers buy a copy of a system, or license to use a system, or create an account within a system. System is what users interact with. System has a name, and this name is known to customers. System is subject to your company marketing and sales efforts. Most of the things we know and use everyday are systems: Spotify, Netflix, Microsoft Windows, Microsoft Word, …

A lot of systems truly are a collection of services (subsystems). Most of discussions around software architecture focus on how to arrange services in a way that responsibilities and boundaries are clear. For many architects, the end goal is to design a system in a way that makes it possible to swap one service implementation for another without impacting the whole thing.

While this separation is important from development perspective, it’s also crucial that it is not visible by a customer. If user feels, or worse - knows that she moves from one subsystem to another, more often than not it means that UX attention is required.

System level tests focus on exercising integration between subsystems and exercising system as a whole. Often they will interact with a system through the interface that is known to users - desktop UI, web page or public API. For that reason, system level tests tend to be relatively slow and brittle. To offset that, usually you will focus only on happy paths and most important end-to-end journeys.

Offering level. Many companies are built around single product and never reach this level. But when a company is big enough and offers multiple products, usually it is important that these products work well together.

Today, one of the best examples is Amazon and AWS. AWS provides access to many services, including EC2 virtual machines, S3 storage and RDS managed databases. Most of these services are maintained by dedicated teams, and customers may decide to pay for one and not another. But customers might also decide to embrace AWS completely. When they do, it’s really important that setting up EC2 machine to store data on S3 is easy, ideally easier than any other cloud storage. Amazon understands that and offers products that group and connect existing services into ready to use solutions for common business problems.

Testing on this level poses unique technical and organizational challenges. Company engineering structure tends to be organized around specific products. Each product will be built by different team using different technology stack and tools, and might have different goal and target audience. To effectively test at this level, you need people working across organization and you need to fill the gaps that nobody feels responsible for. Often you need endorsement from the very top of company leadership, because most of the teams already have more work than they can handle - and if they are to help with offering testing, that must be done at expense of something else.

But this proposal is bad

I am not claiming that above proposal is perfect. In fact, I can find few problems with it myself, which I discuss briefly below. But I think it is step in right direction and provides good foundation that you can adjust to your specific situation.

If we follow the pattern that higher level is a collection of elements at the level below, we might notice that function is not the smallest unit - most functions are executing multiple system calls, and some system calls might encapsulate multiple processor instructions. I’ve decided to skip these levels, because I don’t have any experience working with systems so low in the stack. But I imagine people working on programming languages, compilers and processors might have a case for level(s) below function level.

You might find “class level” to have a misleading name if you work in the language that does not have classes. In functional languages, like Lisp or Haskell, it might be more fitting to use “higher-order functions level”. I don’t think the label is the most important part here - the point is, tests at that level verify integration between functions.

Python naming conventions differentiate between modules and packages. Without going into much detail, module is approximated by single file, and package is approximated by single directory. In Python, package is a collection of modules. Java also differentiates between modules and packages, but the relationship is inverted - package is a collection of classes and functions, and module is a collection of related packages. Depending on your goals and language, it might make sense to maintain both “module level” and “package level”.

Unless you are working on microservices, you might prefer to call “service level” a “subsystem level”. My answer is the same as to “class level” in purely functional languages - it doesn’t matter that much how you call it, as long as you are being consistent. Feel free to use a name that better suits your team and your technology stack naming conventions. The point of service / subsystem level is that these tests cover part of the system that has clearly defined responsibility.

Users these days expect integrations between various services that they use. Take Notion as an example - it can integrate with applications such as Trello, Google Drive, Slack, Jira and GitHub. These integrations need to be tested, but it’s unclear to which level these tests belong. They aren’t system level tests, because they cover system as a whole and something else. They aren’t offering level tests either, because Trello, Slack and GitHub are not part of your company offer. I think that sometimes there might be a need for new level, which we might call “3rd party integrations level”. I would place it between system level and offering level, or between service level and system level.

Why bother discussing test levels, anyway?

You tell me!

This article focuses more on “what” of test levels than on “why”, but that’s a fair question. To wrap the topic, let’s quickly go over some of the reasons why you might want to categorize tests by their levels.

Perhaps you want to track trends over time. Is most of your test development time spent at function level or service level? Can you correlate that with specific problems reported by customers? Does it look like gaps in coverage are emerging from the data?

Perhaps you want to gate your tests on results of tests at the level below. So first you run function level tests, and once they all pass, you run class level, and once they all pass, you run package level… You get the idea.

Perhaps you have different targets for each level. Tests on lower levels tend to run faster, while tests on higher levels tend to be more brittle. So maybe you are OK with system level tests completing in 2 hours, but for function level tests, finishing in 15 minutes is unacceptable. And maybe you target 100% pass rate at the function level, but you understand it’s unreasonable to expect more than 95% pass rate at the system level.

Perhaps you need a tool to guide your thinking on where testing efforts should concentrate. As a rule of thumb, you want to test things on the lowest level possible. As you move up in test levels hierarchy, you want to focus on things that are specific and unique to this level. It’s also generally fine to assume that building blocks on each level are working as advertised, since they were thoroughly tested on the level below.

Whatever you do with test levels, I think it makes sense to use a classification that can be applied unanimously by all team members. Hopefully the one proposed above will give you some ideas on how to construct such classification.

Categories: FLOSS Project Planets

John Ludhi/nbshare.io: PySpark concat_ws

Sun, 2022-08-14 17:38
PySpark concat_ws()

split(str) function is used to convert a string column into an array of strings using a delimiter for the split. concat_ws() is the opposite of split. It creates a string column from an array of strings. The resulting array is concatenated with the provided delimiter.

pyspark functions used in this notebook are
rdd.createOrReplaceTempView, rdd.drop, spark.sql

First we load the important libraries

In [1]: from pyspark.sql import SparkSession from pyspark.sql.functions import (col, concat_ws, split) In [3]: # initializing spark session instance spark = SparkSession.builder.appName('pyspark concat snippets').getOrCreate()

Then load our initial records

In [4]: columns = ["Full_Name","Salary"] data = [("Sam A Smith", 1000), ("Alex Wesley Jones", 120000), ("Steve Paul Jobs", 5000)] In [5]: # converting data to rdds rdd = spark.sparkContext.parallelize(data) In [6]: # Then creating a dataframe from our rdd variable dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns) In [7]: # visualizing current data before manipulation dfFromRDD2.show() +-----------------+------+ | Full_Name|Salary| +-----------------+------+ | Sam A Smith| 1000| |Alex Wesley Jones|120000| | Steve Paul Jobs| 5000| +-----------------+------+

1) Here we are splitting the Full_Name Column containing first name, middle name and last name and adding a new column called Name_Parts

In [8]: # here we add a new column called 'Name_Parts' and use space ' ' as the delimiter string modified_dfFromRDD2 = dfFromRDD2.withColumn("Name_Parts", split(col('Full_Name'), ' ')) In [9]: # visualizing the modified dataframe modified_dfFromRDD2.show() +-----------------+------+--------------------+ | Full_Name|Salary| Name_Parts| +-----------------+------+--------------------+ | Sam A Smith| 1000| [Sam, A, Smith]| |Alex Wesley Jones|120000|[Alex, Wesley, Jo...| | Steve Paul Jobs| 5000| [Steve, Paul, Jobs]| +-----------------+------+--------------------+

2) We can also use a SQL query to split the Full_Name column. For this, we need to use createOrReplaceTempView() to create a create a temporary view from the Dataframe. This view can be accessed till SparkContaxt is active.

In [10]: # Below we use the SQL query to select the required columns. This includes the new column we create # by splitting the Full_Name column. dfFromRDD2.createOrReplaceTempView("SalaryData") modified_dfFromRDD3 = spark.sql("select Full_Name, Salary, SPLIT(Full_Name,' ') as Name_Parts from SalaryData") In [11]: # visualizing the modified dataframe after executing the SQL query. # As you can see, it is exactly the same as the previous output. modified_dfFromRDD3.show(truncate=False) +-----------------+------+---------------------+ |Full_Name |Salary|Name_Parts | +-----------------+------+---------------------+ |Sam A Smith |1000 |[Sam, A, Smith] | |Alex Wesley Jones|120000|[Alex, Wesley, Jones]| |Steve Paul Jobs |5000 |[Steve, Paul, Jobs] | +-----------------+------+---------------------+

Now we will use the above data frame for concat_ws function but will drop the Full_Name column. We will be recreating it using the concatenation operation

In [12]: # Removing the Full_Name column using the drop function modified_dfFromRDD4 = modified_dfFromRDD3.drop('Full_Name') In [13]: # visualizing the modified data frame modified_dfFromRDD4.show() +------+--------------------+ |Salary| Name_Parts| +------+--------------------+ | 1000| [Sam, A, Smith]| |120000|[Alex, Wesley, Jo...| | 5000| [Steve, Paul, Jobs]| +------+--------------------+

1) Here we are concatenating the Name_Parts Column containing first name, middle name and last name string elements and adding a new column called Full_Name

In [13]: # here we add a new column called 'Full_Name' and use space ' ' as the delimiter string to concatenate the Name_Parts modified_dfFromRDD5 = modified_dfFromRDD4.withColumn("Full_Name", concat_ws(' ', col('Name_Parts'))) In [14]: # visualizing the modified dataframe. # The Full_Name column is same as the one in the original data frame we started with above. modified_dfFromRDD5.show() +------+--------------------+-----------------+ |Salary| Name_Parts| Full_Name| +------+--------------------+-----------------+ | 1000| [Sam, A, Smith]| Sam A Smith| |120000|[Alex, Wesley, Jo...|Alex Wesley Jones| | 5000| [Steve, Paul, Jobs]| Steve Paul Jobs| +------+--------------------+-----------------+

2) We can also use a SQL query to concatenate the Name_Parts column like we did for split() above. For this, we need to use createOrReplaceTempView() to create a create a temporary view from the Dataframe like we did before. We will then use that view to execute the concatenate query on.

In [14]: # Below we use the SQL query to select the required columns. This includes the new column we create # by splitting the Full_Name column. modified_dfFromRDD4.createOrReplaceTempView("SalaryData2") modified_dfFromRDD6 = spark.sql("select Salary, Name_Parts, CONCAT_WS(' ', Name_Parts) as Full_Name from SalaryData2") In [15]: # visualizing the modified dataframe after executing the SQL query. # As you can see, it is exactly the same as the previous output. modified_dfFromRDD6.show(truncate=False) +------+---------------------+-----------------+ |Salary|Name_Parts |Full_Name | +------+---------------------+-----------------+ |1000 |[Sam, A, Smith] |Sam A Smith | |120000|[Alex, Wesley, Jones]|Alex Wesley Jones| |5000 |[Steve, Paul, Jobs] |Steve Paul Jobs | +------+---------------------+-----------------+ In [16]: spark.stop()
Categories: FLOSS Project Planets

Brett Cannon: MVPy: Minimum Viable Python

Sun, 2022-08-14 17:19

Over 29 posts spanning 2 years, this is the final post in my blog series on Python&aposs syntactic sugar. I had set out to find all of the Python 3.8 syntax that could be rewritten if you were to run a tool over a single Python source file in isolation and still end up with reasonably similar semantics (i.e. no whole-program analysis, globals() having different keys was okay). Surprisingly, it turns out to be easier to list what syntax you can&apost rewrite than re-iterate all the syntax that you can rewrite!

  1. Integers (as the base for other literals like bytes)
  2. Floats (because I didn&apost want to mess with getting the accuracy wrong)
  3. Function calls
  4. =
  5. :=
  6. Function definitions
  7. global
  8. nonlocal
  9. return
  10. yield
  11. lambda
  12. del
  13. try/except
  14. if
  15. while

All other syntax can devolve to this core set of syntax. I call this subset of syntax the Minimum Viable Python (MVPy) you need to make Python function as a whole. If you can implement this subset of the language, then you can do a syntactic translation to support the rest of Python&aposs syntax (although admittedly it might be a bit faster if you directly implemented all the syntax 😉).

If you look at what synatx is left, it pretty much aligns to what is required to implement a Turing machine:

  1. Read/write data (=,  :=, integers and floats)
  2. Make decisions about data (if,  while, and try)
  3. Do things to that data (everything involving defining and using functions)

You might not be as productive in this subset of the language as you would be with all the syntax available in Python 3.8 (and later), but you should still be able to accomplish the same things given enough time and patience.

Categories: FLOSS Project Planets

Moshe Zadka: On The Go

Sun, 2022-08-14 17:00

Now that travel is more realistic, I have started to optimize how well I can work on the go. I want to be able to carry as few things as possible, and have the best set-up possible.

Charging

The "center" of the mobile set-up is my Anker Power Bank. It serves two purposes:

  • It is my wall-plug charger.
  • It is my "mobile power": I can carry around 10k mAH of energy.

The charger has two USB-C slots and one USB-A slot.

Compute

For "compute", I have three devices:

  • M1 MacBook Air
  • Galaxy Samsung S9+ (I know it's a bit old)
  • FitBit Charge 4

The S9 is old enough that there is no case with a MagSafe compatible back. Instead, I got a MagSafe sticker that goes on the back of the case.

This allowed me to get a MagSafe Pop-Socket base. Sticking a Pop-Socket on top of it lets me hold the phone securely, and avoids it falling on my face at night.

Ear buds

For earbuds, I have the TOZO T10. They come in multiple colors!

The colors are not just an aesthetic choice. They also serve a purpose: I have a black one and a khaki one.

The black one is paired to my phone. The khaki one is paired to my laptop.

I can charge the TOZO cases with either the USB-C cable or the PowerWave charger, whichever is free.

Charging

In order to charge the M1 I have a USB-C "outtie"/USB-C "outtie" 3 foot wire. It's a bit short, but this also means it takes less space. The FitBit Charge comes with its own USB-A custom cable.

For wireless charging, I have the Anker PowerWave. It's MagSafe compatible, and can connect to any USB-C-compatible outlet.

The phone is only charged by the wireless charging. The USB-C input is wonky, and can be incompatible with humid climates.

I connected a Pop Socket to the back of the PowerWave charger. This means that while the phone is charging, I can still hold it securely.

Together, they give me a "wireless charging" battery. The PowerWave connects to the phone, and the Power Bank has plenty of energy to last for a while while not connecting to anything.

I cannot charge all devices at once. But I can charge all devices, and (almost) any three at once.

Hub

The last device I have is an older version of the Anker 5-in-1 hub. This allows connecting USB Drives and HDMI connectors.

Case

All of these things are carried in a Targus TSS912 case. The laptop goes inside the sleeve, while the other things all go in the side pocket.

The side pocket is small, but can fit all of the things above. Because of its size, it does get crowded. In order to find things easily, I keep all of these things in separate sub-pockets.

I keep the Power Bank, the MagSafe charger, and the USB-C/USB-C cable in the little pouch that comes with the Power Bank.

The hub and FitBit charging cable go into a ziplock bag. Those things see less use.

The earbud cases go into the pocket as-is. They are easy enough to dig out by rooting around.

I wanted a messenger-style case so that I can carry it while I have a backpack on. Whether I am carrying my work laptop (in the work backpack) or a travel backpack, this is a distinct advantage.

The case is small enough to be slipped inside another backpack. If I am carrying a backpack, and there's enough room, I can consolidate.

Conclusion

I chose this set up for options.

For example, if my phone is low on battery, I can connect the PowerWave to the bank, leave the bank in the side-bag's pocket, and and keep using the phone while it is charging, holding it with the PowerWave's pop-sockets.

If I am listening to a podcast while walking around, and notice that the ear bud's case is low on battery, I can connect the case to the bank while they are both in the side-bag's pocket.

When sitting down at a coffee shop or an office, I can connect the bank to the wall socket and charge any of my devices while sitting there. As a perk the bank is charging while I'm sitting down.

Categories: FLOSS Project Planets

Podcast.__init__: Remove Roadblocks And Let Your Developers Ship Faster With Self-Serve Infrastructure

Sun, 2022-08-14 06:58
The goal of every software team is to get their code into production without breaking anything. This requires establishing a repeatable process that doesn't introduce unnecessary roadblocks and friction. In this episode Ronak Rahman discusses the challenges that development teams encounter when trying to build and maintain velocity in their work, the role that access to infrastructure plays in that process, and how to build automation and guardrails for everyone to take part in the delivery process.Summary

The goal of every software team is to get their code into production without breaking anything. This requires establishing a repeatable process that doesn’t introduce unnecessary roadblocks and friction. In this episode Ronak Rahman discusses the challenges that development teams encounter when trying to build and maintain velocity in their work, the role that access to infrastructure plays in that process, and how to build automation and guardrails for everyone to take part in the delivery process.

Announcements
  • Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Your host as usual is Tobias Macey and today I’m interviewing Ronak Rahman about how automating the path to production helps to build and maintain development velocity
Interview
  • Introductions
  • How did you get introduced to Python?
  • Can you describe what Quali is and the story behind it?
  • What are the problems that you are trying to solve for software teams?
    • How does Quali help to address those challenges?
  • What are the bad habits that engineers fall into when they experience friction with getting their code into test and production environments?
    • How do those habits contribute to negative feedback loops?
  • What are signs that developers and managers need to watch for that signal the need for investment in developer experience improvements on the path to production?
  • Can you describe what you have built at Quali and how it is implemented?
    • How have the design and goals shifted/evolved from when you first started working on it?
  • What are the positive and negative impacts that you have seen from the evolving set of options for application deployments? (e.g. K8s, containers, VMs, PaaS, FaaS, etc.)
  • Can you describe how Quali fits into the workflow of software teams?
  • Once a team has established patterns for deploying their software, what are some of the disruptions to their flow that they should guard against?
  • What are the most interesting, innovative, or unexpected ways that you have seen Quali used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Quali?
  • When is Quali the wrong choice?
  • What do you have planned for the future of Quali?
Keep In Touch Picks Closing Announcements
  • Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Categories: FLOSS Project Planets

"Paolo Amoroso's Journal": Next Suite8080 features: trim uninitialized data, macro assembler

Sat, 2022-08-13 08:54

I decided what to work on next on Suite8080, the suite of Intel 8080 Assembly cross-development tools I'm writing in Python. I'll add two features, the ability for the assembler to trim trailing uninitialized data and a macro assembler script.

Trimming uninitialized data

Consider this 8080 Assembly code, which declares a 1024 bytes uninitialized data area at the end of the program:

# . . . data: ds 1024 end

For this ds directive, the Suite8080 assembler asm80 emits a sequence of 1024 null bytes at the end of the binary program. Similarly, dw emits 16-bit words. The executable file is thus longer and may be slower to load on the host system, typically CP/M.

The Digital Research CP/M assemblers, ASM.COM and MAC.COM, strip such trailing uninitialized data from binaries. After asking for feedback to r/asm, I decided to do the same with asm80. I should be able to implement this optimization by adding just one line of Python, so the feature is a low-hanging fruit.

Macro assembler

asm80 can accept source files from standard input, which makes it possible to combine the assembler with an external macro preprocessor to get a macro assembler. Thanks to its ubiquity, M4 is the clear choice for a preprocessor.

Assuming prog.asm is an 8080 Assembly source file containing M4 macros, this shell pipe can assemble it with asm80:

$ cat prog.asm | m4 | asm80 - -o prog.com

The - option accepts input from standard input and -o sets the file name of the output binary program.

The other Suite8080 feature I'm going to implement is a mac80 helper script in Python to wrap such a shell pipe and make assembling macro files more convenient. In other words, syntactic sugar wrapping asm80 and M4.

The script will use the Python subprocess module to set up the pipe, feed the proprocessed source to the assembler, and not much else.

#Suite8080 #Python

Discuss... | Reply by email...

Categories: FLOSS Project Planets

Talk Python to Me: #377: Python Packaging and PyPI in 2022

Sat, 2022-08-13 04:00
PyPI has been in the news for a bunch of reasons lately. Many of them good. But also, some with a bit of drama or mixed reactions. On this episode, we have Dustin Ingram, one of the PyPI maintainers and one of the directors of the PSF, here to discuss the whole 2FA story, securing the supply chain, and plenty more related topics. This is another important episode that people deeply committed to the Python space will want to hear.<br/> <br/> <strong>Links from the show</strong><br/> <br/> <div><b>Dustin on Twitter</b>: <a href="https://twitter.com/di_codes" target="_blank" rel="noopener">@di_codes</a><br/> <br/> <b>Hardware key giveaway</b>: <a href="https://pypi.org/security-key-giveaway/" target="_blank" rel="noopener">pypi.org</a><br/> <b>OpenSSF funds PyPI</b>: <a href="https://openssf.org/blog/2022/06/20/openssf-funds-python-and-eclipse-foundations-and-acquires-sos-dev-through-alpha-omega-project/" target="_blank" rel="noopener">openssf.org</a><br/> <b>James Bennet's take</b>: <a href="https://www.b-list.org/weblog/2022/jul/11/pypi/" target="_blank" rel="noopener">b-list.org</a><br/> <b>Atomicwrites (left-pad on PyPI)</b>: <a href="https://old.reddit.com/r/Python/comments/vuh41q/pypi_moves_to_require_2fa_for_critical_projects/" target="_blank" rel="noopener">reddit.com</a><br/> <b>2FA PyPI Dashboard</b>: <a href="https://p.datadoghq.com/sb/7dc8b3250-389f47d638b967dbb8f7edfd4c46acb1" target="_blank" rel="noopener">datadoghq.com</a><br/> <b>github 2FA - all users that contribute code by end of 2023</b>: <a href="https://github.blog/2022-05-04-software-security-starts-with-the-developer-securing-developer-accounts-with-2fa/" target="_blank" rel="noopener">github.blog</a><br/> <b>GPG - not the holy grail</b>: <a href="https://caremad.io/posts/2013/07/packaging-signing-not-holy-grail/" target="_blank" rel="noopener">caremad.io</a><br/> <b>Sigstore for Python</b>: <a href="https://pypi.org/project/sigstore/" target="_blank" rel="noopener">pypi.org</a><br/> <b>pip-audit</b>: <a href="https://pypi.org/project/pip-audit/" target="_blank" rel="noopener">pypi.org</a><br/> <b>PEP 691</b>: <a href="https://peps.python.org/pep-0691/" target="_blank" rel="noopener">peps.python.org</a><br/> <b>PEP 694</b>: <a href="https://peps.python.org/pep-0694/ (in draft)" target="_blank" rel="noopener">peps.python.org</a><br/> <b>Watch this episode on YouTube</b>: <a href="https://www.youtube.com/watch?v=-7zOg1FjTg4" target="_blank" rel="noopener">youtube.com</a><br/> <br/> <b>--- Stay in touch with us ---</b><br/> <b>Subscribe to us on YouTube</b>: <a href="https://talkpython.fm/youtube" target="_blank" rel="noopener">youtube.com</a><br/> <b>Follow Talk Python on Twitter</b>: <a href="https://twitter.com/talkpython" target="_blank" rel="noopener">@talkpython</a><br/> <b>Follow Michael on Twitter</b>: <a href="https://twitter.com/mkennedy" target="_blank" rel="noopener">@mkennedy</a><br/></div><br/> <strong>Sponsors</strong><br/> <a href='https://talkpython.fm/compiler'>RedHat</a><br> <a href='https://talkpython.fm/irl'>IRL Podcast</a><br> <a href='https://talkpython.fm/assemblyai'>AssemblyAI</a><br> <a href='https://talkpython.fm/training'>Talk Python Training</a>
Categories: FLOSS Project Planets

John Ludhi/nbshare.io: Pyspark Expr Example

Fri, 2022-08-12 15:38
PySpark expr()

expr(str) function takes in and executes a sql-like expression. It returns a pyspark Column data type. This is useful to execute statements that are not available with Column type and functional APIs. Using expr(), we can use the Pyspark column names in the expressions as shown in the examples below.

First we load the important libraries In [1]: from pyspark.sql import SparkSession from pyspark.sql.functions import (col, expr) In [3]: # initializing spark session instance spark = SparkSession.builder.appName('snippets').getOrCreate() Then load our initial records In [4]: columns = ["Name","Salary","Age","Classify"] data = [("Sam", 1000,20,0), ("Alex", 120000,40,0), ("Peter", 5000,30,0)]

Let us convert our data to rdds. To learn more about Pyspark rdd. check out following link ...
How To Analyze Data Using Pyspark RDD

In [5]: # converting data to rdds rdd = spark.sparkContext.parallelize(data) In [6]: # Then creating a dataframe from our rdd variable dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns) In [7]: # visualizing current data before manipulation dfFromRDD2.show() +-----+------+---+--------+ | Name|Salary|Age|Classify| +-----+------+---+--------+ | Sam| 1000| 20| 0| | Alex|120000| 40| 0| |Peter| 5000| 30| 0| +-----+------+---+--------+ 1) Here we are changing the "Classify" column upon some condition using the case expression (rather than the built-in pyspark.sql.functions 'when' API which can also be used to achieve the same result):If Salary less than 5000, it will change column value to 1If Salary is less than 10000, it will change column value to 2else, it will change it to 3 In [8]: # here we update the column "Classify" using the CASE expression. # The conditions are based on the values in the Salary column modified_dfFromRDD2 = dfFromRDD2.withColumn("Classify", expr("CASE WHEN Salary < 5000 THEN 1 "+ "WHEN Salary < 10000 THEN 2 " + "ELSE 3 END")) In [9]: # visualizing the modified dataframe modified_dfFromRDD2.show() +-----+------+---+--------+ | Name|Salary|Age|Classify| +-----+------+---+--------+ | Sam| 1000| 20| 1| | Alex|120000| 40| 3| |Peter| 5000| 30| 2| +-----+------+---+--------+ 2) We can also give a column alias to the SQL expression In [45]: # here we updated the column "Classify", CASE expression conditions based on the values in the Salary column modified_dfFromRDD2 = dfFromRDD2.select("Name", "Salary", "Age", expr("CASE WHEN Salary < 5000 THEN 1 "+ "WHEN Salary < 10000 THEN 2 " + "ELSE 3 END as Classify")) In [46]: # visualizing the modified dataframe by using the 'as' for aliasing the resulting column. # As you can see, it is exactly the same as the previous output. You can also see the column name by removing the 'as Classify' modified_dfFromRDD2.show() +-----+------+---+--------+ | Name|Salary|Age|Classify| +-----+------+---+--------+ | Sam| 1000| 20| 1| | Alex|120000| 40| 3| |Peter| 5000| 30| 2| +-----+------+---+--------+ 3) We can also use arithmetic operators to perform operations on columns. Below we add 500 to the salary column and add a new column called New_Salary In [10]: modified_dfFromRDD3 = dfFromRDD2.withColumn("New_Salary", expr("Salary + 500")) In [11]: modified_dfFromRDD3.show() +-----+------+---+--------+----------+ | Name|Salary|Age|Classify|New_Salary| +-----+------+---+--------+----------+ | Sam| 1000| 20| 0| 1500| | Alex|120000| 40| 0| 120500| |Peter| 5000| 30| 0| 5500| +-----+------+---+--------+----------+ We can also use SQL functions with existing column values in expr() In [12]: # Here we use the SQL function 'concat' to concatenate the values in two columns i.e. Name and Salary and also a constant string '_' modified_dfFromRDD4 = dfFromRDD2.withColumn("Name_Salary", expr("concat(Name, '_', Salary)")) In [13]: # visualizing the resulting dataframe modified_dfFromRDD4.show() +-----+------+---+--------+-----------+ | Name|Salary|Age|Classify|Name_Salary| +-----+------+---+--------+-----------+ | Sam| 1000| 20| 0| Sam_1000| | Alex|120000| 40| 0|Alex_120000| |Peter| 5000| 30| 0| Peter_5000| +-----+------+---+--------+-----------+ In [14]: spark.stop()
Categories: FLOSS Project Planets

Python for Beginners: Read File Line by Line in Python

Fri, 2022-08-12 09:00

File operations are crucial during various tasks. In this article, we will discuss how we can read a file line by line in python.

Read File Using the readline() Method

Python provides us with the readline() method to read a file. To read the file, we will first open the file using the open() function in the read mode. The open() function takes the file name as the first input argument and the literal “r” as the second input argument to denote that the file is opened in the read mode. After execution, it returns a file object containing the file.

After getting the file object, we can use the readline() method to read the file. The readline() method, when invoked on a file object, returns the current unread line in the file and moves the iterator to next line in the file. 

To read the file line by line, we will read the each line in the file using the readline() method and print it in a while loop. Once the readline() method reaches the end of the file, it returns an empty string. Hence, in the while loop, we will also check if the content read from the file is an empty string or not,if yes, we will break out from the for loop. 

The python program to read the file using the readline() method is follows.

myFile = open('sample.txt', 'r') print("The content of the file is:") while True: text = myFile.readline() if text == "": break print(text, end="") myFile.close()

Output:

The content of the file is: I am a sample text file. I was created by Aditya. You are reading me at Pythonforbeginners.com.

Suggested Article: Upload File to SFTP Server using C# | DotNet Core | SSH.NET

Read File Line by Line in Python Using the readlines() Method

Instead of the readline() method, we can use the readlines() method to read a file in python. The readlines() method, when invoked on a file object, returns a list of strings, where each element in the list is a line from the file. 

After opening the file, we can use the readlines() method to get a list of all the lines in the file. After that, we can use a for loop to print all the lines in the file one by one as follows.

myFile = open('sample.txt', 'r') print("The content of the file is:") lines = myFile.readlines() for text in lines: print(text, end="") myFile.close()

Output:

The content of the file is: I am a sample text file. I was created by Aditya. You are reading me at Pythonforbeginners.com. Conclusion

In this article, we have discussed two ways to read a file line by line in python. To learn more about programming in python, you can read this article on list comprehension in Python. You might also like this article on dictionary comprehension in python.

The post Read File Line by Line in Python appeared first on PythonForBeginners.com.

Categories: FLOSS Project Planets

PyCharm: Webinar: 10 Pro Git Tips in PyCharm

Fri, 2022-08-12 08:45

Join us Tuesday, August 23, 2022, 6:00 – 7:00 pm CEST (check other time zones) for our free live webinar, 10 Pro Git Tips in PyCharm.

Save your spot

Have you ever worked on a Git repo in PyCharm and wondered, “Am I doing it right?” JetBrains Developer Advocate Marco Behler has a few pointers for what Git workflows you can use and how to manage everything from PyCharm.

Join him as he guides Paul Everitt through development workflows without screwing up his repository. It will be a joy for all, and the last two tips will come from the PyCharm community – so send your suggestions to our Twitter.

Join us for this live interactive webinar on August 23, 2022, which will feature a Q&A session after the live demo.

Categories: FLOSS Project Planets

Real Python: The Real Python Podcast – Episode #121: Moving NLP Forward With Transformer Models and Attention

Fri, 2022-08-12 08:00

What's the big breakthrough for Natural Language Processing (NLP) that has dramatically advanced machine learning into deep learning? What makes these transformer models unique, and what defines "attention?" This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, continues our talk about how machine learning (ML) models understand and generate text.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Categories: FLOSS Project Planets

Hynek Schlawack: pip-tools Supports pyproject.toml

Fri, 2022-08-12 04:00

pip-tools is ready for modern packaging.

Categories: FLOSS Project Planets

PyCharm: The Second Release Candidate for PyCharm 2022.2.1 Is Available!

Thu, 2022-08-11 09:36

This is a new update for the upcoming minor bug-fix release for 2022.2. Last week, in the first release candidate for 2022.2.1, we delivered some critical fixes to use the new functionality of PyCharm 2022.2 without having issues with remote interpreters.

If you encounter an issue in PyCharm 2022.2, please reach out to our support team. This will help us quickly investigate the major issues that are affecting your daily work and solve them.

You can get the new build from our page, via the free Toolbox App, or by using snaps for Ubuntu. 

This week we’re delivering a second release candidate for PyCharm 2022.2.1 with the following bug fixes:

  • Docker: Docker container settings for Docker-based interpreter are now applied to the run.[PY-53116], [PY-53638]
  • Docker Compose: running Django with a Docker compose interpreter doesn’t lead to an HTTP error. [PY-55394]
  • The new UI is enabled for setting up an interpreter via the Show all popup menu in the Python Interpreter popup window. [PY-53057]

We’re working on fixes for the following recent regressions with local and remote interpreters – stay tuned:

  • Custom interpreter paths aren’t supported in the remote interpreters. [PY-52925]
  • Django: Using the Docker-compose interpreter leads to an error when trying to open the manage.py console. [PY-52610]
  • Docker: An exposed port doesn’t work while debugging Docker. [PY-55294]
  • Docker Compose: PyCharm continues the interpreter setup process even if Docker introspection fails during the process. [PY-55392]
  • SSH: Setting up an SSH interpreter leads to infinite reload of the popup window for Jupyter server settings. [PY-55451]
  • Django: The “Run browser” feature that enables running the application in the default browser doesn’t work. [PY-55462]

If you encounter any bugs or have feedback to share, please submit it to our issue tracker, via Twitter, or in the comments section of this blog post.

Categories: FLOSS Project Planets

Zato Blog: Understanding API rate-limiting techniques

Thu, 2022-08-11 08:51

Enabling rate-limiting in Zato means that access to Zato-based APIs can be throttled per endpoint, user or service - including options to make limits apply to specific IP addresses only - and if limits are exceeded within a selected period of time, the invocation will fail. Let’s check how to use it all.

Where and when limits apply

API rate limiting works on several levels and the configuration is always checked in the order below, which follows from the narrowest, most specific parts of the system (endpoints), through users which may apply to multiple endpoints, up to services which in turn may be used by both multiple endpoints and users.

  • First, per-endpoint limits
  • Then, per-user limits
  • Finally, per-service limits

When a request arrives through an endpoint, that endpoint’s rate limiting configuration is checked. If the limit is already reached for the IP address or network of the calling application, the request is rejected.

Next, if there is any user associated with the endpoint, that account’s rate limits are checked in the same manner and, similarly, if they are reached, the request is rejected.

Finally, if the endpoint’s underlying service is configured to do so, it also checks if its invocation limits are not exceeded, rejecting the message accordingly if they are.

Note that the three levels are distinct yet they overlap in what they allow one to achieve.

For instance, it is possible to have the same user credentials be used in multiple endpoints and express ideas such as “Allow this and that user to invoke my APIs 1,000 requests/day but limit each endpoint to at most 5 requests/minute no matter which user”.

Moreover, because limits can be set on services, it is possible to make it even more flexible, e.g. “Let this service be invoked at most 10,000 requests/hour, no matter which user it is, with particular users being able to invoke at most 500 requests/minute, no matter which service, topping it off with per separate limits for REST vs. SOAP vs. JSON-RPC endpoint, depending on what application is invoke the endpoints”. That lets one conveniently express advanced scenarios that often occur in practical situations.

Also, observe that API rate limiting applies to REST, SOAP and JSON-RPC endpoints only, it is not used with other API endpoints, such as AMQP, IBM MQ, SAP, task scheduler or any other technologies. However, per-service limits work no matter which endpoint the service is invoked with and they will work with endpoints such as WebSockets, ZeroMQ or any other.

Lastly, limits pertain to with incoming requests only - any outgoing ones, from Zato to external resources - are not covered by it.

Per-IP restrictions

The architecture is made even more versatile thanks to the fact that for each object - endpoint, user or service - different limits can be configured depending on the caller’s IP address.

This adds yet another dimension and allows to express ideas commonly witnessed in API-based projects, such as:

  • External applications, depending on their IP addresses, can have their own limits
  • Internal users, e.g. employees of the company using VPN, may have hire limits if their addresses are in the 172.x.x.x range
  • For performance testing purposes, access to Zato from a few selected hosts may have no limits at all

IP-based limits work hand in hand are an integral part of the mechanism - they do not rule out per-endpoit, user or service limits. In fact, for each such object, multiple IP-using limits can be set independently, thus allowing for highest degree of flexibility.

Exact or approximate

Rate limits come in two types:

  • Exact
  • Approximate

Exact rate limits are just that, exact - they en that a limit is not exceeded at all, not even by a single request.

Approximate limits may let a very small number of requests to exceed the limit with the benefit being that approximate limits are faster to check than exact ones.

When to use which type depends on a particular project:

  • In some projects, it does not really matter if callers have a limit of 1,000 requests/minute or 1,005 requests/minute because the difference is too tiny to make a business impact. Approximate limits work best in this case.

  • In other projects, there may be requirements that the limit never be exceeded no matter the circumstances. Use exact limits here.

Python code and web-admin

Alright, let’s check how to define the limits in Zato web-admin. We will use the sample service below:

# -*- coding: utf-8 -*- # Zato from zato.server.service import Service class Sample(Service): name = 'api.sample' def handle(self): # Return a simple string on response self.response.payload = 'Hello there!\n'

Now, in web-admin, we will configure limits - separately for the service, a new and a new REST API channel (endpoint).

Points of interest:

  • Configuration for each type of object is independent - within the same invocation some limits may be exact, some may be approximate
  • There can be multiple configuration entries for each object
  • A unit of time is “m”, “h” or “d”, depending on whether the limit is per minute, hour or day, respectively
  • All limits within the same configuration are checked in the order of their definition which is why the most generic ones should be listed first
Testing it out

Now, all is left is to invoke the service from curl.

As long as limits are not reached, a business response is returned:

$ curl http://my.user:password@localhost:11223/api/sample Hello there! $

But if a limit is reached, the caller receives an error message with the 429 HTTP status.

$ curl -v http://my.user:password@localhost:11223/api/sample * Trying 127.0.0.1... ... HTTP/1.1 429 Too Many Requests Server: Zato X-Zato-CID: b8053d68612d626d338b02 ... {"zato_env":{"result":"ZATO_ERROR","cid":"b8053d68612d626d338b02eb", "details":"Error 429 Too Many Requests"}} $

Note that the caller never knows what the limit was - that information is saved in Zato server logs along with other details so that API authors can correlate what callers get with the very rate limiting definition that prevented them from accessing the service.

zato.common.rate_limiting.common.RateLimitReached: Max. rate limit of 100/m reached; from:`10.74.199.53`, network:`*`; last_from:`127.0.0.1; last_request_time_utc:`2020-11-22T15:30:41.943794; last_cid:`5f4f1ef65490a23e5c37eda1`; (cid:b8053d68612d626d338b02)

And this is it - we have created a new API rate limiting definition in Zato and tested it out successfully!

Categories: FLOSS Project Planets

Codementor: #01 | Machine Learning with the Linear Regression

Thu, 2022-08-11 07:24
Dive into the essence of Machine Learning by developing several Regression models with a practical use case in Python to predict accidents in the USA.
Categories: FLOSS Project Planets

Fabio Zadrozny: PyDev debugger: Going from async to sync to async... oh, wait.

Thu, 2022-08-11 02:46

 In Python asyncio land it's always a bit of a hassle when you have existing code which runs in sync mode which needs to be retrofitted to run in async, but it's usually doable -- in many cases, slapping async on the top of a bunch of definitions and adding the needed await statements where needed does the trick -- even though it's not always that easy.

Now, unfortunately a debugger has no such option. You see, a debugger needs to work on the boundaries of callbacks which are called from python (i.e.: it will usually do a busy wait from a line event from a callback registered in sys.settrace which is always called as a sync call).

Still, users still want to do some evaluation in the breakpoint context which would await... What now? Classic questions of how to go from async to sync say this is not possible.

This happens because to run something in asynchronous fashion an asyncio loop must be used to run it, but alas, the current loop is paused in the breakpoint and due to how asyncio is implemented in Python the asyncio loop is not reentrant, so, we can't just ask the loop to keep on processing at a certain point -- note that not all loops are equal, so, this is mostly an implementation detail on how CPython has implemented it, but unless we want to monkey-patch many things to make it reentrant, this would be a no-no... also, even if possible, it's not possible in asyncio to force a given coroutine to execute, rather we schedule it and asyncio decides when it'll run afterwards).

My initial naive attempt was just creating a new event loop, but again, CPython gets in the way because 2 event loops can't even coexist in the same thread. Then I thought about recreating the asyncio loop and got a bit further (up to being able to evaluate an asyncio.sleep coroutine), but after checking the asyncio AbstractEventLoop it became clear that the API is just too big to reimplement safely (it's not just about implementing the loop, it's also about implementing network I/O such as getnameinfo, create_connection, etc).

In the end the solution implemented for the debugger is that to support await constructs for evaluation, a new thread is created with a new event loop and that event loop in that new thread will execute the coroutine (with the context of the paused frame passed to that thread for the evaluation).

This is not perfect as there are some cons, for instance, evaluating the code in a thread can mean that some evaluations may not work because some frameworks such as qt consider the UI thread as special and won't work properly, checks for the current thread won't match the thread paused and probably a bunch of other things, but I guess it's a reasonable tradeoff vs not having it at all as it should work in the majority of cases.

Keep an eye open for the next release as it'll be possible to await coroutines in the debugger evaluation and watches ;)

p.s.: For VSCode users this will also be available in debugpy.

Categories: FLOSS Project Planets

ABlog for Sphinx: ABlog v0.10.30 released

Wed, 2022-08-10 20:00
ABlog v0.10.30 released
Categories: FLOSS Project Planets

TestDriven.io: Integrating Mailchimp with Django

Wed, 2022-08-10 18:28
This article looks at how to integrate Mailchimp with Django for newsletters and transactional emails.
Categories: FLOSS Project Planets

The Python Coding Blog: Shallow and Deep Copy in Python and How to Use __copy__()

Wed, 2022-08-10 12:20

You need to make a copy of an object in a Python program. How difficult can it be? Not very. But you also need to know the difference between shallow and deep copy in Python and decide which one you need.

In this article, you’ll read about the difference between shallow and deep copy when used on simple data structures. Then, you’ll look at more complex structures, including when copying an object created from a class you define yourself. In this example in which I’ll be cloning myself (!), you’ll see some of the pitfalls of copying objects and how to look out for them and avoid them.

In this article, you’ll learn more about:

  • Creating copies of simple lists and other data structures
  • Creating copies of more complex lists
  • Using the copy built-in module
  • Understanding the difference between shallow and deep copy in Python
  • Using __copy__() to define how to shallow copy an object of a user-defined class

Yes, there’s also __deepcopy__(), but I’ll stop at __copy__() in this article.

What’s The Problem With Copying Objects?

Here’s a preview of the example you’ll write towards the end of this article. You’ll create a couple of simple classes to define a Person and a Car. Yes, I’m afraid it’s “person” and “car” again. You’ve seen these examples used very often in object-oriented programming tutorials. But it’s a bit different in this case, so bear with me, please.

If you want a tutorial about classes that doesn’t use the “same old classes” that every other tutorial does, you can read the chapter about Object-Oriented Programming in Python in The Python Coding Book:

# household.py class Car: def __init__(self, make: str, model: str): self.make = make self.model = model self.mileage = 0 def add_mileage(self, miles: float): self.mileage += miles class Person: def __init__(self, firstname: str): self.firstname = firstname self.car = None def buy_car(self, car: Car): self.car = car def drive(self, miles: float): self.car.add_mileage(miles)

You’ll walk through this example in a bit more detail later. For now, Here, I’ll highlight that the Car module has make, model, and mileage attributes. The latter can be updated using the add_mileage() method.

Person has attributes firstname and car. You can assign an object of type Car to the Person using buy_car(), and you can get the person to go for a drive using drive(), which adds mileage to the car.

You can use these classes in a new script:

# cloning_stephen.py from household import Car, Person # Create a person who buys a car stephen = Person("Stephen") stephen.buy_car( Car("BMW", "Series 1") ) # Log how many miles driven stephen.drive(100) print(f"Stephen's mileage is {stephen.car.mileage} miles")

The output from the print() line is:

Stephen's mileage is 100 miles

Next, you’ll clone Stephen (as if one of me is not enough already!)

# cloning_stephen.py import copy from household import Car, Person # Create a person who buys a car stephen = Person("Stephen") stephen.buy_car( Car("BMW", "Series 1") ) # Log how many miles driven stephen.drive(100) print(f"Stephen's mileage is {stephen.car.mileage} miles") # Let's copy the Person instance clone = copy.copy(stephen) print( f"The clone's car is a {clone.car.make} {clone.car.model}" ) print(f"The clone's mileage is {clone.car.mileage} miles") # Let's check whether the two cars are exactly the same car print( f"Stephen's car is clone's car: {stephen.car is clone.car}" )

And here’s where the problem lies. Look at the output from this code:

Stephen's mileage is 100 miles The clone's car is a BMW Series 1 The clone's mileage is 100 miles Stephen's car is clone's car: True

The clone’s car is also a BMW Series 1, which makes sense. The clone has the same tastes and needs as Stephen! But, the clone’s car starts at 100 miles. Even though you’ve just created the clone and he’s not been on a drive yet.

The final line explains what’s happening. Stephen and the clone have the same car. Not just the same make and model, but the exact same car.

If the clone goes for a drive now, Stephen’s mileage will also change. Here’s what will happen if you add the following lines to the end of cloning_stephen.py:

# cloning_stephen.py # ... # Clone goes for a drive: clone.drive(68) print(f"Stephen's mileage is {stephen.car.mileage} miles")

The output is:

Stephen's mileage is 168 miles

Stephen’s mileage increased by 68 miles even though it’s the clone who went for a drive. That’s because they are using the same car! It’s unlikely this is the behaviour you want when you create a copy of a Person.

You’ll return to this example a bit later.

Making a Copy of Simple Data Structures

I’ll go through this section quickly as the fun starts in the next section. Let’s copy a list and a dictionary:

>>> trip_mileages = [10, 12, 3, 59] >>> copied_list = trip_mileages.copy() >>> copied_list [10, 12, 3, 59] >>> copied_list is trip_mileages False >>> trips = { ... "Supermarket": 2, ... "Holiday": 129, ... } >>> copied_dict = trips.copy() >>> copied_dict {'Supermarket': 2, 'Holiday': 129} >>> copied_dict is trips False

Both lists and dictionaries have a .copy() method. This makes life easy to copy them to create a new object containing the same information.

What if you have a tuple?

>>> trip_mileages_tuple = 10, 12, 3, 59 >>> trip_mileages_tuple.copy() Traceback (most recent call last): ... AttributeError: 'tuple' object has no attribute 'copy'

Tuples don’t have a .copy() method. In this case, you can try to use the copy built-in module:

>>> trip_mileages_tuple = 10, 12, 3, 59 >>> import copy >>> copied_tuple = copy.copy(trip_mileages_tuple) >>> copied_tuple (10, 12, 3, 59) >>> copied_tuple is trip_mileages_tuple True

You’ve been able to create a “copy” of a tuple, except it’s not a copy at all! As tuples are immutable, when you try to copy the tuple, you get a new reference to the same tuple.

You may be wondering whether this is also the case if you use copy.copy() with mutable types such as lists and dictionaries:

>>> trip_mileages = [10, 12, 3, 59] >>> import copy >>> copied_list = copy.copy(trip_mileages) >>> copied_list [10, 12, 3, 59] >>> copied_list is trip_mileages False

No, in this case, copy.copy(trip_mileages) gives the same output as trip_mileages.copy(). You’ll see later on what determines how copy.copy() behaves on any object. But first, let’s look at more complex data structures and find out about shallow and deep copies.

Making a Copy of Complex Data Structures

Consider a list of teams, where each team is a list of names. You create a copy of the list of teams:

>>> teams = [["Stephen", "Mary"], ["Kate", "Trevor"]] >>> copied_teams = teams.copy() >>> copied_teams [['Stephen', 'Mary'], ['Kate', 'Trevor']] >>> copied_teams is teams False

So far, this is the same result as the one in the previous section. But, Martin joins Stephen and Mary’s team. You choose to add this to the copied list as you’d like to keep the original teams list unchanged:

>>> copied_teams[0].append("Martin") >>> copied_teams [['Stephen', 'Mary', 'Martin'], ['Kate', 'Trevor']] >>> teams [['Stephen', 'Mary', 'Martin'], ['Kate', 'Trevor']] >>> copied_teams[0] is teams[0] True

You add Martin to the first team in copied_teams. However, he was also added to the first team in teams, the original list, even though you didn’t append anything explicitly to it.

You can see why this happens in the last statement in which you’re checking whether the first list in copied_teams is the same object as the first list in teams. Yes, they are both the same object.

Creating Shallow and Deep Copies in Python

When you copied the list using teams.copy(), you created a shallow copy of the list. Let’s see what this means.

When you create a list, you’re creating a new object of type list which contains several items. However, the list actually contains references to other objects that are stored elsewhere. Therefore, teams[0] is a reference to another object, the list: ['Stephen', 'Mary']. Look again at the line you used to create the teams list initially:

>>> teams = [["Stephen", "Mary"], ["Kate", "Trevor"]]

This line creates three lists:

  • The list ['Stephen', 'Mary']
  • The list ['Kate', 'Trevor']
  • The list named teams which has references to the other two lists

You can visualise this using the diagram below:

When you use teams.copy() or copy.copy(teams), you’re creating a new outer list. However, you’re not copying the inner lists. Instead, you use the same lists ['Stephen', 'Mary'] and ['Kate', 'Trevor'] you already have. Here’s a representation of what this looks like:

teams[0] and copied_teams[0] are two references pointing to the same list. You have two ways of referring to the same object.

So, when you add Martin to the copied_teams[0], you are adding Martin’s name to the only existing list which has the Stephen’s team members’ names.

Sometimes, this is not what you want. Instead, you want to create a copy of all the items inside objects.

Deep Copy

In this section, you’ll read about creating a deep copy of an object. But first, let’s recreate the example above using the functions in the built-in module copy.

copy.copy() creates a shallow copy, so you’ll get the same output as the one in the section above:

>>> import copy >>> teams = [["Stephen", "Mary"], ["Kate", "Trevor"]] >>> copied_teams = copy.copy(teams) >>> copied_teams[0].append("Martin") >>> copied_teams [['Stephen', 'Mary', 'Martin'], ['Kate', 'Trevor']] >>> teams [['Stephen', 'Mary', 'Martin'], ['Kate', 'Trevor']] >>> copied_teams[0] is teams[0] True

Therefore for lists, copy.copy(teams) is the same as teams.copy().

Next, you can try using copy.deepcopy() instead:

>>> import copy >>> teams = [["Stephen", "Mary"], ["Kate", "Trevor"]] >>> deepcopied_teams = copy.deepcopy(teams) >>> deepcopied_teams [['Stephen', 'Mary'], ['Kate', 'Trevor']] >>> deepcopied_teams[0].append("Martin") >>> deepcopied_teams [['Stephen', 'Mary', 'Martin'], ['Kate', 'Trevor']] >>> teams [['Stephen', 'Mary'], ['Kate', 'Trevor']] >>> deepcopied_teams[0] is teams[0] False

When you append "Martin" to deepcopied_teams, which is the deep copy you created from the original list, the new item does not appear when you display teams. And unlike the case with the shallow copy earlier, deepcopied_teams[0] is no longer the same object as teams[0].

When you create a deep copy, you’re copying the outer list, but you’re also creating copies of the inner lists. Therefore, the references in teams and those in deepcopied_teams point to different objects. The two copies created by deepcopy() are entirely separate from each other. Here’s how this representation looks now:

You can read more about shallow and deep copy in Python in the official documentation.

Copying Objects of Classes You’ve Defined Yourself

It’s time to create your own classes and explore what happens when you make copies of them. You’ve already come across the class definitions Car and Person at the beginning of this article. Let’s introduce these classes properly. You can define them in a script called household.py:

# household.py class Car: def __init__(self, make: str, model: str): self.make = make self.model = model self.mileage = 0 def add_mileage(self, miles: float): self.mileage += miles class Person: def __init__(self, firstname: str): self.firstname = firstname self.car = None def buy_car(self, car: Car): self.car = car def drive(self, miles: float): self.car.add_mileage(miles)

You can initialise Car with a make and a model, both of which are strings. I’m using type hinting in this example to keep track of what the argument types are. A new car starts with a mileage of 0 miles (or kilometres, if you prefer).

And as the name implies, the method add_mileage() is used to add miles whenever the person drives the car.

A Person is initialised with a first name which is a string. The method buy_car() allows you to link an instance of the class Car to an instance of Person. The Car object is referenced using the attribute Person.car.

Whenever the person goes on a trip, you can call the drive() method which logs the additional miles onto the person’s car.

In a new script called cloning_stephen.py, you can test these classes:

# cloning_stephen.py from household import Car, Person # Create a person who buys a car stephen = Person("Stephen") stephen.buy_car( Car("BMW", "Series 1") ) # Log how many miles driven stephen.drive(100) print(f"Stephen's mileage is {stephen.car.mileage} miles")

This is the same code you saw earlier. You create an instance of Person and call the buy_car() method for that instance. Stephen (I’m still talking about myself in the third person!) goes for a 100-mile drive. You log this by calling the drive() method. This updates the mileage attribute of the Car instance referenced in stephen.car. This code gives the following output:

Stephen's mileage is 100 miles Copying An Object: The Default Case

Stephen is very busy these days! He decides to clone himself so he can get more things done. Let’s try this. You can copy the instance stephen in cloning_stephen.py using the built-in copy.copy():

# cloning_stephen.py import copy from household import Car, Person # Create a person who buys a car stephen = Person("Stephen") stephen.buy_car( Car("BMW", "Series 1") ) # Log how many miles driven stephen.drive(100) print(f"Stephen's mileage is {stephen.car.mileage} miles") # Let's copy the Person instance clone = copy.copy(stephen) print( f"The clone's car is a {clone.car.make} {clone.car.model}" ) print(f"The clone's mileage is {clone.car.mileage} miles") # Let's check whether the two cars are exactly the same car print( f"Stephen's car is clone's car: {stephen.car is clone.car}" )

The outputs from this script, which you’ve already seen earlier, show the problem with this type of copy:

Stephen's mileage is 100 miles The clone's car is a BMW Series 1 The clone's mileage is 100 miles Stephen's car is clone's car: True

This is a shallow copy. Therefore, although stephen and clone are different instances of the class Person, they both share the same instance of Car. Stephen has managed to clone himself, but he has to share the same car with his clone. That’s not good, as Stephen and the clone can’t be efficient if they can’t go to different places.

If the clone goes for a drive, he’s using the same car Stephen uses. Therefore the extra mileage will also show up for Stephen:

# cloning_stephen.py # ... # Clone goes for a drive: clone.drive(68) print(f"Stephen's mileage is {stephen.car.mileage} miles")

This shows Stephen’s mileage has increased to 168 miles:

Stephen's mileage is 168 miles Using copy.deepcopy()

What if you try to create a deep copy instead of a shallow one? After all, this trick worked with the example of the list of team members earlier. You can update cloning_stephen.py to use copy.deepcopy() instead of copy.copy():

# cloning_stephen.py import copy from household import Car, Person # Create a person who buys a car stephen = Person("Stephen") stephen.buy_car( Car("BMW", "Series 1") ) # Log how many miles driven stephen.drive(100) print(f"Stephen's mileage is {stephen.car.mileage} miles") # Let's copy the Person instance clone = copy.deepcopy(stephen) print( f"The clone's car is a {clone.car.make} {clone.car.model}" ) print(f"The clone's mileage is {clone.car.mileage} miles") # Let's check whether the two cars are exactly the same car print( f"Stephen's car is clone's car: {stephen.car is clone.car}" )

When you run this script, you’ll now get the following output:

Stephen's mileage is 100 miles The clone's car is a BMW Series 1 The clone's mileage is 100 miles Stephen's car is clone's car: False

Stephen’s mileage is still 100 miles. There’s no reason why this should be different as Stephen drove 100 miles.

The clone’s car is a BMW Series 1, the same as Stephen’s car make and model. This is what you want since Stephen’s clone has the same car preferences as Stephen!

Let’s skip to the last line of the output. Stephen’s car is no longer the exact same car as the clone’s car. This is different from the result you got with the shallow copy above. The clone’s car is a different instance of Car. So there are two cars now; one belongs to Stephen and the other to the clone.

However, the clone’s car already has 100 miles on the odometer even though the clone hasn’t driven yet. When you create a deep copy of stephen, the program creates a new instance of Car. However, all of the original car attributes are also copied. This means the clone’s car starts with whatever mileage Stephen’s car has when you create the deep copy.

From now on, the two cars are separate, so when the clone drives the car, the additional mileage won’t show up in Stephen’s car:

# cloning_stephen.py # ... # Clone goes for a drive: clone.drive(68) print(f"Stephen's mileage is {stephen.car.mileage} miles") print(f"The clone's mileage is {clone.car.mileage} miles")

The output shows that Stephen’s mileage is still 100 miles, but the clone’s mileage is now 168 miles even though his one and only only trip is 68 miles long:

... Stephen's mileage is 100 miles The clone's mileage is 168 miles

In the last section of this article, you’ll fix this to customise how an instance of Person should be copied.

Defining The __copy__ Dunder Method

You can override the default behaviour for copy.copy() and copy.deepcopy() for any class you define. In this article, I’ll only focus on defining the dunder method __copy__(), which determines what happens when you call copy.copy() for your object. There’s also a __deepcopy__() dunder method, aimed at creating deep copies, which is similar but provides a bit more functionality to deal with complex objects.

You can return to household.py where you define the class Person and add __copy__() to the class:

# household.py class Car: def __init__(self, make: str, model: str): self.make = make self.model = model self.mileage = 0 def add_mileage(self, miles: float): self.mileage += miles class Person: def __init__(self, firstname: str): self.firstname = firstname self.car = None def buy_car(self, car: Car): self.car = car def drive(self, miles: float): self.car.add_mileage(miles) def __copy__(self): copy_instance = Person(self.firstname) copy_instance.buy_car( Car( make=self.car.make, model=self.car.model, ) ) return copy_instance

The __copy__() dunder method creates a new Person instance using the same first name of the instance you’re copying. It also creates a new Car instance using the make and model of the car you’re copying. You pass this new Car object as an argument in copy_instance.buy_car() and then return the new Person instance.

You can return to cloning_stephen.py, making sure you use copy.copy() to make a copy of stephen. This means that Person.__copy__() is used when creating the copy.

# cloning_stephen.py import copy from household import Car, Person # Create a person who buys a car stephen = Person("Stephen") stephen.buy_car( Car("BMW", "Series 1") ) # Log how many miles driven stephen.drive(100) print(f"Stephen's mileage is {stephen.car.mileage} miles") # Let's copy the Person instance clone = copy.copy(stephen) print( f"The clone's car is a {clone.car.make} {clone.car.model}" ) print(f"The clone's mileage is {clone.car.mileage} miles") # Let's check whether the two cars are exactly the same car print( f"Stephen's car is clone's car: {stephen.car is clone.car}" )

Now, the output is:

Stephen's mileage is 100 miles The clone's car is a BMW Series 1 The clone's mileage is 0 miles Stephen's car is clone's car: False

The clone still has a different instance of Car but now, the car’s mileage starts at 0, as you’d expect! You’ve created a custom version of shallow copy by defining __copy__() for the class. In this case, you decided that when you copy a Person, the new instance has its own car which starts with 0 miles.

In more complex classes, you may want to define both __copy__() and __deepcopy__() if you want to distinguish between shallow and deep copy in your Python program.

Final Words

Here’s a summary of the key points you covered in this article:

  • You created copies of simple lists and other data structures
  • You created copies of more complex lists
  • You used the copy built-in module
  • You learnt about the difference between shallow and deep copy in Python
  • You used __copy__() to define how to shallow copy an object of a user-defined class

You’re now ready to safely copy any object, knowing what to look out for if the object references other objects.

Appendix: You Cannot Copy An Immutable Object

Do you recall when you used copy.copy() on a tuple earlier in the article? Unlike when you copy lists and dictionaries, where you got a new instance containing the same values as the original, you got the same instance back when you try to copy a tuple.

Whenever you pass an immutable object to copy.copy(), it returns the object itself.

Further Reading

You can read more about Object-Oriented Programming here:

Get the latest blog updates

No spam promise. You’ll get an email when a new blog post is published

The post Shallow and Deep Copy in Python and How to Use __copy__() appeared first on The Python Coding Book.

Categories: FLOSS Project Planets

"Paolo Amoroso's Journal": The Python feed of my old blog Moonshots Beyond the Cloud has long been...

Wed, 2022-08-10 10:14

The Python feed of my old blog Moonshots Beyond the Cloud has long been aggregated by Planet Python. But I'm no longer going to update that blog, so I removed the old feed from Planet Python and submitted the Python feed of my new blog, Paolo Amoroso's Journal.

#Python #blogging

Discuss... | Reply by email...

Categories: FLOSS Project Planets

Pages