FLOSS Project Planets
Zato Blog: Web scraping as an API service
In systems-to-systems integrations, there comes an inevitable time when we have to employ some kind of a web scraping tool to integrate with a particular application. Despite its not being our first choice, it is good to know what to use at such a time - in this article, I provide a gentle introduction to my favorite tool of this kind, called Playwright, followed by sample Python code that integrates it with an API service.
Naturally, in the context of backend integrations, web scraping should be avoided and, generally, it should be considered the last resort. The basic issue here is that while the UI term contains the "interface" part, it is not really the "Application Programming" Interface that we would like to have.
It is not that the UI cannot be programmed against. After all, a web browser does just that, it takes a web page and renders it as expected. Same goes for desktop or mobile applications. Also, anyone integrating with mainframe computers will recognize that this is basically what 3270 can be used for too.
Rather, the fundamental issue is that web scraping goes against the principles of separation of layers and roles across frontend, middleware and backend, which in turn means that authors of resources (e.g. HTML pages) do not really expect for many people to access them in automated ways.
Perhaps they actually should expect it, and web pages should finally start to resemble genuine knowledge graphs, easy to access by humans, be it manually or through automation tools, but the reality today is that it is not the case and, in comparison with backend systems, the whole of the web scraping space is relatively brittle, which is why we shun this approach in integrations.
Yet, another part of reality, particularly in enterprise integrations, is that people may be sometimes given access to a frontend application on an internal network and that is it. No API, no REST, no JSON, no POST data, no real data formats, and one is simply supposed to fill out forms as part of a business process.
Typically, such a situation will result in an integration gap. There will be fully automated parts in the business process preceding this gap, with multiple systems coordinated towards a specific goal and there will be subsequent steps in the process, also fully automated.
Or you may be given access only to a specific frontend and only through VPN via a single remote Windows desktop. Getting access to a REST API may take months or may be never realized because of some high level licensing issues. This is not uncommon in the real life.
Such a gap can be a jarring and sore point, truly ruining the whole, otherwise fluid, integration process. This creates a tension and to resolve the tension, we can, should all the attempts to find a real API fail, finally resort to web scraping.
It is mostly in this context that I am looking at Playwright below - the tool is good and it has many other uses that go beyond the scope of this text, and it is well worth knowing it, for instance for frontend testing of your backend systems, but, when we deal with API integrations, we should not overdo with web scraping.
Needless to say, if web scraping is what you do primarily, your perspective will be somewhat different - you will not need any explanation of why it is needed or when, and you may be only looking for a way to enclose up your web scraping code in API services. This article will explain that too.
Introducing PlaywrightThe nice part of Playwright is that we can use it to visually prepare a draft of Python code that will scrape a given resource. That is, instead of programming it in Python, we go to an address, fill out a form, click buttons and otherwise use everything as usually and Playwright generates for us code that will be later used in integrations.
That code will require a bit of clean-up work, which I will talk about below, but overall it works very nicely and is certainly useful. The result is not one of these do-not-touch auto-generated pieces of code that are better left to their own.
While there are better ways to integrate with Jira, I chose that application as an example of Playwright's usage simply because I cannot show you any internal application in a public blog post.
Below, there are two windows. One is Playwright's emulating a Blackberry device to open a resource. I was clicking around, I provided an email address and then I clicked the same email field once more. To the right, based on my actions, we can find the generated Python code, which I consider quite good and readable.
The Playwright Inspector, the tool that gave us the code, will keep recording all of our actions until we click the "Record" button which then allows us to click the button next to "Record" which is "Copy code to clipboard". We can then save the code to a separate file and run it on demand, automatically.
But first, we will need to install Playwright.
Installing and starting PlaywrightThe tools is written in TypeScript and can be installed using npx, which in turn is part of NodeJS.
Afterwards, the "playwright install" call is needed as well because that will potentially install runtime dependencies, such as Chrome libraries.
Finally, we install Playwright using pip as well because we want to access with Python. Note that if you are installing Playwright under Zato, the "/path/to/pip" will be typically "/opt/zato/code/bin/pip".
npx -g --yes playwright install playwright install /path/to/pip install playwrightWe can now start it as below. I am using BlackBerry as an example of what Playwright is capable of. Also, it is usually more convenient to use a mobile version of a site when the main window and Inspector are opened side by side, but you may prefer to use Chrome, Firefox or anything else.
playwright codegen https://example.atlassian.net/jira --device "BlackBerry Z30"That is practically everything as using Playwright to generate code in our context goes. Open the tool, fill out forms, copy code to a Python module, done.
What is still needed, though, is cleaning up the resulting code and embedding it in an API integration process.
Code clean-upAfter you keep using Playwright for a while with longer forms and pages, you will note that the generated code tends to accumulate parts that repeat.
For instance, in the module below, which I already cleaned up, the same "[placeholder=\"Enter email\"]" reference to the email field is used twice, even if a programmer developing this could would prefer to introduce a variable for that.
There is not a good answer to the question of what to do about it. On the one hand, obviously, being programmers we would prefer not to repeat that kind of details. On the other hand, if we clean up the code too much, this may result in too much of a maintenance burden because we need to keep it mind that we do not really want to invest to much in web scraping and, should there be a need to repeat the whole process, we do not want to end up with Playwright's code auto-generated from scratch once more, without any of our clean-up.
A good compromise position is to at least extract any kind of credentials from the code to environment variables or a similar place and to remove some of the code comments that Playwright generates. The result as below is what it should like at the end. Not too much effort without leaving the whole code as it was originally either.
Save the code below as "play1.py" as this is what the API service below will use.
# -*- coding: utf-8 -*- # stdlib import os # Playwright from playwright.sync_api import Playwright, sync_playwright class Config: Email = os.environ.get('APP_EMAIL', 'zato@example.com') Password = os.environ.get('APP_PASSWORD', '') Headless = bool(os.environ.get('APP_HEADLESS', False)) def run(playwright: Playwright) -> None: browser = playwright.chromium.launch(headless=Config.Headless) # type: ignore context = browser.new_context() # Open new page page = context.new_page() # Open project boards page.goto("https://example.atlassian.net/jira/software/projects/ABC/boards/1") page.goto("https://id.atlassian.com/login?continue=https%3A%2F%2Fexample.atlassian.net%2Flogin%3FredirectCount%3D1%26dest-url%3D%252Fjira%252Fsoftware%252Fprojects%252FABC%252Fboards%252F1%26application%3Djira&application=jira") # Fill out the email page.locator("[placeholder=\"Enter email\"]").click() page.locator("[placeholder=\"Enter email\"]").fill(Config.Email) # Click #login-submit page.locator("#login-submit").click() with sync_playwright() as playwright: run(playwright) Web scraping as a standalone activityWe have the generated code so the first thing to do with it is to run it from command line. This will result in a new Chrome window's accessing Jira - it is Chrome, not Blackberry, because that is the default for Playwright.
The window will close soon enough but this is fine, that code only demonstrates a principle, it is not a full integration task.
python /path/to/play1.pyIt is also useful that we can run the same Python module from our IDE, giving us the ability to step through the code line by line, observing what changes when and why.
Web scraping as an API service
Finally, we are ready to invoke the standalone module from an API service, as in the following code that we are also going to make available as a REST channel.
A couple of notes about the Python service below:
- We invoke Playwright in a subprocess, as a shell command
- We accept input through data models although we do not provide any output definition because it is not needed here
- When we invoke Playwright, we set the APP_HEADLESS to True which will ensure that it does not attempt to actually display a Chrome window. After all, we intend for this service to run on Linux servers, in backend, and such a thing will be unlikely to work in this kind of an environment.
Other than that, this is a straightforward Zato service - it receives input, carries out its work and a reply is returned to the caller (here, empty).
# -*- coding: utf-8 -*- # stdlib from dataclasses import dataclass # Zato from zato.server.service import Model, Service # ########################################################################### @dataclass(init=False) class WebScrapingDemoRequest(Model): email: str password: str # ########################################################################### class WebScrapingDemo(Service): name = 'demo.web-scraping' class SimpleIO: input = WebScrapingDemoRequest def handle(self): # Path to a Python installation that Playwright was installed under py_path = '/path/to/python' # Path to a Playwright module with code to invoke playwright_path = '/path/to/the-playwright-module.py' # This is a template script that we will invoke in a subprocess command_template = """ APP_EMAIL={app_email} APP_PASSWORD={app_password} APP_HEADLESS=True {py_path} {playwright_path} """ # This is our input data input = self.request.input # type: WebScrapingDemoRequest # Extract credentials from the input .. email = input.email password = input.password # .. build the full command, taking all the config into account .. command = command_template.format( app_email = email, app_password = password, py_path = py_path, playwright_path = playwright_path, ) # .. invoke the command in a subprocess .. result = self.commands.invoke(command) # .. if it was not a success, log the details received .. if not result.is_ok: self.logger.info('Exit code -> %s', result.exit_code) self.logger.info('Stderr -> %s', result.stderr) self.logger.info('Stdout -> %s', result.stdout) # ###########################################################################Now, the REST channel:
The last thing to do is to invoke the service - I am using curl from the command line below but it could very well be Postman or a similar option.
curl localhost:17010/demo/web-scraping -d '{"email":"hello@example.com", "password":"abc"}' ; echoThere will be no Chrome window this time around because we run Playwright in the headless mode. There will be no output from curl either because we do not return anything from the service but in server logs we will find details such as below.
We can learn from the log that the command took close to 4 seconds to complete, that the exit code was 0 (indicating success) and that is no stdout or stderr at all.
INFO - Command ` APP_EMAIL=hello@example.com APP_PASSWORD=abc APP_HEADLESS=True /path/to/python /path/to/the-playwright-module.py ` completed in 0:00:03.844157, exit_code -> 0; len-out=0 (0 Bytes); len-err=0 (0 Bytes); cid -> zcmdc5422816b2c6ff9f10742134We are now ready to continue to work on it - for instance, you will notice that the password is visible in logs and this should not be allowed.
But, all such works are extra in comparison with the main theme - we have Playwright, which is a a tool that allows us to quickly integrate with frontend applications and we can automate it through API services. Just as expected.
More resources➤ Python API integration tutorial
➤ What is an integration platform?
➤ Python Integration platform as a Service (iPaaS)
➤ What is an Enterprise Service Bus (ESB)? What is SOA?
Tag1 Consulting: Migrating Your Data from D7 to D10: Avoiding entity ID conflicts with AUTO_INCREMENT
Previously, we wrapped up migrating configuration to match the content model we specified in our upgrade plan. Ready to start migrating content? Hang in there — it’s coming up next. But first, let's address one of the hardest issues to resolve when they arise - entity ID conflicts in content migrations.
mauricio Tue, 11/12/2024 - 23:18Bojan Mihelac: Extending different base template for ajax requests in Django
Bojan Mihelac: Django-simpleadmindoc
Bojan Mihelac: Django set language for admin
Bojan Mihelac: Django import export
Bojan Mihelac: Django cookie consent application
Oliver Davies' daily list: Speaking at the Drupal London meetup
Next Wednesday evening, I'm going to be speaking remotely at the Drupal London meetup.
I was originally going to attend in person, but after injuring my foot last week, I can't travel so will have to join remotely.
I'm going to be giving a talk and demo of automated testing and test-driven development with Drupal as well as some Q&A and pair programming, if time allows and we're able to do so remotely.
RSVPs are still open for the event and hopefully I'll get to attend Drupal London in person in the future.
API documentation porting sprint
It was once said over the grapevine that: "Our C++ API documentation has some issues, our QML API documentation has a lot of issues."
And it was true, but that is to change soon! As you might know, there is an ongoing effort to port our documentation from Doxygen to QDoc, and you can help with that.
This is a task that has been adopted by the Streamlined Application Development Experience championed by Nicolas Fella and Nate Graham as part of the KDE Goals initiative.
We would like to invite you to join our porting sprint effort to finish this task. On November 14th at 1PM UTC, we'll be hanging out in the Matrix room working on this. Hope to see you there.
Some prerequisites:
- Ability to use a terminal
- Extra disk space (30GB minimum)
- Some familiarity with APIs
Check out the instructions prepared by Thiago Sueto on how to get started porting a project to QDoc.
Louis-Philippe Véronneau: Montreal's Debian & Stuff - November 2024
Our Debian User Group met on November 2nd after a somewhat longer summer hiatus than normal. It was lovely to see a bunch of people again and to be able to dedicate a whole day to hacking :)
Here is what we did:
lavamind:
- reproduced puppetdb FTBFS #1084038 and reported the issue upstream
- uploaded a new upstream version for pgpainless (1.6.8-1)
- uploaded a new revision for ruby-moneta (1.6.0-3)
- sent an inquiry to the backports team about #1081696
pollo:
- reviewed & merged many lintian merge requests, clearing out most of the queue
- uploaded a new lintian release (1.120.0)
- worked on unblocking the revival of lintian.debian.org (many thanks to anarcat and pkern)
- apparently (kindly) told people to rtfm at least 4 times :)
anarcat:
- figured out he still can't figure out how to do 3D printing and should
probably just delegate these prints, currently in his wishlist:
- k4vp VHF HT radio case (fits on your phone)
- (tr)uSDX HF radio (fits in your hand!)
- framework expansion card holder (fits in my bag!)
- Ortlieb QUICK-LOCK1 compatible inserts (fits on my bike?)
- fixed an RC bug in python-ulid
- sponsored fuzzel 1.11.1+ds-2 upload, resolving a FTBFS on non-x86 arches
- tried to move the sigal packaging forward a little, failing in obscure pybuild land
LeLutin:
- opened an RFS on the ruby team mailing list for the new upstream version of ruby-necromancer
- worked on packaging the new upstream version of ruby-pathspec
tvaz:
- did AM (Application Manager) work
tassia:
- explored the Debian Jr. project (website, wiki, mailing list, salsa repositories)
- played a few games for Nico's entertainment :-)
- built and tested a Debian Jr. live image
This time around, we went back to Foulab. Thanks for hosting us!
As always, the hacklab was full of interesting stuff and I took a few (bad) pictures for this blog post:
Mirek Długosz: Understanding Linux virtualization stack
I find Linux virtualization stack confusing. KVM? ibvirt? QEMU? Xen? What does that even mean?
This post is my attempt at making sense of that all. I don’t claim it to be correct, but that’s the way I understand it. If I badly missed a mark somewhere, please reach out and tell me how wrong I am.
Virtualization primerVirtual machine is a box inside which programs think they are running on different hardware and operating system. That box is like a full computer, running inside a computer. Virtualization is process of creating these boxes. Virtual machine is called guest, while operating system that runs virtual machine is called host.
Two main reasons to virtualize are security and resource management. Security - because if virtual machine is done correctly, then program inside it won’t even know it’s inside a virtual machine, and that means there’s a good chance it won’t escape to interfere with other programs. Resource management - so you can buy huge server and assign slice of resources to each box as you see fit, and dynamically change it as your needs shift.
Virtual machine pretends to be the entire computer, but most discussions revolve around CPU. Each CPU architecture has a specific set of instructions, and programs generally target only one of them. It is technically possible to translate instructions from one architecture to another, but that’s usually slow. Most modern CPUs provide extensions that allow virtual machines to run at speed comparable to non-virtualized installations. Using these extensions is only possible when the guest and the host target the same CPU architecture.
Conceptually, there is no significant difference between virtualization and emulation. Practically, emulation usually refers to guest thinking it has different CPU architecture than host, and to virtual machine pretending to be some very specific physical hardware - like a video game console. Virtualization in turn usually refers to specific case where guest thinks it has the same CPU architecture as the host.
KVMKVM is part of the kernel, responsible for talking with hardware. As I mentioned before, most modern CPUs have special extensions that allow guests to achieve near-native performance. KVM makes it possible for Linux host to use these extensions.
QEMUQEMU is weird, because it’s multiple different things.
First, it can run virtual machines and pass instructions from guest to KVM, where they will be handled by virtualization CPU extension. So you can think about QEMU as user space component for KVM.
Second, it can run virtual machines while emulating different CPU architecture. So you can think about QEMU as user space virtualization solution, which can run without KVM at all.
Third, it can run programs targeting one CPU architecture on a computer with another CPU architecture. It’s like lightweight virtualization, where only single program is virtualized.
Finally, it is set of standalone utilities related to virtualization, but not directly tied to anything in particular. Most of these programs work with disk image files in some way.
QEMU is probably the hardest part to understand, because it can do things that other parts of the stack are responsible for. This results in some overlap between different components and encountering program names in contexts where you would not expect them.
I think this overlap is mostly a historical artifact. The oldest commit in QEMU git repository is dated 2003, while KVM was included in Linux kernel in 2007. So it seems to me that QEMU was originally intended as software virtualization solution that did not depend on any hardware or kernel driver. Later Linux gained drivers for virtualization CPU extensions, but they were useless without something in user space that could work with them. So instead of creating completely new thing, QEMU was extended to work with KVM.
Technically, the journey towards understanding Linux virtualization stack could end here - you can use QEMU to create, start, stop and modify virtual machines. But QEMU commands tend to be verbose and long, so people rarely use it directly.
libvirtI have ignored that so far, but KVM is not the only virtualization driver in kernel. There are many, some of them closed-source.
libvirt is intended as unified layer that abstracts virtual machine management for application developers. So if you would like your application to create, start, stop or delete virtual machine, you don’t have to support each of these operations on KVM, QEMU, Xen and VirtualBox - you can just support libvirt and trust it will do the right thing.
libvirt doesn’t do any work directly, and instead asks other programs to do it. System administrator is responsible for setting up the host and ensuring virtualization may work. This makes libvirt great for application developers, who can forget about details of different solutions; but most of libvirt users are in less fortunate position, because they have to take a role of system administrator.
libvirt will talk with QEMU, which in turn may pass CPU instructions to KVM. Some sources online claim that libvirt can talk with KVM directly. libvirt itself perpetuates this confusion, as KVM is listed alongside QEMU on the main page, giving the impression they are two equivalent components. But detailed documentation page makes it clear - libvirt only talks with QEMU. It can detect KVM and allow user to create “hardware accelerated guests”, but that case is still handled through QEMU.
It’s also worth noting that libvirt supports multiple virtualization drivers across multiple operating systems, including FreeBSD and macOS.
virshvirsh is a command line interface for libvirt. That’s it, there’s nothing more to it. If it was created today, it might have been called libvirtctl.
VagrantVagrant abstracts virtual machine management across operating systems. It’s similar to libvirt, in the way that it allows application developers to target Vagrant and leave all the details to someone else. And just like libvirt, Vagrant doesn’t do anything on its own, but pushes all the work to tools lower in the stack.
Vagrant primary audience is software developers. It allows to succinctly define multiple virtual machines, and then start and provision them all with a single command. These machines are usually considered disposable - they might be deleted and re-created multiple times a day. If you want to manually modify the virtual machine and keep these changes for longer period of time, Vagrant is probably not a tool for you.
On Linux, Vagrant usually works with libvirt, but may also work with VirtualBox or Xen.
VirtualBoxVirtualBox is full virtualization solution. It consists of kernel driver and user space programs, including one with graphical interface. You can think of VirtualBox as something parallel to all that we’ve discussed so far.
The main reason to use VirtualBox is ease of use. It offers simpler mental model for virtualization - you just install VirtualBox, start VirtualBox UI and create and run virtual machines. While it does require a kernel module, and kernel is notoriously bad at breaking modules that are not in tree, VirtualBox has a lot of tooling and integrations that will ensure that module is built every time you install or update kernel image. On distributions like Ubuntu you don’t have to think about this at all.
Technically speaking, libvirt supports VirtualBox, so you can use libvirt and virsh to manage VirtualBox machines. In practice that seems to be rare.
XenXen is another virtualization stack. You can think about it as something parallel to KVM/QEMU/libvirt and VirtualBox. What sets Xen apart is that instead of running inside your operating system, it runs below your operating system.
Xen manages scheduling, interrupts and timers - that is, it manages who gets access to computer resources, when, and for how long. Xen is the very first thing that starts when you boot your computer.
But Xen is not an operating system. So right after starting, it starts the guest with one. That guest is special, because it provides drivers to hardware and is the only one with a privilege to talk with Xen. Only through that special guest virtual machine you can create other virtual machines, assign them resources etc.
Xen is one of the oldest virtualization solutions, with first release back in 2003. However, it was included in kernel only in 2010. So while many people prefer Xen and think favorably of its architecture, it seems to lost the popularity contest to KVM and QEMU.
Xen provides the tools required to manage guest virtual machines, but you can also manage them through libvirt.
SummaryI thought to put a visual aid in place of summary. Each column represents a possible stack - you generally want to pick one of them. Items with dashed line style are optional - you can use them, but don’t have to. Click to see the bigger image.
Paul Tagliamonte: Complex for Whom?
In basically every engineering organization I’ve ever regarded as particularly high functioning, I’ve sat through one specific recurring conversation which is not – a conversation about “complexity”. Things are good or bad because they are or aren’t complex, architectures needs to be redone because it’s too complex – some refactor of whatever it is won’t work because it’s too complex. You may have even been a part of some of these conversations – or even been the one advocating for simple light-weight solutions. I’ve done it. Many times.
When I was writing this, I had a flash-back to a over-10 year old post by mjg59 about LightDM. It would be a mistake not to link it here.Rarely, if ever, do we talk about complexity within its rightful context – complexity for whom. Is a solution complex because it’s complex for the end user? Is it complex if it’s complex for an API consumer? Is it complex if it’s complex for the person maintaining the API service? Is it complex if it’s complex for someone outside the team maintaining it to understand? Complexity within a problem domain I’ve come to believe, is fairly zero-sum – there’s a fixed amount of complexity in the problem to be solved, and you can choose to either solve it, or leave it for those downstream of you to solve that problem on their own.
Although I believe there is a fixed amount of complexity in the lower bound of a problem, you always have the option to change the problem you're solving!That being said, while I believe there is a lower bound in complexity to contend with for a problem, I do not believe there is an upper bound to the complexity of solutions possible. It is always possible, and in fact, very likely that teams create problems for themselves while trying to solve a problem. The rest of this post is talking to the lower bound. When getting feedback on an early draft of this blog post, I’ve been informed that Fred Brooks coined a term for what I call “lower bound complexity” – “Essential Complexity”, in the paper “No Silver Bullet—Essence and Accident in Software Engineering”, which is a better term and can be used interchangeably.
Complexity CultureIn a large enough organization, where the team is high functioning enough to have and maintain trust amongst peers, members of the team will specalize. People will begin to engage with subsets of the work to be done, and begin to have their efficacy measured against that part of the organization’s problems. Incentives shift, and over time it becomes increasingly likely that two engineers may have two very different priorities when working on the same system together. Someone accountable for uptime and tasked with responding to outages will begin to resist changes. Someone accountable for rapidly delivering features will resist gates between them and their users. Companies (either wittingly or unwittingly) will deal with this by tasking engineers with both production (feature development) and operational tasks (maintenance), so the difference in incentives isn’t usually as bad as it could be.
The events depicted in this movie are fictitious. Any similarity to any person living or dead is merely coincidental.When we get a bunch of folks from far-flung corners of an organization in a room, fire up a slide deck and throw up some asperiational to-be architecture diagram in order to get a sign-off to solve some problem (be it someone needs a credible promotion packet, new feature needs to get delivered, or the system has begun to fail and needs fixing), the initial reaction will, more often than I’d like, start to devolve into a discussion of how this is going to introduce a bunch of complexity, going to be hard to maintain, why can’t you make it less complex?
In a high functioning environment, this is a mostly healthy impulse, coming from a good place, and genuinely intended to prevent problems for the whole organization by reducing non-essental complexity. That is good. I'm talking about a conversation discussing removing lower-limit complexity.Right around here is when I start to try and contextualize the conversation happening around me – understand what complexity is that being discussed, and understand who is taking on that burden. Think about who should be owning that problem, and work through the tradeoffs involved. Is it best solved here, or left to consumers (be them other systems, developers, or users). Should something become an API call’s optional param, taking on all the edge-cases and on, or should users have to implement the logic using the data you return (leaving everyone else to take on all the edge-cases and maintenance)? Should you process the data, or require the user to preprocess it for you?
Carla Geisser described this as being reminicent of the technique outlined in "end to end arguments in system design", which she uses to think about where complexity winds up in a system. It's an extremely good parallel.Frequently it’s right to make an active and explicit decision to simplify and leave problems to be solved downstream, since they may not actually need to be solved – or perhaps you expect consumers will want to own the specifics of how the problem is solved, in which case you leave lots of documentation and examples. Many other times, especially when it’s something downstream consumers are likely to hit, it’s best solved internal to the system, since the only thing that can come of leaving it unsolved are bugs, frustration and half-correct solutions. This is a grey-space of tradeoffs, not a clear decision tree. No one wants the software manifestation of a katamari ball or a junk drawer, nor does anyone want a half-baked service unable to handle the simplest use-case.
Head-in-sand as a ServicePopoffs about how complex something is, are, to a first approximation, best understood as meaning “complicated for the person making comments”. A lot of the #thoughtleadership believe that an AWS hosted EKS k8s cluster running images built by CI talking to an AWS hosted PostgreSQL RDS is not complex. They’re right. Mostly right. This is less complex – less complex for them. It’s not, however, without complexity and its own tradeoffs – it’s just complexity that they do not have to deal with. Now they don’t have to maintain machines that have pesky operating systems or hard drive failures. They don’t have to deal with updating the version of k8s, nor ensuring the backups work. No one has to push some artifact to prod manually. Deployments happen unattended. You click a button and get a cluster.
On the other hand, developers outside the ops function need to deal with troubleshooting CI, debugging access control rules encoded in turing complete YAML, permissions issues inside the cluster due to whatever the fuck a service mesh is, everyone needs to learn how to use some k8s tools they only actually use during a bad day, likely while doing some x.509 troubleshooting to connect to the cluster (an internal only endpoint; just port forward it) – not to mention all sorts of rules to route packets to their project (a single repo’s binary being run in 3 containers on a single vm host).
Truly i'm not picking on k8s here; I do genuinely believe it when I say EKS is less complex for me to operate well; that's kinda the whole point.Beyond that, there’s the invisible complexity – complexity on the interior of a service you depend on. I think about the dozens of teams maintaining the EKS service (which is either run on EC2 instances, or alternately, EC2 instances in a trench coat, moustache and even more shell scripts), the RDS service (also EC2 and shell scripts, but this time accounting for redundancy, backups, availability zones), scores of hypervisors pulled off the shelf (xen, kvm) smashed together with the ones built in-house (firecracker, nitro, etc) running on hardware that has to be refreshed and maintained continuously. Every request processed by network ACL rules, AWS IAM rules, security group rules, using IP space announced to the internet wired through IXPs directly into ISPs. I don’t even want to begin to think about the complexity inherent in how those switches are designed. Shitloads of complexity to solve problems you may or may not have, or even know you had.
Do I care about invisible complexity? Generally, no. I don't. It's not my problem and they don't show up to my meetings.What’s more complex? An app running in an in-house 4u server racked in the office’s telco closet in the back running off the office Verizon line, or an app running four hypervisors deep in an AWS datacenter? Which is more complex to you? What about to your organization? In total? Which is more prone to failure? Which is more secure? Is the complexity good or bad? What type of Complexity can you manage effectively? Which threaten the system? Which threaten your users?
COMPLEXIVIBESThis extends beyond Engineering. Decisions regarding “what tools are we able to use” – be them existing contracts with cloud providers, CIO mandated SaaS products, a list of the only permissible open source projects – will incur costs in terms of expressed “complexity”. Pinning open source projects to a fixed set makes SBOM production “less complex”. Using only one SaaS provider’s product suite (even if its terrible, because it has all the types of tools you need) makes accreditation “less complex”. If all you have is a contract with Pauly T’s lowest price technically acceptable artisinal cloudary and haberdashery, the way you pay for your compute is “less complex” for the CIO shop, though you will find yourself building your own hosted database template, mechanism to spin up a k8s cluster, and all the operational and technical burden that comes with it. Or you won’t and make it everyone else’s problem in the organization. Nothing you can do will solve for the fact that you must now deal with this problem somewhere because it was less complicated for the business to put the workloads on the existing contract with a cut-rate vendor.
Suddenly, the decision to “reduce complexity” because of an existing contract vehicle has resulted in a huge amount of technical risk and maintenance burden being onboarded. Complexity you would otherwise externalize has now been taken on internally. With a large enough organizations (specifically, in this case, i’m talking about you, bureaucracies), this is largely ignored or accepted as normal since the personnel cost is understood to be free to everyone involved. Doing it this way is more expensive, more work, less reliable and less maintainable, and yet, somehow, is, in a lot of ways, “less complex” to the organization. It’s particularly bad with bureaucracies, since screwing up a contract will get you into much more trouble than delivering a broken product, leaving basically no reason for anyone to care to fix this.
I can’t shake the feeling that for every story of technical mandates gone awry, somewhere just out of sight there’s a decisionmaker optimizing for what they believe to be the least amount of complexity – least hassle, fewest unique cases, most consistency – as they can. They freely offload complexity from their accreditation and risk acceptance functions through mandates. They will never have to deal with it. That does not change the fact that someone does.
TC;DR (TOO COMPLEX; DIDN’T REVIEW)We wish to rid ourselves of systemic Complexity – after all, complexity is bad, simplicity is good. Removing upper-bound own-goal complexity (“accidental complexity” in Brooks’s terms) is important, but once you hit the lower bound complexity, the tradeoffs become zero-sum. Removing complexity from one part of the system means that somewhere else – maybe outside your organization or in a non-engineering function must grow it back. Sometimes, the opposite is the case, such as when a previously manual business processes is automated. Maybe that’s a good idea. Maybe it’s not. All I know is that what doesn’t help the situation is conflating complexity with everything we don’t like – legacy code, maintenance burden or toil, cost, delivery velocity.
- Complexity is not the same as proclivity to failure. The most reliable systems I’ve interacted with are unimaginably complex, with layers of internal protection to prevent complete failure. This has its own set of costs which other people have written about extensively.
- Complexity is not cost. Sometimes the cost of taking all the complexity in-house is less, for whatever value of cost you choose to use.
- Complexity is not absolute. Something simple from one perspective may be wildly complex from another. The impulse to burn down complex sections of code is helpful to have generally, but sometimes things are complicated for a reason, even if that reason exists outside your codebase or organization.
- Complexity is not something you can remove without introducing complexity elsewhere. Just as not making a decision is a decision itself; choosing to require someone else to deal with a problem rather than dealing with it internally is a choice that needs to be considered in its full context.
Next time you’re sitting through a discussion and someone starts to talk about all the complexity about to be introduced, I want to pop up in the back of your head, politely asking what does complex mean in this context? Is it lower bound complexity? Is this complexity desirable? Is what they’re saying mean something along the lines of I don’t understand the problems being solved, or does it mean something along the lines of this problem should be solved elsewhere? Do they believe this will result in more work for them in a way that you don’t see? Should this not solved at all by changing the bounds of what we should accept or redefine the understood limits of this system? Is the perceived complexity a result of a decision elsewhere? Who’s taking this complexity on, or more to the point, is failing to address complexity required by the problem leaving it to others? Does it impact others? How specifically? What are you not seeing?
What can change?
What should change?
PyCoder’s Weekly: Issue #655 (Nov. 12, 2024)
#655 – NOVEMBER 12, 2024
View in Browser »
In this video course, you’ll learn all about web scraping in Python. You’ll see how to parse data from websites and interact with HTML forms using tools such as Beautiful Soup and MechanicalSoup.
REAL PYTHON course
This article does a comparison between code in single threaded, threaded, and multi-process versions under Python 3.12, 3.13, and 3.13 free-threaded with the GIL on and off.
ARTHUR PASTEL
The only web scraper API that extracts data at scale with a 98.7% average success rate while automatically handling all anti-bot systems. ZenRows is a complete web scraping toolkit with premium proxies, anti-CAPTCHA, headless browsers, and more. Try for free now →
ZENROWS sponsor
This is the first post in a multi-part series that uses Python to build tiny interpreters for other languages.
SERGE ZAITSEV
Michael (from Talk Python fame) introduces the concept of “stack-native” as the opposite of “cloud-native”, and how it applies to Python web apps. Building applications with just enough full-stack building blocks to run reliably with minimal complexity, rather than relying on a multitude of cloud services.
MICHAEL KENNEDY
In this tutorial, you’ll learn how to reset a pandas DataFrame index using various techniques. You’ll also learn why you might want to do this and understand the problems you can avoid by optimizing the index structure.
REAL PYTHON
Posit Connect lets data teams publish, host, & manage Python work. It provides a secure, scalable platform for sharing data insights with those who need them, including models, Jupyter notebooks, Streamlit & Shiny apps, Plotly dashboards, and more.
POSIT sponsor
In this tutorial, you’ll learn about Python closures. A closure is a function-like object with an extended scope. You can use closures to create decorators, factory functions, stateful functions, and more.
REAL PYTHON
You can go surprisingly far with a simple software architecture, in fact simplicity can make it easier to scale. This post talks about some real world cases of just that.
DAN LUU
This project shows you how to implement the Vehicle Routing Problem (VRP) using OpenStreetMap data and Google’s OR-Tools library to find efficient routes.
MEDIUM.COM/P • Shared by Albert Ferré
This is a deep dive article on Python project management and packaging. It covers the pyproject.toml file, modules, dependencies, locking, and more.
REINFORCED KNOWLEDGE
This opinion piece discusses the fact that our faster hardware still can’t keep up with our bloated software and why that is the case.
PRANAV NUTALAPTI
This post by a Django core developer talks about the last twenty years of the library and why it has had such staying power.
CARLTON GIBSON
This quick-hit post shows you how to use the map() and filter() functions in conjunction with list comprehensions.
JUHA-MATTI SANTALA
This post talks about when and when not to use a named tuple in your API and why you might make that choice.
BRETT CANNON
November 12, 2024
MEETUP.COM • Shared by Laura Stephens
November 13, 2024
REALPYTHON.COM
November 14 to November 16, 2024
PYCON.SE
November 15, 2024
MEETUP.COM
November 16 to November 17, 2024
PYCON.HK
November 16 to November 17, 2024
PYCON.JP
November 16 to November 18, 2024
PYTHON.IE
November 22 to November 27, 2024
PYCON.ORG.AU
Happy Pythoning!
This was PyCoder’s Weekly Issue #655.
View in Browser »
[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]
FSF Events: Free Software Directory meeting on IRC: Friday, November 11, starting at 12:00 EST (17:00 UTC)
Sven Hoexter: fluxcd: Validate flux-system Root Kustomization
Not entirely sure how people use fluxcd, but I guess most people have something like a flux-system flux kustomization as the root to add more flux kustomizations to their kubernetes cluster. Here all of that is living in a monorepo, and as we're all humans people figure out different ways to break it, which brings the reconciliation of the flux controllers down. Thus we set out to do some pre-flight validations.
Note1: We do not use flux variable substitutions for those root kustomizations, so if you use those, you've to put additional work into the validation and pipe things through flux envsubst.
First Iteration: Just Run kustomize Like Flux Would Do ItWith a folder structure where we've a cluster folder with subfolders per cluster, we just run a for loop over all of them:
for CLUSTER in ${CLUSTERS}; do pushd clusters/${CLUSTER} # validate if we can create and build a flux-system like kustomization file kustomize create --autodetect --recursive if ! kustomize build . -o /dev/null 2> error.log; then echo "Error building flux-system kustomization for cluster ${CLUSTER}" cat error.log fi popd done Second Iteration: Make Sure Our Workload Subfolder Have a kustomization.yamlNext someone figured out that you can delete some yaml files from a workload subfolder, including the kustomization.yaml, but not all of them. That left around a resource definition which lacks some other referenced objects, but is still happily included into the root kustomization by kustomize create and flux, which of course did not work.
Thus we started to catch that as well in our growing for loop:
for CLUSTER in ${CLUSTERS}; do pushd clusters/${CLUSTER} # validate if we can create and build a flux-system like kustomization file kustomize create --autodetect --recursive if ! kustomize build . -o /dev/null 2> error.log; then echo "Error building flux-system kustomization for cluster ${CLUSTER}" cat error.log fi # validate if we always have a kustomization file in folders with yaml files for CLFOLDER in $(find . -type d); do test -f ${CLFOLDER}/kustomization.yaml && continue test -f ${CLFOLDER}/kustomization.yml && continue if [[ $(find ${CLFOLDER} -maxdepth 1 \( -name '*.yaml' -o -name '*.yml' \) -type f|wc -l) != 0 ]]; then echo "Error Cluster ${CLUSTER} folder ${CLFOLDER} lacks a kustomization.yaml" fi done popd doneNote2: I shortened those snippets to the core parts. In our case some things are a bit specific to how we implemented the execution of those checks in GitHub action workflows. Hope that's enough to transport the idea of what to check for.
Real Python: Formatting Floats Inside Python F-Strings
You’ll often need to format and round a Python float to display the results of your calculations neatly within strings. In earlier versions of Python, this was a messy thing to do because you needed to round your numbers first and then use either string concatenation or the old string formatting technique to do this for you.
Since Python 3.6, the literal string interpolation, more commonly known as a formatted string literal or f-string, allows you to customize the content of your strings in a more readable way.
An f-string is a literal string prefixed with a lowercase or uppercase letter f and contains zero or more replacement fields enclosed within a pair of curly braces {...}. Each field contains an expression that produces a value. You can calculate the field’s content, but you can also use function calls or even variables.
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]