Feeds

Opensource.com: Why Drupal is the future of content strategy

Planet Drupal - Thu, 2022-12-15 03:00
Why Drupal is the future of content strategy Suzanne Dergacheva Thu, 12/15/2022 - 03:00

Drupal is already a robust content management system and digital experience platform. It's also playing a critical role in content strategy.

As a long-time advocate for open source and a contributor to Drupal, I spend a lot of time thinking about how organizations can leverage the platform. I've been thinking about…

Categories: FLOSS Project Planets

Dirk Eddelbuettel: spdl 0.0.3 on CRAN: Adding File Logger

Planet Debian - Wed, 2022-12-14 19:00

A second update to the still-new package spdl is now om CRAN, and in Debian. The key focus of spdl is a offering the same interface from both R and C++ for logging by relying on spdlog via my RcppSpdlog package.

This release add support for a simple filesetup() initialiser to direct logging output to a file. For now the console logger and the file logger are exclusive, if there is interest we could borrow a page from upstream and combine them.

The short NEWS entry follows.

Changes in spld version 0.0.3 (2022-12-14)
  • Add filesetup method

Courtesy of my CRANberries, there is also a diffstat report. More detailed information is on the spdl page.

If you like this or other open-source work I do, you can sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

Categories: FLOSS Project Planets

PreviousNext: Bug Smash: Taking a community initiative from idea to success

Planet Drupal - Wed, 2022-12-14 17:59

Bug Smash is a community-run initiative tackling the growing backlog of bugs in Drupal Core. So, what has made it so successful? 

by griffyn.heels / 15 December 2022

Based on my talk at DrupalSouth 2022 in Brisbane. You can also watch the recording featured at the end of this post.

Drupal community members from around the globe are working together to smash (triage, update and close) Drupal bugs. 

With 450+ members, this is the largest active Drupal community initiative making a substantial contribution to Drupal core and helping to train new members. Community members work asynchronously, with many living in Australian-friendly time zones. 

Let’s look at the Bug Smash initiative, discovering how and why it started, how it is run and how it continues to grow. 

Hang on, isn’t Drupal perfect?

Drupal is great; there’s no doubt about that! But over time, a few bugs have appeared in core.

In fact, as of December 2022, there are approximately 6,000 active bug reports for Drupal 9 and 10. That number of reports is bound to include duplicates or items that are no longer relevant, making it even harder to find and address actual bugs in the platform.

So, what’s the solution? Get in there and smash them!

How does Bug Smash work?

There are close to 500 members using the #BugSmash channel on Drupal’s Slack. This channel includes a few separate work streams, including the daily triage target, the community triage meeting, and the fortnightly initiative meeting.

The daily triage target runs twice a day, at around 10 am AEST and again later in the day, giving our US/EU friends something to smash when they wake up.

Here’s an example daily triage target.

The fantastic Quietone runs the community triage meeting. This meeting picks a theme and dives into a few bugs. It’s an excellent opportunity for members to learn about triage and get bugs ready for smashing by other team members. In 2022, a triage meeting targeted one of the oldest active bugs in Drupal core, resulting in the 16-year-old bug being fixed and closed.

Here’s a recent triage meeting.

The fortnightly initiative meeting runs asynchronously so that all global members can contribute. This meeting is an excellent opportunity to catch up on everything Bug Smash, update everyone on progress and set targets for the coming weeks. It’s also a great place to start if you’re unsure about the initiative or want guidance on what to tackle.

Here’s a recent fortnightly meeting.

However, you can Bug Smash how and what you like. Though, please remember to use the ‘Bug Smash Initiative’ tag on Drupal.org.

Who can Bug Smash?

There are plenty of opportunities for everyone to get involved.

Seasoned Drupalers can tackle critical, older or more complex bugs or mentor others on the daily triage threads.

Newcomers to the community or Drupal can be matched with suitable beginner tasks, allowing them to cut their teeth on Core contributions.

For those without developer skills (like me!), there are tasks such as handling comms, session notes and getting the word out about Bug Smash.

Does all this bug-smashing work?

The short answer? Yes!

Below is a graph that shows the number of active bugs in Drupal core in October 2020, which was early in the initiative.

And here is the number of active bugs in October 2022–showing a reduction of over 6,000 bug reports.

Why does Bug Smash work?

It may not be the first initiative of its kind in Drupal, but it’s still growing after two and a half years. So, what makes it effective? I asked the following questions of a number of Bugsmash community members:

What works in Bug Smash?

  • It’s simple. This was the most consistent response. We smash bugs to reduce the overall bloat and bad stuff in core. What is there not to love? 
  • It’s not just about development
  • It provides education and mentorship. Bug Smash is a collaborative community of senior, mid and junior-level Drupal developers tackling similar issues. Members can quickly request support or advice to get things across the line.
  • It is informal and low commitment. Depending on the topic, you can contribute as much or as little as you’d like. Members can provide support and then drop out without disrupting the overall initiative.

What keeps you coming back?

  • Education. Members said they learnt ‘heaps’ from the back-and-forth discussions about bugs, benefitting from the interaction within a great group of super-smart people.
  • Progress is possible without fixing bugs. Goals are still achieved without a bug being fixed each time.
  • Reviewing bugs is fun. Some even say it’s their favourite thing to do in Drupal.
  • Bugs have a start and finish point. On the other hand, tasks and features can be open-ended.
  • Everyone likes seeing the number of bugs go down. Seeing actively reported metrics and stats goals each fortnight is a big motivator for the community. 

By bringing everyone’s answers together, we can tell that the initiative works because of its community, and the community is effective because of the structure of the initiative.

In summary, Bug Smash gives you options for getting involved. This means taking part doesn’t have to overtake your day. If you get stuck, there’s always someone around to point you in the right direction. Watching the overall number of bugs drop keeps everyone focused and driven, whilst the experience everyone gains helps to skill up community members. And the more people who join Bug Smash, the more successful it will be for everyone.

If you want to start your own Drupal or other open source contribution community, I’ve hopefully given you some food for thought.

How to get involved

Start off by joining the #BugSmash channel on Drupal Slack and say hi!

You can also visit the Bug Smash page on Drupal.org. It has all the information you need about what Bug Smashers do, how the initiative works and when the next meeting is.

Just remember, it doesn’t matter if you’re not in this time zone. The fortnightly meeting is open for 24 hours, meaning everyone has a chance to contribute to threads.

If you have any more questions, feel free to drop them in the comments below! Or reach out to me in Slack, @griffyn.heels.

Categories: FLOSS Project Planets

qutebrowser development blog: Happy 9th birthday, qutebrowser!

Planet Python - Wed, 2022-12-14 15:19

qutebrowser is turning 9 today! I’ll use the opportunity for a – perhaps slightly tl;dr – overview of how it all came to be. As you might notice by the length of this post, stopping to write once I started writing something like this… isn’t exactly my forte! Hopefully …

Categories: FLOSS Project Planets

Ben Cook: Speed Trap

Planet Python - Wed, 2022-12-14 13:26
Overview

This post is going to showcase the development of a vehicle speed detector using Sparrow Computing’s open-source libraries and PyTorch Lightning.

The exciting news here is that we could make this speed detector for any traffic feed without prior knowledge about the site (no calibration required), or specialized imaging equipment (no depth sensors required). Better yet, we only needed ~300 annotated images to reach a decent speed estimate. To estimate speed, we will detect all vehicles coming toward the camera using an object detection model. We will also predict the locations of the back tire and the front tire of every vehicle using a keypoint model. Finally, we will perform object tracking to account for the same tire as it progresses frame-to-frame so that we can estimate vehicle speed.

Without further ado, let’s dive in…

What we will need
  • A video stream of automobile traffic
  • About 300 randomly sampled video frames from the video stream
  • Bounding box annotations around all the vehicles moving toward the camera
  • Keypoint annotations for the points where the front tire and the back tire touch the road (on the side where the vehicle is facing the camera)
An example for an annotated frame: to reduce the annotation labor, we defined a smaller region of the video. End-to-end Pipeline

The computer vision system we are building takes in a traffic video as an input, makes inferences through multiple deep learning models, and generates 1) an annotated video and 2) a traffic log that prints vehicle counts and speeds. The figure below is a high-level overview of how the whole system is glued together (numbered in chronological order). We will talk about each of the nine process units as well as the I/O highlighted in green.

End-to-end inference pipeline of our computer vision system Docker Container Setup

First, we need a template that comes with the dependencies and configuration for Sparrow’s development environment. This template can be generated with a Python package called sparrow-patterns. After creating the project source directory, the project will run in a Docker container so that our development will be platform-independent.

To set this up, all you have to do is:

pip install sparrow-patterns sparrow-patterns project speed-trapv3

Before you run the template on a container, add an empty file called .env inside the folder named .devcontainer as shown below.

Input Annotation Files

For this project, we used an annotation platform called V7 Darwin, but the same approach would work with other annotation platforms.

The Darwin annotation file (shown below) is a JSON object for each image that we tag. Note that “id” is a hash value that uniquely identifies an object on the frame that is only relevant to the Darwin platform. The data we need is contained in the “bounding_box” and “name” objects. Once you have annotations, you need to convert them to a form that PyTorch’s Dataset class is looking for:

{ "annotations": [ { "bounding_box": { "h": 91.61, "w": 151.53, "x": 503.02, "y": 344.16 }, "id": "ce9d224b-0963-4a16-ba1c-34a2fc062b1b", "name": "vehicle" }, { "id": "8120d748-5670-4ece-bc01-6226b380e293", "keypoint": { "x": 508.47, "y": 427.1 }, "name": "back_tire" }, { "id": "3f71b63d-d293-4862-be7b-fa35b76b802f", "keypoint": { "x": 537.3, "y": 434.54 }, "name": "front_tire" } ], } Object Detection Model Define the dataset

In the Dataset class, we create a dictionary with keys such as the frame of the video, the bounding boxes of the frame, and the labels for each of the bounding boxes. The values of that dictionary are then stored as PyTorch tensors. The following is the generic setup for the Dataset class:

class RetinaNetDataset(torch.utils.data.Dataset): # type: ignore """Dataset class for RetinaNet model.""" def __init__(self, holdout: Optional[Holdout] = None) -> None: self.samples = get_sample_dicts(holdout) self.transform = T.ToTensor() def __len__(self) -> int: """Return number of samples.""" return len(self.samples) def __getitem__(self, idx: int) -> dict[str, torch.Tensor]: """Get the tensor dict for a sample.""" sample = self.samples[idx] img = Image.open(sample["image_path"]) x = self.transform(img) boxes = sample["boxes"].astype("float32") return { "image": x, "boxes": torch.from_numpy(boxes), "labels": torch.from_numpy(sample["labels"]), } Define the model

We use a pre-trained RetinaNet model from TorchVision as our detection model defined in the Module class:

from torchvision import models class RetinaNet(torch.nn.Module): """Retinanet model based on Torchvision""" def __init__( self, n_classes: int = Config.n_classes, min_size: int = Config.min_size, trainable_backbone_layers: int = Config.trainable_backbone_layers, pretrained: bool = False, pretrained_backbone: bool = True, ) -> None: super().__init__() self.n_classes = n_classes self.model = models.detection.retinanet_resnet50_fpn( progress=True, pretrained=pretrained, num_classes=n_classes, min_size=min_size, trainable_backbone_layers=trainable_backbone_layers, pretrained_backbone=pretrained_backbone, )

Notice that the forward method (below) of the Module class returns the bounding boxes, confidence scores for the boxes, and the assigned labels for each box in form of tensors stored in a dictionary.

def forward( self, x: Union[npt.NDArray[np.float64], list[torch.Tensor]], y: Optional[list[dict[str, torch.Tensor]]] = None, ) -> Union[ dict[str, torch.Tensor], list[dict[str, torch.Tensor]], FrameAugmentedBoxes ]: """ Forward pass for training and inference. Parameters ---------- x A list of image tensors with shape (3, n_rows, n_cols) with unnormalized values in [0, 1]. y An optional list of targets with an x1, x2, y1, y2 "boxes" tensor and a class index "labels" tensor. Returns ------- result(s) If inference, this will be a list of dicts with predicted tensors for "boxes", "scores" and "labels" in each one. If training, this will be a dict with loss tensors for "classification" and "bbox_regression". """ if isinstance(x, np.ndarray): return self.forward_numpy(x) elif self.training: return self.model.forward(x, y) results = self.model.forward(x, y) padded_results = [] for result in results: padded_result: dict[str, torch.Tensor] = {} padded_result["boxes"] = torch.nn.functional.pad( result["boxes"], (0, 0, 0, Config.max_boxes), value=-1.0 )[: Config.max_boxes] padded_result["scores"] = torch.nn.functional.pad( result["scores"], (0, Config.max_boxes), value=-1.0 )[: Config.max_boxes] padded_result["labels"] = torch.nn.functional.pad( result["labels"], (0, Config.max_boxes), value=-1 )[: Config.max_boxes].float() padded_results.append(padded_result) return padded_results

With that, we should be able to train and save the detector as a .pth file, which stores trained PyTorch weights.

One important detail to mention here is that the dictionary created during inference is converted into a Sparrow data structure called FrameAugmentedBox from the sparrow-datums package. In the following code snippet, the results_to_boxes() function converts the inference result into a FrameAugmentedBox object. We convert the inference results to Sparrow format so they can be directly used with other Sparrow packages such as sparrow-tracky which is used to perform object tracking later on.

def result_to_boxes( result: dict[str, torch.Tensor], image_width: Optional[int] = None, image_height: Optional[int] = None, ) -> FrameAugmentedBoxes: """Convert RetinaNet result dict to FrameAugmentedBoxes.""" box_data = to_numpy(result["boxes"]).astype("float64") labels = to_numpy(result["labels"]).astype("float64") if "scores" in result: scores = to_numpy(result["scores"]).astype("float64") else: scores = np.ones(len(labels)) mask = scores >= 0 box_data = box_data[mask] labels = labels[mask] scores = scores[mask] return FrameAugmentedBoxes( np.concatenate([box_data, scores[:, None], labels[:, None]], -1), ptype=PType.absolute_tlbr, image_width=image_width, image_height=image_height, )

Now, we can use the saved detection model to make inferences and obtain the predictions as FrameAugmentedBoxes.

Object Detection Evaluation

To quantify the performance of the detection model, we use MOTA (Multiple Object Tracking Accuracy) as the primary metric, which has a range of [-inf, 1.0], where perfect object tracking is indicated by 1.0. Since we have not performed tracking yet, we will estimate MOTA without the identity switches. Just for the sake of clarity, we will call it MODA (Multiple Object Detection Accuracy). To calculate MODA, we use a method called compute_moda() from the sparrow-tracky package which employs the pairwise IoU between bounding boxes to find the similarity.

from sparrow_tracky import compute_moda moda = 0 count = 0 for batch in iter(test_dataloader): x = batch['image'] x = x.cuda() sample = {'boxes':batch['boxes'][0], 'labels':batch['labels'][0]} result = detection_model(x)[0] predicted_boxes = result_to_boxes(result) predicted_boxes = predicted_boxes[predicted_boxes.scores > DetConfig.score_threshold] ground_truth_boxes = result_to_boxes(sample) frame_moda = compute_moda(predicted_boxes, ground_truth_boxes) moda += frame_moda.value count += 1 print(f"Based on {count} test examples, the Multiple Object Detection Accuracy is {moda/count}.")

The MODA for our detection model turned out to be 0.98 based on 43 test examples, indicating that our model has high variance, so the model would not be as effective if we used a different video. To improve the high variance, we will need more training examples.

Object Tracking

Now that we have the trained detection model saved as a .pth file, we run an inference with the detection model and perform object tracking at the same time. The source video is split into video frames and detection predictions for a frame are converted into a FrameAugmentedBox as explained before. Later, it is fed into a Tracker object that is instantiated from Sparrow’s sparrow-tracky package. After looping through all the frames, the Tracker object tracks the vehicles using an algorithm similar to SORT (you can read more about our approach here). Finally, the data stored in the Tracker object is written into a file using a Sparrow data structure. The file will have an extension of .json.gz when it is saved.

from sparrow_tracky import Tracker def track_objects( video_path: Union[str, Path], model_path: Union[str, Path] = Config.pth_model_path, ) -> None: """ Track ball and the players in a video. Parameters ---------- video_path The path to the source video output_path The path to write the chunk to model_path The path to the ONNX model """ video_path = Path(video_path) slug = video_path.name.removesuffix(".mp4") vehicle_tracker = Tracker(Config.vehicle_iou_threshold) detection_model = RetinaNet().eval().cuda() detection_model.load(model_path) fps, n_frames = get_video_properties(video_path) reader = imageio.get_reader(video_path) for i in tqdm(range(n_frames)): data = reader.get_data(i) data = cv2.rectangle( data, (450, 200), (1280, 720), thickness=5, color=(0, 255, 0) ) # input_height, input_width = data.shape[:2] aug_boxes = detection_model(data) aug_boxes = aug_boxes[aug_boxes.scores > TrackConfig.vehicle_score_threshold] boxes = aug_boxes.array[:, :4] vehicle_boxes = FrameBoxes( boxes, PType.absolute_tlbr, # (x1, y1, x2, y2) in absolute pixel coordinates [With respect to the original image size] image_width=data.shape[1], image_height=data.shape[0], ).to_relative() vehicle_tracker.track(vehicle_boxes) make_path(str(Config.prediction_directory / slug)) vehicle_chunk = vehicle_tracker.make_chunk(fps, Config.vehicle_tracklet_length) vehicle_chunk.to_file( Config.prediction_directory / slug / f"{slug}_vehicle.json.gz" ) Tracking Evaluation

We quantify our tracking performance using MOTA (Multi Model Tracking Accuracy) from the sparrow-tracky package. Similar to MODA, MOTA has a range of [-inf, 1.0], where 1.0 indicates the perfect tracking performance. Note that we need a sample video with tracking ground truth to perform the evaluation. Further, both the predictions and the ground truth should be transformed into an AugmentedBoxTracking object to be compatible with the compute_mota() function from the sparrow-tracky package as shown below.

from sparrow_datums import AugmentedBoxTracking, BoxTracking from sparrow_tracky import compute_mota test_mota = compute_mota(pred_aug_box_tracking, gt_aug_box_tracking) print("MOTA for the test video is ", test_mota.value)

The MOTA value for our tracking algorithm turns out to be -0.94 when we evaluated it against a small sample of “ground-truth” video clips, indicating that there is plenty of room for improvement.

Keypoint Model

In order to locate each tire of a vehicle, we crop the vehicles detected by the detection model, resize them, and feed them into the keypoint model to predict the tire locations.

During cropping and resizing, the x and y coordinates of the keypoints need to be handled properly. When x and y coordinates are divided by the image width and the image height, the range of the keypoints becomes [0, 1], and we use the term “relative coordinates”. Relative coordinates tell us the location of a pixel with respect to the dimensions of the image it is located at. Similarly, when keypoints are not divided by the dimensions of the frame, we use the term “absolute coordinates”, which solely relies on the frame’s Cartesian plane to establish pixel location. Assuming the keypoints are in relative coordinates when we read them from the annotation file, we have to multiply by the original frame dimensions to get the keypoints in absolute pixel space. Since the keypoint model takes in cropped regions, we have to subtract the top-left coordinate from the keypoints to change the origin of the cropped region from (0, 0) to (x1, y1). Since we resize the cropped region, we divide the shifted keypoints by the dimensions of the cropped region and then multiply by the keypoint model input size. You can visualize this process using the flow chart below.

Pre-processing steps for the keypoints

A known fact among neural networks is that they are good at learning surfaces rather than learning a single point. Because of that, we create two Laplacian of Gaussian surfaces where the highest energy is at the location of each keypoint. These two images are called heatmaps which are stacked up on each other before feeding into the model. To visualize the heatmaps, we can combine both heatmaps into a single heatmap and superimpose it on the corresponding vehicle as shown below.

The heatmap (annotation) representing both keypoints (blue: back tire, red: front tire) is overlaid on the corresponding vehicle (training example) def keypoints_to_heatmap( x0: int, y0: int, w: int, h: int, covariance: float = Config.covariance_2d ) -> np.ndarray: """Create a 2D heatmap from an x, y pixel location.""" if x0 < 0 and y0 < 0: x0 = 0 y0 = 0 xx, yy = np.meshgrid(np.arange(w), np.arange(h)) zz = ( 1 / (2 * np.pi * covariance**2) * np.exp( -( (xx - x0) ** 2 / (2 * covariance**2) + (yy - y0) ** 2 / (2 * covariance**2) ) ) ) # Normalize zz to be in [0, 1] zz_min = zz.min() zz_max = zz.max() zz_range = zz_max - zz_min if zz_range == 0: zz_range += 1e-8 return (zz - zz_min) / zz_range

The important fact to notice here is that if the keypoint coordinates are negative, (0, 0) is assigned. When both tires are not visible (i.e. because of occlusion), we assign (-1, -1) for the missing tire at the Dataset class since the PyTorch collate_fn() requires fixed input shapes. At the keypoints_to_heatmap() function, the negative value is zeroed out indicating that the tire is located at the top left corner of the vehicle’s bounding box as shown below.

The heatmap behaves normally for the present tire (blue: back tire) but assigns (0, 0) for the absent tire (red: front tire)

In real life, this is impossible, since tires are located in the bottom half of the bounding box. The model learns these patterns during the training and continues to predict the missing tire in the top left corner which makes it easier for us to filter.

The Dataset class for the keypoint model could look like this:

class SegmentationDataset(torch.utils.data.Dataset): """Dataset class for Segmentations model.""" def __init__(self, holdout: Optional[Holdout] = None) -> None: self.samples = get_sample_dicts(holdout) def __len__(self): """Length of the sample.""" return len(self.samples) def __getitem__(self, idx: int) -> dict[str, torch.Tensor]: """Get the tensor dict for a sample.""" sample = self.samples[idx] crop_width, crop_height = Config.image_crop_size keypoints = process_keypoints(sample["keypoints"], sample["bounding_box"]) heatmaps = [] for x, y in keypoints: heatmaps.append(keypoints_to_heatmap(x, y, crop_width, crop_height)) heatmaps = np.stack(heatmaps, 0) img = Image.open(sample["image_path"]) img = crop_and_resize(sample["bounding_box"], img, crop_width, crop_height) x = image_transform(img) return { "holdout": sample["holdout"], "image_path": sample["image_path"], "annotation_path": sample["annotation_path"], "heatmaps": heatmaps.astype("float32"), "keypoints": keypoints.astype("float32"), "labels": sample["labels"], "image": x, }

Note that the Dataset class creates a dictionary with the following keys:

  • holdout: Whether the example belongs to train, dev (validation), or test set
  • image_path: Stored location of the video frame
  • annotation_path: Stored location of the annotation file corresponding to the frame
  • heatmaps: Transformed keypoints in the form of surfaces
  • labels: Labels of the keypoints
  • image: Numerical values of the frame

For the Module class of the keypoint model we use a pre-trained ResNet50 architecture with a sigmoid classification top.

A high-level implementation of the Module class would be:

from torchvision.models.segmentation import fcn_resnet50 class SegmentationModel(torch.nn.Module): """Model for prediction court/net keypoints.""" def __init__(self) -> None: super().__init__() self.fcn = fcn_resnet50( num_classes=Config.num_classes, pretrained_backbone=True, aux_loss=False ) def forward(self, x: torch.Tensor) -> torch.Tensor: """Run a forward pass with the model.""" heatmaps = torch.sigmoid(self.fcn(x)["out"]) ncols = heatmaps.shape[-1] flattened_keypoint_indices = torch.flatten(heatmaps, 2).argmax(-1) xs = (flattened_keypoint_indices % ncols).float() ys = torch.floor(flattened_keypoint_indices / ncols) keypoints = torch.stack([xs, ys], -1) return {"heatmaps": heatmaps, "keypoints": keypoints} def load(self, model_path: Union[Path, str]) -> str: """Load model weights.""" weights = torch.load(model_path) result: str = self.load_state_dict(weights) return result

Now, we have everything we need to train and save the model in .pth format.

Keypoints post-processing

Recall that since we transformed the coordinates of the keypoints before feeding them into the model, the keypoint predictions are going to be in absolute space with respect to the dimensions of the resized cropped region. To project them back to the original frame dimensions we have to follow a few steps mentioned in the flowchart below.

Post-processing points for the keypoints

First, we have to divide the keypoints by the dimensions of the model input size, which takes the keypoints into the relative space. Then, we have to multiply them by the dimensions of the cropped region to take them back to absolute space with respect to the cropped region of interest dimensions. Despite the keypoints being back in absolute space, the origin of its coordinate system starts at (x1, y1). So, we have to add (x1, y1) to bring the origin back to (0, 0) of the original frame’s coordinate system.

Keypoint model evaluation

We quantify the model performance using the relative error metric, which is the ratio of the magnitude of the difference between ground truth and prediction compared to the magnitude of the ground truth.

overall_rel_error = 0 count = 0 for batch in iter(test_dataloader): x = batch['image'] x = x.cuda() result = model(x) relative_error = torch.norm( batch["keypoints"].cuda() - result["keypoints"].cuda() ) / torch.norm(batch["keypoints"].cuda()) overall_rel_error += relative_error count += 1 print(f"The relative error of the test set is {overall_rel_error * 100}%.")

The relative error of our keypoint model turns out to be 20%, which indicates that for every ground truth with a magnitude of 100 units, there is an error of 20 units in the corresponding prediction. This model is also over-fitting to some extent, so it would probably perform poorly on a different video. However, this could be mitigated by adding more training examples.

Aggregating Multi-Model Predictions

Recall that we saved the tracking results in a .json.gz file. Now, we open that file using the sparrow-datums package, merge the keypoints predictions, and write into two JSON files called objectwise_aggregation.json and framewise_aggregation.json. The motivation behind having these two files is that it allows us to access all the predictions in one place at constant time (o(1)). More specifically, the objectwise_aggregation.json dictionary keeps the order that the objects that appeared in the video as the key and all the predictions regarding that object as the value.

Here’s a list of things objectwise_aggregation.json keeps for every object:

  • Frame range the object appeared
  • The bounding box of each appearance
  • Object tracklet ID (Unique ID assigned for each object by the SORT algorithm)

In contrast, the framewise_aggregation.json dictionary uses the frame number as the key. All the predictions related to that frame are used as the value.

The following is the list of things we can grab from each video frame:

  • All the objects that appeared in that frame
  • The bounding box of the objects
  • Keypoints (Transformed x, y coordinates of the tires)
  • Sigmoid scores of the keypoints (we used a sigmoid function to classify between the back and front tire)
  • Object ID (the order that the object appeared in the input video)
  • Object tracklet ID (Unique ID assigned for each object by the SORT algorithm)
Speed Algorithm

Once we have all the primitives detected from frames and videos, we will use both frame-wise and object-wise data to estimate the speed of the vehicles based on the model predictions. The simplest form of the speed algorithm would be to measure the distance between the front tire and the back tire which is known as the wheelbase at frame n, and then determine how many frames it took for the back tire to travel the wheelbase distance from frame n. For simplicity, we assume that every vehicle has the same wheelbase, which is 2.43 meters. Further, since we do not know any information about the site or the equipment, we assume that the pixel-wise wheelbase of a vehicle remains the same. Therefore, our approach works best when the camera is located at the side of the road and pointed in the orthogonal direction to the vehicles’ moving direction (which is not the case in the demo video).

Noise Filtering

Since our keypoint model has a 20% error rate, the keypoint predictions are not perfect, so we have to filter out some of the keypoints. Here are some of the observations we did to identify common scenarios where the model under-performed.

Scenario 1

Model predictions around the green boundary do not perform well since only a portion of vehicles is visible to the camera. So, it is better to wait for those vehicles to come to a better detection area. Therefore, we filter out any vehicles predictions that have x1 less than some threshold for the top-left x coordinate of their bounding box.

Scenario 2

For the missing tires, we taught the model to make predictions at the origin of the bounding box. Additionally, there are instances when the model mis-predicts the keypoint and it ends up being on the upper half of the bounding box of the vehicle. Both of these cases can be resolved by removing any keypoints that are located on the upper half of the bounding box.

Scenario 3

For missing tires, the model tends to predict both tires at the same spot, so we have to remove one of them. In this case, we could draw a circle centering the back tire and if the other tire is inside of that circle, we can get rid of the tire that had the lower probability in the sigmoid classification top.

Summary of the rules formed

So, we form three rules to filter out keypoints that are not relevant.

  1. Filter out any keypoints whose top-left bounding box coordinate satisfies x1 < some threshold.
  2. Ignore any keypoints that are predicted in the upper half of the bounding box.
  3. Neglect the keypoint with the lower sigmoid score when tires overlap with each other.

When we plot all the keypoints predicted throughout the input video, notice that most of the tires overlap and the general trend is a straight line.

When the x vs y coordinates of all the vehicles in the video are plotted on the same graph before applying noise rules. When the x vs y coordinates of individual vehicles are plotted on the same graph before applying noise rules.

After the rules have been applied, notice that most of the outliers gets filtered out.

When the x vs y coordinates of all the vehicles in the video are plotted on the same graph after applying noise rules.

Also, note that some vehicles will be completely ignored leaving only four vehicles.

When the x vs y coordinates of individual vehicles after applying noise rules. Filling the Missing Values

When we perform filtering, there are instances where only a single tire gets filtered out, and the other remains. To fix that, we fit keypoint data of each vehicle into a straight line using linear regression, where the x, and y coordinates of the back tire and the x coordinate of the front tire are the independent variables and the y coordinate of the front tire is the dependent variable. Using the coefficients of the fitted line, we can now predict and fill in the missing values.

For example, here’s what the straight-line trend looks like with missing values.

Straight line trend for a particular vehicle before filling in the missing values.

After predicting the missing values with linear regression, we can gain 50 more points to estimate the speed more precisely.

Straight line trend for a particular vehicle after filling the missing values. Speed Estimation

Now that we have complete pairs of keypoints, it’s time to jump into the geometry of the problem we are solving…

If we draw a circle around the back tire with a radius representing the pixel-wise wheelbase (d), we form the gray area defined on the diagram. Our goal is to find out which back tire that shows up in a future frame has reached the distance of d from the current back tire. Since the keypoints are still not perfect, we could land on a future back tire that is located anywhere on the circle. To remedy that, we can find the theta angle by finding alpha and beta with simple trigonometry. Then, we threshold the theta value and define that theta cannot exceed the threshold. Ideally, theta should be zero since the vehicle is traveling on a linear path. Although the future back tire and the current front tire ideally should be on the same circular boundary, there can be some errors. So, we define an upper bound (green dotted line) and a lower bound (purple dotted line) on the diagram. Let’s put this together to form an algorithm to estimate the speed.

Speed Algorithm
  • Pick a pair of tires (red and yellow).
  • Find the distance (d) between the chosen pair.
  • From the back tire, draw a circle with a radius d.
  • From that back tire, iterate through the future back tires until the distance between the current back tire and the future back tire match d.
  • If d has been exceeded by more than m pixels (green boundary), ignore the current tire pair and move on to the next pair.
  • If d has fallen short by more than m pixels (purple boundary), ignore the current tire pair and move on to the next pair.
  • If not, find theta using alpha, beta and if theta is less than the threshold value of 𝛄, and if the number of frames from the current backfire to the future back tire is larger than 1, consider that point as eligible for speed estimation.
  • Otherwise, ignore the current pair and move on to the next pair.
  • Stop when all the tire pairs have been explored.
  • Find out how many frames there are between the current back tire and the future back tire to find the elapsed time.
  • Wheelbase (d) / elapsed time renders the instantaneous vehicle speed.
def estimate_speed(video_path_in): """Estimate the speed of the vehicles in the video. Parameters ---------- video_path_in : str Source video path """ video_path = video_path_in slug = Path(video_path).name.removesuffix(".mp4") objects_to_predictions_map = open_objects_to_predictions_map(slug) object_names, vehicle_indices, objectwise_keypoints = filter_bad_tire_pairs( video_path ) speed_collection = {} for vehicle_index in vehicle_indices: # Looping through all objects in the video approximate_speed = -1 object_name = object_names[vehicle_index] coef, bias, data = straight_line_fit(objectwise_keypoints, object_name) ( back_tire_x_list, back_tire_y_list, front_tire_x_list, front_tire_y_list, ) = fill_missing_keypoints(coef, bias, data) vehicle_speed = [] skipped = 0 back_tire_keypoints = [back_tire_x_list, back_tire_y_list] back_tire_keypoints = [list(x) for x in zip(*back_tire_keypoints[::-1])] front_tire_keypoints = [front_tire_x_list, front_tire_y_list] front_tire_keypoints = [list(x) for x in zip(*front_tire_keypoints[::-1])] back_tire_x_list = [] back_tire_y_list = [] front_tire_x_list = [] front_tire_y_list = [] # Speed calculation algorithm starts here... vehicle_speed = {} total_num_points = len(objectwise_keypoints[object_name]) object_start = objects_to_predictions_map[vehicle_index]["segments"][0][0] for i in range(total_num_points): back_tire = back_tire_keypoints[i] front_tire = front_tire_keypoints[i] if back_tire[0] < 0 or front_tire[0] < 0: vehicle_speed[i + object_start] = approximate_speed skipped += 1 continue for j in range(i, total_num_points): future_back_tire = back_tire_keypoints[j] if future_back_tire[0] < 0: continue back_tire_x = back_tire[0] back_tire_y = back_tire[1] front_tire_x = front_tire[0] front_tire_y = front_tire[1] future_back_tire_x = future_back_tire[0] future_back_tire_y = future_back_tire[1] current_keypoints_distance = get_distance( back_tire_x, back_tire_y, front_tire_x, front_tire_y ) future_keypoints_distance = get_distance( back_tire_x, back_tire_y, future_back_tire_x, future_back_tire_y ) if ( current_keypoints_distance - future_keypoints_distance ) >= -SpeedConfig.distance_error_threshold and ( current_keypoints_distance - future_keypoints_distance ) <= SpeedConfig.distance_error_threshold: alpha = get_angle( back_tire_x, back_tire_y, future_back_tire_x, future_back_tire_y ) beta = get_angle( back_tire_x, back_tire_y, front_tire_x, front_tire_y ) if ( SpeedConfig.in_between_angle >= alpha + beta and (j - i) > 1 ): approximate_speed = round( SpeedConfig.MPERSTOMPH * SpeedConfig.WHEEL_BASE / frames_to_seconds(30, j - i) ) vehicle_speed[i + object_start] = approximate_speed back_tire_x_list.append(back_tire_x) back_tire_y_list.append(back_tire_y) front_tire_x_list.append(front_tire_x) front_tire_y_list.append(front_tire_y) break speed_collection[vehicle_index] = vehicle_speed f = open(SpeedConfig.json_directory / slug / "speed_log.json", "w") json.dump(speed_collection, f) f.close()

The instantaneous speed calculated by the algorithm is saved into a JSON file called speed_log.json which keeps track of the instantaneous speed of the detected vehicles with their corresponding frames. Also, the detected speed is printed on the video frame and all the video frames are put back together to produce the following annotated video.

After iterating through all the frames, we can use the speed log JSON file to calculate general statics such as the maximum speed, fastest vehicle, and average speed of every vehicle to create a report of the traffic in the video feed.

Conclusion

Let’s summarize what we did today. We built a computer vision system that can estimate the speed of vehicles in a given video. To estimate the speed of a vehicle, we needed to know the locations of its tires in every frame that it appeared. For that, we needed to perform three main tasks.

  1. Detecting vehicles
  2. Tracking the vehicles and assigning unique identities to them
  3. Predicting the keypoints for every vehicle in every frame

After acquiring keypoint locations for every vehicle, we developed a geometric rule-based algorithm to estimate the number of frames it takes for the back tire of a vehicle to reach the position of its corresponding front tire’s position in the future. With that information in hand, we can approximate the speed of that vehicle.

Before we wind up our project, you can check out the complete implementation of the code here. This code would have been more complex if it wasn’t for Sparrow packages. So, make sure to check them out as well. You can find me via LinkedIn if you have any questions.

The post Speed Trap appeared first on Sparrow Computing.

Categories: FLOSS Project Planets

PyBites: 5 ways I use GitHub Actions

Planet Python - Wed, 2022-12-14 12:33

I am increasingly using GitHub Actions these days.

If you’re new to this you might want to check out our article or video.

In this article I will show you 5 cool ways I use it day to day.

Run tests and linters

The first and most obvious reason is to automate tooling. Although you probably want to first and foremost do this step locally, for example by using pre-commit, it’s nice to always have that second automated check in the cloud.

Example workflow.

On: push dictates that I want to run this job upon every code push.

You can read the steps yourself, but basically we set up a container installing python and the required project dependencies. Then we run pytest with coverage. There is one environment variable required (EMOJI_PREFERENCES) which I set under env.

Here is another example of a test runner further restricting the on clause.

Automate publishing a package to PyPI

In the same repo you can see another workflow.

This creates a new package and pushes it to PyPI. It always tries the test instance, but only does the “live” PyPI instance when I push a tag (using condition if: startsWith(github.ref, 'refs/tags')).

So now I don’t have to use flit / poetry / twine locally, I can just do rely on using git tags:

git tag x.y.z git push --tags

(I have to check but I think I also need to manually update the version in __init__.py)

How does it know about my PyPI auth token?

You can use GitHub’s encrypted secrets which I load in like this:

password: ${{ secrets.PYPI_API_TOKEN }}

This is pretty slick.

Update my GitHub Readme

A few weeks ago I made a self updating Readme on GitHub.

The build-readme.py script runs every day to pull in new content (articles, tips, toots) and updates my profile repo’s Readme with it.

Here is the workflow.

As you can see this one is slightly different from the previous ones, it uses the cron feature, so this script runs once a day:

on: push: schedule: - cron: "45 8 * * *" Back up Mastodon toots

While on the topic of syncing content, I used the same cronjob style GitHub Action to sync my Mastodon (Fosstodon) activity to a local database (repo).

Here is the job.

This runs a few times a day (for crontab syntax crontab.guru is really useful):

on: push: schedule: # At minute 4 past every 4th hour # https://crontab.guru/#4_*/4_*_*_* - cron: '4 */4 * * *'

Notice how the job also commits the updated sqlite db to version control using stefanzweifel/git-auto-commit-action@v4

Post to an API

Lastly, I made a job to sync new tips from my code tips repo to our CodeImag.es API

Here is the job.

I run it once a day: cron: "45 16 * * *"

It follows the usual order of setting up a container, installing Python and the requirements, and then running the script:

python sync_tips.py

Similarly to the publishing a package workflow (2nd tip), I use GitHub secrets to load in my API user credentials.

CODEIMAGES_USER: ${{ secrets.CODEIMAGES_USER }} CODEIMAGES_PASSWORD: ${{ secrets.CODEIMAGES_PASSWORD }}

And that’s it. I hope this gave you some inspiration to leverage GitHub Actions to save time and make your life easier

– Bob

Categories: FLOSS Project Planets

health @ Savannah: GNU Health Hospital Management client 4.0.2 available

GNU Planet! - Wed, 2022-12-14 12:17

Dear GNUHealth community:

I am happy to announce the maintenance release 4.0.2 of the Hospital Management client (GTK).

Release 4.0.2 of the GNUHealth HMIS client includes bug fixes (see the Changelog[1]) and is REUSE compliant[2].

As usual, the source code can be downloaded from the official GNU ftp site[3]. You can also install it directly via pip[4].

You can join us at Mastodon for the latest news and events around GNUHealth! (https://mastodon.social/@gnuhealth)

Happy and healthy hacking!
Luis

1.- https://hg.savannah.gnu.org/hgweb/health-hmis-client/file/5d21a06fa998/Changelog
2.- https://reuse.software/
3.- https://ftp.gnu.org/gnu/health/
4.- https://pypi.org/project/gnuhealth-client/

Categories: FLOSS Project Planets

Selenium + AT-SPI = GUI Testing

Planet KDE - Wed, 2022-12-14 10:59

At KDE we have multiple levels of quality assurance ranging from various degrees of a humans testing features to fully automated testing. Indeed automated testing is incredibly important for the continued quality of our software. A big corner stone of our testing strategy are so called unit tests, they test a specific piece of our software for its behavior in isolation. But for many aspects of our software we need a much higher level view, testing pieces of Plasma’s application launcher in isolation is all good and well but that won’t tell us if the entire UI can be easily navigated using the keyboard. For this type of test we require a different testing approach altogether. A couple months ago I’ve set set out to create a testing framework for this use case and I’m glad to say that it has matured enough to be used for writing tests. I’d like to walk you through the technical building blocks and a simple example.

Let us start of by looking at the architecture at large. So… there’s Selenium which is an incredibly popular, albeit web-oriented, testing framework. Its main advantages for us are its popularity and that it sports a server-client split. This means we can leverage the existing client tooling available for Selenium without having to write anything ourselves, we only need to grow a server. The server component, called a WebDriver, implements the actual interaction with UI elements and is generic enough to also apply to desktop applications. Indeed so thought others as well: there already exists Appium – it extends Selenium with more app-specific features and behaviors. Something for us to build upon. The clients meanwhile are completely separate and talk to the WebDriver over a well defined JSON REST protocol, meaning we can reuse the existing clients without having to write anything ourselves. They are available in a multitude of programming languages, and who knows maybe we’ll eventually get one for writing Selenium tests in QML

That of course doesn’t explain how GUI testing can work with this on Linux. Enter: AT-SPI. AT-SPI is an accessibility API and pretty much the standard accessibility system for use on Linux. Obviously its primary use is assistive technologies, like the screen reader Orca, but to do its job it essentially offers a toolkit-independent way of introspecting and interacting with GUI applications. This then gives us a way to implement a WebDriver without caring about the toolkit or app specifics. As long as the app supports AT-SPI, which all Qt apps do implicitly, we can test it.

Since all the client tooling is independent of the server all we needed to get GUI testing going was a WebDriver that talks to AT-SPI.

That is what I set out to write and I’m happy to announce that we now have an AT-SPI based WebDriver, and the first tests are popping into existence already. There is also lovely documentation to hold onto.

So, without further ado. Let us write a simple test. Since the documentation already writes one in Python I’ll use Ruby this time around so we have some examples of different languages. A simple candidate is KInfoCenter. We can test its search functionality with a couple of lines of code.

First we need to install selenium-webdriver-at-spi, clone it, cmake build it, and cmake install it. You’ll also need to install the relevant client libraries. For ruby that’s simply running gem install appium_lib.

Then we can start with writing our test. We will need some boilerplate setup logic. This is more or less the same for every test. For more details on the driver setup you may also check the wiki page.

def setup @appium_driver = Appium::Driver.new( { 'caps' => { app: 'org.kde.kinfocenter.desktop' }, 'appium_lib' => { server_url: 'http://127.0.0.1:4723', wait_timeout: 10, wait_interval: 0.5 } }, true ) @driver = @appium_driver.start_driver end

The driver will take care of starting the correct application and make sure that it is actually running correctly. Next we’ll write the actual test. Let’s test the search. The first order of business is using a tool called Accerciser to inspect the AT-SPI presentation of the application. For more information on how to use this tool please refer to the wiki. Using Accerciser I’ve located the search field and learned that it is called ‘Search’. So, let’s locate it and activate it, search for the CPU module:

def test_search search = driver.find_element(:name, 'Search') search.click search.send_keys('cpu')

Next let us find the CPU list item and activate it:

cpu = driver.find_element(:class_name, '[list item | CPU]') assert(cpu.displayed?) cpu.click

And finally let’s assert that the page was actually activated:

cpu_tab = driver.find_element(:class_name, '[page tab | CPU]') assert(cpu_tab.displayed?)

To run the complete test we can use the run wrapper: selenium-webdriver-at-spi-run ./kinfocentertest.rb (mind that it needs to be +x). If all has gone well we should get a successful test.

Finished in 1.345276s, 0.7433 runs/s, 1.4867 assertions/s. 1 runs, 2 assertions, 0 failures, 0 errors, 0 skips I, [2022-12-14T13:13:53.508516 #154338] INFO -- : tests done I, [2022-12-14T13:13:53.508583 #154338] INFO -- : run.rb exiting true

This should get you started with writing a test for your application! I’ll gladly help and review your forthcoming tests.
For more detailed documentation check out the writing-tests wiki page as well as the appium command reference.

Of course the work is not done. selenium-webdriver-at-spi is very much still a work in progress and I’d be glad for others to help add features as they become needed. The gitlab project is the place for that.

The complete code of the example above:

#!/usr/bin/env ruby # frozen_string_literal: true # SPDX-License-Identifier: GPL-2.0-only OR GPL-3.0-only OR LicenseRef-KDE-Accepted-GPL # SPDX-FileCopyrightText: 2022 Harald Sitter <sitter@kde.org> require 'appium_lib' require 'minitest/autorun' class TestKInfoCenter < Minitest::Test attr_reader :driver def setup @appium_driver = Appium::Driver.new( { 'caps' => { app: 'org.kde.kinfocenter.desktop' }, 'appium_lib' => { server_url: 'http://127.0.0.1:4723', wait_timeout: 10, wait_interval: 0.5 } }, true ) @driver = @appium_driver.start_driver end def teardown driver.quit end def test_search search = driver.find_element(:name, 'Search') search.click search.send_keys('cpu') cpu = driver.find_element(:class_name, '[list item | CPU]') assert(cpu.displayed?) cpu.click cpu_tab = driver.find_element(:class_name, '[page tab | CPU]') assert(cpu_tab.displayed?) end end
Categories: FLOSS Project Planets

CodersLegacy: Ctypes vs Python – Performance Comparison

Planet Python - Wed, 2022-12-14 10:49

Python Ctypes is a foreign function library that allows Python code to call C or C++ functions directly. This can be useful for improving the performance of Python code, particularly when working with large data sets or computationally intensive tasks. In this Article we will be doing a Performance Comparison between Native Python Code and Python + Ctypes Code.

Testing Environment

In order to make this comparison as fair and accurate as possible, we will be using 3 different test cases.

  1. For-loops (finding sum)
  2. Recursion-heavy Code
  3. Quicksort Algorithm

Each test case will be benchmarked appropriately using the timeit library. This library produces accurate results by running the same code many times and averaging the time taken. We will be running each test case about 100 times.

These tests were performed on a fairly powerful and modern laptop (Ryzen 7 5700U, 8 cores, 16 threads). Execution will vary based on the system being used. All driver code is provided so you can try out the code yourselves too.

We have three types of files being used. “Driver” files with the benchmarking code, “Python” files with the python functions and “Ctypes” files with the C-library code.

We use the following notation in our code, “Python_1” to indicate we are accessing the Python file for the first test case. “Ctypes_1” similarly has the same meaning. “Ctypes_2” means the C-code for the second test case.

Now let’s begin this Ctypes and Python Comparison!

Ctypes and Python Comparision#1

Here we have our Python code for calculating the sum of numbers from 1 to “n”.

def calc_sum(n): sum = 0 for i in range(n): sum += i return sum

Here is the same in the C programming language.

int calc_sum(int n) { int sum = 0; for (int i = 0; i < n; i++) sum += i; return sum; }

And here is our driver code (you can skip this part if you aren’t interested and just go to results)

import timeit from python_1 import calc_sum code = '''\ calc_sum(100) ''' setup='''\ from __main__ import calc_sum''' times = timeit.repeat(setup = setup, stmt = code, number = 1, repeat = 100) print('Python: calculate sum: {}'.format(sum(times) / 100)) code = '''\ path = os.getcwd() clibrary = ctypes.CDLL(os.path.join(path, 'ctypes_1.so')) clibrary.calc_sum(100) ''' setup='''\ import os import ctypes''' times = timeit.repeat(setup = setup, stmt = code, number = 1, repeat = 100) print('Ctypes: calculate sum: {}'.format(sum(times) / 100)) Results

First we have the results for n = 100. Don’t let the format of the numbers fool you. Python is actually faster here (look at the exponent which has a higher negative power). Remember, less time == faster.

# Calculate sum of all numbers till 100 Python: calculate sum: 6.319014355540275e-06 Ctypes: calculate sum: 2.1877996623516082e-05

So why is Python faster here? Well that’s because of some basic overhead incurred by Ctypes when calling the C-function.

We can verify that some basic overhead exists by trying out another Performance Comparison between Python and Ctypes on higher numbers.

# Calculate sum of all numbers till 1000 Python: calculate sum: 7.438200525939465e-05 Ctypes: calculate sum: 5.2979998290538785e-05

At n = 1000 we can see that the performance gains by using Ctypes are now greater than the overhead involved. Ctypes is about 30% faster than Python here.

Let’s try some higher numbers to make this difference more obvious. At higher number of operations, the overhead caused by Ctypes becomes more and more diluted and more negligible.

Here are the results for 100000. Here we can see a much larger performance increase. Ctypes is almost 15x faster than Python here.

# Calculate sum of all numbers till 100000 Python: calculate sum: 0.007767976995091885 Ctypes: calculate sum: 0.00047917700838297607

Python is famously quite slow when it comes to loops, and this clearly shows from the above results. The difference will only grow larger as you increase the number of iterations.

Ctypes and Python Comparision#2

Next up we have a slightly odd scenario. We will be comparing Ctypes and Python on the recursive implementation of the Fibonacci series. (Yes, i know it’s inefficient, this is just for testing purposes).

def fib(n): if n <= 1: return n return fib(n-1) + fib(n-2)

Here is the same in the C language.

int fib(int n) { if (n <= 1) return n; return fib(n - 1) + fib(n - 2); }

Here is our driver code.

import timeit from python_2 import fib code = '''\ fib(40) ''' setup='''\ from __main__ import fib''' times = timeit.repeat(setup = setup, stmt = code, number = 1, repeat = 100) print('Python: calculate sum: {}'.format(sum(times) / 100)) code = '''\ path = os.getcwd() clibrary = ctypes.CDLL(os.path.join(path, 'ctypes_2.so')) clibrary.fib(40) ''' setup='''\ import os import ctypes''' times = timeit.repeat(setup = setup, stmt = code, number = 1, repeat = 100) print('Ctypes: calculate sum: {}'.format(sum(times) / 100)) Results # Calculate 10th Fibonacci number Python: calculate sum: 1.4245007187128066e-05 Ctypes: calculate sum: 2.6811982970684766e-05

From our first result we can see that Python is faster than Ctypes. As we observed earlier, we should try a higher number.

Calculating the 30th Fibonacci number shows that Ctypes performs much better than Python in recursive situations. Ctypes here is a whopping 20x times faster than Python.

# Calculate 30th Fibonacci number Python: calculate sum: 0.20831336600938813 Ctypes: calculate sum: 0.010219879988580942

Calculating the 40th Fibonacci number widens the gap even further to almost 100x times!

# Calculate 40th Fibonacci number Python: calculate sum: 44.40553419990465 Ctypes: calculate sum: 0.5199361999984831

From this we can safely conclude that Python is not very efficient when it comes to recursion.

Notes: I also tried this for the 50th Fibonacci number, but gave up after 30 mins of waiting. Also if you try this for memoized versions of the Fibonacci code, it will give very similar results to the First Test Case we discussed.

Ctypes and Python Comparision#3

Finally we have the Quick Sort algorithm on which we will conduct our final test. It also makes use of Recursion, but not as extensive as Fibonacci (which was O(2^n)) . We also how loops here along with quite a bit of “regular” code, so this will be a good test.

Here is our Python code as usual.

def partition(array, low, high): pivot = array[high] i = low - 1 for j in range(low, high): if array[j] <= pivot: i = i + 1 (array[i], array[j]) = (array[j], array[i]) (array[i + 1], array[high]) = (array[high], array[i + 1]) return i + 1 def quick_sort(array, low, high): if low < high: pi = partition(array, low, high) quick_sort(array, low, pi - 1) quick_sort(array, pi + 1, high)

And here is the same in the C programming.

void swap(int *a, int *b) { int t = *a; *a = *b; *b = t; } int partition(int array[], int low, int high) { int pivot = array[high]; int i = (low - 1); for (int j = low; j < high; j++) { if (array[j] <= pivot) { i++; swap(&array[i], &array[j]); } } swap(&array[i + 1], &array[high]); return (i + 1); } void quick_sort(int array[], int low, int high) { if (low < high) { int pi = partition(array, low, high); quick_sort(array, low, pi - 1); quick_sort(array, pi + 1, high); } }

Finally our driver code.

import timeit from python_3 import quick_sort code = '''\ new_array = [] for i in range(100000): new_array.append(randint(0, 1000)) quick_sort(new_array, 0, 99999) ''' setup='''\ from __main__ import quick_sort from random import randint''' times = timeit.repeat(setup = setup, stmt = code, number = 1, repeat = 100) print('Python: calculate sum: {}'.format(sum(times) / 100)) code = '''\ path = os.getcwd() clibrary = ctypes.CDLL(os.path.join(path, 'ctypes_3.so')) array = (ctypes.c_int * 100000)() for i in range(100000): array[i] = randint(0, 1000) clibrary.quick_sort(array, 0, 99999) ''' setup='''\ import os import ctypes from random import randint''' times = timeit.repeat(setup = setup, stmt = code, number = 1, repeat = 100) print('Ctypes: calculate sum: {}'.format(sum(times) / 100)) Results

Here are the results for an array of size 100. For once, Ctypes actually outperformed Python in an initial test. It’s not by a very big margin though.

# Quick sort on array of size 100 Python: calculate sum: 0.0001310600037686527 Ctypes: calculate sum: 0.00010179100325331093

Let’s try and observe how much the margin between Python and Ctypes grows as we increase the size of the array.

# Quick sort on array of size 1000 Python: calculate sum: 0.0016809910046868025 Ctypes: calculate sum: 0.0007084359927102923

At size = 1000, we can observe a 2.5x speed of Ctypes over Python.

# Quick sort on array of size 10000 Python: calculate sum: 0.02391906899632886 Ctypes: calculate sum: 0.007066867975518107

At size = 10000, we can observe a 3.5x times speed up over Python.

# Quick sort on array of size 100000 Python: calculate sum: 1.3064013520046138 Ctypes: calculate sum: 0.21337365100858732

Finally for an array size of 100000 we begin seeing some big gains of Ctypes over Python. We now have a 6.5x times speed up!

The difference is not as great as we observed in the Fibonacci code, or even the Array sum code, but it is still quite substantial. And it will only grow further as we increase the size of the Array.

You can suggest some additional test-cases in the comments section below and we might add them! Let us know what you think.

This marks the end of the Ctypes vs Python – Performance Comparison Article. Any suggestions or contributions for CodersLegacy are more than welcome. Questions about the tutorial content can be asked in the comments section below.

The post Ctypes vs Python – Performance Comparison appeared first on CodersLegacy.

Categories: FLOSS Project Planets

Krita 5.1.4 Released

Planet KDE - Wed, 2022-12-14 09:48

We’re releasing today a new bugfix release. This probably will be the last 5.1 bugfix release, since we’re updating our dependencies and builds after this. Next will be 5.2 with a ton of changes!

Fixes
  • Vector shapes not swapping the current fg/bg color (BUG:461692)
  • Fix a crash when using “Paste into Active Layer” (BUG:462223)
  • Layer Styles: label the Outer Glow page as Outer, not Inner Glow. (BUG:462091)
  • Parse transfer characteristics from ICC profiles. Patch by Rasyuga, thanks! (BUG:45911)
  • Fix handling ICC color primaries and white point detections. Patch by Rasyuga, thanks!
  • Remove two obsolete actions from the action list in settings->configure Krita->shortcuts
  • Fix some display artifacts when using fractional display scaling. (BUG:441216, 460577, 461912)
  • Fix wraparound mode for non-pixel brush engines (BUG:460299)
  • Fix visibility of the measure and gradient tools on a dark background
  • Fix data loss when a transform tool is applied too quickly. (BUG:460557, 461109)
  • Android: Disable changing the resource location
  • Android: Disable the touch docker (some buttons are completely broken, and we’re rewriting Krita’s touch functionality). BUG:461634
  • Android: disable New Window (Android does not do windows).
  • Android: disable workspaces that create multiple windows (Android does not do windows).
  • Android: make TIFF import and export work
  • Android: remove the detach canvas action (Android does not do windows).
  • TIFF: Fix inconsistent alpha and Photoshop-style layered tiff export checkboxes (BUG:462925)
  • TIFF: Fix handling multipage files (BUG:461975)
  • TIFF: Implement detection of the resolution’s unit. (BUG:420932)
  • EXR: Implement consistent GRAY and XYZ exporting (BUG:462799)
  • AVIF: add the image/avif mimetype to the desktop file so external applications can know Krita can open these files. (BUG:462224)
  • PSD: allow zero-sized resource blocks (BUG:461493, 450983)
  • Python: fix creating a new image from Python. (BUG:462665)
  • Python: fix updating the filename when using Document::saveAs.  (BUG:462667)
  • Python: make it possible to use Python 3.11 (BUG:461598)
  • Animation: Improve the autokey functionality. (BUG:459723)
Download Windows

If you’re using the portable zip files, just open the zip file in Explorer and drag the folder somewhere convenient, then double-click on the krita icon in the folder. This will not impact an installed version of Krita, though it will share your settings and custom resources with your regular installed version of Krita. For reporting crashes, also get the debug symbols folder.

Note that we are not making 32 bits Windows builds anymore.

Linux

The separate gmic-qt appimage is no longer needed.

(If, for some reason, Firefox thinks it needs to load this as text: to download, right-click on the link.)

macOS

Note: if you use macOS Sierra or High Sierra, please check this video to learn how to enable starting developer-signed binaries, instead of just Apple Store binaries.

Android

We consider Krita on ChromeOS as ready for production. Krita on Android is still beta. Krita is not available for Android phones, only for tablets, because the user interface needs a large screen.

Source code md5sum

For all downloads, visit https://download.kde.org/stable/krita/5.1.4/ and click on Details to get the hashes.

Key

The Linux appimage and the source .tar.gz and .tar.xz tarballs are signed. You can retrieve the public key here. The signatures are here (filenames ending in .sig).

The post Krita 5.1.4 Released appeared first on Krita.

Categories: FLOSS Project Planets

Python for Beginners: Drop Duplicate Rows From a Pandas Dataframe

Planet Python - Wed, 2022-12-14 09:00

Pandas dataframes are used to handle tabular data in Python. The data sometimes contains duplicate values which might be undesired. In this article, we will discuss different ways to drop duplicate rows from a pandas dataframe using the drop_duplicates() method.

Table of Contents
  1. The drop_duplicates() Method
  2. Drop Duplicate Rows From a Pandas Dataframe
  3. Drop All Duplicate Rows From a Pandas Dataframe
  4. Drop Duplicate Rows Inplace From a Pandas Dataframe
  5. Drop Rows Having Duplicate Values in Specific Columns
  6. Conclusion
The drop_duplicates() Method

The drop_duplicates() method is used to drop duplicate rows from a pandas dataframe. It has the following syntax.

DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

Here,

  • The subset parameter is used to compare two rows to determine duplicate rows. By default, the subset parameter is set to None. Due to this, values from all the columns are used from rows for comparison. If you want to compare two rows by only a single column, you can pass the column name to the subset parameter as the input argument. If you want to compare rows by two or more columns, you can pass the list of column names to the subset parameter.  
  • The keep parameter is used to decide whether we want to keep one of the duplicate rows in the output dataframe. If we want to drop all the duplicate rows except the first occurrence, we can set the keep parameter to “first” which is its default value. If we want to drop all the duplicate rows except the last occurrence, we can set the keep parameter to “last”. If we need to drop all the rows having duplicates, we can set the keep parameter to False.
  • The inplace parameter is used to decide if we get a new dataframe after the drop operation or if we want to modify the original dataframe. When inplace is set to False, which is its default value, the original dataframe isn’t changed and the drop_duplicates() method returns the modified dataframe after execution. To alter the original dataframe, you can set inplace to True. 
  • When rows are dropped from a dataframe, the order of the indices becomes irregular. If you want to refresh the index and assign the ordered index from 0 to (length of dataframe)-1, you can set ignore_index to True. 

After execution, the drop_duplicates() method returns a dataframe if the inplace parameter is set to False. Otherwise, it returns None.

Drop Duplicate Rows From a Pandas Dataframe

To drop duplicate rows from a pandas dataframe, you can invoke the drop_duplicates() method on the dataframe. After execution, it returns a dataframe containing all the unique rows. You can observe this in the following example.

import pandas as pd df=pd.read_csv("grade2.csv") print("The dataframe is:") print(df) df=df.drop_duplicates() print("After dropping duplicates:") print(df)

Output:

The dataframe is: Class Roll Name Marks Grade 0 2 27 Harsh 55 C 1 2 23 Clara 78 B 2 3 33 Tina 82 A 3 3 34 Amy 88 A 4 3 15 Prashant 78 B 5 3 27 Aditya 55 C 6 3 34 Amy 88 A 7 3 23 Radheshyam 78 B 8 3 11 Bobby 50 D 9 2 27 Harsh 55 C 10 3 15 Lokesh 88 A After dropping duplicates: Class Roll Name Marks Grade 0 2 27 Harsh 55 C 1 2 23 Clara 78 B 2 3 33 Tina 82 A 3 3 34 Amy 88 A 4 3 15 Prashant 78 B 5 3 27 Aditya 55 C 7 3 23 Radheshyam 78 B 8 3 11 Bobby 50 D 10 3 15 Lokesh 88 A

In the above example, we have an input dataframe containing the Class, Roll, Name, Marks, and Grades of some students. As you can observe, the input dataframe contains some duplicate rows. The rows at index 0 and 9 are the same. Similarly, rows at the index 3 and 6 are the same. After execution of the drop_duplicates() method, we get a pandas dataframe in which all the rows are unique. Hence, the rows at indexes 6 and 9 are dropped from the dataframe so that the rows at indexes 0 and 3 become unique.

Drop All Duplicate Rows From a Pandas Dataframe

In the above example, one entry from each set of duplicate rows is preserved. If you want to delete all the duplicate rows from the dataframe, you can set the keep parameter to False in the drop_duplicates() method. After this, all the rows having duplicate values will be deleted. You can observe this in the following example.

import pandas as pd df=pd.read_csv("grade2.csv") print("The dataframe is:") print(df) df=df.drop_duplicates(keep=False) print("After dropping duplicates:") print(df)

Output:

The dataframe is: Class Roll Name Marks Grade 0 2 27 Harsh 55 C 1 2 23 Clara 78 B 2 3 33 Tina 82 A 3 3 34 Amy 88 A 4 3 15 Prashant 78 B 5 3 27 Aditya 55 C 6 3 34 Amy 88 A 7 3 23 Radheshyam 78 B 8 3 11 Bobby 50 D 9 2 27 Harsh 55 C 10 3 15 Lokesh 88 A After dropping duplicates: Class Roll Name Marks Grade 1 2 23 Clara 78 B 2 3 33 Tina 82 A 4 3 15 Prashant 78 B 5 3 27 Aditya 55 C 7 3 23 Radheshyam 78 B 8 3 11 Bobby 50 D 10 3 15 Lokesh 88 A

In this example, you can observe that rows at index 0 and 9 are the same. Similarly, rows at the index 3 and 6 are the same. When we set the keep parameter to False in the drop_duplicates() method, you can observe that all the rows that have duplicate values i.e. rows at index 0, 3, 6, and 9 are dropped from the input dataframe.

Suggested Reading: If you are into machine learning, you can read this MLFlow tutorial with code examples. You might also like this article on 15 Free Data Visualization Tools for 2023.

Drop Duplicate Rows Inplace From a Pandas Dataframe

By default, the drop_duplicates() method returns a new dataframe. If you want to alter the original dataframe instead of creating a new one, you can set the inplace parameter to True in the drop_duplicates() method as shown below.

import pandas as pd df=pd.read_csv("grade2.csv") print("The dataframe is:") print(df) df.drop_duplicates(keep=False,inplace=True) print("After dropping duplicates:") print(df)

Output:

The dataframe is: Class Roll Name Marks Grade 0 2 27 Harsh 55 C 1 2 23 Clara 78 B 2 3 33 Tina 82 A 3 3 34 Amy 88 A 4 3 15 Prashant 78 B 5 3 27 Aditya 55 C 6 3 34 Amy 88 A 7 3 23 Radheshyam 78 B 8 3 11 Bobby 50 D 9 2 27 Harsh 55 C 10 3 15 Lokesh 88 A After dropping duplicates: Class Roll Name Marks Grade 1 2 23 Clara 78 B 2 3 33 Tina 82 A 4 3 15 Prashant 78 B 5 3 27 Aditya 55 C 7 3 23 Radheshyam 78 B 8 3 11 Bobby 50 D 10 3 15 Lokesh 88 A

In this example, we have set the inplace parameter to True in the drop_duplicates() method. Hence, the  drop_duplicates() method modifies the input dataframe instead of creating a new one. Here, the drop_duplicates() method returns None.

Drop Rows Having Duplicate Values in Specific Columns

By default, the drop_duplicates() method compares all the columns for similarity to check for duplicate rows. If you want to compare the rows for duplicate values on the basis of specific columns, you can use the subset parameter in the drop_duplicates() method. 

The subset parameter takes a list of columns as its input argument. After this, the drop_duplicates() method compares the rows only based on the specified columns. You can observe this in the following example.

import pandas as pd df=pd.read_csv("grade2.csv") print("The dataframe is:") print(df) df.drop_duplicates(subset=["Class","Roll"],inplace=True) print("After dropping duplicates:") print(df)

Output:

The dataframe is: Class Roll Name Marks Grade 0 2 27 Harsh 55 C 1 2 23 Clara 78 B 2 3 33 Tina 82 A 3 3 34 Amy 88 A 4 3 15 Prashant 78 B 5 3 27 Aditya 55 C 6 3 34 Amy 88 A 7 3 23 Radheshyam 78 B 8 3 11 Bobby 50 D 9 2 27 Harsh 55 C 10 3 15 Lokesh 88 A After dropping duplicates: Class Roll Name Marks Grade 0 2 27 Harsh 55 C 1 2 23 Clara 78 B 2 3 33 Tina 82 A 3 3 34 Amy 88 A 4 3 15 Prashant 78 B 5 3 27 Aditya 55 C 7 3 23 Radheshyam 78 B 8 3 11 Bobby 50 D

In this example, we have passed the python list [“Class”, “Roll”] to the subset parameter in the drop_duplicates() method. Hence, the duplicate rows are decided on the basis of these two columns only. As a result, the rows having the same value in the Class and Roll columns are considered duplicates and are dropped from the dataframe.

Conclusion

In this article, we have discussed different ways to drop duplicate rows from a dataframe using the drop_duplicates() method.

To know more about the pandas module, you can read this article on how to sort a pandas dataframe. You might also like this article on how to drop columns from a pandas dataframe.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Drop Duplicate Rows From a Pandas Dataframe appeared first on PythonForBeginners.com.

Categories: FLOSS Project Planets

Four Kitchens: Drupal 10 is here: Is your website ready?

Planet Drupal - Wed, 2022-12-14 06:34

Back in 2020, Drupal delivered a surprise by estimating a June 2022 release for Drupal 10. While the release was ultimately pushed back to December 14, 2022, you need to know where your website stands for the upcoming upgrade.

For any IT team, changes to a site platform are cause for concern. With less than a year before Drupal 9 hits end-of-life, you need to start planning your preparations for the coming year.

Thankfully, Drupal has remained true to its word about its latest updates avoiding the complex migrations that were required moving from Drupal 7 (but I’ll touch more on that later). Still, the overall impact of Drupal 10 ultimately depends on the condition of your current site.

Platform updates are always cause for uncertainty, and your preparations will vary to navigate a move to Drupal 10. If you start by taking into account where your current site stands, you can best ensure it’s on steady ground for the benefits that lie ahead.

Advantages of upgrading to Drupal 10

The benefits of moving your site to Drupal 10 follow a familiar path. Drupal’s development team doesn’t pack major updates with flashy new features, unlike traditional hardware and software development. Instead, the community continues to refresh the latest version of Drupal with brand new tools.

The arrival of Drupal 10 will clear the system of old, backward-compatible code so the platform runs more efficiently. That way, as work begins to create new tools for version 10, Drupal developers are starting from a clean slate.

The promise of a clean codebase may sound a bit anticlimactic from the perspective of your users. But for developers, it’s an addition by subtraction. Drupal 10 will run much faster than your current platform by losing the clutter required to support out-of-date features.

What can you expect from the next version of Drupal?

Many of the features included with Drupal 10 have already been in use at various points in Drupal 9’s development. Here are a few benefits planned for Drupal’s new release:

  • CKEditor 5: Drupal 9 features version 4 of the open-source JavaScript text editor, which will be deprecated in 2023. This new version is already in use and features a similar-enough interface to be familiar with performance and security enhancements.
  • Updated frontend and admin themes: These features have been available in Drupal 9 but will become the default themes. In addition to offering improved capabilities for migrating a site into Drupal, the new administration theme is more intuitive with better spacing and readability.
  • New package manager: Though potentially unavailable until version 10.1, this feature enables admin users to install modules through the UI. Instead of requiring a developer to FTP modules to a server, you can install them directly from a menu in a way that resembles WordPress extensions.
More good news: Drupal 10 will last longer than 9

One of the third-party technical dependencies of Drupal is its PHP framework, Symfony. Symfony runs on two-year release cycles, which introduces the potential for Drupal to do the same. Drupal 9 uses Symfony 4, which was at the tail end of its development when Drupal 9 was launched. Consequently, as Symfony fell out-of-date in less than two years, so did Drupal 9.

These dependencies were a big part of why Drupal 9 had such a short lifespan as compared with the platform’s history. At one time, versions of Drupal required five to seven years of development. 

Drupal’s development team is releasing Drupal 10 on Symfony 6, which was released earlier in 2022. Drupal 10 will last at least four years before the next major version is released. By working to get ahead of schedule with Symfony, Drupal aims to deliver a platform that’s faster and more stable — with staying power.

Will upgrading to Drupal 10 be easy? It depends.

Drupal 9 will reach its end-of-life sooner than may be ideal, but you face an easier upgrade path to Drupal 10 if your site is currently running version 9.4 or 9.5. Just as with the upgrade from version 8 to 9, updates to Drupal 10 will run “in place.” Rather than needing to migrate to a new platform to upgrade, Drupal 10 is being built inside Drupal 9.

You won’t have to rebuild your site to upgrade to Drupal 10 if you’re up-to-date with its latest version. However, not every organization can keep its website current with every platform release. As with any journey, the road to Drupal 10 entirely depends on where you are now.

If your site is running Drupal 9:

Much like the shift from Drupal 8 to Drupal 9, moving to Drupal 10 can be seamless with the right planning. You need to monitor custom code in any platform update, and Drupal Rector streamlines the process. The module identifies your areas of need, and in many cases will update your code automatically. 

You still need an engineer to oversee the upgrade, but Drupal Rector eliminates the tedium of manually updating a bunch of APIs beforehand. As changes are made to Drupal 10, developers are required to add an automated rule to Rector. Consequently, your future upgrades will be even easier.

Once Drupal 10 is released, you have until November 23, 2023 to complete the upgrade before Drupal 9 reaches its end-of-life. At that point, your site will no longer receive security updates from the Drupal community.

If your site is running Drupal 8:

Drupal 8 reached its end-of-life in November 2021, which means your site may be at risk without the community’s support with security patches and bug fixes. To offset that danger, you should use Drupal Rector to identify deprecated code in your Drupal 8 site to automate a portion of your upgrade journey to Drupal 9.

Fortunately, the move from 8 to 9 is an easier transition than you may think. Once your site is up-to-date to version 9.4, then the jump to Drupal 10 should be fairly straightforward upon its release.

If your site is running Drupal 7:

If you’re still on Drupal 7 (or older), your platform is currently scheduled to reach its end-of-life in November 2023. While this date has been extended several times over the past few years, there is no guarantee it will be extended again. However, you’re not alone. According to estimates, more sites are on Drupal 7 than there are on 8 and 9 combined.  

Migrating your site from Drupal 7 is a complicated, labor-intensive undertaking, which is why the community extended the platform’s support during the pandemic. However, once Drupal 7 reaches its end-of-life next year, you’ll only be able to receive security updates through Vendor Extended Support. Those organizations remain an option to provide service for your site until 2025 — for a price.

To reduce support expenses, you should start working toward loading your site into Drupal 9.4 or 9.5 as soon as possible rather than waiting for the latest version. Drupal 10 will include migration tools from Drupal 7, but Drupal 9 already includes many of the modules you use. That may no longer be the case after Drupal 10 comes out.

Future-proof your site with an upgrade to Drupal 10

Whether you’re facing a migration from Drupal 7 or the end-of-life for Drupal 9, platform updates require planning to succeed. There is no sense in waiting to get started. If anything, upgrading to Drupal 10 from a much older version may grow more complex the longer you delay.

The days of launching a website and ignoring it for five or 10 years are over. The industry just moves too fast. Fortunately, with the right plan, your organization can get the platform you need to take on whatever lies ahead.

The post Drupal 10 is here: Is your website ready? appeared first on Four Kitchens.

Categories: FLOSS Project Planets

death and gravity: Has your password been pwned? Or, how I almost failed to search a 37 GB text file in under 1 millisecond (in Python)

Planet Python - Wed, 2022-12-14 05:41

So, there's this website, Have I Been Pwned, where you can check if your email address has appeared in a data breach.

There's also a Pwned Passwords section for checking passwords ... but, typing your password on a random website probably isn't such a great idea, right?

Of course, you could read about how HIBP protects the privacy of searched passwords1, and understand how k-Anonymity works (it does), and check that the website uses the k-Anonymity API (it does), but that's waaay too much work.

Of course, we could use the API ourselves, but where's the fun in that?

Instead, we'll do it the hard way – we'll check for the password offline.

And we're not stopping until it's fast. (This is a threat.)

The Pwned Passwords list #

OK, first we need to get password list.

Go to Pwned Passwords, scroll to Downloading the Pwned Passwords list, and download the SHA-1 ordered by hash file – be nice and use the torrent, if you can.

Note

You can also use the downloader to get an up-to-date version of the file, but that wasn't available a year ago when I started writing this.

The 16 GB archive extracts to a 37 GB text file:

$ pushd ~/Downloads $ stat -f %z pwned-passwords-sha1-ordered-by-hash-v8.7z 16257755606 $ 7zz e pwned-passwords-sha1-ordered-by-hash-v8.7z $ stat -f %z pwned-passwords-sha1-ordered-by-hash-v8.txt 37342268646 $ popd

... that looks like this:

$ head -n5 ~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt 000000005AD76BD555C1D6D771DE417A4B87E4B4:10 00000000A8DAE4228F821FB418F59826079BF368:4 00000000DD7F2A1C68A35673713783CA390C9E93:873 00000001E225B908BAC31C56DB04D892E47536E0:6 00000006BAB7FC3113AA73DE3589630FC08218E7:3

Each line has the format:

<SHA-1 of the password>:<times it appeared in various breaches>

Note

See the article mentioned in the introduction for why use a hash instead of the actual password, and what the counts are good for.

To make commands using it shorter, we link it in the current directory:

$ ln -s ~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt pwned.txt A minimal plausible solution #

We'll take an iterative, problem-solution approach to this. But since right now we don't have any solution, we start with the simplest thing that could possibly work.

First, we get the imports out of the way:2

1 2 3 4 5import os import sys import time import getpass import hashlib

Then, we open the file:

7 8path = sys.argv[1] file = open(path, 'rb')

By opening the file early, we get free error checking – no point in reading the password if the file isn't there.

Opening it in binary mode makes searching a bit faster, as it skips needless decoding. It's minimal plausible solution, it doesn't mean we can't do some obvious things.

Next, the password:

10 11 12 13 14 15 16 17try: password = sys.argv[2] except IndexError: password = getpass.getpass() hexdigest = hashlib.sha1(password.encode()).hexdigest() del password print("looking for", hexdigest)

We either take it as the second argument to the script, or read it with getpass(), so your actual password doesn't remain the shell history.

After computing the password's hash, we delete it. We won't go as far as to zero it,3 but at least we prevent accidental printing (e.g. in a traceback).

Let's see if it works:

$ python pwned.py pwned.txt password looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 $ python pwned.py pwned.txt Password: looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8

The shell sha1sum seems to agree:

$ echo -n password | sha1sum 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 -

To find the hash, we just go through the file line by line (remember, simplest thing that could possibly work). We put this in a function, it might be useful later on.

8 9 10 11 12 13 14def find_line(lines, prefix): for line in lines: if line.startswith(prefix): return line if line > prefix: break return None

If a line was found, we print the count:

29 30 31 32 33 34 35line = find_line(file, hexdigest.upper().encode()) if line: times = int(line.decode().rstrip().partition(':')[2]) print(f"pwned! seen {times:,} times before") else: print("not found")

Before giving it a whirl, let's add some timing code:

29 30 31start = time.monotonic() line = find_line(file, hexdigest.upper().encode()) end = time.monotonic() 39print(f"in {end-start:.6f} seconds")

And, it works (the code so far):

$ python pwned.py pwned.txt blocking looking for 000085013a02852372159cb94101b99ccaec59e1 pwned! seen 587 times before in 0.002070 seconds Problem: it's slow #

...kinda.

You may have noticed I switched from password to blocking. That's because I was cheating – I deliberately chose a password whose hash is at the beginning of the file.

On my 2013 laptop, searching for password actually takes 86 seconds!

Let's put a lower bound on the time it takes to go through the file:

$ time wc -l pwned.txt 847223402 pwned.txt real 1m7.234s user 0m31.133s sys 0m16.379s

There are faster implementations of wc out there, but Python can't even beat this one:

$ time python3 -c 'for line in open("pwned.txt", "rb"): pass' real 1m28.325s user 1m10.049s sys 0m15.577s

There must be a better way.

Skipping #

There's a hint about that in the file name: the hashes are ordered.

That means we don't have to check all the lines; we can skip ahead until we're past the hash, go back one step, and only check each line from there.

Lines are different lengths, so we can't skip exactly X lines without reading them. But we don't need to, skipping to any line that's a reasonable amount ahead will do.

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31def skip_to_before_line(file, prefix, offset): old_position = file.tell() while True: file.seek(offset, os.SEEK_CUR) file.readline() line = file.readline() # print("jumped to", (line or b'<eof>').decode().rstrip()) if not line or line >= prefix: file.seek(old_position) break old_position = file.tell()

So we just seek() ahead a set number of bytes. Since that might not leave us at the start of a line, we discard the incomplete line, and use the next one.

Finally, we wrap the original find_line() with one that does the skipping:

8 9 10 11 12 13 14def find_line(file, prefix): skip_to_before_line(file, prefix, 16 * 2**20) return find_line_linear(file, prefix) def find_line_linear(lines, prefix): for line in lines:

It works:

$ python pwned.py pwned.txt password looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 pwned! seen 9,545,824 times before in 0.027203 seconds

I found the magic 16 MiB offset by trying a bunch of different values:

offset (MiB) time (s) 1 0.05 4 0.035 8 0.030 16 0.027 <-- sweet spot 32 0.14

The code so far.

Problem: it needs tuning, it's still slow #

While three orders of magnitude faster, we still have a bunch of issues:

  1. The ideal skip size depends on computer you're running this on.
  2. The run time still increases linearly with file size; we haven't really solved the problem, as much as made it smaller by a (large, but) constant factor.
  3. The run time still increases linearly with where the hash is in the file.
  4. To be honest, it's still kinda slow. ¯\_(ツ)_/¯

There must be a better way.

Binary skipping #

To make the "linear" part painfully obvious, uncomment the jumped to line.

$ python pwned.py pwned.txt password | grep -o 'jumped to .' | uniq -c 139 jumped to 0 139 jumped to 1 139 jumped to 2 139 jumped to 3 139 jumped to 4 103 jumped to 5

Surely, after we've ju 44 0.000 0.000 0.000 0.000 {method 'tell' of '_io.BufferedReader' objects}mped to 0 once, we don't need to do it 138 more times, right?

We could jump directly to a line in the middle of the file; if there at all, the hash will be in either of the halves. We then jump to the middle of that half, and to the middle of that half, and so on, until we either find the hash or there's nowhere left to jump.

Note

If that sounds a lot like binary search, that's because it is – it's just not wearing its usual array clothes.

And most of the work is already done: we can jump to a line at most X bytes from where the hash should be, we only need to do it repeatedly, in smaller and smaller fractions of the file size:

25 26 27 28 29 30 31 32def skip_to_before_line(file, prefix, offset): while offset > 2**8: offset //= 2 skip_to_before_line_linear(file, prefix, offset) def skip_to_before_line_linear(file, prefix, offset): old_position = file.tell()

The only thing left is to get the file size:

8 9 10 11 12 13def find_line(file, prefix): file.seek(0, os.SEEK_END) size = file.tell() file.seek(0) skip_to_before_line(file, prefix, size) return find_line_linear(file, prefix)

It works:

$ python pwned.py pwned.txt password looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 pwned! seen 9,545,824 times before in 0.009559 seconds $ python pwned.py pwned.txt password looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 pwned! seen 9,545,824 times before in 0.000268 seconds

The huge time difference is due to operating system and/or disk caches – on the second run, the (same) parts of the file are likely already in memory.

Anyway, look again at the jumped to output: instead of jumping blindly through the whole file, now we're jumping around the hash, getting closer and closer to it.

$ python pwned.py pwned.txt password | grep -o 'jumped to .' | uniq -c 1 jumped to 7 1 jumped to 3 1 jumped to 7 1 jumped to 5 1 jumped to 4 39 jumped to 5 $ python pwned.py pwned.txt password | grep -o 'jumped to ..' | uniq -c 1 jumped to 7F 1 jumped to 3F 1 jumped to 7F 1 jumped to 5F 1 jumped to 4F 1 jumped to 5F 1 jumped to 57 1 jumped to 5F 1 jumped to 5B 1 jumped to 59 1 jumped to 5B 1 jumped to 5A 32 jumped to 5B

You may notice we end up at the same 7F... prefix twice; this makes sense – we skip ahead by half the file size, then back, then ahead by a quarter two times. It shouldn't really change anything, the second time the data is likely already cached.

The code so far.

Better timing #

Given the way caching muddies the waters, how fast is it really?

This shell function averages a hundred runs, each with a different password:

function average-many { for _ in {1..100}; do python $@ $( python -c 'import time; print(time.time())' ) done \ | grep seconds \ | cut -d' ' -f2 \ | paste -sd+ - \ | sed 's#^#scale=6;(#' | sed 's#$#)/100#' \ | bc } After a few repeats, the average time settles around 3 ms. $ for _ in {1..20}; do average-many pwned.py pwned.txt; done .004802 .003904 .004088 .003486 .003451 .003476 .003414 .003442 .003169 .003297 .002931 .003077 .003092 .003011 .002980 .003147 .003112 .002942 .002984 .002934

Again, this is due to caching; the more we run it, the more likely it is the pages at half, quarters, eights, sixteenths, and so on of the file size are already in memory, and the road to any line starts through a subset of those.

I wave my hands, get a 2020 laptop, and a miracle happens. It's far enough into the totally unpredictable future, now, that you can search any password in under 1 millisecond. You can do anything you want.

So, there we go. Wasn't that an interesting story? That's the end of the article. Don't look at the scroll bar. Don't worry about it.

If you came here to find a somewhat inconvenient way of checking if your password has been compromised, you can go.

Subscribe on the way out if you'd like, but take care :)

Go solve Advent of Code or something.

I'm just chilling out.

See ya.

Want to know when new articles come out? Subscribe here to get new stuff straight to your inbox!

Failing to get to under 1 millisecond #

OK, I think they're gone.

I swear this was supposed to be the end; this really was supposed to be a short one.

Here's what a friend of mine said, that chronologically should be way later into the article, but makes a great summary for what follows:

And now that you have arrived at this point, spend a moment to ponder the arbitrary nature of 1 millisecond given its dependency on the current year and the choice of your particular hardware.

After that moment, continue celebrating.

Nah, fuck it, it has to take less than 1 millisecond on the old laptop.

... so yeah, here's a bunch of stuff that didn't work.

Profile before optimizing #

Now, with the obvious improvements out of the way, it's probably a good time to stop and find out where time is being spent.

$ python -m cProfile -s cumulative pwned.py pwned.txt "$( date )" looking for 3960626a8c59fe927d3cf2e991d67f4c505ae198 not found in 0.004902 seconds 1631 function calls (1614 primitive calls) in 0.010 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) ... 1 0.000 0.000 0.005 0.005 02-binary-search.py:8(find_line) 1 0.000 0.000 0.005 0.005 02-binary-search.py:22(skip_to_before_line) 28 0.000 0.000 0.005 0.000 02-binary-search.py:28(skip_to_before_line_linear) 86 0.004 0.000 0.004 0.000 {method 'readline' of '_io.BufferedReader' objects} ... 71 0.000 0.000 0.000 0.000 {method 'seek' of '_io.BufferedReader' objects} ... 44 0.000 0.000 0.000 0.000 {method 'tell' of '_io.BufferedReader' objects} ...

From the output above, we learn that:

readline() is implemented in C, so there's not much we can change there.

What we can change, however, is how often we call it.

Another thing of interest is how much individual readline() calls take.

In skip_to_before_line_linear():

34 35 36 37 38 39 40 start = time.monotonic() file.readline() line = file.readline() end = time.monotonic() print("jumped to", (line[:5] or b'<eof>').decode().rstrip(), f"in {(end-start)*1000000:4.0f} us", f"at offset {file.tell():16,}") The output is pretty enlightening: $ python pwned.py pwned.txt asdf looking for 3da541559918a808c2402bba5012f6c60b27661c jumped to 7FF9E in 10 us at offset 18,671,134,394 <-- 1/2 file size jumped to 3FF89 in 4 us at offset 9,335,567,234 <-- 1/4 file size jumped to 1FFBA in 3 us at offset 4,667,783,663 <-- 1/8 file size jumped to 3FF89 in 3 us at offset 9,335,567,322 <-- 1/4 file size jumped to 2FFA4 in 5 us at offset 7,001,675,508 jumped to 3FF89 in 4 us at offset 9,335,567,366 <-- 1/4 file size jumped to 37F98 in 4 us at offset 8,168,621,453 jumped to 3FF89 in 3 us at offset 9,335,567,410 <-- 1/4 file size jumped to 3BF94 in 3 us at offset 8,752,094,477 jumped to 3FF89 in 2 us at offset 9,335,567,498 <-- 1/4 file size jumped to 3DF8E in 3 us at offset 9,043,831,007 jumped to 3CF90 in 3 us at offset 8,897,962,782 jumped to 3DF8E in 2 us at offset 9,043,831,095 jumped to 3D790 in 3 us at offset 8,970,896,964 jumped to 3DF8E in 2 us at offset 9,043,831,139 jumped to 3DB90 in 253 us at offset 9,007,364,072 jumped to 3D990 in 206 us at offset 8,989,130,552 jumped to 3DB90 in 6 us at offset 9,007,364,160 jumped to 3DA8F in 270 us at offset 8,998,247,402 <-- page 2,196,837 jumped to 3DA0F in 189 us at offset 8,993,689,007 jumped to 3DA8F in 5 us at offset 8,998,247,446 <-- page 2,196,837 jumped to 3DA4F in 212 us at offset 8,995,968,274 jumped to 3DA8F in 5 us at offset 8,998,247,534 <-- page 2,196,837 jumped to 3DA6F in 266 us at offset 8,997,107,921 jumped to 3DA5F in 203 us at offset 8,996,538,139 jumped to 3DA57 in 195 us at offset 8,996,253,241 jumped to 3DA53 in 197 us at offset 8,996,110,772 jumped to 3DA57 in 6 us at offset 8,996,253,285 jumped to 3DA55 in 193 us at offset 8,996,182,045 jumped to 3DA54 in 178 us at offset 8,996,146,471 jumped to 3DA54 in 189 us at offset 8,996,128,666 jumped to 3DA54 in 191 us at offset 8,996,119,760 <-- page 2,196,318 jumped to 3DA54 in 32 us at offset 8,996,128,710 jumped to 3DA54 in 5 us at offset 8,996,124,259 jumped to 3DA54 in 10 us at offset 8,996,122,057 <-- page 2,196,318 jumped to 3DA54 in 4 us at offset 8,996,120,955 <-- page 2,196,318 jumped to 3DA54 in 4 us at offset 8,996,120,382 <-- page 2,196,318 jumped to 3DA54 in 9 us at offset 8,996,120,112 <-- page 2,196,318 jumped to 3DA54 in 1 us at offset 8,996,120,470 <-- page 2,196,318 jumped to 3DA54 in 1 us at offset 8,996,120,338 <-- page 2,196,318 pwned! seen 324,774 times before in 0.003654 seconds

At least half the reads are pretty fast:

  • In the beginning, because searches start with the same few pages.
  • At the end, because searches end on the same page (the macOS page size is 4K).
  • Reads on the same page, after the first one.

So, it's the reads in the middle that we need to get rid of.

Position heuristic #

In theory, the output of a good hash function should be uniformly distributed.

This means that with a bit of math, we can estimate where a hash would be – a hash that's ~1/5 in the range of all possible hashes should be at ~1/5 of the file.

Here's a tiny example:

>>> digest = '5b' # 1-byte hash (2 hex digits) >>> size = 1000 # 1000-byte long file >>> int_digest = int(digest, 16) # == 91 >>> int_end = 16 ** len(digest) # == 0xff + 1 == 256 >>> int(size * int_digest / int_end) 355

We can do this once, and then binary search a safety interval around that position. Alas, this only gets rid of the fast jumps at the beginning of the binary search (and for some reason, it ends up being slightly slower than binary search alone). (code)

We can also narrow down around the estimated position iteratively, making the interval smaller by constant factor each time. This seems to work: a factor of 1000 yields 1.7 ms, and a factor of 8000 yields 1.2 ms (both in 2 steps). (code)

However, it has other issues:

  • Having arbitrary start/end offsets complicates the code quite a bit.
  • I don't know how to reliably determine the factor.4
  • I don't know how to prove it's correct (especially for smaller intervals, where the hashes are less uniform). To be honest, I don't think it can be 100% correct.5

Anyway, if the implementation is hard to explain, it's a bad idea.

Index file #

An arbitrary self-imposed restriction I had was that any solution should mostly use the original passwords list, with little to no preparation.

By relaxing this a bit, and going through the file once, we can build an index like:

<SHA-1 of the password>:<offset in file>

... that we can search with skip_to_before_line().

Of course, we can't include all the hashes, since the index will have just as many lines as the original file. But we don't need to – by including only lines a few kilobytes apart from each other, we can seek directly to within a few kilobytes in the big file.

The only thing left to figure out is how much "a few kilobytes" is.

After my endless harping about pages and caching, the answer should be obvious: one page size (4K). And this actually gets us 0.8 ms (with a 452M index file).

Back when I wrote the code, that thought hadn't really sunk in, so after getting 1.2 ms with a 32K "block size", I deemed this not good enough, and moved on.

(Code: search, index.)

Binary file #

At this point, I was grasping for straws.

Since I was already on the external file slipery slope, I thought of converting the file to binary, mostly to make the file smaller – smaller file, fewer reads.

I packed each line into 24 bytes:

| binary hash (20 bytes) | count (4 bytes) |

This halved the file, but only lowered the runtime to 2.6 ms :(

But more importantly, it made the code much, much simpler: because items are fixed size, you can know where the Nth item is, so I was able to use bisect for the binary search.

(Code: search, convert.)

Getting to under 1 millisecond #

OK, what now? We've tried some things, we've learned some stuff:

  • The position heuristic kinda (maybe?) works, but is hard to reason about.
  • The index file gets us there in the end, but barely, and the index is pretty big.
  • The binary file isn't much faster, and it creates a huge file. But, less code!

I don't know what to do with the first one, but we can combine the last two.

Generating the index #

Let's start with the script I made for the text index:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15import os, sys file = sys.stdin.buffer outf = sys.stdout.buffer while True: pos = file.tell() line = file.readline() if not line: break outf.write(line.partition(b':')[0] + b':' + str(pos).encode() + b'\n') file.seek(2**12, os.SEEK_CUR) file.readline()

The output looks like this:

$ python generate-index.py < pwned.txt 2>/dev/null | head -n5 000000005AD76BD555C1D6D771DE417A4B87E4B4:0 00000099A4D3034E14DF60EF50799F695C27C0EC:4157 00000172E8E1D1BD54AC23B3F9AB4383F291CA17:8312 000002C8F808A7DB504BBC3C711BE8A8D508C0F9:12453 0000047139578F13D70DD96BADD425C372DB64A9:16637

We need to pack that into bytes.

A hash takes 20 bytes. But, looking at the values above, we only need slightly more than 3 bytes (6 hex digits) to distinguish between the index lines:

$ python generate-index.py < pwned.txt 2>/dev/null | cut -c-6 | uniq -c | head 2 000000 1 000001 1 000002 1 000004 1 000005 1 000007 1 000008 1 00000A 1 00000B 1 00000C

To represent all the offsets in the file, we need log2(35G) / 8 = 4.39... bytes, which results in a total of 9 bytes (maybe even 8, if we mess with individual bits).

But, let's make it future-proof: 6 bytes for the hash buys at least 2.4 petabyte files, and 6 for the offset buys 0.28 petabyte files.

>> f"{ 37 * 10**9 * 256**2 :,}" '2,424,832,000,000,000' >>> f"{ 256**6 :,}" '281,474,976,710,656' --> 12 13 14 digest, _, _ = line.partition(b':') outf.write(bytes.fromhex(digest.decode())[:6]) outf.write(pos.to_bytes(6, 'big'))

If you look at the text index, you'll notice we skip 4K + ~50 bytes; this results in sometimes having to read 2 pages from the big file, because not all pages have an index entry. Let's fix that by reading the first whole line after a 4K boundary instead:

16 17 file.seek((pos // 4096 + 1) * 4096) file.readline()

OK, we're done:

$ time python generate-index.py < pwned.txt > index.bin real 1m2.729s user 0m34.292s sys 0m21.392s Using the index #

We start off with a skeleton that's functionally identical with the naive "go through every line" script; the only difference is I've added stubs for passing and using the index:

9 10 11def find_line(file, prefix, index): skip_to_before_line_index(file, prefix, index) return find_line_linear(file, prefix) 23 24def skip_to_before_line_index(file, prefix, index): ... 30 31index_path = sys.argv[2] index = ... 43line = find_line(file, hexdigest.upper().encode(), index) (In case you want to see the whole script.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52import os import sys import time import bisect import getpass import hashlib def find_line(file, prefix, index): skip_to_before_line_index(file, prefix, index) return find_line_linear(file, prefix) def find_line_linear(lines, prefix): for line in lines: if line.startswith(prefix): return line if line > prefix: break return None def skip_to_before_line_index(file, prefix, index): ... path = sys.argv[1] file = open(path, 'rb') index_path = sys.argv[2] index = ... try: password = sys.argv[3] except IndexError: password = getpass.getpass() hexdigest = hashlib.sha1(password.encode()).hexdigest() del password print("looking for", hexdigest) start = time.monotonic() line = find_line(file, hexdigest.upper().encode(), index) end = time.monotonic() if line: times = int(line.decode().rstrip().partition(':')[2]) print(f"pwned! seen {times:,} times before") else: print("not found") print(f"in {end-start:.6f} seconds")

As mentioned before, we'll use the standard library bisect module to search the index.

We could read the entire index in memory, as a list of 12-byte bytes. But that has at least two issues:

  • It'd still be slow, even if outside the current timing code.
  • Memory usage would increase with the size of the file.

Although the module documentation says they work with lists, the bisect* functions work with any arbitrary sequence; a sequence is an object that implements a couple special methods that allow it to behave like a list.

We'll need a object, which needs the file and its size:

27 28 29 30 31 32class BytesArray: item_size = 12 def __init__(self, file): self.file = file

We can go ahead and plug it in:

38 39index_path = sys.argv[2] index = BytesArray(open(index_path, 'rb'))

The first special method is __getitem__(), to implement a[i]:6

34 35 36 37 38 39 def __getitem__(self, index): self.file.seek(index * self.item_size) buffer = self.file.read(self.item_size) if len(buffer) != self.item_size: raise IndexError(index) # out of bounds return buffer

The second special method is __len__(), to implement len(a):

41 42 43 44 def __len__(self): self.file.seek(0, os.SEEK_END) size = self.file.tell() return size // self.item_size

Using the index becomes straightforward:

23 24 25 26 27 28 29 30 31 32 33 34def skip_to_before_line_index(file, prefix, index): item_prefix = bytes.fromhex(prefix.decode())[:6] item = find_lt(index, item_prefix) offset = int.from_bytes(item[6:], 'big') if item else 0 file.seek(offset) def find_lt(a, x): i = bisect.bisect_left(a, x) if i: return a[i-1] return None

We get the first 6 bytes of the hash, find the rightmost value less than that, extract the offset from it, and seek to there. find_lt() comes from bisect's recipes for searching sorted lists.

And we're done:

$ average-many pwned.py pwned.txt index.bin .002546

Huh? ... that's unexpected...

Oh.

I said we won't read the index in memory. But we can force it into the OS cache by getting the OS to read it a bunch of times:

$ for _ in {1..10}; do cat index.bin > /dev/null; done

Finally:

$ average-many pwned.py pwned.txt index.bin .000421 I heard you like indexes (the end) #

Hmmm... isn't that cold start bugging you? It is bugging me.

I tried it out, and if we make an index (313K) for the big index, we get 1.2 ms from a cold start. (Forcing the big index in memory doesn't get us more than the ~500 ms we hit before, because most of that is spent reading from the big file.)

Maybe another smaller index can take us to below 1 ms from a cold start?

...

Just kidding, this is it, this really is the end.

Let's take a look at what we've accomplished:

method statements time (ms, order of magnitude) linear 29 100,000 linear+skip 42 100 binary search 49 10 binary index 59 (72) 1

For twice the code, it's 5 orders of magnitude faster! (I'm not counting bisect or the OS cache, but that's kinda the point, those already exist, they're basically free.)

I mean, we could do indexception, but to do it even remotely elegantly, I need at least five more hours, for something that's definitely not going to be twice the code. And for what? Just slightly faster cold start? I'm good.

Turns out, you can get pretty far with just a few tricks.

That's it for now.

Learned something new today? Share this with others, it really helps!

Want to know when new articles come out? Subscribe here to get new stuff straight to your inbox!

Bonus: better data structures #

Sometimes, a specialized data structure will solve a problem better.

In Sketchy Pwned Passwords, Scott Helme manages to "pack" the entire passwords list into a 1.5G in-memory count-min sketch, which can then be queried in 1 microsecond. And, if you don't care about the counts, a plain Bloom filter can even take you to 0.1 µs! (There is a trade-off, and it's that it takes 11 hours to create either.)

  1. It's linked right above the textbox, after all. [return]

  2. I remember reading that Guido van Rossum thinks sorting imports by length looks cool (I know I do). In production code, though, maybe stick to alphabetical :) [return]

  3. Which is kinda difficult to do in Python anyway. [return]

  4. Something like (size / 4096) ** (1 / int(log(size, 4096))) should work, but I didn't have enough patience to debug the infinite loop it caused. [return]

  5. I mean, I did cross-check it with the binary search solution for a few thousand values, and it seems correct, but that's not proof. [return]

  6. We're only implementing the parts of the sequence protocol that bisect uses.

    For the full protocol, __getitem__() would need to also implement negative indexes and slicing. To get even more sequence methods like count() and index() for free, we can inherit collections.abc.Sequence.

    Interestingly, our class will work in a for loop without needing an __iter__() method. That's because there are actually two iteration protocols: an older one, using __getitem__(), and a newer one, added in Python 2.1, using __iter__() / __next__(); for backwards compatibility, Python still supports the old one (more details on the logic in Unravelling for statements by Brett Cannon). [return]

Categories: FLOSS Project Planets

Join Season of KDE 2023

Planet KDE - Wed, 2022-12-14 03:50

By Benson Muite

Season of KDE is an opportunity to contribute to KDE, while at the same time improving your skills with guidance from experienced mentors.

Apart from code, software projects require artwork, translations, documentation, community management, and funds acquired through fundraising campaigns. With Season of KDE, you too can get the chance to hone your skills in one or more of these areas over a 12-week period. Participation is open to people of all ages with an interest in learning about open source.

Finding a Project

The first place to look for projects is on the KDE SoK page. If a project listed there interests you, contact the mentor. If you are looking for something different, other open source projects are hosted on KDE's development platform. To find interesting projects, you can list them by:

You can also try out applications listed on the KDE applications website and see if you can find something in the application that needs improving. If you are a programmer, have a look at the bugs listed for that application. Once you have found a project you find interesting, and have ideas for improvements, contact the project leads to see if they might be supportive of your ideas and, if you find at least one mentor, post your idea on the KDE SoK wiki before the 15th of January 2023.

Timeline Date Event 15 Dec 2022 Start of Season of KDE 15 Jan 2023 Deadline for Participant and Mentor Applications 22 Jan 2023 Projects Announced 24 Jan 2023 Start of Work 15 Apr 2023 End of Work 20 Apr 2023 Results Announced 20 May 2023 Certificates Issued 1 Jun 2023 Merchandise and SWAG Sent Out by Courier

Past SoKs

If you would like to know more about SoK projects and learn what to expect, check out SoK 2022 and read some of the outcomes from last year. While much of the work is focused on programming, projects in all areas related to the KDE ecosystem are welcome.

You too make KDE possible! Check out our End of Year Fundraiser!
Categories: FLOSS Project Planets

Opensource.com: Simplify the installation of Drupal modules with Project Browser

Planet Drupal - Wed, 2022-12-14 03:00
Simplify the installation of Drupal modules with Project Browser Nadiia Nykolaichuk Wed, 12/14/2022 - 03:00

The Project Browser initiative is aimed at providing an easy process for discovering and installing contributed modules directly from the Drupal admin dashboard with the click…

Drupal's modular structure lets you extend your website with an endless array of features. Then again, discovering the right module and installing it on your website can be a…

Categories: FLOSS Project Planets

Five Jars: Getting Started with Your First Website: Step-by-Step Web Development Roadmap

Planet Drupal - Wed, 2022-12-14 00:42
Everyone knows what a website is, but only a few know what web development is. Website Development is a blanket term for everything related to website building.
Categories: FLOSS Project Planets

The Drop Times: Project Browser and Automatic Update Make the Open Web Approachable

Planet Drupal - Wed, 2022-12-14 00:18
Drupal's core capabilities, such as content reuse, multi-channel publishing, and in-depth customization for website users, are widely praised by programmers and marketers alike. More great enhancements are on the way with Drupal 10, which will release today.
Categories: FLOSS Project Planets

Russ Allbery: Review: Contact

Planet Debian - Tue, 2022-12-13 23:21

Review: Contact, by Carl Sagan

Publisher: Pocket Books Copyright: 1985 Printing: October 1986 ISBN: 0-671-43422-5 Format: Mass market Pages: 434

Contact is a standalone first-contact science fiction novel. Carl Sagan (1934–1996) was best known as a non-fiction writer, astronomer, and narrator of the PBS non-fiction program Cosmos. This is his first and only novel.

Ellie Arroway is the director of Project Argus, a radio telescope array in the New Mexico desert whose primary mission is SETI: the search for extra-terrestrial intelligence by scanning the skies for unexpected radio signals. Its assignment to SETI is controversial; there are radio astronomy projects waiting, and although 25% of the telescope time is assigned to non-SETI projects, some astronomers think the SETI mission should be scrapped and Argus fully diverted to more useful research. That changes overnight when Argus picks up a signal from Vega, binary pulses representing the sequence of prime numbers.

The signal of course doesn't stop when the Earth rotates, so Ellie and her team quickly notify every radio observatory they can get hold of to follow the signal as it passes out of their line of sight. Before long, nearly every country with a radio telescope is involved, and Russian help is particularly vital since they have ship-mounted equipment. The US military and intelligence establishment isn't happy about this and make a few attempts to shove the genie back into the bottle and keep any further discoveries secret, without a lot of success. (Sagan did not anticipate the end of the Cold War, and yet ironically relations with the Russians in his version of the 1990s are warmer by far than they are today. Not that this makes the military types any happier.) For better or worse, making sense of the alien signal becomes a global project.

You may be familiar with this book through its 1997 movie adaptation starring Jodie Foster. What I didn't know before reading this book is that it started life as a movie treatment, co-written with Ann Druyan, in 1979. When the movie stalled, Sagan expanded it into a novel. (Given the thanks to Druyan in the author's note, it may not be far wrong to name her as a co-author.) If you've seen the movie, you will have a good idea of what will happen, but the book gives the project a more realistic international scope. Ellie has colleagues carefully selected from all over the world, including for the climactic moment of the story.

The biggest problem with Contact as a novel is that Sagan is a non-fiction writer who didn't really know how to write a novel. The long, detailed descriptions of the science and the astronomical equipment fit a certain type of SF story, but the descriptions of the characters, even Ellie, are equally detailed and yet use the same style. The book starts with an account of Ellie's childhood and path into science written like a biography or a magazine profile, not like a novel in which she's the protagonist. The same is true of the other characters: we get characterization of a sort, but the tone ranges from Wikipedia article to long-form essay and never quite feels like a story.

Ellie is the most interesting character in the book, partly because the way Sagan writes her is both distant but oddly compelling. Sagan (or perhaps Druyan?) tries hard to show what life is like for a woman born in the middle of the 20th century who is interested in science and I think mostly succeeds, although Ellie's reactions to sexism felt weirdly emotionless. The descriptions of her relationships are even odder and the parts where this book felt the least like a novel, but Sagan does sell some of that tone as reflective of Ellie's personality. She's a scientist, the work is the center of her life, and everything else, even when important, is secondary. It doesn't entirely make the writing style work, but it helps.

Sagan does attempt to give Ellie a personal development arc related to her childhood and her relationships with her father and step-father. I thought the conclusion to that was neither believable nor anywhere near as important as Sagan thought it was, which was off-putting. Better were her ongoing arguments with evangelical Christians, one of whom is a close-minded ass and the other of which is a far more interesting character. They felt wedged into this book, and I'm dubious a realistic version of Ellie would have been the person to have those debates, but it's a subject Sagan clearly has deep personal interest in and that shows in how they're written.

The other problem with Contact as a novel is that Sagan does not take science fiction seriously as a genre, instead treating it as a way to explore a thought experiment. To a science fiction reader, or at least to this science fiction reader, the interesting bits of this story involve the aliens. Those are not the bits Sagan is interested in. His attention is on how this sort of contact, and project, would affect humanity and human politics. We do get some more traditional science fiction near the end of the book, but Sagan then immediately backs away from it, minimizes that part of the story, and focuses exclusively on the emotional and philosophical implications for humans of his thought experiment. Since I found his philosophical musings about agnosticism and wonder and discovery less interesting than the actual science fiction bits, I found this somewhat annoying. The ending felt a bit more like a cheap trick than a satisfying conclusion.

Interestingly, this entire novel is set in an alternate universe, for reasons entirely unexplained (at least that I noticed) in the book. It's set in the late 1990s but was written in 1985, so of course this is an alternate future, but the 1985 of this world still isn't ours. Yuri Gagarin was the first man to set foot on the moon, and the space program and the Cold War developed in subtly different ways. I'm not sure why Sagan made that choice, but it felt to me like he was separating his thought experiment farther from our world to give the ending more plausible deniability.

There are, at the time of the novel, permanent orbital colonies for (mostly) rich people, because living in space turns out to greatly extend human lifespans. That gives Sagan an opportunity to wax poetic about the life-altering effects of seeing Earth from space, which in his alternate timeline rapidly sped up nuclear disarmament and made the rich more protective of the planet. This is an old bit of space boosterism that isn't as common these days, mostly because it's become abundantly clear that human psychology doesn't work this way. Sadly, rich sociopaths remain sociopaths even when you send them into space. I was a bit torn between finding Sagan's predictions charmingly hopeful and annoyingly daft.

I don't think this novel is very successful as a novel. It's much longer than it needs to be and long parts of it drag. But it's still oddly readable; even knowing the rough shape of the ending in advance, I found it hard to put down once the plot properly kicks into gear about two-thirds of the way through. There's a lot in here that I'd argue with Sagan about, but he's thoughtful and makes a serious attempt to work out the political and scientific ramifications of such a discovery in detail. Despite the dry tone, he does a surprisingly good job capturing the anticipation and excitement of a very expensive and risky scientific experiment.

I'm not sure I would recommend this book to anyone, but I'm also the person who found Gregory Benford's Timescape to be boring and tedious, despite its rave reviews as a science fiction novel about the practice of science. If that sort of book is more your jam, you may like Contact better than I did.

Rating: 6 out of 10

Categories: FLOSS Project Planets

Malthe Borch: Why pre-cloud tools are quietly dying

Planet Python - Tue, 2022-12-13 20:39

In Why is everyone trying to kill Airflow, dataengineeringdude argues that pre-cloud era tools such as the popular open-source orchestrator Apache Airflow are here to stay, even if there are numerous things to be unhappy about (read the article for a long list of those).

Trouble is, Airflow is cut from the same cloth as other pre-cloud era tools, it's complex to operate and the codebase shows its age, being built on a language that wasn't designed to support large programs. Ironically, that's still how Airflow gets most of its power, being programmable in its implementation language, Python.

It's the end of year season, which is when you make prediction. My prediction is that Airflow is not here to stay and the reason – in addition to its pre-cloud era design – is that DAG execution is in the process of being unbundled.

I speak this truth in reverence for Python, a 90's programming language that I have written thousands upon thousands of lines of code in and the original idea of Airflow, a gloriously pragmatic tool that does what you pretty much want – or wanted.

Airflow is a collection of components that are not best-in-class on their own. That's often the case with integrated systems, the upside is that the components are typically well-integrated. For example, Github includes a code editor and it's "good enough" to make small edits – but it's no Visual Studio Code.

dataengineeringdude already mentions that Airflow's UI is not the best, although it has seen some minor improvements over the past few years. The task execution log is plain at best and integration with better log tools is a hyperlink that takes you out of the experience. The scheduler is not the best either, it provides the same old scheduling capabilities as cron and has no concept of a calendar. Christmas, anyone?

But the reason I wanted to write this post was that I have been following along with the progress of Dagger, a new DevOps platform co-founded by Solomon Hykes who brought us Docker (about which you'll probably see an article "Is WASM trying to kill Docker?" real soon.)

It's in the name! But the ironic thing is that as of this writing, Dagger is being marketed as "CI/CD Pipelines as Code" – but if it it looks like a duck, and quacks like a duck, it's probably more than just a CI/CD tool, because the requirements are the same as data engineering pipelines.

Where Dagger is fundamentally different to Airflow DAGs is that the code is declarative. You can code up a DAG in any of the supported languages including "non-languages" such as Cue. The resulting logic is then executed by an engine which ... could be distributed. You get the drift. It's the same programming model as has been popularized for a decade by tools such as Apache Spark.

But Airflow can run my DAG every Sunday at 4PM! Lots of tools can kick off a process (such as Dagger) at a cron-defined time. Even Azure Data Factory can do that – for free!

We'll see if Dagger reaches critical awareness, but the future is cloud-native, distributed, declarative, local-first, and speaks your language.

Categories: FLOSS Project Planets

Declassed Art: Which compression method is best for API?

Planet Python - Tue, 2022-12-13 19:00
When I worked at work I did not care about such niceties. API returns too many data? Let's turn gzip on in NGINX! Given that we have no other options for HTTP so far, nobody cares. But what if they did?
Categories: FLOSS Project Planets

Pages