Planet Python

Subscribe to Planet Python feed
Planet Python - http://planetpython.org/
Updated: 9 hours 5 min ago

Python for Beginners: What is a good comment/code ratio?

Mon, 2021-03-08 11:09

A comment is a piece of code that isn’t executed by the compiler or interpreter when the program is executed. Comments can only be read when we have access to the source code. To answer the question about a  good comment/code ratio, we need to first understand the needs that require us to include comments in our source code. 

Why are comments needed?

It is very well known that comments don’t improve the quality of the program. They improve the readability and understanding of  the source code.Following are some of the specific reasons for using comments in our source codes.

1. We use comments to specify the license or copyright of the source code.

When we write any source code for a commercial purpose, it is customary to specify the name of the author, date and time at which the source code was created. We also specify if the source code is copyrighted or licensed and whether the source code can be used and modified without the permission of authors of the code.

2. We use comments to specify contracts if any.

If the source code is written for a different firm for commercial purposes, generally the conditions for usage, copyrights are specified in the source code. When anyone tries to modify the code, they may know in advance if they are allowed to use or modify the source code so that no legal trouble comes in future.

The above two uses are accomplished by using header comments in the source code. Header comments are written at the start of the program file and they specify the author, date of creation, licenses and describe what the source code in the file does.If any modification is done in the source code by anyone, it should also be listen in the header comment so that others can get information about it.

For example, following code snippet shows a python comment in the starting of a python program file.

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Sat Mar 6 14:13:03 2021 @author: aditya1117 This code is for pythonforbeginners.com """ 3. We use comments for documentation of the source code.

When any new framework or library is written for any programming language, it includes specification of each of the methods used in the source code and their use cases. It is done to make the users of the framework aware about how to use the functions and methods in the source code. It also specifies about the parameters, desired input values and expected output of the functions and methods so that programmers can use the code with very little difficulties.

For documentation of source code, we use comments above every function. These comments are also called function header comments.For example, the comment in the following example describes the function defined after it.

#This function is an example of nested function which defines another function and executes it when called. def outer_function(): x=10 print("It is outer function which encloses a nested function") def nested_function(): print("I am in nested function and I can access x from my enclosing function's scope. Printing x") print(x) nested_function() 4. We use comments to clarify why a specific statement is written in the source code.

When the usage of a particular function or expression doesn’t seem obvious in the source code, we use comments to specify why the function or the statement has been written there. It increases the readability and understanding of the code.

5. We use comments to specify the instructions which helps in debugging the code.

There may be a need in future to modify any part of code if any error occurs. In this case the programmer who will perform correction of source code will need to know about the properties of the functions. They will need a clear understanding of each statement included in the code so that they can improvise it. For the programmer to understand the code at such instances, comments should be added to the source code whenever it seems necessary.

The above two use cases are accomplished with the help of inline comments. 

What should comments in the source code do? 1. Explain why a statement has been written.

A comment needs to be able to explain why a statement has been used in the source code. Again, the comment should explain why the statement has been written only when it doesn’t seem obvious to the reader. Redundant comments should not be added. Instead of increasing the understandability of the code, comments may decrease the readability of the code if they are used without proper need.

For example, The comment in the following code is explaining that a python dictionary is being created which anyone can identify from the code. Hence the comment is redundant.

#define a dictionary myDict= {"Name":"PythonForBeginners", "code":"PFB"} 2. Should not explain how the statement performs any operation.

If someone has been writing codes for a significant amount of time and is an experienced programmer, they can easily understand how the statements written in the source code are executing. They will not be able to understand the code only if they cannot figure out why the statement is used. So, the comments should not try to explain how any statement works.

3.Specify the metadata about the program file.

The comment at file header should specify the author, date of creation, licenses and describe what the source code in the file does. If any modification is done in the source code by anyone, it should also be specified by the header comment so that others can get information about it. The comments at function headers should specify the signature, input details, functionality and output of a function.

Conclusion

Having read the details, now we can address our initial question about comment/code ratio. It should be understood that there is no standard code/comment ratio. We can use any number of comments as long as it serves the purpose. But it should be kept in mind that we should not include redundant comments in the code and we should not comment about obvious things written in the source code. Comments should be used for enhancing understandability and maintainability of the source code and should not explain every statement in the code.

The post What is a good comment/code ratio? appeared first on PythonForBeginners.com.

Categories: FLOSS Project Planets

Real Python: The Real Python Podcast: It's Been a Year!

Mon, 2021-03-08 09:00

This week, the Real Python Podcast is reaching its fiftieth episode!

It’s been quite a year, full of sharing and learning and connecting in the Python community. We’re looking forward to bringing you more interesting guests, interviews with expert Pythonistas, and lots of behind-the-scenes with the Real Python team.

Here’s a quick look at some of what’s been going on with the podcast in the past year as well as a sneak peek at what’s to come!

Free Bonus: Click here to get a Python Cheat Sheet and learn the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.

PyCon Speakers Shared Their Expertise

PyCon is an important hub for the Python community, so we absolutely had to bring you the experts who are sharing their expertise at this conference. Here are some of the speakers you heard from this past year:

Read the full article at https://realpython.com/real-python-podcast-first-year/ »

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Categories: FLOSS Project Planets

Stack Abuse: How to Sort a Pandas DataFrame by Date

Mon, 2021-03-08 08:30
Introduction

Pandas is an extremely popular data manipulation and analysis library. It's the go-to tool for loading in and analyzing datasets for many.

Correctly sorting data is a crucial element of many tasks regarding data analysis. In this tutorial, we'll take a look at how to sort a Pandas DataFrame by date.

Let's start off with making a simple DataFrame with a few dates:

import pandas as pd data = {'Name':["John", "Paul", "Dhilan", "Bob", "Henry"], 'Date of Birth': ["01/06/86", "05/10/77", "11/12/88", "25/12/82", "01/06/86"]} df = pd.DataFrame(data) print(df)

By default our output is sorted by the DataFrames index:

Name Date of Birth 0 John 01/06/86 1 Paul 05/10/77 2 Dhilan 11/12/88 3 Bob 25/12/82 4 Henry 01/06/86

The eagle-eyed may notice that John and Paul have the same date of birth - this is on-purpose as we'll see in a moment.

Convert Strings to Datetime in Pandas DataFrame

We have input Date of Birth in date format and it appears to be formatted as such. However, the first thing we need to do is ensure Pandas recognises and understands that this date is in fact a date.

The way Pandas stores and manipulates data in a DataFrame is determined by its data type.

The data type of each value is assigned automatically, based on what it looks like. 60 will be assigned an integer type, while John will be assigned a string type. Let's check the current data type of each column:

print(df.dtypes)

This gives us our list of data types:

Name object Date of Birth object dtype: object

We can see our Date of Birth column has been assigned a basic string object type by default. However, in order to correctly sort, analyse or manipulate our dates correctly we need Pandas to recognise this column contains dates.

Let's explicitly change the data type in our Date of Birth column from an object type to a datetime type.

The easiest way to do this is to use the to_datetime() function:

df["Date of Birth"] = pd.to_datetime(df["Date of Birth"]) print(df.dtypes)

Now, if we check our output:

Name object Date of Birth datetime64[ns] dtype: object

So, we can see we have successfully changed our data type to datetime.

Alternatively, we can manually specify the data type of our column, provided of course we know what data type we want it to be:

df["Date of Birth"] = df["Date of Birth"].astype('datetime64[ns]')

Output:

Name object Date of Birth datetime64[ns] dtype: object

Whilst both of these methods produce the same result, the to_datetime() method is preferred as it was explicitly designed for this purpose.

Sorting a DataFrame by Date in Pandas

Now that Pandas correctly recognizes our data types, let's sort the DataFrame.

Note: All of the methods we'll use don't sort in-place, so you'll have to either reassign the changed DataFrame to a new (or the same) reference variable to persist the change - or you can use the inplace argument to change the default behavior.

Sort by Single Date Column in Ascending Order

The sort_values() method will, by default, sort data in ascending order. For dates this would mean the first or earliest in order will appear at the top of the list:

df.sort_values(by='Date of Birth', inplace=True) print(df)

Running this code results in:

Name Date of Birth 1 Paul 1977-05-10 3 Bob 1982-12-25 0 John 1986-01-06 4 Henry 1986-01-06 2 Dhilan 1988-11-12

Alternatively, if you don't want to use the inplace argument, you can simply re-assign the returned DataFrame from the sort_values() method to df (or any other reference variable:

df = df.sort_values(by='Date of Birth')

As we gave John and Henry have the same birthday, the order is based on their corresponding index number.

Sort by Single Date Column in Descending Order

Changing our order of sort to descending can be done by setting the ascending argument to False when calling the sort_values() function:

df.sort_values(by='Date of Birth', ascending = False, inplace=True)

This time we get our data sorted in descending order, meaning the last or most recent will appear at the top of our list. Again as John and Henry have the same birthday their order is based on their index number:

Name Date of Birth 2 Dhilan 1988-11-12 0 John 1986-01-06 4 Henry 1986-01-06 3 Bob 1982-12-25 1 Paul 1977-05-10 Sort by Multiple Date Columns

So, what happens if we have multiple date columns that we want to sort by?

Let’s add another date-related column to our DataFrame and make sure both our data types are correctly assigned:

# Values for the new column employment_start = ["22/05/16", "17/08/10", "22/05/16", "11/06/19", "16/06/05"] # Adding columns to DataFrame df['Employment Start'] = employment_start # Applying to_datetime() function to multiple columns at once df[['Date of Birth', 'Employment Start']] = df[['Date of Birth', 'Employment Start']].apply(pd.to_datetime) print(df.dtypes) print(df)

Now, let's check if things look good:

Name object Date of Birth datetime64[ns] Employment Start datetime64[ns] dtype: object Name Date of Birth Employment Start 0 John 1986-01-06 2016-05-22 1 Paul 1977-05-10 2010-08-17 2 Dhilan 1988-11-12 2016-05-22 3 Bob 1982-12-25 2019-11-06 4 Henry 1986-01-06 2005-06-16 Sort by Multiple Date Columns in Ascending Order

To sort the DataFrame by both Date of Birth and Employment Start in ascending order, we simply need to add both column names to our sort_values() method. Just bear in mind the priority of the sort is determined by which column is entered first:

df.sort_values(by=['Date of Birth', 'Employment Start'], inplace=True)

As this method defaults to ascending order, our output will be:

Name Date of Birth Employment Start 1 Paul 1977-05-10 2010-08-17 3 Bob 1982-12-25 2019-11-06 4 Henry 1986-01-06 2005-06-16 0 John 1986-01-06 2016-05-22 2 Dhilan 1988-11-12 2016-05-22

As Date of Birth is the first column entered in our method, Pandas is prioritizing it. Since John and Henry have the same Date of Birth, they're sorted by the Employment Start column instead.

Sort by Multiple Date Columns in Descending Order

As with the single column sort, we can change the order to descending order by changing the ascending parameter to False:

df.sort_values(by=['Date of Birth', 'Employment Start'], ascending = False, inplace=True)

Now, our output in descending order is:

Name Date of Birth Employment Start 2 Dhilan 1988-11-12 2016-05-22 0 John 1986-01-06 2016-05-22 4 Henry 1986-01-06 2005-06-16 3 Bob 1982-12-25 2019-11-06 1 Paul 1977-05-10 2010-08-17

As we can see John and Henry both appear higher in the list as the birthdays are displayed in descending order. This time though, John takes priority over Henry due to his more recent Employment Start date.

Sort by Multiple Date Columns and Variable Order Sorts

Now, what if we not only want to sort using multiple columns but also have these columns sorted using different ascending criteria? With Pandas, this can be implemented within the same sort_values() method we've used so far. We just have to pass the correct and corresponding list of values in the ascending parameter.

In this example let’s assume we want to sort our Employment Start in ascending order, i.e. longest serving first, but then their Date of Birth in descending order i.e. youngest first:

df.sort_values(by=['Employment Start', 'Date of Birth'], ascending = [True, False], inplace=True)

The data is first sorted by Employment Start in ascending order, this takes priority as this was the first column passed in our method. We then sort Date of Birth in descending order. As Dhilan and John share the same Employment Start date, Dhilan now takes priority as he is younger than John:

Name Date of Birth Employment Start 4 Henry 1986-01-06 2005-06-16 1 Paul 1977-05-10 2010-08-17 2 Dhilan 1988-11-12 2016-05-22 0 John 1986-01-06 2016-05-22 3 Bob 1982-12-25 2019-11-06 Conclusion

Given the popularity of the Pandas library, it is hardly surprising that sorting data based on columns is a straightforward process. We taken a look at the flexibility of using the sort_values() method across single and multiple columns, in ascending, descending and even a variable order. Whilst we have focused on sorting by date, this method can be used across multiple data types.

When looking to sort by date in particular, the first, and arguably most important step, is making sure we have correctly assigned the datetime type to our data. Without correctly defining our data type we risk Pandas not recognising our dates at all.

Categories: FLOSS Project Planets

Codementor: Descriptive Statistics for World GDP per Capita with Python

Mon, 2021-03-08 08:02
Learn how to produce descriptive statistics for GDP data using Python data science techniques.
Categories: FLOSS Project Planets

Python Pool: How to Solve TypeError: ‘int’ object is not Subscriptable

Mon, 2021-03-08 01:08
Introduction

Some of the objects in python are subscriptable. This means that they hold and hold other objects, but an integer is not a subscriptable object. We use Integers used to store whole number values in python. If we treat an integer as a subscriptable object, it will raise an error. So, we will be discussing the particular type of error that we get while writing the code in python, i.e., TypeError: ‘int’ object is not subscriptable. We will also discuss the various methods to overcome this error.

What is TypeError: ‘int’ object is not subscriptable? What is TypeError?

The TypeError occurs when you try to operate on a value that does not support that operation. Let us understand with the help of an example:

Suppose we try to concatenate a string and an integer using the ‘+’ operator. Here, we will see a TypeError because the + operation is not allowed between the two objects that are of different types.

#example of typeError S = "Latracal Solutions" number = 4 print(S + number)

Output:

Traceback (most recent call last): File "<pyshell#5>", line 1, in <module> print(S + number) TypeError: can only concatenate str (not "int") to str Explanation:

Here, we have taken a string ‘Latracal Solutions” and taken a number. After that, in the print statement, we try to add them. As a result: TypeError occurred.

What is ‘int’ object is not subscriptable?

When we try to concatenate string and integer values, this message tells us that we treat an integer as a subscriptable object. An integer is not a subscriptable object. The objects that contain other objects or data types, like strings, lists, tuples, and dictionaries, are subscriptable. Let us take an example :

1. Number: typeerror: ‘int’ object is not subscriptable #example of integer which shows a Typeerror number = 1500 print(number[0])

output:

Traceback (most recent call last): File "<pyshell#7>", line 1, in <module> print(number[0]) TypeError: 'int' object is not subscriptable Explanation:

Here, we have taken a number and tried to print the through indexing, but it shows type error as integers are not subscriptable.

2. List: typeerror: ‘int’ object is not subscriptable

This TyperError problem doesn’t occur in the list as it is a subscriptable object. We can easily perform operations like slicing and indexing.

#list example which will run correctly Names = ["Latracal" , " Solutions", "Python"] print(Names[1])

Output:

Solutions Explanation:

Here firstly, we have taken the list of names and accessed it with the help of indexing. So it shows the output as Solutions.

Daily Life Example of How typeerror: ‘int’ object is not subscriptable can Occur

Let us take an easy and daily life example of your date of birth, written in date, month, and year. We will write a program to take the user’s input and print out the date, month, and year separately.

#Our program begins from here Date_of_birth = int(input("what is your birth date?")) birth_date = Date_of_birth[0:2] birth_month = Date_of_birth[2:4] birth_year = Date_of_birth[4:8] print(" birth_date:",birth_date) print("birth_month:",birth_month) print("birth_year:",birth_year)

Output:

what is your birth date?31082000 Traceback (most recent call last): File "C:/Users/lenovo/Desktop/fsgedg.py", line 3, in <module> birth_date = Date_of_birth[0:2] TypeError: 'int' object is not subscriptable Explanation:

Here firstly, we have taken the program for printing the date of birth separately with the help of indexing. Secondly, we have taken the integer input of date of birth in the form of a date, month, and year. Thirdly, we have separated the date, month, and year through indexing, and after that, we print them separately, but we get the output ad TypeError: ‘int’ object is not subscriptable. As we studied above, the integer object is not subscriptable.

Solution of TypeError: ‘int’ object is not subscriptable

We will make the same program of printing data of birth by taking input from the user. In that program, we have converted the date of birth as an integer, so we could not perform operations like indexing and slicing.

To solve this problem now, we will remove the int() statement from our code and run the same code.

#remove int() from the input() Date_of_birth = input("what is your birth date?") birth_date = Date_of_birth[0:2] birth_month = Date_of_birth[2:4] birth_year = Date_of_birth[4:8] print(" birth_date:",birth_date) print("birth_month:",birth_month) print("birth_year:",birth_year)

Output:

what is your birth date?31082000 birth_date: 31 birth_month: 08 birth_year: 2000 Explanation:

Here, we have just taken the input into the string by just removing the int(), and now we can do indexing and slicing in it easily as it became a list that is subscriptable, so no error arises.

Must Read Conclusion: Typeerror: ‘int’ object is not subscriptable

We have learned all key points about the TypeError: ‘int’ object is not subscriptable. There are objects like list, tuple, strings, and dictionaries which are subscriptable. This error occurs when you try to do indexing or slicing in an integer.

Suppose we need to perform operations like indexing and slicing on integers. Firstly, we have to convert the integer into a string, list, tuple, or dictionary.

Now you can easily the solve this python TypeError like an smart coder.

However, if you have any doubts or questions, do let me know in the comment section below. I will try to help you as soon as possible.

Happy Pythoning!

The post How to Solve TypeError: ‘int’ object is not Subscriptable appeared first on Python Pool.

Categories: FLOSS Project Planets

Mike Driscoll: PyDev of the Week: Jens Winkelmann

Mon, 2021-03-08 01:05

This week we welcome Jens Winkelmann (@WinmanJ) as our PyDev of the Week! Jens is a former PhD researcher in the Foams and Complex System Group at Trinity College Dublin (TCD) but is now working as a Data Scientist at talpasolutions. You can find out more about what Jens does on his web page. Jens is also a conference speaker.

Let’s spend a few moments getting to know Jens better!

Can you tell us a little about yourself (hobbies, education, etc):

I was born and raised in the beautiful city of Essen, Germany, where I also currently live and work again after a couple of years abroad.

I obtained a B.Sc. and an M.Sc., both in Physics from TU Dortmund (Germany), in 2013 and 2015, respectively. End of 2015, I moved to Dublin, Ireland, to pursuit a PhD in Physics in the Foams and Complex Systems research group of Trinity College Dublin, from which I graduated last year.

In December 2019 I returned to Essen and am working here now as a Data Scientist at talpasolutions GmbH. Talpasolutions is the leading driver of the Industrial Internet of Things in the heavy industry. We build digital products that offer actionable insights for machine manufacturers as well as operators based on collected machine sensor data.

In my free time I enjoy climbing, both rope climbing as well as bouldering. It is a great sport because it combines mental focus with physical workout and can be individual or communal as much as you like.

Why did you start using Python?

I started using Python for the data analysis and plotting parts of the Physics labs during my undergrad at TU Dortmund. Some friends of my study group who were more familiar with programming languages introduced me to it. They quickly convinced me that it reduces my stress level for the Physics labs tremendously in the long run compared to Excel.

First, I used it for typical tasks in Physics labs where you analyse and then plot experimental data using NumPy and Matplotlib. Over time the data analysis became more and more complex. I also used it for my Bachelor, Master and later on PhD thesis, where I analysed and visualised large amount of data created by computer simulations. It was only then that I fully appreciated what a powerful tool Python can be.

What other programming languages do you know and which is your favorite?

I also learned C/C++ in an introductory coding lecture as well as part of a Computational Physics lecture. I implemented a hydrodynamic simulation in C/C++ for my Bachelor as well as Master thesis. Computational speed is quite essential here and everything needed to be programmed from scratch. So Python was unfortunately not an option for this.

I also got a bit into functional programming through a lecture about Haskell during my Master studies. But the only learning that remained is the functools package in Python which provides some functional programming tools.

Python is by far my favourite programming language at the moment. Since it is so straightforward, it allows me to fully focus on the problem that I’d like to solve rather than getting distracted by unnecessary boiler-plate code. This and Python’s large ecosystem ranging from NumPy to tensorflow and keras makes it to a powerful tool in the repertoire of a Data Scientist.

What projects are you working on now?

Most of my current projects are related to my work as a Data Scientist at talpasolutions where I analyse data from the world’s largest machines that are being used in the mining industry. Our data science solutions increase overall equipment efficiency, operational productivity, predict possible maintenance downtimes, and also have an ecological impact: For example, we help our customers to reduce their diesel consumption and thus save CO2 emissions.

There are two particular projects or use cases that I’ve been currently involved in:

  • Activity detection &
  • predictive maintenance.

Our activity detection algorithms are comparable to object detections in image recognitions. The sensors of a heavy machinery such as a truck or excavator can be used to classify its current activity state. A truck for instance may be loading, dumping, idle, driving loaded, or driving unloaded. Based on sensor signal such as payload, speed, and dump angle, our algorithms infer its activity state. Activity detection algorithms are crucial because they build the basis for a digital surveillance of the mine’s productivity and further analytical tools of our software. Based on these algorithms, we provide actionable insights to our users that optimise their mine operations, e.g.: What is the average loading time of a truck? What are the largest efficiency losses in the mine operation?

The goal behind predictive maintenance is to reduce the mine operator’s maintenance costs which occur either due to unplanned downtimes or component failures. Our algorithms achieve this goal by predicting unplanned downtimes based on the machine’s historical data. The analytical results are then displayed in our software solution to inform the right person at the right time. With unplanned downtime quickly costing more than $1000 per truck per hour, the importance of this issue is indisputable. One of our exemplary strategy includes live-casting sensor data by using anomaly detection. For this strategy, we employ a neural network to detect possible anomalous behaviours in sensor signals such as the suspension pressure.

If this got you excited about my Data Science work feel free to watch my talk at the pyjamas conference (an online conference dedicated to Python) on YouTube.

Another project unrelated to my job as a Data Scientist includes writing an academic book by the title Columnar Structures of Spheres: Fundamentals and Applications together with Professor Ho-Hei Chan from the Harbin Institute of Technology in China. The book covers the topic of my PhD thesis about so-called ordered columnar structures that we investigated using computer simulations in Python. Such structures occur when identical spheres are being packed densely inside a cylindrical confinement (for more details check out this wikipedia article). We simulated such structures by employing optimisation algorithms in Python, which helped us to discover a novel experimental foam structure, a so-called line-slip structure.

The full range of their applications is still under discovery, but so far they have been found in foam structures (like beer foam), botany, and nano science. My personal favourite application is that of a photonic metamaterial. Such materials are characterised by having a negative refractive index which allows them to be used for super lenses or cloaking. Some of our structures are potential candidates for such a material.

Because of Covid-19, we actually made good progress on the writing lately. The book is now planned to be published in the summer 2021 by Jenny Stanford Publishing.

Which Python libraries are your favorite (core or 3rd party)?

The Python ecosystem provides an amazing variety of well-developed Python libraries for Data Scientists. They all serve different purpose. Some that I most often use are:

  • Pandas (for data wrangling and manipulation)
  • NumPy (for numerical data structures and methods)
  • SciPy (for everything scientific, e.g. linear algebra, optimisation algorithms, or statistics)
  • Scikit-learn (for standard machine learning models)
  • Matplotlib (for data visualisation)
  • Plotly (for interactive data visualisation)

I especially like Matplotlib because of how versatile it is in creating graphs and data visualisation. But of course, Plotly shouldn’t go unmentioned here either. Matplotlib lacks a bit in plotting large amount of data in an interactive graph. This is where Plotly actually shines.

What drew you to data science?

In retrospective, it seems like Data Science is the natural path after studying Physics. But winding the clocks back to when I was starting my Physics undergrad degree, I didn’t even know what Data Science was.

During my time of my PhD in Dublin, I came across the Python Ireland community and participated in a few of the monthly meet-ups as well as the Python Conference in 2016. The talks and discussion with people at these meet-ups made me curious about Data Science. What I really liked about Data Science was the fact that it provided a way to do Science outside of Academia. On top of this, my Python skills turned out to be quite useful for Data Science as well.

So after I finished my PhD in Dublin, I decided to apply for a couple of positions in Germany and Ireland, including my current position at talpasolutions in my hometown Essen.

Talpasolutions stood out to me from all the other companies that I applied to because talpasolutions mission has meaning to me. By developing digital products for the mining industry, we improve the working conditions of heavy industry workers and we make the industry more environment-friendly by reducing its carbon food print.

Additionally, the mining industry has a long and famous history in Essen. Even though the last mines have been closed for years, it feels like, we at talpasolutions are carrying on the spirit of this era. Since Essen is my hometown, I really enjoy working here. For many other Data Science positions, I would be starving for meaning because what lots of those companies do is make people click ads or make rich people richer.

Can people without math backgrounds get into data science? Why or why not?

I think a solid foundation of math skills, especially statistics, is essential for Data Science. It is important to understand the math behind the models that you employ as a Data Scientist. The math background helps you to optimise your model and how to avoid over- or underfitting.

But you don’t need to be a math genius because the Data Science work in most companies consists only of applying and optimising already developed (machine learning) models to their data. Data Scientist at FAANG companies or research facilities are mainly the once developing completely new algorithms. In that case of course, your math skills better be in good shape.

Similar to Computer Science, Data Science is also ranging over a broad spectrum and it will continue to broaden in the future. I’d say there are some Data Science fields that require more and some less mathematics skills. We at talpasolutions deal entirely with numerical data from the engineering world which requires a certain degree of mathematical understanding from all our developers.

Is there anything else you’d like to say?

As final words, I’d like to say thank you for giving me the opportunity to answer your questions here. I hope my answers got your blog audience intrigued and more eager than ever to learn more about Data Science. I also would like to thank my friend Sanyo for proofreading my answers and making sure that they are making crispy-clear sense.

Thanks for doing the interview, Jens!

The post PyDev of the Week: Jens Winkelmann appeared first on Mouse Vs Python.

Categories: FLOSS Project Planets

John Ludhi/nbshare.io: Strftime and Strptime In Python

Sun, 2021-03-07 20:40
Strftime and Strptime In Python

In this post, we will learn about strftime() and strptime() methods from Python datetime package.

Python Strftime Format

The strftime converts date object to a string date.

The syntax of strftime() method is...

dateobject.strftime(format)

Where format is the desired format of date string that user wants. Format is built using codes shown in the table below...

Code Meaning %a Weekday as Sun, Mon %A Weekday as full name as Sunday, Monday %w Weekday as decimal no as 0,1,2... %d Day of month as 01,02 %b Months as Jan, Feb %B Months as January, February %m Months as 01,02 %y Year without century as 11,12,13 %Y Year with century 2011,2012 %H 24 Hours clock from 00 to 23 %I 12 Hours clock from 01 to 12 %p AM, PM %M Minutes from 00 to 59 %S Seconds from 00 to 59 %f Microseconds 6 decimal numbers Datetime To String Python using strftime()

Example: Convert current time to date string...

In [8]: import datetime from datetime import datetime now = datetime.now() print(now) 2021-03-07 23:24:11.192196

Let us convert the above datetime object to datetime string now.

In [2]: now.strftime("%Y-%m-%d %H:%M:%S") Out[2]: '2021-03-07 23:16:41'

If you want to print month as locale’s abbreviated name, replace %m with %b as shown below...

In [3]: now.strftime("%Y-%b-%d %H:%M:%S") Out[3]: '2021-Mar-07 23:16:41'

Another example...

In [4]: now.strftime("%Y/%b/%A %H:%M:%S") Out[4]: '2021/Mar/Sunday 23:16:41' Date To String Python using strftime()

Date to string is quite similar to datetime to string Python conversion.

Example: Convert current date object to Python date string.

In [5]: today = datetime.today() print(today) 2021-03-07 23:22:15.341074

Let us convert the above date object to Python date string using strftime().

In [6]: today.strftime("%Y-%m-%d %H:%M:%S") Out[6]: '2021-03-07 23:22:15' Python Strftime Milliseconds

To get a date string with milliseconds, use %f format code at the end as shown below...

In [7]: today = datetime.today() today.strftime("%Y-%m-%d %H:%M:%S.%f") Out[7]: '2021-03-07 23:23:50.851344' Python Strptime Format

Strptime python is used to convert string to datetime object.

strptime(date_string, format)

example:

strptime("9/23/20", "%d/%m/%y")

Note - format "%d/%m/%y" represents the the corresponding "9/23/20" format. The output of the above command will be a Python datetime object.

The format is constructed using pre-defined codes. There are many codes to choose from. The most important ones are listed below.

Code Meaning %a Weekday as Sun, Mon %A Weekday as full name as Sunday, Monday %w Weekday as decimal no as 0,1,2... %d Day of month as 01,02 %b Months as Jan, Feb %B Months as January, February %m Months as 01,02 %y Year without century as 11,12,13 %Y Year with century 2011,2012 %H 24 Hours clock from 00 to 23 %I 12 Hours clock from 01 to 12 %p AM, PM %M Minutes from 00 to 59 %S Seconds from 00 to 59 %f Microseconds 6 decimal numbers Python Datetime Strptime

Example: Convert date string to Python datetime object.

In [9]: import datetime datetime.datetime.strptime("09/23/2030 8:28","%m/%d/%Y %H:%M") Out[9]: datetime.datetime(2030, 9, 23, 8, 28)
Categories: FLOSS Project Planets

Codementor: How I learned Python

Sun, 2021-03-07 20:18
About me I am a senior software developer with experience over 8 years. I am working now at IT company as Developer. I'd like to get new technology and challenge. I love the programming and...
Categories: FLOSS Project Planets

Matthew Wright: How to remove a column from a DataFrame, with some extra detail

Sun, 2021-03-07 19:14

Removing one or more columns from a pandas DataFrame is a pretty common task, but it turns out there are a number of possible ways to perform this task. I found that this StackOverflow question, along with solutions and discussion in it raised a number of interesting topics. It is worth digging in a little bit to the … Continue reading How to remove a column from a DataFrame, with some extra detail

The post How to remove a column from a DataFrame, with some extra detail appeared first on wrighters.io.

Categories: FLOSS Project Planets

Python Pool: 5 Best Ways to Find Python String Length

Sun, 2021-03-07 08:34
What is Python String Length?

Python string length is the function through which we find the length of the string. There is an inbuilt function called len() in python, so this len() function finds the length of the given string, array, list, tuple, dictionary, etc.

Through the len() function, we can optimize the performance of the program. The number of elements stored in the object is never calculated, so len() helps provide the number of elements.

Syntax len(string) Parameters

String : This will calculate the length of the value passed in the string variable.

Return Value

It will return an interger value i.e. the length of the given string.

Various Type Of Return Value

  1. String
  2. Empty
  3. Collection
  4. Type Error
  5. Dictionary
1. String:

It is used to return the number of characters present in the string, including punctuation, space, and all types of special characters. However, it would help if you were careful while using the len of a Null variable.

2. Empty:

In this the return call has the zero characters, but it is always None.

3. Collections:

The len() built in function return the number of elements in the collection.

4. Type Error:

Len function always depends on the type of the variable passed to it. A Non-Type len() does not have any built-in support.

5. Dictionary:

In this, each pair is counted as one unit. However, Keys and values are not independent in the dictionary.

Ways to find the length of string in Python 1. Using the built-in function len() # Python code to demonstrate string length # using len str = 'Latracal' print(len(str))

output:

8

Explanation:

In this code, we have taken str as the variable in which we have stored the string named ‘Latracal’ and then applied the len() in which we have put str in between. so the output came is 8 as the word ‘Latracal‘ contains 8 characters.

2. Using for loop to Find the length of the string in python

 A string can be iterated over easily and directly in for loop. By maintaining a count of the number of iterations will result in the length of the string.

# Python code to demonstrate string length # using for loop # Returns length of string def findLength(str): counter = 0 for i in str: counter += 1 return counter str = "Latracal" print(findLength(str))

output:

8

Explanation:

In this code, we have used for loop to find the length of the string. Firstly, we have taken an str variable that we have given ‘Latracal’ as the string. Secondly, we have called the findLength function in which we have counter equals 0, After that, for loop was written from 0 to string, and the counter value gets increased by 1 at a time. At last, we have printed the counter value.

3. Using while loop and Slicing

We slice a string making it shorter by 1 at regular intervals to time with each iteration till the string is the empty string. This is when the while loop stops. By maintaining a count of the number of iterations will result in the length of the string.

# Python code to demonstrate string length # using while loop. # Returns length of string def findLength(str): count = 0 while str[count:]: count = count + 1 return count str = "LatracalSolutions" print(findLength(str))

output:

17

Explanation:

In this code, we have used for loop to find the length of the string. Firstly, we have taken an str variable in which we have given ‘LatracalSolutions’ as the string. Secondly, we have called the findLength function in which we have set the value of count equals 0. Thirdly, then applied the while loop in which we are slicing the value of str by one at each iteration till the string becomes empty. And at last, returned the count value.

4. Using string methods join and count

The join method of strings takes in an iteration and returns a string which is the concatenation of the iteration strings. The separator present in the between of elements is the original string on which the method is called. Using the join and count method, the joined string in the original string will also result in the string’s length.

# Python code to demonstrate string length # using join and count # Returns length of string def findLength(str): if not str: return 0 else: some_random_str = 'py' return ((some_random_str).join(str)).count(some_random_str) + 1 str = "LatracalSolutions" print(findLength(str))

output:

17

Explanation:

In this code, we have used for loop to find the length of the string. Firstly, we have taken an str variable in which we have given ‘LatracalSolutions’ as the string. Secondly, then we have called the findLength function in which we have applied if and else function in which if contains the conditions that if the string is empty, it should return 0; otherwise, the else part will work. We have taken some random string ‘py’ in which the main string will get join by the iteration, and the count value will increase till the string becomes empty. After that, the output gets printed.

5. Using getsizeof() method to Find Length Of String In Python

This method is used to find the object’s storage size that occupies some space in the memory.

import sys s = "pythonpool" print(sys.getsizeof(s) - sys.getsizeof(""))

Output:

10

Explanation:

Here, we have used the sys module which is inbuilt in python. then we have to take a string s and using the sys module with the getsizeof() method printed the length of the string.

Example to Find Length of String in Python # Python code to demonstrate string length # testing len() str1 = "Welcome to Latracal Solutions Python Tutorials" print("The length of the string is :", len(str1))

Output:

The length of the string is : 46 Must Read Summary: Python String Length
  • Python len() is a built-in function. You can use the len() to find the length of the given string, array, list, tuple, dictionary, etc.
  • String: This will calculate the length of the value passed in the string variable.
  • Return value: It will return an integer value i.e. the length of the given string.

However, if you have any doubts or questions, do let me know in the comment section below. I will try to help you as soon as possible.

Happy Pythoning!

The post 5 Best Ways to Find Python String Length appeared first on Python Pool.

Categories: FLOSS Project Planets

John Ludhi/nbshare.io: Python Generators

Sat, 2021-03-06 20:40
Python Generators

Python generators are very powerful for handling operations which require large amount of memory.

Let us start with simple example. Below function prints infinite sequence of numbers.

In [1]: def generator_example1(): count = 0 while True: yield count count+=1 In [2]: g = generator_example1() In [3]: next(g) Out[3]: 0 In [4]: next(g) Out[4]: 1 In [5]: next(g) Out[5]: 2

and so on...

Python Yield

Ok let us revisit our function 'generator_example1()'. What is happening in the below code?

Inside while loop, we have 'yield' statement. Yield breakes out of loop and gives back control to whomever called function generator_exampe1(). In statement 'g = generator_example1()', g is now a geneator as shown below.

In [6]: def generator_example1(): count = 0 while True: yield count count+=1 In [7]: g = generator_example1() In [8]: g Out[8]: <generator object generator_example1 at 0x7f3334416e08>

Once you have a generator function, you can iterate through it using next() function. Since we have a infinite 'while' loop in the genereator_example() function, we can call iterator as many times as we want it. Each time, we use next(), generator starts the execution from previous position and prints a new value.

Python Generator Expression

Python generators can be used outside the function without the 'yield'. Check out the below example.

In [9]: g = (x for x in range(10)) In [10]: g Out[10]: <generator object <genexpr> at 0x7f3334416f68>

(x for x in range(10)) is a Python generator object. The syntax is quite similar to Python list comprehension except that instead of square brackets, generators are defined using round brackets. As usual, once we have generator object, we can call iterator next() on it to print the values as shown below.

In [11]: next(g) Out[11]: 0 In [12]: next(g) Out[12]: 1 Python Generator stop Iteration

Python generators will throw 'StopIteration' exception, if there is no value to return for the iterator.

Let us look at following example.

In [13]: def range_one(): for x in range(0,1): yield x In [14]: g = range_one() In [15]: next(g) Out[15]: 0 In [16]: next(g) --------------------------------------------------------------------------- StopIteration Traceback (most recent call last) <ipython-input-16-e734f8aca5ac> in <module> ----> 1 next(g) StopIteration:

To avoid above error, we can catch exception like this and stop the iteration.

In [17]: g = range_one() In [18]: try: print(next(g)) except StopIteration: print('Iteration Stopped') 0 In [19]: try: print(next(g)) except StopIteration: print('Iteration Stopped') Iteration Stopped Python Generator send()

We can pass value to Python Generators using send() function.

In [20]: def incrment_no(): while True: x = yield yield x + 1 In [21]: g = incrment_no() # Create our generator In [22]: next(g) # It will go to first yield In [23]: print(g.send(7)) # value 7 is sent to generator which gets assgined to x, 2nd yield statement gets executed 8 Python Recursive Generator

Python generators can be used recursively. Check out the below code. In below function, "yield from generator_factorial(n - 1)" is recursive call to function generator_factorial().

In [24]: def generator_factorial(n): if n == 1: f = 1 else: a = yield from generator_factorial(n - 1) f = n * a yield f return f In [25]: g = generator_factorial(3) In [26]: next(g) Out[26]: 1 In [27]: next(g) Out[27]: 2 In [28]: next(g) Out[28]: 6 Python Generator throw() Error

Continuing with above example, let us say we want generator to throw error for the factorial of number greater than 100. We can add generator.throw() exception such as shown below.

In [29]: n = 100 if n >= 100: g.throw(ValueError, 'Only numbers less than 100 are allowed') --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-29-bf449f9fafac> in <module> 1 n = 100 2 if n >= 100: ----> 3 g.throw(ValueError, 'Only numbers less than 100 are allowed') <ipython-input-24-e76bd978ab03> in generator_factorial(n) 5 a = yield from generator_factorial(n - 1) 6 f = n * a ----> 7 yield f 8 return f ValueError: Only numbers less than 100 are allowed Python Generators Memory Efficient

Python generators take very less memory. Let us look at following two examples. In the examples below, note the difference between byte size of memory used by 'Python list' vs 'Python generator'.

In [30]: import sys In [31]: #Python List comprehension sequence = [x for x in range(1,1000000)] sys.getsizeof(sequence) Out[31]: 8697464 In [32]: #Python Generators sequence = (x for x in range(1,1000000)) sys.getsizeof(sequence) Out[32]: 88 Python Generator Performance

One thing to notice here is that, Python generators are slower than Python list comprehension if the memory is large engough to compute. Let us look at below two examples from the performance perspective.

In [33]: #Python List comprehension import cProfile cProfile.run('sum([x for x in range(1,10000000)])') 5 function calls in 0.455 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.327 0.327 0.327 0.327 <string>:1(<listcomp>) 1 0.073 0.073 0.455 0.455 <string>:1(<module>) 1 0.000 0.000 0.455 0.455 {built-in method builtins.exec} 1 0.054 0.054 0.054 0.054 {built-in method builtins.sum} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} In [34]: #generators import cProfile cProfile.run('sum((x for x in range(1,10000000)))') 10000004 function calls in 1.277 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 10000000 0.655 0.000 0.655 0.000 <string>:1(<genexpr>) 1 0.000 0.000 1.277 1.277 <string>:1(<module>) 1 0.000 0.000 1.277 1.277 {built-in method builtins.exec} 1 0.622 0.622 1.277 1.277 {built-in method builtins.sum} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

Check the number of function calls and time the 'Python generator' took to compute the sum compare to Python 'list comprehension'.

Data Pipeline with Python Generator

Let us wrap up this tutorial with Data Pipelines. Python generators are great for building the pipelines.

Let us open a CSV file and iterate through it using Python generator.

In [41]: def generator_read_csv_file(): for entry in open('stock.csv'): yield entry In [42]: g = generator_read_csv_file() In [43]: next(g) Out[43]: 'Date,Open,High,Low,Close,Adj Close,Volume\n' In [44]: next(g) Out[44]: '1996-08-09,14.250000,16.750000,14.250000,16.500000,15.324463,1601500\n'

Let us say, we want to replace the commas in the CSV for each line with spaces, we can build a pipeline for this.

In [45]: g1 = (entry for entry in open('stock.csv')) In [46]: g2 = (row.replace(","," ") for row in g1) In [47]: next(g2) Out[47]: 'Date Open High Low Close Adj Close Volume\n' In [48]: next(g2) Out[48]: '1996-08-09 14.250000 16.750000 14.250000 16.500000 15.324463 1601500\n' In [50]: next(g2) Out[50]: '1996-08-12 16.500000 16.750000 16.375000 16.500000 15.324463 260900\n' Wrap Up:

It takes a little practice to get hold on Python generators but once mastered, Python generators are very useful for not only building data pipelines but also handling large data operations such as reading a large file.

Categories: FLOSS Project Planets

Zero-with-Dot (Oleg Żero): Why using SQL before using Pandas?

Sat, 2021-03-06 18:00
Introduction

Data analysis is one of the most essential steps in any data-related project. Regardless of the context (e.g. business, machine-learning, physics, etc.), there are many ways to get it right… or wrong. After all, decisions often depend on actual findings. and at the same time, nobody can tell you what to find before you have found it.

For these reasons, it is important to try to keep the process as smooth as possible. On one hand, we want to get into the essence quickly. On the other, we do not want to complicate the code. If cleaning the code takes longer than cleaning data, you know something is not right.

In this article, we focus on fetching data. More precisely, we show how to get the same results with both Python’s key analytics library, namely Pandas and SQL. Using an example dataset (see later), we describe some common patterns related to preparation and data analysis. Then we explain how to get the same results with either of them and discuss which one may be preferred. So, irrespectively if you know one way, but not the other, or you feel familiar with both, we invite you to read this article.

The example dataset

“NIPS Papers” from Kaggle will serve us an example dataset. We have purposely chosen this dataset for the following reasons:

  • It is provided as an SQLite file, thus we can simulate a common scenario, where data is obtained from a relational database, e.g. a data warehouse.
  • It contains more than one table, so we can also show how to combine the data, which is a frequently encountered problem.
  • It is imperfect: there will be nulls, NaNs, and some other common issues… life ;)
  • It is simple to understand, contains only two data tables (“authors” and “papers”), plus the third to establish a many-to-many relationship between them. Thanks to that, we can focus on methodology rather than the actual content.
  • Finally, the dataset comes also as a CSV file.

You are welcome to explore it yourself. However, as we focus on the methodology, we limit to discuss the content to a bare minimum. Let’s dive in.

Fetching data

The way you pick the data depends on its format and the way it is stored. If the format is fixed, we get very little choice on how we pick it up. However, if the data sits in some database, we have more options.

The simplest and perhaps the most naive way is to fetch it table after table and store them locally as CSV files. This is not the best approach for two main reasons:

  • The total data volume is unnecessarily larger. You have to deal with excess data that is not information (indices, auxiliary columns, etc.).
  • Depending on the data, notebooks will contain substantial cleaning and processing code overhead, obscuring the actual analytics.

You don’t want any of these. However, if you have no clue about the data, fetching everything is the safest option.

Let’s take a look at what tables and columns the “NIPS” has.

table authors 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import pandas as pd df_authors = pd.read_csv("data/authors.csv") df_auhtors.info() # returns <class 'pandas.core.frame.DataFrame'> RangeIndex: 9784 entries, 0 to 9783 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 9784 non-null int64 1 name 9784 non-null object dtypes: int64(1), object(1) memory usage: 153.0+ KB table papers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 df_papers = pd.read_csv("data/papers.csv") df_papers.info() # returns <class 'pandas.core.frame.DataFrame'> RangeIndex: 7241 entries, 0 to 7240 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 7241 non-null int64 1 year 7241 non-null int64 2 title 7241 non-null object 3 event_type 2422 non-null object 4 pdf_name 7241 non-null object 5 abstract 7241 non-null object 6 paper_text 7241 non-null object dtypes: int64(2), object(5) memory usage: 396.1+ KB table paper_authors

As mentioned earlier, this table links the former two, using author_id and paper_id foreign keys. In addition, it has its own primary key id.

Figure 1. Displaying the top 5 rows from the data from the `papers` table. ✕ Figure 1. Displaying the top 5 rows from the data from the `papers` table.

As we can see from the image (and also when digging into the analytics deeper), the pdf_name column is more or less redundant, given the title column. Furthermore, by calling df_papers["event_type"].unique(), we know there are four distinct values for this column: 'Oral', 'Spotlight', 'Poster' or NaN (which signifies a publication was indeed a paper).

Let’s say, we would like to filter away pdf_name together with any entry that represents any publication that is other than a usual paper. The code to do it in Pandas looks like this:

1 2 3 df = df_papers[~df_papers["event_type"] \ .isin(["Oral", "Spotlight", "Poster"])] \ [["year", "title", "abstract", "paper_text"]]

The line is composed of three parts. First, we pass df_papers["event_type"].isin(...), which is a condition giving us a binary mask, then we pass it on to df_papers[...] essentially filtering the rows. Finally, we attach a list of columns ["year", "title", "abstract", "paper_text"] to what is left (again using [...]) thus indicating the columns we want to preserve. Alternatively, we may also use .drop(columns=[...]) to indicate the unwanted columns.

A more elegant way to achieve the same result is to use Pandas’ .query method instead of using a binary mask.

1 2 3 df = df_papers \ .query("event_type not in ('Oral', 'Spotlight', 'Poster')") \ .drop(columns=["id", "event_type", "abstract"])

The code looks a bit cleaner, and a nice thing about .query is the fact, we can use @-sign to refer to another object, for example .query("column_a > @ass and column_b not in @bees"). On the flip side, this method is a bit slower, so you may want to stick to the binary mask method when having to repeat it excessively.

Using SQL for getting data

Pandas gets the job done. However, we do have databases for a reason. They are optimized to search through tables efficiently and deliver data as needed.

Coming back to our problem, all we have achieved here is simple filtering of columns and rows. Let’s delegate the task to the database itself, and use Pandas to fetch the prepared table.

Pandas provides three functions that can help us: pd.read_sql_table, pd.read_sql_query and pd.read_sql that can accept both a query or a table name. For SQLite pd.read_sql_table is not supported. This is not a problem as we are interested in querying the data at the database level anyway.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 import sqlite3 # or sqlalchemy.create_engine for e.g. Postgres con = sqlite3.connect(DB_FILEPATH) query = """ select year, title, abstract, paper_text from papers where trim(event_type) != '' """ df_papers_sql = pd.read_sql(query, con=con)

Let’s break it down.

First, we need to connect to the database. For SQLite, it is easy, as we are only providing a path to a database file. For other databases, there are other libraries (e.g. psycopg2 for Postgres or more generic: sqlalchemy). The point is to create the database connection object that points Pandas in the right direction and sorts the authentication.

Once it is settled, the only thing left is constructing the right SQL query. SQL filters columns through the select statement. Similarly, rows are filtered with the where clause. Here we use the trim function to strip the entires from spaces, leaving us everything, but an empty string to pick up. The reason we use trim is specific to the data content of this example, but generally where is a place to put a condition.

With read_sql, the data is automatically DataFrame‘ed with all the rows and columns prefiltered as described.

Nice, isn’t it?

Let’s move further…

Joining, merging, collecting, combining…

Oftentimes data is stored across several tables. In these cases, stitching a dataset becomes an additional step that precedes the analytics.

Here, the relationship is rather simple: there is a many-to-many relationship between authors and papers, and the two tables are linked through the third, namely papers_authors. Let’s take a look at how Pandas handles the case. For the sake of argument, let’s assume we want to find the most “productive” authors in terms of papers published.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 df_authors \ .merge( df_papers_authors.drop(columns=["id"]), left_on="id", right_on="author_id", how="inner") \ .drop(columns=["id"]) \ .merge( df_papers \ .query("event_type in ('Oral', 'Spotlight', 'Poster')") \ [["id", "year", "title", "abstract", "paper_text"]], left_on="paper_id", right_on="id", how="left") \ .drop(columns=["id", "paper_id", "author_id"]) \ .sort_values(by=["name", "year"], ascending=True)

… for Python, this is just one line of code, but here we split it into several for clarity.

We start with the table authors and want to assign papers. Pandas offers three functions for “combining” data.

  • pd.concat - concatenates tables by rows or columns. Similar to SQL’s union.
  • pd.join - joins two tables just like SQL’s join, but requires the index to be the actual DataFrame index.
  • pd.merge - the same, but more flexible: can join using both indices and columns.

To “get” to papers, we first need to inner-join the papers_authors table. However, both tables have an id column. To avoid conflict (or automatic prefixing), we remove the papers_authors.id column before joining. Then, we join on authors.id == papers_authors.author_id, after which we also drop id from the resulting table. Having access to papers_id, we perform joining again. This time, it is a left-join as we don’t want to eliminate “paperless” authors. We also take the opportunity to filter df_papers as described earlier. However, it is essential to keep papers.id or else Pandas will refuse to join them. Finally, we drop all indices: id, paper_id, author_id as they don’t bring any information and sort the records for convenience.

Using SQL for combining

Now, the same effect using SQL.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 query = """ select a.name, p.year, p.title, p.abstract, p.paper_text from authors a inner join paper_authors pa on pa.author_id = a.id left join papers p on p.id = pa.paper_id and p.event_type not in ('Oral', 'Spotlight', 'Poster') order by name, year asc """ pd.read_sql(query, con=con)

Here, we build it “outwards” from line 8., subsequently joining the other tables, with the second one being trimmed using line 11.. The rest is just ordering and filtering using a, p, pa as an alias.

The effect is the same, but with SQL, we avoid having to manage indices, which has nothing to do with analytics.

Data cleaning

Let’s take a look at the resulting dataset.

Figure 2. The top five rows of the combined table. ✕ Figure 2. The top five rows of the combined table.

The newly created table contains missing values and encoding problems. Here, we skip fixing the encoding as this problem is specific to the data content. However, missing values are a very common issue. Pandas offers, among others, .fillna(...) and .dropna(...), and depending on the conventions, we may fill NaNs with different values.

Using SQL for data cleaning

Databases also have their way to deal with the issue. Here, the equivalents to fillna and dropna are coalesce and is not null, respectively.

Using coalesce, our query cures the dataset, injecting any value in case it is missing.

1 2 3 4 5 6 7 8 9 10 11 12 13 """ select a.name, coalesce(p.year, 0) as year, coalesce(p.title, 'untitled') as title, coalesce(p.abstract, 'Abstract Missing') as abstract, coalesce(p.paper_text, '') as text, from authors a join paper_authors pa on pa.author_id = a.id left join papers p on p.id = pa.paper_id and p.event_type not in ('Oral', 'Spotlight', 'Poster') order by name, year asc """ Aggregations

Our dataset is prepared, “healed” and fetched using SQL. Now, let’s take it we would like to rank the authors based on the number of papers they write each year. In addition, we would like to calculate the total word count that every author “produced” every year.

Again, this is another standard data transformation problem. Let’s examine how Pandas handles it. The starting point is the joint and cleaned table.

1 2 3 4 5 6 7 8 9 df["paper_length"] = df["paper_text"].str.count() df[["name", "year", "title", "paper_length"]] \ .groupby(by=["name", "year"]) \ .aggregate({"title": "count", "paper_length": "sum"}) \ .reset_index() \ .rename(columns={"title": "n_papers", "paper_length": "n_words"}) \ .query("n_words > 0") \ .sort_values(by=["n_papers"], ascending=False)

We calculate articles’ length by counting spaces. Although it is naive to believe that every word in a paper is separated by exactly one space, it does give us some estimation. Line 1. does that via .str attribute, introducing a new column at the same time.

Later on, we formulate a new table by applying a sequence of operations:

  1. We narrow down the table only to the columns of interest.
  2. We aggregate the table using both name and year columns.
  3. As we apply different aggregation functions to the two remaining columns, we use the .aggregate method that accepts a dictionary with instructions.
  4. The aggregation results in having a double-index. Line 6. restores name and year to columns.
  5. The remaining columns’ names stay the same, but no longer reflect the meaning for the numbers. We change that in line 7..
  6. For formerly missing values, it is safe to assume they have a word count equal to zero. To build the ranking, we eliminate them using .query.
  7. Finally, we sort the table for convenience.

The aggregated table is presented in figure 3.

Figure 3. The ranking is an aggregated table. ✕ Figure 3. The ranking is an aggregated table. Using SQL to aggregate

Now, once again, let’s achieve the same result using SQL.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 """ select d.name, d.year, count(title) as n_papers, sum(paper_length) as n_words from ( select a.name, coalesce(p.year, 0) as year, coalesce(p.title, 'untitled') as title, length(coalesce(p.paper_text, '')) - length(replace(coalesce(p.paper_text, ''), ' ', '') as paper_length from authors a join papers_authors pa on pa.author_id = a.id left join papers p on p.id = pa.paper_id and p.event_type not in ('Oral', 'Spotlight', 'Poster') ) as d group by name, year having n_words > 0 order by n_papers desc """

This query may appear more hassle than the Pandas code, but it isn’t. We combine all our work plus the aggregation in a single step. Thanks to the subquery and functions, it is possible to arrange so that in our particular example, we may get the result, before we even start the analysis.

The query contains a subquery (lines 8.-18.), where apart from removing abstract and introducing paper_length columns, almost everything stays the same. SQLite does not have an equivalent to str.count(), so we work around it counting differences between spaces and all other words using length and replace. Later, in line 19., we assign d to be an alias for the subquery table for reference.

Next, the groupby statement in combination with count and sum is what we did using Pandas’ .aggregate method. Here, we also apply the condition from line 21. using having. The having statement works similarly to where except that it operates “inside” the aggregation as opposed to where that is applied to a table that is formulated to remove some of its records.

Again, the resulting tables are exactly the same.

Conclusion

Pandas and SQL may look similar, but their nature is very different. Python is an object-oriented language and Pandas stores data as table-like objects. In addition, it offers a wide variety of methods to transform them in any way possible, which makes it an excellent tool for data analysis.

On the opposite side, Pandas’ methods to formulate a dataset are just a different “incarnation” to what SQL is all about. SQL is a declarative language that is naturally tailored to fetch, transform and prepare a dataset. If data resides in a relational database, letting a database engine perform these steps is a better choice. Not only are these engines optimized to do that, but also letting a database prepare a clean and convenient dataset facilitates the analysis process.

The disadvantage of SQL is the fact it may be harder to read and to figure out data to throw away and what to keep, before creating a dataset. Pandas, running on Python, lets us assign fractions of the dataset to variables, inspect them, and then make further decisions.

Still, these temporary variables often creep to clutter the workspace… Therefore, unless you are in doubt, there are strong reasons to use SQL.

And how do you analyze your data? ;)

Categories: FLOSS Project Planets

The Digital Cat: Delegation: composition and inheritance in object-oriented programming

Sat, 2021-03-06 13:00
Introduction

Object-oriented programming (OOP) is a methodology that was introduced in the 60s, though as for many other concepts related to programming languages it is difficult to give a proper date. While recent years have witnessed a second youth of functional languages, object-oriented is still a widespread paradigm among successful programming languages, and for good reasons. OOP is not the panacea for all the architectural problems in software development, but if used correctly can give a solid foundation to any system.

It might sound obvious, but if you use an object-oriented language or a language with strong OOP traits, you have to learn this paradigm well. Being very active in the Python community, I see how many times young programmers are introduced to the language, the main features, and the most important libraries and frameworks, without a proper and detailed description of OOP and how OOP is implemented in the language.

The implementation part is particularly important, as OOP is a set of concepts and features that are expressed theoretically and then implemented in the language, with specific traits or choices. It is very important, then, to keep in mind that the concepts behind OOP are generally shared among OOP languages, but are not tenets, and are subject to interpretation.

What is the core of OOP? Many books and tutorials mention the three pillars encapsulation, delegation, and polymorphism, but I believe these are traits of a more central concept, which is the collaboration of entities. In a well-designed OO system, we can observe a set of actors that send messages to each other to keep the system alive, responsive, and consistent.

These actors have a state, the data, and give access to it through an interface: this is encapsulation. Each actor can use functionalities implemented by another actor sending a message (calling a method) and when the relationship between the two is stable we have delegation. As communication happens through messages, actors are not concerned with the nature of the recipients, only with their interface, and this is polymorphism.

Alan Kay, in his "The Early History of Smalltalk", says

In computer terms, Smalltalk is a recursion on the notion of computer itself. Instead of dividing "computer stuff" into things each less strong than the whole — like data structures, procedures, and functions which are the usual paraphernalia of programming languages — each Smalltalk object is a recursion on the entire possibilities of the computer. Thus its semantics are a bit like having thousands and thousands of computers all hooked together by a very fast network.

I find this extremely enlightening, as it reveals the idea behind the three pillars, and the reason why we do or don't do certain things in OOP, why we consider good to provide some automatic behaviours or to forbid specific solutions.

By the way, if you replace the word "object" with "microservice" in the quote above, you might be surprised by the description of a very modern architecture for cloud-based systems. Once again, concepts in computer science are like fractals, they are self-similar and pop up in unexpected places.

In this post, I want to focus on the second of the pillars of object-oriented programming: delegation. I will discuss its nature and the main two strategies we can follow to implement it: composition and inheritance. I will provide examples in Python and show how the powerful OOP implementation of this language opens the door to interesting atypical solutions.

For the rest of this post, I will consider objects as mini computers and the system in which they live a "very fast network", using the words of Alan Kay. Data contained in an object is the state of the computer, its methods are the input/output devices, and calling methods is the same thing as sending a message to another computer through the network.

Delegation in OOP

Delegation is the mechanism through which an actor assigns a task or part of a task to another actor. This is not new in computer science, as any program can be split into blocks and each block generally depends on the previous ones. Furthermore, code can be isolated in libraries and reused in different parts of a program, implementing this "task assignment". In an OO system the assignee is not just the code of a function, but a full-fledged object, another actor.

The main concept to retain here is that the reason behind delegation is code reuse. We want to avoid code repetition, as it is often the source of regressions; fixing a bug in one of the repetitions doesn't automatically fix it in all of them, so keeping one single version of each algorithm is paramount to ensure the consistency of a system. Delegation helps us to keep our actors small and specialised, which makes the whole architecture more flexible and easier to maintain (if properly implemented). Changing a very big subsystem to satisfy a new requirement might affect other parts system in bad ways, so the smaller the subsystems the better (up to a certain point, where we incur in the opposite problem, but this shall be discussed in another post).

There is a dichotomy in delegation, as it can be implemented following two different strategies, which are orthogonal from many points of view, and I believe that one of the main problems that object-oriented systems have lies in the use of the wrong strategy, in particular the overuse of inheritance. When we create a system using an object-oriented language we need to keep in mind this dichotomy at every step of the design.

There are four areas or points of views that I want to introduce to help you to visualise delegation between actors: visibility, control, relationship, and entities. As I said previously, while these concepts apply to systems at every scale, and in particular to every object-oriented language, I will provide examples in Python.

Visibility: state sharing

The first way to look at delegation is through the lenses of state sharing. As I said before the data contained in an object can be seen as its state, and if hearing this you think about components in a frontend framework or state machines you are on the right path. The state of a computer, its memory or the data on the mass storage, can usually be freely accessed by internal systems, while the access is mediated for external ones. Indeed, the level of access to the state is probably one of the best ways to define internal and external systems in a software or hardware architecture.

When using inheritance, the child class shares its whole state with the parent class. Let's have a look at a simple example

class Parent: def __init__(self, value): self._value = value 3 def describe(self): 1 print(f"Parent: value is {self._value}") class Child(Parent): pass >>> cld = Child(5) >>> print(cld._value) 5 >>> cld.describe() 2 Parent: value is 5

As you can see, describe is defined in Parent 1, so when the instance cld calls it 2, its class Child delegates the call to the class Parent. This, in turn, uses _value as if it was defined locally 3, while it is defined in cld. This works because, from the point of view of the state, Parent has complete access to the state of Child. Please note that the state is not even enclosed in a name space, as the state of the child class becomes the state of the parent class.

Composition, on the other side, keeps the state completely private and makes the delegated object see only what is explicitly shared through message passing. A simple example of this is

class Logger: def log(self, value): print(f"Logger: value is {value}") class Process: def __init__(self, value): self._value = value 1 self.logger = Logger() def info(self): self.logger.log(self._value) 2 >>> prc = Process(5) >>> print(prc._value) 5 >>> prc.info() Logger: value is 5

Here, instances of Process have an attribute _value 1 that is shared with the classLogger only when it comes to calling Logger.log 2 inside their info method. Logger objects have no visibility of the state of Process objects unless it is explicitly shared.

Note for advanced readers: I'm clearly mixing the concepts of instance and class here, and blatantly ignoring the resulting inconsistencies. The state of an instance is not the same thing as the state of a class, and it should also be mentioned that classes are themselves instances of metaclasses, at least in Python. What I want to point out here is that access to attributes is granted automatically to inherited classes because of the way __getattribute__ and bound methods work, while in composition such mechanisms are not present and the effect is that the state is not shared.

Control: implicit and explicit delegation

Another way to look at the dichotomy between inheritance and composition is that of the control we have over the process. Inheritance is usually provided by the language itself and is implemented according to some rules that are part of the definition of the language itself. This makes inheritance an implicit mechanism: when you make a class inherit from another one, there is an automatic and implicit process that rules the delegation between the two, which makes it run outside our control.

Let's see an example of this in action using inheritance

class Window: def __init__(self, title, size_x, size_y): self._title = title self._size_x = size_x self._size_y = size_y def resize(self, new_size_x, new_size_y): self._size_x = new_size_x self._size_y = new_size_y self.info() def info(self): 2 print(f"Window '{self._title}' is {self._size_x}x{self._size_y}") class TransparentWindow(Window): def __init__(self, title, size_x, size_y, transparency=50): self._title = title self._size_x = size_x self._size_y = size_y self._transparency = transparency def change_transparency(self, new_transparency): self._transparency = new_transparency def info(self): 1 super().info() 3 print(f"Transparency is set to {self._transparency}")

At this point we can instantiate and use TransparentWindow

>>> twin = TransparentWindow("Terminal", 640, 480, 80) >>> twin.info() Window 'Terminal' is 640x480 Transparency is set to 80 >>> twin.change_transparency(70) >>> twin.resize(800, 600) Window 'Terminal' is 800x600 Transparency is set to 70

When we call twin.info, Python is running TransparentWindow's implementation of that method 1 and is not automatically delegating anything to Window even though the latter has a method with that name 2. Indeed, we have to explicitly call it through super when we want to reuse it 3. When we use resize, though, the implicit delegation kicks in and we end up with the execution of Window.resize. Please note that this delegation doesn't propagate to the next calls. When Window.resize calls self.info this runs TransparentWindow.info, as the original call was made from that class.

Composition is on the other end of the spectrum, as any delegation performed through composed objects has to be explicit. Let's see an example

class Body: def __init__(self, text): self._text = text def info(self): return { "length": len(self._text) } class Page: def __init__(self, title, text): self._title = title self._body = Body(text) def info(self): return { "title": self._title, "body": self._body.info() 1 }

When we instantiate a Page and call info everything works

>>> page = Page("New post", "Some text for an exciting new post") >>> page.info() {'title': 'New post', 'body': {'length': 34}}

but as you can see, Page.info has to explicitly mention Body.info through self._body 1, as we had to do when using inheritance with super. Composition is not different from inheritance when methods are overridden, at least in Python.

Relationship: to be vs to have

The third point of view from which you can look at delegation is that of the nature of the relationship between actors. Inheritance gives the child class the same nature as the parent class, with specialised behaviour. We can say that a child class implements new features or changes the behaviour of existing ones, but generally speaking, we agree that it is like the parent class. Think about a gaming laptop: it is a laptop, only with specialised features that enable it to perform well in certain situations. On the other end, composition deals with actors that are usually made of other actors of a different nature. A simple example is that of the computer itself, which has a CPU, has a mass storage, has memory. We can't say that the computer is the CPU, because that is reductive.

This difference in the nature of the relationship between actors in a delegation is directly mapped into inheritance and composition. When using inheritance, we implement the verb to be

class Car: def __init__(self, colour, max_speed): self._colour = colour self._speed = 0 self._max_speed = max_speed def accelerate(self, speed): self._speed = min(speed, self._max_speed) class SportsCar(Car): def accelerate(self, speed): self._speed = speed

Here, SportsCar is a Car, it can be initialised in the same way and has the same methods, though it can accelerate much more (wow, that might be a fun ride). Since the relationship between the two actors is best described by to be it is natural to use inheritance.

Composition, on the other hand, implements the verb to have and describes an object that is "physically" made of other objects

class Employee: def __init__(self, name): self._name = name class Company: def __init__(self, ceo_name, cto_name): self._ceo = Employee(ceo_name) self._cto = Employee(cto_name)

We can say that a company is the sum of its employees (plus other things), and we easily recognise that the two classes Employee and Company have a very different nature. They don't have the same interface, and if they have methods with the same name is just by chance and not because they are serving the same purpose.

Entities: classes or instances

The last point of view that I want to explore is that of the entities involved in the delegation. When we discuss a theoretical delegation, for example saying "This Boeing 747 is a plane, thus it flies" we are describing a delegation between abstract, immaterial objects, namely generic "planes" and generic "flying objects".

class FlyingObject: pass class Plane(FlyingObject): pass >>> boeing747 = Plane()

Since Plane and FlyingObject share the same underlying nature, their relationship is valid for all objects of that type and it is thus established between classes, which are ideas that become concrete when instantiated.

When we use composition, instead, we are putting into play a delegation that is not valid for all objects of that type, but only for those that we connected. For example, we can separate gears from the rest of a bicycle, and it is only when we put together that specific set of gears and that bicycle that the delegation happens. So, while we can think theoretically at bicycles and gears, the actual delegation happens only when dealing with concrete objects.

class Gears: def __init__(self): self.current = 1 def up(self): self.current = min(self.current + 1, 8) def down(self): self.current = max(self.current - 1, 0) class Bicycle: def __init__(self): self.gears = Gears() 1 def gear_up(self): self.gears.up() 2 def gear_down(self): self.gears.down() 3 >>> bicycle = Bicycle()

As you can see here, an instance of Bicycle contains an instance of Gears 1 and this allows us to create a delegation in the methods gear_up 2 and gear_down 3. The delegation, however, happens between bicycle and bicycle.gears which are instances.

It is also possible, at least in Python, to have composition using pure classes, which is useful when the class is a pure helper or a simple container of methods (I'm not going to discuss here the benefits or the disadvantages of such a solution)

class Gears: @classmethod def up(cls, current): return min(current + 1, 8) @classmethod def down(cls, current): return max(current - 1, 0) class Bicycle: def __init__(self): self.gears = Gears self.current_gear = 1 def gear_up(self): self.current_gear = self.gears.up(self.current_gear) def gear_down(self): self.current_gear = self.gears.down(self.current_gear) >>> bicycle = Bicycle()

Now, when we run bicycle.gear_up the delegation happens between bicycle, and instance, and Gears, a class. We might extend this forward to have a class which class methods call class methods of another class, but I won't give an example of this because it sounds a bit convoluted and probably not very reasonable to do. But it can be done.

So, we might devise a pattern here and say that in composition there is no rule that states the nature of the entities involved in the delegation, but that most of the time this happens between instances.

Note for advanced readers: in Python, classes are instances of a metaclass, usually type, and type is an instance of itself, so it is correct to say that composition happens always between instances.

Bad signs

Now that we looked at the two delegations strategies from different points of view, it's time to discuss what happens when you use the wrong one. You might have heard of the "composition over inheritance" mantra, which comes from the fact that inheritance is often overused. This wasn't and is not helped by the fact that OOP is presented as encapsulation, inheritance, and polymorphism; open a random OOP post or book and you will see this with your own eyes.

Please, bloggers, authors, mentors, teachers, and overall programmers: stop considering inheritance the only delegation system in OOP.

That said, I think we should avoid going from one extreme to the opposite, and in general learn to use the tools languages give us. So, let's learn how to recognise the "smell" of bad code!

You are incorrectly using inheritance when:

  • There is a clash between attributes with the same name and different meanings. In this case, you are incorrectly sharing the state of a parent class with the child one (visibility). With composition the state of another object is namespaced and it's always clear which attribute you are dealing with.
  • You feel the need to remove methods from the child class. This is typically a sign that you are polluting the class interface (relationship) with the content of the parent class. using composition makes it easy to expose only the methods that you want to delegate.

You are incorrectly using composition when:

  • You have to map too many methods from the container class to the contained one, to expose them. The two objects might benefit from the automatic delegation mechanism (control) provided by inheritance, with the child class overriding the methods that should behave differently.
  • You are composing instances, but creating many class methods so that the container can access them. This means that the nature of the delegation is more related to the code and the object might benefit from inheritance, where the classes delegate the method calls, instead of relying on the relationship between instances.

Overall, code smells for inheritance are the need to override or delete attributes and methods, changes in one class affecting too many other classes in the inheritance tree, big classes that contain heavily unrelated methods. For composition: too many methods that just wrap methods of the contained instances, the need to pass too many arguments to methods, classes that are too empty and that just contain one instance of another class.

Domain modelling

We all know that there are few cases (in computer science as well as in life) where we can draw a clear line between two options and that most of the time the separation is blurry. There are many grey shades between black and white.

The same applies to composition and inheritance. While the nature of the relationship often can guide us to the best solution, we are not always dealing with the representation of real objects, and even when we do we always have to keep in mind that we are modelling them, not implementing them perfectly.

As a colleague of mine told me once, we have to represent reality with our code, but we have to avoid representing it too faithfully, to avoid bringing reality's limitations into our programs.

I believe this is very true, so I think that when it comes to choosing between composition an inheritance we need to be guided by the nature of the relationship in our system. In this, object-oriented programming and database design are very similar. When you design a database you have to think about the domain and the way you extract information, not (only) about the real-world objects that you are modelling.

Let's consider a quick example, bearing in mind that I'm only scratching the surface of something about which people write entire books. Let's pretend we are designing a web application that manages companies and their owners, and we started with the consideration that and Owner, well, owns the Company. This is a clear composition relationship.

class Company: def __init__(self, name): self.name = name class Owner: def __init__(self, first_name, last_name, company_name): self.first_name = first_name self.last_name = last_name self.company = Company(company_name) >>> owner1 = Owner("John", "Doe", "Pear")

Unfortunately, this automatically limits the number of companies owned by an Owner to one. If we want to relax that requirement, the best way to do it is to reverse the composition, and make the Company contain the Owner.

class Owner: def __init__(self, first_name, last_name): self.first_name = first_name self.last_name = last_name class Company: def __init__(self, name, owner_first_name, owner_last_name): self.name = name self.owner = Owner(owner_first_name, owner_last_name) >>> company1 = Company("Pear", "John", "Doe") >>> company2 = Company("Pulses", "John", "Doe")

As you can see this is in direct contrast with the initial modelling that comes from our perception of the relationship between the two in the real world, which in turn comes from the specific word "owner" that I used. If I used a different word like "president" or "CEO", you would immediately accept the second solution as more natural, as the "president" is one of many employees.

The code above is not satisfactory, though, as it initialises Owner every time we create a company, while we might want to use the same instance. Again, this is not mandatory, it depends on the data contained in the Owner objects and the level of consistency that we need. For example, if we add to the owner an attribute online to mark that they are currently using the website and can be reached on the internal chat, we don't want have to cycle between all companies and set the owner's online status for each of them if the owner is the same. So, we might want to change the way we compose them, passing an instance of Owner instead of the data used to initialise it.

class Owner: def __init__(self, first_name, last_name, online=False): self.first_name = first_name self.last_name = last_name self.online = online class Company: def __init__(self, name, owner): self.name = name self.owner = owner >>> owner1 = Owner("John", "Doe") >>> company1 = Company("Pear", owner1) >>> company2 = Company("Pulses", owner1)

Clearly, if the class Company has no other purpose than having a name, using a class is overkill, so this design might be further reduced to an Owner with a list of company names.

class Owner: def __init__(self, first_name, last_name): self.first_name = first_name self.last_name = last_name self.companies = [] >>> owner1 = Owner("John", "Doe") >>> owner1.companies.extend(["Pear", "Pulses"])

Can we use inheritance? Now I am stretching the example to its limit, but I can accept there might be a use case for something like this.

class Owner: def __init__(self, first_name, last_name): self.first_name = first_name self.last_name = last_name class Company(Owner): def __init__(self, name, owner_first_name, owner_last_name): self.name = name super().__init__(owner_first_name, owner_last_name) >>> company1 = Company("Pear", "John", "Doe") >>> company2 = Company("Pulses", "John", "Doe")

As I showed in the previous sections, though, this code smells as soon as we start adding something like the email address.

class Owner: def __init__(self, first_name, last_name, email): self.first_name = first_name self.last_name = last_name self.email = email class Company(Owner): def __init__(self, name, owner_first_name, owner_last_name, email): self.name = name super().__init__(owner_first_name, owner_last_name, email) >>> company1 = Company("Pear", "John", "Doe") >>> company2 = Company("Pulses", "John", "Doe")

Is email that of the company or the personal one of its owner? There is a clash, and this is a good example of "state pollution": both attributes have the same name, but they represent different things and might need to coexist.

In conclusion, as you can see we have to be very careful to discuss relationships between objects in the context of our domain and avoid losing connection with the business logic.

Mixing the two: composed inheritance

Speaking of blurry separations, Python offers an interesting hook to its internal attribute resolution mechanism which allows us to create a hybrid between composition and inheritance that I call "composed inheritance".

Let's have a look at what happens internally when we deal with classes that are linked through inheritance.

class Parent: def __init__(self, value): self.value = value def info(self): print(f"Value: {self.value}") class Child(Parent): def is_even(self): return self.value % 2 == 0 >>> c = Child(5) >>> c.info() Value: 5 >>> c.is_even() False

This is a trivial example of an inheritance relationship between Child and Parent, where Parent provides the methods __init__ and info and Child augments the interface with the method is_even.

Let's have a look at the internals of the two classes. Parent.__dict__ is

mappingproxy({'__module__': '__main__', '__init__': <function __main__.Parent.__init__(self, value)>, 'info': <function __main__.Parent.info(self)>, '__dict__': <attribute '__dict__' of 'Parent' objects>, '__weakref__': <attribute '__weakref__' of 'Parent' objects>, '__doc__': None}

and Child.__dict__ is

mappingproxy({'__module__': '__main__', 'is_even': <function __main__.Child.is_even(self)>, '__doc__': None})

Finally, the bond between the two is established through Child.__bases__, which has the value (__main__.Parent,).

So, when we call c.is_even the instance has a bound method that comes from the class Child, as its __dict__ contains the function is_even. Conversely, when we call c.info Python has to fetch it from Parent, as Child can't provide it. This mechanism is implemented by the method __getattribute__ that is the core of the Python inheritance system.

As I mentioned before, however, there is a hook into this system that the language provides us, namely the method __getattr__, which is not present by default. What happens is that when a class can't provide an attribute, Python first tries to get the attribute with the standard inheritance mechanism but if it can't be found, as a last resort it tries to run __getattr__ passing the attribute name.

An example can definitely clarify the matter.

class Parent: def __init__(self, value): self.value = value def info(self): print(f"Value: {self.value}") class Child(Parent): def is_even(self): return self.value % 2 == 0 def __getattr__(self, attr): if attr == "secret": return "a_secret_string" raise AttributeError >>> c = Child(5)

Now, if we try to access c.secret, Python would raise an AttributeError, as neither Child nor Parent can provide that attribute. As a last resort, though, Python runs c.__getattr__("secret"), and the code of that method that we implemented in the class Child returns the string "a_secret_string". Please note that the value of the argument attr is the name of the attribute as a string.

Because of the catch-all nature of __getattr__, we eventually have to raise an AttributeError to keep the inheritance mechanism working, unless we actually need or want to implement something very special.

This opens the door to an interesting hybrid solution where we can compose objects retaining an automatic delegation mechanism.

class Parent: def __init__(self, value): self.value = value def info(self): print(f"Value: {self.value}") class Child: def __init__(self, value): self.parent = Parent(value) def is_even(self): return self.value % 2 == 0 def __getattr__(self, attr): return getattr(self.parent, attr) >>> c = Child(5) >>> c.value 5 >>> c.info() Value: 5 >>> c.is_even() False

As you can see, here Child is composing Parent and there is no inheritance between the two. We can nevertheless access c.value and call c.info, thanks to the face that Child.__getattr__ is delegating everything can't be found in Child to the instance of Parent stored in self.parent.

Note: don't confuse getattr with __getattr__. The former is a builtin function that gets an attribute provided its name, a replacement for the dotted notation when the name of the attribute is known as a string. The latter is the hook into the inheritance mechanism that I described in this section.

Now, this is very powerful, but is it also useful?

I think this is not one of the techniques that will drastically change the way you write code in Python, but it can definitely help you to use composition instead of inheritance even when the amount of methods that you have to wrap is high. One of the limits of composition is that you are at the extreme spectrum of automatism; while inheritance is completely automatic, composition doesn't do anything for you. This means that when you compose objects you need to decide which methods or attributes of the contained objects you want to wrap, in order to expose then in the container object. In the previous example, the class Child might want to expose the attribute value and the method info, which would result in something like

class Parent: def __init__(self, value): self.value = value def info(self): print(f"Value: {self.value}") class Child: def __init__(self, value): self.parent = Parent(value) def is_even(self): return self.value % 2 == 0 def info(self): return self.parent.info() @property def value(self): return self.parent.value

As you can easily see, the more Child wants to expose of the Parent interface, the more wrapper methods and properties you need. To be perfectly clear, in this example the code above smells, as there are too many one-liner wrappers, which tells me it would be better to use inheritance. But if the class Child had a dozen of its own methods, suddenly it would make sense to do something like this, and in that case, __getattr__ might come in handy.

Final words

Both composition and inheritance are tools, and both exist to serve the bigger purpose of code reuse, so learn their strength and their weaknesses, so that you might be able to use the correct one and avoid future issues in your code.

I hope this rather long discussion helped you to get a better picture of the options you have when you design an object-oriented system, and also maybe introduced some new ideas or points of view if you are already comfortable with the concepts I wrote about.

Updates

2021-03-06 Following the suggestion of Tim Morris I added the console output to the source code to make the code easier to understand. Thanks Tim for the feedback!

Feedback

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

Categories: FLOSS Project Planets

The Digital Cat: TDD in Python with pytest - Part 5

Sat, 2021-03-06 13:00

This is the fifth and last post in the series "TDD in Python with pytest" where I develop a simple project following a strict TDD methodology. The posts come from my book Clean Architectures in Python and have been reviewed to get rid of some bad naming choices of the version published in the book.

You can find the first post here.

In this post I will conclude the discussion about mocks introducing patching.

Patching

Mocks are very simple to introduce in your tests whenever your objects accept classes or instances from outside. In that case, as shown in the previous sections, you just have to instantiate the class Mock and pass the resulting object to your system. However, when the external classes instantiated by your library are hardcoded this simple trick does not work. In this case you have no chance to pass a fake object instead of the real one.

This is exactly the case addressed by patching. Patching, in a testing framework, means to replace a globally reachable object with a mock, thus achieving the goal of having the code run unmodified, while part of it has been hot swapped, that is, replaced at run time.

A warm-up example

Clone the repository fileinfo that you can find here and move to the branch develop. As I did for the project simple_calculator, the branch master contains the full solution, and I use it to maintain the repository, but if you want to code along you need to start from scratch. If you prefer, you can clearly clone it on GitHub and make your own copy of the repository.

git clone https://github.com/lgiordani/fileinfo cd fileinfo git checkout --track origin/develop

Create a virtual environment following your preferred process and install the requirements

pip install -r requirements/dev.txt

You should at this point be able to run

pytest -svv

and get an output like

=============================== test session starts =============================== platform linux -- Python XXXX, pytest-XXXX, py-XXXX, pluggy-XXXX -- fileinfo/venv3/bin/python3 cachedir: .cache rootdir: fileinfo, inifile: pytest.ini plugins: cov-XXXX collected 0 items ============================== no tests ran in 0.02s ==============================

Let us start with a very simple example. Patching can be complex to grasp at the beginning so it is better to start learning it with trivial use cases. The purpose of this library is to develop a simple class that returns information about a given file. The class shall be instantiated with the file path, which can be relative.

The starting point is the class with the method __init__. If you want you can develop the class using TDD, but for the sake of brevity I will not show here all the steps that I followed. This is the set of tests I have in tests/test_fileinfo.py

tests/test_fileinfo.pyfrom fileinfo.fileinfo import FileInfo def test_init(): filename = 'somefile.ext' fi = FileInfo(filename) assert fi.filename == filename def test_init_relative(): filename = 'somefile.ext' relative_path = '../{}'.format(filename) fi = FileInfo(relative_path) assert fi.filename == filename

and this is the code of the class FileInfo in the file fileinfo/fileinfo.py

fileinfo/fileinfo.pyimport os class FileInfo: def __init__(self, path): self.original_path = path self.filename = os.path.basename(path)

Git tag: first-version

As you can see the class is extremely simple, and the tests are straightforward. So far I didn't add anything new to what we discussed in the previous posts.

Now I want the method get_info to return a tuple with the file name, the original path the class was instantiated with, and the absolute path of the file. Pretending we are in the directory /some/absolute/path, the class should work as shown here

>>> fi = FileInfo('../book_list.txt') >>> fi.get_info() ('book_list.txt', '../book_list.txt', '/some/absolute')

You can quickly realise that you have a problem writing the test. There is no way to easily test something as "the absolute path", since the outcome of the function called in the test is supposed to vary with the path of the test itself. Let us try to write part of the test

def test_get_info(): filename = 'somefile.ext' original_path = '../{}'.format(filename) fi = FileInfo(original_path) assert fi.get_info() == (filename, original_path, '???')

where the '???' string highlights that I cannot put something sensible to test the absolute path of the file.

Patching is the way to solve this problem. You know that the function will use some code to get the absolute path of the file. So, within the scope of this test only, you can replace that code with something different and perform the test. Since the replacement code has a known outcome writing the test is now possible.

Patching, thus, means to inform Python that during the execution of a specific portion of the code you want a globally accessible module/object replaced by a mock. Let's see how we can use it in our example

tests/test_fileinfo.pyfrom unittest.mock import patch [...] def test_get_info(): filename = 'somefile.ext' original_path = '../{}'.format(filename) with patch('os.path.abspath') as abspath_mock: test_abspath = 'some/abs/path' abspath_mock.return_value = test_abspath fi = FileInfo(original_path) assert fi.get_info() == (filename, original_path, test_abspath)

You clearly see the context in which the patching happens, as it is enclosed in a with statement. Inside this statement the module os.path.abspath will be replaced by a mock created by the function patch and called abspath_mock. So, while Python executes the lines of code enclosed by the statement with any call to os.path.abspath will return the object abspath_mock.

The first thing we can do, then, is to give the mock a known return_value. This way we solve the issue that we had with the initial code, that is using an external component that returns an unpredictable result. The line

tests/test_fileinfo.pyfrom unittest.mock import patch [...] def test_get_info(): filename = 'somefile.ext' original_path = '../{}'.format(filename) with patch('os.path.abspath') as abspath_mock: test_abspath = 'some/abs/path' abspath_mock.return_value = test_abspath fi = FileInfo(original_path) assert fi.get_info() == (filename, original_path, test_abspath)

instructs the patching mock to return the given string as a result, regardless of the real values of the file under consideration.

The code that make the test pass is

fileinfo/fileinfo.pyclass FileInfo: [...] def get_info(self): return ( self.filename, self.original_path, os.path.abspath(self.original_path) )

When this code is executed by the test the function os.path.abspath is replaced at run time by the mock that we prepared there, which basically ignores the input value self.filename and returns the fixed value it was instructed to use.

Git tag: patch-with-context-manager

It is worth at this point discussing outgoing messages again. The code that we are considering here is a clear example of an outgoing query, as the method get_info is not interested in changing the status of the external component. In the previous post we reached the conclusion that testing the return value of outgoing queries is pointless and should be avoided. With patch we are replacing the external component with something that we know, using it to test that our object correctly handles the value returned by the outgoing query. We are thus not testing the external component, as it has been replaced, and we are definitely not testing the mock, as its return value is already known.

Obviously to write the test you have to know that you are going to use the function os.path.abspath, so patching is somehow a "less pure" practice in TDD. In pure OOP/TDD you are only concerned with the external behaviour of the object, and not with its internal structure. This example, however, shows that this pure approach has some limitations that you have to cope with, and patching is a clean way to do it.

The patching decorator

The function patch we imported from the module unittest.mock is very powerful, as it can temporarily replace an external object. If the replacement has to or can be active for the whole test, there is a cleaner way to inject your mocks, which is to use patch as a function decorator.

This means that you can decorate the test function, passing as argument the same argument you would pass if patch was used in a with statement. This requires however a small change in the test function prototype, as it has to receive an additional argument, which will become the mock.

Let's change test_get_info, removing the statement with and decorating the function with patch

tests/test_fileinfo.py@patch('os.path.abspath') def test_get_info(abspath_mock): test_abspath = 'some/abs/path' abspath_mock.return_value = test_abspath filename = 'somefile.ext' original_path = '../{}'.format(filename) fi = FileInfo(original_path) assert fi.get_info() == (filename, original_path, test_abspath)

Git tag: patch-with-function-decorator

As you can see the decorator patch works like a big with statement for the whole function. The argument abspath_mock passed to the test becomes internally the mock that replaces os.path.abspath. Obviously this way you replace os.path.abspath for the whole function, so you have to decide case by case which form of the function patch you need to use.

Multiple patches

You can patch more that one object in the same test. For example, consider the case where the method get_info calls os.path.getsize in addition to os.path.abspathm in order to return the size of the file. You have at this point two different outgoing queries, and you have to replace both with mocks to make your class work during the test.

This can be easily done with an additional patch decorator

tests/test_fileinfo.py@patch('os.path.getsize') @patch('os.path.abspath') def test_get_info(abspath_mock, getsize_mock): filename = 'somefile.ext' original_path = '../{}'.format(filename) test_abspath = 'some/abs/path' abspath_mock.return_value = test_abspath test_size = 1234 getsize_mock.return_value = test_size fi = FileInfo(original_path) assert fi.get_info() == (filename, original_path, test_abspath, test_size)

Please note that the decorator which is nearest to the function is applied first. Always remember that the decorator syntax with @ is a shortcut to replace the function with the output of the decorator, so two decorators result in

@decorator1 @decorator2 def myfunction(): pass

which is a shorcut for

def myfunction(): pass myfunction = decorator1(decorator2(myfunction))

This explains why, in the test code, the function receives first abspath_mock and then getsize_mock. The first decorator applied to the function is the patch of os.path.abspath, which appends the mock that we call abspath_mock. Then the patch of os.path.getsize is applied and this appends its own mock.

The code that makes the test pass is

fileinfo/fileinfo.pyclass FileInfo: [...] def get_info(self): return ( self.filename, self.original_path, os.path.abspath(self.original_path), os.path.getsize(self.original_path) )

Git tag: multiple-patches

We can write the above test using two with statements as well

tests/test_fileinfo.pydef test_get_info(): filename = 'somefile.ext' original_path = '../{}'.format(filename) with patch('os.path.abspath') as abspath_mock: test_abspath = 'some/abs/path' abspath_mock.return_value = test_abspath with patch('os.path.getsize') as getsize_mock: test_size = 1234 getsize_mock.return_value = test_size fi = FileInfo(original_path) assert fi.get_info() == ( filename, original_path, test_abspath, test_size )

Using more than one with statement, however, makes the code difficult to read, in my opinion, so in general I prefer to avoid complex with trees if I do not really need to use a limited scope of the patching.

Checking call parameters

When you patch, your internal algorithm is not executed, as the patched method just return the values it has been instructed to return. This is connected to what we said about testing external systems, so everything is good, but while we don't want to test the internals of the module os.path, we want to be sure that we are passing the correct values to the external methods.

This is why mocks provide methods like assert_called_with (and other similar methods), through which we can check the values passed to a patched method when it is called. Let's add the checks to the test

tests/test_fileinfo.py@patch('os.path.getsize') @patch('os.path.abspath') def test_get_info(abspath_mock, getsize_mock): test_abspath = 'some/abs/path' abspath_mock.return_value = test_abspath filename = 'somefile.ext' original_path = '../{}'.format(filename) test_size = 1234 getsize_mock.return_value = test_size fi = FileInfo(original_path) info = fi.get_info() abspath_mock.assert_called_with(original_path) getsize_mock.assert_called_with(original_path) assert info == (filename, original_path, test_abspath, test_size)

As you can see, I first invoke fi.get_info storing the result in the variable info, check that the patched methods have been called witht the correct parameters, and then assert the format of its output.

The test passes, confirming that we are passing the correct values.

Git tag: addding-checks-for-input-values

Patching immutable objects

The most widespread version of Python is CPython, which is written, as the name suggests, in C. Part of the standard library is also written in C, while the rest is written in Python itself.

The objects (classes, modules, functions, etc.) that are implemented in C are shared between interpreters, and this requires those objects to be immutable, so that you cannot alter them at runtime from a single interpreter.

An example of this immutability can be given easily using a Python console

>>> a = 1 >>> a.conjugate = 5 Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'int' object attribute 'conjugate' is read-only

Here I'm trying to replace a method with an integer, which is pointless per se, but clearly shows the issue we are facing.

What has this immutability to do with patching? What patch does is actually to temporarily replace an attribute of an object (method of a class, class of a module, etc.), which also means that if we try to replace an attribute in an immutable object the patching action will fail.

A typical example of this problem is the module datetime, which is also one of the best candidates for patching, since the output of time functions is by definition time-varying.

Let me show the problem with a simple class that logs operations. I will temporarily break the TDD methodology writing first the class and then the tests, so that you can appreciate the problem.

Create a file called logger.py and put there the following code

fileinfo/logger.pyimport datetime class Logger: def __init__(self): self.messages = [] def log(self, message): self.messages.append((datetime.datetime.now(), message))

This is pretty simple, but testing this code is problematic, because the method log produces results that depend on the actual execution time. The call to datetime.datetime.now is however an outgoing query, and as such it can be replaced by a mock with patch.

If we try to do it, however, we will have a bitter surprise. This is the test code, that you can put in tests/test_logger.py

tests/test_logger.pyfrom unittest.mock import patch from fileinfo.logger import Logger @patch('datetime.datetime.now') def test_log(mock_now): test_now = 123 test_message = "A test message" mock_now.return_value = test_now test_logger = Logger() test_logger.log(test_message) assert test_logger.messages == [(test_now, test_message)]

When you try to execute this test you will get the following error

TypeError: can't set attributes of built-in/extension type 'datetime.datetime'

which is raised because patching tries to replace the function now in datetime.datetime with a mock, and since the module is immutable this operation fails.

Git tag: initial-logger-not-working

There are several ways to address this problem. All of them, however, start from the fact that importing or subclassing an immutable object gives you a mutable "copy" of that object.

The easiest example in this case is the module datetime itself. In the function test_log we tried to patch directly the object datetime.datetime.now, affecting the builtin module datetime. The file logger.py, however, does import datetime, so this latter becomes a local symbol in the module logger. This is exactly the key for our patching. Let us change the code to

tests/test_logger.py@patch('fileinfo.logger.datetime.datetime') def test_log(mock_datetime): test_now = 123 test_message = "A test message" mock_datetime.now.return_value = test_now test_logger = Logger() test_logger.log(test_message) assert test_logger.messages == [(test_now, test_message)]

Git tag: correct-patching

If you run the test now, you can see that the patching works. What we did was to inject our mock in fileinfo.logger.datetime.datetime instead of datetime.datetime.now. Two things changed, thus, in our test. First, we are patching the module imported in the file logger.py and not the module provided globally by the Python interpreter. Second, we have to patch the whole module because this is what is imported by the file logger.py. If you try to patch fileinfo.logger.datetime.datetime.now you will find that it is still immutable.

Another possible solution to this problem is to create a function that invokes the immutable object and returns its value. This last function can be easily patched, because it just uses the builtin objects and thus is not immutable. This solution, however, requires changing the source code to allow testing, which is far from being optimal. Obviously it is better to introduce a small change in the code and have it tested than to leave it untested, but whenever is possible I try as much as possible to avoid solutions that introduce code which wouldn't be required without tests.

Mocks and proper TDD

Following a strict TDD methodology means writing a test before writing the code that passes that test. This can be done because we use the object under test as a black box, interacting with it through its API, and thus not knowing anything of its internal structure.

When we mock systems we break this assumption. In particular we need to open the black box every time we need to patch an hardcoded external system. Let's say, for example, that the object under test creates a temporary directory to perform some data processing. This is a detail of the implementation and we are not supposed to know it while testing the object, but since we need to mock the file creation to avoid interaction with the external system (storage) we need to become aware of what happens internally.

This also means that writing a test for the object before writing the implementation of the object itself is difficult. Pretty often, thus, such objects are built with TDD but iteratively, where mocks are introduced after the code has been written.

While this is a violation of the strict TDD methodology, I don't consider it a bad practice. TDD helps us to write better code consistently, but good code can be written even without tests. The real outcome of TDD is a test suite that is capable of detecting regressions or the removal of important features in the future. This means that breaking strict TDD for a small part of the code (patching objects) will not affect the real result of the process, only change the way we achieve it.

A warning

Mocks are a good way to approach parts of the system that are not under test but that are still part of the code that we are running. This is particularly true for parts of the code that we wrote, which internal structure is ultimately known. When the external system is complex and completely detached from our code, mocking starts to become complicated and the risk is that we spend more time faking parts of the system than actually writing code.

In this cases we definitely crossed the barrier between unit testing and integration testing. You may see mocks as the bridge between the two, as they allow you to keep unit-testing parts that are naturally connected ("integrated") with external systems, but there is a point where you need to recognise that you need to change approach.

This threshold is not fixed, and I can't give you a rule to recognise it, but I can give you some advice. First of all keep an eye on how many things you need to mock to make a test run, as an increasing number of mocks in a single test is definitely a sign of something wrong in the testing approach. My rule of thumb is that when I have to create more than 3 mocks, an alarm goes off in my mind and I start questioning what I am doing.

The second advice is to always consider the complexity of the mocks. You may find yourself patching a class but then having to create monsters like cls_mock().func1().func2().func3.assert_called_with(x=42) which is a sign that the part of the system that you are mocking is deep into some code that you cannot really access, because you don't know it's internal mechanisms.

The third advice is to consider mocks as "hooks" that you throw at the external system, and that break its hull to reach its internal structure. These hooks are obviously against the assumption that we can interact with a system knowing only its external behaviour, or its API. As such, you should keep in mind that each mock you create is a step back from this perfect assumption, thus "breaking the spell" of the decoupled interaction. Doing this makes it increasingly complex to create mocks, and this will contribute to keep you aware of what you are doing (or overdoing).

Final words

Mocks are a very powerful tool that allows us to test code that contains outgoing messages. In particular they allow us to test the arguments of outgoing commands. Patching is a good way to overcome the fact that some external components are hardcoded in our code and are thus unreachable through the arguments passed to the classes or the methods under analysis.

Updates

2021-03-06 GitHub user 4myhw spotted an inconsistency between the code on GitHub and the code in the post. Thanks!

Feedback

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

Categories: FLOSS Project Planets

Andre Roberge: Going back in history

Sat, 2021-03-06 07:58

Imagine that you wish to run a program that takes a long time to run. Just in case somethings goes wrong, you decide to use friendly-traceback (soon to be renamed...) in interactive mode to run it.  This turns out to be a good decision:

Time to explore what might be the problem, and where exactly things might have gone wrong.
Ooops ... a silly typo. Easy enough to correct:
Unfortunately, that did not work: Friendly-traceback, like all Python programs designed to handle exceptions, only capture the last one.
This has happened to me so many times; granted, it was always with short programs so that I could easily recreate the original exception. However, I can only imagine how frustrating it might be for beginners encountering this situation.
A solution
Fortunately, you are using the latest version of Friendly-traceback, the one that records exceptions that were captured, and allows you to discard the last one recorded (rinse, and repeat as often as needed), thus going back in history.


Now that we are set, we can explore further to determine what might have gone wrong.



Categories: FLOSS Project Planets

Doug Hellmann: imapautofiler 1.11.0

Sat, 2021-03-06 06:14
New Features A configuration option has been added to disable the use of SSL/TLS for servers that do not support it. The default is to always use SSL/TLS. Contributed by Samuele Zanon. Upgrade Notes This release drops support for python 3.6 and 3.7.
Categories: FLOSS Project Planets

PyBites: Are You Working on Your Mindset as Much as Your Technical Skills?

Sat, 2021-03-06 01:21

Do you want to read an amazing coaching / mindset book? Check out Wooden: A Lifetime of Observations and Reflections On and Off the Court:

In this post I wanted to share some of our favorite lessons:

  1. "Too often we neglect our journey in our eagerness or anxiety about reaching our goal ... The preparation is where success is truly found."

    Sometimes we become too obsessed over results and big and flashy goals: land a develop job, build a profitable side gig, etc.

    However if you want to be successful in the long term you have to fall in love with the game, the process, the daily reps.

    Only then can you become really great at anything and sustain the challenges you inevitably face.

  2. "When you improve a little each day, eventually big things occur."

    Some people post 3 days on social, code for a month, do 2 coding interviews and don't see significant results then throw in the towel.

    You will only see consistent results gradually though. What you see today is a reflection of the past 2 years of actions - the tip of the iceberg.

    Take our platform for example, the people that rise to the top and leave amazing stories have been coding for many days, weeks and even years. They made many mistakes along the way and experimented every day.

    Adopt the 1% rule: consistent little improvements always beat a few big improvements (which are mostly an illusion).

  3. "Your reaction to victory or defeat is an important part of how you play the game."

    The Detroit Pistons were notorious for their game and how they behaved when they lost a championship.

    Winning or losing is one thing, how you react to it is way more important. Are you defeated or do you hit the gym again the next day?

    When you hit roadblocks ("I don't grasp OOP", "my code crashes", "my app slows down with 10x the load") - do you complain about it or do you fully embrace these obstacles and see them as fuel or opportunities for growth?

  4. "Never believe you're better than anybody else, but remember that you're just as good as everybody else."

    This one should be obvious. Once you think you're better than somebody, expect to go downhill quickly.

    Stay humble, everybody you encounter can teach you a valuable lesson. Newbies can open expert Pythonista's eyes, just by reframing a problem with a beginner mindset.

    One of my favourite stories in this context is the kid that found a simple (clever) solution to get a truck unstuck.

  5. "You cannot function physically or mentally unless your emotions are under control."

    Emotions can cloud your judgement. It's often good to cool down before making any rash decisions.

    In this context we like the "hot letter" or "unsent angry letter" hack Abraham Lincoln (and other public figures) used.

These are the kinds of important mindset lessons we include in our PyBites Developer Mindset coaching program.

If you want to be coached on mindset in addition to Python and software development, book a Strategy Session with us.

Don't ignore the mindset side of things, it's as important (if not more important) as the technical skills!

-- Bob

Categories: FLOSS Project Planets

Codementor: pytest quick tip: Adding CLI options

Sat, 2021-03-06 00:48
Quick intro to adding CLI arguments to a pytest test suite in the context of pytest-selenium.
Categories: FLOSS Project Planets

Test and Code: 147: Testing Single File Python Applications/Scripts with pytest and coverage

Fri, 2021-03-05 21:00

Have you ever written a single file Python application or script?
Have you written tests for it?
Do you check code coverage?

This is the topic of this weeks episode, spurred on by a listener question.

The questions:

  • For single file scripts, I'd like to have the test code included right there in the file. Can I do that with pytest?
  • If I can, can I use code coverage on it?

The example code discussed in the episode: script.py

def foo(): return 5 def main(): x = foo() print(x) if __name__ == '__main__': # pragma: no cover main() ## test code # To test: # pip install pytest # pytest script.py # To test with coverage: # put this file (script.py) in a directory by itself, say foo # then from the parent directory of foo: # pip install pytest-cov # pytest --cov=foo foo/script.py # To show missing lines # pytest --cov=foo --cov-report=term-missing foo/script.py def test_foo(): assert foo() == 5 def test_main(capsys): main() captured = capsys.readouterr() assert captured.out == "5\n"

Suggestion by @cfbolz if you need to import pytest:

if __name__ == '__main__': # pragma: no cover main() else: import pytest

Sponsored By:

Support Test & Code : Python Testing

<p>Have you ever written a single file Python application or script?<br> Have you written tests for it?<br> Do you check code coverage?</p> <p>This is the topic of this weeks episode, spurred on by a listener question.</p> <p>The questions:</p> <ul> <li>For single file scripts, I&#39;d like to have the test code included right there in the file. Can I do that with pytest?</li> <li>If I can, can I use code coverage on it?</li> </ul> <p>The example code discussed in the episode: script.py</p> <pre><code>def foo(): return 5 def main(): x = foo() print(x) if __name__ == &#39;__main__&#39;: # pragma: no cover main() ## test code # To test: # pip install pytest # pytest script.py # To test with coverage: # put this file (script.py) in a directory by itself, say foo # then from the parent directory of foo: # pip install pytest-cov # pytest --cov=foo foo/script.py # To show missing lines # pytest --cov=foo --cov-report=term-missing foo/script.py def test_foo(): assert foo() == 5 def test_main(capsys): main() captured = capsys.readouterr() assert captured.out == &quot;5\n&quot; </code></pre> <p>Suggestion by <a href="https://twitter.com/cfbolz/status/1368196960302358528?s=20" rel="nofollow">@cfbolz</a> if you need to import pytest:</p> <pre><code>if __name__ == &#39;__main__&#39;: # pragma: no cover main() else: import pytest </code></pre><p>Sponsored By:</p><ul><li><a href="https://testandcode.com/pycharm" rel="nofollow">PyCharm Professional</a>: <a href="https://testandcode.com/pycharm" rel="nofollow">Try PyCharm Pro for 4 months and learn how PyCharm will save you time.</a> Promo Code: TESTANDCODE21</li></ul><p><a href="https://www.patreon.com/testpodcast" rel="payment">Support Test & Code : Python Testing</a></p>
Categories: FLOSS Project Planets

Pages