Planet Python

Syndicate content
Planet Python -
Updated: 2 days 2 hours ago

Kushal Das: PyConf Hyderabad 2017

Wed, 2017-11-15 23:25

In the beginning of October, I attended a new PyCon in India, PyConf Hyderabad (no worries, they are working on the name for the next year). I was super excited about this conference, the main reason is being able to meet more Python developers from India. We are a large country, and we certainly need more local conferences :)

We reached the conference hotel a day before the event starts along with Py. The first day of the conference was workshop day, we reached the venue on time to say hi to everyone. Meet the team at the conference and many old friends. It was good to see that folks traveled from all across the country to volunteer for the conference. Of course, we had a certain number of dgplug folks there :)

In the conference day, Anwesha and /my setup in the PSF booth, and talked to the attendees. During the lighting talk session, Sayan and Anwesha introduced PyCon Pune, and they also opened up the registration during the lighting talk :). I attended Chandan Kumar’s talk about his journey into upstream projects. Have to admit that I feel proud to see all the work he has done.

Btw, I forgot to mention that lunch at PyConf Hyderabad was the best conference food ever. They had some amazing biryani :).

The last talk of the day was my keynote titled Free Software movement & current days. Anwesha and I wrote an article on the history of Free Software a few months back, and that the talk was based on that. This was also the first time I spoke about Freedom of the Press Foundation (attended my first conference as the FPF staff member).

The team behind the conference did some amazing groundwork to make this conference happening. It was a good opportunity to meet the community, and make new friends.

Categories: FLOSS Project Planets

Django Weblog: Django 2.0 release candidate 1 released

Wed, 2017-11-15 18:54

Django 2.0 release candidate 1 is the final opportunity for you to try out the assortment of new features before Django 2.0 is released.

The release candidate stage marks the string freeze and the call for translators to submit translations. Provided no major bugs are discovered that can't be solved in the next two weeks, Django 2.0 will be released on or around December 1. Any delays will be communicated on the django-developers mailing list thread.

Please use this opportunity to help find and fix bugs (which should be reported to the issue tracker). You can grab a copy of the package from our downloads page or on PyPI.

The PGP key ID used for this release is Tim Graham: 1E8ABDC773EDE252.

Categories: FLOSS Project Planets

PyCharm: PyCharm 2017.3 EAP 10

Wed, 2017-11-15 12:47

This week’s early access program (EAP) version of PyCharm is now available from our website:

Get PyCharm 2017.3 EAP 10

The release is getting close, and we’re just polishing out the last small issues until it’s ready.

Improvements in This Version
  • kwargs autocompletion for Model.objects.create(). Django support is only available in PyCharm Professional Edition
  • An issue that would cause PyCharm to fill multiple log files per minute has been fixed
  • Docker Run configurations have been improving steadily throughout the EAP phase, in this version ports that are used in a binding but haven’t been exposed yet will be auto-exposed (Docker support is available only in PyCharm Professional Edition)
  • And more, have a look at the release notes for details

If these features sound interesting to you, try them yourself:

Get PyCharm 2017.3 EAP 10

If you are using a recent version of Ubuntu (16.04 and later) you can also install PyCharm EAP versions using snap:

sudo snap install [pycharm-professional | pycharm-community] --classic --edge

If you already used snap for the previous version, you can update using:

sudo snap refresh [pycharm-professional | pycharm-community] --classic --edge

As a reminder, PyCharm EAP versions:

  • Are free, including PyCharm Professional Edition EAP
  • Will work for 30 days from being built, you’ll need to update when the build expires

If you run into any issues with this version, or another version of PyCharm, please let us know on our YouTrack. If you have other suggestions or remarks, you can reach us on Twitter, or by commenting on the blog.

Categories: FLOSS Project Planets

Codementor: Onicescu correlation coefficient-Python - Alexandru Daia

Wed, 2017-11-15 12:41
Implementing a new correlation method based on kinetic energies I research
Categories: FLOSS Project Planets

Ian Ozsvald: PyDataBudapest and “Machine Learning Libraries You’d Wish You’d Known About”

Wed, 2017-11-15 10:50

I’m back at BudapestBI and this year it has its first PyDataBudapest track. Budapest is fun! I’ve had a second iteration talking on a slightly updated “Machine Learning Libraries You’d Wish You’d Known About” (updated from PyDataCardiff two weeks back). When I was here to give an opening keynote talk two years back the conference was a bit smaller, it has grown by +100 folk since then. There’s also a stronger emphasis on open source R and Python tools. As before, the quality of the members here is high – the conversations are great!

During my talk I used my Explaining Regression Predictions Notebook to cover:

  • Dask to speed up Pandas
  • TPOT to automate sklearn model building
  • Yellowbrick for sklearn model visualisation
  • ELI5 with Permutation Importance and model explanations
  • LIME for model explanations

Nick’s photo of me on stage

Some audience members asked about co-linearity detection and explanation. Whilst I don’t have a good answer for identifying these relationships, I’ve added a seaborn pairplot, a correlation plot and the Pandas Profiling tool to the Notebook which help to show these effects.

Although it is complicated, I’m still pretty happy with this ELI5 plot that’s explaining feature contributions to a set of cheap-to-expensive houses from the Boston dataset:

Boston ELI5

I’m planning to do some training on these sort of topics next year, join my training list if that might be of use.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
Categories: FLOSS Project Planets

EuroPython Society: EuroPython 2018: Location and Dates

Wed, 2017-11-15 10:43

After a two month RFP bidding process with 19 venues from all over Europe, we are pleased to announce our selection of the location and venue for EuroPython 2018:

… yes, this is just one week before the famous Edinburgh Festival Fringe, so you can extend your stay a little longer if you like.

Based on the feedback we collected in the last few years, we have switched to a more compact conference layout for 2018:

  • Monday, Tuesday: Workshops and Trainings
  • Wednesday, Thursday, Friday: Main conference with talks, keynotes, exhibition
  • Saturday, Sunday: Sprints

More information will be available as we progress with the organization.

PS: We are now entering contract negotiations, so the above dates are highly likely, but we cannot confirm 100% yet.


EuroPython Society

Categories: FLOSS Project Planets

Tryton News: Foundation board renewal 2017

Wed, 2017-11-15 04:00

The 2017 foundation board renewal process has finished. We are happy to anounce that the new board is composed by:

  • Axel Braun from Germany
  • Jonathan Levy from the United States of America
  • Korbinian Preisler from Germany
  • Nicolas Évrard from Belgium
  • Pedro Luciano Rossi from Argentina
  • Sebastián Marró from Argentina
  • Sergi Almacellas Abellana from Spain

Congratulations to Nicolas Évrard as he became the second president of the Tryton foundation board.

We nearly reached the website redesign goal of our budget for 2017. You can help to make it happen by making a donation.

Categories: FLOSS Project Planets

Talk Python to Me: #138 Anvil: All web, all Python

Wed, 2017-11-15 03:00
Have you noticed that web development is kind of hard? If you've been doing it for a long time, this is easy to forget. It probably sounds easy enough
Categories: FLOSS Project Planets

NumFOCUS: Quantopian commits to fund pandas as a new NumFOCUS Corporate Partner

Tue, 2017-11-14 19:03
NumFOCUS welcomes Quantopian as our first Emerging Leader Corporate Partner, a partnership for small but growing companies who are leading by providing fiscal support to our open source projects. — Quantopian Supports Open Source by John Fawcett, CEO and founder of Quantopian       It all started with a single tweet. While scrolling through […]
Categories: FLOSS Project Planets

Mike Driscoll: Book Review: Python Testing with pytest

Tue, 2017-11-14 13:15

A couple of months ago, Brian Okken asked me if I would be interested in reading his book, Python Testing with pytest. I have been interested in learning more about the pytest package for a while, so I agreed to take a look. I also liked that the publisher was The Pragmatic Programmers, which I’ve had good experience with in the past. We will start with a quick review and then dive into the play-by-play.

Quick Review
  • Why I picked it up: The author of the book asked me to read his book
  • Why I finished it: I mostly skimmed the book to see how it was written and to check out the examples
  • I’d give it to: Anyone who is interested in testing in Python and especially in the pytest package
  • Book Formats

    You can get this as a physical soft cover, Kindle on Amazon or various other eBook formats via Pragmatic Programming’s website.

    Book Contents

    This book has 7 chapters, 5 appendices and is 222 pages long.

    Full Review

    This book jumps right in by starting off with an example in chapter 1. I actually found this a bit jarring as usually chapters have an introduction at the beginning that goes over what the chapter will be about. But chapter one just jumps right in with an example of a test in Python. It’s not bad, just different. This chapter explains how to get started using pytest and covers some of the common command line options you can pass to pytest.

    Chapter two goes into writing test functions with pytest. It also talks about how pytest uses Python’s assert keyword rather than assert methods like Python’s unittest library does. I found that to be an appealing feature of pytest. You will also learn how to skip tests and how to mark tests that we expect will fail as well as a few other things.

    If you’ve wondered about fixtures in pytest, then you will be excited to know that this book has two chapters on the topic; specifically chapters three and four. These chapters cover a lot of material, so I will just mention the highlights. You will learn about creating fixtures for setup and teardown, how to trace fixture execution, fixture scope, parameterized fixtures and builtin fixtures.

    Chapter five is about how to add plugins to pytest. You will also learn how to write your own plugins, how to install them and how to test your plugins. You will also get a good foundation in how to use in this chapter.

    In chapter six, we learn all about configuring pytest. The big topics covered in this chapter deal with pytest.ini, and as well as what you might use setup.cfg for. There are a lot of interesting topics in this chapter as well such as registering markers or changing test discovery locations. I encourage you to take a look at the book’s table of contents to learn more though!

    Finally in chapter 7 (the last chapter), we learn about using pytest with other testing tools. In this case, the book covers, pdb,, mock, tox, Jenkins and even unittest.

    The rest of the book is a series of five appendices and an index. The appendices cover virtual environments, pip, a plugin sampler, packaging / distributing Python projects and xUnit fixtures.

    I thought this book was well written and stayed on topic well. The examples are short and to the point. I am looking forward to diving more deeply into the book when I want to use the pytest package for my own code. I would recommend this book to anyone who is interested in the pytest package.

    Python Testing with pytest

    by Brian Okken

    Amazon, Book Website

    Other Book Reviews

    Categories: FLOSS Project Planets

    Stack Abuse: Using Machine Learning to Predict the Weather: Part 1

    Tue, 2017-11-14 12:21
    Part 1: Collecting Data From Weather Underground

    This is the first article of a multi-part series on using Python and Machine Learning to build models to predict weather temperatures based off data collected from Weather Underground. The series will be comprised of three different articles describing the major aspects of a Machine Learning project. The topics to be covered are:

    1. Data collection and processing
    2. Linear regression models
    3. Neural network models

    The data used in this series will be collected from Weather Underground's free tier API web service. I will be using the requests library to interact with the API to pull in weather data since 2015 for the city of Lincoln, Nebraska. Once collected, the data will need to be process and aggregated into a format that is suitable for data analysis, and then cleaned.

    The second article will focus on analyzing the trends in the data with the goal of selecting appropriate features for building a Linear Regression model using the statsmodels and scikit-learn Python libraries. I will discuss the importance of understanding the assumptions necessary for using a Linear Regression model and demonstrate how to evaluate the features to build a robust model. This article will conclude with a discussion of Linear Regression model testing and validation.

    The final article will focus on using Neural Networks. I will compare the process of building a Neural Network model, interpreting the results and, overall accuracy between the Linear Regression model built in the prior article and the Neural Network model.

    Getting Familiar with Weather Underground

    Weather Underground is a company that collects and distributes data on various weather measurements around the globe. The company provides a swath of API's that are available for both commercial and non-commercial uses. In this article, I will describe how to programmatically pull daily weather data from Weather Underground using their free tier of service available for non-commercial purposes.

    If you would like to follow along with the tutorial you will want to sign up for their free developer account here. This account provides an API key to access the web service at a rate of 10 requests per minute and up to a total of 500 requests in a day.

    Weather Underground provides many different web service API's to access data from but, the one we will be concerned with is their history API. The history API provides a summary of various weather measurements for a city and state on a specific day.

    The format of the request for the history API resource is as follows:
    • API_KEY: The API_KEY that Weather Underground provides with your account
    • YYYYMMDD: A string representing the target date of your request
    • STATE: The two letter state abbreviation in the United States
    • CITY: The name of the city associated with the state you requested
    Making Requests to the API

    To make requests to the Weather Underground history API and process the returned data I will make use of a few standard libraries as well as some popular third party libraries. Below is a table of the libraries I will be using and their description. For installation instructions please refer to the listed documentation.

    Library Description of Usage Source datetime Used to increment our requests by day Standard Library time Used to delay requests to stay under 10 per minute Standard Library collections Use namedtuples for structured collection of data Standard Library pandas Used to process, organize and clean the data Third Party Library requests Used to make networked requests to the API Third Party Library matplotlib Used for graphical analysis Third Party Library

    Let us get started by importing these libraries:

    from datetime import datetime, timedelta import time from collections import namedtuple import pandas as pd import requests import matplotlib.pyplot as plt

    Now I will define a couple of constants representing my API_KEY and the BASE_URL of the API endpoint I will be requesting. Note you will need to signup for an account with Weather Underground and receive your own API_KEY. By the time this article is published I will have deactivated this one.

    BASE_URL is a string with two place holders represented by curly brackets. The first {} will be filled by the API_KEY and the second {} will be replaced by a string formatted date. Both values will be interpolated into the BASE_URL string using the str.format(...) function.

    API_KEY = '7052ad35e3c73564' BASE_URL = "{}/history_{}/q/NE/Lincoln.json"

    Next I will initialize the target date to the first day of the year in 2015. Then I will specify the features that I would like to parse from the responses returned from the API. The features are simply the keys present in the history -> dailysummary portion of the JSON response. Those features are used to define a namedtuple called DailySummary which I'll use to organize the individual request's data in a list of DailySummary tuples.

    target_date = datetime(2016, 5, 16) features = ["date", "meantempm", "meandewptm", "meanpressurem", "maxhumidity", "minhumidity", "maxtempm", "mintempm", "maxdewptm", "mindewptm", "maxpressurem", "minpressurem", "precipm"] DailySummary = namedtuple("DailySummary", features)

    In this section I will be making the actual requests to the API and collecting the successful responses using the function defined below. This function takes the parameters url, api_key, target_date and days.

    def extract_weather_data(url, api_key, target_date, days): records = [] for _ in range(days): request = BASE_URL.format(API_KEY, target_date.strftime('%Y%m%d')) response = requests.get(request) if response.status_code == 200: data = response.json()['history']['dailysummary'][0] records.append(DailySummary( date=target_date, meantempm=data['meantempm'], meandewptm=data['meandewptm'], meanpressurem=data['meanpressurem'], maxhumidity=data['maxhumidity'], minhumidity=data['minhumidity'], maxtempm=data['maxtempm'], mintempm=data['mintempm'], maxdewptm=data['maxdewptm'], mindewptm=data['mindewptm'], maxpressurem=data['maxpressurem'], minpressurem=data['minpressurem'], precipm=data['precipm'])) time.sleep(6) target_date += timedelta(days=1) return records

    I start by defining a list called records which will hold the parsed data as DailySummary namedtuples. The for loop is defined so that it iterates over the loop for number of days passed to the function.

    Then the request is formatted using the str.format() function to interpolate the API_KEY and string formatted target_date object. Once formatted, the request variable is passed to the get() method of the requests object and the response is assigned to a variable called response.

    With the response returned I want to make sure the request was successful by evaluating that the HTTP status code is equal to 200. If it is successful then I parse the response's body into JSON using the json() method of the returned response object. Chained to the same json() method call I select the indexes of the history and daily summary structures then grab the first item in the dailysummary list and assign that to a variable named data.

    Now that I have the dict-like data structure referenced by the data variable I can select the desired fields and instantiate a new instance of the DailySummary namedtuple which is appended to the records list.

    Finally, each iteration of the loop concludes by calling the sleep method of the time module to pause the loop's execution for six seconds, guaranteeing that no more than 10 requests are made per minute, keeping us within Weather Underground's limits.

    Then the target_date is incremented by 1 day using the timedelta object of the datetime module so the next iteration of the loop retrieves the daily summary for the following day.

    The First Batch of Requests

    Without further delay I will kick off the first set of requests for the maximum allotted daily request under the free developer account of 500. Then I suggest you grab a refill of your coffee (or other preferred beverage) and get caught up on your favorite TV show because the function will take at least an hour depending on network latency. With this we have maxed out our requests for the day, and this is only about half the data we will be working with.

    So, come back tomorrow where we will finish out the last batch of requests then we can start working on processing and formatting the data in a manner suitable for our Machine Learning project.

    records = extract_weather_data(BASE_URL, API_KEY, target_date, 500) Finishing up the Data Retrieval

    Ok, now that it is a new day we have a clean slate and up to 500 requests that can be made to the Weather Underground history API. Our batch of 500 requests issued yesterday began on January 1st, 2015 and ended on May 15th, 2016 (assuming you didn't have any failed requests). Once again let us kick off another batch of 500 requests but, don't go leaving me for the day this time because once this last chunk of data is collected we are going to begin formatting it into a Pandas DataFrame and derive potentially useful features.

    # if you closed our terminal or Jupyter Notebook, reinitialize your imports and # variables first and remember to set your target_date to datetime(2016, 5, 16) records += extract_weather_data(BASE_URL, API_KEY, target_date, 500) Setting up our Pandas DataFrame

    Now that I have a nice and sizable records list of DailySummary named tuples I will use it to build out a Pandas DataFrame. The Pandas DataFrame is a very useful data structure for many programming tasks which are most popularly known for cleaning and processing data to be used in machine learning projects (or experiments).

    I will utilize the Pandas.DataFrame(...) class constructor to instantiate a DataFrame object. The parameters passed to the constructor are records which represent the data for the DataFrame, the features list I also used to define the DailySummary namedtuples which will specify the columns of the DataFrame. The set_index() method is chained to the DataFrame instantiation to specify date as the index.

    df = pd.DataFrame(records, columns=features).set_index('date') Deriving the Features

    Machine learning projects, also referred to as experiments, often have a few characteristics that are a bit oxymoronic. By this I mean that it is quite helpful to have subject matter knowledge in the area under investigation to aid in selecting meaningful features to investigate paired with a thoughtful assumption of likely patterns in data.

    However, I have also seen highly influential explanatory variables and pattern arise out of having almost a naive or at least very open and minimal presuppositions about the data. Having the knowledge-based intuition to know where to look for potentially useful features and patterns as well as the ability to look for unforeseen idiosyncrasies in an unbiased manner is an extremely important part of a successful analytics project.

    In this regard, we have selected quite a few features while parsing the returned daily summary data to be used in our study. However, I fully expect that many of these will prove to be either uninformative in predicting weather temperatures or inappropriate candidates depending on the type of model being used but, the crux is that you simply do not know until you rigorously investigate the data.

    Now I can't say that I have significant knowledge of meteorology or weather prediction models, but I did do a minimal search of prior work on using Machine Learning to predict weather temperatures. As it turns out there are quite a few research articles on the topic and in 2016 Holmstrom, Liu, and Vo they describe using Linear Regression to do just that. In their article, Machine Learning Applied to Weather Forecasting, they used weather data on the prior two days for the following measurements.

    • max temperature
    • min temperature
    • mean humidity
    • mean atmospheric pressure

    I will be expanding upon their list of features using the ones listed below, and instead of only using the prior two days I will be going back three days.

    • mean temperature
    • mean dewpoint
    • mean pressure
    • max humidity
    • min humidity
    • max dewpoint
    • min dewpoint
    • max pressure
    • min pressure
    • precipitation

    So next up is to figure out a way to include these new features as columns in our DataFrame. To do so I will make a smaller subset of the current DataFrame to make it easier to work with while developing an algorithm to create these features. I will make a tmp DataFrame consisting of just 10 records and the features meantempm and meandewptm.

    tmp = df[['meantempm', 'meandewptm']].head(10) tmp date meantempm meandewptm 2015-01-01 -6 -12 2015-01-02 -6 -9 2015-01-03 -4 -11 2015-01-04 -14 -19 2015-01-05 -9 -14 2015-01-06 -10 -15 2015-01-07 -16 -22 2015-01-08 -7 -12 2015-01-09 -11 -19 2015-01-10 -6 -12

    Let us break down what we hope to accomplish, and then translate that into code. For each day (row) and for a given feature (column) I would like to find the value for that feature N days prior. For each value of N (1-3 in our case) I want to make a new column for that feature representing the Nth prior day's measurement.

    # 1 day prior N = 1 # target measurement of mean temperature feature = 'meantempm' # total number of rows rows = tmp.shape[0] # a list representing Nth prior measurements of feature # notice that the front of the list needs to be padded with N # None values to maintain the constistent rows length for each N nth_prior_measurements = [None]*N + [tmp[feature][i-N] for i in range(N, rows)] # make a new column name of feature_N and add to DataFrame col_name = "{}_{}".format(feature, N) tmp[col_name] = nth_prior_measurements tmp date meantempm meandewptm meantempm_1 2015-01-01 -6 -12 None 2015-01-02 -6 -9 -6 2015-01-03 -4 -11 -6 2015-01-04 -14 -19 -4 2015-01-05 -9 -14 -14 2015-01-06 -10 -15 -9 2015-01-07 -16 -22 -10 2015-01-08 -7 -12 -16 2015-01-09 -11 -19 -7 2015-01-10 -6 -12 -11

    Ok so it appears we have the basic steps required to make our new features. Now I will wrap these steps up into a reusable function and put it to work building out all the desired features.

    def derive_nth_day_feature(df, feature, N): rows = df.shape[0] nth_prior_measurements = [None]*N + [df[feature][i-N] for i in range(N, rows)] col_name = "{}_{}".format(feature, N) df[col_name] = nth_prior_measurements

    Now I will write a loop to loop over the features in the feature list defined earlier, and for each feature that is not "date" and for N days 1 through 3 we'll call our function to add the derived features we want to evaluate for predicting temperatures.

    for feature in features: if feature != 'date': for N in range(1, 4): derive_nth_day_feature(df, feature, N)

    And for good measure I will take a look at the columns to make sure that they look as expected.

    df.columns Index(['meantempm', 'meandewptm', 'meanpressurem', 'maxhumidity', 'minhumidity', 'maxtempm', 'mintempm', 'maxdewptm', 'mindewptm', 'maxpressurem', 'minpressurem', 'precipm', 'meantempm_1', 'meantempm_2', 'meantempm_3', 'meandewptm_1', 'meandewptm_2', 'meandewptm_3', 'meanpressurem_1', 'meanpressurem_2', 'meanpressurem_3', 'maxhumidity_1', 'maxhumidity_2', 'maxhumidity_3', 'minhumidity_1', 'minhumidity_2', 'minhumidity_3', 'maxtempm_1', 'maxtempm_2', 'maxtempm_3', 'mintempm_1', 'mintempm_2', 'mintempm_3', 'maxdewptm_1', 'maxdewptm_2', 'maxdewptm_3', 'mindewptm_1', 'mindewptm_2', 'mindewptm_3', 'maxpressurem_1', 'maxpressurem_2', 'maxpressurem_3', 'minpressurem_1', 'minpressurem_2', 'minpressurem_3', 'precipm_1', 'precipm_2', 'precipm_3'], dtype='object')

    Excellent! Looks like we have what we need. The next thing I want to do is assess the quality of the data and clean it up where necessary.

    Data Cleaning - The Most Important Part

    As the section title says, the most important part of an analytics project is to make sure you are using quality data. The proverbial saying, "garbage in, garbage out", is as appropriate as ever when it comes to machine learning. However, the data cleaning part of an analytics project is not just one of the most important parts it is also the most time consuming and laborious. To ensure the quality of the data for this project, in this section I will be looking to identify unnecessary data, missing values, consistency of data types, and outliers then making some decisions about how to handle them if they arise.

    The first thing I want to do is drop any the columns of the DataFrame that I am not interested in to reduce the amount of data I am working with. The goal of the project is to predict the future temperature based off the past three days of weather measurements. With this in mind we only want to keep the min, max, and mean temperatures for each day plus all the new derived variables we added in the last sections.

    # make list of original features without meantempm, mintempm, and maxtempm to_remove = [feature for feature in features if feature not in ['meantempm', 'mintempm', 'maxtempm']] # make a list of columns to keep to_keep = [col for col in df.columns if col not in to_remove] # select only the columns in to_keep and assign to df df = df[to_keep] df.columns Index(['meantempm', 'maxtempm', 'mintempm', 'meantempm_1', 'meantempm_2', 'meantempm_3', 'meandewptm_1', 'meandewptm_2', 'meandewptm_3', 'meanpressurem_1', 'meanpressurem_2', 'meanpressurem_3', 'maxhumidity_1', 'maxhumidity_2', 'maxhumidity_3', 'minhumidity_1', 'minhumidity_2', 'minhumidity_3', 'maxtempm_1', 'maxtempm_2', 'maxtempm_3', 'mintempm_1', 'mintempm_2', 'mintempm_3', 'maxdewptm_1', 'maxdewptm_2', 'maxdewptm_3', 'mindewptm_1', 'mindewptm_2', 'mindewptm_3', 'maxpressurem_1', 'maxpressurem_2', 'maxpressurem_3', 'minpressurem_1', 'minpressurem_2', 'minpressurem_3', 'precipm_1', 'precipm_2', 'precipm_3'], dtype='object')

    The next thing I want to do is to make use of some built in Pandas functions to get a better understanding of the data and potentially identify some areas to focus my energy on. The first function is a DataFrame method called info() which, big surprise... provides information on the DataFrame. Of interest is the "data type" column of the output. <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 1000 entries, 2015-01-01 to 2017-09-27 Data columns (total 39 columns): meantempm 1000 non-null object maxtempm 1000 non-null object mintempm 1000 non-null object meantempm_1 999 non-null object meantempm_2 998 non-null object meantempm_3 997 non-null object meandewptm_1 999 non-null object meandewptm_2 998 non-null object meandewptm_3 997 non-null object meanpressurem_1 999 non-null object meanpressurem_2 998 non-null object meanpressurem_3 997 non-null object maxhumidity_1 999 non-null object maxhumidity_2 998 non-null object maxhumidity_3 997 non-null object minhumidity_1 999 non-null object minhumidity_2 998 non-null object minhumidity_3 997 non-null object maxtempm_1 999 non-null object maxtempm_2 998 non-null object maxtempm_3 997 non-null object mintempm_1 999 non-null object mintempm_2 998 non-null object mintempm_3 997 non-null object maxdewptm_1 999 non-null object maxdewptm_2 998 non-null object maxdewptm_3 997 non-null object mindewptm_1 999 non-null object mindewptm_2 998 non-null object mindewptm_3 997 non-null object maxpressurem_1 999 non-null object maxpressurem_2 998 non-null object maxpressurem_3 997 non-null object minpressurem_1 999 non-null object minpressurem_2 998 non-null object minpressurem_3 997 non-null object precipm_1 999 non-null object precipm_2 998 non-null object precipm_3 997 non-null object dtypes: object(39) memory usage: 312.5+ KB

    Notice that the data type of every column is of type "object". We need to convert all of these feature columns to floats for the type of numerical analysis that we hope to perform. To do this I will use the apply() DataFrame method to apply the Pandas to_numeric method to all values of the DataFrame. The error='coerce' parameter will fill any textual values to NaNs. It is common to find textual values in data from the wild which usually originate from the data collector where data is missing or invalid.

    df = df.apply(pd.to_numeric, errors='coerce') <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 1000 entries, 2015-01-01 to 2017-09-27 Data columns (total 39 columns): meantempm 1000 non-null int64 maxtempm 1000 non-null int64 mintempm 1000 non-null int64 meantempm_1 999 non-null float64 meantempm_2 998 non-null float64 meantempm_3 997 non-null float64 meandewptm_1 999 non-null float64 meandewptm_2 998 non-null float64 meandewptm_3 997 non-null float64 meanpressurem_1 999 non-null float64 meanpressurem_2 998 non-null float64 meanpressurem_3 997 non-null float64 maxhumidity_1 999 non-null float64 maxhumidity_2 998 non-null float64 maxhumidity_3 997 non-null float64 minhumidity_1 999 non-null float64 minhumidity_2 998 non-null float64 minhumidity_3 997 non-null float64 maxtempm_1 999 non-null float64 maxtempm_2 998 non-null float64 maxtempm_3 997 non-null float64 mintempm_1 999 non-null float64 mintempm_2 998 non-null float64 mintempm_3 997 non-null float64 maxdewptm_1 999 non-null float64 maxdewptm_2 998 non-null float64 maxdewptm_3 997 non-null float64 mindewptm_1 999 non-null float64 mindewptm_2 998 non-null float64 mindewptm_3 997 non-null float64 maxpressurem_1 999 non-null float64 maxpressurem_2 998 non-null float64 maxpressurem_3 997 non-null float64 minpressurem_1 999 non-null float64 minpressurem_2 998 non-null float64 minpressurem_3 997 non-null float64 precipm_1 889 non-null float64 precipm_2 889 non-null float64 precipm_3 888 non-null float64 dtypes: float64(36), int64(3) memory usage: 312.5 KB

    Now that all of our data has the data type I want I would like to take a look at some summary stats of the features and use the statistical rule of thumb to check for the existence of extreme outliers. The DataFrame method describe() will produce a DataFrame containing the count, mean, standard deviation, min, 25th percentile, 50th percentile (or median), the 75th percentile and, the max value. This can be very useful information to evaluating the distribution of the feature data.

    I would like to add to this information by calculating another output column, indicating the existence of outliers. The rule of thumb to identifying an extreme outlier is a value that is less than 3 interquartile ranges below the 25th percentile, or 3 interquartile ranges above the 75th percentile. Interquartile range is simply the difference between the 75th percentile and the 25th percentile.

    # Call describe on df and transpose it due to the large number of columns spread = df.describe().T # precalculate interquartile range for ease of use in next calculation IQR = spread['75%'] - spread['25%'] # create an outliers column which is either 3 IQRs below the first quartile or # 3 IQRs above the third quartile spread['outliers'] = (spread['min']<(spread['25%']-(3*IQR)))|(spread['max'] > (spread['75%']+3*IQR)) # just display the features containing extreme outliers spread.ix[spread.outliers,] count mean std min 25% 50% 75% max outliers maxhumidity_1 999.0 88.107107 9.273053 47.0 83.0 90.0 93.00 100.00 True maxhumidity_2 998.0 88.102204 9.276407 47.0 83.0 90.0 93.00 100.00 True maxhumidity_3 997.0 88.093280 9.276775 47.0 83.0 90.0 93.00 100.00 True maxpressurem_1 999.0 1019.924925 7.751874 993.0 1015.0 1019.0 1024.00 1055.00 True maxpressurem_2 998.0 1019.922846 7.755482 993.0 1015.0 1019.0 1024.00 1055.00 True maxpressurem_3 997.0 1019.927783 7.757805 993.0 1015.0 1019.0 1024.00 1055.00 True minpressurem_1 999.0 1012.329329 7.882062 956.0 1008.0 1012.0 1017.00 1035.00 True minpressurem_2 998.0 1012.326653 7.885560 956.0 1008.0 1012.0 1017.00 1035.00 True minpressurem_3 997.0 1012.326981 7.889511 956.0 1008.0 1012.0 1017.00 1035.00 True precipm_1 889.0 2.908211 8.874345 0.0 0.0 0.0 0.51 95.76 True precipm_2 889.0 2.908211 8.874345 0.0 0.0 0.0 0.51 95.76 True precipm_3 888.0 2.888885 8.860608 0.0 0.0 0.0 0.51 95.76 True

    Assessing the potential impact of outliers is a difficult part of any analytics project. On the one hand, you need to be concerned about the potential for introducing spurious data artifacts that will significantly impact or bias your models. On the other hand, outliers can be extremely meaningful in predicting outcomes that arise under special circumstances. We will discuss each of these outliers containing features and see if we can come to a reasonable conclusion as to how to treat them.

    The first set of features all appear to be related to max humidity. Looking at the data I can tell that the outlier for this feature category is due to the apparently very low min value. This indeed looks to be a pretty low value and I think I would like to take a closer look at it, preferably in a graphical way. To do this I will use a histogram.

    %matplotlib inline plt.rcParams['figure.figsize'] = [14, 8] df.maxhumidity_1.hist() plt.title('Distribution of maxhumidity_1') plt.xlabel('maxhumidity_1')

    Looking at the histogram of the values for maxhumidity the data exhibits quite a bit of negative skew. I will want to keep this in mind when selecting prediction models and evaluating the strength of impact of max humidities. Many of the underlying statistical methods assume that the data is normally distributed. For now I think I will leave them alone but it will be good to keep this in mind and have a certain amount of skepticism of it.

    Next I will look at the minimum pressure feature distribution.

    df.minpressurem_1.hist() plt.title('Distribution of minpressurem_1') plt.xlabel('minpressurem_1')

    This plot exhibits another interesting feature. From this plot, the data is multimodal, which leads me to believe that there are two very different sets of environmental circumstances apparent in this data. I am hesitant to remove these values since I know that the temperature swings in this area of the country can be quite extreme especially between seasons of the year. I am worried that removing these low values might have some explanatory usefulness but, once again I will be skeptical about it at the same time.

    The final category of features containing outliers, precipitation, are quite a bit easier to understand. Since the dry days (ie, no precipitation) are much more frequent, it is sensible to see outliers here. To me this is no reason to remove these features.

    The last data quality issue to address is that of missing values. Due to the way in which I have built out the DataFrame, the missing values are represented by NaNs. You will probably remember that I have intentionally introduced missing values for the first three days of the data collected by deriving features representing the prior three days of measurements. It is not until the third day in that we can start deriving those features, so clearly I will want to exclude those first three days from the data set.

    Look again at the output from the last time I issued the info method. There is a column of output that listed the non-null values for each feature column. Looking at this information you can see that for the most part the features contain relatively few missing (null / NaN) values, mostly just the ones I introduced. However, the precipitation columns appear to be missing a significant part of their data.

    Missing data poses a problem because most machine learning methods require complete data sets devoid of any missing data. Aside from the issue that many of the machine learning methods require complete data, if I were to remove all the rows just because the precipitation feature contains missing data then I would be throwing out many other useful feature measurements.

    As I see it I have a couple of options to deal with this issue of missing data:

    1. I can simply remove the rows that contain the missing values, but as I mentioned earlier throwing out that much data removes a lot of value from the data
    2. I can fill the missing values with an interpolated value that is a reasonable estimation of the true values.

    Since I would rather preserve as much of the data as I can, where there is minimal risk of introducing erroneous values, I am going to fill the missing precipitation values with the most common value of zero. I feel this is a reasonable decision because the great majority of values in the precipitation measurements are zero.

    # iterate over the precip columns for precip_col in ['precipm_1', 'precipm_2', 'precipm_3']: # create a boolean array of values representing nans missing_vals = pd.isnull(df[precip_col]) df[precip_col][missing_vals] = 0

    Now that I have filled all the missing values that I can, while being cautious not to negatively impact the quality, I would be comfortable simply removing the remaining records containing missing values from the data set. It is quite easy to drop rows from the DataFrame containing NaNs. All I have to do is call the method dropna() and Pandas will do all the work for me.

    df = df.dropna() Resources

    Want to learn the tools, machine learning, and data analysis used in this tutorial? Here are a few great resources to get you started:


    In this article I have described the process of collecting, cleaning, and processing a reasonably good-sized data set to be used for upcoming articles on a machine learning project in which we predict future weather temperatures.

    While this is probably going to be the driest of the articles detaining this machine learning project, I have tried to emphasize the importance of collecting quality data suitable for a valuable machine learning experiment.

    Thanks for reading and I hope you look forward to the upcoming articles on this project.

    Categories: FLOSS Project Planets

    Wallaroo Labs: Identifying Trending Twitter Hashtags in Real-time with Wallaroo

    Tue, 2017-11-14 11:00
    This week we have a guest post written by Hanee’ Medhat Hanee’ is a Big Data Engineer, with experience working with massive data in many industries, such as Telecommunications and Banking. Overview One of the primary places where the world is seeing an explosion of data growth is in social media. Wallaroo is a powerful and simple-to-use open-source data engine that is ideally suited for handling massive amounts of streaming data in real-time.
    Categories: FLOSS Project Planets

    Codementor: 30-minute Python Web Scraper

    Tue, 2017-11-14 04:31
    I’ve been meaning to create a web scraper using Python and Selenium ( for a while now, but never gotten around to it. A few nights ago, I decided to give it a spin....
    Categories: FLOSS Project Planets

    Montreal Python User Group: Montréal-Python 68: Wysiwyg Xylophone

    Tue, 2017-11-14 00:00

    Please RSVP on our meetup event


    November 20th at 6:00PM


    Google Montréal 1253 McGill College #150 Montréal, QC

    We thank Google Montreal for sponsoring MP68

    • 6:00PM - Doors open
    • 6:30PM - Talks
    • 7:30PM - Break
    • 7:45PM - Talk
    • 8:30-9:00PM - End of event
    Presentations Va debugger ton Python! - Stéphane Wirtel

    Cette présentation vous explique les bases de Pdb ainsi que GDB, afin de debugger plus facilement vos scripts Python.

    Writing a Python interpreter in Python from scratch - Zhentao Li

    I will show a prototype of a Python interpreter written in entirely in Python itself (that isn't Pypy).

    The goal is to have simpler internals to allow experimenting with changes to the language more easily. This interpreter has a small core with much of the library modifiable at run time for quickly testing changes. This differs from Pypy that aimed for full Python compatibility and speed (from JIT compilation). I will show some of the interesting things that you can do with this interpreter.

    This interpreter has two parts: a parser to transform source to Abstract Syntax Tree and a runner for traversing this tree. I will give an overview of how both part work and discuss some challenges encountered and their solution.

    This interpreter makes use of very few libraries, and only those included with CPython.

    This project is looking for members to discuss ways of simplifying parts of the interpreter (among other things).

    The talk would be about Rasa, an open-source chatbots platform - Nathan Zylbersztejn

    Most chatbots rely on external APIs for the cool stuff such as natural language understanding (NLU) and disappoint because if and else conditionals fail at delivering good representations of our non linear human way to converse. Wouldn’t it be great if we could 1) take control of NLU and tweak it to better fit our needs and 2) really apply machine learning, extract patterns from real conversations, and handle dialogue in a decent manner? Well, we can, thanks to It’s open-source, it’s in Python, and it works

    About Nathan Zylbersztejn:

    Nathan is the founder of Mr. Bot, a dialogue consulting agency in Montreal with clients in the media, banking and aerospace industries. He holds a master in economics, a graduate diploma in computer science, and a machine learning nanodegree.

    Categories: FLOSS Project Planets

    Doug Hellmann: readline — The GNU readline Library — PyMOTW 3

    Mon, 2017-11-13 09:00
    The readline module can be used to enhance interactive command line programs to make them easier to use. It is primarily used to provide command line text completion, or “tab completion”. Read more… This post is part of the Python Module of the Week series for Python 3. See for more articles from the …
    Categories: FLOSS Project Planets

    Mike Driscoll: PyDev of the Week: Bert JW Regeer

    Mon, 2017-11-13 08:30

    This week we welcome Bert JW Regeer as our PyDev of the Week! Bert is a core developer of the Pyramid web framework. You can check our his experience over on his website or go check out his Github profile to see what projects he’s been working on lately. Let’s take a few moments to get to know Bert better!

    Can you tell us a little about yourself? (Hobbies, education, etc?):

    Oh, I have no idea where to start, but let’s give it a shot. I am first and foremost a geek, I love electronics, which in and of itself is an expensive hobby. Always new toys to play with and buy. I studied Computer Science at the University of Advancing Technology, and have been known to spend a lot of time building cool hardware based projects too. Spent a lot of time on FIRST Robotics, first in HS and then mentoring HS students while in college. Lately the only hardware I get to play with though is home automation, installing new switches and sensors to make my laziness even more lazy!

    My other major hobbies all have something in common, they are expensive ones: photography and cars. I am a bit of an amateur photographer and am always looking to get new lenses or find new ideas on how to shoot something new and exciting. My next goal is to do some astrophotography and I am looking to get a nice wide lens with a nice large aperture. I live in Colorado so there are plenty of gorgeous places to go photograph. I drive a Subaru WRX, and I absolutely love going for rides. Been eyeing some upgrades to my car, but so far she is still stock. I enjoy going out and driving around the mountains though, which goes hand in hand with the photography!

    Last but not least, lately I have gotten into making my own sourdough bread. There is nothing better than a freshly baked sourdough bread with a little butter. It’s my newest and most recent hobby, and it is also the one that costs almost nothing. I get to make healthier bread, share it with friends and family, and it costs pennies! I work at a small company named Crunch and a few of my colleagues are bread bakers too, which allows us to share tips and ideas on how to improve our breads.

    I really should update my website to include this, but on a different note, you can find some of my older projects there!

    Why did you start using Python?

    I was introduced to Python while I was at university, at the time (2008 time frame) I didn’t really think much of it other than just a quick way to prototype some ideas before translating them into C/C++. At the time I was a pretty big proponent of C++, and working mostly in the backend it was a pretty natural fit. Write your applications in C++ and run on FreeBSD.

    It wasn’t until I was at my first programming job where we had to quickly write a new server component that I started reaching for Python first and foremost to deliver the project. After quickly prototyping I was able to prove that Python would provide us with the speed and throughput required while alleviating some of the development concerns we were worried about. As time went on there were still components I wrote in C++, but a large part of our service ended up being written in Python due to the speed of development.

    For personal use, I had always written my websites in PHP since it was always available and easy to use, but I never really did enjoy using any of the frameworks built for PHP. All of them were brand new at the time and felt incredibly heavy weight, and because I was using more Python at work it was getting confusing, when do I need to use tabs and when do I use curly braces? It always took me a minute or two to context switch from one programming language to another, so I started looking at Django, Flask, Pylons, Pyramid, Google App Engine and others. I ended up settling on Pyramid due to it’s simplicity and because it allowed me to pick the components I wanted to use. I ended becoming the maintainer for WebOb and recently have become a full core contributor to the Pylons Project.

    What other programming languages do you know and which is your favorite?

    This is going to be an incredibly long list, so let’s go with the ones I have extensively and not just used for a toy project here and there. As already mentioned I started out with C/C++, being the first language I learned from a C++ for Dummies book when I was 12, since then I’ve ran the gamut, but PHP was fairly big for me a for a while. In high school and university Java of course was used, although it still is my least favorite language, I have used it extensively on some Android projects. I’ve worked on a project that was Objective C, with a little bit of Swift, mainly doing a security audit so I am not sure it really counts as extensively… currently the two languages I use most are Python and JavaScript (or Typescript, transpiled to JavaScript). ES6/ES7 (yay for Babel) are heavily used in various projects.

    Python however has definitely become my favorite programming language. It is incredibly versatile and even though it is not the fastest language by far, it is one of the most flexible and I can appreciate how easy it makes my life. Are there things I’d like to see change in Python? Absolutely. Are there pieces missing? Sure. At the same time I am not sure what other language I would enjoy working in as much as I currently do with Python. I’ve tried Golang, it’s just not for me. Rust comes pretty close, but it feels too much like C/C++ and requires a lot more thinking than I think is necessary for the things I am working on.

    What projects are you working on now?

    Outside of work, just a bunch of open source currently. As I am writing this I am preparing a talk for PloneConf where I am going to talk about taking over maintenance of well loved projects, specifically WebOb which is a Python WSGI/HTTP request/response library that was written by Ian Bicking and is now maintained by the Pylons Project with me as lead.

    The Pylons Project is a collective of people that maintain a bunch of different pieces of software, mostly for the HTTP web. Pyramid the web framework, WebOb the WSGI/HTTP request/response library that underlies it, webtest for testing your WSGI applications, form generation, pure HTTP web server and more. We don’t have a lot of people, so there is a lot of work to be done. Releasing new versions of existing software, accepting/reviewing patches and reducing the issue count faster than the issues continue to be generated!

    There are also many unfinished projects, my Github is a venerable graveyard of projects that I’ve started and never finished. Great aspirations, just find that if I am doing things for myself once I have figured out the hard part, once I’ve solved “the problem”, completing a project is not nearly as much fun, so off I go to the next project. I always learn something new, just feel bad that it is mostly half-finished code that no-one else can really benefit from.

    Which Python libraries are your favorite (core or 3rd party)

    I believe it started as a third-party, but is now baked into every Python installation (unless you are on one of those special Linux distributions that rips it out and makes it a separate installable), and that is the virtual environment tool venv or virtualenv. It makes it much simpler to have different environments with different libraries installed. Being able to separate out all of my different projects and not have to install globally is a amazing! C/C++ make this much more difficult especially if you need to include linker flags and all kinds of fun stuff, and pkg-config and friends only get you so far. Similar systems exist for other languages, but it is by far my favorite part about working with Python.

    Is there anything else you’d like to say?

    We are always looking for new developers to join us in the Pylons Project, if you are looking for someone to mentor you, please reach out and we will do our best. This year we had an absolutely fantastic Google Summer of Code with Ira and I’d be happy to help introduce more new people to not just the Pylons Project but to open source in general.

    Thanks for doing the interview!

    Categories: FLOSS Project Planets

    Tryton News: Tryton on Docker

    Mon, 2017-11-13 04:00

    In order to easy the adoption of Tryton, we are publishing Docker images images on the Docker hub for all series starting from 4.4.

    They contain the server, the web client and all modules of the series. They are periodically updated and tagged per series. They work with the postgres images as default storage back-end.

    The usage is pretty simple:

    $ docker run --name postgres -e POSTGRES_DB=tryton -d postgres $ docker run --link postgres:postgres -it tryton/tryton trytond-admin -d tryton --all $ docker run --name tryton -p 8000:8000 --link postgres:postgres -d tryton/tryton

    Then you can connect using: http://localhost:8000/

    Categories: FLOSS Project Planets

    Full Stack Python: DevOps, Thank You Maintainers and Contributing to Open Source

    Mon, 2017-11-13 00:00

    DevOps, Continuous Delivery... and You is a blog post with the slides and notes based on a class I taught at the University of Virginia this past week. The talk is relevant as a brief introduction to DevOps and Continuous Delivery, especially for junior developers and less-technical managers of software teams. I'm experimenting with the "talk as blog post" style so let me know via email or a tweet if you enjoy it and would want to see future technical talks in that format.

    Speaking of feedback on projects, this GitHub issue thread named "thank you" is incredible to read. The issue ticket blew up on the front page of Hacker News as an example of how powerful genuine positive comments can be for project maintainers. Every time I get a thank you tweet (like this one), email or GitHub issue it certainly helps to motivate me to continue working hard on Full Stack Python.

    Contributing to open source is a recent Talk Python to Me podcast episode in the same vein as thanking your maintainer. Working on open source projects with your own contributions to documentation or simple bug fixes can be a great way to become a better programmer. I particularly enjoyed the recommendations of the panel to cut your teeth on smaller open source projects rather than trying to jump into a massive codebase like Django or the CPython implementation. Take a listen to that podcast episode if you are new to open source or have been wondering how to get involved.

    As always, send me an email or submit an issue ticket on GitHub to let me know how to improve Full Stack Python as I continue to fill in the table of contents with new pages and new tutorials.

    Categories: FLOSS Project Planets

    Yasoob Khalid: Introduction to Machine Learning and its Usage in Remote Sensing

    Sat, 2017-11-11 16:34

    Hey guys! I recently wrote a review paper regarding the use of Machine Learning in Remote Sensing. I thought that some of you might find it interesting and insightful. It is not strictly a Python focused research paper but is interesting nonetheless.

    Introduction to Machine Learning and its Usage in Remote Sensing

    1. Introduction

    Machines have allowed us to do complex computations in short amounts of time. This has given rise to an entirely different area of research which was not being explored: teaching machines to predict a likely outcome by looking at patterns. Machine Learning is being used to solve almost all kinds of problems ranging from Stock Market predictions to medical formulae synthesis.

    There are multiple famous machine learning algorithms in use today and new algorithms are popping up every other day. Some of the widely known algorithms are:

    1. Support Vector Machines
    2. Neural Networks
    3. Random Forests
    4. K-Nearest Neighbors
    5. Decision Trees
    6. K-Means
    7. Principal Component Analysis

    Different important steps are involved in getting the machines to predict dependable and reliable data.

    2. Machine Learning in Remote Sensing

    The roots of machine learning in remote sensing date back to the 1990s. It was initially introduced as a way to automate knowledge-base building for remote sensing. In their paper, Huang and Jensen (1997) talk about how a knowledge-base was built using minimal input from human experts and then decision trees were created to infer the rules from the human input for the expert system. The generated rules were used at a study site on the Savannah River. The conclusion details how the proposed machine-learning assisted expert system approach yielded the highest accuracy compared to conventional methods at that time. After such similar developments machine learning was soon adopted as an important tool by the remote sensing community. Now it is being used in all sorts of projects, from an unsupervised satellite image scene classification (Li, et al. 2016) to the classification of Australian native forests (Shang & Chisholm, 2014). Now we will take a look at the typical machine learning workflow.

    3. Machine Learning Workflow

    It is often important to acquaint yourself with the workflow involved. Machine Learning, too, has a workflow which is somewhat common to all of the machine learning based projects.

    • Gathering Data
    • Cleaning Data
    • Model building & selecting the right algorithm
    • Gaining Insights from the results
    • Visualizing the data

    In remote sensing one gathers data mostly using satellites or aerial drones. Data cleaning comes in when our dataset has incomplete or missing values and algorithm selection involves getting acquainted with the problem which one is trying to solve (more on this later). If one is making a model just for predictions and not specifically for gaining insights then the workflow ends here and one gets started with implementing the trained model in production. However, if one is writing a research paper or wants to gain insights then one can chart the results using a graphing library and draw insights from the charted data. We will counter the data cleaning and the model building part in this paper.

    3.1 Data Cleanup

    This process involves cleaning up textual and/or image-based data and making the data manageable (which sometimes might involve reducing the number of variables associated with a record).

    3.1.1 Textual Data

    Oftentimes, one might encounter missing values in one’s dataset. One has to decide whether to try and fill in the missing data by “guessing” the missing values using the neighbouring data or to drop that particular record altogether. Dropping records seems like a viable option but it might not be feasible if the dataset is already quite small. So one has to resort to filling in the incomplete data cells. There are multiple ways to do this but the easiest one is to take the neighbouring values and calculate an average.

    3.1.2 Image Data

    Data cleanup also involves manipulating images which might contain some artifacts that can interfere with one’s classification algorithms. Nath et al. (2010) in their paper about water-body area extraction tackle this exact problem. The images that they had contained building shadows which can very easily be confused with water-bodies. They partially solved this problem by calculating the entropy of the image and then used it to segment the image. Entropy refers to randomness. A Water-body has less randomness when compared with its surroundings so it is possible to extract the water-body area by segmenting the image based on the difference in the pixel colors. In other instances the image dataset might contain some blurry images which can gravely affect the accuracy of our algorithm in the training stage. One needs to get rid of such images in the data cleanup step.

    3.1.3 Multiple Features

    Oftentimes when one records data in the field of remote sensing, one is essentially recording multispectral or hyperspectral data (Shang, et al. 2014). This means that each record will have a lot of variables. If one tries to plot the dataset, one might not be able to make any sense of it because one will have a lot of pairwise correlations to think about if one plots a plethora of variables. To interpret the data more meaningfully, one needs some way to reduce the number of variables. This is where the Principal Component Analysis (PCA) comes in –it will reduce the number of variables to a few, interpretable linear combinations of the data. Each linear combination will correspond to a principal component. There are numerous tools available to help one with PCA. If one is utilizing the famous scikit-learn library, there is a PCA function which one can use.

    3.2 Types of Machine Learning Algorithms

    There are three broad classes of machine learning algorithms. One class is the supervised machine learning, the second is unsupervised machine learning, and the third is reinforced learning. The difference between supervised and unsupervised is that while using supervised algorithms, one has a dataset containing the output column whereas while using the unsupervised algorithms, one only has a huge dataset and it is the duty of the algorithm to cluster the dataset into various different classes based on the relation it has identified between different records. Reinforcement learning is slightly different. In reinforcement learning, one provides the algorithm with an environment and the algorithm takes decisions in that environment. It keeps on improving itself with each decision based on the feedback it gets for its last decision. We will now discuss three famous algorithms being used in remote sensing.

    3.2.1 Random Forest

    Random forest algorithms are increasing in popularity in the Remote Sensing community (Belgiu, et al. 2016) because of the accuracy of their classifications. These are ensemble classifiers, which basically means that they make use of multiple decision trees underneath. A major reason for the popularity of RF classifiers is that they help in alleviating the high dimensional problem. They provide a variable importance (VI) measurement which allows one to reduce the number of dimensions of hyperspectral data. Variable Importance is essentially the measure of how much change in a specific input affects the output.

    3.2.2 Support Vector Machines

    SVMs are supervised learning models which can be used for regression as well as classification problems. They are mostly used for classification problems. The way they work is by plotting the points (features) in a n-dimensional space (n features) and then coming up with a hyperplane which best divides those points. SVMs are being used in almost all types of classification problems in remote sensing, from forest categorization (Shang, X & Chisholm, 2014) to segmentation of multispectral remote sensing images (Mitra, et al. 2004). Just like other algorithms, their success depends on the nature of the problem and one will have to test each algorithm separately and then take a decision based on the performance of each algorithm.

    3.2.3 Artificial Neural Networks

    Neural Networks are a class of machine learning algorithms which try to mimic the way our brains work. The first application of neural networks (NN) in remote sensing was completed in 1988 (Kanellopoulos and Wilkinson 1997). Artificial Neural Networks are a type of Neural Networks. ANNs are the biologically inspired simulations performed on the computer to perform certain specific tasks like pattern recognition, clustering, classification etc. Their popularity has increased a lot recently due to technical advancements which became possible due to ANNs, an example is AlphaGo defeating the world champion of the game Go. This had never been done before, and it was considered a great feat. Accurate land cover classification used to be done mostly by statistical classifiers, but now ANNs have taken their place because ANNs provide an accurate way to classify land cover and geophysical features without having to rely on statistical assumptions or procedures. ANNs “learn” different patterns in images based on their own (by using artificial neurons) with a minimal set of inputs. They are also referred to as black-box algorithms because oftentimes it is hard to figure out how an ANN is figuring out the outputs.

    4. Overfitting and Bias

    Most of the times when you are developing a model for predicting/classifying images, you have a big dataset for training and testing your algorithm. We split the dataset into roughly a 75:25 ratio where 75% of the data is used for training and 25% is used for evaluating the performance of the model after it has been trained. 75:25 is not a hard ratio; you can use any other dataset division which strikes your fancy. The only problem you have to take care of is that the training segment of the dataset should have an unbiased representation of the whole dataset and that it should not be too small as compared to the testing segment of the dataset. Unbiased means that it should not have only one type of record from the dataset and should have almost every type of record which is a part of the dataset so that the model is trained over every different kind of input. If the training dataset is too small, then you might not get reliable predictions because the model has not been trained over every different type of input.

    Overfitting is another problem which you need to take care of. Overfitting the model generally entails making an overly complex model to explain idiosyncrasies and outliers in the data under study. This means that if you use the same type of data (the type of data on which it has been trained) for evaluating the model, you will get a very high prediction/classification accuracy. However, if you modify the input just a little (something which the model has not seen before), then the prediction/classification accuracy takes a dip. You can fix overfitting by using a bigger dataset and segmenting the dataset properly. Additionally, it is beneficial to reduce the complexity of the model definition so that not all of the extreme edge cases are being classified.

    5. Which algorithm is the best one?

    The answer to this question depends on the problem which one is trying to solve. In some cases when you have multiple dimensions but limited records, SVM might work better. If you have a lot of records but less dimensions (features), Neural Networks (NN) might yield a better prediction/classification accuracy. One often has to test multiple algorithms on your dataset and choose the one which works the best. Oftentimes, it is necessary to tune various parameters for the different algorithms (i.e variable importance for RF, number of hidden layers and neurons for Neural Networks and “decision function shape” for SVMs etc.). A lot of times, a better accuracy may be achieved by combining multiple algorithms together; this is called ensemble. It is also possible to combine SVM and Neural Networks or SVM and RF (possibilities are endless) to improve the prediction accuracy. Again, one will have to test multiple ensembles in order to choose the best one.

    It is also important to note that the prediction accuracy might change based on which particular feature one is trying to use for classification/prediction purposes. For instance, Shang and Chisholm (2014) discuss how when they had to classify Australian native forest species, they decided to use state-of-the-art remote sensing algorithms. They decided to classify trees at leaf, canopy, and community level. They tested various algorithms (SVM, AdaBoost and Random Forest) and found that each algorithm outperformed the other at each different level. At the leaf level, Random Forest achieved the best classification accuracy (94.7%), and Support Vector Machine outperformed the other algorithms at both the canopy (84.5%) and community levels (75.5%).

    Another factor which can affect one’s algorithm choice is whether the data is linearly separable or not. For instance, linear classification algorithms (SVM, logistic regression etc.) expect that the data can be divided by a straight line in linear space. Assuming that the data is linearly separable might work for most cases but will be correct for some scenarios and will bring down the prediction/classification accuracy. Hence, we need to make sure that the algorithm used is able to handle the kind of available data.

    It is not possible to look at an algorithm and decided theoretically whether it will yield the best results for your dataset or not because a lot of the machine-learning algorithms are black box algorithms. This means that it is hard to see how the algorithm is arriving at a specific result. Therefore, it is useful to first narrow down your algorithm choice based on the type of problem and then apply the narrowed down algorithms on a part of your dataset and see which one performs best.

    6. Conclusion

    In this paper we looked at what machine learning is, how it was first introduced into the world of remote sensing, what a typical workflow is like, and what kind of problems are being solved using machine learning. Machine learning has a bright future because more and more people are learning the basics of machine learning and applying it in their regular jobs and researches. New algorithms are cropping up every other day, and the accuracy rate of classifications are improving along with it. Those problems in remote sensing (mapping land cover) which seemed difficult and sometimes impossible are being solved by new algorithms every single day. It is not far-fetched to say that most analysis work done in the world today will be done by machine learning algorithms in near future.


    • Belgiu, M., & Drăguţ, L. (2016). Random forest in remote sensing: A review of applications and future directions. ISPRS Journal of Photogrammetry and Remote Sensing, 114, 24-31, doi:10.1016/j.isprsjprs.2016.01.011
    • Rodriguez-Galiano, V. F., & Chica-Rivas, M. (2014). Evaluation of different machine learning methods for land cover mapping of a Mediterranean area using multi-seasonal Landsat images and Digital Terrain Models. International Journal of Digital Earth, 7(6), 492-509, doi:10.1080/17538947.2012.748848
    • Shang, X., & Chisholm, L. A. (2014). Classification of Australian native forest species using hyperspectral remote sensing and machine-learning classification algorithms. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 7(6), 2481-2489, doi:10.1109/JSTARS.2013.2282166
    • Nath, R. K., & Deb, S. K. (2010). Water-body area extraction from high resolution satellite images-an introduction, review, and comparison. International Journal of Image Processing (IJIP), 3(6), 353-372,
    • Li, Y., Tao, C., Tan, Y., Shang, K., & Tian, J. (2016). Unsupervised multilayer feature learning for satellite image scene classification. IEEE Geoscience and Remote Sensing Letters, 13(2), 157-161, doi:10.1109/LGRS.2015.2503142
    • Huang, X., & Jensen, J. R. (1997). A machine-learning approach to automated knowledge-base building for remote sensing image analysis with GIS data. Photogrammetric engineering and remote sensing, 63(10), 1185-1193.
    • Mitra, P., Shankar, B. U., & Pal, S. K. (2004). Segmentation of multispectral remote sensing images using active support vector machines. Pattern recognition letters, 25(9), 1067-1074, doi:10.1016/j.patrec.2004.03.004
    • Melgani, F., & Bruzzone, L. (2004). Classification of hyperspectral remote sensing images with support vector machines. IEEE Transactions on geoscience and remote sensing, 42(8), 1778-1790, doi:10.1109/TGRS.2004.831865
    • Kubat, M., Holte, R. C., & Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine learning, 30(2-3), 195-215, doi:10.1023/A:1007452223027
    • Maggiori, E., Tarabalka, Y., Charpiat, G., & Alliez, P. (2017). Convolutional neural networks for large-scale remote-sensing image classification. IEEE Transactions on Geoscience and Remote Sensing, 55(2), 645-657, doi:10.1109/TGRS.2016.2612821
    • Pedregosa, F., Varoquaux, G., Gramfort, A. & Michel, V. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
    • Jensen, R. R., Hardin, P. J. and Yu, G. (2009), Artificial Neural Networks and Remote Sensing. Geography Compass, 3: 630–646. doi:10.1111/j.1749-8198.2008.00215.x

    I hope you guys enjoyed the paper. I am open to any kind of comments and suggestions. Till next time!


    Categories: FLOSS Project Planets