(PyCon 2014 Video) How To Get Started with Machine Learning – Melanie Warrick’s PyCon 2014 Talk

Save $500 on the machine learning course for developers (May 10-11, 2014 at Hackbright Academy). Use promo code “HB-BLOG” to save $500 when you register for the Hackbright machine learning course here.

Melanie Warrick, data scientist and software engineer for hire, Zipfian and Hackbright graduate, gives a talk at the annual PyCon 2014 conference.

Watch her full PyCon 2014 talk on machine learning:

Arthur Samuel defined machine learning as a “field of study that gives computers the ability to learn without being explicitly programmed”. Machine learning is about applying algorithm(s) in a program to solve the problem you are faced with and address the type of data that you have. You create a model that will help conduct pattern matching and/or predict results. Then you evaluate the model and iterate on it as needed to create the right type of solution for the problem.

Examples of machine learning (ML) in the real world include handwritten analysis, which uses neural nets to read millions of mail regularly to sort and classify all the different variations in written addresses. Weather prediction, fraud detection, search, facial recognition, and so forth are all examples of machine learning in the wild.

Algorithms

There are several types of ML algorithms to choose from and apply to a problem – some are listed below and are broken into categories to give an approach on how to think about applying them. When choosing an algorithm, it’s important to think about the goal/problem, the type of data available and the time and effort that you have to work on the solution.

Machine Learning Algorithms

A couple starting points to consider are – whether the data is unsupervised or supervised. Supervised is whether you have actual data that represent the results you are targeting in order to train the model. Spam filters are built on actual data that have been labeled as spam while unsupervised data doesn’t have a clear picture of the result. For unsupervised learning, there will be questions about the data and you can run algorithms on it to see if patterns emerge that help tell a story. Unsupervised is a challenging type of approach and typically there isn’t necessarily a “right” answer for the solution.

In addition, if the data is continuous (e.g. height, weight) or categorical/discrete (e.g. male/female, Canadian/American) that helps determine the type of algorithm to apply. Basically its about whether the data has a set amount of units that can be defined or if the variations in the data are nearly infinite. These are some ways to evaluate what you have to help identify an approach to solve the problem.

Note, the algorithms categorization has been simplified a bit to help provide context, but some of the algorithms do cross the above boundaries (i.e. linear regression).

Models

Once you have the data and an algorithmic approach, you can work on building a model. A model can be something as simple as an equation for a line (y=mx+b) or as complex as a neural net with many layers and nodes.

Linear Regression is a machine learning algorithm and a simple one to start with, where you find the best fit line to represent observed data. In the talk, I showed two different examples of having observed data that exhibited some type of linear trend. There was a lot of noise (data was scattered around the graph), but there was enough of a trend to demo linear regression.

When building a model with linear regression, you want to find the most optimal slope (m) and intercept (b) based on the actual data. See, algebra is actually applicable in the real world. This is a simpl algorithm to calculate the model yourself, but it’s better to leverage tools like scikit-learn’s library to help you more efficiently calculate the best fit line. What you are calculating is a line that minimizes the distance between all the observed data points.

After generating a model, evaluate the performance and iterate to improve the model as needed if it is not performing as expected. I have explained linear regression here.

Prediction

When we have a good model, you can take in new data and output predictions. Those predictions can feed into some type of data product or generate results for a report or visualization.

In my presentation, I used actual head size and brain weight data to build a model that predicts brain weight based on head size. Since the data was fairly small, this decreases the predictive power and increases the potential for error in the model. I went with this data since it was a demo, and I wanted to keep it simple. When graphed, the observed data was spread out which also indicated error and a lot of variance in the data. So it predicts weight with a good amount of variance in the model.

With the linear model I built, I was able to apply it so that I could feed it a head size (x) and it would calculate the predicted brain weight (y). Other models are more complex regarding the underlying math and application. To see the full code solution, checkout the github repository as noted above. The script is written a little differently from the slides because I created functions for each of the major steps. Also, there is an iPython notebook that shows some of the drafts I worked through to build out the code for the presentation

Tools

The Python stack is becoming popular for scientific computing because of the well-supported toolsets. Below is a list of key tools to start learning if you want to work with ML. There are many other Python libraries out there for more nuanced needs in the space as well as other stack packages to explore (R, Java, Julia).

If you are trying to figure out where to start, here are my recommendation:

• Scikit-Learn = machine learning algorithms
• Pandas = dataframe tool
• NumPy = matrix manipulation tool
• SciPy = stats models
• Matplotlib = visualization

Skills

In order to work with ML algorithms and problems, it’s important to build out your skill set regarding the following:

• Algorithms
• Statistics (probability, inferential, descriptive)
• Linear Algebra (vectors & matrices)
• Data Analysis (intuition)
• SQL, Python, R, Java, Scala (programming)
• Databases & APIs (get data)

Resources

Below is a beginning list of resources to get you started.

I highly recommend Andrew Ng’s class and a couple of links are to sites with more recommendations on what to check out next:

• Andrew Ng’s Machine Learning on Coursera
Khan Academy courses in linear algebra and stats
“Think Stats” by Allen Downey
• Zipfian’s “A Practical Intro to Data Science”
Metacademy
Open Source Data Science Masters
Stack Overflow, Data Tau, Kaggle
• Mentors

One point to note from this list and I stressed this in the talk – seek out mentors! They are out there and willing to help. You have to put it out there what you want to learn and then be aware when someone offers to help. Also, follow up. Don’t stalk the person but reach out to see if they will make a plan to meet you. They may only have an hour or they may give you more time than you expect. Just ask and if you don’t get a good response or have a hard time understanding what they share, don’t stop there. Keep seeking out mentors. They are an invaluable resource to get you much farther faster.

Last Point to Note

ML is not the solution for everything and many times can be overkill. You have to look at the problem you are working on to determine what makes the most sense in regards to your solution and how much data you have available.

I highly recommend looking for the simple solution first before reaching for something more complex and time-consuming. Sometimes regex is the right answer and there is nothing wrong with that. As mentioned to figure out an approach, it’s good to understand the problem, the data, the amount of data you have and timing to turn the solution around.

Good luck in your ML pursuit.

References

These are the main references I used in putting together my talk and post:

• Zipfian
• Framed.io
• “Analyzing the Analyzers” – Harlan Harris, Sean Murphy, Marck Vaisman
• “Doing Data Science” – Rachel Schutt & Cathy O’Neil
• “Collective Intelligence” – Toby Segaran
• “Some Useful Machine Learning Libraries” (blog)
• University GPA Linear Regression Example
• Scikit-Learn (esp. linear regression)
• Mozy Blog
• StackOverflow
• Wiki

This post was originally posted at Melanie Warrick’s blog.

Save $500 on the machine learning course for developers (May 10-11, 2014 at Hackbright Academy). Use promo code “HB-BLOG” to save $500 when you register for the Hackbright machine learning course here.