What is Data Science and Machine Learning?

Join Hackbright Academy’s new Data Science course focusing on Machine Learning this April 5-6, 2014 in San Francisco.

We talk to Hackbright Academy’s Data Science instructor Dan Wiesenthal:

What is Data Science?

Oh man. There’s so much. At the highest level, it’s trying to look for useful patterns or meaning in data. Which means you have to figure out what data exists, and if it doesn’t exist how you can create or capture it. And then once it exists, there’s almost always a lot of work around cleaning and massaging it (we call this “data munging”) into a format that’s easier to work with. And then once it does exist in a format you can easily work with, you can get to the high level questions of what does this data tell us about the world? Framing those questions can be really hard, because what questions should you really ask? What is useful? What is not useful?

There are a lot of seemingly useful but not actually useful questions. And the way you frame the questions matters a lot. To poach someone else’s example (can’t remember who), you could try to ask what interesting trends can be found in baby formula sales. And after all your data gathering and munging and analysis, you might see there’s a statistically significant increase in baby formula sales. Sweet! Pattern found! Clearly, babies are getting hungrier! …right?

Well, maybe, but what’s been happening to the population? Growing? Shrinking? Staying the same in terms of total, but becoming younger overall (like the reverse of Japan’s aging population)? So framing the question, understanding the data, understanding the domain and knowing where to look for confounding factors is all pretty important.

When you abstract out from all the nuances, Data Science basically means trying to look for meaning in data to help us understand the world. Sometimes that understanding can then lead to solutions to the problems we are facing.

What is Machine Learning?

Machine Learning is a branch of Computer Science that tries to push computers to go beyond their programmers. Rather than require that the programmer anticipate all the directions that a problem might go, tweak all the levers, and optimize everything by hand, this field looks at systems that are aware of their own performance. If they’re aware of their own performance, they can tweak some levers themselves, automatically, and see if performance improves.

By tweaking these levers — and sometimes there are a lot of levers, way more than human programmers could handle by hand — the program can self-optimize. So they are designed to learn over time to be as good as they can. Sometimes it’s a short amount of time, sometimes it’s a very long optimization process.

The really interesting thing for me isn’t even the tweaking of the levers (I mean, that’s cool and all, because there can be these big meta-levers that determine what the littler levers even look like and which directions they go), but the creating the levers themselves. The way you frame some of these problems can lead to really interesting ways of trying to represent the world in a computer-interpretable form — google “feature engineering”— which, at least for me, seems to brush up against the very meaning of reality. It’s pretty cool to think about.

How does Machine Learning fit into Data Science?

Machine learning is one tool in a data scientist’s toolbelt. Some people use that tool a lot more, and some people tend to use it a lot less; some prefer it in production, some only in pre-production or research.

In a production context, it’s a good way to get state-of-the-art performance once you know you need it, and a great way to prototype new features that might make people happy before you’re sure you want to sink a lot of resources into it. It’s pretty easy to prototype a recommender system like Amazon’s or Netflix’s, or something like a quick “Hey, this isn’t what you normally do, is this really you/did you mean to do something else?” which is get into fraud detection and error recovery.

Outside of a production context, it can be really interesting to try and look at what the machine has learned and the way it views the world, or to compare the machine’s performance to other systems—e.g., humans. Sometimes things that we as humans think are really important (or not important) turn out to be pretty different from what the machines “learn” as being important.

There’s this great paper by some Stanford folks — including Dan Jurafsky, who’s one of my favorite profs there — about speed dating that basically shows that machines can learn to be better at detecting flirting than we humans are! Most people, I think, would have intuited that humans would be better at that, since, y’know, flirting is a pretty human thing. Or at least a very squishy-animal thing.

Turns out our intuition fails and what the machines have learned is a way more accurate way to detect flirting.

How is Machine Learning applied in the world?

Fraud detection is probably the most canonical example. Computer vision is another big one. Natural Language Processing uses ML all over the place, from language translation to writing analysis (like how they found out who Fake Steve Jobs was) to handwriting and voice recognition to sentiment analysis.

Then there’s recommender systems, like Netflix and Amazon and Pandora. Social network connection prediction. Stock market analysis and prediction (if you’re lucky!).

Medical diagnosis and treatment recommendations. The list goes on and on.

What’s your advice for an aspiring data scientist?

Read a lot of papers.

Take Andrew Ng’s course on Machine Learning. If you like language, take anything you can find from Dan Jurafsky and Chris Manning, I think they’re doing a Coursera course too. Play around even more than you read, actually do things. Pick toy problems and ask yourself how you can solve them, come up with a simple solution, build it. Then try to find out why it’s not performing perfectly — what’s it getting wrong?

Think about how you can improve it. Sometimes the way you’d need to improve it would require a lot more data or computational resources. That’s okay, you don’t always have to actually build it. But thinking through those next steps is a great learning process. (And then going to find out what people did who actually did take those next steps is always helpful too!)

What are resources for someone interested in getting started in data science?

Andrew Ng’s course on Machine Learning, which is now on Coursera, is kind of amazing. Like, it’s really, really good.

Dan Jurafsky’s NLP course got me into NLP, which is actually how I circuitously got into ML, so if you like playing with language these two courses will be the cat’s meow for you.

I usually use Python when I can, as it’s a great tool and pretty popular. R is also good. I like MongoDB, which mentally maps pretty well to how I think about data and its flexibility regarding schema is nice for playing around with data you’re still not yet sure how to structure. PiCloud is a cool tool for distributed processing, though I’m not sure about the future of the platform. The team, great folks, just joined Dropbox, so we’ll see — I think someone else is taking it over.

Heroku is great for I-don’t-want-to-deal-with-server-shit-but-need-to-deploy-this-to-the-cloud, I use it all the time.

And within Python there are the usual suspects NumPy, SciPy, scikit-learn, Pandas, Matplotlib…

Actually, a lot of these resources are listed in the Open Source Data Science Masters, which is a great collection.

What data science application inspires you?

Pandora single-handedly brought me the biggest joy of any tech company. They have just brought me so much delight thanks to their music recommendations.

What is a common misconceptions about data science?

That you need a PhD in math. I mean, it’ll certainly be helpful if you do have one… But seriously, I think the biggest misconception is about the scope of what it means to be a data scientist.

It’s not just math, it’s not just CS, it’s not just engineering, it’s not just domain expertise… it’s everything.

Usually you’ll have a team of data scientists, some of whom are stronger at the mathematics, some of whom are stronger at the software engineering, some of whom know more about the domain, etc. And sometimes you try to embody all of those people in one, because you’re at an early-stage startup or something like that. But that doesn’t mean you have to be all of those things to attempt to become — or succeed at becoming — a data scientist.

Join Hackbright Academy’s new Data Science course focusing on Machine Learning this April 5-6, 2014 in San Francisco.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply