I just built a machine learning model (and why that matters)

May 11, 2018Jake DiMare

This week I completed a free machine learning course on Kaggle.com. After a couple of hours on the fundamentals, I figured out how to use their development environment and built a model that predicts the sale price of homes based on lot size, square footage, bedrooms, and year it was built.

Now, before I go any further with this humblebrag, I should share a few mitigating factors.

First, once upon a time -I was a developer. My professional career ended in 2005 when I went over to the business/consulting side. However, to this day I’ve maintained my capabilities with HTML, JS, PHP, CSS, and SQL. Second, when I say I ‘built’ a machine learning model, it’s more like I copied and pasted example code snippets and modified them for my little application.

The model is a built with Python, Pandas, and SciKit-Learn. Here’s the code:


# Need this for data frames
import pandas as pd# Import Iowa real estate data and put it in a variable
main_file_path = '../input/train.csv'
jake_data = pd.read_csv(main_file_path)

# Set the prediction target
y = jake_data.SalePrice

# Set Lot size, Sq footage, # bedrooms, and year built predictors
predictors = ['LotArea', '1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'YearBuilt']
x = jake_data[predictors]

# Machine Learning Engine
from sklearn.tree import DecisionTreeRegressor

# Define model
jake_model = DecisionTreeRegressor()

# Fit model
jake_model.fit(x, y)

print("Providing predictions for the following houses:")
print("The predictions are")


Providing predictions for the following houses:
   LotArea  1stFlrSF  2ndFlrSF  BedroomAbvGr  YearBuilt
0     8450       856       854             3       2003
1     9600      1262         0             3       1976
2    11250       920       866             3       2001
3     9550       961       756             3       1915
4    14260      1145      1053             4       2000
The predictions are
[208500. 181500. 223500. 140000. 250000.]


Pretty cool, huh? Yeah, don’t feel bad if you don’t understand it. I barely do. But I promised I would talk about why the fact that I, a marketer, was even able to do this.

Google I/O and the big push on AI

This week was Google’s annual I/O event, and they launched ai.google. Their vision is to share the benefits of AI with everyone. In support of this goal, they’re publishing their research in a wide variety of venues. They’ve also open sourced tools and systems, like TensorFlow.

“…While we fundamentally believe in the promise of AI, we also think that it’s critical that this technology is used to help people — that it is socially beneficial, fair, accountable, and works for everyone.” -Google

A lovely sentiment to be sure, but something fascinating is going on right now in the world of emerging digital innovations. It reminds me a lot of the nineties when I learned how to code and got my first work as a web developer.

Back then, the push was to get developers to adopt LAMP, .NET, or Java, to develop websites, but at the end of the day, it was all about selling servers.

These days, the big story is the competition between Amazon, Microsoft, IBM, and others, and the modern battleground is the cloud. Last year Google was in the #7 spot for cloud vendors, with Microsoft, Amazon, and IBM in the top three slots (in that order). Thus, I’m not surprised to see Google come out swinging this year, with a bold new strategy to get more developers on board.

No surprise, Microsoft, Amazon, and IBM are all also aggressively pushing their platforms as well. Rumour has it, Microsoft is willing to donate virtually any amount of professional services to support anyone who has a valid idea and is ready to launch it on Azure.

Amazon is giving away massive tools, in Lumberyard and 12 months of free AWS. IBM is running promotions around Watson and has published a ton of very impressive, free education on machine learning, AI, NPL, and more.

Welcome to the cloud wars.

By the way, more than just a learning environment, Kaggle provides a ton of resources like public data sets, community forums, and data science competitions with significant cash prizes, in some cases. The community over there covers machine learning, deep learning, data visualization, R, SQL, Pandas, and TensorFlow.

Prev Post