Box Relatives

Thoughts about puzzles, math, coding, and miscellaneous

Basic Machine Learning With R

You’ve heard of machine learning. How could you not have? It’s absolutely everywhere, and baseball is no exception. It’s how Gameday knows how to tell a fastball from a cutter and how the advanced pitch-framing metrics are computed. The math behind these algorithms can go from the fairly mundane (linear regression) to seriously complicated (neural networks), but good news! Someone else has wrapped up all the complex stuff for you. All you need is a basic understanding of how to approach these problems and some rudimentary programming knowledge. That’s where this article comes in. So if you like the idea of predicting whether a batted ball will become a home run or predicting time spent on the DL, this post is for you.

We’re going to use R and RStudio to do the heavy lifting for us, so you’ll have to download them. It’s fairly painless and well-documented here — in fact, you know what? Go read that article in its entirety before you start here. It not only has an intro to getting started with R, but information on getting baseball-related data, as well as some other indispensable links. Once you’ve finished downloading RStudio and reading that article head back here and we’ll get started! (If you don’t want to download anything for now you can run the code from this first part on R-Fiddle — though you’ll want to download R in the long run if you get serious.)

Let’s start with some basic machine learning concepts. We’ll stick to supervised learning, of which there are two main varieties: regression and classification. To know what type of learning you want, you need to know what problem you’re trying to solve. If you’re trying to predict a number — say, how many home runs a batter will hit or how many games a team will win — you’ll want to run a regression. If you’re trying to predict an outcome — maybe if a player will make the hall of fame or if a team will make the playoffs — you’d run a classification. These classification algorithms can also give you probabilities for each outcome, instead of just a binary yes/no answer (so you can give a probability that a player will make the hall of fame, say).

Okay, so the first thing to do is figuring out what problem you want to solve. The second part is figuring out what goes into the prediction. The variables that go into the prediction are called “features” and feature selection is one of the most important parts of creating a machine learning algorithm. To predict how many home runs a batter will hit, do you want to look at how many triples he’s hit? Maybe you look at plate appearances, or K%, or handedness … you can go on and on, so choose wisely.

Enough theory — let’s look at a specific example using some real-life R code and the famous “iris” data set.

inTrain <- createDataPartition(iris$Species,p=0.7,list=FALSE)
training <- iris[inTrain,]
model <- train(Species~.,data=training,method='rf')

Believe it or not, in those five lines of code we have run a very sophisticated machine learning model on a subset of the iris data set! Let’s take a more in-depth look at what happened here.


This first line loads the iris data set into a data frame — a variable type in R that looks a lot like an Excel spreadsheet or CSV file. The data is organized into columns and each column has a name. That first command loaded our data into a variable called “iris.” Let’s actually take a look at it; the “head” function in R shows the first five rows of the dataset — type


into the console.

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

As you hopefully read in the Wikipedia page, this data set consists of various measurements of three related species of flowers. The problem we’re trying to solve here is to figure out, given the measurements of a flower, which species it belongs to. Loading the data is a good first step.


If you’ve been running this code while reading this post, you may have gotten the following error when you got here:

Error in library(caret) : there is no package called 'caret'

This is because, unlike the iris data set, the “caret” library doesn’t ship with R. That’s too bad, because the caret library is the reason we’re using R in the first place, but fear not! Installing missing packages is dead easy, with just the following command


or, if you have a little time and want to ensure that you don’t run into any issues down the road:

install.packages("caret", dependencies = c("Depends", "Suggests")

The latter command installs a bunch more stuff than just the bare minimum, and it takes a while, but it might be worth it if you’re planning on doing a lot with this package. Note: you should be planning to do a lot with it — this library is a catch-all for a bunch of machine learning tools and makes complicated processes look really easy (again, see above: five lines of code!)

inTrain <- createDataPartition(iris$Species,p=0.7,list=FALSE)

We never want to train our model on the whole data set, a concept I’ll get into more next time. For now, just know that this line of code randomly selects 70% of our data set to use to train the model. Note also R’s “<-” notation for assigning a value to a variable.

training <- iris[inTrain,]

Whereas the previous line chose which rows we’d use to train our model, this line actually creates the training data set. The “training” variable now has 105 randomly selected rows from the original iris data set (you can again use the “head” function to look at the top 5).

model <- train(Species~.,data=training,method='rf')

This line of code runs the actual model! The “train” function is the model-building one. “Species~.” means we want to predict the “Species” column from all the others. “data=training” means the data set we want to use is the one we assigned to the “training” variable earlier. And “method=’rf'” means we will use the very powerful and very popular Random forest model to do our classification. If, while running this command, R tells you it needs to install something, go ahead and do it. R will run its magic and create a model for you!

You may be surprised that the data we fed into the model told us the answer ahead of time. Why, you may be asking, are we building a model to tell us the species when we already knew it? Well, this is the essence of supervised learning — we use data for which we know the answer to help tell us the answer when we don’t know it. Now that we have the model, if someone presents us with measurements for a new flower for which we don’t know the species, we can give a best guess for which species it is, along with probability values for each possibility. This, for instance, is how spam detection algorithms are created; large amounts of e-mails are manually marked as spam or not-spam, that labeled data is fed into a model, which in turn can predict whether a future e-mail is spam. We could use something similar to predict if a baseball player will make the hall of fame, using current hall of famers to predict future ones.