Fitting linear regression models in R

What’s in this tutorial?

The main goal of this tutorial is to present and explain the code for fitting a simple regression model in R and getting the parameter estimates from a fitted model.

Loading data and packages

The main function we’ll be using is the lm function from base R, so we don’t need to load any additional packages. We’ll use the subset of data from Pearson & Lee (1903) as described in the Unit 3 reading, so we’ll load that data in the code below, the same as we did for the tutorial on scatterplots.

pl <- read.csv("FatherSonHeights.csv")
head(pl)
  observation father_height son_height
1           1          62.5       59.5
2           2          62.5       59.5
3           3          63.5       59.5
4           4          63.5       59.5
5           5          64.5       59.5
6           6          64.5       59.5

The workhorse lm function

One of the core functions of R is the lm function, which stands for “linear model”. This function is used to estimate linear regression parameters using ordinary least squares. There are several more advanced options and parameters in order to work with more complex uses, but we will focus on just two main arguments, the formula and the data.

The data argument is straightforward, and is just the data frame with the variables that we will use in the regression.

The formula argument warrants a little more discussion.

Formulas in R

R actually has a specific type of data object called “formulas”, and once you understand what they look like, you’ll notice them across many different functions. The main thing that tells you that you are working with a formula in R is the use of the tilde character: ~

This is the character that is on the key just to the left of the “1” key on most English keyboards, but you need to use Shift with it, otherwise you get a “backtick” (e.g., the character used at the beginning and end of R code chunks in R Markdown).

The tilde symbol in a formula can usually be read as “is distributed as a function of”. So for example, a typical regression formula might look like:

y ~ 1 + x

This could be read as something like “the variable \(y\) is distributed as a function of a constant (1) and a variable \(x\)”. So when you have a formula, thinking about what’s on the left side or right side of the tilde is the key.

Building a regression formula

In terms of regression, this formula syntax basically corresponds directly to the linear equation:

\(y = \alpha + \beta \times x\)

On the left side of the tilde is the response variable, which we usually call the “y” variable. This is the quantity we are trying to predict.

On the right side of the tilde is the rest of the linear equation. First we have an intercept, which is essentially a constant. In the R formula, we use the number 1 to signify that constant. The intercept in a linear model is a constant because the intercept is just a single fixed value, no matter where on the line we are.

The “x” on the right side is the predictor variable. In terms of our linear equation, it’s the data that get multiplied by the slope.

This is one of the nice things about R’s use of formulas in the lm() function – the straightforward mapping between our statistical model and the syntax of how we specify it in the function.

Example: Pearson & Lee’s height data

Let’s make these abstract concepts a little more concrete with our example data. In this data, let’s use the father_height variable as a predictor to predict the son_height variable. This means our formula should look like this:

son_height ~ 1 + father_height

This essentially translates to: “we are predicting son_height as a function of a constant (intercept) and the father_height variable.”

What we don’t know yet is the actual parameters of this model: the exact value of the intercept constant, or the exact value of the slope that the father_height is multiplied by in order to get the predicted son_height value. But that’s exactly what the lm function will do for us: estimate those parameters!

So here’s the code that puts this into action:

pl_regression <- lm(son_height ~ 1 + father_height, data = pl)

So in the lm() function, we start by giving the formula, using the names of the columns that represent our variables. Then in the next argument we just need to tell R what data frame to look at to find these columns.

Note that because we want to “save” the results of the regression, we assign the results to a variable like normal. The results are a special type of object with the class lm, which we can verify using the class() function like so:

class(pl_regression)
[1] "lm"

Once we fit the model, basically all the information we need is in this new object, so let’s turn to how to get values from it.

Inspecting lm objects and getting parameter values

There’s actually a lot of information in an “lm” object, so we won’t go into exhaustive detail right now, but we’ll return to it throughout the course. It’s useful to think about the lm object as representing our fitted model, and we can use that model to do all sorts of things, and we can inspect the model in various ways.

As we discussed in the reading, the most fundamental aspect of the model is the parameter values. In our simple linear model, that’s the intercept (aka “alpha” or \(\alpha\)) and the slope (aka “beta” or \(\beta\)). In R, these parameter values are also called coefficients.

One way to inspect the internal structure of objects in R is with the str() function (standing for “structure”), but since lm objects are so complex, it’s a little simpler to use the names() function instead:

names(pl_regression)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        

And as you can see, there’s a structure inside the object called coefficients.

In this structure, you can use the same “dollar sign” notation that we learned to access columns in a data frame. In other words, we can access the coefficients with the following:

pl_regression$coefficients
  (Intercept) father_height 
   33.2680781     0.5192646 

This is where R’s common use of ways to get individual pieces of a larger structure comes in handy. So all of the things we know about how to get values out of vectors or data frames also apply here.

To walk you through the process, let’s back up and start with the concept that we know our regression results are in the pl_regression object, but we’re not sure how to get info out of it. The first step is to use names() (or if you want to look at more detail, str()), and look at what the sub-parts of this object are.

Seeing that there’s something called coefficients, we can access that using the code above: pl_regression$coefficients

Now what? Well, now we might want to check on what kind of thing this sub-structure is, again using our friend, the class() function:

class(pl_regression$coefficients)
[1] "numeric"

The class function tells us that the coefficients are just a vector of numbers, though this vector also has names. This is good, because we already know how to work with vectors. For example, since pl_regression$coefficients is a vector, if we want to just get the value for the intercept parameter, we can do either of the following:

pl_regression$coefficients[1]
(Intercept) 
   33.26808 
pl_regression$coefficients["(Intercept)"]
(Intercept) 
   33.26808 

And similarly, we can get the slope parameter (which always has the name of the predictor it’s associated with) with either of the following:

pl_regression$coefficients[2]
father_height 
    0.5192646 
pl_regression$coefficients["father_height"]
father_height 
    0.5192646 

Putting this together, we can easily extract the intercept and slope parameters and assign them to variables with the following (or you could use the names instead, if you like that way better):

regression_intercept <- pl_regression$coefficients[1]
regression_slope <- pl_regression$coefficients[2]

Example: fitting a model with a centered predictor

As we discussed in the reading, for this particular example, the intercept is not especially meaningful, since it means “the expected height of the son if the father is 0 inches tall.” So let’s walk through the code you could use to:

  1. Center the father_height predictor variable
  2. Re-fit the model with the new (centered) predictor
  3. Get the updated coefficients

First, we are creating a different “version” of a variable, namely the centered version of the father_height variable. Since transformations like log, centering, scaling, standardizing, etc. are common, I like to have a personal notation for reminding myself that a variable has been centered. For example, I like to add a “.c” at the end of a variable to stand for “this has been centered”. But note that this is just a personal naming convention; you can pick your own variable naming conventions. I just highly recommend having some kind of consistent system – your future self will thank you when they try to read your code later!

So using this naming convention, the code below creates a new centered variable from the father_height variable, by simply subtracting the mean of father_height from the father_height column1, and saving that as a new column in the data frame. Then we can fit a new model using that new variable in the formula. Finally, we can print the coefficients from that new model to see how the intercept has changed, but the slope hasn’t.

1 Recall from the tutorial on vectors that in R, if you subtract a single number (like a mean) from a vector of numbers (like a column in your data), R subtracts that single value from each value in the vector. No programming loops required!

pl$father_height.c <- pl$father_height - mean(pl$father_height)
pl_regression_father.c <- lm(son_height ~ father_height.c, data = pl)
pl_regression_father.c$coefficients
    (Intercept) father_height.c 
     68.1614100       0.5192646 

Recall from the reading that this intercept value can be interpreted to mean “the expected height of a son of an average-height father is 68.16 inches.” The slope doesn’t change, because the relationship is still the same.

We can also center the son_height variable, refit the model again, and print the updated parameter values:

pl$son_height.c <- pl$son_height - mean(pl$son_height)
pl_regression_fatherson.c <- lm(son_height.c ~ father_height.c, data = pl)
pl_regression_fatherson.c$coefficients
    (Intercept) father_height.c 
  -1.832789e-14    5.192646e-01 

By centering both the response variable and the predictor variable, the intercept should be zero. Recall that this is because the best-fitting regression line will always pass through the [mean of \(x\), mean of \(y\)] coordinate. So by centering both variables (making the mean of both variables 0), both the \(y\)-intercept and the \(x\)-intercept are at 0.

But if that is true, why does the output above not say 0?

This is a good opportunity to point out that when R runs into a lot of digits, it will often print things out in scientific notation. So here, the intercept value is a number times 10 to the negative 14, which is a very very very small number. To illustrate, you could re-write -1.83e-14 as: -0.0000000000000183

Computers actually have a hard time computing things to be exactly zero in many cases, due to the limits of floating point numeric precision, so if you ever see a number this small in R, it’s basically equivalent to zero.

Somewhat annoyingly, when you ask R to print something, the default is to print things out in a consistent way, so we also get scientific notation for the slope parameter when we print it along with this practically-zero intercept. Converting into regular notation, the value of the slope is 0.519, which is the same as in the other models we’ve fitted so far. Just keep an eye out for scientific notation when you’re reading results!

Finally, let’s look at the code for the example from the reading where we created yet another version of the son_height variable by subtracting the mean father’s height. This gets us a variable that we can interpret as “relative height compared to an average father.”

I gave this variable a suffix of .adj to stand for “adjusted”, because I couldn’t think of anything I liked better. Again, you should come up with a naming convention that makes sense to you. We can then once again fit a new model with these variables and print the parameter values.

pl$son_height.adj <- pl$son_height - mean(pl$father_height)
pl_regression_fatherson.adj <- lm(son_height.adj ~ father_height.c, data = pl)
pl_regression_fatherson.adj$coefficients
    (Intercept) father_height.c 
      0.9638219       0.5192646 

As discussed in the reading, this changes the interpretation of the intercept as “how much taller (on average) are the sons in this data, as compared to the average father height in this data.” And again, notice that the slope has not changed.

Transformations and parameters

As a bit of a side-note, the reason that the slope did not change in any of the above models is that the only transformation we were performing was simple subtraction. Other transformations would affect the slope (as well as the intercept), like multiplication/division or exponential/logarithmic transformation.

The easiest way to wrap your head around this in an intuitive way is to remember that the slope parameter means “for every unit change in the \(x\) variable, there is a change of \(\beta\) units in the \(y\) variable.” If you are doing something like division or log-transforming, you are affecting what the units are of one or both variables, so the interpretation of what a “unit change” refers to is also affected.

We’ll see examples of this going forward; I just wanted to point it out now in case you were getting a false impression that the slope of a model is unaffected by changes to the variables. To be precise, the slope is unaffected when we are just doing simple addition or subtraction. Other transformations will affect both the slope and intercept.

Categorical predictors

Now let’s look briefly at how to handle categorical predictors. Luckily, R does some of the work for us. We will be using the same sleep-study data discussed in the reading. This data set is pre-loaded in R as sleep, so it’s very convenient to work with.

head(sleep)
  extra group ID
1   0.7     1  1
2  -1.6     1  2
3  -0.2     1  3
4  -1.2     1  4
5  -0.1     1  5
6   3.4     1  6
class(sleep$group)
[1] "factor"
summary(sleep)
     extra        group        ID   
 Min.   :-1.600   1:10   1      :2  
 1st Qu.:-0.025   2:10   2      :2  
 Median : 0.950          3      :2  
 Mean   : 1.540          4      :2  
 3rd Qu.: 3.400          5      :2  
 Max.   : 5.500          6      :2  
                         (Other):8  

When we look at the data, there is a group variable that’s our categorical “treatment” variable. But while it looks like it’s the number 1 and the number 2, it’s actually encoded in R as a factor data type. You can also see this if you use the summary function on the data frame, because instead of telling us the mean, median, min/max and quartiles of the variable (which it would do if R was treating it as a number), it tells us the tabulation of how many times each value shows up. This is the hint that this variable is a categorical factor.

Dummy coding for free

When a variable is a factor (or a string), then when you use lm() to fit a model, R will do the dummy-coding for you, invisibly converting the factor to a variable of 0s and 1s when fitting the model.

So let’s fit a model to this data, with the categorical group predicting the extra column, which is the amount of extra hours of sleep each person got for the given treatments.

sleep_regression <- lm(extra ~ 1 + group, data = sleep)
sleep_regression$coefficients
(Intercept)      group2 
       0.75        1.58 

Now notice that for the slope parameter, the name is “group2”, not just “group.” This is R telling us that the category “2” is the value that is assigned to the number 1 behind the scenes, and the other group category “1” is assigned to the number 0. So as described in the Unit 3 reading, this means that the intercept value of 0.75 represents the average value of extra sleep for the “group 1” category, and “group 2” is estimated to get 1.58 more hours of sleep than group 1.

More to come

This is just the very beginning of fitting models with lm() and extracting parameter values. Tutorials in later units will dig deeper into lm objects and how to interpret the results of regression.