A lot of data is in the form of counts (whole numbers or integers). For example, the number of species in plots, the number of sunny days per year, the number of restuarants in towns, …
Further, with count data, the number 0 often appears as a value of the response variable.
Straightforward linear regression methods (assuming constant variance, normal errors) are not appropriate for count data for four main reasons:
Therefore, Poisson regression is used to model count variables.
The Poisson distribution was made famous by the Russian statistician Ladislaus Bortkiewicz (1868-1931), who counted the number of soldiers killed by being kicked by a horse each year in each of 14 cavalry corps in the Prussian army over a 20-year period. The data can be found here, if you are interested.
To fit a Poisson regression, we run a generalized linear model with family = poisson
, which sets errors = Poisson
and link = log
. The log link ensures that all the fitted values are positive. The Poisson errors take account of the fact that the data are integer and have variances that are equal to their means.
We will illustrate a Poisson regression with data on the number of awards earned by students at one high school. Predictors of the number of awards earned include the type of program in which the student was enrolled (e.g., vocational, general, or academic) and the score on their final exam in maths.
Look at the head of the data, awards
.
head(awards)
## student num_awards program math_score
## 1 1 0 vocational 40
## 2 2 0 vocational 33
## 3 3 0 academic 48
## 4 4 1 academic 41
## 5 5 1 academic 43
## 6 6 0 academic 46
The response variable is the number of awards earned by each high-school student during the year ($num_awards
). The predictor variables are each students’ score on their maths final exam ($math_score
, continuous) and the type of program in which the students were enrolled ($program
, categorical with three levels: general, academic, vocational).
Check that the data read in correctly, with str()
.
str(awards)
## 'data.frame': 200 obs. of 4 variables:
## $ student : int 1 2 3 4 5 6 7 8 9 10 ...
## $ num_awards: int 0 0 0 1 1 0 1 0 0 1 ...
## $ program : Factor w/ 3 levels "academic","general",..: 3 3 1 1 1 1 1 1 3 2 ...
## $ math_score: int 40 33 48 41 43 46 59 52 52 49 ...
The Poisson distribution assumes that the mean and variance of the response variable, conditioned on the predictors, are equal. We can check that this is the case.
First, the unconditional mean and variance (i.e., the mean and variance of the whole vector). Calculate the mean number of awards given, over all students (i.e. the entire column in the data).
mean(awards$num_awards)
## [1] 0.63
Now, calculate the variance of the number of awards.
var(awards$num_awards)
## [1] 1.108643
Second, check the conditional mean and variance (i.e., the mean and variance of the respose variable conditional on the levels of categorical predictors).
Calculate the mean number of awards for each program. Call this object a_mean
.
a_mean <- tapply(awards$num_awards, awards$program, mean)
Calculate the variance in the number of awards for each program. Call this object a_var
.
a_var <- tapply(awards$num_awards, awards$program, var)
The conditional mean and variance look reasonably similar here. Let’s make a histogram. The displayed figure is created from the following code:
barplot(t(table(awards$num_awards, awards$program)), beside = TRUE, xlab = 'Number of Awards', ylab = 'Count')
Let’s fit a glm! We want to model the number of awards as a function of the program and the maths grade. Remember to set the family =
argument to poisson
. Use a formula and data =
argument, and call this m1
.
m1 <- glm(num_awards ~ program + math_score, data = awards, family = 'poisson')
Now display the summary of the model result.
summary(m1)
##
## Call:
## glm(formula = num_awards ~ program + math_score, family = "poisson",
## data = awards)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2043 -0.8436 -0.5106 0.2558 2.6796
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16327 0.66288 -6.281 3.37e-10 ***
## programgeneral -1.08386 0.35825 -3.025 0.00248 **
## programvocational -0.71405 0.32001 -2.231 0.02566 *
## math_score 0.07015 0.01060 6.619 3.63e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287.67 on 199 degrees of freedom
## Residual deviance: 189.45 on 196 degrees of freedom
## AIC: 373.5
##
## Number of Fisher Scoring iterations: 6
Let’s work through this output, which is in the usual format.
First we have the model Call, reiterating the model we just ran.
Second, we have the Deviance Residuals, which are approximately normally distributed if the model is specified correctly. Here, it shows a little bit of skeweness since median is not quite zero.
Then, we have the Coefficients: The Poisson regression coefficients, standard errors, z-scores, and p-values for each variable.
For continuous predictor variables, the coefficient estimate is the expected log count of the response variable for a one-unit increase in the continuous predictor.
For example, the coefficient for $math_score
is 0.07. This means that the expected increase in log award number for a one-unit increase in math is 0.07. To obtain the actual counts, we need to calculate the exponential of 0.07.
Extract the coefficient estimate for $math_score
from this model summary, and calculate the exponent.
exp(coef(m1)['math_score'])
## math_score
## 1.072672
For categorical predictor variables, the coefficient estimate is the expected difference in log count between the baseline level (in this case, academic
) and each other level. To return the difference in actual counts, you would take the exponent.
In this example, the expected difference in log count for general
compared to academic
is -1.08. The expected difference in log count for vocational
compared to academic
is -0.7: students in general
and vocational
programs got fewer awards than students in academic programs.
Finally, we have information on the Deviance. The residual deviance can be used to test for the goodness of fit of the overall model. The residual deviance is the difference between the deviance of the current model and the maximum deviance of the ideal model where the predicted values are identical to the observed. The smaller the residual deviance, the better the model fit.
We can eyeball this fit by checking if the residual deviance is similar to the residual degrees of freedom (it should be, because the variance and the mean should be the same). If the residual deviance is larger than residual degrees of freedom, this difference suggests that there is extra, unexplained, variation in the response (i.e., we have a case of overdispersion). Overdispersion can be compensated for by refitting the model using family =
quasipoisson` rather than Poisson errors.
A better approach is to explicitly test for a difference. We can do this using a chi-squared goodness of fit test, pchisq()
. This function returns a p-value for a specific value (q
) and degrees of freedom (df
) for the chi-squared distribution. By default, pchisq()
gives the proportion of the distribution to the left of the value. To get the proportion more extreme than the difference, you can specify lower.tail = FALSE
(or subtract the result from 1).
If the residual difference is small enough, the goodness of fit test will not be significant, indicating that the model fits the data.
Run a chi-squared goodness of fit test on the residuals of m1, using pchisq()
. Make sure that you extract the deviance and degrees of freedom from the model (df.residual), rather than entering the values manually.
pchisq(m1$deviance, m1$df.residual)
## [1] 0.3817726
We can also use a goodness of fit test to compare this full model with a simpler model, such as one that only includes $math_score
. Run a new model, m2
, that models the number of awards as a function of math_score only.
m2 <- glm(num_awards ~ math_score, data = awards, family = 'poisson')
Now we can test the overall effect of $program
by comparing the deviance of the full model with the deviance of this new model. Use the anova()
function, with test = 'Chisq'
to compare models m1 and m2.
anova(m2, m1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: num_awards ~ math_score
## Model 2: num_awards ~ program + math_score
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 198 204.02
## 2 196 189.45 2 14.572 0.0006852 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The two degree-of-freedom chi-square test indicates that $program
is a statistically significant predictor of num_awards.
As with the binomial glm, we cannot just use the coefficient estimates to plot the fitted line with abline()
.
We first need to calculate and store predicted values for each row in the original dataset (we have enough data here that, at least for our purposes, we do not need to generate a new vector of math_scores).
Use the predict()
function to add a new column ($pred
) to the awards
data object for the predicted values from the original model, m1. In this case, we do not need to provide any newdata
. Further, to get predicted *counts8 (rather than log counts), we need to set `type = ‘response’.
awards$pred <- predict(m1, type = "response")
Now we can make a plot of the data and fitted lines. Think about how we would plot both a continuous and a categorical predictor variables. A good way would be to have the continuous variable as the x-axis, and different points/lines for each level of the categorical variable.
To plot the results, we need to first order the data by $program
and $math_score
, so that the lines we plot will be curves and not go all over the plot. We can do that using the order()
function within the square brackets, as we did before. Run: awards2 <- awards[order(awards\(program, awards\)math_score), ].
awards2 <- awards[order(awards$program, awards$math_score), ]
Now, we can plot the data points (adding some jitter to the counts, because there will be some overlapping points), and then add the predicted lines. First, plot jittered $num_awards
as a function of math_score
. Use pch = 20
, and the argument col =
to your factor, program.
plot(jitter(num_awards) ~ math_score, data = awards2, col = program, pch = 20)
Now you can use lines()
to add a line for each program. The way to do this is similar to making a plot. Set up the formula (i.e., the predicted values as a function of the continuous variable), and use the data =
argument to subset out only the rows of data you want for each line. First, plot the academic
group with the default line parameters.
plot(jitter(num_awards) ~ math_score, data = awards2, col = program, pch = 20)
lines(pred ~ math_score, data = awards2[awards2$program == 'academic', ])
Now, plot the line for general
, using col = 2
. Ensuring that the color numbers are in the same order as the levels of the factor you are plotting will ensure that the lines and points are the same colour.
plot(jitter(num_awards) ~ math_score, data = awards2, col = program, pch = 20)
lines(pred ~ math_score, data = awards2[awards2$program == 'academic', ])
lines(pred ~ math_score, data = awards2[awards2$program == 'general', ], col = 2)
Now, plot the line for vocational
, using col = 3
.
plot(jitter(num_awards) ~ math_score, data = awards2, col = program, pch = 20)
lines(pred ~ math_score, data = awards2[awards2$program == 'academic', ])
lines(pred ~ math_score, data = awards2[awards2$program == 'general', ], col = 2)
lines(pred ~ math_score, data = awards2[awards2$program == 'vocational', ], col = 3)
Great! Now you can run a poisson glm, and plot the data. Have a muffin.
This swirl lesson is assessed by the honesty of your heart.