So far we have looked at solely categorical data (Testing Ratios) and continuous data taken from members of different groups or categories (Testing Populations).
Here, we will look at methods to test whether two or more samples of continuous data are positively or negatively associated. We will also consider having multiple predictor variables.
Response/Dependent: Integer or continuous
Predictor/Independent: Continuous (or Categorical)
(0. Response and predictor data are categorical: See Testing Ratios, chisq.test()
.)
Do the two continuous variables come from the same distribution? ks.test()
.
We assume no causal relationship between the continuous ‘predictor’ and the continuous ‘response’: correlation, cor()
, cor.test()
.
We are testing if the continuous/categorical predictor variables are causing some of the variation in the continuous response data data from more than two groups and/or two predictors: linear regression, lm()
.
If we have continuous data on both sides, we will most likely make some kind of scatter plot.
# Generate 2 vectors of 100 random values from a Normal distribution
x <- rnorm(100)
y <- rnorm(100)
plot(y ~ x)
We can use a Kolmogorov-Smirnov Test to test whether two samples come from the same distribution.
ks.test(x = , y = )
performs a two-sample test of the null hypothesis that x and y were drawn from the same continuous distribution.
x =
and y =
are vectors of numeric data.Generate 50 values drawn randomly from two different distributions: x is from a normal distribution and y from a uniform.
Thus, we know they are different …
x <- rnorm(n = 50)
# Here, y comes from a uniform distribution, where each value the same probability of being sampled.
y <- runif(n = 50)
We can check this with histograms.
par(mfrow = c(1,2))
hist(x, breaks = 6)
hist(y, breaks = 6)
The ks.test()
function takes x =
and y =
.
ks.test(x, y)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.6, p-value = 1.062e-08
## alternative hypothesis: two-sided
The p-value less than 0.001 suggests that x and y are indeed drawn from different distributions (as we shushpected … ).
Whether we run correlation or a regression depends on how we think the two variables are related. If we think that there is no causal relationship between the two, then we would test for a correlation. If we think that there is a causal relationship between the two, then we would run a regressin.
As the saying goes “correlation does not imply causation”.
cor()
returns the correlation coefficient of two variables. Note: You can also run cor()
on a matrix or dataframe to test all possible correlations at once.
cor.test()
tests for association—correlation—between paired samples. It returns the correlation coefficient and the p-value of the correlation.
x =
a numeric vector the same length as y
,
y =
a numeric vector the same length as x
,
method =
Pearson’s product moment correlation coefficient (method = 'pearson'
) is the parametric version used for normal data (and the default). Kendall’s tau (method = 'kendall'
) or Spearman’s rho (method = 'spearman'
) are used for non-parametric data. The three methods each estimate the association between paired samples and compute a test of the value being zero (indicating no association).
The correlation coefficient can fall between -1 and 1.
-1: (negative 1) means a strong negative correlation (x goes up and y goes down),
0: means no association between x and y,
1: means a strong positive correlation (x goes up and y goes up).
Let’s look at the small sparrow dataset and the relationships between the various measurements.
BirdData <- data.frame(
Tarsus = c(22.3, 19.7, 20.8, 20.3, 20.8, 21.5, 20.6, 21.5),
Head = c(31.2, 30.4, 30.6, 30.3, 30.3, 30.8, 32.5, 31.6),
Weight = c(9.5, 13.8, 14.8, 15.2, 15.5, 15.6, 15.6, 15.7),
Wingcrd = c(59, 55, 53.5, 55, 52.5, 57.5, 53, 55),
Species = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B')
)
We can use the pairs()
function to plot all the variables against one another.
pairs(BirdData)
We have no a priori reason to believe that any of these measurements of sparrows drives variation in the others. However, we can test whether they are correlated. For example, do sparrows with larger heads also have larger wings?
The function cor()
returns the correlation coefficient of two variables. It requires an x =
and a y =
. Pearson’s product moment correlation coefficient is the parametric version used for normal data; Kendall’s tau or Spearman’s rho for non-parametric data.
cor(x = BirdData$Tarsus, y = BirdData$Wingcrd, method = 'pearson')
## [1] 0.6409484
The function cor.test()
tests whether the correlation coefficient is significantly different from 0.
cor.test(x = BirdData$Tarsus, y = BirdData$Wingcrd, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: BirdData$Tarsus and BirdData$Wingcrd
## t = 2.0454, df = 6, p-value = 0.0868
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1162134 0.9269541
## sample estimates:
## cor
## 0.6409484
t = 2.0454
is the t-test statistic value,
df = 6
is the degrees of freedom,
p-value = 0.0868
is the significance level of the t-test,
95 percent confidence interval: -0.1162134 0.9269541
is the confidence interval of the correlation coefficient at 95%,
sample estimates cor 0.6409484
is the correlation coefficient.
The p-value of the test is 0.0868, not less than the usual alpha of 0.05; so we cannot reject the null hypothesis that Tarsus and Wing are not significantly correlated.
If we think that one variable is driving variation in the other (i.e., we have predictor and response variables), then we should use regression rather than correlation.
lm()
formula =
a symbolic (pseudocode) formula,
data =
(optional) a data object, usually a data.frame.
Models are specified symbolically. A typical model has the form response ~ terms
where:
response is the (numeric) response/dependent vector and
terms is a series of terms which specifies a linear predictor (or independent variable) for the response.
The response variable is fitted as a function of the predictor variable. The tilde symbol ( ~
) signifies as a function of, and separates the response and predictor variables. Adding further predictor variables to the right hand side is possible (see Formulae).
For now, we will carry out a simple regression of a single predictor and response. This model estimates values for the three elements of the equation:
y ~ beta0 + beta1 * x + sigma
In other words:
y ~ intercept + slope * x + sd of the error
lm()
provides estimates of the intercept, slope and sd.
We will illustrate the use of lm()
using the sparrow data (even though we know there is no causality here).
First, we plot the data.
plot(BirdData$Tarsus ~ BirdData$Wingcrd)
Fig. Sparrow Wingcrd as a function of Tarsus length
Second, we run the model. We can specify the formula, with or without the data =
argument.
m <- lm(BirdData$Tarsus ~ BirdData$Wingcrd)
# or
m <- lm(Tarsus ~ Wingcrd, data = BirdData)
The output of lm()
is a list.
We can use str()
to examine the elements of the output, as well as pull out various parts directly, or use other functions to do so.
For example, summary()
pulls out various elements and formats them nicely to display the whole model table.
summary(m)
##
## Call:
## lm(formula = Tarsus ~ Wingcrd, data = BirdData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2229 -0.1594 0.1844 0.4492 0.5770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.1210 6.2706 1.295 0.2429
## Wingcrd 0.2328 0.1138 2.045 0.0868 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6705 on 6 degrees of freedom
## Multiple R-squared: 0.4108, Adjusted R-squared: 0.3126
## F-statistic: 4.184 on 1 and 6 DF, p-value: 0.0868
Call:
The model formula,
Residuals:
The quartiles of the residuals from the model,
Coefficients:
For continuous predictors The Estimate and Standard Error for the intercept and slope. It also includes the t-values and associated p-values testing if the estimates are significantly different from 0.
R-squared
The proportion of the variation in the response variable that is explained by the predictor,
F statistic
From the ANOVA table of the model.
Intercept: Where the line of the linear model crosses the y-axis,
Slope: For continuous predictors, the estimate is the slope, i.e., the change in the response for a 1-unit increase in the predictor,
p-value: The probability of obtaining a result equal to or more extreme than what was actually observed, if the null hypothesis is true. The null hypothesis is rejected if the p-value is less than the alpha (usually 0.05). More.
We can access the model coefficients with the function coef()
coef(m)
(Intercept) Wingcrd
8.1209721 0.2327633
or directly with m$coefficients
m$coefficients
(Intercept) Wingcrd
8.1209721 0.2327633
We can access the residuals with resid()
resid(m)
## 1 2 3 4 5
## 0.445994599 -1.222952295 0.226192619 -0.622952295 0.458955896
## 6 7 8
## -0.004860486 0.142574257 0.577047705
or m$residuals
m$residuals
1 2 3 4 5 6
0.445994599 -1.222952295 0.226192619 -0.622952295 0.458955896 -0.004860486
7 8
0.142574257 0.577047705
We can use abline()
to add the fitted line of the model to a plot of the raw data.
plot(BirdData$Tarsus ~ BirdData$Wingcrd)
abline(lm(BirdData$Tarsus ~ BirdData$Wingcrd))
# or
m <- lm(BirdData$Tarsus ~ BirdData$Wingcrd)
abline(m)
Fig. Sparrow Wingcrd as a function of Tarsus length, with fitted line
You can also include categorical predictors in linear models.
If we include only one predictor, and that predictor is categorical, we could think of this as an ANOVA (remember that ANOVA and regression are essentially the same kind of thing in many cases).
Let’s model Tarsus as a function of Species.
A good plot of this model would be a boxplot.
plot(Tarsus ~ Species, data = BirdData)
Now we can run the model. Notice that we can use lm()
to run this model (whereas in Unit 3 we used aov()
).
m <- lm(Tarsus ~ Species, data = BirdData)
And look at the summary.
summary(m)
Call:
lm(formula = Tarsus ~ Species, data = BirdData)
Residuals:
Min 1Q Median 3Q Max
-1.08 -0.51 0.02 0.30 1.52
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.7800 0.3763 55.222 2.37e-09 ***
SpeciesB 0.4200 0.6145 0.683 0.52
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8414 on 6 degrees of freedom
Multiple R-squared: 0.07224, Adjusted R-squared: -0.08239
F-statistic: 0.4672 on 1 and 6 DF, p-value: 0.5198
We have the same parts to the model summary as above, but the interpretation of the coefficient estimates is slightly different.
Call:
The model formula,
Residuals:
The quartiles of the residuals from the model,
Coefficients:
For categorical predictors the default is to display the mean of the base level (in this case Species A), and then the difference between the mean of each level and the base level. So, here we have the difference between the mean Tarsus size of Species B from Species A. It also includes the t-values and associated p-values testing if the estimates of this difference are significantly different from 0.
R-squared
The proportion of the variation in the response variable that is explained by the predictor,
F statistic
From the ANOVA table of the model.