In this lesson, we’ll cover ordination methods. These are methods that help us summarize multivariate data, and detect potential associations between many variables.
We often have large data sets of different observations or measurements from the same sampling units or sites.
For example, we may have a network of quadrats, plots, or sites with incidence or abundance data on taxa present at each site, along with environmental variables (soil nutrients, elevation, aspect, …). Or, we may have measurements of many different variables from the same individuals or sampling units (e.g., individual sparrows).
There are several reasons to use ordination:
Reducing the number of potentially correlated dependent variables (e.g., soil nutrients, multiple variables on the same individuals).
Modelling occurance of a large number of species that would be hard to model individually.
There are a large number of different ordination methods that are each slightly and subtley different, and best-used in specific situations. These methods include principle components analysis, principle coordinates analysis, non-metric multi-dimensional scaling, detrended correspondance analysis, canonical correspondance analysis, …
Basic ordination methods are available in base R, and there are a number of specialist packages that extend these capabilities, including the vegan
and labdsv
package.
Now, on to an actual example!
First, look at the head of the sparrow
dataframe we’ve loaded with the lesson for you.
head(sparrow)
## Species Sex Wingcrd Tarsus Head Culmen Nalospi Wt Observer Age
## 1 SSTS Male 58.0 21.7 32.7 13.9 10.2 20.3 2 0
## 2 SSTS Female 56.5 21.1 31.4 12.2 10.1 17.4 2 0
## 3 SSTS Male 59.0 21.0 33.3 13.8 10.0 21.0 2 0
## 4 SSTS Male 59.0 21.3 32.5 13.2 9.9 21.0 2 0
## 5 SSTS Male 57.0 21.0 32.5 13.8 9.9 19.8 2 0
## 6 SSTS Female 57.0 20.7 32.5 13.3 9.9 17.5 2 0
Base R (in the stats package) has two functions to calculate a Principle Components Analysis (PCA), prcomp()
and princomp()
.
Most multivariate analyses also require you to decide whether you want to scale the variables or not. As with multiple regression, this makes it easier to make direct comparisons between variables. In the case of prcomp()
, the argument scale =
takes a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. The default is FALSE
for consistency with S, but in general scaling is advisable.
Now run prcomp on the 3rd through the 8th columns of the sparrow data, and assign it to the object pca1. Be sure to set the scale argument to TRUE so that our variables have roughly equal variance.
pca1 <- prcomp(sparrow[, 3:8], scale = TRUE)
Lets visualize your PCA. We can do so in multiple ways. First, we can look at how much variance is explained by each axis, and what variables are driving that.
Plot pca1 now.
plot(pca1)
Each bar represents our principle components axes, ordered by the principle component explaining the most variation in our data, to the least.
Here we can see that Axis 1 explains the majority of variation in our data, indicating that a handful of correlated variables are explaining our data.
Enter pca1
to see the weights of each of your variables in each principle components axis.
pca1
## Standard deviations (1, .., p=6):
## [1] 1.9952947 0.8924830 0.6077801 0.5767803 0.5350063 0.4837036
##
## Rotation (n x k) = (6 x 6):
## PC1 PC2 PC3 PC4 PC5
## Wingcrd -0.3617745 0.63271912 0.50662096 0.3258822675 -0.31354286
## Tarsus -0.4214011 0.07322892 -0.75093073 0.4624788082 -0.04496309
## Head -0.4459458 -0.19032228 0.01334900 -0.0002966949 -0.09030289
## Culmen -0.4099956 -0.39641642 0.38030979 0.2694261583 0.63356601
## Nalospi -0.4049720 -0.46052289 0.09015465 -0.3618441591 -0.58920702
## Wt -0.4007168 0.43457354 -0.16277793 -0.6902118236 0.37807906
## PC6
## Wingcrd -0.08725000
## Tarsus -0.19301118
## Head 0.86981428
## Culmen -0.23690880
## Nalospi -0.37106919
## Wt -0.06884131
It appears that all of the variables have roughly equal weight in Axis 1. This makes sense given that they’re all anatomical variables, and thus highly correlated.
We could use the columns PC1 (and maybe PC2) as predictors in a regular lm()
or glm()
instead of a single anatomical variable, to account for correlation between all the variables (this is often done for soil nutrient data).
Often, a plot of the first two principle components axes is an effective way to represent the multi-dimensional correlation of our variables. The function biplot()
depicts PC1 and PC2 points for each sparrow and adds arrows corresponding to the direction and strength of each variable.
Create a biplot of our pca1
object now.
biplot(pca1)
How do we interpret this plot? First, the length of the arrow indicates the weight of each variable (how much variance in the data is being explained by this variable). Second, arrows pointing in the same direction indicate a positive correlation between the two, those pointing in the opposite direction indicate a negative correlation, and those orthogonal indicate little correlation.
In this case, one species is generally smaller than the other, so the arrows all point one way. In other cases, the arrows will reflect much more variation in the data.
We can also visualize these data using just points, and color-coding the species. In this instance, we need to use scores()
to extract the PC axes from the object.
Load the vegan library. Note that you may need to install the vegan package first.
library(vegan)
Now enter ‘plot(scores(pca1), col = sparrow$Species)’. We’re using the scores()
function to extract the points for each axis, while plot()
only plots the first two axes.
plot(scores(pca1), col = sparrow$Species)
The ‘col = dat$Species’ argument is telling R to change the color of each point to its factor level for species. We can see that each species is clearly distinguished in our PCA, primarily along PC1 (i.e., strong horizontal but weaker vertical separation).
This is telling us roughly the same information as our biplot, but does a better job of distinguishing key groups in variable space.
Now lets practice with a slightly more advanced example using the labdsv package. Load the library (be sure it is installed).
library(labdsv)
Now look at the head of the ‘bryce’ object we’ve loaded into the lesson for you.
head(bryce)
## junost ameuta arcpat arttri atrcan berfre ceamar cerled cermon
## bcnp__1 0 0.0 1.0 0 0 0 0.5 0 0
## bcnp__2 0 0.5 0.5 0 0 0 0.0 0 0
## bcnp__3 0 0.0 1.0 0 0 0 0.5 0 0
## bcnp__4 0 0.5 1.0 0 0 0 0.5 0 0
## bcnp__5 0 0.0 4.0 0 0 0 0.5 0 0
## chrdep chrnau chrpar chrvis eurlan juncom pacmyr pruvir purtri
## bcnp__1 0 0 0 0 0 0 0.0 0 0.0
## bcnp__2 0 0 0 0 0 2 0.5 0 0.0
## bcnp__3 0 0 0 0 0 1 0.5 0 0.0
## bcnp__4 0 0 0 0 0 0 0.0 0 0.0
## bcnp__5 0 0 0 0 0 1 0.5 0 0.5
## quegam rhutri ribcer roswoo samcoe shearg sherot symore arcuva
## bcnp__1 0 0 0.0 0.0 0 0 0 1.0 0
## bcnp__2 0 0 0.0 0.5 0 0 0 0.5 0
## bcnp__3 0 0 0.5 0.5 0 0 0 0.5 0
## bcnp__4 0 0 0.0 0.0 0 0 0 0.5 0
## bcnp__5 0 0 0.0 0.0 0 0 0 0.5 0
## artarb artfri artpyg atrcon berrep ericor gutsar tetcan agrcri
## bcnp__1 0 0 0 0 1.0 0 0 0 0
## bcnp__2 0 0 0 0 0.0 0 0 0 0
## bcnp__3 0 0 0 0 0.5 0 0 0 0
## bcnp__4 0 0 0 0 1.0 0 0 0 0
## bcnp__5 0 0 0 0 0.5 0 0 0 0
## agrdas agrscr agrsmi bougra broano brocil broine carrss elysal
## bcnp__1 0 0 0 0 0 0 0 0.5 0
## bcnp__2 0 0 0 0 0 0 0 0.5 0
## bcnp__3 0 0 0 0 0 0 0 0.5 0
## bcnp__4 0 0 0 0 0 0 0 0.5 0
## bcnp__5 0 0 0 0 0 0 0 1.0 0
## fesovi hiljam junbal koenit muhmon muhric oryhym orymic phlpra
## bcnp__1 0 0 0 0 0 0 0 0 0
## bcnp__2 0 0 0 0 0 0 0 0 0
## bcnp__3 0 0 0 0 0 0 0 0 0
## bcnp__4 0 0 0 0 0 0 0 0 0
## bcnp__5 0 0 0 0 0 0 0 0 0
## poacom poafen poanev poapra sithys sticom stilet stipin achmil
## bcnp__1 0 0.5 0 0 0 0 0 0 0.0
## bcnp__2 0 0.0 0 0 0 0 0 0 0.5
## bcnp__3 0 0.0 0 0 0 0 0 0 0.5
## bcnp__4 0 0.0 0 0 0 0 0 0 0.0
## bcnp__5 0 0.0 0 0 0 0 0 0 0.0
## agogla anemul antros antros.1 apoand arahol arapen arefen artcar
## bcnp__1 0 0 0 0 0 0 0 0 0
## bcnp__2 0 0 0 0 0 0 0 0 0
## bcnp__3 0 0 0 0 0 0 0 0 0
## bcnp__4 0 0 0 0 0 0 0 0 0
## bcnp__5 0 0 0 0 0 0 0 0 0
## artlud astagr astchi astcon asthum astken astmeg astmis astten
## bcnp__1 0 0 0 0 0 0 0 0.5 0
## bcnp__2 0 0 0 0 0 0 0 0.0 0
## bcnp__3 0 0 0 0 0 0 0 0.0 0
## bcnp__4 0 0 0 0 0 0 0 0.0 0
## bcnp__5 0 0 0 0 0 0 0 0.5 0
## balsag calnut caschr caslin chadou cirneo compal corkin creint
## bcnp__1 0 0 0 0.0 0 0 0 0 0
## bcnp__2 0 0 0 0.0 0 0 0 0 0
## bcnp__3 0 0 0 0.0 0 0 0 0 0
## bcnp__4 0 0 0 0.0 0 0 0 0 0
## bcnp__5 0 0 0 0.5 0 0 0 0 0
## crycon cympur dessop drasub echtri eriala erican erieat erifla
## bcnp__1 0 0 0 0 0 0 0 0 0
## bcnp__2 0 0 0 0 0 0 0 0 0
## bcnp__3 0 0 0 0 0 0 0 0 0
## bcnp__4 0 0 0 0 0 0 0 0 0
## bcnp__5 0 0 0 0 0 0 0 0 0
## eripan eripum erirac erisub eriumb eupfen euplur euprob fraves
## bcnp__1 0 0 0 0 0 0 0.0 0 0
## bcnp__2 0 0 0 0 0 0 0.5 0 0
## bcnp__3 0 0 0 0 0 0 0.0 0 0
## bcnp__4 0 0 0 0 0 0 0.0 0 0
## bcnp__5 0 0 0 0 0 0 0.5 0 0
## genaff gerfre gerric gilcon haparm heddru hymaca hymfil hymric
## bcnp__1 0 0.5 0 0.0 0 0 0.0 0 0
## bcnp__2 0 0.5 0 0.5 0 0 0.0 0 0
## bcnp__3 0 0.5 0 0.0 0 0 0.0 0 0
## bcnp__4 0 0.0 0 0.5 0 0 0.0 0 0
## bcnp__5 0 0.0 0 0.0 0 0 0.5 0 0
## ipoagg irimis ivesab leppun lesint leueri ligpor linkin linlew
## bcnp__1 0 0 0 0 0 0 0 0 0
## bcnp__2 0 0 0 0 0 0 0 0 0
## bcnp__3 0 0 0 0 0 0 0 0 0
## bcnp__4 0 0 0 0 0 0 0 0 0
## bcnp__5 0 0 0 0 0 0 0 0 0
## litinc litmul lotuta lupkin lupser lyggra lygspi macgri molpar
## bcnp__1 0.0 0 0 0 0 0 0 0 0
## bcnp__2 0.0 0 0 0 0 0 0 0 0
## bcnp__3 0.0 0 0 0 0 0 0 0 0
## bcnp__4 0.0 0 0 0 0 0 0 0 0
## bcnp__5 0.5 0 0 0 0 0 0 0 0
## oenbra oencae oencor oenfla oenlav opueri orofas pedcan pedsim
## bcnp__1 0 0 0 0 0 0 0 0.0 0
## bcnp__2 0 0 0 0 0 0 0 0.0 0
## bcnp__3 0 0 0 0 0 0 0 0.0 0
## bcnp__4 0 0 0 0 0 0 0 0.0 0
## bcnp__5 0 0 0 0 0 0 0 0.5 0
## pencae pencom penlei penuta phllon phycha potcon potcri potgra
## bcnp__1 0 0 0 0 0 0 0 0 0
## bcnp__2 0 0 0 0 0 0 0 0 0
## bcnp__3 0 0 0 0 0 0 0 0 0
## bcnp__4 0 0 0 0 0 0 0 0 0
## bcnp__5 0 0 0 0 0 0 0 0 0
## pteand pyrvir salibe sclwhi senmul sphcoc stapin steten strcor
## bcnp__1 0 0 0 0 0.0 0 0 0 0
## bcnp__2 0 0 0 0 0.5 0 0 0 0
## bcnp__3 0 0 0 0 0.5 0 0 0 0
## bcnp__4 0 0 0 0 0.5 0 0 0 0
## bcnp__5 0 0 0 0 0.5 0 0 0 0
## swerad taroff thafen towmin tradub valacu vicame
## bcnp__1 0.0 0 0 0 0 0 0.0
## bcnp__2 0.5 0 0 0 0 0 0.5
## bcnp__3 0.0 0 0 0 0 0 0.0
## bcnp__4 0.0 0 0 0 0 0 0.0
## bcnp__5 0.5 0 0 0 0 0 0.0
## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
It is a large species x site matrix of vegetation from Bryce Canyon, a famous US National Park.
Labdsv uses its own function pca()
, as a wrapper for base R’s prcomp()
.
Create a new object, pca2
, using pca()
on bryce, setting the cor =
argument to TRUE, and the dim =
argument to 10 (to return 10 PCA axes).
pca2 <- pca(bryce, cor = TRUE, dim = 10)
You can access the scores and loading of this PCA by name from the pca2
object (e.g., pca2$scores
).
Similar to vegan, we can use plot()
on the object pca2
to plot the first two axes. Do so now.
plot(pca2)
We’ll now use our PCA of species at each site to model elevation by site at Bryce Canyon. Look at the head of the ‘site’ variable, loaded with this lesson.
head(site)
## labels annrad asp av depth east elev grorad north pos quad
## 1 50001 241 30 1.00 deep 390 8720 162 4100 ridge pc
## 2 50002 222 50 0.96 shallow 390 8360 156 4100 mid_slope pc
## 3 50003 231 50 0.96 shallow 390 8560 159 4100 mid_slope pc
## 4 50004 254 360 0.93 shallow 390 8660 166 4100 ridge pc
## 5 50005 232 300 0.48 shallow 390 8480 159 4100 up_slope pc
## 6 50006 216 330 0.76 shallow 390 8560 155 4100 mid_slope pc
## slope
## 1 9
## 2 2
## 3 2
## 4 0
## 5 2
## 6 2
The model we will now make lets us determine whether species assemblages are predicting elevation.
Create a GLM of the $elev
variable (in site
) as a function of both the first two pca axes. You can subset them by position in pca2$scores
. Assign this model to elev.glm
elev.glm <- glm(site$elev ~ pca2$scores[,1] + pca2$scores[,2])
Now look at your model summary.
summary(elev.glm)
##
## Call:
## glm(formula = site$elev ~ pca2$scores[, 1] + pca2$scores[, 2])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1123.26 -361.72 19.48 396.99 964.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7858.56 39.89 196.982 < 2e-16 ***
## pca2$scores[, 1] -56.33 10.96 -5.142 8.01e-07 ***
## pca2$scores[, 2] -105.72 14.44 -7.323 1.19e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 254655.1)
##
## Null deviance: 60370369 on 159 degrees of freedom
## Residual deviance: 39980847 on 157 degrees of freedom
## AIC: 2450.7
##
## Number of Fisher Scoring iterations: 2
We can see that both PC1 and PC2 are significant predictors of elevation. Let’s plot the model to see what this looks like.
Now plot pca2 again.
plot(pca2)
We are now going to plot elevation contours. To do so, we need the interp()
function from the ‘akima’ package to interpolate values where we have none. Load the ‘akima’ package now (install it first if needed).
library(akima)
Add contour lines with the following code, contour(interp(pca2$scores[,1], pca2$scores[,2], fitted(elev.glm)), add = TRUE, col = 2, labcex = 0.8)
.
plot(pca2); contour(interp(pca2$scores[,1], pca2$scores[,2], fitted(elev.glm)), add = TRUE, col = 2, labcex = 0.8)
We have just added the predicted elevation contours from our model. Now add the real lines with this code: contour(interp(pca2$scores[,1], pca2$scores[,2], site$elev), add = TRUE, col = 3)
.
plot(pca2); contour(interp(pca2$scores[,1], pca2$scores[,2], site$elev), add = TRUE, col = 3)
You can see from the fitted values compared to the real values that our model should be quadratic, not linear. Let’s make a second model with squared predictors in addition to our normal predictors.
Create the model elev.glm2 from the following code: glm(site$elev ~ pca2$scores[,1] + I(pca2$scores[,1]^2) + pca2$scores[,2] + I(pca2$scores[,2]^2))
.
elev.glm2 <- glm(site$elev ~ pca2$scores[,1] + I(pca2$scores[,1]^2) + pca2$scores[,2] + I(pca2$scores[,2]^2))
Lets see which model fits better - the linear model or the quadratic model.
Using anova()
, compare the elev.glm to elev.glm2. Set the test =
argument to ‘Chi’.
anova(elev.glm, elev.glm2, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: site$elev ~ pca2$scores[, 1] + pca2$scores[, 2]
## Model 2: site$elev ~ pca2$scores[, 1] + I(pca2$scores[, 1]^2) + pca2$scores[,
## 2] + I(pca2$scores[, 2]^2)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 157 39980847
## 2 155 31469682 2 8511165 7.889e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The residual deviation is much smaller in the quadratic model, indicating a better fit.
Plot pca2 again.
plot(pca2)
Now add the contours, this time fitted from the quadratic version of our model. You can adjust the code from the linear model to do this.
plot(pca2); contour(interp(pca2$scores[,1], pca2$scores[,2], fitted(elev.glm2)), add = TRUE, col = 2, labcex = 0.8)
Better, but not perfect. Such is life.
This is just the tip of the iceberg for Principle Components Analysis, however you now have the gist of its application. Hopefully you now appreciate how these methods can find structure in large amounts of multivariate data (which is otherwise difficult to describe).
Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.
As you wish