Here, we illustrate the variety of different graphs possible in R. These are all done with the base R install. Additional packages such as lattice and ggplot2 provide more options and could be looked at once you are familiar with these.
Functions: hist(), barplot(), pie(), stripchart(), boxplot(), dotchart(), coplot()
Packages: vioplot, plotrix
A. Foundational Knowledge
B. Application
C. Integration & Human Dimension
Bird data.
BirdData <- data.frame(
Tarsus = c(22.3, 19.7, 20.8, 20.3, 20.8, 21.5, 20.6, 21.5),
Head = c(31.2, 30.4, 30.6, 30.3, 30.3, 30.8, 32.5, 31.6),
Weight = c(9.5, 13.8, 14.8, 15.2, 15.5, 15.6, 15.6, 15.7),
Wingcrd = c(59, 55, 53.5, 55, 52.5, 57.5, 53, 55),
Species = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B')
)
Choosing the most appropriate graph for the data is one of the most important decisions you can make in presenting your results.
Here, we will look at some basic graphs, based on whether the x (predictor/independent variable) and y (response/dependent variable) are categorical or continuous:
Single continuous
Predictor and response are continuous
Single categorical
Categorical predictor and continous response
Stephen Few has a great Graph Selection Matrix
For each graph type, we will look at the default appearance and some of the more useful arguments we can alter.
**data:** single continuous variable
use: illustrating counts or distributions of data
example: bird data
Histograms present the frequency (count) or probability density distribution of a vector of data.
breaks
describes the number of bins or breakpoints between bins.freq = TRUE
defaults to counts; set to FALSE
to plot probability distribution.plot = TRUE
default will plot the histogram. Set to FALSE
to not plot, and store the counts in an object for future use.# set layout and margins
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1), lwd = 2)
hist(BirdData$Head, main = "default histogram")
hist(BirdData$Head, breaks = 10, main = "breaks = 10")
hist(BirdData$Head, breaks = seq(from = 30, to = 33, by = 0.25), main = "breaks = seq()" )
# plot probability density distribution
hist(BirdData$Head, breaks = seq(from = 30, to = 33, by = 0.25), freq = FALSE, main = "probability density" )
**data:** continuous response; continuous predictor.
use: illustrating relationships and correlations between continuous variables, …
example: BirdData
We have already used scatter plots (see Advanced Graphics.
Here, however, we add a legend to two of the panels, using legend()
.
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1), lwd = 2)
plot(Head ~ Tarsus, data = BirdData,
xlab = 'Tarsus (mm)',
ylab = 'Head Size (mm)',
main = 'Head vs Tarsus',
pch = 20,
col = Species,
cex = 3)
plot(Head ~ Wingcrd, data = BirdData,
xlab = 'Wingcrd (m)',
ylab = 'Head Size (mm)',
main = 'Head vs Wing',
pch = 20,
col = Species,
cex = 3)
plot(Head ~ Weight, data = BirdData,
xlab = 'Weight (kg)',
ylab = 'Head Size (mm)',
main = 'Head vs Weight',
pch = 20,
col = Species,
cex = 3)
# Here, we add a legend to this plot
legend('topleft',
pch = c(20, 20), col = c(1, 2),
legend = c('Species A', 'Species B')
)
plot(Head ~ Weight, data = BirdData,
xlab = 'Species',
ylab = 'Head Size (mm)',
main = 'Head vs Species',
pch = c(20, 23),
col = 1:2,
cex = Tarsus/5)
# Another legend
legend('topleft',
pch = c(20, 23), col = c(1, 2),
legend = c('Species A', 'Species B')
)
A pie chart is a circular representation of counts or proportion, with the angle corresponding to the value of each category.
data: usually counts or proportion response; categorical predictor
use: never! (unless it involves real pie)
example: BirdFlu
We have annual data from the years 2003-2008. Has the total number of bird flu cases has increased over time?
Load the data and select only cases columns.
BFdata <- read.table(file = 'http://www.simonqueenborough.info/R/basic/data/birdflu_corrected.txt', header = TRUE, sep = '\t')
# select cases columns only
BFcases <- BFdata[, c('cases03', 'cases04', 'cases05', 'cases06', 'cases07', 'cases08')]
names(BFdata)
## [1] "Country" "cases03" "deaths03" "cases04" "deaths04" "cases05"
## [7] "deaths05" "cases06" "deaths06" "cases07" "deaths07" "cases08"
## [13] "deaths08"
str(BFdata)
## 'data.frame': 15 obs. of 13 variables:
## $ Country : Factor w/ 15 levels "Azerbaijan","Bangladesh",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ cases03 : int 0 0 0 1 0 0 0 0 0 0 ...
## $ deaths03: int 0 0 0 1 0 0 0 0 0 0 ...
## $ cases04 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ deaths04: int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases05 : int 0 0 4 8 0 0 20 0 0 0 ...
## $ deaths05: int 0 0 4 5 0 0 13 0 0 0 ...
## $ cases06 : int 8 0 2 13 1 18 55 3 0 0 ...
## $ deaths06: int 5 0 2 8 0 10 45 2 0 0 ...
## $ cases07 : int 0 0 1 5 0 25 42 0 2 1 ...
## $ deaths07: int 0 0 1 3 0 9 37 0 2 0 ...
## $ cases08 : int 0 1 0 3 0 7 18 0 0 0 ...
## $ deaths08: int 0 0 0 3 0 3 15 0 0 0 ...
Now we make a vector of the sum of each year’s cases using colSums()
, and add names to each element using names()
.
Cases <- colSums(BFcases)
names(Cases) <- 2003:2008
Cases
## 2003 2004 2005 2006 2007 2008
## 4 46 98 115 88 34
The pie()
function requires a vector of non-negative integers. The names associated with each element are added as labels.
The default direction of pie()
is anti-clockwise. This can be reversed setting clockwise = TRUE
(the second pie chart).
For those who will sink so low, there is also a 3-D pie chart function in the ‘plotrix’ package…
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))
pie(Cases, main = "Ordinary pie chart")
pie(Cases, col = gray(seq(0.4, 1.0, length = 6)), clockwise = TRUE, main = "Grey colours")
pie(Cases, col = rainbow(6), clockwise = TRUE, main = "Rainbow colours")
#install.packages('plotrix')
library(plotrix)
pie3D(Cases, labels = names(Cases), explode = 0.1, main = "3D pie chart", labelcex = 0.6)
Most professional statistical graphicicians recommend against using pie charts for the following reasons:
See here for a greater discussion of these issues.
An improvement on the pie chart is the stacked bar plot. This allows the viewer to more readily compare the sizes of each section because we also have an axis. But.. there are no labels (although we could add them with text()
or legend()
).
# To get the 'stacked' bars, we need to convert our vector to a matrix (barplot() will only stack a matrix).
barplot(as.matrix(Cases), horiz = TRUE, xlim = c(0, 400))
The barplot()
function is discussed in more detail below.
There are a variety of ways to plot continuous data as a function of categorical predictors.
**data:** continuous or count response; categorical or binned continuous predictor
use: comparing categories, showing distributions, …
example: BirdFluDeath
We use the bird flu data again, but this time select the deaths columns.
BFdata <- read.table(file = 'http://www.simonqueenborough.info/R/basic/data/birdflu_corrected.txt', header = TRUE, sep = '\t')
# select cases columns only
BFdeaths <- BFdata[, c('deaths03', 'deaths04', 'deaths05', 'deaths06', 'deaths07', 'deaths08')]
names(BFdeaths)
## [1] "deaths03" "deaths04" "deaths05" "deaths06" "deaths07" "deaths08"
str(BFdeaths)
## 'data.frame': 15 obs. of 6 variables:
## $ deaths03: int 0 0 0 1 0 0 0 0 0 0 ...
## $ deaths04: int 0 0 0 0 0 0 0 0 0 0 ...
## $ deaths05: int 0 0 4 5 0 0 13 0 0 0 ...
## $ deaths06: int 5 0 2 8 0 10 45 2 0 0 ...
## $ deaths07: int 0 0 1 3 0 9 37 0 2 0 ...
## $ deaths08: int 0 0 0 3 0 3 15 0 0 0 ...
# Make a vector as before
Deaths <- colSums(BFdeaths)
names(Deaths) <- 2003:2008
Deaths
## 2003 2004 2005 2006 2007 2008
## 4 32 43 79 59 26
How have the number of cases and deaths changed over time?
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1), lwd = 2)
barplot(Cases, main = "default barplot of Cases")
# Create a new variable
Counts <- cbind(Cases, Deaths) # makes a matrix
barplot(Counts, main = "default with 2-column matrix" )
barplot(t(Counts), col = gray(c(0.5, 1)), main = "stacked")
barplot(t(Counts), beside = TRUE, main = "beside")
The bird flu data are just counts, and so we have to use a continuous data set (deer) to obtain means and variances.
First, we need to calculate the mean and standard deviation values for bird heads of each species.
head.mean <- tapply(BirdData$Head, INDEX = BirdData$Species, FUN = mean)
head.sd <- tapply(BirdData$Head, INDEX = BirdData$Species, FUN = sd)
barplot(head.mean,
xlab = "Species", ylab = "Head Length (mm)",
ylim = c(0, 35),
col = 1:2)
Then we can plot the data and add error bars.
There are two options for creating error bars: arrows()
and segments()
. Both work in a similar way, drawing lines between two coordinates. For vertical error bars, you keep the x coords the same for each line, and for horizontal error bars keep the y coords the same. arrows()
adds an arrow head to one end of the line, and you can set the angle
to 90 degrees so that it is a flat arrow. As usual, the coordinate arguments x0
, y0
, x1
and y1
can take single values or vectors.
We first need to determine the x values for the lines. This is done by assigning the barplot()
output to an object and then calling that. This saves the values of the midpoints of each bar.
xvals <- barplot(head.mean, plot = FALSE)
xvals
## [,1]
## [1,] 0.7
## [2,] 1.9
We can then add the arrows to the bar plot.
barplot(head.mean,
xlab = "Species", ylab = "Head Length (mm)",
ylim = c(0, 35),
col = 1:2)
arrows(x0 = xvals, y0 = head.mean, # arrow from ...
x1 = xvals, y1 = head.mean + head.sd, # ... to
lwd = 2, angle = 90, length = 0.1)
The function segments()
works in an identical fashion, but has no arrow head.
Bar plots are fine, but are limited in that they do not display much of the raw data, and use a lot of ink to convey that small amount of information. There are other plots that show more of the data.
To provide the viewer with a much better idea of the distribution of your data, a boxplot can be used.
**data:** continuous response; categorical predictor
use:
example: Bird data
We already used plot()
to make a box plot. However, the function boxplot()
has more arguments and we can modify the appearance much more.
par(mfrow = c(2, 2), lwd = 2)
plot(Head ~ Species,
data = BirdData,
main = "default boxplot with plot()")
boxplot(Head ~ Species,
data = BirdData,
main = "default boxplot with boxplot()")
boxplot(Head ~ Species,
data = BirdData,
notch = TRUE,
main = "A notched boxplot")
## Warning in bxp(structure(list(stats = structure(c(30.3, 30.3, 30.4, 30.6, :
## some notches went outside hinges ('box'): maybe set notch=FALSE
# install.packages('vioplot')
library(vioplot)
bird.A <- BirdData$Head[BirdData$Species == 'A']
bird.B <- BirdData$Head[BirdData$Species == 'B']
#vioplot(bird.A, bird.B)
#mtext(side = 3, text = "default violin plot", line = 1.8, font = 2)
upper - lower
= the spread of the dataGraphical best practice is to show both raw data and summary data if possible. The stripchart()
function can be used to do just that.
**data:** continuous response; categorical predictor
use: differences between categories, plotting raw data and summary data on top
example: BirdData
The default plots all data on one line, therefore we need to set method = 'jitter'
to add some random y-axis variation to each point to show the raw data more clearly.
Using the mean and standard deviations we calculated before, we can then use points()
and arrows()
to add the summary data on top.
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1), lwd = 2, las = 1, cex = 1.25)
stripchart(BirdData$Head ~ BirdData$Species,
xlim = c(29, 34), ylim = c(0.5, 2.5),
main = "default strip chart",
col = 'grey80')
stripchart(BirdData$Head ~ BirdData$Species,
xlim = c(29, 34), ylim = c(0.5, 2.5),
main = "add jitter to y",
col = 'grey80',
method = 'jitter')
stripchart(BirdData$Head ~ BirdData$Species,
xlim = c(29, 34), ylim = c(0.5, 2.5),
main = "add summary data",
col = 'grey80',
method = 'jitter')
points(x = head.mean, y = 1:2, pch = 20, cex = 2)
arrows(x0 = head.mean, y0 = 1:2,
x1 = head.mean + head.sd, y1 = 1:2,
angle = 90, length = 0.1, lwd = 2)
arrows(x0 = head.mean, y0 = 1:2,
x1 = head.mean - head.sd, y1 = 1:2,
angle = 90, length = 0.1, lwd = 2)
There are many other possibilities for graphics, depending on the nature of your data and what you want to show of it.