Plotting Data

Plotting Data

In this lesson, we will learn how to make simple plots of numeric and factorial data. We will begin to modify the defaults of these plots. As with most stats software the default options are not great. Usually, you will want to modify your plots, either to fit the publication requirements of a journal, or to make the plots objectively ‘better’, in line with recommended best practice.

We will be using the base plotting system and talk more about how you can exploit all its many parameters to get the plot you want.

Moreover, we’ll focus on using the base plotting system to create graphics on the screen device rather than another graphics device.

The base system is very intuitive and easy to use. It is akin to drawing a plot using a pencil and ruler, building up the elements piece by piece.

You can’t go backwards, though, say, if you need to readjust margins or have misspelled a caption. To go ‘backwards’, you would need to start drawing the whole plot again from scratch (i.e., re-running the code chunk—not normally a big deal). A finished plot will therefore be a series of R commands, so it will be difficult to translate a finished plot into a different system." Calling a basic routine such as plot(x, y) or hist(x) launches a graphics device (if one is not already open) and draws a new plot on the device. A graphics ‘device’ is just a general word for either the screen, or a graphics file such as .pdf or .png.

Let’s try our hand at some simple graphs using a small dataset on sparrow morphology. It is loaded with this lesson. Have a look at it. Type ‘BirdData’.

BirdData <- data.frame(
  Tarsus  = c(22.3, 19.7, 20.8, 20.3, 20.8, 21.5, 20.6, 21.5),
  Head    = c(31.2, 30.4, 30.6, 30.3, 30.3, 30.8, 32.5, 31.6),
  Weight  = c(9.5, 13.8, 14.8, 15.2, 15.5, 15.6, 15.6, 15.7),
  Wingcrd = c(59, 55, 53.5, 55, 52.5, 57.5, 53, 55),
  Species = c('A', 'A', 'A', 'A', 'A',  'B', 'B', 'B')
)
BirdData
##   Tarsus Head Weight Wingcrd Species
## 1   22.3 31.2    9.5    59.0       A
## 2   19.7 30.4   13.8    55.0       A
## 3   20.8 30.6   14.8    53.5       A
## 4   20.3 30.3   15.2    55.0       A
## 5   20.8 30.3   15.5    52.5       A
## 6   21.5 30.8   15.6    57.5       B
## 7   20.6 32.5   15.6    53.0       B
## 8   21.5 31.6   15.7    55.0       B

We will start by using the generic plot() function, which does various things depending what data you input. The function plot() is a wrapper for specific plot() functions that take over if criteria (such as the class of the input data) are met.

You can plot a single vector/column, and it will appear in the same order as the data. Use plot() to plot the $Head column.

plot(BirdData$Head)

You can see that the value of each circle is given along the y-axis. The axis has the label ‘BirdData$Head’. The points are also in order they appear in the dataset, i.e., according to their ‘index’.

This figure illustrates many of the defaults in the base plotting system in R. Axes have labels based on the data you feed in; everything is in the same font; labels and numbers run parallel to their axis; tick marks lie outside the lines; data points are empty circles.

We will come back to these defaults shortly. For now, let’s see what happens when we input different data.

How about if we wanted to arrange our data in order of their $Head size, rather than the order they appear in the data? Use the function sort() to display these data.

sort(BirdData$Head)
## [1] 30.3 30.3 30.4 30.6 30.8 31.2 31.6 32.5

Now, plot a figure as before, but with the sorted data nested within the call to plot().

plot(sort(BirdData$Head))

Plotting a single numeric variable plots each data point. What does plotting a single categorical variable with plot() do? Try it out… use plot() on the $Species column.

plot(BirdData$Species)

Plotting a factor results in counts of each level, much like the function table(). It is a kind of barplot.

Lovely. You can plot one variable! However, most graphics are used to look at the relationships between two or more variables.

Let’s try plotting two numeric columns first. Plot $Head as a function of $Tarsus. Use a formula to do so: BirdData$Head ~ BirdData$Tarsus.

plot(BirdData$Head ~ BirdData$Tarsus)

Plotting two numeric columns makes a scatterplot. What about a numeric variable as a function of a factor? Plot $Head as a function of $Species.

plot(BirdData$Head ~ BirdData$Species)

A numeric and a factor makes a boxplot.

Now that you can use plot() to make different graphs depending on the input, let’s explore the arguments to plot() in more detail.

We have already used a formula structure to tell plot() what data we want to plot. The tilde (~) in R formula notation means ‘as a function of’. The y variable goes to the left side, and the x variable on the right.

If you look at the help page for plot(), you will see that the only two required arguments to use in plot() are x and y: the data arguments, i.e., the x- and y-axes. In fact, the y argument is optional. In the case of the formula, that takes the x argumment and so we do not need y.

However, in lieu of writing a formula these two arguments can be specified explicitly. Set ‘x’ to BirdData\(Species and 'y' to BirdData\)Head.

plot(x = BirdData$Species, y = BirdData$Head)

Note that in this case, R does not include axis labels.

In many cases, using the formula approach will be slightly easier to understand at a glance, because it is more obvious which variable is being plotted as a function of the other.

We can make things even easier to read and understand if we use the data = argument. Use the arrow keys to scroll up to the previous command that created the boxplot, using a formula. Delete ‘BirdData$’ from before both ‘Head’ and ‘Species’. Add the name of the data object to the argument data =, nested within the plot() function.

plot(Head ~ Species, data = BirdData)

Wow! This code is almost understandable as a regular language! Ok, so here it should be clear that we are plotting Head size as a function of Species, both of which come from the BirdData data object.

This data = argument is only available if you use the formula approach to input data to plot(). The plot() function calls the plot.formula() function. Check the help pages if you want.

Another advantage of the data = argument, apart from making the code a bit clearer, is that it then is easier to manipulate the data object from within the call to plot().

For example, if we wanted to subset out the first four sparrows, we could either first create a new data object which is passed to plot: BirdData1 <- BirdData[1:4, ].

Or, we could subset both $Head and $Species individually: plot(BirdData$Head[1:4] ~ BirdData$Species[1:4]).

Both these approaches create extra steps or require repetition, where errors could creep in. It would be much better to use a single location where we subset that applies to both variables that we want to plot.

Plot $Head as a function of $Tarsus, using the data = argument of plot() to subset out the first four sparrows.

plot(Head ~ Tarsus, dat = BirdData[1:4, ])

Alternatively, we could sort the whole dataframe by one of the columns, to plot the points in a different order. We used the function sort() to change the arrangement of a single variable when we began this lesson. To sort an entire dataframe is a bit more complex.

Instead of sort(), we would use the function order(). This function returns the indices of a variable, in ascending order of their value. Look at what happens when you run order() on BirdData$Tarsus.

order(BirdData$Tarsus)
## [1] 2 4 7 3 5 6 8 1

Now, try using this order expression inside [ ] square brackets to subset the original data.

BirdData[order(BirdData$Tarsus), ]
##   Tarsus Head Weight Wingcrd Species
## 2   19.7 30.4   13.8    55.0       A
## 4   20.3 30.3   15.2    55.0       A
## 7   20.6 32.5   15.6    53.0       B
## 3   20.8 30.6   14.8    53.5       A
## 5   20.8 30.3   15.5    52.5       A
## 6   21.5 30.8   15.6    57.5       B
## 8   21.5 31.6   15.7    55.0       B
## 1   22.3 31.2    9.5    59.0       A

Placing this ordered vector inside the brackets is a essentially subsetting all the data. It is akin to one of the uses of sample(), where we sampled the entire vector without replacement, essentially shuffling everything into a random order. However, with order(), we have a spceific order in mind. The indices of the vector (the column $Tarsus) then are used as the row numbers across the entire dataset.

You can see that the whole data object ‘BirdData’ is now sorted by the Tarsus column, which goes from 19.7 to 22.3, compared to the unsorted BirdData.

Now plot this ordered dataset in a call to plot(), for Head size as a function of Tarsus.

plot(Head ~ Tarsus, data = BirdData[order(BirdData$Tarsus), ])

Now that you can create different types of plots with different parts of a data object, let’s look at modifying the default plot.

The default plot in R is not too bad (better than Excel, at least!), but does require some modification for publication. Given that the default plot() only has two arguments, all subsequent arguments must be specified explicitly.

With two numeric variables the default plot is a scatterplot. What if we wanted a line plot instead (i.e., a joined up line that goes from each x-y point)? The argument type = can be used to set the kind of plot, including type = 'l' (a line plot), type = 'b' (both points and lines), as well as type = 'n' (for no plotting).

Make a line plot of Head on Tarsus.

plot(Head ~ Tarsus, data = BirdData, type = 'l')

The line goes from point to point, in the order that the x and y coordinates appear in the dataframe. Now try a graph of points joined by lines in between.

plot(Head ~ Tarsus, data = BirdData, type = 'b')

We can add text to this figure to describe the x and y-axes, as well as give the plot a title. The arguments xlab =, ylab =, and main = correspond to these parts of the graph, and all take text strings in quotes.

For a graph of sparrow head size as a function of species, set xlab to ‘Species of Sparrow’, ylab to ‘Head Size (mm)’, and the graph title to ‘A Boxplot of Sparrows’.

plot(Head ~ Species, data = BirdData, xlab = 'Species of Sparrow', ylab = 'Head Size (mm)', main = 'A Boxplot of Sparrows')

A further thing you might wish to modify is the plotting characters themselves. Maybe empty circles is just not doing it for you.

This figure illustrates some of the plotting characters available to you. They are called by number, using the argument pch =. Characters 21:25 can have two different colours specified and can be filled with a colour different from the line. We will look at colour in the next lesson.

plot(   x = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
    y = rep(5:1,5),
    axes = F,
    xlab = "",ylab = "",
    xlim = c(1,6),ylim = c(1,5),
    pch = 1:25,
    bg = "red",cex = 3)
text(   x = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5))+.5,
    y = rep(5:1,5),
    labels = as.character(1:25))
abline(h = (1:5)+0.5,col = "grey")
abline(v = (1:5)+0.75,col = "grey")

Finally, the size of almost anything in a plot can be altered using cex =, a number giving the magnification relative to the default. The default magnification changes depending on the layout of the plotting area, but starts as 1.

The size of plotting text and symbols is given using cex =. The size of axis labels and the main title is changed using cex.lab = and cex.main =. The size of the axis elements is changed using cex.axis =.

Repeat the figure of Head on Tarsus size as before, but set the plotting symbols to three times as large and upside-down triangles. There is no need to set different labels or a title. Remember that in RStudio you can use the arrow buttons to move back and forth through all the Plots that were made in this session.

plot(Head ~ Tarsus, data = BirdData, pch = 6, cex = 3)