First, remember that a vector is just a single sequence of data elements (e.g., the numbers 1 to 10). These data elements are all the same type (e.g., all numbers, or all integers, or all text). The sequence can be of any length, even of length 1.
In this lesson, we’ll see how to extract elements from a vector based on some conditions that we specify. In other words, we want to select some of the numbers in a vector based either on their position in the vector or the value that each number has.
For example, we may only be interested in the first 20 elements of a vector, or only the elements that are not NA, or only those that are positive or correspond to a specific variable of interest. By the end of this lesson, you’ll know how to handle each of these scenarios.
NAs are a special data type in R to signify missing values or entries. For example, empty values in a spreadsheet are read as NA in R. They don’t work like any other data type, and should be worked around. We’ll also teach you how to subset around NAs in this lesson.
I’ve created for you a vector called x
that contains a random ordering of 20 numbers (from a standard normal distribution i.e., a bell curve) and 20 NAs. Type x
now to see what it looks like.
x
## [1] 0.21771094 NA 0.15449270 -2.61517750 NA
## [6] 1.80052287 NA -0.36517295 -0.19652421 NA
## [11] -3.52363368 -0.15117132 1.12422272 NA NA
## [16] NA -0.88643968 NA 1.74777493 -0.40056175
## [21] -0.24152672 NA 0.92819946 NA NA
## [26] NA NA NA NA -1.32037266
## [31] NA -0.19223905 NA NA -0.02350852
## [36] 2.09120911 -2.09498133 0.09690617 NA -0.05169677
Look carefully at x
in your R console above. It has 40 ‘elements’, or pieces of data, in the vector. You can see the NA
s and the numbers.
You can also see some numbers in square brackets to the left of the screen. The first one should be a 1: [1]
. These numbers in square brackets tell us where we are in the vector. [1]
tells us that this is the first element, position 1 in the vector.
The next number vertically below the [1]
tells us at which position the vector starts on the new line. This may be [6]
, [7]
, [8]
, depending on the width of your console.
Count along the vector. What is the value of the fourth element? Type it out in full.
-2.6151775
## [1] -2.615178
We can use this numbered or ‘indexing’ system in reverse to tell R what we want to look at. Type x[4]
into your console.
x[4]
## [1] -2.615178
We just told R to show us the fourth element of x.
So, the way to tell R that you want to select some particular elements (i.e., a ‘subset’) from a vector is by placing an ‘index’ in square brackets immediately following the name of the vector. This index tells R which locations in the vector to pull out.
For a simple example, try x[1:10]
to view the first ten elements of x
. (Remember that :
is a neat way to generate a sequence of numbers!).
x[1:10]
## [1] 0.2177109 NA 0.1544927 -2.6151775 NA 1.8005229
## [7] NA -0.3651730 -0.1965242 NA
An index (pl. indices) comes in four different flavors—logical, positive integers, negative integers, and character strings—each of which we’ll cover in this lesson.
Usually you will want to subset/index more than one value, and so you will use a vector of indices to index the (original) vector … so to speak.
Let’s start by indexing with logical vectors. One common scenario when working with real-world data is that we want to extract all elements of a vector that are not NA (i.e., missing data). Recall that is.object_type
yields a true or false depending on the type of object you’re working with. In a similar vein, is.data_type
does the same thing with data. For example, is.na(x)
yields a vector of logical values the same length as x
, with TRUEs corresponding to NA values in x
and FALSEs corresponding to non-NA values in x
.
Before we subset, let’s see what see what this logical vector looks like. Enter is.na(x) now.
is.na(x)
## [1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
## [12] FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
## [23] FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
## [34] TRUE FALSE FALSE FALSE FALSE TRUE FALSE
As you can see, this outputs a vector of logical values (trues and falses). Each element of this vector corresponds to the same element of x. For example, the first TRUE/FALSE is telling you whether the first element of x, in R notation x[1], is NA or not.
When we pass a logical vector of equal length to our index brackets [ ], then R will return only those values corresponding to TRUE. Lets try this now.
What do you think x[is.na(x)]
will give you?
A vector of all NAs
Prove it to yourself by typing x[is.na(x)]
.
x[is.na(x)]
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Another useful operator is the exclamation mark !
, which gives us the negation of a logical expression, so !is.na(x)
can be read as ‘is not NA’. Therefore, if we want to create a vector called y
that contains all of the non-NA values from x
, we can use y <- x[!is.na(x)]
. Give it a try.
y <- x[!is.na(x)]
Print y
to the console.
y
## [1] 0.21771094 0.15449270 -2.61517750 1.80052287 -0.36517295
## [6] -0.19652421 -3.52363368 -0.15117132 1.12422272 -0.88643968
## [11] 1.74777493 -0.40056175 -0.24152672 0.92819946 -1.32037266
## [16] -0.19223905 -0.02350852 2.09120911 -2.09498133 0.09690617
## [21] -0.05169677
Now that we’ve isolated the non-missing values of x
and put them in y
, we can subset y
as we please.
Functions aren’t the only way to create logical vectors. We can also use operators that compare one value to another and respond with a TRUE or FALSE. These operators include greater than ‘>’, less than ‘<’, is equal to ‘==’, and is not equal to ‘!=’. Note that two equal signs are needed in ‘equal to’ because a single = instead denotes an assignment opertor, much like ‘<-’.
For example, the expression y > 0 will give us a vector of logical values the same length as y, with TRUEs corresponding to values of y that are greater than zero and FALSEs corresponding to values of y that are less than or equal to zero. Enter y > 0 now to see.
y > 0
## [1] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [12] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
Lets subset. What do you think y[y > 0] will give you?
A vector of all the positive elements of y
Type y[y > 0]
to see that we get all of the positive elements of y
, which are also the positive elements of our original vector x
.
y[y > 0]
## [1] 0.21771094 0.15449270 1.80052287 1.12422272 1.74777493 0.92819946
## [7] 2.09120911 0.09690617
You might wonder why we didn’t just start with x[x > 0]
to isolate the positive elements of x
. Try that now to see why.
x[x > 0]
## [1] 0.21771094 NA 0.15449270 NA 1.80052287 NA
## [7] NA 1.12422272 NA NA NA NA
## [13] 1.74777493 NA 0.92819946 NA NA NA
## [19] NA NA NA NA NA NA
## [25] 2.09120911 0.09690617 NA
Since NA is not a value, but rather a placeholder for an unknown quantity, the expression NA > 0
evaluates to NA. Hence we get a bunch of NAs mixed in with our positive numbers when we do this.
We can also string together logical statements with the ‘and’ operator &
, meaning that both statements must be true to evaluate as true, and the ‘or’ operator |
, meaning that either statement can be true to evaluate as true.
Combining our knowledge of logical operators with our knowledge of subsetting, we could do this: x[!is.na(x) & x > 0]
. Try it out.
x[!is.na(x) & x > 0]
## [1] 0.21771094 0.15449270 1.80052287 1.12422272 1.74777493 0.92819946
## [7] 2.09120911 0.09690617
In this case, we obtain only values of x
that are both non-missing AND greater than zero.
I’ve already shown you how to subset just the first ten values of x using x[1:10]
. In this case, we’re providing a vector of positive integers inside of the square brackets, which tells R to return only the elements of x
numbered 1 through 10.
Many programming languages use what’s called ‘zero-based indexing’, which means that the first element of a vector is considered element 0. R uses ‘one-based indexing’, which (you guessed it!) means the first element of a vector is considered element 1.
Can you figure out how we’d subset the 3rd, 5th, and 7th elements of x
? Hint, use the c()
function to specify the element numbers as a numeric vector.
x[c(3, 5, 7)]
## [1] 0.1544927 NA NA
It’s important that when using integer vectors to subset our vector x
, we stick with the set of indexes {1, 2, …, 40} since x
only has 40 elements. What happens if we ask for the zeroth element of x
(i.e. x[0]
)? Give it a try.
x[0]
## numeric(0)
As you might expect, we get nothing useful. Unfortunately, R doesn’t prevent us from doing this. What if we ask for the 3000th element of x
? Try it out.
x[3000]
## [1] NA
Again, nothing useful, but R doesn’t prevent us from asking for it. This should be a cautionary tale. You should always make sure that what you are asking for is within the bounds of the vector you’re working with.
What if we’re interested in all elements of x
except the 2nd and 10th elements? It would be pretty tedious to construct a vector containing all numbers 1 through 40 except 2 and 10.
Luckily, R accepts negative integer indexes. Whereas x[c(2, 10)]
gives us only the 2nd and 10th elements of x
, x[c(-2, -10)]
gives us all elements of x
except for the 2nd and 10 elements. Try x[c(-2, -10)]
now to see this.
x[c(-2, -10)]
## [1] 0.21771094 0.15449270 -2.61517750 NA 1.80052287
## [6] NA -0.36517295 -0.19652421 -3.52363368 -0.15117132
## [11] 1.12422272 NA NA NA -0.88643968
## [16] NA 1.74777493 -0.40056175 -0.24152672 NA
## [21] 0.92819946 NA NA NA NA
## [26] NA NA -1.32037266 NA -0.19223905
## [31] NA NA -0.02350852 2.09120911 -2.09498133
## [36] 0.09690617 NA -0.05169677
A shorthand way of specifying multiple negative numbers is to put the negative sign out in front of the vector of positive numbers. Type x[-c(2, 10)]
to get the exact same result.
x[-c(2, 10)]
## [1] 0.21771094 0.15449270 -2.61517750 NA 1.80052287
## [6] NA -0.36517295 -0.19652421 -3.52363368 -0.15117132
## [11] 1.12422272 NA NA NA -0.88643968
## [16] NA 1.74777493 -0.40056175 -0.24152672 NA
## [21] 0.92819946 NA NA NA NA
## [26] NA NA -1.32037266 NA -0.19223905
## [31] NA NA -0.02350852 2.09120911 -2.09498133
## [36] 0.09690617 NA -0.05169677
So far, we’ve covered three types of index vectors—logical, positive integer, and negative integer. The only remaining type requires us to introduce the concept of ‘named’ elements.
Create a numeric vector with three named elements using vect1 <- c(foo = 11, bar = 2, norf = NA)
.
vect1 <- c(foo = 11, bar = 2, norf = NA)
When we type vect1
in the console, you’ll see that each element has a name above each element. Try it out.
vect1
## foo bar norf
## 11 2 NA
We can also get the names of vect1
by passing vect1
as an argument to the function names()
. Give that a try.
names(vect1)
## [1] "foo" "bar" "norf"
Alternatively, we can create an unnamed vector, vect2
, with c(11, 2, NA)
. Do that now.
vect2 <- c(11, 2, NA)
Then, we can add the names
attribute to vect2
after the fact with names(vect2) <- c("foo", "bar", "norf")
. Go ahead. (Notice that now we use quotation marks to specify the names).
names(vect2) <- c("foo", "bar", "norf")
Now, let’s check that vect1
and vect2
are the same by passing them as arguments to the identical()
function.
identical(vect1, vect2)
## [1] TRUE
Indeed, vect1
and vect2
are identical vectors.
Now, back to the matter of subsetting a vector by named elements. Which of the following commands do you think would give us the second element of vect1
?
vect1[“bar”]
Now, try it out.
vect1["bar"]
## bar
## 2
Likewise, we can specify a vector of names with vect1[c("foo", "bar")]
. Try it out.
vect1[c("foo", "bar")]
## foo bar
## 11 2
Actually, you can also use integer indexes on named vectors. First, notice that one of the options in the question above was vect1["2"]
. Does this work? Try it out.
vect1["2"]
## <NA>
## NA
As above, integer indexes need to be integers. Putting the number 2 in quotes tells R that it is in fact a text character.
So, how would you index the second element of vect1
using an integer?
vect1[2]
## bar
## 2
BEST PRACTICE: If you have names, it is best to use them. This issue will become more important when we start working with data frame and lists. The problem is that if the columns in a dataframe get out of order, column 2 may no longer be what it was. Using a name ensures that R always gets the correct thing you are interested in.
Yay!! Now you know all four methods of subsetting data from vectors. Different approaches are best in different scenarios and when in doubt, try it out!
Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.
Go on, then.