Subsetting Dataframes

Now that we’re finally thinking in 2 dimensions, we’re going to learn one of the most important skills in the course, subsetting data frames.

Lets start with a quick refresher on subsets. Make a vector called my_seq and give it the values from 0 to 20 in increments of 2.

my_seq <- seq(0,20, 2)

Now display the first 5 elements of the vector. As a reminder, you’ll need square brackets [] along with a vector of the values 1 through 5.

my_seq[1:5]

## [1] 0 2 4 6 8

Display only those values greater than 10. You’ll need to logically subset here. Remember that we can make a logical vector with the ‘<’ or ‘>’ operator, and use this logical vector as our subset condition in the [].

my_seq[my_seq > 10]

## [1] 12 14 16 18 20

Now that you’re refreshed on subsets, we’re going to apply this logic in 2 dimensions.

We’ve loaded a data frame for you, called forest_area. It has the area of every country, the percentage of that area covered in forest, and their respective ranks in forest cover. Look at the first 6 rows now using head().

head(forest_area)

##   rank       country land.area.km per.forest
## 1    1        Russia      8149300      49.40
## 2    2        Canada      4916438      49.24
## 3    3        Brazil      4776980      56.10
## 4    4 United States      3100950      33.84
## 5    5         China      2083210      21.83
## 6    6     Australia      1470832      19.90

You can see we have 4 columns, most of which are self-explanatory. Per.forest stands for percentage forest. Lets now look at the structure of our data frame. Use the str() function.

str(forest_area)

## 'data.frame':    192 obs. of  4 variables:
##  $ rank        : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ country     : Factor w/ 191 levels "Afghanistan",..: 139 31 24 182 36 9 44 7 75 74 ...
##  $ land.area.km: num  8149300 4916438 4776980 3100950 2083210 ...
##  $ per.forest  : num  49.4 49.2 56.1 33.8 21.8 ...

Everything looks good - country should be a factor, and is, and everything else is either numeric or an integer. Lets use this data frame to practice subsetting!

With vectors, we only needed a single value or vector in our square brackets [] because vectors are 1-dimensional. With 2-dimensional objects, we’ll need 2 values, the first telling R which row or rows we want, and the second telling R which column or columns. We use a comma to separate them, so it’ll take the form [row,col].

What do you think forest_area[1,2] will give you?

The entire data frame
The first column of data
The value in the first row and second column
The value in the first column and second row

The value in the first row and second column

Subset now by entering forest_area[1,2].

forest_area[1,2]

## [1] Russia
## 191 Levels: Afghanistan Albania Algeria Andorra ...  Switzerland

Its basically like telling R which coordinate you want in a x,y grid. As in 1-dimension, we can also use vectors as our subset rather than a single value. Use vectors to subset forest_area for the 1st, 3rd, and 5th row, and the 2nd and 4th column.

forest_area[c(1,3,5),c(2,4)]

##   country per.forest
## 1  Russia      49.40
## 3  Brazil      56.10
## 5   China      21.83

Often, we just want the entire row or the entire column. If you want all of either rows or columns, simply leave that subset blank. Lets get the entire first row by entering forest_area[1,] now. This tells R we want the first row, and all columns.

forest_area[1,]

##   rank country land.area.km per.forest
## 1    1  Russia      8149300       49.4

Lets do something similar, but this time subset all rows of just the 2nd column.

forest_area[,2]

##   [1] Russia                           Canada                          
##   [3] Brazil                           United States                   
##   [5] China                            Australia                       
##   [7] Democratic Republic of the Congo Argentina                       
##   [9] Indonesia                        India                           
##  [11] Mexico                           Peru                            
##  [13] Sudan                            Mozambique                      
##  [15] Colombia                         Angola                          
##  [17] Bolivia                          Venezuela                       
##  [19] Myanmar                          Zambia                          
##  [21] Tanzania                         Sudan                           
##  [23] Papua New Guinea                 Sweden                          
##  [25] Zimbabwe                         Japan                           
##  [27] France                           Finland                         
##  [29] Central African Republic         Gabon                           
##  [31] Republic of the Congo            Turkey                          
##  [33] Cameroon                         Malaysia                        
##  [35] Paraguay                         Spain                           
##  [37] Laos                             Chile                           
##  [39] Guyana                           South Sudan                     
##  [41] Suriname                         Thailand                        
##  [43] Madagascar                       Mali                            
##  [45] Vietnam                          Norway                          
##  [47] Botswana                         Chad                            
##  [49] Ecuador                          Germany                         
##  [51] Niger                            Morocco                         
##  [53] Nigeria                          Iran                            
##  [55] Guatemala                        Italy                           
##  [57] Ukraine                          Mongolia                        
##  [59] Cambodia                         Poland                          
##  [61] South Africa                     Senegal                         
##  [63] New Zealand                      Belarus                         
##  [65] Namibia                          North Korea                     
##  [67] Philippines                      Somalia                         
##  [69] Romania                          Burkina Faso                    
##  [71] Honduras                         Guinea                          
##  [73] South Korea                      Benin                           
##  [75] Kazakhstan                       Nicaragua                       
##  [77] Uganda                           Panama                          
##  [79] Pakistan                         Algeria                         
##  [81] Turkmenistan                     Austria                         
##  [83] Ethiopia                         Greece                          
##  [85] Nepal                            Bulgaria                        
##  [87] Malawi                           Portugal                        
##  [89] Liberia                          Ivory Coast                     
##  [91] Latvia                           United Kingdom                  
##  [93] Georgia                          Sierra Leone                    
##  [95] Serbia                           Cuba                            
##  [97] Saudi Arabia                     Czech Republic                  
##  [99] Bosnia and Herzegovina           Croatia                         
## [101] Bhutan                           Ghana                           
## [103] Estonia                          Taiwan                          
## [105] Lithuania                        Hungary                         
## [107] Guinea-Bissau                    Tunisia                         
## [109] Costa Rica                       Slovakia                        
## [111] Uzbekistan                       Sri Lanka                       
## [113] Belize                           Equatorial Guinea               
## [115] Eritrea                          Dominican Republic              
## [117] Kenya                            Oman                            
## [119] Slovenia                          Switzerland                    
## [121] Azerbaijan                       Macedonia                       
## [123] Fiji                             Bangladesh                      
## [125] Kyrgyzstan                       Iraq                            
## [127] Togo                             Timor-Leste                     
## [129] Albania                          Ireland                         
## [131] Uruguay                          Belgium                         
## [133] Montenegro                       Denmark                         
## [135] Bahamas                          Gambia                          
## [137] Syria                            Yemen                           
## [139] Vanuatu                          Tajikistan                      
## [141] Netherlands                      Jamaica                         
## [143] Moldova                          Rwanda                          
## [145] United Arab Emirates             El Salvador                     
## [147] Armenia                          Brunei                          
## [149] Mauritania                       Trinidad and Tobago             
## [151] Libya                            Iceland                         
## [153] Cyprus                           Afghanistan                     
## [155] Israel                           Lebanon                         
## [157] Burundi                          Samoa                           
## [159] Jordan                           Luxembourg                      
## [161] Cape Verde                       Egypt                           
## [163] Micronesia                       Mauritius                       
## [165] Dominica                         Haiti                           
## [167] Seychelles                       Palau                           
## [169] Sao Tome and Principe            Swaziland                       
## [171] Saint Lucia                      Andorra                         
## [173] Saint Vincent and the Grenadines Antigua and Barbuda             
## [175] Palestine                        Lesotho                         
## [177] Liechtenstein                    Djibouti                        
## [179] Kuwait                           Comoros                         
## [181] Saint Kitts and Nevis            Grenada                         
## [183] Tonga                            Tuvalu                          
## [185] Barbados                         Kiribati                        
## [187] Singapore                        Maldives                        
## [189] Bahrain                          Malta                           
## [191] San Marino                       Qatar                           
## 191 Levels: Afghanistan Albania Algeria Andorra ...  Switzerland

Well done. You now know how to subset by position in 2-dimensional objects. Subsetting an entire column is particularly useful, because each column represents a variable of our data. You can use positional subsets to do this, but there’s an easier more flexible way.

If we want to target a single column of our data (for example, percent forest area in our data frame), the easiest way is the $ operator. If we use it following the name of our data frame, we can tell R exactly which column of data we’re interested in working with, and it will return that column as if it were its own vector.

What do you think forest_area$per.forest will give you?

The entire data frame
Country names
Percent forest cover
The first row of data

Percent forest cover

We could print that column of data, but it would just be a long vector that’s hard to interpret. So lets visualize it instead! Make a histogram of the per.forest column in your forest_area data frame.

hist(forest_area$per.forest)

Now find the median land area in our data frame. Just like mean(), the function you want is median().

median(forest_area$land.area.km)

## [1] 27065

The $ operator is a much better way to subset a column of data than positional subsets, because we don’t have to memorize what position our column of interest is in (is it the 2nd column? The 4th?). This might seem trivial with 4 columns, but when you have 10, 20, or maybe 100 variables, the ability to refer to the column by name is a life saver.

Now we’re really going to spice things up - logical subsets! Earlier in the lesson we subset a vector based on a logical condition. We can also subset our entire data frame using the values of a single column.

What if we were interested in rows of data (observations) for only those countries with more than 80% forest cover?. We’ll need a logical subset of course! To show you how this works, first enter forest_area$per.forest > 80 now.

forest_area$per.forest > 80

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [166] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE

This logical vector can be passed as our row subset, such that R will only keep those rows which are TRUE in the vector. Lets subset our entire data frame with this vector by passing the logical statement we just entered as our row subset, and leaving the column subset blank (because we want to see all variables). Try it now.

forest_area[forest_area$per.forest > 80,]

##     rank    country land.area.km per.forest
## 30    30      Gabon       227517      85.00
## 41    41   Suriname       147760      90.20
## 163  164 Micronesia          630      89.74
## 167  168 Seychelles          407      89.45
## 168  169      Palau          400      87.15
## 184  185     Tuvalu           23      89.12

Great work. Now subset the data frame for ranks above or equal to 20. To add an equals to to an inequality, just enter it as >= or <=.

forest_area[forest_area$rank >= 20,]

##     rank                          country land.area.km per.forest
## 20    20                           Zambia       376309      50.00
## 21    21                         Tanzania       352570      37.22
## 22    22                            Sudan       327909      17.62
## 23    23                 Papua New Guinea       294370      63.60
## 24    24                           Sweden       280730      68.95
## 25    25                         Zimbabwe       259267      66.35
## 26    26                            Japan       253203      67.00
## 27    27                           France       246640      36.76
## 28    28                          Finland       233320      72.00
## 29    29         Central African Republic       227550      36.53
## 30    30                            Gabon       227517      85.00
## 31    31            Republic of the Congo       224710      65.70
## 32    32                           Turkey       216781      27.60
## 33    33                         Cameroon       212450      44.68
## 34    34                         Malaysia       195200      59.50
## 35    35                         Paraguay       184750      45.42
## 36    36                            Spain       184180      36.70
## 37    37                             Laos       170000      71.60
## 38    38                            Chile       158781      21.00
## 39    39                           Guyana       151040      70.26
## 40    40                      South Sudan       148196      23.00
## 41    41                         Suriname       147760      90.20
## 42    42                         Thailand       147620      29.00
## 43    43                       Madagascar       128380      21.87
## 44    44                             Mali       125720      10.14
## 45    45                          Vietnam       123000      37.14
## 46    46                           Norway       121120      37.42
## 47    47                         Botswana       119430      20.53
## 48    48                             Chad       119210       9.28
## 49    49                          Ecuador       115510      42.00
## 50    50                          Germany       114190      32.00
## 51    51                            Niger       112000       8.00
## 52    52                          Morocco       111325      15.66
## 53    53                          Nigeria       110890      12.00
## 54    54                             Iran       110750       6.72
## 55    55                        Guatemala       108894      51.00
## 56    56                            Italy       106736      35.00
## 57    57                          Ukraine       105000      17.00
## 58    58                         Mongolia       102520       6.55
## 59    59                         Cambodia        93350      51.56
## 60    60                           Poland        90000      28.80
## 61    61                     South Africa        89170       7.31
## 62    62                          Senegal        86730      44.09
## 63    63                      New Zealand        85424      31.87
## 64    64                          Belarus        80334      38.60
## 65    65                          Namibia        76610       9.29
## 66    66                      North Korea        76240      73.00
## 67    67                      Philippines        71620      23.87
## 68    68                          Somalia        71310      11.18
## 69    69                          Romania        69610      29.02
## 70    70                     Burkina Faso        67940      24.78
## 71    71                         Honduras        67254      60.00
## 72    72                           Guinea        67240      27.35
## 73    73                      South Korea        63346      63.20
## 74    74                            Benin        61860      54.93
## 75    75                       Kazakhstan        54498       2.00
## 76    76                        Nicaragua        51890      39.80
## 77    77                           Uganda        50000      20.74
## 78    78                           Panama        42940      56.93
## 79    79                         Pakistan        42240       5.31
## 80    80                          Algeria        42000       1.70
## 81    81                     Turkmenistan        41270       8.46
## 82    82                          Austria        39600      47.20
## 83    83                         Ethiopia        39313       3.56
## 84    84                           Greece        37520      28.43
## 85    85                            Nepal        36360      24.70
## 86    86                         Bulgaria        36250      32.69
## 87    87                           Malawi        33176      28.00
## 88    88                         Portugal        31820      34.80
## 89    89                          Liberia        31540      28.32
## 90    90                      Ivory Coast        30000       9.30
## 91    91                           Latvia        28807      60.00
## 92    92                   United Kingdom        28650      11.76
## 93    93                          Georgia        28577      41.00
## 94    94                     Sierra Leone        27540      38.39
## 95    95                           Serbia        27200      31.13
## 96    96                             Cuba        27130      24.47
## 97    97                     Saudi Arabia        27000       1.26
## 98    98                   Czech Republic        26000      34.00
## 99    99           Bosnia and Herzegovina        25599      53.00
## 100  100                          Croatia        24901      44.00
## 101  101                           Bhutan        24764      64.50
## 102  102                            Ghana        24600      10.20
## 103  103                          Estonia        23066      61.00
## 104  105                           Taiwan        21588      60.00
## 105  106                        Lithuania        21223      33.00
## 106  107                          Hungary        20990      23.00
## 107  108                    Guinea-Bissau        20720      57.36
## 108  109                          Tunisia        20615      12.60
## 109  110                       Costa Rica        20440      40.00
## 110  111                         Slovakia        20006      40.80
## 111  112                       Uzbekistan        19690       4.40
## 112  113                        Sri Lanka        19330      29.46
## 113  114                           Belize        16530      71.98
## 114  115                Equatorial Guinea        16320      58.18
## 115  116                          Eritrea        15540      13.21
## 116  117               Dominican Republic        13384      27.50
## 117  118                            Kenya        13200       2.27
## 118  119                             Oman        13050       4.22
## 119  120                         Slovenia        12574      60.00
## 120  121                      Switzerland        12540      31.80
## 121  122                       Azerbaijan        11400      13.17
## 122  123                        Macedonia        10285      40.00
## 123  124                             Fiji        10000      54.72
## 124  125                       Bangladesh         8710       6.05
## 125  126                       Kyrgyzstan         8690       4.35
## 126  127                             Iraq         8220       1.88
## 127  128                             Togo         8000      14.00
## 128  129                      Timor-Leste         7980      53.65
## 129  130                          Albania         7716      26.84
## 130  131                          Ireland         7540      11.07
## 131  132                          Uruguay         7000       4.00
## 132  133                          Belgium         6834      22.38
## 133  134                       Montenegro         6252      45.26
## 134  135                          Denmark         6120      14.20
## 135  136                          Bahamas         5150      51.45
## 136  137                           Gambia         4857      43.00
## 137  138                            Syria         4610       2.49
## 138  139                            Yemen         4490       0.85
## 139  140                          Vanuatu         4470      36.67
## 140  141                       Tajikistan         4100       2.87
## 141  142                      Netherlands         3650       8.79
## 142  143                          Jamaica         3308      30.10
## 143  144                          Moldova         3290       9.72
## 144  145                           Rwanda         3161      12.00
## 145  146             United Arab Emirates         3120       3.73
## 146  147                      El Salvador         2980      14.16
## 147  148                          Armenia         2974      10.00
## 148  149                           Brunei         2780      48.22
## 149  150                       Mauritania         2670       0.26
## 150  151              Trinidad and Tobago         2260      44.07
## 151  152                            Libya         2170       0.12
## 152  153                          Iceland         1930       2.00
## 153  154                           Cyprus         1740      18.81
## 154  155                      Afghanistan         1631       0.25
## 155  156                           Israel         1600       7.00
## 156  157                          Lebanon         1383      13.30
## 157  158                          Burundi         1270       4.50
## 158  159                            Samoa         1050      37.00
## 159  160                           Jordan          893       1.00
## 160  161                       Luxembourg          870      33.64
## 161  162                       Cape Verde          840      20.83
## 162  163                            Egypt          670       0.07
## 163  164                       Micronesia          630      89.74
## 164  165                        Mauritius          612      30.00
## 165  166                         Dominica          488      65.00
## 166  167                            Haiti         8300      30.00
## 167  168                       Seychelles          407      89.45
## 168  169                            Palau          400      87.15
## 169  170            Sao Tome and Principe          270      28.01
## 170  171                        Swaziland          174       1.00
## 171  172                      Saint Lucia          170      27.60
## 172  173                          Andorra          160      34.19
## 173  174 Saint Vincent and the Grenadines          127      32.56
## 174  175              Antigua and Barbuda           90      21.00
## 175  176                        Palestine           90       1.45
## 176  177                          Lesotho           80       0.26
## 177  178                    Liechtenstein           70      43.75
## 178  179                         Djibouti           60       0.26
## 179  180                           Kuwait           60       0.34
## 180  181                          Comoros           50       2.24
## 181  182            Saint Kitts and Nevis           50      19.16
## 182  183                          Grenada           40      11.63
## 183  184                            Tonga           40       5.35
## 184  185                           Tuvalu           23      89.12
## 185  186                         Barbados           20       4.65
## 186  187                         Kiribati           20       2.47
## 187  188                        Singapore           20       3.00
## 188  189                         Maldives           10       3.56
## 189  190                          Bahrain            5       0.67
## 190  191                            Malta            3       0.95
## 191  192                       San Marino            1       1.64
## 192  193                            Qatar            0       0.00

We can even use logical subsets to zero in on specific observations using other functions. How might we find the country with the maximum percentage forest cover?

Use the max() function to find the maximum value in the per.forest column in our data frame.

max(forest_area$per.forest)

## [1] 90.2

We can use this function to create a logical statement to check where along our column of values the percentage forest cover is equal to the maximum. When finding an equivalence (as opposed to an inequality), the operator is a double equals, ==. Enter forest_area$per.forest == max(forest_area$per.forest) now.

forest_area$per.forest == max(forest_area$per.forest)

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [166] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE

This gives us a logical vector in which the single TRUE corresponds to the maximum percentage forest cover. Now use this logical statement to subset forest_area for just the maximum, remembering to select all columns.

forest_area[forest_area$per.forest == max(forest_area$per.forest),]

##    rank  country land.area.km per.forest
## 41   41 Suriname       147760       90.2

This is a very powerful form of subsetting, letting us select observations in our data frame for desired ranges of values or specific conditions like maximum or minimum. Essentially any logical condition you can imagine.

Lets do one last subset for this lesson. This time, we want only the country name that has the minimum percentage forest cover.

Since we only want the country name, we’re only subsetting the country column of our data frame. You can access this with the $ operator as usual. Because we’re only subsetting a single column, we only need a single subset in our brackets [] (i.e. we dont need a row and column). Try selecting the country name that has the minimum percentage forest cover now.

forest_area$country[forest_area$per.forest == min(forest_area$per.forest)]

## [1] Qatar
## 191 Levels: Afghanistan Albania Algeria Andorra ...  Switzerland

Excellent work. You’ve now been introduced to subsets in 2 dimensions. It may seem tricky or confusing now, but with practice (and you will practice plenty), you’ll be subseting your data frames like a pro in no time.

Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.

As you wish

As you wish