11  Tidyverse

11.1 The Tidyverse Approach

This is the first introduction to tidyverse and is the key skill necessary to become proficient at data analysis.

library( tidyverse )
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library( lubridate )

11.1.1 The Datat

For this topic we will use some example data from the Rice Rivers Center. These data represent both atmospheric and water data collected from instrumentation on-site. I have stored these data in a spreadsheet that is shared on Google Drive as a CSV file.

You can look at it here.

11.1.2 The Data in R

So let’s load it into memory and take a look at it.

url <- "https://docs.google.com/spreadsheets/d/1Mk1YGH9LqjF7drJE-td1G_JkdADOU0eMlrP01WFBT8s/pub?gid=0&single=true&output=csv"
rice <- read_csv( url )
Rows: 8199 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): DateTime
dbl (22): RecordID, PAR, WindSpeed_mph, WindDir, AirTempF, RelHumidity, BP_H...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary( rice )
   DateTime            RecordID          PAR           WindSpeed_mph   
 Length:8199        Min.   :43816   Min.   :   0.000   Min.   : 0.000  
 Class :character   1st Qu.:45866   1st Qu.:   0.000   1st Qu.: 2.467  
 Mode  :character   Median :47915   Median :   0.046   Median : 4.090  
                    Mean   :47915   Mean   : 241.984   Mean   : 5.446  
                    3rd Qu.:49964   3rd Qu.: 337.900   3rd Qu.: 7.292  
                    Max.   :52014   Max.   :1957.000   Max.   :30.650  
                                                                       
    WindDir          AirTempF       RelHumidity        BP_HG      
 Min.   :  0.00   Min.   : 3.749   Min.   :15.37   Min.   :29.11  
 1st Qu.: 37.31   1st Qu.:31.545   1st Qu.:42.25   1st Qu.:29.87  
 Median :137.30   Median :37.440   Median :56.40   Median :30.01  
 Mean   :146.20   Mean   :38.795   Mean   :58.37   Mean   :30.02  
 3rd Qu.:249.95   3rd Qu.:46.410   3rd Qu.:76.59   3rd Qu.:30.21  
 Max.   :360.00   Max.   :74.870   Max.   :93.00   Max.   :30.58  
                                                                  
    Rain_in            H2O_TempC       SpCond_mScm      Salinity_ppt   
 Min.   :0.0000000   Min.   :-0.140   Min.   :0.0110   Min.   :0.0000  
 1st Qu.:0.0000000   1st Qu.: 3.930   1st Qu.:0.1430   1st Qu.:0.0700  
 Median :0.0000000   Median : 5.450   Median :0.1650   Median :0.0800  
 Mean   :0.0008412   Mean   : 5.529   Mean   :0.1611   Mean   :0.0759  
 3rd Qu.:0.0000000   3rd Qu.: 7.410   3rd Qu.:0.1760   3rd Qu.:0.0800  
 Max.   :0.3470000   Max.   :13.300   Max.   :0.2110   Max.   :0.1000  
                     NA's   :1        NA's   :1        NA's   :1       
       PH           PH_mv        Turbidity_ntu       Chla_ugl    
 Min.   :6.43   Min.   :-113.8   Min.   :  6.20   Min.   :  1.3  
 1st Qu.:7.50   1st Qu.: -47.8   1st Qu.: 15.50   1st Qu.:  3.7  
 Median :7.58   Median : -43.8   Median : 21.80   Median :  6.7  
 Mean   :7.60   Mean   : -44.5   Mean   : 24.54   Mean   :137.3  
 3rd Qu.:7.69   3rd Qu.: -38.9   3rd Qu.: 30.30   3rd Qu.:302.6  
 Max.   :9.00   Max.   :  28.5   Max.   :187.70   Max.   :330.1  
 NA's   :1      NA's   :1        NA's   :1        NA's   :1      
   BGAPC_CML        BGAPC_rfu         ODO_sat         ODO_mgl     
 Min.   :   188   Min.   :  0.10   Min.   : 87.5   Min.   :10.34  
 1st Qu.:   971   1st Qu.:  0.50   1st Qu.: 99.2   1st Qu.:12.34  
 Median :  1369   Median :  0.70   Median :101.8   Median :12.88  
 Mean   :153571   Mean   : 72.91   Mean   :102.0   Mean   :12.88  
 3rd Qu.:345211   3rd Qu.:163.60   3rd Qu.:104.1   3rd Qu.:13.34  
 Max.   :345471   Max.   :163.70   Max.   :120.8   Max.   :14.99  
 NA's   :1        NA's   :1        NA's   :1       NA's   :1      
    Depth_ft        Depth_m      SurfaceWaterElev_m_levelNad83m
 Min.   :12.15   Min.   :3.705   Min.   :-32.53                
 1st Qu.:14.60   1st Qu.:4.451   1st Qu.:-31.78                
 Median :15.37   Median :4.684   Median :-31.55                
 Mean   :15.34   Mean   :4.677   Mean   :-31.55                
 3rd Qu.:16.12   3rd Qu.:4.913   3rd Qu.:-31.32                
 Max.   :17.89   Max.   :5.454   Max.   :-30.78                
                                                               

These data represent measurements taken every 15 minutes, 24 hours a day, 7 days a week, 365 days a year. For brevity, this file contains measurements starting at 1/1/2014 12:00:00 AM and ending at 3/27/2014 9:30:00 AM (only 8199 records here…).

If you look at the summary of the data above, you will see several things, including:

  • Date and time objects are character
  • Some measurements are in Standard and some in Imperial with units in the same file include both °F and °C, as well as measurements in meters, feet, and inches. In fact, there are duplication of data columns in different units (guess what kind of correlation they might have…)

11.2 Verbs of Analysis

When we perform any type of data manipulation, we use specific verbs. There is a limited lexicon for us to use, but the key here is how we perform these actions, and in which order they are deployed for a huge diversity in outcomes. For now, these basic verbs include:

  • Select: Used to grab or reorder columns of data.
  • Filter: Used to grab subsets of records (rows) based upon some criteria.
  • Mutate: Create new columns of data based upon manipulations of existing columns.
  • Arrange: Order the records (rows) based upon some criteria.
  • Group: Gather records together to perform operations on chunks of them similar to by().
  • Summarize: Extract summaries of data (or grouped data) based upon some defined criteria.

In the following examples, we’ll be using the rice data above. For each verb, I’m going to use the pipe operator (%>%) to send the data into the example functions and then assign the result to a dummy data.frame named df. The arguments passed to each of the verbs are where the magic happens.

11.2.1 The Output

The key to these activities is that every one of these functions takes a data.frame as input, does its operations on it, then return a data.frame object as output. The data.frame is the core data container for all of these actions.

11.2.2 Select Operator

The select() function allows you to choose which columns of data to work with.

rice %>%
  select( DateTime, AirTempF ) -> df 
head(df)
# A tibble: 6 × 2
  DateTime             AirTempF
  <chr>                   <dbl>
1 1/1/2014 12:00:00 AM     31.0
2 1/1/2014 12:15:00 AM     30.7
3 1/1/2014 12:30:00 AM     31.2
4 1/1/2014 12:45:00 AM     30.5
5 1/1/2014 1:00:00 AM      30.9
6 1/1/2014 1:15:00 AM      30.6

Select can also be used to reorder the columns in a data.frame object. Here are the names of the data columns as initially loaded.

names( rice )
 [1] "DateTime"                       "RecordID"                      
 [3] "PAR"                            "WindSpeed_mph"                 
 [5] "WindDir"                        "AirTempF"                      
 [7] "RelHumidity"                    "BP_HG"                         
 [9] "Rain_in"                        "H2O_TempC"                     
[11] "SpCond_mScm"                    "Salinity_ppt"                  
[13] "PH"                             "PH_mv"                         
[15] "Turbidity_ntu"                  "Chla_ugl"                      
[17] "BGAPC_CML"                      "BGAPC_rfu"                     
[19] "ODO_sat"                        "ODO_mgl"                       
[21] "Depth_ft"                       "Depth_m"                       
[23] "SurfaceWaterElev_m_levelNad83m"

Let’s say that you wanted to reorder the columns as RecordID, ODO_mgl and PH as the first three columns and leave everything else as is. There is this cool function everthying() that helps out.

rice %>%
  select( RecordID, ODO_mgl, PH, everything() ) -> df
names( df )
 [1] "RecordID"                       "ODO_mgl"                       
 [3] "PH"                             "DateTime"                      
 [5] "PAR"                            "WindSpeed_mph"                 
 [7] "WindDir"                        "AirTempF"                      
 [9] "RelHumidity"                    "BP_HG"                         
[11] "Rain_in"                        "H2O_TempC"                     
[13] "SpCond_mScm"                    "Salinity_ppt"                  
[15] "PH_mv"                          "Turbidity_ntu"                 
[17] "Chla_ugl"                       "BGAPC_CML"                     
[19] "BGAPC_rfu"                      "ODO_sat"                       
[21] "Depth_ft"                       "Depth_m"                       
[23] "SurfaceWaterElev_m_levelNad83m"

11.2.3 Filter

The function filter() works to select records (rows) based upon some criteria. So for example, if I am interested in just records when the airtemp was freezing (and the raw data are in °F). The range of values in the original data was:

range( rice$AirTempF )
[1]  3.749 74.870

but after filtering using the name of the variable and a logical operator.

rice %>%
  filter( AirTempF < 32 ) -> df
range( df$AirTempF )
[1]  3.749 31.990

Just like select(), it is possible to have several conditions, that are compounded (using a logical AND operator) by adding them to the filter() function. Here I also split the conditionals requiring the data to be above freezing air temperatures, not missing data from the PH meter, and water turbidity < 15 ntu’s. I also put each of these onto their own lines and auto-indent does a great job of making it reasonably readable.

rice %>%
  filter( AirTempF > 32, 
          !is.na(PH), 
          Turbidity_ntu < 15) -> df
nrow(df)
[1] 1449

11.2.4 Mutate

The mutate() function changes values in the table and is quite versatile. Here I will jump back to our old friend mdy_hms() from lubridate and convert the DateTime column, which is

class( rice$DateTime )
[1] "character"

and convert it into a real date and time object

rice %>%
  mutate( Date = mdy_hms(DateTime, tz = "EST") ) -> df
class( df$Date )
[1] "POSIXct" "POSIXt" 
summary( df$Date )
                 Min.               1st Qu.                Median 
"2014-01-01 00:00:00" "2014-01-22 08:22:30" "2014-02-12 16:45:00" 
                 Mean               3rd Qu.                  Max. 
"2014-02-12 16:45:00" "2014-03-06 01:07:30" "2014-03-27 09:30:00" 

You can also create several mutations in one mutation step.

rice %>%
  mutate( Date = mdy_hms(DateTime, tz = "EST"), 
          Month = month(Date, label = TRUE) ) -> df
summary( df$Month )
 Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec 
2976 2688 2535    0    0    0    0    0    0    0    0    0 

11.2.5 Arrange

We can sort entire data.frame objects based upon the values in one or more of the columns using the arrange() function.

rice %>%
  arrange( WindSpeed_mph ) -> df 
df$WindSpeed_mph[1]
[1] 0

By default, it is in ascending order, to reverse it, use the negative operator on the column name object in the function.

rice %>%
  arrange( -WindSpeed_mph ) -> df 
df$WindSpeed_mph[1]
[1] 30.65

As above, it is possible to combine many columns of data as criteria for sorting by adding more arguments to the function call.

rice %>%
  arrange( -WindSpeed_mph, WindDir ) -> df

11.2.6 Summarise

This function is the first one that does not return some version of the original data that was passed to it. Rather, this performs operations on the data and makes a brand new data.frame object.

Each argument you give to the function performs one or more operations on the data and returns a brand new data.frame object with only the the values specified.

Here is an example where I am taking the mean air and water temperature (n.b., one is in °F and the other is in °C). Notice the result is a new data.frame object with one row and two new columns defined by how I asked for the summary in the first place. I used single tick notation so I can have a space in the column names.

rice %>%
  summarize( `Air Temp` = mean( AirTempF), 
             `Water Temp` = mean(H2O_TempC, na.rm=TRUE))
# A tibble: 1 × 2
  `Air Temp` `Water Temp`
       <dbl>        <dbl>
1       38.8         5.53

11.2.7 Group & Summarize

To get more than one row in the resulting data.frame from summary(), we need to group the data in some way. The function group_by() does this and is used prior to summary(). Let’s take a look at how we can get the average air and water temp by month. To do this, I’m going to have to do several steps. I’m just going to chain them together using the %>% operator.

rice %>%
  mutate( Date = mdy_hms( DateTime, 
                          tz="EST"),
          Month = month( Date, 
                         abbr = FALSE, 
                         label=TRUE) ) %>%
  group_by( Month ) %>%
  summarize( `Air Temp` = mean( AirTempF), 
             `Water Temp` = mean( H2O_TempC, 
                                  na.rm=TRUE) )
# A tibble: 3 × 3
  Month    `Air Temp` `Water Temp`
  <ord>         <dbl>        <dbl>
1 January        34.7         3.68
2 February       39.7         5.29
3 March          42.6         7.96

As you read the code, notice how easy it is to understand what is going on because of both the pipes and because of the way I am formatting the code itself.

11.3 Flows

This last part really showed off the process of multi-step data manipulations using the pipe operator and the several verbs we introduced. These are both efficient in terms of typing as well as efficient in the way of producing research that makes sense to look at.

Here are some strategies that I use when building up these manipulation workflows.

  1. Do not think that you have to do the whole thing at once. I typically build up the workflow, one line at a time. Make sure the output from the previous line is what you think it should be then add the next one.

  2. Keep your code open and airy, it makes it easier to read and to catch any logical errors that may arrise.

  3. You can pipe into a lot of different functions. In fact, any function that takes a data frame can be the recipient of a pipe. While developing a workflow, I will often pipe into things like head(), summary(), or View() to take a look at what is coming out of my workflow to make sure it resembles what I think it should look like.

11.4 Questions

If you have any questions for me specifically on this topic, please post as an Issue in your repository, otherwise consider posting to the discussion board on Canvas.