This is the first introduction to tidyverse and is the key skill necessary to become proficient at data analysis.
library( tidyverse )
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library( lubridate )
11.1.1 The Datat
For this topic we will use some example data from the Rice Rivers Center. These data represent both atmospheric and water data collected from instrumentation on-site. I have stored these data in a spreadsheet that is shared on Google Drive as a CSV file.
Rows: 8199 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DateTime
dbl (22): RecordID, PAR, WindSpeed_mph, WindDir, AirTempF, RelHumidity, BP_H...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary( rice )
DateTime RecordID PAR WindSpeed_mph
Length:8199 Min. :43816 Min. : 0.000 Min. : 0.000
Class :character 1st Qu.:45866 1st Qu.: 0.000 1st Qu.: 2.467
Mode :character Median :47915 Median : 0.046 Median : 4.090
Mean :47915 Mean : 241.984 Mean : 5.446
3rd Qu.:49964 3rd Qu.: 337.900 3rd Qu.: 7.292
Max. :52014 Max. :1957.000 Max. :30.650
WindDir AirTempF RelHumidity BP_HG
Min. : 0.00 Min. : 3.749 Min. :15.37 Min. :29.11
1st Qu.: 37.31 1st Qu.:31.545 1st Qu.:42.25 1st Qu.:29.87
Median :137.30 Median :37.440 Median :56.40 Median :30.01
Mean :146.20 Mean :38.795 Mean :58.37 Mean :30.02
3rd Qu.:249.95 3rd Qu.:46.410 3rd Qu.:76.59 3rd Qu.:30.21
Max. :360.00 Max. :74.870 Max. :93.00 Max. :30.58
Rain_in H2O_TempC SpCond_mScm Salinity_ppt
Min. :0.0000000 Min. :-0.140 Min. :0.0110 Min. :0.0000
1st Qu.:0.0000000 1st Qu.: 3.930 1st Qu.:0.1430 1st Qu.:0.0700
Median :0.0000000 Median : 5.450 Median :0.1650 Median :0.0800
Mean :0.0008412 Mean : 5.529 Mean :0.1611 Mean :0.0759
3rd Qu.:0.0000000 3rd Qu.: 7.410 3rd Qu.:0.1760 3rd Qu.:0.0800
Max. :0.3470000 Max. :13.300 Max. :0.2110 Max. :0.1000
NA's :1 NA's :1 NA's :1
PH PH_mv Turbidity_ntu Chla_ugl
Min. :6.43 Min. :-113.8 Min. : 6.20 Min. : 1.3
1st Qu.:7.50 1st Qu.: -47.8 1st Qu.: 15.50 1st Qu.: 3.7
Median :7.58 Median : -43.8 Median : 21.80 Median : 6.7
Mean :7.60 Mean : -44.5 Mean : 24.54 Mean :137.3
3rd Qu.:7.69 3rd Qu.: -38.9 3rd Qu.: 30.30 3rd Qu.:302.6
Max. :9.00 Max. : 28.5 Max. :187.70 Max. :330.1
NA's :1 NA's :1 NA's :1 NA's :1
BGAPC_CML BGAPC_rfu ODO_sat ODO_mgl
Min. : 188 Min. : 0.10 Min. : 87.5 Min. :10.34
1st Qu.: 971 1st Qu.: 0.50 1st Qu.: 99.2 1st Qu.:12.34
Median : 1369 Median : 0.70 Median :101.8 Median :12.88
Mean :153571 Mean : 72.91 Mean :102.0 Mean :12.88
3rd Qu.:345211 3rd Qu.:163.60 3rd Qu.:104.1 3rd Qu.:13.34
Max. :345471 Max. :163.70 Max. :120.8 Max. :14.99
NA's :1 NA's :1 NA's :1 NA's :1
Depth_ft Depth_m SurfaceWaterElev_m_levelNad83m
Min. :12.15 Min. :3.705 Min. :-32.53
1st Qu.:14.60 1st Qu.:4.451 1st Qu.:-31.78
Median :15.37 Median :4.684 Median :-31.55
Mean :15.34 Mean :4.677 Mean :-31.55
3rd Qu.:16.12 3rd Qu.:4.913 3rd Qu.:-31.32
Max. :17.89 Max. :5.454 Max. :-30.78
These data represent measurements taken every 15 minutes, 24 hours a day, 7 days a week, 365 days a year. For brevity, this file contains measurements starting at 1/1/2014 12:00:00 AM and ending at 3/27/2014 9:30:00 AM (only 8199 records here…).
If you look at the summary of the data above, you will see several things, including:
Date and time objects are character
Some measurements are in Standard and some in Imperial with units in the same file include both °F and °C, as well as measurements in meters, feet, and inches. In fact, there are duplication of data columns in different units (guess what kind of correlation they might have…)
11.2 Verbs of Analysis
When we perform any type of data manipulation, we use specific verbs. There is a limited lexicon for us to use, but the key here is how we perform these actions, and in which order they are deployed for a huge diversity in outcomes. For now, these basic verbs include:
Select: Used to grab or reorder columns of data.
Filter: Used to grab subsets of records (rows) based upon some criteria.
Mutate: Create new columns of data based upon manipulations of existing columns.
Arrange: Order the records (rows) based upon some criteria.
Group: Gather records together to perform operations on chunks of them similar to by().
Summarize: Extract summaries of data (or grouped data) based upon some defined criteria.
In the following examples, we’ll be using the rice data above. For each verb, I’m going to use the pipe operator (%>%) to send the data into the example functions and then assign the result to a dummy data.frame named df. The arguments passed to each of the verbs are where the magic happens.
11.2.1 The Output
The key to these activities is that every one of these functions takes a data.frame as input, does its operations on it, then return a data.frame object as output. The data.frame is the core data container for all of these actions.
11.2.2 Select Operator
The select() function allows you to choose which columns of data to work with.
Let’s say that you wanted to reorder the columns as RecordID, ODO_mgl and PH as the first three columns and leave everything else as is. There is this cool function everthying() that helps out.
The function filter() works to select records (rows) based upon some criteria. So for example, if I am interested in just records when the airtemp was freezing (and the raw data are in °F). The range of values in the original data was:
range( rice$AirTempF )
[1] 3.749 74.870
but after filtering using the name of the variable and a logical operator.
Just like select(), it is possible to have several conditions, that are compounded (using a logical AND operator) by adding them to the filter() function. Here I also split the conditionals requiring the data to be above freezing air temperatures, not missing data from the PH meter, and water turbidity < 15 ntu’s. I also put each of these onto their own lines and auto-indent does a great job of making it reasonably readable.
The mutate() function changes values in the table and is quite versatile. Here I will jump back to our old friend mdy_hms() from lubridate and convert the DateTime column, which is
As above, it is possible to combine many columns of data as criteria for sorting by adding more arguments to the function call.
rice %>%arrange( -WindSpeed_mph, WindDir ) -> df
11.2.6 Summarise
This function is the first one that does not return some version of the original data that was passed to it. Rather, this performs operations on the data and makes a brand new data.frame object.
Each argument you give to the function performs one or more operations on the data and returns a brand new data.frame object with only the the values specified.
Here is an example where I am taking the mean air and water temperature (n.b., one is in °F and the other is in °C). Notice the result is a new data.frame object with one row and two new columns defined by how I asked for the summary in the first place. I used single tick notation so I can have a space in the column names.
To get more than one row in the resulting data.frame from summary(), we need to group the data in some way. The function group_by() does this and is used prior to summary(). Let’s take a look at how we can get the average air and water temp by month. To do this, I’m going to have to do several steps. I’m just going to chain them together using the %>% operator.
# A tibble: 3 × 3
Month `Air Temp` `Water Temp`
<ord> <dbl> <dbl>
1 January 34.7 3.68
2 February 39.7 5.29
3 March 42.6 7.96
As you read the code, notice how easy it is to understand what is going on because of both the pipes and because of the way I am formatting the code itself.
11.3 Flows
This last part really showed off the process of multi-step data manipulations using the pipe operator and the several verbs we introduced. These are both efficient in terms of typing as well as efficient in the way of producing research that makes sense to look at.
Here are some strategies that I use when building up these manipulation workflows.
Do not think that you have to do the whole thing at once. I typically build up the workflow, one line at a time. Make sure the output from the previous line is what you think it should be then add the next one.
Keep your code open and airy, it makes it easier to read and to catch any logical errors that may arrise.
You can pipe into a lot of different functions. In fact, any function that takes a data frame can be the recipient of a pipe. While developing a workflow, I will often pipe into things like head(), summary(), or View() to take a look at what is coming out of my workflow to make sure it resembles what I think it should look like.
11.4 Questions
If you have any questions for me specifically on this topic, please post as an Issue in your repository, otherwise consider posting to the discussion board on Canvas.