13 Joins

Rarely do we work on only one data.frame, particularly when we start working with complex data and data contained within relational databases. In these cases, data are factored into several tables (akin to data.frame objects) with entries that connect the information from one table to another. Consider the following example tables

Each has a column I named Key and another with some data in it. In R they could be defined as:

df.X <- data.frame( Key = c("A","B","C"),
                    X = c(1,2,3) )
df.X

  Key X
1   A 1
2   B 2
3   C 3

and

df.Y <- data.frame( Key = c("B","C","D"),
                    Y = c(10,11,12) )
df.Y

13.1 Keys

An important component of relational data are the keys. These are unique identifiers for a particular datum from a table. In each of these examples the variable (obviously named) Key is what is called a Primary Key because it uniquely identifies each row. You can verify this by counting the number of entries then filtering only for ones with 2 or more instances.

library( tidyverse )

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

df.X %>%
  count( Key ) %>%
  filter( n > 1 )

[1] Key n  
<0 rows> (or 0-length row.names)

Notice there is nothing here as each is unique.

The column Key is a Primary Key for the df.X data because it identifies a unique row in that table. In addition to a Primary Key we can have a Foreign Key when it is used to indicate data within a separate table. For example, if I am interested to see if the smallest value in df.X$X corresponds with the smallest value in df.Y$Y, then I will be using the Key form df.X representing max(X) to find the value of Y in df.Y and evaluate if it is max(Y). This means that df.X$Key is a Foreign Key as it points to a row in the df.Y data frame.

The keys are used to link together different tables.

14 Joins

A join is where we combine information contained within two data frames.
Joins are ways to merge together data and come in four flavors.

14.1 Left Join

A left join is one where all the data from the left data frame is in the result and the data whose keys in the right data frame are present in the left one are also included. Graphically, this leads to:

Where in R we do this using the left_join() function.

df.X %>%
  left_join( df.Y, by="Key")

  Key X  Y
1   A 1 NA
2   B 2 10
3   C 3 11

14.2 Right Join

The right join does the same thing but keeps all the keys in the right data table and has missing data where the key in the left one is not in the right one.

This is accomplished using the right_join() function.

df.X %>%
  right_join( df.Y, by="Key")

  Key  X  Y
1   B  2 10
2   C  3 11
3   D NA 12

14.3 Full (or Outer) Join

This join is one where all the keys are retained adding missing data as necessary.

df.X %>%
  full_join( df.Y, by="Key")

  Key  X  Y
1   A  1 NA
2   B  2 10
3   C  3 11
4   D NA 12

14.4 Inner Join

The last one retains only those keys that are common in both.

df.X %>%
  inner_join( df.Y, by="Key")

  Key X  Y
1   B 2 10
2   C 3 11

15 Filtering Joins

We can also use joins to filter values within one data.frame. Here the semi_join() keeps everything in the left data that has a key in the right one, but importantly it does not import the right data columns into the result.

df.X %>%
  semi_join( df.Y )

Joining with `by = join_by(Key)`

  Key X
1   B 2
2   C 3

The opposite of this is the anti_join() which drops everything in the left table that has a key in the right one, leaving only the ones that are unique.

df.X %>%
  anti_join( df.Y )

Joining with `by = join_by(Key)`

  Key X
1   A 1

15.1 Questions

If you have any questions for me specifically on this topic, please post as an Issue in your repository, otherwise consider posting to the discussion board on Canvas.