<- data.frame( Key = c("A","B","C"),
df.X X = c(1,2,3) )
df.X
Key X
1 A 1
2 B 2
3 C 3
Rarely do we work on only one data.frame
, particularly when we start working with complex data and data contained within relational databases. In these cases, data are factored into several tables (akin to data.frame
objects) with entries that connect the information from one table to another. Consider the following example tables
Each has a column I named Key and another with some data in it. In R
they could be defined as:
<- data.frame( Key = c("A","B","C"),
df.X X = c(1,2,3) )
df.X
Key X
1 A 1
2 B 2
3 C 3
and
<- data.frame( Key = c("B","C","D"),
df.Y Y = c(10,11,12) )
df.Y
Key Y
1 B 10
2 C 11
3 D 12
An important component of relational data are the keys. These are unique identifiers for a particular datum from a table. In each of these examples the variable (obviously named) Key
is what is called a Primary Key because it uniquely identifies each row. You can verify this by counting the number of entries then filtering only for ones with 2 or more instances.
library( tidyverse )
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
%>%
df.X count( Key ) %>%
filter( n > 1 )
[1] Key n
<0 rows> (or 0-length row.names)
Notice there is nothing here as each is unique.
The column
Key
is a Primary Key for thedf.X
data because it identifies a unique row in that table. In addition to a Primary Key we can have a Foreign Key when it is used to indicate data within a separate table. For example, if I am interested to see if the smallest value indf.X$X
corresponds with the smallest value indf.Y$Y
, then I will be using theKey
formdf.X
representingmax(X)
to find the value ofY
indf.Y
and evaluate if it ismax(Y)
. This means thatdf.X$Key
is a Foreign Key as it points to a row in thedf.Y
data frame.
The keys are used to link together different tables.
A join is where we combine information contained within two data frames.
Joins are ways to merge together data and come in four flavors.
A left join is one where all the data from the left data frame is in the result and the data whose keys in the right data frame are present in the left one are also included. Graphically, this leads to:
Where in R
we do this using the left_join()
function.
%>%
df.X left_join( df.Y, by="Key")
Key X Y
1 A 1 NA
2 B 2 10
3 C 3 11
The right join does the same thing but keeps all the keys in the right data table and has missing data where the key in the left one is not in the right one.
This is accomplished using the right_join()
function.
%>%
df.X right_join( df.Y, by="Key")
Key X Y
1 B 2 10
2 C 3 11
3 D NA 12
This join is one where all the keys are retained adding missing data as necessary.
%>%
df.X full_join( df.Y, by="Key")
Key X Y
1 A 1 NA
2 B 2 10
3 C 3 11
4 D NA 12
The last one retains only those keys that are common in both.
%>%
df.X inner_join( df.Y, by="Key")
Key X Y
1 B 2 10
2 C 3 11
We can also use joins to filter values within one data.frame
. Here the semi_join()
keeps everything in the left data that has a key in the right one, but importantly it does not import the right data columns into the result.
%>%
df.X semi_join( df.Y )
Joining with `by = join_by(Key)`
Key X
1 B 2
2 C 3
The opposite of this is the anti_join()
which drops everything in the left table that has a key in the right one, leaving only the ones that are unique.
%>%
df.X anti_join( df.Y )
Joining with `by = join_by(Key)`
Key X
1 A 1
If you have any questions for me specifically on this topic, please post as an Issue in your repository, otherwise consider posting to the discussion board on Canvas.