<- NA x
6 Basic Data Types
6.1 Missing Data
The Absence of Data
The most fundamental type of data in R
is data that does not exist! Missing data! It is represented as NA
and can be in
6.2 Numerical Data
Numerical data contains all numerical represenations.
By far, the most common kind of data we use in our analyses is numerical data. This may represent measured things like height
, snout-vent length
(whatever that is), depth
, age
, etc. In data analysis, we commonly take (or obtain) measurements from several items and then try to characterize them using summaries and visualization.
In R
, the numerical data type can be defined as:
<- 42 X
Notice how the numerical value of 42
is assigned to the variable named X
. To have R
print out the value of a particular variable, you can type its name in the console and it will give it to you.
X
[1] 42
6.2.1 Operators
Numeric types have a ton of normal operators that can be used. Some examples include:
The usual arithmatic operators:
<- 10
x <- 23
y
+ y x
[1] 33
- y x
[1] -13
* y x
[1] 230
/ y x
[1] 0.4347826
You have the exponentials:
## x raised to the y
^y x
[1] 1e+23
## the inverse of an exponent is a root, here is the 23rd root of 10
^(1/y) x
[1] 1.105295
The logrithmics:
## the natural log
log(x)
[1] 2.302585
## Base 10 log
log(x,base=10)
[1] 1
And the modulus operator:
%% x y
[1] 3
If you didn’t know what this one is, don’t worry. The modulus is just the remainder after division like you did in grade school. The above code means that 23 divided by 10 has a remainder of 3. I include it here just to highlight the fact that many of the operators that we will be working with in R
are created by more than just a single symbol residing at the top row of your computer keyboard. There are just too few symbos on the normal keyboard to represent the breath of operators. The authors of R
have decided that using combinations of symbols to handle these and you will get used to them in not time at all.
6.2.2 Introspection & Coercion
The class()
of a numeric type is (wait for it)… numeric
(those R
programmers are sure clever).
class( 42 )
[1] "numeric"
In this case class
is the name of the function and there are one or more things we pass to that function. These must be enclosed in the parenthesis associated with class
. The parantheses must be right next to the name of the function. If you put a space betwen the word class
and the parentheses, it may not work the way you would like it to. You’ve been warned.
class
function and it does its magic and tells us what kind of data type it is. Many functions have several arguements that can be passed to them, some optional, some not. We will get more into that on the lecture covering Functions.
It is also possible to inquire if a particular variable is of a certain class. This is done by using the is.*
set of functions.
is.numeric( 42 )
[1] TRUE
is.numeric( "dr dyer" )
[1] FALSE
Sometimes we may need to turn one kind of class into another kind. Consider the following:
<- "42"
x is.numeric( x )
[1] FALSE
class(x)
[1] "character"
It is a character
data type because it is enclosed within a set of quotes. However, we can coerce it into a numeric type by:
<- as.numeric( x )
y is.numeric( y )
[1] TRUE
y
[1] 42
6.3 Character Data
Character data represents textual content.
The data type character
is intended to represent textual data such as actual texts, names of objects, and other contnet that is intended to help both you and the audience you are trying to reach better understand your data.
<- "Dyer"
name <- "Frolf" sport
The two variables above have a sequence of characters enclosed by a double quote. You can use a single quote instead, however the enclosing quoting characters must be the same (e.g., you cannot start with a single quote and end with a double).
6.3.1 Lengths
The length of a string is a measure of how many varibles there are, not the number of characters within it. For example, the length of dyer
is
length(name)
[1] 1
because it only has one character but the number of characters within it is:
nchar(name)
[1] 4
Length is defined specifically on the number of elements in a vector, and technically the variable dyer
is a vector of length one. If we concatinate them into a vector (go see the vector content)
<- c( name, sport ) phrase
we find that it has a length of 2
length(phrase)
[1] 2
And if we ask the vector how many characters are in the elements it contains, it gives us a vector of numeric types representing the number of letters in each of the elements.
nchar(phrase)
[1] 4 5
6.3.2 Putting Character Objects Together
The binary +
operator has not been defined for objects of class character
, which is understandable once we consider all the different ways we may want to put the values contained in the variables together. If you try it, R
will complain.
+ sport name
Error in name + sport: non-numeric argument to binary operator
The paste()
function is designed to take a collection of character
variables and smush them togethers. By default, it inserts a space between each of the variables and/or values passed to it.
paste( name, "plays", sport )
[1] "Dyer plays Frolf"
Although, you can have any kind of separator you like:
paste(name, sport, sep=" is no good at ")
[1] "Dyer is no good at Frolf"
The elements you pass to paste()
do not need to be held in variables, you can put quoted character
values in there as well.
paste( name, " the ", sport, "er", sep="")
[1] "Dyer the Frolfer"
If you have a vector of character
types, by default, it considers the pasting operation to be applied to every element of the vector.
paste( phrase , "!")
[1] "Dyer !" "Frolf !"
However if you intention is to take the elements of the vector and paste them together, then you need to specify that using the collapse
optional argument. By default, it is set to NULL
, and that state tells the function to apply the paste()-ing to each element. However, if you set collapse
to something other than NULL
, it will use that to take all the elements and put them into a single response.
paste( phrase, collapse = " is not good at ")
[1] "Dyer is not good at Frolf"
6.3.3 String Operations
Many times, we need to extract components from within a longer character
element. Here is a longer bit of text as an example.
<- "An environmental impact statement (EIS), under United States environmental law, is a document required by the 1969 National Environmental Policy Act (NEPA) for certain actions 'significantly affecting the quality of the human environment'.[1] An EIS is a tool for decision making. It describes the positive and negative environmental effects of a proposed action, and it usually also lists one or more alternative actions that may be chosen instead of the action described in the EIS. Several U.S. state governments require that a document similar to an EIS be submitted to the state for certain actions. For example, in California, an Environmental Impact Report (EIR) must be submitted to the state for certain actions, as described in the California Environmental Quality Act (CEQA). One of the primary authors of the act is Lynton K. Caldwell." corpus
6.3.4 Splits
We can split the original string into several components by specifying which particular character or set of characters we wish to use to break it apart.
As we start working with increasingly more complicated string operations, I like to use a higher-level library (part of tidyverse
) called stringr
. If you do not have this library already installed, you can install it using install.packages("stringr")
.
library( stringr )
Here is an example using the space character to pull it apart into words.
str_split( corpus, pattern=" ", simplify=TRUE)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "An" "environmental" "impact" "statement" "(EIS)," "under" "United"
[,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "States" "environmental" "law," "is" "a" "document" "required" "by"
[,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
[1,] "the" "1969" "National" "Environmental" "Policy" "Act" "(NEPA)" "for"
[,24] [,25] [,26] [,27] [,28] [,29] [,30]
[1,] "certain" "actions" "'significantly" "affecting" "the" "quality" "of"
[,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38] [,39]
[1,] "the" "human" "environment'.[1]" "An" "EIS" "is" "a" "tool" "for"
[,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47]
[1,] "decision" "making." "It" "describes" "the" "positive" "and" "negative"
[,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55]
[1,] "environmental" "effects" "of" "a" "proposed" "action," "and" "it"
[,56] [,57] [,58] [,59] [,60] [,61] [,62] [,63] [,64]
[1,] "usually" "also" "lists" "one" "or" "more" "alternative" "actions" "that"
[,65] [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73]
[1,] "may" "be" "chosen" "instead" "of" "the" "action" "described" "in"
[,74] [,75] [,76] [,77] [,78] [,79] [,80] [,81] [,82]
[1,] "the" "EIS." "Several" "U.S." "state" "governments" "require" "that" "a"
[,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90] [,91]
[1,] "document" "similar" "to" "an" "EIS" "be" "submitted" "to" "the"
[,92] [,93] [,94] [,95] [,96] [,97] [,98] [,99]
[1,] "state" "for" "certain" "actions." "For" "example," "in" "California,"
[,100] [,101] [,102] [,103] [,104] [,105] [,106] [,107]
[1,] "an" "Environmental" "Impact" "Report" "(EIR)" "must" "be" "submitted"
[,108] [,109] [,110] [,111] [,112] [,113] [,114] [,115]
[1,] "to" "the" "state" "for" "certain" "actions," "as" "described"
[,116] [,117] [,118] [,119] [,120] [,121] [,122]
[1,] "in" "the" "California" "Environmental" "Quality" "Act" "(CEQA)."
[,123] [,124] [,125] [,126] [,127] [,128] [,129] [,130] [,131]
[1,] "One" "of" "the" "primary" "authors" "of" "the" "act" "is"
[,132] [,133] [,134]
[1,] "Lynton" "K." "Caldwell."
which shows 134 words in the text.
I need to point out that I added the simplify=TRUE
option to str_split
. Had I not done that, it would have returned a list
object that contained the individual vector of words. There are various reasons that it returns a list, none of which I can frankly understand, that is just the way the authors of the function made it.
6.3.5 Substrings
There are two different things you may want to do with substrings; find them and replace them. Here are some ways to figure out where they are.
str_detect(corpus, "Environment")
[1] TRUE
str_count( corpus, "Environment")
[1] 3
str_locate_all( corpus, "Environment")
[[1]]
start end
[1,] 125 135
[2,] 637 647
[3,] 754 764
We can also replace instances of one substring with another.
str_replace_all(corpus, "California", "Virginia")
[1] "An environmental impact statement (EIS), under United States environmental law, is a document required by the 1969 National Environmental Policy Act (NEPA) for certain actions 'significantly affecting the quality of the human environment'.[1] An EIS is a tool for decision making. It describes the positive and negative environmental effects of a proposed action, and it usually also lists one or more alternative actions that may be chosen instead of the action described in the EIS. Several U.S. state governments require that a document similar to an EIS be submitted to the state for certain actions. For example, in Virginia, an Environmental Impact Report (EIR) must be submitted to the state for certain actions, as described in the Virginia Environmental Quality Act (CEQA). One of the primary authors of the act is Lynton K. Caldwell."
There is a lot more fun stuff to do with string based data.
6.4 Logical Data
Logical data consists of two mutually exclusive states: TRUE
or FALSE
<- TRUE
dyer_has_good_jokes dyer_has_good_jokes
[1] TRUE
6.4.1 Operators on Logical Types
There are 3 primary logical operators that can be used on logical types; one unary and two binary.
6.4.1.1 Unary Operator
The negation
operator
!dyer_has_good_jokes
[1] FALSE
6.4.2 The Binary Operators
6.4.2.1 The OR operator
TRUE | FALSE
[1] TRUE
6.4.2.2 The AND operator
TRUE & FALSE
[1] FALSE
6.4.3 Introspection
Logical
types have an introspection operator.
is.logical( dyer_has_good_jokes )
[1] TRUE
Coercion of something else to a Logical
is more case-specific.
From character
data.
as.logical( "TRUE" )
[1] TRUE
as.logical( "FALSE" )
[1] FALSE
Other character
types result in NA
(missing data).
as.logical( "Bob" )
[1] NA
6.4.4 Coercion
Coercion of something else to a Logical
is more case-specific.
From numeric
data:
- Values of 0
are FALSE
- Non-zero values are TRUE
as.logical(0)
[1] FALSE
as.logical( 323 )
[1] TRUE
6.5 Dates
Time is the next dimension.
This topic covers the basics of how we put together data based upone date and time objects. For this, we will use the following data frame with a single column of data representing dates as they are written in the US.
These are several challenges associated with working with date and time objects. To those of us who are reading this with a background of how US time and date formats are read, we can easily interpret data objects as Month/Day/Year formats (e.g., “2/14/2018”), and is commonly represented in the kind of input data we work in R
with as with a string of characters. Dates and times are sticky things in data analysis because they do not work the way we think they should. Here are some wrinkles:
- There are many types of calendars, we use the Julian calendar. However, there are many other calendars that are in use that we may run into. Each of these calendars has a different starting year (e.g., in the Assyrian calendar it is year 6770, it is 4718 in the Chinese calendar, 2020 in the Gregorian, and 1442 in the Islamic calendar).
- Western calendar has leap years (+1 day in February) as well as leap seconds because it is based on the rotation around the sun, others are based upon the lunar cycle and have other corrections.
- On this planet, we have 24 different time zones. Some states (looking at you Arizona) don’t feel it necessary to follow the other states around so they may be the same as PST some of the year and the same as MST the rest of the year. The provence of Newfoundland decided to be half-way between time zones so they are GMT-2:30. Some states have more than one time zone even if they are not large in size (hello Indiana).
- Dates and time are made up of odd units, 60-seconds a minute, 60-minutes an hour, 24-hours a day, 7-days a week, 2-weeks a fortnight, 28,29,30,or 31-days in a month, 365 or 366 days in a year, 100 years in a century, etc.
Fortunately, some smart programmers have figured this out for us already. What they did is made the second as the base unit of time and designated 00:00:00 on 1 January 1970 as the unix epoch. Time on most modern computers is measured from that starting point. It is much easier to measure the difference between two points in time using the seconds since unix epich and then translate it into one or more of these calendars than to deal with all the different calendars each time. So under the hood, much of the date and time issues are kept in terms of epoch seconds.
unclass( Sys.time() )
[1] 1701788349
6.5.1 Basic Date Objects
R
has some basic date functionality built into it. One of the easiest says to get a date object created is to specify the a date as a character string and then coerce it into a data object. By default, this requires us to represent the date objects as “YEAR-MONTH-DAY” with padding 0
values for any integer of month or date below 9 (e.g., must be two-digits).
So for example, we can specify a date object as:
<- as.Date("2021-01-15")
class_start class_start
[1] "2021-01-15"
And it is of type:
class( class_start )
[1] "Date"
If you want to make a the date from a different format, you need to specify what elements within the string representation using format codes. These codes (and many more) can be found by looking at ?strptime
.
<- as.Date( "5/10/21", format = "%m/%d/%y")
class_end class_end
[1] "2021-05-10"
I like to use some higher-level date functions from the lubridate
library. If you don’t have it installed, do so using the normal approach.
library( lubridate )
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
Date objects can be put into vectors and sequences just like other objects.
<- seq( class_start, class_end, by = "1 day")
semester semester
[1] "2021-01-15" "2021-01-16" "2021-01-17" "2021-01-18" "2021-01-19"
[6] "2021-01-20" "2021-01-21" "2021-01-22" "2021-01-23" "2021-01-24"
[11] "2021-01-25" "2021-01-26" "2021-01-27" "2021-01-28" "2021-01-29"
[16] "2021-01-30" "2021-01-31" "2021-02-01" "2021-02-02" "2021-02-03"
[21] "2021-02-04" "2021-02-05" "2021-02-06" "2021-02-07" "2021-02-08"
[26] "2021-02-09" "2021-02-10" "2021-02-11" "2021-02-12" "2021-02-13"
[31] "2021-02-14" "2021-02-15" "2021-02-16" "2021-02-17" "2021-02-18"
[36] "2021-02-19" "2021-02-20" "2021-02-21" "2021-02-22" "2021-02-23"
[41] "2021-02-24" "2021-02-25" "2021-02-26" "2021-02-27" "2021-02-28"
[46] "2021-03-01" "2021-03-02" "2021-03-03" "2021-03-04" "2021-03-05"
[51] "2021-03-06" "2021-03-07" "2021-03-08" "2021-03-09" "2021-03-10"
[56] "2021-03-11" "2021-03-12" "2021-03-13" "2021-03-14" "2021-03-15"
[61] "2021-03-16" "2021-03-17" "2021-03-18" "2021-03-19" "2021-03-20"
[66] "2021-03-21" "2021-03-22" "2021-03-23" "2021-03-24" "2021-03-25"
[71] "2021-03-26" "2021-03-27" "2021-03-28" "2021-03-29" "2021-03-30"
[76] "2021-03-31" "2021-04-01" "2021-04-02" "2021-04-03" "2021-04-04"
[81] "2021-04-05" "2021-04-06" "2021-04-07" "2021-04-08" "2021-04-09"
[86] "2021-04-10" "2021-04-11" "2021-04-12" "2021-04-13" "2021-04-14"
[91] "2021-04-15" "2021-04-16" "2021-04-17" "2021-04-18" "2021-04-19"
[96] "2021-04-20" "2021-04-21" "2021-04-22" "2021-04-23" "2021-04-24"
[101] "2021-04-25" "2021-04-26" "2021-04-27" "2021-04-28" "2021-04-29"
[106] "2021-04-30" "2021-05-01" "2021-05-02" "2021-05-03" "2021-05-04"
[111] "2021-05-05" "2021-05-06" "2021-05-07" "2021-05-08" "2021-05-09"
[116] "2021-05-10"
Some helpful functions include the Julian Ordinal Day (e.g., number of days since the start of the year).
<- yday( semester[102] )
ordinal_day ordinal_day
[1] 116
The weekday as an integer (0-6 starting on Sunday), which I use to index the named values.
<- c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")
days_of_week <- wday( semester[32] )
x days_of_week[ x ]
[1] "Monday"
Since we did not specify a time, things like hour()
and minute()
do not provide any usable information.
6.5.2 Dates & Times
To add time to the date objects, we need to specify both date and time specifically. Here are some example data:
<- data.frame( Date = c("8/21/2004 7:33:51 AM",
df "7/12/2008 9:23:08 PM",
"2/14/2010 8:18:30 AM",
"12/23/2018 11:11:45 PM",
"2/1/2019 4:42:00 PM",
"5/17/2012 1:23:23 AM",
"12/11/2020 9:48:02 PM") )
summary( df )
Date
Length:7
Class :character
Mode :character
Just like above, if we want to turn these into date and time objects we must be able to tell the parsing algorithm what elements are represented in each entry. There are many ways to make dates and time, 10/14 or 14 Oct or October 14 or Julian day 287, etc. These are designated by a format string were we indicate what element represents a day or month or year or hour or minute or second, etc. These are found by looking at the documentation for?strptime
.
In our case, we have:
- Month as 1 or 2 digits
- Day as 1 or 2 digits
- Year as 4 digits
- a space to separate date from time
- hour (not 24-hour though)
- minutes in 2 digits
- seconds in 2 digits
- a space to separate time from timezone
- timezone
- /
separating date objects
- :
separating time objects
To make the format string, we need to look up how to encode these items. The items in df
for a date & time object such as 2/1/2019 4:42:00 PM have the format string:
<- "%m/%d/%Y %I:%M:%S %p" format
Now, we can convert the character string in the data frame to a date and time object.
6.5.3 Lubridate
Instead of using the built-in as.Date()
functionality, I like the lubridate
library1 as it has a lot of additional functionality that we’ll play with a bit later.
$Date <- parse_date_time( df$Date,
dforders=format,
tz = "EST" )
summary( df )
Date
Min. :2004-08-21 07:33:51.00
1st Qu.:2009-04-29 14:50:49.00
Median :2012-05-17 01:23:23.00
Mean :2013-07-11 07:28:39.85
3rd Qu.:2019-01-12 19:56:52.50
Max. :2020-12-11 21:48:02.00
class( df$Date )
[1] "POSIXct" "POSIXt"
Now, we can ask Date-like questions about the data such as what day of the week was the first sample taken?
weekdays( df$Date[1] )
[1] "Saturday"
What is the range of dates?
range( df$Date )
[1] "2004-08-21 07:33:51 EST" "2020-12-11 21:48:02 EST"
What is the median of samples
median( df$Date )
[1] "2012-05-17 01:23:23 EST"
and what julian ordinal day (e.g., how many days since start of the year) is the last record.
yday( df$Date[4] )
[1] 357
Just for fun, I’ll add a column to the data that has weekday.
$Weekday <- weekdays( df$Date )
df df
Date Weekday
1 2004-08-21 07:33:51 Saturday
2 2008-07-12 21:23:08 Saturday
3 2010-02-14 08:18:30 Sunday
4 2018-12-23 23:11:45 Sunday
5 2019-02-01 16:42:00 Friday
6 2012-05-17 01:23:23 Thursday
7 2020-12-11 21:48:02 Friday
However, we should probably turn it into a factor (e.g., a data type with pre-defined levels—and for us here—an intrinsic order of the levels).
$Weekday <- factor( df$Weekday,
dfordered = TRUE,
levels = days_of_week
)summary( df$Weekday )
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
2 0 0 0 1 2 2
6.5.4 Filtering on Date Objects
We can easily filter the content within a data.frame
using some helper functions such as hour()
, minute()
, weekday()
, etc. Here are some examples including pulling out the weekends.
<- df[ df$Weekday %in% c("Saturday","Sunday"), ]
weekends weekends
Date Weekday
1 2004-08-21 07:33:51 Saturday
2 2008-07-12 21:23:08 Saturday
3 2010-02-14 08:18:30 Sunday
4 2018-12-23 23:11:45 Sunday
finding items that are in the past (paste being defined as the last time this document was knit).
<- df$Date[ df$Date < Sys.time() ]
past past
[1] "2004-08-21 07:33:51 EST" "2008-07-12 21:23:08 EST"
[3] "2010-02-14 08:18:30 EST" "2018-12-23 23:11:45 EST"
[5] "2019-02-01 16:42:00 EST" "2012-05-17 01:23:23 EST"
[7] "2020-12-11 21:48:02 EST"
Items that are during working hours
<- df$Date[ hour(df$Date) >= 9 & hour(df$Date) <= 17 ]
work work
[1] "2019-02-01 16:42:00 EST"
And total range of values in days using normal arithmatic operations such as the minus operator.
max(df$Date) - min(df$Date)
Time difference of 5956.593 days
6.6 Questions
If you have any questions for me specifically on this topic, please post as an Issue in your repository, otherwise consider posting to the discussion board on Canvas.
If you get an error saying something like, “there is no package named lubridate” then use
install.packages("lubridate")
and install it. You only need to do this once.↩︎