6 Basic Data Types

6.1 Missing Data

The Absence of Data

The most fundamental type of data in R is data that does not exist! Missing data! It is represented as NA

x <- NA

and can be in

6.2 Numerical Data

Numerical data contains all numerical represenations.

By far, the most common kind of data we use in our analyses is numerical data. This may represent measured things like height, snout-vent length (whatever that is), depth, age, etc. In data analysis, we commonly take (or obtain) measurements from several items and then try to characterize them using summaries and visualization.

In R, the numerical data type can be defined as:

X <- 42

Notice how the numerical value of 42 is assigned to the variable named X. To have R print out the value of a particular variable, you can type its name in the console and it will give it to you.

[1] 42

6.2.1 Operators

Numeric types have a ton of normal operators that can be used. Some examples include:

The usual arithmatic operators:

x <- 10
y <- 23

x + y

[1] 33

x - y

[1] -13

x * y

[1] 230

x / y

[1] 0.4347826

You have the exponentials:

## x raised to the y
x^y

[1] 1e+23

## the inverse of an exponent is a root, here is the 23rd root of 10
x^(1/y)

[1] 1.105295

The logrithmics:

## the natural log
log(x)

[1] 2.302585

## Base 10 log
log(x,base=10)

[1] 1

And the modulus operator:

y %% x

[1] 3

If you didn’t know what this one is, don’t worry. The modulus is just the remainder after division like you did in grade school. The above code means that 23 divided by 10 has a remainder of 3. I include it here just to highlight the fact that many of the operators that we will be working with in R are created by more than just a single symbol residing at the top row of your computer keyboard. There are just too few symbos on the normal keyboard to represent the breath of operators. The authors of R have decided that using combinations of symbols to handle these and you will get used to them in not time at all.

6.2.2 Introspection & Coercion

The class() of a numeric type is (wait for it)… numeric (those R programmers are sure clever).

class( 42 )

[1] "numeric"

In this case class is the name of the function and there are one or more things we pass to that function. These must be enclosed in the parenthesis associated with class. The parantheses must be right next to the name of the function. If you put a space betwen the word class and the parentheses, it may not work the way you would like it to. You’ve been warned.

The stuff inside the parenthesis are called arguments and are the data that we pass to the function itself. In this case we pass a value or varible to the class function and it does its magic and tells us what kind of data type it is. Many functions have several arguements that can be passed to them, some optional, some not. We will get more into that on the lecture covering Functions.

It is also possible to inquire if a particular variable is of a certain class. This is done by using the is.* set of functions.

is.numeric( 42 )

[1] TRUE

is.numeric( "dr dyer" )

[1] FALSE

Sometimes we may need to turn one kind of class into another kind. Consider the following:

x <- "42"
is.numeric( x )

[1] FALSE

class(x)

[1] "character"

It is a character data type because it is enclosed within a set of quotes. However, we can coerce it into a numeric type by:

y <- as.numeric( x )
is.numeric( y )

[1] TRUE

[1] 42

6.3 Character Data

Character data represents textual content.

The data type character is intended to represent textual data such as actual texts, names of objects, and other contnet that is intended to help both you and the audience you are trying to reach better understand your data.

name <- "Dyer"
sport <- "Frolf"

The two variables above have a sequence of characters enclosed by a double quote. You can use a single quote instead, however the enclosing quoting characters must be the same (e.g., you cannot start with a single quote and end with a double).

6.3.1 Lengths

The length of a string is a measure of how many varibles there are, not the number of characters within it. For example, the length of dyer is

length(name)

[1] 1

because it only has one character but the number of characters within it is:

nchar(name)

[1] 4

Length is defined specifically on the number of elements in a vector, and technically the variable dyer is a vector of length one. If we concatinate them into a vector (go see the vector content)

phrase <- c( name, sport )

we find that it has a length of 2

length(phrase)

[1] 2

And if we ask the vector how many characters are in the elements it contains, it gives us a vector of numeric types representing the number of letters in each of the elements.

nchar(phrase)

[1] 4 5

6.3.2 Putting Character Objects Together

The binary + operator has not been defined for objects of class character, which is understandable once we consider all the different ways we may want to put the values contained in the variables together. If you try it, R will complain.

name + sport

Error in name + sport: non-numeric argument to binary operator

The paste() function is designed to take a collection of character variables and smush them togethers. By default, it inserts a space between each of the variables and/or values passed to it.

paste( name, "plays", sport )

[1] "Dyer plays Frolf"

Although, you can have any kind of separator you like:

paste(name, sport, sep=" is no good at ")

[1] "Dyer is no good at Frolf"

The elements you pass to paste() do not need to be held in variables, you can put quoted character values in there as well.

paste( name, " the ", sport, "er", sep="")

[1] "Dyer the Frolfer"

If you have a vector of character types, by default, it considers the pasting operation to be applied to every element of the vector.

paste( phrase , "!")

[1] "Dyer !"  "Frolf !"

However if you intention is to take the elements of the vector and paste them together, then you need to specify that using the collapse optional argument. By default, it is set to NULL, and that state tells the function to apply the paste()-ing to each element. However, if you set collapse to something other than NULL, it will use that to take all the elements and put them into a single response.

paste( phrase, collapse = " is not good at ")

[1] "Dyer is not good at Frolf"

6.3.3 String Operations

Many times, we need to extract components from within a longer character element. Here is a longer bit of text as an example.

corpus <- "An environmental impact statement (EIS), under United States environmental law, is a document required by the 1969 National Environmental Policy Act (NEPA) for certain actions 'significantly affecting the quality of the human environment'.[1] An EIS is a tool for decision making. It describes the positive and negative environmental effects of a proposed action, and it usually also lists one or more alternative actions that may be chosen instead of the action described in the EIS. Several U.S. state governments require that a document similar to an EIS be submitted to the state for certain actions. For example, in California, an Environmental Impact Report (EIR) must be submitted to the state for certain actions, as described in the California Environmental Quality Act (CEQA). One of the primary authors of the act is Lynton K. Caldwell."

6.3.4 Splits

We can split the original string into several components by specifying which particular character or set of characters we wish to use to break it apart.

As we start working with increasingly more complicated string operations, I like to use a higher-level library (part of tidyverse) called stringr. If you do not have this library already installed, you can install it using install.packages("stringr").

library( stringr )

Here is an example using the space character to pull it apart into words.

str_split( corpus, pattern=" ", simplify=TRUE)

     [,1] [,2]            [,3]     [,4]        [,5]     [,6]    [,7]    
[1,] "An" "environmental" "impact" "statement" "(EIS)," "under" "United"
     [,8]     [,9]            [,10]  [,11] [,12] [,13]      [,14]      [,15]
[1,] "States" "environmental" "law," "is"  "a"   "document" "required" "by" 
     [,16] [,17]  [,18]      [,19]           [,20]    [,21] [,22]    [,23]
[1,] "the" "1969" "National" "Environmental" "Policy" "Act" "(NEPA)" "for"
     [,24]     [,25]     [,26]            [,27]       [,28] [,29]     [,30]
[1,] "certain" "actions" "'significantly" "affecting" "the" "quality" "of" 
     [,31] [,32]   [,33]              [,34] [,35] [,36] [,37] [,38]  [,39]
[1,] "the" "human" "environment'.[1]" "An"  "EIS" "is"  "a"   "tool" "for"
     [,40]      [,41]     [,42] [,43]       [,44] [,45]      [,46] [,47]     
[1,] "decision" "making." "It"  "describes" "the" "positive" "and" "negative"
     [,48]           [,49]     [,50] [,51] [,52]      [,53]     [,54] [,55]
[1,] "environmental" "effects" "of"  "a"   "proposed" "action," "and" "it" 
     [,56]     [,57]  [,58]   [,59] [,60] [,61]  [,62]         [,63]     [,64] 
[1,] "usually" "also" "lists" "one" "or"  "more" "alternative" "actions" "that"
     [,65] [,66] [,67]    [,68]     [,69] [,70] [,71]    [,72]       [,73]
[1,] "may" "be"  "chosen" "instead" "of"  "the" "action" "described" "in" 
     [,74] [,75]  [,76]     [,77]  [,78]   [,79]         [,80]     [,81]  [,82]
[1,] "the" "EIS." "Several" "U.S." "state" "governments" "require" "that" "a"  
     [,83]      [,84]     [,85] [,86] [,87] [,88] [,89]       [,90] [,91]
[1,] "document" "similar" "to"  "an"  "EIS" "be"  "submitted" "to"  "the"
     [,92]   [,93] [,94]     [,95]      [,96] [,97]      [,98] [,99]        
[1,] "state" "for" "certain" "actions." "For" "example," "in"  "California,"
     [,100] [,101]          [,102]   [,103]   [,104]  [,105] [,106] [,107]     
[1,] "an"   "Environmental" "Impact" "Report" "(EIR)" "must" "be"   "submitted"
     [,108] [,109] [,110]  [,111] [,112]    [,113]     [,114] [,115]     
[1,] "to"   "the"  "state" "for"  "certain" "actions," "as"   "described"
     [,116] [,117] [,118]       [,119]          [,120]    [,121] [,122]   
[1,] "in"   "the"  "California" "Environmental" "Quality" "Act"  "(CEQA)."
     [,123] [,124] [,125] [,126]    [,127]    [,128] [,129] [,130] [,131]
[1,] "One"  "of"   "the"  "primary" "authors" "of"   "the"  "act"  "is"  
     [,132]   [,133] [,134]     
[1,] "Lynton" "K."   "Caldwell."

which shows 134 words in the text.

I need to point out that I added the simplify=TRUE option to str_split. Had I not done that, it would have returned a list object that contained the individual vector of words. There are various reasons that it returns a list, none of which I can frankly understand, that is just the way the authors of the function made it.

6.3.5 Substrings

There are two different things you may want to do with substrings; find them and replace them. Here are some ways to figure out where they are.

str_detect(corpus, "Environment")

[1] TRUE

str_count( corpus, "Environment")

[1] 3

str_locate_all( corpus, "Environment")

[[1]]
     start end
[1,]   125 135
[2,]   637 647
[3,]   754 764

We can also replace instances of one substring with another.

str_replace_all(corpus, "California", "Virginia")

[1] "An environmental impact statement (EIS), under United States environmental law, is a document required by the 1969 National Environmental Policy Act (NEPA) for certain actions 'significantly affecting the quality of the human environment'.[1] An EIS is a tool for decision making. It describes the positive and negative environmental effects of a proposed action, and it usually also lists one or more alternative actions that may be chosen instead of the action described in the EIS. Several U.S. state governments require that a document similar to an EIS be submitted to the state for certain actions. For example, in Virginia, an Environmental Impact Report (EIR) must be submitted to the state for certain actions, as described in the Virginia Environmental Quality Act (CEQA). One of the primary authors of the act is Lynton K. Caldwell."

There is a lot more fun stuff to do with string based data.

6.4 Logical Data

Logical data consists of two mutually exclusive states: TRUE or FALSE

dyer_has_good_jokes <- TRUE
dyer_has_good_jokes

[1] TRUE

6.4.1 Operators on Logical Types

There are 3 primary logical operators that can be used on logical types; one unary and two binary.

6.4.1.1 Unary Operator

The negation operator

!dyer_has_good_jokes

[1] FALSE

6.4.2 The Binary Operators

6.4.2.1 The OR operator

TRUE | FALSE

[1] TRUE

6.4.2.2 The AND operator

TRUE & FALSE

[1] FALSE

6.4.3 Introspection

Logical types have an introspection operator.

is.logical( dyer_has_good_jokes )

[1] TRUE

Coercion of something else to a Logical is more case-specific.

From character data.

as.logical( "TRUE" )

[1] TRUE

as.logical( "FALSE" )

[1] FALSE

Other character types result in NA (missing data).

as.logical( "Bob" )

[1] NA

6.4.4 Coercion

Coercion of something else to a Logical is more case-specific.

From numeric data:
- Values of 0 are FALSE
- Non-zero values are TRUE

as.logical(0)

[1] FALSE

as.logical( 323 )

[1] TRUE

6.5 Dates

Time is the next dimension.

This topic covers the basics of how we put together data based upone date and time objects. For this, we will use the following data frame with a single column of data representing dates as they are written in the US.

These are several challenges associated with working with date and time objects. To those of us who are reading this with a background of how US time and date formats are read, we can easily interpret data objects as Month/Day/Year formats (e.g., “2/14/2018”), and is commonly represented in the kind of input data we work in R with as with a string of characters. Dates and times are sticky things in data analysis because they do not work the way we think they should. Here are some wrinkles:

There are many types of calendars, we use the Julian calendar. However, there are many other calendars that are in use that we may run into. Each of these calendars has a different starting year (e.g., in the Assyrian calendar it is year 6770, it is 4718 in the Chinese calendar, 2020 in the Gregorian, and 1442 in the Islamic calendar).
Western calendar has leap years (+1 day in February) as well as leap seconds because it is based on the rotation around the sun, others are based upon the lunar cycle and have other corrections.
On this planet, we have 24 different time zones. Some states (looking at you Arizona) don’t feel it necessary to follow the other states around so they may be the same as PST some of the year and the same as MST the rest of the year. The provence of Newfoundland decided to be half-way between time zones so they are GMT-2:30. Some states have more than one time zone even if they are not large in size (hello Indiana).
Dates and time are made up of odd units, 60-seconds a minute, 60-minutes an hour, 24-hours a day, 7-days a week, 2-weeks a fortnight, 28,29,30,or 31-days in a month, 365 or 366 days in a year, 100 years in a century, etc.

Fortunately, some smart programmers have figured this out for us already. What they did is made the second as the base unit of time and designated 00:00:00 on 1 January 1970 as the unix epoch. Time on most modern computers is measured from that starting point. It is much easier to measure the difference between two points in time using the seconds since unix epich and then translate it into one or more of these calendars than to deal with all the different calendars each time. So under the hood, much of the date and time issues are kept in terms of epoch seconds.

unclass( Sys.time() )

[1] 1701788349

6.5.1 Basic Date Objects

R has some basic date functionality built into it. One of the easiest says to get a date object created is to specify the a date as a character string and then coerce it into a data object. By default, this requires us to represent the date objects as “YEAR-MONTH-DAY” with padding 0 values for any integer of month or date below 9 (e.g., must be two-digits).

So for example, we can specify a date object as:

class_start <- as.Date("2021-01-15")
class_start

[1] "2021-01-15"

And it is of type:

class( class_start )

[1] "Date"

If you want to make a the date from a different format, you need to specify what elements within the string representation using format codes. These codes (and many more) can be found by looking at ?strptime.

class_end <- as.Date( "5/10/21", format = "%m/%d/%y")
class_end

[1] "2021-05-10"

I like to use some higher-level date functions from the lubridate library. If you don’t have it installed, do so using the normal approach.

library( lubridate )


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

Date objects can be put into vectors and sequences just like other objects.

semester <- seq( class_start, class_end, by = "1 day")
semester

  [1] "2021-01-15" "2021-01-16" "2021-01-17" "2021-01-18" "2021-01-19"
  [6] "2021-01-20" "2021-01-21" "2021-01-22" "2021-01-23" "2021-01-24"
 [11] "2021-01-25" "2021-01-26" "2021-01-27" "2021-01-28" "2021-01-29"
 [16] "2021-01-30" "2021-01-31" "2021-02-01" "2021-02-02" "2021-02-03"
 [21] "2021-02-04" "2021-02-05" "2021-02-06" "2021-02-07" "2021-02-08"
 [26] "2021-02-09" "2021-02-10" "2021-02-11" "2021-02-12" "2021-02-13"
 [31] "2021-02-14" "2021-02-15" "2021-02-16" "2021-02-17" "2021-02-18"
 [36] "2021-02-19" "2021-02-20" "2021-02-21" "2021-02-22" "2021-02-23"
 [41] "2021-02-24" "2021-02-25" "2021-02-26" "2021-02-27" "2021-02-28"
 [46] "2021-03-01" "2021-03-02" "2021-03-03" "2021-03-04" "2021-03-05"
 [51] "2021-03-06" "2021-03-07" "2021-03-08" "2021-03-09" "2021-03-10"
 [56] "2021-03-11" "2021-03-12" "2021-03-13" "2021-03-14" "2021-03-15"
 [61] "2021-03-16" "2021-03-17" "2021-03-18" "2021-03-19" "2021-03-20"
 [66] "2021-03-21" "2021-03-22" "2021-03-23" "2021-03-24" "2021-03-25"
 [71] "2021-03-26" "2021-03-27" "2021-03-28" "2021-03-29" "2021-03-30"
 [76] "2021-03-31" "2021-04-01" "2021-04-02" "2021-04-03" "2021-04-04"
 [81] "2021-04-05" "2021-04-06" "2021-04-07" "2021-04-08" "2021-04-09"
 [86] "2021-04-10" "2021-04-11" "2021-04-12" "2021-04-13" "2021-04-14"
 [91] "2021-04-15" "2021-04-16" "2021-04-17" "2021-04-18" "2021-04-19"
 [96] "2021-04-20" "2021-04-21" "2021-04-22" "2021-04-23" "2021-04-24"
[101] "2021-04-25" "2021-04-26" "2021-04-27" "2021-04-28" "2021-04-29"
[106] "2021-04-30" "2021-05-01" "2021-05-02" "2021-05-03" "2021-05-04"
[111] "2021-05-05" "2021-05-06" "2021-05-07" "2021-05-08" "2021-05-09"
[116] "2021-05-10"

Some helpful functions include the Julian Ordinal Day (e.g., number of days since the start of the year).

ordinal_day <- yday( semester[102] )
ordinal_day

[1] 116

The weekday as an integer (0-6 starting on Sunday), which I use to index the named values.

days_of_week <- c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")
x <- wday( semester[32] )
days_of_week[ x ]

[1] "Monday"

Since we did not specify a time, things like hour() and minute() do not provide any usable information.

6.5.2 Dates & Times

To add time to the date objects, we need to specify both date and time specifically. Here are some example data:

df <- data.frame( Date = c("8/21/2004 7:33:51 AM",
                           "7/12/2008 9:23:08 PM",
                           "2/14/2010 8:18:30 AM",
                           "12/23/2018 11:11:45 PM",
                           "2/1/2019 4:42:00 PM",
                           "5/17/2012 1:23:23 AM",
                           "12/11/2020 9:48:02 PM") )
summary( df )

     Date          
 Length:7          
 Class :character  
 Mode  :character

Just like above, if we want to turn these into date and time objects we must be able to tell the parsing algorithm what elements are represented in each entry. There are many ways to make dates and time, 10/14 or 14 Oct or October 14 or Julian day 287, etc. These are designated by a format string were we indicate what element represents a day or month or year or hour or minute or second, etc. These are found by looking at the documentation for?strptime.

In our case, we have:
- Month as 1 or 2 digits
- Day as 1 or 2 digits
- Year as 4 digits
- a space to separate date from time
- hour (not 24-hour though)
- minutes in 2 digits
- seconds in 2 digits
- a space to separate time from timezone
- timezone
- / separating date objects
- : separating time objects

To make the format string, we need to look up how to encode these items. The items in df for a date & time object such as 2/1/2019 4:42:00 PM have the format string:

format <- "%m/%d/%Y %I:%M:%S %p"

Now, we can convert the character string in the data frame to a date and time object.

6.5.3 Lubridate

Instead of using the built-in as.Date() functionality, I like the lubridate library¹ as it has a lot of additional functionality that we’ll play with a bit later.

df$Date <- parse_date_time( df$Date, 
                            orders=format, 
                            tz = "EST" )
summary( df )

      Date                       
 Min.   :2004-08-21 07:33:51.00  
 1st Qu.:2009-04-29 14:50:49.00  
 Median :2012-05-17 01:23:23.00  
 Mean   :2013-07-11 07:28:39.85  
 3rd Qu.:2019-01-12 19:56:52.50  
 Max.   :2020-12-11 21:48:02.00

class( df$Date )

[1] "POSIXct" "POSIXt"

Now, we can ask Date-like questions about the data such as what day of the week was the first sample taken?

weekdays( df$Date[1] )

[1] "Saturday"

What is the range of dates?

range( df$Date )

[1] "2004-08-21 07:33:51 EST" "2020-12-11 21:48:02 EST"

What is the median of samples

median( df$Date )

[1] "2012-05-17 01:23:23 EST"

and what julian ordinal day (e.g., how many days since start of the year) is the last record.

yday( df$Date[4] )

[1] 357

Just for fun, I’ll add a column to the data that has weekday.

df$Weekday <- weekdays( df$Date )
df

                 Date  Weekday
1 2004-08-21 07:33:51 Saturday
2 2008-07-12 21:23:08 Saturday
3 2010-02-14 08:18:30   Sunday
4 2018-12-23 23:11:45   Sunday
5 2019-02-01 16:42:00   Friday
6 2012-05-17 01:23:23 Thursday
7 2020-12-11 21:48:02   Friday

However, we should probably turn it into a factor (e.g., a data type with pre-defined levels—and for us here—an intrinsic order of the levels).

df$Weekday <- factor( df$Weekday, 
                        ordered = TRUE, 
                        levels = days_of_week
                        )
summary( df$Weekday )

   Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
        2         0         0         0         1         2         2

6.5.4 Filtering on Date Objects

We can easily filter the content within a data.frame using some helper functions such as hour(), minute(), weekday(), etc. Here are some examples including pulling out the weekends.

weekends <- df[ df$Weekday %in% c("Saturday","Sunday"), ]
weekends

                 Date  Weekday
1 2004-08-21 07:33:51 Saturday
2 2008-07-12 21:23:08 Saturday
3 2010-02-14 08:18:30   Sunday
4 2018-12-23 23:11:45   Sunday

finding items that are in the past (paste being defined as the last time this document was knit).

past <- df$Date[ df$Date < Sys.time() ]
past

[1] "2004-08-21 07:33:51 EST" "2008-07-12 21:23:08 EST"
[3] "2010-02-14 08:18:30 EST" "2018-12-23 23:11:45 EST"
[5] "2019-02-01 16:42:00 EST" "2012-05-17 01:23:23 EST"
[7] "2020-12-11 21:48:02 EST"

Items that are during working hours

work <- df$Date[ hour(df$Date) >= 9 & hour(df$Date) <= 17 ]
work

[1] "2019-02-01 16:42:00 EST"

And total range of values in days using normal arithmatic operations such as the minus operator.

max(df$Date) - min(df$Date)

Time difference of 5956.593 days

6.6 Questions

If you have any questions for me specifically on this topic, please post as an Issue in your repository, otherwise consider posting to the discussion board on Canvas.

If you get an error saying something like, “there is no package named lubridate” then use install.packages("lubridate") and install it. You only need to do this once.↩︎