9  Basic Graphics in R

Data visualization is key to providing effective communication of your data.

Jonathan Swan

9.1 The Data

The iris flower data set (also known as Fisher’s Iris data set) is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper entitled, The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

These data are part of the base R distribution and contain sepal and pedal measurements for three species if congeneric plants, Iris setosa, I. versicolor, and I. virginica.

The three species of iris in the default data set.

Here is what the data summary looks like.

summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

9.2 Basic Plotting in R

The base R comes with several built-in plotting functions, each of which is accessed through a single function with a wide array of optional arguments that modify the overall appearance.

Histograms - The Density of A Single Data Vector

hist( iris$Sepal.Length)

You can see that the default values for the hist() function label the x-axis & title on the graph have the names of the variable passed to it, with a y-axis is set to “Frequency”.

  • xlab & ylab: The names attached to both x- and y-axes.
  • main: The title on top of the graph.
  • breaks: This controls the way in which the original data are partitioned (e.g., the width of the bars along the x-axis).
    • If you pass a single number, n to this option, the data will be partitioned into n bins.
    • If you pass a sequence of values to this, it will use this sequence as the boundaries of bins.
  • col: The color of the bar (not the border)
  • probability: A flag as either TRUE or FALSE (the default) to have the y-axis scaled by total likelihood of each bins rather than a count of the numbrer of elements in that range.

Density - Estimating the continuous density of data

d_sepal.length <- density( iris$Sepal.Length )
d_sepal.length

Call:
    density.default(x = iris$Sepal.Length)

Data: iris$Sepal.Length (150 obs.); Bandwidth 'bw' = 0.2736

       x               y            
 Min.   :3.479   Min.   :0.0001495  
 1st Qu.:4.790   1st Qu.:0.0341599  
 Median :6.100   Median :0.1534105  
 Mean   :6.100   Mean   :0.1905934  
 3rd Qu.:7.410   3rd Qu.:0.3792237  
 Max.   :8.721   Max.   :0.3968365  

The density() function estimates a continuous probability density function for the data and returns an object that has both x and y values. In fact, it is a special kind of object.

class(d_sepal.length)
[1] "density"

Because of this, the general plot() function knows how to plot these kinds of things.

plot( d_sepal.length )

Now, the general plot() function has A TON of options and is overloaded to be able to plot all kinds of data. In addition to xlab and ylab, we modify the following:

  • col: Color of the line.
  • lwd: Line width
  • bty: This covers the ‘box type’, which is the square box around the plot area. I typically use bty="n" because I hate those square boxes around my plots (compare the following 2 plots to see the differences). But you do you.
  • xlim & ylim: These dictate the range on both the x- and y-axes. It takes a pair of values such as c(min,max) and then limits (or extends) that axis to to fill that range.

Scatter Plots - Plotting two variables

plot( iris$Sepal.Length, iris$Sepal.Width  )

Here is the most general plot(). The form of the arguments to this function are x-data and then y-data. The visual representation of the data is determined by the optional values you pass (or if you do not pass any optional values, the default is the scatter plot shown above)

Parameter Description
type The kind of plot to show (’p’oint, ’l’ine, ’b’oth, or ’o’ver). A point plot is the default.
pch The character (or symbol) being used to plot. There 26 recognized general characters to use for plotting. The default is pch=1.
col The color of the symbols/lines that are plot.
cex The magnification size of the character being plot. The default is cex=1 and deviation from that will increase (\(cex > 1\)) or decrease (\(0 < cex < 1\)) the scaling of the symbols.
lwd The width of any lines in the plot.
lty The type of line to be plot (solid, dashed, etc.)

One of the relevant things you can use the parameter pch for is to differentiate between groups of observations (such as different species for example). Instead of giving it one value, pass it a vector of values whose length is equal to that for x- and y-axis data.

Here is an example where I coerce the iris$Species data vector into numeric types and use that for symbols.

symbol <- as.numeric(iris$Species)
symbol
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
[112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[149] 3 3
plot( iris$Sepal.Length, iris$Sepal.Width, pch=symbol )

We can use the same technique to use col instead of pch. Here I make a vector of color names and then use the previously defined in the variable symbol.

raw_colors <- c("red","gold","forestgreen")
colors <- raw_colors[ symbol ]
colors[1:10]
 [1] "red" "red" "red" "red" "red" "red" "red" "red" "red" "red"

In addition to the general form for the function plot(x,y) we used above, we can use an alternative designation based upon what is called the functional form. The functional form is how we designate functions in R, such as regression anlaysis. This basic syntax for this is y ~ x, that is the response variable (on the y-axis) is a function of the predictor (on the x-axis).

For simplicty, I’ll make x and y varibles pointing to the same same data as in the previous graph.

y <- iris$Sepal.Width
x <- iris$Sepal.Length

Then, the plot() function can be written as (including all the fancy additional stuff we just described):

plot( y ~ x , 
      col=colors, 
      pch=20, 
      bty="n", 
      xlab="Sepal Length", ylab="Sepal Width")

This is much easier to read (also notice how I used serveral lines to put in all the options to the plot function for legibility).

Bar Plots - Quantifying Counts

The barplot function takes a set of heights, one for each bar. Let’s quickly grab the mean length for sepals across all three species. There are many ways to do this, here are two, the first being more pedantic and the second more concise.

The iris data is in a data.frame that has a column designating the species. We can see which ones using unique().

unique( iris$Species )
[1] setosa     versicolor virginica 
Levels: setosa versicolor virginica

To estimate the mean for each species, we can take values in iris$Sepal.Length for each level of iris$Species using indices.

mu.Setosa <- mean( iris$Sepal.Length[ iris$Species == "setosa" ])
mu.Versicolor <- mean( iris$Sepal.Length[ iris$Species == "versicolor" ])
mu.Virginica <- mean( iris$Sepal.Length[ iris$Species == "virginica" ])

meanSepalLength <- c( mu.Setosa, mu.Versicolor, mu.Virginica )
meanSepalLength
[1] 5.006 5.936 6.588

When we plot these data using barplot() we pass the values and set the names of the bars us

barplot( meanSepalLength, 
         names.arg = c("setosa","versicolor","virginica"), 
         xlab="Iris Species",
         ylab="Mean Sepal Length")

The second way to do this is to use the by() function (see ?by for the complete help file). The by function takes the following objects:

  • The raw data to use as measurements. Here we will use iris$Sepal.Length as the raw data.
  • Data designating groups to partition the raw data into (we will use iris$Species).
  • The function that you want to use on each group. (here we will ask for the mean).
meanSepalLength <- by( iris$Sepal.Length, iris$Species, mean )
meanSepalLength
iris$Species: setosa
[1] 5.006
------------------------------------------------------------ 
iris$Species: versicolor
[1] 5.936
------------------------------------------------------------ 
iris$Species: virginica
[1] 6.588

The data returned from this function is both numeric and has a name set for each value.

is.numeric( meanSepalLength )
[1] TRUE
names( meanSepalLength )
[1] "setosa"     "versicolor" "virginica" 

Then when we pass that to barplot() the column labels are set automatically (e.g., no need to set names.arg as above).

barplot( meanSepalLength, 
         xlab = "Iris Species",
         ylab = "Average Sepal Length")

Boxplots - High density information

A boxplot contains a high amount of information content and is appropriate when the groupings on the x-axis are categorical. For each category, the graphical representation includes:

  • The median value for the raw data
  • A box indicating the area between the first and third quartile (e.g,. the values enclosing the 25% - 75% of the data). The top and bottoms are often referred to as the hinges of the box.
  • A notch (if requested), represents confidence around the estimate of the median.
  • Whiskers extending out to shows \(\pm 1.5 * IQR\) (the Inner Quartile Range)
  • Any points of the data that extend beyond the whiskers are plot as points.

For legibility, we can use the functional form for the plots as well as separate out the data.frame from the columns using the optional data= argument.

boxplot( Sepal.Length ~ Species, 
         data = iris, 
         notch=TRUE, 
         ylab="Sepal Length" )

9.3 Colors

Named Colors - There are 657 pre-defined, named colors built into the base R distribution. Here is a random selection of those values.

randomColors <- sample( colors(), size = nrow(iris) )
head(randomColors)
[1] "grey54"         "lavenderblush1" "gray32"         "peachpuff1"    
[5] "grey35"         "springgreen3"  

To use these colors, you can specify them by name for either all the elements

boxplot( Sepal.Length ~ Species, 
         data = iris, 
         col = randomColors[1],
         notch=TRUE, 
         ylab="Sepal Length" )

or for each element individually.

boxplot( Sepal.Length ~ Species, 
         data = iris, 
         col = randomColors[1:3],
         notch=TRUE, 
         ylab="Sepal Length" )

Hex Colors: You can also use hexadecimal representations of colors, which is most commonly used on the internet. A hex representation of colors consists of red, green, and blue values encoded as numbers in base 16 (e.g., the single digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F). There are a lot of great resources on the internet for color themes that report red, green, blue and hex values. I often use the coolors.co website to look for themes that go well together for slides or presentations.

Color Brewer Finally, there is an interesting website at colorbrewer2.org that has some interesting built-in palettes. There is an associated library that makes creating palettes for plots really easy and as you get more expreienced with R, you will find this very helpful. For quick visualizations and estimation of built-in color palettes, you can look at the website (below).

ColorBrewer Site or look at the colors in R

library(RColorBrewer)
display.brewer.all()

There are three basic kinds of palettes: divergent, qualitative, and sequential. Each of these built-in palletes has a maximum number of colors available (though as you see below we can use them to interpolate larger sets) as well as indications if the palette is safe for colorblind individuals.

brewer.pal.info
         maxcolors category colorblind
BrBG            11      div       TRUE
PiYG            11      div       TRUE
PRGn            11      div       TRUE
PuOr            11      div       TRUE
RdBu            11      div       TRUE
RdGy            11      div      FALSE
RdYlBu          11      div       TRUE
RdYlGn          11      div      FALSE
Spectral        11      div      FALSE
Accent           8     qual      FALSE
Dark2            8     qual       TRUE
Paired          12     qual       TRUE
Pastel1          9     qual      FALSE
Pastel2          8     qual      FALSE
Set1             9     qual      FALSE
Set2             8     qual       TRUE
Set3            12     qual      FALSE
Blues            9      seq       TRUE
BuGn             9      seq       TRUE
BuPu             9      seq       TRUE
GnBu             9      seq       TRUE
Greens           9      seq       TRUE
Greys            9      seq       TRUE
Oranges          9      seq       TRUE
OrRd             9      seq       TRUE
PuBu             9      seq       TRUE
PuBuGn           9      seq       TRUE
PuRd             9      seq       TRUE
Purples          9      seq       TRUE
RdPu             9      seq       TRUE
Reds             9      seq       TRUE
YlGn             9      seq       TRUE
YlGnBu           9      seq       TRUE
YlOrBr           9      seq       TRUE
YlOrRd           9      seq       TRUE

It is very helpful to look at the different kinds of data palettes available and I’ll show you how to use them below when we color in the states based upon population size at the end of this document.

9.4 Annotations

You can easily add text onto a graph using the text() function. Here is the correlation between the sepal length and width (the function cor.test() does the statistical test).

cor <- cor.test( iris$Sepal.Length, iris$Sepal.Width )
cor

    Pearson's product-moment correlation

data:  iris$Sepal.Length and iris$Sepal.Width
t = -1.4403, df = 148, p-value = 0.1519
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.27269325  0.04351158
sample estimates:
       cor 
-0.1175698 

We can put the correlation and the p-value on the plot

cor.text <- paste( "r = ", format( cor$estimate, digits=4), "; P = ", format( cor$p.value, digits=4 ), sep="" ) 
cor.text
[1] "r = -0.1176; P = 0.1519"

The we can the overlay this onto an existing plot. For the text() function, we need to give the x- and y- coordinates where you want it put onto the coordinate space of the existing graph.

plot( y ~ x , 
      col=colors, 
      pch=20, 
      bty="n", 
      xlab="Sepal Length", ylab="Sepal Width")
text( 7.4, 4.2, cor.text )