Lab 3: Basic Data Visualizations

Preamble

The purpose of this lab is to learn how to create basic data visualizations.

After completing this lab, you should be able to create a data visualization like Figure 1.

Figure 1: Flipper and bill length for various penguins.

You will know that you are done when you have created a rendered file where the results match the examples in this HTML file.

A note on Labs

Each lab has the same structure:

  • A preamble (i.e., what is above this section) that tells you the goals for the lab and how you will know when you have achieved them
  • A Why are we here? section that explains how this lab relates to the previous labs
  • The lab instructions, which include boxed examples that you need to complete and match to the solutions file
  • The submission instructions which detail what you need to do to finish this lab and what you need to do to prepare for the next lab

Why are we here?

In Lab 1, you set up the computational tools that we are using in this course and for most classes in the SDS program. You should have installed R and RStudio, and then rendered a Quarto document (a .qmd) into an .html file.

In this lab, we will learn how to use these computational tools on data to generate a few different data visualizations. Along the way, we will discuss how to set up your R environment using libraries and introduce a few basic concepts about data.

Lab instructions

Setting up

Download the Lab 3 template.

Warning

Before you begin:

  • Move the template file you just downloaded from your Downloads folder into the SDS 100 folder that you created in Lab 1.

  • Make sure that within RStudio, you’re working in the SDS 100 project that you created in Lab 1.

For more detailed instructions on either of the above, see Lab 2.

Optional reading

4.1–4.4 in Data Science: A first introduction is a good additional resource for the material covered in this lab.

Creating the lab environment

In Lab 1, you learned how to install R packages from CRAN. Once a package is installed, you can use it after loading it using the library() function. For example, library(tidyverse) will make all of the functionality of the tidyverse package (which includes both functions and data objects) available to you for the remainder of your R session.

Code
library(tidyverse)

Generally, every Quarto document that you write will contain library(tidyverse) at or near the top.

Question

Use the library() function to load the tidyverse package. Then run search(), and confirm that you see package:tidyverse in the output.

 [1] ".GlobalEnv"             "package:palmerpenguins" "package:lubridate"     
 [4] "package:forcats"        "package:stringr"        "package:dplyr"         
 [7] "package:purrr"          "package:readr"          "package:tidyr"         
[10] "package:tibble"         "package:ggplot2"        "package:tidyverse"     
[13] "package:stats"          "package:graphics"       "package:grDevices"     
[16] "package:utils"          "package:datasets"       "package:methods"       
[19] "Autoloads"              "package:base"          

It’s worth reflecting for a moment on the distinction between installing and loading a package. By way of analogy, consider the difference between buying a book and reading it.

  • Installing a package is like buying a book and putting it on your bookshelf. Once you have it, you don’t need to buy another copy. So you only have to install a package once. You may eventually need to upgrade the book if a 2nd edition comes out, but once it’s on your bookshelf, it’s always available to you if you want it. However, the book is still on your bookshelf and not in your brain. Therefore, you can’t actually use it.
  • Loading a package (via the library() function) is like taking the book off the shelf, placing it on your desk and opening it up. Now all of the information in that book is available to you. Once it’s on your desk, it’s available to you for as long as the book is on your desk. However, when you close your R session, all the books go back on the shelf! So you have to library() a package each time you open a new R session.

At the same time, you don’t want to library() more packages than you need, since then your desk will become cluttered.

Installing Additional Packages

In Lab 1, you learned how to install packages. Today, you need to install one additional package: palmerpenguins.

Palmer Penguins

In this lab we will use the penguins data from the palmerpenguins package. To learn more about the what the variables in this data set mean, read more about the Palmer Penguins data set. If necessary, review the Lab 1 instructions for installing a package and then install palmerpenguins.

Question

Use the library() function to load the palmerpenguins package. Then run search(), and confirm that you see package:palmerpenguins in the output.

 [1] ".GlobalEnv"             "package:palmerpenguins" "package:lubridate"     
 [4] "package:forcats"        "package:stringr"        "package:dplyr"         
 [7] "package:purrr"          "package:readr"          "package:tidyr"         
[10] "package:tibble"         "package:ggplot2"        "package:tidyverse"     
[13] "package:stats"          "package:graphics"       "package:grDevices"     
[16] "package:utils"          "package:datasets"       "package:methods"       
[19] "Autoloads"              "package:base"          

With these packages installed and loaded, we have created our computational workspace.

Data

Data is core to our work in this class and in Statistics and Data Science more broadly. Data can take many forms and shapes. In this class, we will work on rectangular data, that is data that is organized such that each observation is a row and the information collected about the observations are the columns. Data of this form are called tidy (Wickham 2014).

Below is a brief review of data and the structure of data. For more on this topic, please read Section 1.2 in Introduction to Modern Statistics.

Observations

When we collect data, we think in terms of observational units. These are the objects of study. For example, observational units could be Smithies, individual tomato plants, or listings on a digital shopping site.

Variables

For each of our observations, we collect various information. The information we collect are called variables. For example, for each Smithie in our dataset, we could collect: the average number of hours they sleep per night, their house, and their birth month. In a data table, our variables are the columns, with the specific values for each observation placed into the appropriate row.

There are several types of variables.

Numerical

Numerical variables are ones where the information collected are numbers. In our example from above, the average number of hours slept per night is a numerical variable.

Categorical

Categorical variables are those where the information collected are groups or categories. In our example from above, both the house and the birth month variables are categorical variables.

Ordinal

Ordinal variables are special kinds of categorical variables where there is an agreed-upon (sometimes called a natural) ordering of the categories. For example, the birth month variable in our above example would be an ordinal variable, since we all agree that January comes before February, which comes before March, etc. But the house variable is not an ordinal variable because there is not agreed-upon ordering for Smith houses.

Data Visualization

Visualizing your data is a great way to get to know your data. Prof. Jeff Huang (Brown University, CS) notes that he makes visualizations about his data until it no longer surprises him. That is, he makes tons and tons of images until he is no longer seeing things that he hadn’t see before.

When starting to work with any dataset, we recommended that you get to know your data quite well. Visualizations are a great way to do this. They can help you answer questions like:

  • What is the shape of your data?
  • How are the variables related?
  • What patterns are present?

Basic data graphics with ggplot2

We will create data visualizations using the ggplot2 package. These visualizations are commonly called ggplots.

ggplot2 is one of the most downloaded R packages. The gg stands for the “Grammar of Graphics, which is a systematic framework for composing data graphics from various simple elements (Wilkinson 2012). The ggplot2 package implements the grammar of graphics in R, and is one of the core packages in the tidyverse.

While we don’t have time in SDS 100 to cover the grammar of graphics, we will develop the basic components of a ggplot2 graphic in this lab, and follow up with more bells and whistles in Lab 6.

Components of a data graphic

Every data graphic that you make with ggplot2 will contain at least the following elements:

  • a call to the ggplot() function that contains:
    • a data argument that specifies the name of the object containing the data to be plotted
    • a mapping argument that specifies one (or more) connections between variables in the data frame and elements on the plot
  • a + operator that adds a new layer to a plot
  • a call to a geom_*() function that specifies what you actually want to draw.

Below, we develop these elements in more detail.

Univariate summaries

Do the lengths of flippers vary across different species of penguins? This question might be interesting to marine biologists trying to understand penguin evolution.

Let’s first try to understand the distribution penguin flipper length. This kind of univariate (one variable) analysis is fundamental to statistical inquiry and exploratory data analysis.

The following simple example creates a histogram of penguin flipper length:

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm)
) + 
  geom_histogram()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

The basic structure of the code is one call to the ggplot() function, a +, and then one call to the geom_histogram() function. You might think about this as ggplot() + geom_histogram(). In a sense, you are telling R: “make a ggplot and then add to it a layer with a histogram on it.” Everything else that you see in the code are arguments to those functions.

Let’s now unpack those arguments. The data argument to the ggplot() function tells R where to find the data that you want to plot. In this case, the penguins data frame1 contains the data about flipper length that we want to plot. It is very unusual to omit the data argument, since otherwise there won’t be anything to plot.

The second argument to ggplot() is called mapping. It takes a call to the aes() function, which specifies aesthetic mappings from variables (in the data frame specified by the data argument) to attributes of the data graphic. This is at the heart of the grammar of graphics. In this case, the x coordinate (aesthetic) is mapped to the flipper_length_mm variable. This means that when R draws the histogram, it will use the values in the flipper_length_mm variable from within the penguins data frame to determine where on the x-axis to place the data points. Since a histogram is univariate, only the x aesthetic is required.

Question

Modify the code snippet above to create a histogram for the bill_length_mm variable.

Alternative graphics can be drawn by changing the geometries added to the plot. To draw a boxplot instead of a histogram, simply change the call to geom_histogram() to geom_boxplot(). Note that the call to ggplot() does not need to change, because we are still mapping the flipper_length_mm variable to the x coordinate aesthetic.

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm)
) + 
  geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Question

Use the geom_density() function to modify the code snippet above to create a density plot for the bill_length_mm variable.

Tip

If you see the prompt > at the bottom of your console with nothing next to it, R is waiting for you to start a new command and will complete whatever computations that you tell it to. If you do not see your prompt > in the last line of the console, R is waiting for you to complete a command, and you will not be able to do further computations until you see that prompt again.

Tip

Hitting the ESC key will cancel the current command and bring you back to the prompt >.

Bivariate Summaries

Things get more interesting as we incorporate more than one variable. Bivariate data graphics allow us to visualize the relationship between two variables.

If the second variable is categorical

Our original question was about how flipper length varies by species among penguins. So far, we’ve only considered the distribution of flipper length overall. This helps us understand the distribution of flipper length, but we need to incorporate the species variable into our plot if we want to understand how flipper length varies by species.

The simplest way to do this is to map the species variable to the color or fill aesthetic. Let’s continue the example using boxplots. In the code below, we add fill = species to the call to aes().

Code
ggplot(
  data = penguins,
  mapping = aes(
    x = flipper_length_mm,
    fill = species
  )
) + 
  geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

ggplot2 now draws three boxplots: one for each distinct value of the species variable.

Note that if we use the color aesthetic instead of the fill aesthetic, the border of the boxplots gets the color, but not the interior. In general, different geom_*()’s use different aesthetics. Use the built-in documentation (e.g., help(geom_boxplot)) to learn more about which aesthetics each geom_*() function understands.

Code
ggplot(
  data = penguins,
  mapping = aes(
    x = flipper_length_mm,
    color = species
  )
) + 
  geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

We note that from the boxplot above, it is clear that the Gentoo species has the longest average flipper length.

Question

Use the geom_density() function to create a density plot for the bill_length_mm variable, and use the fill aesthetic to separate the distributions by island.

An alternative way to incorporate a second categorical variable is with a facet. Facets are small multiples of data graphics in which the value of one categorical variable varies. Conceptually, facets are often used for the same purpose as colors or fills: to separate the plots by a categorical variable. However, there is no facet aesthetic in ggplot2. Instead, we use the facet_wrap() function, which can be added similar to how we add additional geom_*()’s.

The facet_wrap() function takes a variable name as its first argument, but for technical reasons, that name needs to be wrapped in the vars() function. In Lab 6, you’ll get more practice with facets.

In the code below, we set the ncol argument to facet_wrap() to 1, so that the plots are stacked on top of each other vertically. This allows us to make easy visual comparisons across the x-axis. The figure again confirms that Gentoo penguins typically have larger longer flippers than the other two species.

Code
ggplot(
  data = penguins,
  mapping = aes(
    x = flipper_length_mm
  )
) + 
  geom_histogram() +
  facet_wrap(vars(species), ncol = 1)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

Question

Use the facet_wrap() function to create a density plot for the bill_length_mm variable, separating the facets by island.

If the second variable is numerical

Now consider the relationship between flipper length and bill length. You might hypothesize that some penguins are just larger, so that penguins with longer flippers are also more likely to have longer bills. Or perhaps the opposite is true: perhaps penguins with longer flippers have shorter bills. Or perhaps there is no relationship between the two variables at all—any variation we see is just random.

The most useful graphical summary of two numeric variables is nearly always a scatterplot. We draw scatterplots with the geom_point() function, and map the x and y aesthetics to the two numeric varables.

Code
ggplot(
  data = penguins,
  mapping = aes(
    x = flipper_length_mm,
    y = bill_length_mm
  )
) + 
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

This scatterplot shows a clear positive trend: penguins with longer flippers indeed tend to have longer bills.

Multivariate analysis

The ggplot2 system allows you to create complex data graphics through two main mechanisms:

  • layering additional marks on the plot by adding more geom_*()’s (and facets (see below))
  • incorporating additional variables by adding more aesthetic mappings

In the figure below, we add complexity to our scatterplot by coloring the points based on the island on which each penguins lives. Note that this data graphic is trivariate, since three different variables are mapped to three different aesthetics.

Code
ggplot(
  data = penguins,
  mapping = aes(
    x = flipper_length_mm,
    y = bill_length_mm,
    color = island,
  )
) + 
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Now that you’ve learned the basics, use your creativity to explore the penguins data a bit more. Here are some tips that you might find helpful through the next exercises:

  • It doesn’t matter what order you put the mappings inside the aes() function.
  • It does matter in which order you add the geom_*()s, since the plot is build sequentially up to each + operator. However, depending on what you are doing, you may or may not notice a difference in the actual appearance of the plot.
  • Don’t forget the aes() function to the mapping argument.
  • Don’t forget the vars() function inside facet_wrap(). [You might also see the ~, which is an older style that has the same effect.]
  • If you want to set an aesthetic to a single value (as opposed to mapping it to a variable), set the aesthetic outside of the call to aes().
  • You can also set or map aesthetics inside calls to geom_*() functions, instead of or in addition to the usual placement within ggplot(). That’s an advanced practice that generally won’t be necessary in this class.
Question

You can add more than one geom_*() to a plot. There are lots of different geom_*() functions. Experiment with a couple of other options.

Question

You can map additional aesthetics. geom_point() is perhaps the most versatile mark. In addition to x, y, color, and fill, it also understands alpha, shape, size, stroke, and group. Experiment by mapping a couple of these aesthetics to variables.

Question

You can map the same variable to multiple aesthetics. For example, you can map species to both the color and fill aesthetic. Try this or another double mapping.

Finally, let’s bring the different things we’ve explored together into one plot:

Question

Use ggplot2 to create a plot that looks like Figure 1.

What’s next?

In this lab, we learned how to use ggplot2 to build data visualizations. But we still have much to learn. So far, our plots do not have titles, appropriate scales, or readable axis labels. In Lab 5, we will explore how to add these critical pieces of information.

Other tools can help facilitate rapid prototyping of ggplot2 data graphics using graphical user interfaces:

  • esquisse: creates an RStudio add-in that provides a graphical user interface to creating ggplot2 graphics, and exports valid ggplot2 code.
  • mosaic: provides a similar, but simpler graphical user interface for multiple graphics systems.

You are welcome to use these tools to accelerate your learning, but ultimately, we expect you to be able to read and write ggplot2 code without their assistance.

Finishing this lab

Instructions

Step 1: Create twin document

Download the Lab 3 template (if you haven’t done so already). Complete it such that the results in the generated output match the examples in the Lab 3 solutions file.

Step 3: Complete the Moodle Quiz

Complete the Moodle quiz for this lab.

Optional reading

To prepare for next week’s lab on data wrangling, we encourage reading sections 4.1-4.3 in R for Data Science

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.
Wilkinson, Leland. 2012. “The Grammar of Graphics.” In Handbook of Computational Statistics, 375–414. Springer. https://link.springer.com/chapter/10.1007/978-3-642-21551-3_13.

Footnotes

  1. Be aware that another rectangular data structure in R that used extensively in the tidyverse is called a tibble. All tibbles are data.frames, but not all data.frames are tibbles. For the most part, the distinction is subtle and not important for the work you do in this class.↩︎