Lab 6: Polishing figures

Preamble

The purpose of this lab is to learn how to polish data graphics to maximize their readability and impact.

After completing this lab, you should be able to add and modify titles, subtitles, captions, legends, scales, axis labels, and tick labels for ggplot2 data graphics.

You will know that you are done when you can create this image:

Figure 1: Your goal is to create this data graphic. Note that while “Statistics and Decision Science” are not as popular as Computer Science or Mathematics, the field has the best gender balance of those depicted.

Why are we here?

In Lab 3, you learned how to create basic data graphics using ggplot2.

In this lab, we will extend your understanding of ggplot2 to include many customizations and modifications that will allow you to create production-quality data graphics.

Optional reading

R for Data Science has several chapters about data visualization. Chapter 2 covers a lot of the basics, and Chapter 10: Layers, Chapter 11: Exploratory data analysis, and Chapter 12: Communication provide much more depth in both the how and why of data visualization.

The material in this lab is covered in Chapter 2: Data visualisation for R for Data Science.

Lab instructions

Setting up

Download the Lab 6 template.

Warning

Before you begin:

  • Move the template file you just downloaded from your Downloads folder into the SDS 100 folder that you created in Lab 1.

  • Make sure that within RStudio, you’re working in the SDS 100 project that you created in Lab 1.

For more detailed instructions on either of the above, see Lab 2.

In this lab, we’ll be working with a data set that is near and dear to your : college majors. These data come from the American Community Survey, which is administered by the US Census Bureau. The data are included in the fivethirtyeight R package. You can read the specifics of which data were included by reading the README.md Markdown file posted on GitHub by FiveThirtyEight, a prominent online data journalism site.

This data set is also the basis for the article “The Economic Guide To Picking A College Major”, which was published on FiveThirtyEight on September 12, 2014. You may find it helpful to read the article to get a sense of what you can do with these data.

First, load the packages you will need:

Question
library(tidyverse)
library(fivethirtyeight)

Next, spend two whole minutes familiarizing yourself with the college_recent_grads data frame. You may want to do some combination of the following:

  • Read the help file: help(college_recent_grads)
  • Inspect the data visually using the RStudio data viewer: View(college_recent_grads) or click on college_recent_grads in the Environment pane
  • Inspect the structure of the data frame in the console: glimpse(college_recent_grads)

What kind of thing does each row in this data set represent?

Finally, to create the plot shown in Figure 1, you will need to restrict the data set to only those majors that belong to the major_category values of Computers & Mathematics and Engineering. We also only want to show only those majors with at least 5,000 total graduates. (Note that the variable total stores the number of recent graduates in each major.)

Question

Use the data wrangling skills you learned in Lab 4 (Hint: filter()) to restrict the data set to 1) only those majors that belong to the major_category values of Computers & Mathematics and Engineering, and 2) only those majors with at least 5,000 total graduates.

To avoid overwriting college_recent_grads, assign the resulting data frame to a new object called college_majors.

Check: When you’re done, your new data frame should have 22 rows.

Make the initial plot

At its most basic level, Figure 1 is a bar graph. Typically, bar graphs use the vertical axis for a numerical quantity, and the horizontal axis for a categorical quantity. (Although this is not what is shown in Figure 1!) Let’s start with that.

Question

Use the geom_col() function to create a ggplot object called g that is a bar graph showing the total number of recent graduates in each major. Be sure to map major and total to the x and y aesthetics, respectively, in the call to ggplot() (i.e., not in the call to geom_col()). Don’t worry about colors, labels, or anything else just yet.

If you’re having trouble with creating an object, see R Objects in Lab 2.

Your initial plot should look something like this:

Code
g

Setting aesthetics

It was important to map the x and y aesthetics to major and total in the call to ggplot(), because we want those mappings to propagate over any geoms that we might add later.

Next, we’re going to map the fill aesthetic to the sharewomen variable. However, unlike the mappings for x and y, we want only want the fill mapping to apply to geom_col().

Question

Add a call to geom_col() to g again. This time, map the fill aesthetic to the sharewomen variable within geom_col().

Save the resulting plot as g (so you are overwriting the previous g).

Hint: Make sure you use aes() inside your call to geom_col()!

Your plot should now look like this. Note that the legend for fill appears automatically.

Code
g

Ordering factors

The ordering of the bars in a bar graph can play a role in helping the reader find what they are looking for quickly. Generally, you probably want the bars ordered either alphabetically (the default, and the easiest to find what you are looking for), or in rank order of bar height (allowing for easy comparisons).

The fct_reorder() function from the forcats package (part of the tidyverse), allows you to put one factor in order based on the value of another variable. To do this, it needs two arguments: first, the variable that we want to reorder, and second, the variable that we want to reorder it by.

Question

Using the instructions above, add aes() to g and use the fct_reorder() function to reset the x aesthetic so that the bars for major appear in order according to the value of total.

Your plot should now look like this:

Code
g

Adding a title, subtitle, and caption

A data graphic needs context, and ideally that context appears directly on the graphic. This can be added with the labs() function.

To use this function, we need to add to the graph we made. When making plots with ggplot2, this is usually done by adding new functions at the end of your code rather than modifying code within the ggplot() function. You can add functions by using the + symbol like this:

g <- g +
  new_function() # replace this with the function you're adding
Question

Use the labs() function to add a title, subtitle, and caption to g. Remember to assign the output back to g (i.e., overwrite/update g).

Your plot should now look like this:

Code
g

Modifying scales, legends, and axes

In ggplot2, most scales, legends, and axes have default behaviors that are informed by best practices in data visualization. However, a good data graphic is often customized. In this section, we learn how to customize scales, legends, and axes.

Choosing a color palette

The default color palette in ggplot2 is actually quite poor. The viridis palette is an improvement (read this to learn why).

Question

Use scale_fill_viridis_c() to specify the viridis color palette for the fill color.

Also specify the name argument to something human-readable (e.g., % Women).

Code
g

Setting axis labels

Leaving bare variable names as axis labels is bad practice. Your readership does not know or care about your variable names, and they shouldn’t have to interpret your cryptic choices. Give each axis a clear, human-readable label, with explicit units.

In this case, the major_category label is not helpful, since the categories are implied by the title. We may be better off by just omitting the label altogether.

Question

Use scale_x_discrete() to suppress the horizontal axis label by setting the name argument to NULL.

Code
g

Formatting tick labels

The number of recent graduates is given in scientific notation. This is unnecessarily complicated and difficult to read. The little marks on the axes are called ticks. We can change the format of the tick labels by specifying a function that will be applied to default axis labels. The scales package—which is a dependency of ggplot2—provides several useful functions (all named label_*()) for formatting axis labels.

The one we want is called label_comma(). Most of the time when we use a function, R will know which package it’s from. Sometimes, though, multiple packages may have different functions with the same name. In order to specify which version of the function to use, we can use the notation package::function(); in this case, scales::label_comma().

Note that scales::label_comma() returns a function, so it works like this:

x <- 10^(0:7)
x
[1] 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07
scales::label_comma()(x)
[1] "1"          "10"         "100"        "1,000"      "10,000"    
[6] "100,000"    "1,000,000"  "10,000,000"

This may seem a bit weird, but the main takeaway for us is that the function returned by scales::label_comma() can be passed to the labels argument in any of the scale_*() functions in ggplot2.

Question

Use the labels argument within the scale_y_continuous() function to force the numbers to show in the format provided by scales::label_comma() (the label_comma() function from the scales package).

Also improve the axis label by setting the name argument to scale_y_continuous().

Code
g

We have fixed the vertical axis, but our plot still isn’t legible, because the labels on the horizontal axis are running into each other.

One option for getting around that is to have the labels printed at a 45 degree angle. We can achieve this by overriding the default angle property of the axis.text.x argument to the theme() function.

Code
g + 
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

However, in this case, many of the axis labels are really long, so this doesn’t look good. It’s also harder to read text that is at an angle.

Another option is to wrap the text in the labels. The simplest way to do this is to use another function from scales: label_wrap(). This function takes an argument that specifies the number of characters after which to wrap the text.

Question

Pass the scales::label_wrap() function to the labels argument in scale_x_discrete() to wrap the major field names to 25 characters.

Code
g

Flipping the orientation

This helps, but the labels still aren’t readable.

A better effect is achieved by flipping the coordinate axes. We could go back and change the way that we assigned the x and y aesthetics, but then we’d have to re-run all of our code. Instead, we will use the coord_flip() function to re-orient our existing plot.

Question

Use the coord_flip() function to flip the coordinate axes.

Code
g

One last touch to improve the tick labels is to make the font smaller. You can do this by modifying the theme(). Note that since we already flipped the axes, we are now setting the axis.text.y property. Here we set the font size to 70% of its default size.

g <- g + 
  theme(
    axis.text.y = element_text(size = rel(0.7))
  )
g

Faceting

Finally, we want to separate the “Computers & Mathematics” majors from the “Engineering” majors. We can do use this by creating sections of identical subplots according the value of a categorial variable. In this case, we facet via the major_category variable.

The facet_wrap() function takes a variable name as its first argument, but for technical reasons, that name needs to be wrapped in the vars() function.

For this plot, we don’t need all majors to be present on the vertical axis each facet, so we’ll set the scales argument (different from the scales package we used above!) to "free_y" (so that the y scales can be free!). We also want to make vertical comparisons across bars, so we’ll force the facets to only occupy one column by setting the ncol argument to 1.

Question

Use the facet_wrap() function to create facets for each major_category. Be sure to set the scales and ncol arguments as specified above.

Your final plot should look like this. Note that here we’ve can set the fig-height chunk option to 8 to create a little more vertical space by putting #| fig-height: 8 at the top of the code chunk.

Code
g

Submitting this lab

Step 1: Compare solutions

Download the Lab 6 template (if you haven’t done so already). Do you end up with the same graph as at the top of the lab?

Step 2: Final Project Data Check 3

As part of your preparation from the final project, you need to identify data sources. In labs four through six, you will investigate a new data repository. Each week, you will identify one dataset that you are interested in. For each dataset, write one sentence explaining why you think this dataset could be a good fit for the project (save this for later). The data need to be downloadable (i.e., not from a package such as openintro or moderndive), and cannot be any data we have used in class. The data must also fit the following characteristics:

  • Have at least 200 observations, and no more than 10000
  • Have at least 4 variables total (not including ID variable)
    • At least 2 categorical variables
    • At least 2 numeric variables

Submit the name and URL for the data as part of each week’s Moodle quiz.

You are welcome to use data from any source that meets the above requirements. Here are a few resources to get you started:

  • FiveThirtyEight makes the data they use in their articles publicly available.
  • The UCI Machine Learning Repository has a collection of datasets used in machine learning.
  • Kaggle has a large collection of user-uploaded datasets, often used in competitions hosted on the website.

Step 3: Complete the Moodle Quiz

After completing the data check, complete the Moodle quiz for this lab.

Optional reading

To prepare for next week’s lab on importing data, we encourage reading the following: