Lab 6: Polishing figures
Preamble
The purpose of this lab is to learn how to polish data graphics to maximize their readability and impact.
After completing this lab, you should be able to add and modify titles, subtitles, captions, legends, scales, axis labels, and tick labels for ggplot2 data graphics.
You will know that you are done when you can create this image:
Why are we here?
In Lab 3, you learned how to create basic data graphics using ggplot2.
In this lab, we will extend your understanding of ggplot2 to include many customizations and modifications that will allow you to create production-quality data graphics.
Optional reading
R for Data Science has several chapters about data visualization. Chapter 2 covers a lot of the basics, and Chapter 10: Layers, Chapter 11: Exploratory data analysis, and Chapter 12: Communication provide much more depth in both the how and why of data visualization.
The material in this lab is covered in Chapter 2: Data visualisation for R for Data Science.
Lab instructions
Setting up
Download the Lab 6 template.
Before you begin:
Move the template file you just downloaded from your Downloads folder into the SDS 100 folder that you created in Lab 1.
Make sure that within RStudio, you’re working in the SDS 100 project that you created in Lab 1.
For more detailed instructions on either of the above, see Lab 2.
In this lab, we’ll be working with a data set that is near and dear to your : college majors. These data come from the American Community Survey, which is administered by the US Census Bureau. The data are included in the fivethirtyeight R package. You can read the specifics of which data were included by reading the README.md Markdown file posted on GitHub by FiveThirtyEight, a prominent online data journalism site.
This data set is also the basis for the article “The Economic Guide To Picking A College Major”, which was published on FiveThirtyEight on September 12, 2014. You may find it helpful to read the article to get a sense of what you can do with these data.
First, load the packages you will need:
Question
library(tidyverse)
library(fivethirtyeight)
Next, spend two whole minutes familiarizing yourself with the college_recent_grads
data frame. You may want to do some combination of the following:
- Read the help file:
help(college_recent_grads)
- Inspect the data visually using the RStudio data viewer:
View(college_recent_grads)
or click oncollege_recent_grads
in the Environment pane - Inspect the structure of the data frame in the console:
glimpse(college_recent_grads)
What kind of thing does each row in this data set represent?
Finally, to create the plot shown in Figure 1, you will need to restrict the data set to only those majors that belong to the major_category
values of Computers & Mathematics
and Engineering
. We also only want to show only those majors with at least 5,000 total
graduates. (Note that the variable total
stores the number of recent graduates in each major
.)
Question
Use the data wrangling skills you learned in Lab 4 (Hint: filter()
) to restrict the data set to 1) only those majors that belong to the major_category
values of Computers & Mathematics
and Engineering
, and 2) only those majors with at least 5,000 total
graduates.
To avoid overwriting college_recent_grads
, assign the resulting data frame to a new object called college_majors
.
Check: When you’re done, your new data frame should have 22 rows.
Make the initial plot
At its most basic level, Figure 1 is a bar graph. Typically, bar graphs use the vertical axis for a numerical quantity, and the horizontal axis for a categorical quantity. (Although this is not what is shown in Figure 1!) Let’s start with that.
Question
Use the geom_col()
function to create a ggplot object called g
that is a bar graph showing the total number of recent graduates in each major
. Be sure to map major
and total
to the x
and y
aesthetics, respectively, in the call to ggplot()
(i.e., not in the call to geom_col()
). Don’t worry about colors, labels, or anything else just yet.
If you’re having trouble with creating an object, see R Objects in Lab 2.
Your initial plot should look something like this:
Code
g
Setting aesthetics
It was important to map the x
and y
aesthetics to major
and total
in the call to ggplot()
, because we want those mappings to propagate over any geoms that we might add later.
Next, we’re going to map the fill
aesthetic to the sharewomen
variable. However, unlike the mappings for x
and y
, we want only want the fill
mapping to apply to geom_col()
.
Question
Add a call to geom_col()
to g
again. This time, map the fill
aesthetic to the sharewomen
variable within geom_col()
.
Save the resulting plot as g
(so you are overwriting the previous g
).
Hint: Make sure you use aes()
inside your call to geom_col()
!
Your plot should now look like this. Note that the legend for fill
appears automatically.
Code
g
Ordering factors
The ordering of the bars in a bar graph can play a role in helping the reader find what they are looking for quickly. Generally, you probably want the bars ordered either alphabetically (the default, and the easiest to find what you are looking for), or in rank order of bar height (allowing for easy comparisons).
The fct_reorder()
function from the forcats package (part of the tidyverse), allows you to put one factor in order based on the value of another variable. To do this, it needs two arguments: first, the variable that we want to reorder, and second, the variable that we want to reorder it by.
Question
Using the instructions above, add aes()
to g
and use the fct_reorder()
function to reset the x
aesthetic so that the bars for major
appear in order according to the value of total
.
Your plot should now look like this:
Code
g
Modifying scales, legends, and axes
In ggplot2, most scales, legends, and axes have default behaviors that are informed by best practices in data visualization. However, a good data graphic is often customized. In this section, we learn how to customize scales, legends, and axes.
Choosing a color palette
The default color palette in ggplot2 is actually quite poor. The viridis palette is an improvement (read this to learn why).
Question
Use scale_fill_viridis_c()
to specify the viridis color palette for the fill color.
Also specify the name
argument to something human-readable (e.g., % Women
).
Code
g
Setting axis labels
Leaving bare variable names as axis labels is bad practice. Your readership does not know or care about your variable names, and they shouldn’t have to interpret your cryptic choices. Give each axis a clear, human-readable label, with explicit units.
In this case, the major_category
label is not helpful, since the categories are implied by the title. We may be better off by just omitting the label altogether.
Question
Use scale_x_discrete()
to suppress the horizontal axis label by setting the name
argument to NULL
.
Code
g
Formatting tick labels
The number of recent graduates is given in scientific notation. This is unnecessarily complicated and difficult to read. The little marks on the axes are called ticks. We can change the format of the tick labels by specifying a function that will be applied to default axis labels. The scales package—which is a dependency of ggplot2—provides several useful functions (all named label_*()
) for formatting axis labels.
The one we want is called label_comma()
. Most of the time when we use a function, R will know which package it’s from. Sometimes, though, multiple packages may have different functions with the same name. In order to specify which version of the function to use, we can use the notation package::function()
; in this case, scales::label_comma()
.
Note that scales::label_comma()
returns a function, so it works like this:
<- 10^(0:7)
x x
[1] 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07
::label_comma()(x) scales
[1] "1" "10" "100" "1,000" "10,000"
[6] "100,000" "1,000,000" "10,000,000"
This may seem a bit weird, but the main takeaway for us is that the function returned by scales::label_comma()
can be passed to the labels
argument in any of the scale_*()
functions in ggplot2.
Question
Use the labels
argument within the scale_y_continuous()
function to force the numbers to show in the format provided by scales::label_comma()
(the label_comma()
function from the scales package).
Also improve the axis label by setting the name
argument to scale_y_continuous()
.
Code
g
We have fixed the vertical axis, but our plot still isn’t legible, because the labels on the horizontal axis are running into each other.
One option for getting around that is to have the labels printed at a 45 degree angle. We can achieve this by overriding the default angle
property of the axis.text.x
argument to the theme()
function.
Code
+
g theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
However, in this case, many of the axis labels are really long, so this doesn’t look good. It’s also harder to read text that is at an angle.
Another option is to wrap the text in the labels. The simplest way to do this is to use another function from scales: label_wrap()
. This function takes an argument that specifies the number of characters after which to wrap the text.
Question
Pass the scales::label_wrap()
function to the labels
argument in scale_x_discrete()
to wrap the major field names to 25 characters.
Code
g
Flipping the orientation
This helps, but the labels still aren’t readable.
A better effect is achieved by flipping the coordinate axes. We could go back and change the way that we assigned the x
and y
aesthetics, but then we’d have to re-run all of our code. Instead, we will use the coord_flip()
function to re-orient our existing plot.
Question
Use the coord_flip()
function to flip the coordinate axes.
Code
g
One last touch to improve the tick labels is to make the font smaller. You can do this by modifying the theme()
. Note that since we already flipped the axes, we are now setting the axis.text.y
property. Here we set the font size to 70% of its default size.
<- g +
g theme(
axis.text.y = element_text(size = rel(0.7))
) g
Faceting
Finally, we want to separate the “Computers & Mathematics” majors from the “Engineering” majors. We can do use this by creating sections of identical subplots according the value of a categorial variable. In this case, we facet via the major_category
variable.
The facet_wrap()
function takes a variable name as its first argument, but for technical reasons, that name needs to be wrapped in the vars()
function.
For this plot, we don’t need all majors to be present on the vertical axis each facet, so we’ll set the scales
argument (different from the scales
package we used above!) to "free_y"
(so that the y
scales can be free!). We also want to make vertical comparisons across bars, so we’ll force the facets to only occupy one column by setting the ncol
argument to 1
.
Question
Use the facet_wrap()
function to create facets for each major_category
. Be sure to set the scales
and ncol
arguments as specified above.
Your final plot should look like this. Note that here we’ve can set the fig-height
chunk option to 8 to create a little more vertical space by putting #| fig-height: 8
at the top of the code chunk.
Code
g
Submitting this lab
Step 1: Compare solutions
Download the Lab 6 template (if you haven’t done so already). Do you end up with the same graph as at the top of the lab?
Step 2: Final Project Data Check 3
As part of your preparation from the final project, you need to identify data sources. In labs four through six, you will investigate a new data repository. Each week, you will identify one dataset that you are interested in. For each dataset, write one sentence explaining why you think this dataset could be a good fit for the project (save this for later). The data need to be downloadable (i.e., not from a package such as openintro
or moderndive
), and cannot be any data we have used in class. The data must also fit the following characteristics:
- Have at least 200 observations, and no more than 10000
- Have at least 4 variables total (not including ID variable)
- At least 2 categorical variables
- At least 2 numeric variables
Submit the name and URL for the data as part of each week’s Moodle quiz.
You are welcome to use data from any source that meets the above requirements. Here are a few resources to get you started:
- FiveThirtyEight makes the data they use in their articles publicly available.
- The UCI Machine Learning Repository has a collection of datasets used in machine learning.
- Kaggle has a large collection of user-uploaded datasets, often used in competitions hosted on the website.
Step 3: Complete the Moodle Quiz
After completing the data check, complete the Moodle quiz for this lab.
Optional reading
To prepare for next week’s lab on importing data, we encourage reading the following:
- Sections 2.1 and 2.4 in Data Science: A First Introduction
- Sections 8.1-8.2 in R for Data Science