Lab 4: Introduction to Data Wrangling

Preamble

The purpose of this lab is to get an introduction to the logic of data “wrangling” (transforming the data to the format you need for visualization/analysis). Data wrangling is a core skill in data science, and is increasingly valued by employers. You will be using the select(), filter(), mutate(), and arrange() commands to alter a data frame.1

After completing this lab, you should be able to transform a data frame by subsetting variables and observations, generating new variables, and sorting your data by values.

You will know that you are done when you have created a rendered version of the Lab 4 template file where the results match the examples in Lab 4 solutions file.

Why are we here?

Often, when we find data for a project it is not quite ready for use yet. Maybe the data are too large and you only need some subset of the observations or variables. Alternatively, you may have a variable that needs to be changed. For example, you might want to change a variable from being numeric (income in dollars) to a categorical measure (high, medium, low income). The process of wrangling your data is a vital step that makes your data usable for visualization or analysis.

Optional reading

4.1-4.3 in R for Data Science is a good additional resource for material covered in this lab.

Preliminary Work

Setting up

Download the Lab 4 template.

Warning

Before you begin:

  • Move the template file you just downloaded from your Downloads folder into the SDS 100 folder that you created in Lab 1.

  • Make sure that within RStudio, you’re working in the SDS 100 project that you created in Lab 1.

For more detailed instructions on either of the above, see Lab 2.

Loading packages and data

We will be relying heavily on the dplyr package found in the tidyverse. To complete this lab you will need to load the tidyverse package (see Lab 1 for instructions on how to do this.

  • We will also need to install and library the openintro package to load the dataset acs12, which is a 2012 survey from the US Census Bureau’s American Community Survey (ACS) of 2000 Americans (see documentation here)
  • Each observation represents an individual respondent, meaning that our units are individual people living in the United States.
Code
library(tidyverse)

The sample code below loads the survey data directly from the openintro package and we assign it to the object acs_df. Many packages include data that you can use directly rather than needing to separately download a file and import it into R. You will learn later (in Lab 7) about importing data into R, but for now, we will stick to data already provided by packages.

Code
acs_df <- openintro::acs12
  • We want to save acs_df into our global environment so that we can observe the changes in observations or variables. You can name this anything you want, but we recommend keeping names short because you will be using this name to refer to the dataset when coding. At the same time, we want to name it something long enough that we know what the object is. Non-descriptive names like “object1” are not helpful, so we want to generate a name that is short enough to be easy to type, but specific enough we know what the object is. We call this object acs_df so we know it is a data frame object and it contains ACS data.
  • If you run the command glimpse(acs_df) you will generate an overview of the data in the Console, including the number of observations and variables. This will be our raw (original) version of the dataset.
  • When we wrangle the data by dropping variables or cases, we will save a new copy of the data frame, so that we can easily compare the original data to our altered data. Because the data are built into the openintro package, we only have to type openintro::acs12 to use the original package data directly, and the assignment operator (<-) then saves it as a data frame in R that we call acs_df.
  • The double colon operator (::) provides explicit access to objects within packages. So openintro::acs12 means “get me the ac12 object from within the openintro package.”
  • After we save acs_df to our environment, we can then just use acs_df whenever we want to use the data in a command.
Question

Use the code below to:

  1. Load the tidyverse package.
  2. Import the acs12 data from the openintro package, and assign it to an object called acs_df in your environment.
Code
library(tidyverse)
acs_df <- openintro::acs12

Pipes

One crucial tool you will use in R is the pipe operator.

  • The pipe operator2 in R looks like this: |>. You may occasionally see the older version of a pipe (%>%) online or in other classes, but both pipes work the same way.
  • The pipe takes the result of the command that came before it and injects it into the first argument of the command that comes after it. So x |> f(y) means the same thing as f(x, y).
  • The pipe allows us to take several commands and combine them into a single chain of commands. This can be helpful when you are trying to implement several different commands to the same data frame, and also allows you to easily read and execute the commands sequentially.
  • One way to think of the pipe operator is a placeholder telling R “and then.” For example, the code below tells us to use the acs_df data frame, and then only keep the variables income, age, and gender, and then only keep observations where the respondent is female.
Code
subset_df <- acs_df |>
  select(income, age, gender, hrs_work, time_to_work) |>
  filter(gender == "female")

You may find the following video about the pipe operator helpful. Please note that while the video refers to the old pipe (%>%), the content also applies to the new pipe (|>).

Lab instructions

Subsetting the columns (variables)

The select() function allows us to choose which variables to subset from a data frame.

Datasets may contain dozens, or even hundreds of variables, but a researcher is often only interested in some of these variables. The extra variables or observations not only make it harder to sort through and identify the data you want, but can also slow down your computer as it reads through gigabytes of data. Before working with data, you should identify its codebook, which should have information about each of the variables and how they are measured. Codebooks are a major resource for you to identify what data you do and do not want.

To make the data easier to work with (and potentially save huge amounts of computer memory) we can use the select() command to only keep the variables we want to analyze. Figure 1 illustrates the logic of the select() function. The number of columns (representing variables) is reduced from five to three, as we only told R to select columns 2, 3, and 5.

Figure 1: Visual depiction of the select() function. The variables in blue are selected, while the variables in grey are discarded.

The sample code below takes the acs_df data frame, and then only keeps the variables income, employment and age. Note that in this code we have not saved our changes to the data frame. Instead, we get an output that shows us what the resulting data frame would look like.

Code
acs_df |> 
  select(income, employment, age)
# A tibble: 2,000 × 3
   income employment           age
    <int> <fct>              <int>
 1  60000 not in labor force    68
 2      0 not in labor force    88
 3     NA <NA>                  12
 4      0 not in labor force    17
 5      0 not in labor force    77
 6   1700 employed              35
 7     NA <NA>                  11
 8     NA <NA>                   7
 9     NA <NA>                   6
10  45000 employed              27
# ℹ 1,990 more rows

If we wanted to save these changes, we would need to assign them to an object by using a <-, and creating a new object name. If we named the object acs_df then we would save over our original data frame. (You can see in the sample code above we created an object called subset_df that saves our changes.)

Question

Using the above code as an example, use the select() function to keep the variables married, race, and edu. Use the template code below as a guide to help you select the columns.

Code
# CODE TEMPLATE
acs_df |> 
  select(________)
Question

Select the following variables: citizen, time_to_work, and lang.

Subsetting the rows (observations)

Another way we may want to wrangle our data is to evaluate a subset of the observations.

The filter() function allows us to subset our data by only keeping the observations that fit a condition. Perhaps we are interested in respondents who are over the age of 65, or only the non-white sample. The filter() function evaluates a logical expression based on a value of one or more columns, and only keeps those observations for which the expression is true.

Figure 2 illustrates the logic of the filter command.

Figure 2: Visual depiction of the filter() function. Here, only the 2nd and 3rd rows (in blue) meet the condition and are kept. The 1st and 4th rows (in grey) do not meet the condition and are discarded.

The code below takes our saved data frame, and then only keeps rows where gender is coded as female. Note that we use == to tell R to only keep cases where gender equals a particular category.

Tip

The == is a test for equality, whereas = is an assertion of equality. Thus, a == b will return TRUE or FALSE, whereas a = b will set the value of a to be equal to the value of b.

Run the code below, and note the number of rows in the output. Is it different than the original data frame?

Code
acs_df |>
  filter(gender == "female")

What if our variable is numeric instead of categorical? We can use > or < to instruct R which cases to keep. For example, the code below only keeps observations where the age of the respondent age is 30 or less.

Code
acs_df |>
  filter(age <= 30)

We can also filter by multiple conditions. For example, we may only be interested in female respondents under the age of 30. We use & (AND) to only keep cases that fit both conditions. So, someone who is under 30 and is female.

Code
acs_df |>
  filter(age <= 30 & gender == "female")

Alternatively, we use | (OR) to say we want to keep observations that meet either of the conditions. So anyone who is under 30, or is female. Notice the tibble below contains a lot more cases.

Code
acs_df |>
  filter(age <= 30 | gender == "female")
Question

Use the template below to only keep observations where time_to_work is greater than or equal to 35 minutes and income is strictly greater than fifty thousand dollars ($50000).

Code
# CODE TEMPLATE
acs_df |>
  filter( ______  & _______  )

Modifying the variables

mutate() is a powerful tool when wrangling data because it allows us to create new variables based on the values of other variables.

For example, in the ACS dataset the variable income is measured in dollars, so the scale is quite large, ranging from a low of $0 to a high of $450,000. Given how large the numbers are, we may want to divide this value by 1000 so we can get a new variable that measures income in $1000 increments.

Note that we name the new variable income_1000 but we could name it anything we want. All else being equal, short names with clear meanings are preferable. Also notice that we are permanently changing the subset_df data set by using R’s assignment operator to overwrite the original data set with our updated data set that includes the income_1000 column

Code
subset_df <- subset_df |>
  mutate(income_1000 = income / 1000)

subset_df
# A tibble: 969 × 6
   income   age gender hrs_work time_to_work income_1000
    <int> <int> <fct>     <int>        <int>       <dbl>
 1  60000    68 female       40           NA        60  
 2     NA    12 female       NA           NA        NA  
 3      0    77 female       NA           NA         0  
 4   1700    35 female       40           15         1.7
 5     NA     8 female       NA           NA        NA  
 6   8600    69 female       23            5         8.6
 7   4000    67 female        8           10         4  
 8  19000    36 female       35           15        19  
 9     NA    12 female       NA           NA        NA  
10   1200    18 female       12           NA         1.2
# ℹ 959 more rows

Figure 3 illustrates the logic of the mutate() function.

Figure 3: Visual depiction of the mutate() function. The existing variables (in grey) are unchanged, while a new variable (in blue) is created.
Question

Use the mutate() command to generate a new variable in the subset_df data frame that measures age in months instead of years.

Sorting the observations

Lastly, the arrange() function is a tool that orders the rows of a data frame by the values of a variable. For example, you could sort the data by income with the lowest income values being in the first rows, and the highest in later rows.

Code
subset_df |>
  arrange(income_1000)

If you want to reverse the order, you can add desc() to the command.

Code
 subset_df |>
  arrange(desc(income_1000))
Question

Use the template below to arrange the subset_df data frame by the hrs_work variable.

Code
# CODE TEMPLATE
subset_df |>
  arrange(_________)
Question

Generate code that will arrange the subset_df data frame by time_to_work in descending order.

Bringing it all together

Because of our pipe operator, rather than having separate commands that filter, select, mutate, or arrange our data, we can combine multiple instances of these tools into a single pipeline. For example, the code below keeps the variables age, gender, and race, and then only keeps observations with age greater than or equal to 75. Order does matter. For example, if we did not select the age variable, then we would not be able to filter the data by age (because we dropped it!).

Code
acs_df |>
  select(age, gender, race) |>
  filter(age >= 75)
Question

Using the template below, select the variables hrs_work, income, and married, and keep observations where income is less than $ 30,000.

Code
# CODE TEMPLATE
acs_df |>
  select() |>
  filter()
Question

Use the pipe operator to generate a variable that measures income in units of $10,000 and then arranges the data by hrs_work. Then select these two variables.

Finishing this lab

Instructions

Step 1: Create twin document

Download the Lab 4 template (if you haven’t done so already). Complete it such that the results in the generated output match the examples in the Lab 4 solutions file.

Step 2: Final Project Data Check 1

As part of your preparation from the final project, you need to identify data sources. In labs four through six, you will investigate a new data repository. Each week, you will identify one dataset that you are interested in. For each dataset, write one sentence explaining why you think this dataset could be a good fit for the project (save this for later). The data need to be downloadable (i.e., not from a package such as openintro or moderndive), and cannot be any data we have used in class. The data must also fit the following characteristics:

  • Have at least 200 observations, and no more than 10000
  • Have at least 4 variables total (not including ID variable)
    • At least 2 categorical variables
    • At least 2 numeric variables

Submit the name and URL for the data as part of each week’s Moodle quiz.

You are welcome to use data from any source that meets the above requirements. Here are a few resources to get you started:

  • FiveThirtyEight makes the data they use in their articles publicly available.
  • The UCI Machine Learning Repository has a collection of datasets used in machine learning.
  • Kaggle has a large collection of user-uploaded datasets, often used in competitions hosted on the website.

Step 3: Complete the Moodle Quiz

After completing the data check, complete the Moodle quiz for this lab.

Optional reading

To prepare for next week’s lab on summarizing, we encourage reading section 4.4-4.5 in R for Data Science.

References

Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Chapman; Hall/CRC. https://doi.org/10.1201/9781138359444.

Footnotes

  1. Materials inspired by Xie, Allaire, and Grolemund (2018).↩︎

  2. The pipe operator in R was inspired by the pipe operator from UNIX (|).↩︎