Code
library(tidyverse)
The purpose of this lab is to get an introduction to the logic of data “wrangling” (transforming the data to the format you need for visualization/analysis). Data wrangling is a core skill in data science, and is increasingly valued by employers. You will be using the select()
, filter()
, mutate()
, and arrange()
commands to alter a data frame.1
After completing this lab, you should be able to transform a data frame by subsetting variables and observations, generating new variables, and sorting your data by values.
You will know that you are done when you have created a rendered version of the Lab 4 template file where the results match the examples in Lab 4 solutions file.
Often, when we find data for a project it is not quite ready for use yet. Maybe the data are too large and you only need some subset of the observations or variables. Alternatively, you may have a variable that needs to be changed. For example, you might want to change a variable from being numeric (income in dollars) to a categorical measure (high, medium, low income). The process of wrangling your data is a vital step that makes your data usable for visualization or analysis.
4.1-4.3 in R for Data Science is a good additional resource for material covered in this lab.
Download the Lab 4 template.
Before you begin:
Move the template file you just downloaded from your Downloads folder into the SDS 100 folder that you created in Lab 1.
Make sure that within RStudio, you’re working in the SDS 100 project that you created in Lab 1.
For more detailed instructions on either of the above, see Lab 2.
We will be relying heavily on the dplyr
package found in the tidyverse
. To complete this lab you will need to load the tidyverse
package (see Lab 1 for instructions on how to do this.
openintro
package to load the dataset acs12
, which is a 2012 survey from the US Census Bureau’s American Community Survey (ACS) of 2000 Americans (see documentation here)library(tidyverse)
The sample code below loads the survey data directly from the openintro
package and we assign it to the object acs_df
. Many packages include data that you can use directly rather than needing to separately download a file and import it into R. You will learn later (in Lab 7) about importing data into R, but for now, we will stick to data already provided by packages.
<- openintro::acs12 acs_df
acs_df
into our global environment so that we can observe the changes in observations or variables. You can name this anything you want, but we recommend keeping names short because you will be using this name to refer to the dataset when coding. At the same time, we want to name it something long enough that we know what the object is. Non-descriptive names like “object1” are not helpful, so we want to generate a name that is short enough to be easy to type, but specific enough we know what the object is. We call this object acs_df
so we know it is a data frame object and it contains ACS data.glimpse(acs_df)
you will generate an overview of the data in the Console, including the number of observations and variables. This will be our raw (original) version of the dataset.openintro
package, we only have to type openintro::acs12
to use the original package data directly, and the assignment operator (<-
) then saves it as a data frame in R that we call acs_df
.::
) provides explicit access to objects within packages. So openintro::acs12
means “get me the ac12
object from within the openintro
package.”acs_df
to our environment, we can then just use acs_df
whenever we want to use the data in a command.Use the code below to:
acs12
data from the openintro package, and assign it to an object called acs_df
in your environment.library(tidyverse)
<- openintro::acs12 acs_df
One crucial tool you will use in R is the pipe operator.
|>
. You may occasionally see the older version of a pipe (%>%
) online or in other classes, but both pipes work the same way.x |> f(y)
means the same thing as f(x, y)
.acs_df
data frame, and then only keep the variables income, age, and gender, and then only keep observations where the respondent is female.<- acs_df |>
subset_df select(income, age, gender, hrs_work, time_to_work) |>
filter(gender == "female")
You may find the following video about the pipe operator helpful. Please note that while the video refers to the old pipe (%>%
), the content also applies to the new pipe (|>
).
The select()
function allows us to choose which variables to subset from a data frame.
Datasets may contain dozens, or even hundreds of variables, but a researcher is often only interested in some of these variables. The extra variables or observations not only make it harder to sort through and identify the data you want, but can also slow down your computer as it reads through gigabytes of data. Before working with data, you should identify its codebook, which should have information about each of the variables and how they are measured. Codebooks are a major resource for you to identify what data you do and do not want.
To make the data easier to work with (and potentially save huge amounts of computer memory) we can use the select()
command to only keep the variables we want to analyze. Figure 1 illustrates the logic of the select()
function. The number of columns (representing variables) is reduced from five to three, as we only told R to select columns 2, 3, and 5.
The sample code below takes the acs_df
data frame, and then only keeps the variables income, employment and age. Note that in this code we have not saved our changes to the data frame. Instead, we get an output that shows us what the resulting data frame would look like.
|>
acs_df select(income, employment, age)
# A tibble: 2,000 × 3
income employment age
<int> <fct> <int>
1 60000 not in labor force 68
2 0 not in labor force 88
3 NA <NA> 12
4 0 not in labor force 17
5 0 not in labor force 77
6 1700 employed 35
7 NA <NA> 11
8 NA <NA> 7
9 NA <NA> 6
10 45000 employed 27
# ℹ 1,990 more rows
If we wanted to save these changes, we would need to assign them to an object by using a <-
, and creating a new object name. If we named the object acs_df
then we would save over our original data frame. (You can see in the sample code above we created an object called subset_df
that saves our changes.)
Using the above code as an example, use the select()
function to keep the variables married
, race
, and edu
. Use the template code below as a guide to help you select the columns.
# CODE TEMPLATE
|>
acs_df select(________)
Select the following variables: citizen
, time_to_work
, and lang
.
Another way we may want to wrangle our data is to evaluate a subset of the observations.
The filter()
function allows us to subset our data by only keeping the observations that fit a condition. Perhaps we are interested in respondents who are over the age of 65, or only the non-white sample. The filter()
function evaluates a logical expression based on a value of one or more columns, and only keeps those observations for which the expression is true.
Figure 2 illustrates the logic of the filter command.
The code below takes our saved data frame, and then only keeps rows where gender
is coded as female
. Note that we use ==
to tell R to only keep cases where gender equals a particular category.
The ==
is a test for equality, whereas =
is an assertion of equality. Thus, a == b
will return TRUE
or FALSE
, whereas a = b
will set the value of a
to be equal to the value of b
.
Run the code below, and note the number of rows in the output. Is it different than the original data frame?
|>
acs_df filter(gender == "female")
What if our variable is numeric instead of categorical? We can use >
or <
to instruct R which cases to keep. For example, the code below only keeps observations where the age of the respondent age is 30 or less.
|>
acs_df filter(age <= 30)
We can also filter by multiple conditions. For example, we may only be interested in female respondents under the age of 30. We use &
(AND) to only keep cases that fit both conditions. So, someone who is under 30 and is female.
|>
acs_df filter(age <= 30 & gender == "female")
Alternatively, we use |
(OR) to say we want to keep observations that meet either of the conditions. So anyone who is under 30, or is female. Notice the tibble
below contains a lot more cases.
|>
acs_df filter(age <= 30 | gender == "female")
Use the template below to only keep observations where time_to_work
is greater than or equal to 35 minutes and income
is strictly greater than fifty thousand dollars ($50000).
# CODE TEMPLATE
|>
acs_df filter( ______ & _______ )
mutate()
is a powerful tool when wrangling data because it allows us to create new variables based on the values of other variables.
For example, in the ACS dataset the variable income
is measured in dollars, so the scale is quite large, ranging from a low of $0 to a high of $450,000. Given how large the numbers are, we may want to divide this value by 1000 so we can get a new variable that measures income in $1000 increments.
Note that we name the new variable income_1000
but we could name it anything we want. All else being equal, short names with clear meanings are preferable. Also notice that we are permanently changing the subset_df
data set by using R’s assignment operator to overwrite the original data set with our updated data set that includes the income_1000
column
<- subset_df |>
subset_df mutate(income_1000 = income / 1000)
subset_df
# A tibble: 969 × 6
income age gender hrs_work time_to_work income_1000
<int> <int> <fct> <int> <int> <dbl>
1 60000 68 female 40 NA 60
2 NA 12 female NA NA NA
3 0 77 female NA NA 0
4 1700 35 female 40 15 1.7
5 NA 8 female NA NA NA
6 8600 69 female 23 5 8.6
7 4000 67 female 8 10 4
8 19000 36 female 35 15 19
9 NA 12 female NA NA NA
10 1200 18 female 12 NA 1.2
# ℹ 959 more rows
Figure 3 illustrates the logic of the mutate()
function.
Use the mutate()
command to generate a new variable in the subset_df
data frame that measures age in months instead of years.
Lastly, the arrange()
function is a tool that orders the rows of a data frame by the values of a variable. For example, you could sort the data by income with the lowest income values being in the first rows, and the highest in later rows.
|>
subset_df arrange(income_1000)
If you want to reverse the order, you can add desc()
to the command.
|>
subset_df arrange(desc(income_1000))
Use the template below to arrange the subset_df
data frame by the hrs_work
variable.
# CODE TEMPLATE
|>
subset_df arrange(_________)
Generate code that will arrange the subset_df
data frame by time_to_work
in descending order.
Because of our pipe operator, rather than having separate commands that filter, select, mutate, or arrange our data, we can combine multiple instances of these tools into a single pipeline. For example, the code below keeps the variables age
, gender
, and race
, and then only keeps observations with age
greater than or equal to 75. Order does matter. For example, if we did not select the age
variable, then we would not be able to filter the data by age
(because we dropped it!).
|>
acs_df select(age, gender, race) |>
filter(age >= 75)
Using the template below, select the variables hrs_work
, income
, and married
, and keep observations where income
is less than $ 30,000.
# CODE TEMPLATE
|>
acs_df select() |>
filter()
Use the pipe operator to generate a variable that measures income in units of $10,000 and then arranges the data by hrs_work
. Then select these two variables.
Download the Lab 4 template (if you haven’t done so already). Complete it such that the results in the generated output match the examples in the Lab 4 solutions file.
As part of your preparation from the final project, you need to identify data sources. In labs four through six, you will investigate a new data repository. Each week, you will identify one dataset that you are interested in. For each dataset, write one sentence explaining why you think this dataset could be a good fit for the project (save this for later). The data need to be downloadable (i.e., not from a package such as openintro
or moderndive
), and cannot be any data we have used in class. The data must also fit the following characteristics:
Submit the name and URL for the data as part of each week’s Moodle quiz.
You are welcome to use data from any source that meets the above requirements. Here are a few resources to get you started:
After completing the data check, complete the Moodle quiz for this lab.
To prepare for next week’s lab on summarizing, we encourage reading section 4.4-4.5 in R for Data Science.
The pipe operator in R was inspired by the pipe operator from UNIX (|
).↩︎