- Three of the variables seem related to employment:
employment
,hrs_work
andtime_to_work
. You could argue thatincome
is employment-related, but their income could also be ‘passive income’ e.g., stocks or rental properties. - The
<fct>
label tells us it is represented using categories.
Lab 4: Introduction to Data Wrangling
Question
Insert a new code chunk in your Quarto document. In this code chunk, write code that will:
- Load the
tidyverse
package - Import the
acs12
data set from theopenintro
package, and store a copy of it as an object namedacs_df
in our environment.
Question
Based on the output from the glimpse()
function, answer the following questions:
- How many variables directly related to employment are measured in these data?
- Is level of education measured as a numeric or categorical variable?
Question
Insert a new code chunk into your Quarto document. Then, copy the template code below into this code chunk:
Modify this template so that the select()
function keeps only the variables married
, race
, and edu
.
# A tibble: 2,000 × 3
married race edu
<fct> <fct> <fct>
1 no white college
2 no white hs or lower
3 no white hs or lower
4 no white hs or lower
5 no white hs or lower
6 yes other hs or lower
7 no white hs or lower
8 no other hs or lower
9 no asian hs or lower
10 yes white hs or lower
# ℹ 1,990 more rows
Question
Insert a new code chunk into your Quarto document, and write code that will
- Select the
employment
,time_to_work
, andhrs_work
variables from theacs_df
data set. - Store this new subset of of the ACS data in an object named
acs_employment
. - Prints out a “glimpse” of the
acs_employment
data.
Rows: 2,000
Columns: 3
$ employment <fct> not in labor force, not in labor force, NA, not in labor …
$ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA, NA…
$ hrs_work <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA, N…
Question
Insert a new code chunk into your Quarto document, and copy the template code below into your new code chunk.
Modify this template so that the filter()
function only keeps observations where time_to_work
is greater than or equal to 35 minutes and income
is strictly greater than fifty thousand dollars ($50000).
# A tibble: 68 × 13
income employment hrs_work race age gender citizen time_to_work lang
<int> <fct> <int> <fct> <int> <fct> <fct> <int> <fct>
1 70000 employed 50 black 50 male yes 40 english
2 85000 employed 50 white 33 male yes 65 english
3 100000 employed 40 white 27 male yes 45 english
4 55000 employed 60 white 66 male yes 50 english
5 58000 employed 50 white 59 male yes 45 english
6 60000 employed 40 other 48 male yes 45 english
7 90000 employed 50 asian 30 male yes 45 other
8 100000 employed 40 white 40 male yes 142 other
9 360000 employed 50 white 52 male yes 40 english
10 89000 employed 55 white 43 male yes 40 english
# ℹ 58 more rows
# ℹ 4 more variables: married <fct>, edu <fct>, disability <fct>,
# birth_qrtr <fct>
Question
Use the mutate()
command to generate a new variable in the subset_df
data frame that measures age in months instead of years.
# A tibble: 2,000 × 6
income age hrs_work time_to_work income_1000 age_months
<int> <int> <int> <int> <dbl> <dbl>
1 60000 68 40 NA 60 816
2 0 88 NA NA 0 1056
3 NA 12 NA NA NA 144
4 0 17 NA NA 0 204
5 0 77 NA NA 0 924
6 1700 35 40 15 1.7 420
7 NA 11 NA NA NA 132
8 NA 7 NA NA NA 84
9 NA 6 NA NA NA 72
10 45000 27 84 40 45 324
# ℹ 1,990 more rows
Question
Insert a new code chunk into your Quarto document, and copy the template code below into your new code chunk.
Modify this code template so that the arrange()
function sorts the subset_df
data frame by the hrs_work
variable.
# A tibble: 2,000 × 6
income age hrs_work time_to_work income_1000 age_months
<int> <int> <int> <int> <dbl> <dbl>
1 300 20 1 30 0.3 240
2 750 49 1 NA 0.75 588
3 110 24 2 NA 0.11 288
4 300 84 2 NA 0.3 1008
5 100 33 4 NA 0.1 396
6 0 50 4 35 0 600
7 1800 18 4 NA 1.8 216
8 180 65 4 NA 0.18 780
9 1200 25 5 30 1.2 300
10 1300 21 5 5 1.3 252
# ℹ 1,990 more rows
Question
Insert a new code chunk into your Quarto document, and write code that will arrange the subset_df
data frame by time_to_work
in descending order.
# A tibble: 2,000 × 6
income age hrs_work time_to_work income_1000 age_months
<int> <int> <int> <int> <dbl> <dbl>
1 38000 51 35 163 38 612
2 40000 53 37 157 40 636
3 110000 28 65 152 110 336
4 50000 46 70 148 50 552
5 56000 42 50 145 56 504
6 100000 40 40 142 100 480
7 17100 44 40 142 17.1 528
8 18000 32 40 128 18 384
9 1200 50 17 125 1.2 600
10 60000 61 45 119 60 732
# ℹ 1,990 more rows
Question
Insert a new code chunk into your Quarto document, and copy the template code below into your new code chunk.
Modify this template so that the select()
function keeps just the variables hrs_work
, income
, and married
, and the filter()
function keeps just the observations where income
is less than $ 30,000.
# A tibble: 1,171 × 3
hrs_work income married
<int> <int> <fct>
1 NA 0 no
2 NA 0 no
3 NA 0 no
4 40 1700 yes
5 23 8600 no
6 NA 0 yes
7 NA 0 no
8 8 4000 yes
9 35 19000 yes
10 25 3400 no
# ℹ 1,161 more rows
Question
Insert a new code chunk into your Quarto document, and write a data wrangling “pipeline” that will
- generate a new variable named
income_1000
that measures income in thousands of dollars (instead of dollars) - sort the data in ascending order of hours worked
- select just the hours worked and income in thousands of dollars variables
# A tibble: 2,000 × 2
hrs_work income_1000
<int> <dbl>
1 1 0.3
2 1 0.75
3 2 0.11
4 2 0.3
5 4 0.1
6 4 0
7 4 1.8
8 4 0.18
9 5 1.2
10 5 1.3
# ℹ 1,990 more rows