Lab 4: Introduction to Data Wrangling

Question

Insert a new code chunk in your Quarto document. In this code chunk, write code that will:

  1. Load the tidyverse package
  2. Import the acs12 data set from the openintro package, and store a copy of it as an object named acs_df in our environment.
Question

Based on the output from the glimpse() function, answer the following questions:

  1. How many variables directly related to employment are measured in these data?
  2. Is level of education measured as a numeric or categorical variable?
  1. Three of the variables seem related to employment: employment, hrs_work and time_to_work. You could argue that income is employment-related, but their income could also be ‘passive income’ e.g., stocks or rental properties.
  2. The <fct> label tells us it is represented using categories.
Question

Insert a new code chunk into your Quarto document. Then, copy the template code below into this code chunk:

Modify this template so that the select() function keeps only the variables married, race, and edu.

# A tibble: 2,000 × 3
   married race  edu        
   <fct>   <fct> <fct>      
 1 no      white college    
 2 no      white hs or lower
 3 no      white hs or lower
 4 no      white hs or lower
 5 no      white hs or lower
 6 yes     other hs or lower
 7 no      white hs or lower
 8 no      other hs or lower
 9 no      asian hs or lower
10 yes     white hs or lower
# ℹ 1,990 more rows
Question

Insert a new code chunk into your Quarto document, and write code that will

  1. Select the employment, time_to_work, and hrs_work variables from the acs_df data set.
  2. Store this new subset of of the ACS data in an object named acs_employment.
  3. Prints out a “glimpse” of the acs_employment data.
Rows: 2,000
Columns: 3
$ employment   <fct> not in labor force, not in labor force, NA, not in labor …
$ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA, NA…
$ hrs_work     <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA, N…
Question

Insert a new code chunk into your Quarto document, and copy the template code below into your new code chunk.

Modify this template so that the filter() function only keeps observations where time_to_work is greater than or equal to 35 minutes and income is strictly greater than fifty thousand dollars ($50000).

# A tibble: 68 × 13
   income employment hrs_work race    age gender citizen time_to_work lang   
    <int> <fct>         <int> <fct> <int> <fct>  <fct>          <int> <fct>  
 1  70000 employed         50 black    50 male   yes               40 english
 2  85000 employed         50 white    33 male   yes               65 english
 3 100000 employed         40 white    27 male   yes               45 english
 4  55000 employed         60 white    66 male   yes               50 english
 5  58000 employed         50 white    59 male   yes               45 english
 6  60000 employed         40 other    48 male   yes               45 english
 7  90000 employed         50 asian    30 male   yes               45 other  
 8 100000 employed         40 white    40 male   yes              142 other  
 9 360000 employed         50 white    52 male   yes               40 english
10  89000 employed         55 white    43 male   yes               40 english
# ℹ 58 more rows
# ℹ 4 more variables: married <fct>, edu <fct>, disability <fct>,
#   birth_qrtr <fct>
Question

Use the mutate() command to generate a new variable in the subset_df data frame that measures age in months instead of years.

# A tibble: 2,000 × 6
   income   age hrs_work time_to_work income_1000 age_months
    <int> <int>    <int>        <int>       <dbl>      <dbl>
 1  60000    68       40           NA        60          816
 2      0    88       NA           NA         0         1056
 3     NA    12       NA           NA        NA          144
 4      0    17       NA           NA         0          204
 5      0    77       NA           NA         0          924
 6   1700    35       40           15         1.7        420
 7     NA    11       NA           NA        NA          132
 8     NA     7       NA           NA        NA           84
 9     NA     6       NA           NA        NA           72
10  45000    27       84           40        45          324
# ℹ 1,990 more rows
Question

Insert a new code chunk into your Quarto document, and copy the template code below into your new code chunk.

Modify this code template so that the arrange() function sorts the subset_df data frame by the hrs_work variable.

# A tibble: 2,000 × 6
   income   age hrs_work time_to_work income_1000 age_months
    <int> <int>    <int>        <int>       <dbl>      <dbl>
 1    300    20        1           30        0.3         240
 2    750    49        1           NA        0.75        588
 3    110    24        2           NA        0.11        288
 4    300    84        2           NA        0.3        1008
 5    100    33        4           NA        0.1         396
 6      0    50        4           35        0           600
 7   1800    18        4           NA        1.8         216
 8    180    65        4           NA        0.18        780
 9   1200    25        5           30        1.2         300
10   1300    21        5            5        1.3         252
# ℹ 1,990 more rows
Question

Insert a new code chunk into your Quarto document, and write code that will arrange the subset_df data frame by time_to_work in descending order.

# A tibble: 2,000 × 6
   income   age hrs_work time_to_work income_1000 age_months
    <int> <int>    <int>        <int>       <dbl>      <dbl>
 1  38000    51       35          163        38          612
 2  40000    53       37          157        40          636
 3 110000    28       65          152       110          336
 4  50000    46       70          148        50          552
 5  56000    42       50          145        56          504
 6 100000    40       40          142       100          480
 7  17100    44       40          142        17.1        528
 8  18000    32       40          128        18          384
 9   1200    50       17          125         1.2        600
10  60000    61       45          119        60          732
# ℹ 1,990 more rows
Question

Insert a new code chunk into your Quarto document, and copy the template code below into your new code chunk.

Modify this template so that the select() function keeps just the variables hrs_work, income, and married, and the filter() function keeps just the observations where income is less than $ 30,000.

# A tibble: 1,171 × 3
   hrs_work income married
      <int>  <int> <fct>  
 1       NA      0 no     
 2       NA      0 no     
 3       NA      0 no     
 4       40   1700 yes    
 5       23   8600 no     
 6       NA      0 yes    
 7       NA      0 no     
 8        8   4000 yes    
 9       35  19000 yes    
10       25   3400 no     
# ℹ 1,161 more rows
Question

Insert a new code chunk into your Quarto document, and write a data wrangling “pipeline” that will

  • generate a new variable named income_1000 that measures income in thousands of dollars (instead of dollars)
  • sort the data in ascending order of hours worked
  • select just the hours worked and income in thousands of dollars variables
# A tibble: 2,000 × 2
   hrs_work income_1000
      <int>       <dbl>
 1        1        0.3 
 2        1        0.75
 3        2        0.11
 4        2        0.3 
 5        4        0.1 
 6        4        0   
 7        4        1.8 
 8        4        0.18
 9        5        1.2 
10        5        1.3 
# ℹ 1,990 more rows