Lab 4: Introduction to Data Wrangling

Question

Use the code below to:

  1. Load the tidyverse package.
  2. Import the acs12 data from the openintro package, and assign it to an object called acs_df in your environment.
Question

Using the above code as an example, use the select() function to keep the variables married, race, and edu. Use the template code below as a guide to help you select the columns.

# A tibble: 2,000 × 3
   married race  edu        
   <fct>   <fct> <fct>      
 1 no      white college    
 2 no      white hs or lower
 3 no      white hs or lower
 4 no      white hs or lower
 5 no      white hs or lower
 6 yes     other hs or lower
 7 no      white hs or lower
 8 no      other hs or lower
 9 no      asian hs or lower
10 yes     white hs or lower
# ℹ 1,990 more rows
Question

Select the following variables: citizen, time_to_work, and lang.

# A tibble: 2,000 × 3
   citizen time_to_work lang   
   <fct>          <int> <fct>  
 1 yes               NA english
 2 yes               NA english
 3 yes               NA english
 4 yes               NA other  
 5 yes               NA other  
 6 yes               15 other  
 7 yes               NA english
 8 yes               NA english
 9 yes               NA other  
10 yes               40 english
# ℹ 1,990 more rows
Question

Use the template below to only keep observations where time_to_work is greater than or equal to 35 minutes and income is strictly greater than fifty thousand dollars ($50000).

# A tibble: 68 × 13
   income employment hrs_work race    age gender citizen time_to_work lang   
    <int> <fct>         <int> <fct> <int> <fct>  <fct>          <int> <fct>  
 1  70000 employed         50 black    50 male   yes               40 english
 2  85000 employed         50 white    33 male   yes               65 english
 3 100000 employed         40 white    27 male   yes               45 english
 4  55000 employed         60 white    66 male   yes               50 english
 5  58000 employed         50 white    59 male   yes               45 english
 6  60000 employed         40 other    48 male   yes               45 english
 7  90000 employed         50 asian    30 male   yes               45 other  
 8 100000 employed         40 white    40 male   yes              142 other  
 9 360000 employed         50 white    52 male   yes               40 english
10  89000 employed         55 white    43 male   yes               40 english
# ℹ 58 more rows
# ℹ 4 more variables: married <fct>, edu <fct>, disability <fct>,
#   birth_qrtr <fct>
Question

Use the mutate() command to generate a new variable in the subset_df data frame that measures age in months instead of years.

# A tibble: 969 × 6
   income   age gender hrs_work time_to_work age_months
    <int> <int> <fct>     <int>        <int>      <dbl>
 1  60000    68 female       40           NA        816
 2     NA    12 female       NA           NA        144
 3      0    77 female       NA           NA        924
 4   1700    35 female       40           15        420
 5     NA     8 female       NA           NA         96
 6   8600    69 female       23            5        828
 7   4000    67 female        8           10        804
 8  19000    36 female       35           15        432
 9     NA    12 female       NA           NA        144
10   1200    18 female       12           NA        216
# ℹ 959 more rows
Question

Use the template below to arrange the subset_df data frame by the hrs_work variable.

# A tibble: 969 × 6
   income   age gender hrs_work time_to_work age_months
    <int> <int> <fct>     <int>        <int>      <dbl>
 1   1800    18 female        4           NA        216
 2    180    65 female        4           NA        780
 3   1300    21 female        5            5        252
 4    680    19 female        5            5        228
 5     50    30 female        5           20        360
 6    850    19 female        6           10        228
 7   1300    78 female        6            2        936
 8      0    66 female        6           15        792
 9   1000    64 female        6           10        768
10   4000    67 female        8           10        804
# ℹ 959 more rows
Question

Generate code that will arrange the subset_df data frame by time_to_work in descending order.

# A tibble: 969 × 6
   income   age gender hrs_work time_to_work age_months
    <int> <int> <fct>     <int>        <int>      <dbl>
 1  38000    51 female       35          163        612
 2  40000    53 female       37          157        636
 3  56000    42 female       50          145        504
 4  18000    32 female       40          128        384
 5  65000    38 female       40           90        456
 6  36000    41 female       36           90        492
 7  60000    54 female       40           90        648
 8  62000    48 female       40           80        576
 9      0    37 female       30           75        444
10 150000    62 female       40           75        744
# ℹ 959 more rows
Question

Using the template below, select the variables hrs_work, income, and married, and keep observations where income is less than $ 30,000.

# A tibble: 1,171 × 3
   hrs_work income married
      <int>  <int> <fct>  
 1       NA      0 no     
 2       NA      0 no     
 3       NA      0 no     
 4       40   1700 yes    
 5       23   8600 no     
 6       NA      0 yes    
 7       NA      0 no     
 8        8   4000 yes    
 9       35  19000 yes    
10       25   3400 no     
# ℹ 1,161 more rows
Question

Use the pipe operator to generate a variable that measures income in units of $10,000 and then arranges the data by hrs_work. Then select these two variables.

# A tibble: 2,000 × 2
   hrs_work income_10000
      <int>        <dbl>
 1        1        0.03 
 2        1        0.075
 3        2        0.011
 4        2        0.03 
 5        4        0.01 
 6        4        0    
 7        4        0.18 
 8        4        0.018
 9        5        0.12 
10        5        0.13 
# ℹ 1,990 more rows