Lab 5: Summarizing Data

Question

Load the mosaicData R package in your Quarto document, and then visit the documentation page for the SaratogaHouses data set.

Question

Import the Saratoga Houses data set into your environment using the command SaratogaHouses <- mosaicData::SaratogaHouses in your Quarto document. Use nrow() to compute the number of rows in SaratogaHouses.

[1] 1728

Question

How old was a “typical” home in Saratoga County during 2006?. To find out, use summarize() to compute the median age of all the homes in the SaratogaHouses data set. Be sure to think about what “ingredients” you need to supply for your code to work!

  median_age
1         19
Question

What was the average amount of living space of a home for sale in Saratoga County during 2006? Write your own code “from scratch” to compute this summary.

Be sure to think about:

  1. What variable in the data set measures home size?
  2. What function in R calculates the average?
  3. What are the units in the output
  4. What a good name for your summary would be!
  avg_size
1 1754.976
Question

Does the price of a home depend on the type of heating system it has? To find out, use the incomplete code below as a template and compute the median price of all a homes, taking into the type of heating system the home has. Be sure to look back at the variables in the data set to figure out which one can tell you what type of heating system a home uses.

# A tibble: 3 × 2
  heating         median_price
  <fct>                  <dbl>
1 hot air               200000
2 hot water/steam       199700
3 electric              149000
Question

How does the average size of a home depend on the number of bedrooms in the home? Write your own code “from scratch” to compute this summary. Be sure to think about

  1. Which variable should be the “grouping” variable (i.e., which variable you should look at to figure out which group a home belongs to) and
  2. Which variable should be the summarized variable?
# A tibble: 7 × 2
  bedrooms avg_size
     <int>    <dbl>
1        1     885.
2        2    1202.
3        3    1628.
4        4    2273.
5        5    2476.
6        6    3060.
7        7    2521.
Question

Get some practice using the n() function to summarize the frequency of different category values by using the incomplete code below as a template to figure out how many homes use electric, gas, or oil as their heating fuel. Be sure to look back in the documentation for the SaratogaHouses data set to figure out which variable measures what kind of fuel a home uses, so you know which variable to group the data by!

# A tibble: 3 × 2
  fuel     num_homes
  <fct>        <int>
1 gas           1197
2 electric       315
3 oil            216
Question

Write your own code “from scratch” to calculate how many homes in the data set do and do not have a waterfront on their property.

# A tibble: 2 × 2
  waterfront num_homes
  <fct>          <int>
1 Yes               15
2 No              1713
Question

Use the is.na function and the penguins dataset to see only penguins with missing body mass.

# A tibble: 2 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen             NA            NA                NA          NA
2 Gentoo  Biscoe                NA            NA                NA          NA
# ℹ 2 more variables: sex <fct>, year <int>

You should see that there are only 2 penguins missing body mass observations, one Adelie and one gentoo, compared to 1728 total penguins. With so few missing values, it’s unlikely that the mean mass would change much even if the masses of these two penguins were serious outliers. In this type of case, it’s often okay to exclude the NA values and take the average of all known observations. If a large portion of the data (either overall or within a group) were missing, we might not want to do this.

We tell R to exclude NA values by using the na.rm argument of the mean() function. This argument removes NA values before the calculation. It’s an optional argument, meaning you don’t need to include it unless you want the default behavior of the function to change.

Question

Use the na.rm argument of the mean() function to calculate the average body mass in grams for each species of penguin with NA values removed. If you’re not sure how to add this argument, refer to the documentation for the mean() function.

# A tibble: 3 × 2
  species   avg_body_mass_g
  <fct>               <dbl>
1 Adelie              3701.
2 Chinstrap           3733.
3 Gentoo              5076.

You should now see numeric values for all three penguin species.

Many other functions that you use within summarize() will behave similarly to the mean() function here and will also have the na.rm optional argument.