A typical refrain from data scientists in the wild is that 80-90% of their time is spent wrangling data.
But for data scientists in the classroom, it’s hard to create those wild settings which require working with the entire data science process. In this assignment, the student is required to create the data, wrangle the data, and visualize the data. The data you will be asked to collect and visualize are based on a Google calendar.
TODO: listen. nothing to include right here.
TODO: write down the question that you might be able to answer given the data below.
TODO: create the dataset. nothing to include right here.
ical
package will read in files from calendars other than Google. Feel free to use any calendar system you want!) Note that in the Console
(look at the window below!) you’ll need to import the two packages we used: install.packages("tidyverse")
and install.packages("ical")
. Don’t install the packages in the Rmd file.library(tidyverse)
library(ical)
# Load ics file using ical package
daily_events <-
"Math154_u6i1mi8treredmpe526b3qv7j4@group.calendar.google.com.ics" %>%
ical_parse_df() %>%
as_tibble()
TODO: change the name of the file you are importing. remember to include the actual data file in your GitHub repository so that the grader can knit your .Rmd file.
TODO: feel free to directly use the R code for wrangle & visuazlize (or expand it if you want to!). write a sentence or two to summarize.
daily_events <- daily_events %>% as_tibble() %>%
mutate(
start_datetime = lubridate::with_tz(start, tzone = "America/Los_Angeles"),
end_datetime = lubridate::with_tz(end, tzone = "America/Los_Angeles"),
seconds = end_datetime - start_datetime,
minutes = seconds / 60,
hours = minutes / 60,
time_unit = lubridate::floor_date(start_datetime, unit = "day")
) %>%
# I only want information since September
dplyr::filter(time_unit > "2019-09-01") %>%
# Separate out summary and subtask
separate(summary, c("summary", "subtask"), sep = ":") %>%
select(-subtask)
daily_events %>%
# Summarize time spent
group_by(time_unit, summary) %>%
summarize(tot_min = sum(minutes) %>% as.numeric()) %>%
mutate(tot_hr = tot_min/60) %>%
# Plot as a line graph
ggplot(aes(x=time_unit, y = tot_hr, color = summary)) + geom_line()
TODO: You write some words here.
TODO: answer each of the questions below in your own words.
What types of questions was Hilary trying to answer in her data collection? What types of questions might you be able to answer in your data collection?
Name two of Hilary’s main hurdles in gathering accurate data. Name two of your main hurdles in gathering accurate data.
Write a few sentences contrasting data collection that is “high touch” (manual) versus “low touch” (automatic). When is high touch more accurate, when is low touch more accurate (provide an example or two)?
What additional variables (covariates) did Hilary discuss as important to the data analysis but difficult to collect or not collected in her first foray of analysis? What additional variables (covariates) would you want to collect the next time you tried to understand how you spend your day?
How much data do you think you’d need to collect in order to answer the research question of interest? Would it be hard to collect that data? Why or why not?