Motivation

A typical refrain from data scientists in the wild is that 80-90% of their time is spent wrangling data.

drob
Image credit: https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html

But for data scientists in the classroom, it’s hard to create those wild settings which require working with the entire data science process. In this assignment, the student is required to create the data, wrangle the data, and visualize the data. The data you will be asked to collect and visualize are based on a Google calendar.

Assignment

  1. As a preliminary step, listen to the Compromised Shoe Situation episode of Not So Standard Deviations, https://overcast.fm/+FMBuKdMEI/00:30 or http://nssdeviations.com/71-compromised-shoe-situation. Note: the conversation is about 45 minutes long. Listen while folding laundry! [Roger Peng blogged about it, but the thought process in the actual podcast is better for understanding the assignment at hand: https://simplystatistics.org/2019/01/09/how-data-scientists-think-a-mini-case-study/.]

TODO: listen. nothing to include right here.

  1. Ask yourself a question which might be answered by a calendar. How much do I study per week? or How much am I sleeping? or How much am I exercising? (Feel free to make up the question and the data if you aren’t comfortable putting your calendar into your HW assignment … but the HW repo is private, FYI.)

TODO: write down the question that you might be able to answer given the data below.

  1. Create the dataset:
    • In a Google calendar, under “Other calendars” click on the “+” to add a calendar. Name it whatever you want.
    • Add events to your calendar. They can be as personal or impersonal as you want, but the events will be summarized in your HW assignment.
    • Tag the events (tagging is done by setting the color of the event). You should have at least 20 events and at least 4 different tags.
    • Try to put as many items on your calendar as possible. The events may be activities (what you are doing), locations (where you are), or any other creative way that you want to partition your day.
    • Go to Google Calendar -> Settings -> “Import & Export” -> Export. Unzip the file and put the dataset into the GitHub repo for this assignment.

TODO: create the dataset. nothing to include right here.

  1. Upload the dataset into R. (Note that the ical package will read in files from calendars other than Google. Feel free to use any calendar system you want!) Note that in the Console (look at the window below!) you’ll need to import the two packages we used: install.packages("tidyverse") and install.packages("ical"). Don’t install the packages in the Rmd file.
library(tidyverse)
library(ical)

# Load ics file using ical package
daily_events <- 
  "Math154_u6i1mi8treredmpe526b3qv7j4@group.calendar.google.com.ics" %>% 
  ical_parse_df() %>% 
  as_tibble()

TODO: change the name of the file you are importing. remember to include the actual data file in your GitHub repository so that the grader can knit your .Rmd file.

  1. Wrangle, visualize, and summarize the results of the dataset. It is likely that you can just run the code below. If you want a longer time period, change the date in the filter command. If you want to do something else, feel free to ask me the R code, and I will just give you the code (or ask on Piazza!).

TODO: feel free to directly use the R code for wrangle & visuazlize (or expand it if you want to!). write a sentence or two to summarize.

Wrangle

daily_events <- daily_events %>% as_tibble() %>%
  mutate(
    start_datetime = lubridate::with_tz(start, tzone = "America/Los_Angeles"),
    end_datetime = lubridate::with_tz(end, tzone = "America/Los_Angeles"),
    seconds = end_datetime - start_datetime,
    minutes = seconds / 60,
    hours = minutes / 60,
    time_unit = lubridate::floor_date(start_datetime, unit = "day")
  ) %>%
  # I only want information since September
  dplyr::filter(time_unit > "2019-09-01") %>% 
  # Separate out summary and subtask
  separate(summary, c("summary", "subtask"), sep = ":") %>%
  select(-subtask)

Visualize

daily_events %>%   
  # Summarize time spent
  group_by(time_unit, summary) %>%
  summarize(tot_min = sum(minutes) %>% as.numeric()) %>%
  mutate(tot_hr = tot_min/60) %>%
  # Plot as a line graph
  ggplot(aes(x=time_unit, y = tot_hr, color = summary)) + geom_line()

Summarize

TODO: You write some words here.

  1. Reflect on the process by answering the following questions:

TODO: answer each of the questions below in your own words.

  1. What types of questions was Hilary trying to answer in her data collection? What types of questions might you be able to answer in your data collection?

  2. Name two of Hilary’s main hurdles in gathering accurate data. Name two of your main hurdles in gathering accurate data.

  3. Write a few sentences contrasting data collection that is “high touch” (manual) versus “low touch” (automatic). When is high touch more accurate, when is low touch more accurate (provide an example or two)?

  4. What additional variables (covariates) did Hilary discuss as important to the data analysis but difficult to collect or not collected in her first foray of analysis? What additional variables (covariates) would you want to collect the next time you tried to understand how you spend your day?

  5. How much data do you think you’d need to collect in order to answer the research question of interest? Would it be hard to collect that data? Why or why not?