Lab 7: Importing Data

Question

Use the library() function to load the tidyverse, janitor, and googlesheets4 packages.

Question

Spend a minute reviewing the README file in the coffee quality rating GitHub repository. Which of the following is not an aspect the coffees were rated on?

  1. Body
  2. Cup Cleanliness
  3. Bitterness
  4. Moisture
  1. Bitterness
Question

Insert a new code chunk in your Quarto document, and copy the code template below into this code chunk.

Then, copy the “raw” URL for the robusta_ratings_raw.csv from GitHub, and paste this URL inside the call to read_csv(). Don’t forget to put quotation marks around the URL!

Use the <- assignment operator to store the output from the read_csv() function in an object named robusta_raw.

Finally, confirm that the data was imported correctly by printing the robusta_raw object.

# A tibble: 28 × 51
    ...1 quality_score view_certificate_1 view_certificate_2
   <dbl>         <dbl> <lgl>              <lgl>             
 1     0          83.8 NA                 NA                
 2     0          83.5 NA                 NA                
 3     0          83.2 NA                 NA                
 4     0          83   NA                 NA                
 5     0          83   NA                 NA                
 6     0          82.8 NA                 NA                
 7     0          82.6 NA                 NA                
 8     0          82.5 NA                 NA                
 9     0          82.5 NA                 NA                
10     0          82.4 NA                 NA                
# ℹ 18 more rows
# ℹ 47 more variables: `Cupping Protocol and Descriptors` <lgl>,
#   `View Green Analysis Details` <lgl>, `Request a Sample` <lgl>,
#   Species <chr>, Owner...9 <chr>, `Country of Origin` <chr>,
#   `Farm Name` <chr>, `Lot Number` <chr>, Mill <chr>, `ICO Number` <chr>,
#   Company <chr>, Altitude <chr>, Region <chr>, Producer <chr>,
#   `Number of Bags` <dbl>, `Bag Weight` <chr>, `In-Country Partner` <chr>, …
Question

Use glimpse() to get a quick overview of the robusta_raw data. Then, use the output from glimpse() to answer the following questions.

  1. How many variables are measured in these data?
  2. How many variables have names with ... in them?
  3. How many variables have names with spaces in them?
Rows: 28
Columns: 51
$ ...1                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ quality_score                      <dbl> 83.75, 83.50, 83.25, 83.00, 83.00, …
$ view_certificate_1                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ view_certificate_2                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ `Cupping Protocol and Descriptors` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ `View Green Analysis Details`      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ `Request a Sample`                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Species                            <chr> "Robusta", "Robusta", "Robusta", "R…
$ Owner...9                          <chr> "Ankole coffee producers coop", "Ni…
$ `Country of Origin`                <chr> "Uganda", "India", "India", "Uganda…
$ `Farm Name`                        <chr> "Kyangundu cooperative society", "S…
$ `Lot Number`                       <chr> NA, "25", NA, NA, NA, NA, NA, "7", …
$ Mill                               <chr> "Ankole coffee producers", "Sethura…
$ `ICO Number`                       <chr> "0", "14/1148/2017/21", "0000", "0"…
$ Company                            <chr> "Ankole Coffee Producers Coop", "Ka…
$ Altitude                           <chr> "1488", "3170", "1000m", "1212", "1…
$ Region                             <chr> "Sheema South Western", "Chikmagalu…
$ Producer                           <chr> "Ankole coffee producers coop", "Ni…
$ `Number of Bags`                   <dbl> 300, 320, 300, 320, 1, 200, 320, 32…
$ `Bag Weight`                       <chr> "60 kg", "60 kg", "2 kg", "60 kg", …
$ `In-Country Partner`               <chr> "Uganda Coffee Development Authorit…
$ `Harvest Year`                     <dbl> 2013, 2017, 2015, 2013, 2013, 2012,…
$ `Grading Date`                     <chr> "June 26th, 2014", "October 31st, 2…
$ Owner...24                         <chr> "Ankole coffee producers coop", "Ni…
$ Variety                            <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Ot…
$ Status                             <chr> "Completed", "Completed", "Complete…
$ `Processing Method`                <chr> NA, "Washed / Wet", NA, NA, NA, NA,…
$ NA...28                            <chr> "Sample", "Sample", "Sample", "Samp…
$ `Fragrance / Aroma`                <dbl> 7.83, 8.00, 7.92, 8.00, 8.33, 8.00,…
$ Flavor                             <dbl> 8.08, 7.75, 7.83, 7.92, 7.83, 7.92,…
$ Aftertaste                         <dbl> 7.75, 7.92, 7.92, 7.92, 7.83, 7.67,…
$ `Salt / Acid`                      <dbl> 7.92, 8.00, 8.00, 7.75, 7.75, 8.00,…
$ `Bitter / Sweet`                   <dbl> 8.00, 8.00, 7.83, 7.75, 7.58, 7.75,…
$ Mouthfeel                          <dbl> 8.25, 7.92, 7.83, 7.83, 8.25, 7.75,…
$ NA...35                            <chr> "Sample", "Sample", "Sample", "Samp…
$ `Uniform Cup`                      <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, …
$ `Clean Cup`                        <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, …
$ Balance                            <dbl> 7.92, 7.92, 7.92, 7.75, 7.75, 7.92,…
$ `Cupper Points`                    <dbl> 8.00, 8.00, 8.00, 8.08, 7.67, 7.75,…
$ `Total Cup Points`                 <chr> "Sample  83.75", "Sample  83.50", "…
$ NA...41                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Moisture                           <chr> "12 %", "0 %", "0 %", "12 %", "12 %…
$ `Category One Defects`             <chr> "0 full defects", "0 full defects",…
$ Quakers                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Color                              <chr> "Green", NA, "Green", "Green", "Gre…
$ `Category Two Defects`             <chr> "2 full defects", "2 full defects",…
$ NA...47                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Expiration                         <chr> "June 26th, 2015", "October 31st, 2…
$ `Certification Body`               <chr> "Uganda Coffee Development Authorit…
$ `Certification Address`            <chr> "Coffee House, Plot 35, Jinja Road,…
$ `Certification Contact`            <chr> "Clare Rwakatogoro - 256-41-256940/…
  1. 51
  2. 7
  3. 25
Question

Insert a new code chunk in your Quarto document, and copy the code template below into this code chunk.

Fill in the blank inside the clean_names() function with the robusta_raw data frame object. Then, use the <- assignment operator to store the output from the clean_names() function in a new data frame object named robusta_cleaned.

Finally, run the glimpse() function to inspect the newly created robusta_cleaned data frame. How have the variable names changed after using the clean_names() function?

Rows: 28
Columns: 51
$ x1                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ quality_score                    <dbl> 83.75, 83.50, 83.25, 83.00, 83.00, 82…
$ view_certificate_1               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ view_certificate_2               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ cupping_protocol_and_descriptors <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ view_green_analysis_details      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ request_a_sample                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ species                          <chr> "Robusta", "Robusta", "Robusta", "Rob…
$ owner_9                          <chr> "Ankole coffee producers coop", "Nish…
$ country_of_origin                <chr> "Uganda", "India", "India", "Uganda",…
$ farm_name                        <chr> "Kyangundu cooperative society", "Set…
$ lot_number                       <chr> NA, "25", NA, NA, NA, NA, NA, "7", "R…
$ mill                             <chr> "Ankole coffee producers", "Sethurama…
$ ico_number                       <chr> "0", "14/1148/2017/21", "0000", "0", …
$ company                          <chr> "Ankole Coffee Producers Coop", "Kaap…
$ altitude                         <chr> "1488", "3170", "1000m", "1212", "120…
$ region                           <chr> "Sheema South Western", "Chikmagalur …
$ producer                         <chr> "Ankole coffee producers coop", "Nish…
$ number_of_bags                   <dbl> 300, 320, 300, 320, 1, 200, 320, 320,…
$ bag_weight                       <chr> "60 kg", "60 kg", "2 kg", "60 kg", "6…
$ in_country_partner               <chr> "Uganda Coffee Development Authority"…
$ harvest_year                     <dbl> 2013, 2017, 2015, 2013, 2013, 2012, 2…
$ grading_date                     <chr> "June 26th, 2014", "October 31st, 201…
$ owner_24                         <chr> "Ankole coffee producers coop", "Nish…
$ variety                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Othe…
$ status                           <chr> "Completed", "Completed", "Completed"…
$ processing_method                <chr> NA, "Washed / Wet", NA, NA, NA, NA, N…
$ na_28                            <chr> "Sample", "Sample", "Sample", "Sample…
$ fragrance_aroma                  <dbl> 7.83, 8.00, 7.92, 8.00, 8.33, 8.00, 7…
$ flavor                           <dbl> 8.08, 7.75, 7.83, 7.92, 7.83, 7.92, 7…
$ aftertaste                       <dbl> 7.75, 7.92, 7.92, 7.92, 7.83, 7.67, 7…
$ salt_acid                        <dbl> 7.92, 8.00, 8.00, 7.75, 7.75, 8.00, 7…
$ bitter_sweet                     <dbl> 8.00, 8.00, 7.83, 7.75, 7.58, 7.75, 8…
$ mouthfeel                        <dbl> 8.25, 7.92, 7.83, 7.83, 8.25, 7.75, 7…
$ na_35                            <chr> "Sample", "Sample", "Sample", "Sample…
$ uniform_cup                      <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10…
$ clean_cup                        <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10…
$ balance                          <dbl> 7.92, 7.92, 7.92, 7.75, 7.75, 7.92, 7…
$ cupper_points                    <dbl> 8.00, 8.00, 8.00, 8.08, 7.67, 7.75, 7…
$ total_cup_points                 <chr> "Sample  83.75", "Sample  83.50", "Sa…
$ na_41                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ moisture                         <chr> "12 %", "0 %", "0 %", "12 %", "12 %",…
$ category_one_defects             <chr> "0 full defects", "0 full defects", "…
$ quakers                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ color                            <chr> "Green", NA, "Green", "Green", "Green…
$ category_two_defects             <chr> "2 full defects", "2 full defects", "…
$ na_47                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ expiration                       <chr> "June 26th, 2015", "October 31st, 201…
$ certification_body               <chr> "Uganda Coffee Development Authority"…
$ certification_address            <chr> "Coffee House, Plot 35, Jinja Road, P…
$ certification_contact            <chr> "Clare Rwakatogoro - 256-41-256940/23…

All the letters have been convered to lower case, all the spaces have replaced with underscores, and the leading ... have been turned into x’s!

Question

Investigate the online repository for the avengers data from FiveThirtyEight, which is all about Marvel comic book characters.

Then, import the avengers.csv file, and then clean up the variable names. Finally, use the glimpse function to answer the following question: how many variable’s in this data set contain numeric data?

Rows: 173
Columns: 21
$ url                         <chr> "http://marvel.wikia.com/Henry_Pym_(Earth-…
$ name_alias                  <chr> "Henry Jonathan \"Hank\" Pym", "Janet van …
$ appearances                 <dbl> 1269, 1165, 3068, 2089, 2402, 612, 3458, 1…
$ current                     <chr> "YES", "YES", "YES", "YES", "YES", "YES", …
$ gender                      <chr> "MALE", "FEMALE", "MALE", "MALE", "MALE", …
$ probationary_introl         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ full_reserve_avengers_intro <chr> "Sep-63", "Sep-63", "Sep-63", "Sep-63", "S…
$ year                        <dbl> 1963, 1963, 1963, 1963, 1963, 1963, 1964, …
$ years_since_joining         <dbl> 52, 52, 52, 52, 52, 52, 51, 50, 50, 50, 50…
$ honorary                    <chr> "Full", "Full", "Full", "Full", "Full", "H…
$ death1                      <chr> "YES", "YES", "YES", "YES", "YES", "NO", "…
$ return1                     <chr> "NO", "YES", "YES", "YES", "YES", NA, "YES…
$ death2                      <chr> NA, NA, NA, NA, "YES", NA, NA, "YES", NA, …
$ return2                     <chr> NA, NA, NA, NA, "NO", NA, NA, "YES", NA, N…
$ death3                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ return3                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ death4                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ return4                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ death5                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ return5                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ notes                       <chr> "Merged with Ultron in Rage of Ultron Vol.…

3 variables contain numeric data (appearances, year, and years_since_joining).

Question

Insert a new code chunk into your Quarto document, and copy the code template below into this chunk.

Then, navigate to the Gapminder Google Sheet, and copy the URL from the address bar. Paste this the URL inside the call to the read_sheet() function, with quotation marks around the URL. Next, use the assignment operator to store the output from the read_sheet() function in an object named gapminder. Finally, confirm that the data was imported correctly by printing the gapminder object.

# A tibble: 624 × 6
   country continent  year lifeExp      pop gdpPercap
   <chr>   <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Algeria Africa     1952    43.1  9279525     2449.
 2 Algeria Africa     1957    45.7 10270856     3014.
 3 Algeria Africa     1962    48.3 11000948     2551.
 4 Algeria Africa     1967    51.4 12760499     3247.
 5 Algeria Africa     1972    54.5 14760787     4183.
 6 Algeria Africa     1977    58.0 17152804     4910.
 7 Algeria Africa     1982    61.4 20033753     5745.
 8 Algeria Africa     1987    65.8 23254956     5681.
 9 Algeria Africa     1992    67.7 26298373     5023.
10 Algeria Africa     1997    69.2 29072015     4797.
# ℹ 614 more rows
Question

Above, we discussed the difference between read_sheet() and read_csv(). Copy your code from the previous answer into a new code chunk below. Then, replace read_sheet() with read_csv(), and run the code again (this command may take a several moments to finish running).

What happens when you use read_csv() instead of read_sheet() on a Google Sheet URL? Hint: Consider the number of rows and columns. Is this what you expect?

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
# A tibble: 11 × 4,078
   <!DOCTYPE html><html la…¹ .gb_A .gb_B{-webkit-tap-hi…² .gb_E{fill:#5f6368}.…³
   <chr>                     <chr> <chr>                  <chr>                 
 1 "/*"                       <NA>  <NA>                  <NA>                  
 2 "Copyright The Closure L…  <NA>  <NA>                  <NA>                  
 3 "SPDX-License-Identifier…  <NA>  <NA>                  <NA>                  
 4 "*/"                       <NA>  <NA>                  <NA>                  
 5 "function a(d){var f"     "g"   "h;if(((f=document)==… l;((l=(k=window).DOCS…
 6 "for(var e;b.length&&(e=…  <NA>  <NA>                  <NA>                  
 7 "// Google Inc."           <NA>  <NA>                  <NA>                  
 8 "//# sourceMappingURL=fo…  <NA>  <NA>                  <NA>                  
 9 "</script><script nonce=… "Rob… "Helvetica"            Arial                 
10 "g&&(a+=15);var f=d.getE… "m=f… "k=f?null:d.getElemen… n=l?l.offsetHeight:0;…
11 "e.style.display=\"block… "e=g…  <NA>                  <NA>                  
# ℹ abbreviated names:
#   ¹​`<!DOCTYPE html><html lang="en-US"><head><script nonce="fhy_159A9h6oa_L07WYlUg">var DOCS_timing={}; DOCS_timing['pls']=new Date().getTime();</script><meta property="og:title" content="gapminder"><meta property="og:type" content="article"><meta property="og:site_name" content="Google Docs"><meta property="og:url" content="https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit?gid=780868077&amp;usp=embed_facebook"><meta property="og:image" content="https://lh7-us.googleusercontent.com/docs/AHkbwyKlA5QbblnI18BecNvdWzoAfwMyGHSz9WApv1JE3fYcedBKxPsQFkIvPkA1i6Y0mXS2wTUms0HQPhQU8YWTyJ0cMlK4LtRJIC49SfymkAcbDc69oq4=w1200-h630-p"><meta property="og:image:width" content="1200"><meta property="og:image:height" content="630"><meta name="google" content="notranslate"><meta http-equiv="X-UA-Compatible" content="IE=edge;"><meta name="referrer" content="origin"><title>gapminder - Google Sheets</title><link rel="shortcut icon" href="//ssl.gstatic.com/docs/spreadsheets/spreadsheets_2023q4.ico"><link rel="chrome-webstore-item" href="https://chrome.google.com/webstore/detail/apdfllckaahabafndbhieahigkjlhalf"><style nonce="sNlH5gLJp8yEVHWq6xR84w">.gb_x{display:-webkit-box;display:-webkit-flex;display:-moz-box;display:-ms-flexbox;display:-webkit-box;display:-moz-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-box-pack:end;-webkit-justify-content:flex-end;-moz-box-pack:end;-ms-flex-pack:end;-webkit-justify-content:flex-end;justify-content:flex-end;width:100%}.gb_y{-moz-box-sizing:border-box;-webkit-box-sizing:border-box;box-sizing:border-box;height:48px;padding:4px}.gb_z`,
#   ²​`.gb_B{-webkit-tap-highlight-color:transparent;-moz-box-sizing:border-box;-webkit-box-sizing:border-box;box-sizing:border-box;float:right;outline:none}.gb_C{-moz-box-sizing:border-box;-webkit-box-sizing:border-box;box-sizing:border-box;display:inline-block;height:48px;outline:none;padding:4px}.gb_D`,
#   ³​`.gb_E{fill:#5f6368}.gb_F .gb_E`
# ℹ 4,074 more variables: `.gb_F .gb_D{fill:rgba(255` <chr>, `255...6` <chr>,
#   `255...7` <chr>, `.87)}.gb_H .gb_A` <chr>,
#   `.gb_I .gb_A{cursor:pointer;padding:8px}.gb_J{display:none}a.gb_z{-webkit-box-align:center;-webkit-align-items:center;-moz-box-align:center;-ms-flex-align:center;-webkit-align-items:center;align-items:center;background:#1a73e8;border:1px solid transparent;-webkit-border-radius:4px;-moz-border-radius:4px;border-radius:4px;color:#fff;cursor:pointer;display:-webkit-box;display:-webkit-flex;display:-moz-box;display:-ms-flexbox;display:-webkit-box;display:-moz-box;display:-webkit-flex;display:-ms-flexbox;display:flex;font-family:Google Sans` <chr>, …

No, this result doesn’t look at all like what I expected. The number of rows and columns should be 624 and 6 for the gapminder data. In addition to having the wrong dimensions, the data appear to be a bunch of mangled text.

Question

Extract the data file named product review text.csv from the zip file you downloaded, and move it into your SDS 100 project folder.

If you’ve never extracted files from a zip file before, here are some instructions you might find helpful:

Since no R code is needed for this question, you won’t see any output below this question in your rendered HTML file. You’ll know you completed this exercise correctly if you see the product review text.csv file in your SDS 100 project folder, along with all your lab files.

Question

Use the locale() function (which comes from the readr package) to create an object representing a Chinese locale that uses the GB18030 encoding, just like the example below:

Then, try using the read_csv() function to import the product review text.csv file and store the data in an object named JD_product_reviews again. But this time, pass the locale object you created into the read_csv() function using the locale argument.

Check that the data were imported properly by printing out the JD_product_reviews object; the data should look like the screenshot above.

# A tibble: 67,846 × 2
   充电有点烫                               rel  
   <chr>                                    <chr>
 1 用了三天来评价                           irr  
 2 上网还可以                               rel  
 3 就是用一会手机发烫                       rel  
 4 不知道是怎么回事                         irr  
 5 可以可以不错不错                         irr  
 6 最便宜的苹果了                           rel  
 7 拿在手里刚刚好                           rel  
 8 比较好看的苹果机型作为备用机还是很不错的 rel  
 9 挺好的                                   irr  
10 留个纪念也好                             irr  
# ℹ 67,836 more rows