Lab 7: Importing Data

Preamble

The purpose of this lab is to import data beyond just R packages and from the internet.

After completing this lab, you should be able to import data stored as CSV files on the internet and from your local machine.

You will know that you are done when you have created a rendered file where the results match the examples in this HTML file.

Why are we here?

Nearly every lab in this course has used data, but all of that data has come from R packages. The reality is that most data is not part of an R package.

Data, like books or music, can have a variety of restrictions and protections. There is a wealth of “open-access” data, which is data that is shared publicly, often directly on the internet. In this lab, we will look at how to import this kind of publicly available data.

Lab instructions

Setting up

Download the Lab 7 template.

Warning

Before you begin:

  • Move the template file you just downloaded from your Downloads folder into the SDS 100 folder that you created in Lab 1.

  • Make sure that within RStudio, you’re working in the SDS 100 project that you created in Lab 1.

For more detailed instructions on either of the above, see Lab 2.

Optional reading

The following are good additional resources for the material covered in this lab:

Creating the lab environment

Installing and loading packages

Today we need two additional packages googlesheets4 and janitor. Please install and load both now.

In the lab, we will use the tidyverse, googlesheets4, and janitor packages. Please load them below:

Question

Use the library() function to load the tidyverse, janitor, and googlesheets4 packages.

Data beyond R packages

Today, we’ll be using data from two places:

FiveThirtyEight Data

FiveThirtyEight maintains a GitHub data repository. You can think of this as a second version of the above website, just presented in a slightly different way.1

Specifically today, we’ll be looking at the pollster-ratings dataset that is available as a csv and in another format. The link above takes you to a GitHub repository2 (a special kind of online folder) with information about the dataset, as well as different versions of it, including versions by year.

In a GitHub repo, the README.md3 file often gives an overview of what is in the whole repository. For many data repositories, this is where you will find documentation about the dataset. This documentation is similar to what you find in the helpfile for datasets from R packages.

Importing CSV files with read_csv()

Data files comes in many shapes and sizes, but one of the most common formats is CSV. C SV is an acronym standing for comma-separated variables. This means that each row of our data frame is represented on one line of a text file as a sequence of information with the value for each column separated from the next by a comma.

In this section, we will learn how to use the read_csv() function to load data from the internet into R.

Locating the URL for a file in a repository

To use read_csv(), we first need to locate the file that we want to import. Within a GitHub repository, there are usually a number of files and folders.

Today, we are going to work with the pollster-ratings-combined.csv file. From the main repository page, click on the file with this name. You should see a data table, displayed as part of a website. This is a user-friendly rendering of the CSV file created by the GitHub web server.

To see the raw file, find the grey box that starts with 540 lines… and then look on the right side of that box. You should see a button with the label Raw. Click that box.

You will now see just text without any of the user-friendly structure of the previous website. The URL (address) associated with this page is the unique location where the .csv for the pollster-ratings resides.

Copy this URL and paste it below between two double quotes:

Question

Create a new object in R called file_location and assign the URL to the pollster-ratings CSV to it.

Confirm that this worked by typing file_location so that R displays the URL.

Importing data using read_csv()

So far we have identified where the data is and stored that location as file_location, but we haven’t yet loaded the data into R.

To load our data into R, we use read_csv() from the readr package, which is part of the tidyverse. By loading the tidyverse, we have automatically loaded the readr package.

The most simple use of read_csv() is using only the location of the file as the input, that is: read_csv(file_location). We always want to save the result as an object.

Question

Use read_csv() to load the pollster-ratings data into R. Save the resulting object as pollster_raw.

After running the above line, you should now see your data object in the Global Environment pane in the top right of your RStudio. This means that you have imported the data, and we can now work with it.

Question

Use glimpse() to check the number of rows and columns of pollster_raw, and list the names of the variables.

Cleaning up Variable names

Programmers use style guidelines to influence how they write their code. R also has rules that are enforced and conventions that are followed by many people. One common (and useful) convention is not to allow spaces in variable names. While we could manually update the variable name for each column, the janitor package does this for us with its clean_names() function.

Question

Use clean_names() to sanitize the variable names in pollster_raw. Save the resulting object as pollster_clean.

Question

Use glimpse() to display a listing of the names of the variables. What do you notice about the how the variable names are stored?

Checking for file extensions

When working with a data file, the first thing you should check is what format the file is in. The format is denoted by the letters following the final period in the file’s location. These letters are known as the file extension.4

For our pollster data, the letters following the last period are csv denoting that it is a CSV file. The letters following the last period of this lab document are qmd denoting that it is a quarto markdown document. Notice that the letters following the last period of the README.md file in the pollster-rating repository are md denoting that it is a markdown document, but not specifically a Quarto document.

Your turn

Investigate the online repository for the avengers data from FiveThirtyEight, about Marvel comic book characters.

Question

Import the raw avengers dataset and then clean up the column names. Call the imported data av_data.

Question

How many rows and columns are in the data?

Data beyond R packages: Google Sheets

So far in this lab, we’ve imported data that exist publicly as CSV files. There are many ways to store data. One of the most common places to store data is on Google Sheets. In this next part, we will learn how to access the first page of a Google Sheet.

Setting up googlesheets4

Note

The initial set-up for using googlesheets4 can be challenging. Please read these instructions through fully once before you start the installation.

Acknowledgement: This section will follow examples and code from the googlesheets4 help pages.

The first time that you use googlesheets4, it will ask you to authenticate. Authenticating via your Google account allows you to access files that require authentication. While there are files that are publicly available, there are many more that require you to identify yourself via a Google account. These files can include ones that have access restricted to just a few people. In fact, you may find that for your SDS 100 project, you would like to use data that needs to be kept private (via restrictions).

This section will walk you through how to give googlesheets4 permission to access Google Sheets through your Google account. We note that if done correctly, you will only need to do this procedure once per computer and per googlesheets4 installation. For example, I have two computers, so that I had to do this twice: once per computer.

Data for this section

In this part, we will use data that comes from Gapminder. This nonprofit is based in Sweden and takes a data-driven approach to fighting ignorance. They have a number of really cool data-based projects, including Dollar Street, which humanizes data through a data set that pairs photographs with numerical data.

Our data is stored as a Google Sheet at this link.

Question

Navigate to the Google Sheet and save the URL as gapminder_loc. Confirm that this worked by typing gapminder_loc to display the URL.

Accessing Data for the first time

To access today’s dataset and begin the authentication process, run the following line:

read_sheet(gapminder_loc)

Since this is the first time that you are using googlesheets4, you will be asked a few questions; some will be in RStudio and others will open in your web-browser.

Questions in RStudio

The first set of questions will be in RStudio. You may see the following message:

The googlesheets4 package is requesting access to your Google account. Select a pre-authorised account or enter ‘0’ to obtain a new token. Press Esc/Ctrl + C to cancel.

If so, press 0 to start the authentication process.

Then you might see the following two questions:

Is it OK to cache OAuth access credentials in the folder ~/Library/Caches/gargle between R sessions?

The httpuv package enables a nicer Google auth experience, in many cases. It doesn’t seem to be installed. Would you like to install it now?

For both of these questions, it is recommended to select 1.

Steps on the web browser

RStudio will prompt your web browser to open. You will see something like this page asking you to select an account. Like me, you may have two google accounts. For this lab, we ask that you authenticate with your Smith account:

This will next take you to a page with three checkboxes. The first two will already be checked. You need to check the third one!! (See Figure 1.)

Note

Don’t forget to check the third box!

(a) Don’t leave it like this!

(b) Check the 3rd box to make it look like this!

Figure 1: Google Authentication

Check the third box (as shown in Figure 1 (b) above) and then click Continue.

You should next be taken to a blank webpage that has the following text in the top left hand corner:

Authentication complete. Please close this page and return to R.

If you see this page, you can close it and return to RStudio.

Verifying Completed Authentication in RStudio

Back in RStudio you should see something like this:

Authentication and import

The red text was likely there while you were walking through the steps in your browser. The two lines with the checks are what you will typically see when you load data using googlesheets4.

Using googlesheets4

In our authentication procedure, we glossed over the read_sheet() function. Let’s take a closer look at that now.

The function read_sheet() works nearly identically to read_csv() from the first part of this lab. To use it, we need 1) a file location (which in this case is the Google Sheet URL, and 2) the name of the object that we will assign the result to. The critical difference is that read_sheet() takes in the location of Google Sheet while read_csv() takes in the location of a CSV file.

Question

Earlier in this part of the lab, we stored the location of the Google Sheet as the object gapminder_loc. Use this variable with read_sheet() to read in the first page of this Google Sheet, clean the variable names using clean_names(), and store the resulting data frame as the object gm_data.

Question

Above, we discussed the difference between read_sheet() and read_csv(). Copy your code from the previous answer into a new code chunk below. Then, replace read_sheet() with read_csv() and gm_data with gm_data2, and run the code (note this command may take a few moments to run).

What happens when you use read_csv() instead of read_sheet() on a Google Sheet? Hint: Consider the number of rows and columns. Is this what you expect?

Reading data from your local machine

Now that we have learned a couple of methods for reading data from links, let’s apply what we learned in Lab 2 about absolute vs. local file paths and practice reading in data locally.

For this exercise we will use the Cirrhosis data available from the UCI Machine Learning Repository, collected by the Mayo Clinic.

First, go to the link above and select “Download”. This will download a .zip file. Open the zip file, and move the CSV to your SDS100 project folder.

Question

Download and read in the cirrhosis dataset from your local machine using the read_csv() function. Then clean the column names.

Submitting this lab

Step 1: Create the twin document

Download the Lab 7 template (if you haven’t done so already). Complete it such that the results in the generated output match the examples in the Lab 7 solutions file.

Step 2: Final Project - Rotation 0

Next week, you will start looking at one of your data sets from the previous weeks’ data checks in more detail. The following steps will make sure that the one you choose is a good candidate for your final project.

Create a new Quarto document. This will be the document you use for final project work over the next five labs. Within this document, follow these steps:

  1. Select one of your three data sets and import it into R using one of the techniques covered in today’s lab.
  2. Clean the column names using the clean_names() function from the janitor package.
  3. Look in your environment to see how many observations the data set has. If it is at least two hundred and no more than ten thousand, proceed to step 4. Otherwise, select another of your data sets and start over from step 1.
  4. Look at your data by clicking on its name in the environment or running View(<dataset_name>). Check: does it look generally okay? If not, stop and follow the instructions in the section below titled “Common Issues: Data doesn’t look right?”.
  5. Identify two numeric variables and calculate the mean of each. If you get non-NA numeric values for both, proceed to step 6. Otherwise, stop and follow the instructions in the section below titled “Common Issues: Numeric Variables.”
  6. Identify two categorical variables. For each one, use group_by(variable_name) (filling in the name of the variable) and summarize(count = n()) to calculate the number of values in each category. Each variable should meet two criteria: 1) there are at least two categories, and 2) most of the counts are greater than one. If this is the case, proceed to step 7. If the data set doesn’t have at least two variables that fit these criteria, it is unsuitable for the final project. Select another of your data sets and start over from step 1.
  7. If you have successfully completed all of the above steps, your data set is suitable for the final project! Name your .qmd a relevant title for your dataset (for example, “cirrhosis.qmd” instead of “rotation_0.qmd”). Submit both the .qmd file and the .html file for this rotation document on Moodle under “Project Rotation 0.”

Common Issues: Data doesn’t look right?

Data you import yourself will not necessarily be as clean as data you read directly from a URL: it may take some modifications of your function call to get something that looks good. Some important read_csv() and read_sheet()arguments that you may need to leverage:

  • skip: If your data has a header other than just column names, you will need to skip rows. This is a number of rows to skip before it starts reading data.
  • na: If you have missing data that is not represented by an NA, you will want to specify which strings to interpret as missing.

If your data is local and you don’t have a “.csv” file, see if you can Save As your document into a .csv format. Otherwise, you may need to use the more broad read_delim() function and specify what kind of delimiter your data has, such as “\t”.

If it still doesn’t look right, ask your instructor! There might be a simple solution. If you’ve resolved the issue, continue to step 5. If not, you may need to start over with a different dataset in step 1.

Common Issues: Numeric Variables

If your calculated mean is NA and there is no warning message: Use the na.rm argument as covered in lab 5 within the mean() function. Check to confirm that you now have a numeric value. Once you have non-NA numeric means for two variables, proceed to step 6.

If you see a warning message containing “argument is not numeric or logical”: This variable may look numeric, but it’s not being treated as such by R. To force R to turn it into a numeric variable, use the as.numeric() function as follows, filling in the appropriate values for dataset_name and variable_name:

dataset_name <- dataset_name |>
  mutate(variable_name = as.numeric(variable_name))

You may see a warning message saying “NAs introduced by coercion.” This means that at least one value was unable to be turned into a number. If this is the case, you will have to use the na.rm argument as covered in lab 5 within the mean() function.

Check to confirm that you now have a numeric value. Once you have non-NA numeric means for two variables, proceed to step 6.

If you’re still unable to get a numeric value for the mean: This is likely not a suitable numeric variable. If you can choose another numeric variable from this data set, do so and start this step over with that variable. If you cannot, this data set is unsuitable for the final project. Select another of your data sets and start over from step 1.

Step 3: Complete the Moodle Quiz

Complete the Moodle quiz for this lab.

Resources

The googlesheets4 section of this lab follows the example from the googlesheets4 installation instructions

Footnotes

  1. Note that neither of these websites should be confused with the fivethirtyeight R package that we used in Lab 6. While the R package does contain data from the FiveThirtyEight data repository, it is not managed by FiveThirtyEight employees (but rather, by our very own Prof. Albert Kim).↩︎

  2. We won’t be learning how to use GitHub in SDS 100. However, GitHub is introduced in SDS 192 and is used extensively in upper-level SDS classes.↩︎

  3. Note that README.md is a Markdown file! You are already familiar with the formatting of these files because Quarto is also a Markdown variant.↩︎

  4. Unfortunately, many modern operating systems try to hide file extensions from users. But they are usually still there!↩︎