Lab 2: Problem Solving

Preamble

The purpose of this lab is to introduce you to working with R, and with resources and techniques for problem solving while using R. You should reference this lab often throughout the semester for reminders on best practices for addressing errors and getting help.

A note on Labs

Each lab has the same structure:

  • A preamble (i.e., what is above this section) that tells you the goals for the lab and how you will know when you have achieved them
  • A Why are we here? section that explains how this lab relates to the previous labs
  • The lab instructions, which include boxed examples that you need to complete and match to the solutions file
  • The submission instructions which detail what you need to do to finish this lab and what you need to do to prepare for the next lab

Most labs have an accompanying template file, which is a version of this lab with just the questions. You should download the Lab 2 template, place it in your SDS 100 project folder from last week, and open the template. That way, you can answer the questions as you go through this lab.

Why are we here?

In Lab 1, you set up the computational tools that we are using in this course and for most classes in the SDS program. You should have installed R and RStudio, and then rendered a Quarto document (a .qmd) into an .html file.

In this lab, we will start to learn how to use these computational tools, and what to do when those tools break.

Along the way we will also learn how to:

  • Use R as a calculator
  • Assign outputs to objects in the R environment
  • Familiarize yourself with the R interface
  • Interpret error messages in R
  • Reference R help resources
  • Identify working directories and navigate file paths

Lab instructions

Setting up

Download the Lab 2 template.

Warning

Before you begin:

  • Move the template file you just downloaded from your Downloads folder into the SDS 100 folder that you created last week.

    • There are lots of different ways to do this (e.g., using the Finder, Windows Explorer, or command line, etc.). Perhaps the easiest way is to open the lab_02_problem_solving.qmd file, go to File -> Save As…, and navigate to your RStudio project directory.
  • Make sure that within RStudio, you’re working in the SDS 100 project that you created last week. If the Project pane in the top right of your RStudio window doesn’t show “SDS 100”, go to Open Project as shown below and navigate to your project.

Optional reading

Chapter 2: R basics and workflows Is a good additional reference for the material covered in this lab.

R as Calculator & Storing Values

The first thing you can try to do with R is a familiar one - simply use it like a calculator. It is easy to forget because of all the other things it can do, but R at the lowest level is still just working with numbers.

Question

In the Console, enter 5 + 2 and press Enter to run the code.

R Objects

Right now, the outputs of our code - in this case the number 7 - are just being shown to us. That can be helpful, but isn’t really the point of a programming language. We want to be able to save our results and build on them later. We can do that by assigning the output of the code we run into an object.

You can think of objects as boxes that store things. To create an object, we have to ask R to take the results of some code and assign those results to an object. We do this using the assignment arrow, <-. Let’s try it out.

Question

In the Console, type in result <- 5 + 2 and press Enter to run the code.

Then, type in applesauce <- 10 * 3 and press Enter to run the code.

You can see all of the objects you currently have in the upper right pane labeled Environment. At this point, you should see two objects in your environment: an object named result (which as the value 7) and and object named applesauce (which has the value 30), Objects can be named anything, as long as they 1) don’t start with a number, and 2) don’t contain any spaces.

Note that when you created result and applesauce, R didn’t show the results of 5 + 2 or 10 * 3 to the Console. This is because creating an object is separate from showing it. Showing results is often called printing. In order to print an object, you can just write the name of it. When you run that, R will show you the object.

Question

Type result into the Console and press Enter to print it. Do the same with applesauce.

Objects are useful to use because we can use their names to represent their values in our code. For example, we can now add result and applesauce together, because result represents the value 7 and applesauce represents the value 30. Try it for yourself:

Question

In the Console, enter result + applesauce and press Enter to run the code.

We can even change what is stored inside those objects, then run the same code to get different results!

Question

In the Console, enter result <- 300 and press Enter to run the code. Then, try result + applesauce again. What has changed? Why?

[1] 330

It is very important to remember, that anything you don’t assign to an objects is lost. It is shown to you, but R will not remember it, and you will not be able to use it later.

R Functions and Arguments

Functions in R are like imperative sentences (e.g. “go”,“stay”, or “sleep”). They indicate to R that it should take some form of action. Functions are typically immediately followed by open and closed parentheses. For example, check out this function that we call when we want R to output the current date.

Code
Sys.Date()
[1] "2024-04-30"

Now imagine we approached someone with the imperative sentence: “Close” …or “Bring”. Their next questions might be “close what” or “bring what”, and we might say back “close the door” or “bring dessert”. Arguments specify the action of a function by supplying the “to what” and/or “how.” They are listed inside of the parentheses, and often have specific names. Some arguments are required for a function to run, and others are optional.

Question

The sum() function adds up all of its arguments, and outputs the grand total. In the Console, enter sum(5, 2) and pressing Enter. What were the arguments you passed to the sum() function?

R Editor Panel

Up to this point, we have been writing R code in the Console pane. You’ll notice that when you direct your attention to the console, it will look like this:

>

When we execute code in the console, we get a result immediately. The console is a great space to work when we need to run a piece of code only once. However, the console is not a great space to edit code or to save code that we’ve written. This is why most of the code that you will write in this course and other SDS courses will be composed in the editor pane (which is typically located directly above the console). You’ll notice that when you direct your attention to the editor, the lab template for this lab will be open. In the open file, you will see text after a series of line numbers that look like this:

1
2
3

This is a file where you can compose, edit, and save code. In this file, you can also write text that explains the rationale for the code you’ve written, interprets the results, or offers additional contextual information regarding the data analysis.

Code Chunks

RStudio needs a way to know when we are switching from writing text about our code to actually writing code that we wish to execute. This is where code chunks come in to play. A code chunk is a space in a .qmd file where we can write code that we wish to execute. There are a few ways to create code chunks in RStudio:

First, we can place our cursors on a new line in the editor pane and enter the following ```{R} to indicate the start of a code chunk. Moving to a few lines beneath that, we can then enter ``` to indicate the end of a code chunk. You’ll notice that when you do this, it creates a grey box with a green triangle icon in the upper right hand corner. The box will look something like this:

Code
```{r}
#| output: false
#| eval: false

# Hello world!
# I'm a code chunk.

1 + 8
```

There’s another, simpler way to create a code chunk. Direct your attention to the upper right hand corner of your editor pane. You’ll notice a green icon with a plus sign (+) and the letter ‘C’. If you place your cursor on a new line and click that button, a code chunk will be automatically created on that line in the file. You’ll notice that the code chunk starts with ```{R} and ends with ```.

In this course, almost all of the code that we compose will be written in code chunks within .qmd files.

Question

In your template file, add result + applesauce to the end of the following code chunk. Click the green triangle button in the upper right hand corner of the code chunk to execute the code in the chunk.

Code
result <- 5 + 2
applesauce <- 10 * 3

Just like when you ran this same code in the Console, here you’ve saved the result of 5 + 2 and the result of 10 * 3 to objects in your environment. You’ve also evaluated the result of adding the values stored in these two objects.

However, unlike when you ran this in the Console, this time around you have the ability to edit your code. For example, you could place your cursor on the line where you created the object for result, change the number 5 to 4, and then re-run the code to get a new result. You can also save this file so that you can re-run this code at a later date.

Warning

It’s very important that all ```{r} have a corresponding ```. If you don’t close a code chunk with ```, you will see that the code chunk extends to the end of your file. If you have an extra ``` in your file, you will notice that run buttons disappear in all of the code chunks below that extra ```.

If you ever notice that the green, triangular run buttons have disappeared from your code chunks, first make sure you are in the source editor mode, rather than visual. You can change this setting in the top bar of the editor pane. Next, look for an extra ``` in your .qmd file.

Commenting

Any text that you enter into a code chunk must be executable R code. For instance, check out what happens when we try to write some instructions in “plain English” at the bottom of our code:

Code
result <- 5 + 2
applesauce <- 10 * 3
Place cursor here.
Error: <text>:3:7: unexpected symbol
2: applesauce <- 10 * 3
3: Place cursor
         ^

We get an error because the words “Place cursor here” are not valid R code. If we want to add explanatory text to a code chunk, we need to preface that text with a hashtag (#) like this:

Code
# This is a comment

This indicates that all of the characters following the hashtag should be ignored when it comes time to run the code.

Adding comments in code chunks is helpful for explaining what a piece of code is doing. This is important when sharing your code with others, or even reminding your future self why you took a certain approach. Check out how we use a comment to explain the code below:

Code
# Below we create a vector of colors. To create a vector, we use the function c(), which is short for combine. 

colors <- c("red", "blue", "green")

Interpreting Error Messages

Often, something won’t go as planned. R has several ways to communicate with us, the most common of which are error messages. Error messages appear when something doesn’t work as anticipated. It is tempting to see them as denoting failure, but that is not the case. Error messages are a way for R to alert us to potential problems; they are R’s way of trying to communicate with us. The real danger comes where R does not tell us when something goes wrong.

First, it’s important to make some distinctions between the kinds of messages that R presents to us when attempting to run code:

Errors

Terminate a process that we are trying to run in R. They arise when it is not possible for R to continue evaluating a function. Like when we try to add letters together.

Warnings

Don’t terminate a process but are meant to warn us that there may be an issue with our code and its output. They arise when R recognizes potential problems with the code we’ve supplied.

Messages

Also don’t terminate a process and don’t necessarily indicate a problem but simply provide us with more potentially helpful information about the code we’ve supplied.

Check out the differences between an error and a warning in R by reviewing the output in the Console when you run the following code chunks.

Error in R

Code
sum("3", "4")
Error in sum("3", "4"): invalid 'type' (character) of argument

Warning in R

Code
vector1 <- 1:5
vector2 <- 3:6
vector1 + vector2
Warning in vector1 + vector2: longer object length is not a multiple of shorter
object length
[1]  4  6  8 10  8

So what should you do when you get an error message? How should you interpret it? Luckily, there are some clues and standardized components of the message the indicate why R can’t execute the code. Consider the following error message that you received when running the code above:

Error in sum("3", "4") : invalid 'type' (character) of argument

There are three things we should pay attention to in this message:

  1. The word “Error” indicates that this code did not run.
  2. The text immediately after the word “in” tells us which specific function did not run.
  3. The text after the colon gives us clues as why the code did not run.

Reviewing the error above, we can guess that there was a problem with the argument that we supplied to the sum() function, and specifically that we supplied an argument of the wrong type. This is because R interprets “3” and “4” as character strings and not numbers, and it doesn’t know how to add character strings together.

Question

Fix the errors in the code below, and then press the green “Play” button to run the corrected code.

Code
sum("3", "4")

Preparing to Get Help

When we do get errors in our code and need to ask for help in interpreting them, it’s important to provide collaborators with the information they need to help us. Sometimes when teaching R we will hear things like: “My code doesn’t work!” or “I’m stuck and don’t know what to do,” and it can be challenging to suss out the root of the issue without more information.

Notice the left side of this document has a series of numbers listed vertically next to each line? These are known as line numbers. Oftentimes, if you are having an issue with your code and ask me to review it, we will say something like: “Check out line 53.” By this we mean that you should scroll the document to the 53rd line. You can similarly tell me or your peers which line of your code you are struggling with.

Referencing Resources

There are a number of resources available to help you recall how certain functions work.

Help pages

R help pages can be a great resources when you know the function you need to use, but can’t remember how to apply it or what its parameters are. Help pages typically include a description of the function, its arguments and their names, details about the function, the values it produces, a list of related functions, and examples of its use. You can access the help pages for a function by typing the name of the function with a question mark in front of it into our Console (e.g. ?log or ?sum). Some help pages are well-written and include helpful examples, while others are spotty and don’t include many examples.

Question

Access the help pages for the function sort(). Write code below to sort the vector a in decreasing order.

Code
a <- c(1, 6, 4, 4, 8, 2)

Cheatsheets

The R community has developed a series of cheatsheets that list the functions made available through some packages and their arguments. Cheatsheets are a great resources when you know what you need to do and what tools to use on a dataset in R, but can’t recall the function that enables you to do it.

Question

Imagine we collected the temperature of our home each day for the past ten days (see code below). Let’s say we wanted to find how each day ranked from coolest to hottest. In other words, we wanted to know: was day 1 the coolest day (rank 1) out of all 10 days? Or was day 1 only the 4th coolest (rank 4) day?

Use this cheatsheet, to find our which R function will allow you to generate a ranking of each day’s temperature. Then, search the help page for this function to determine how to handle ties (i.e., two or more days the same temperature) by randomly choosing one temperature to be “first”. Make sure your code includes a short comment describing how ranking the temperatures is different than sorting the temperatures.

Code
# Create a vector of temperatures
temps_to_rank <- c(68, 70, 78, 75, 69, 80, 66, 66, 79)

# Write code below to rank the days with random ties

# Replace this line with a comment to yourself (starting a line with a # in a code block or script file makes it a comment) describing how this function is different than sorting the data. 

Searching the Web

You can also search the web when you get errors in your code. Others have likely experienced that error before and gotten help from communities of data analysts and programmers. However, you should never copy and paste code directly from Stack Overflow. This violates the course policies on Academic Honesty. Instead you should use these resources to take notes and learn how to improve and revise code. Any time you reference Stack Overflow or any other Web resource to help you figure out an answer to a problem, you should cite that resource in your code. Here is how you would cite that post in APA format:

Username of asker. (Year, Month Date). Title of page. Stack Overflow. Webpage URL

Question

Using a comment, add a properly formatted citation for this Stack Overflow post to the code chunk above.

File Paths and Organization

Computers allow you to save files, like Word documents or PDFs, so that you can access them whenever you need to. But, your computer doesn’t force all your files to be saved together in one place; your computer’s storage is divided into folders. For example, your computer comes with a folder named Downloads (where files downloaded from the internet are saved by default), and a folder named Documents (where new Word documents are saved by default).

Files that are saved on computer are called local files, while files that are elsewhere are called remote. Files that are in folders for syncing services like Dropbox, Google Drive, or similar services exist in an in-between, in that they are viable on your computer, but are downloaded on-demand whenever you try to open them. This can cause problems for R, so it is suggested you keep your code folders in non-synced locations.

The folders in your computer’s storage are arranged in a hierarchy, with one folder at the “top” of the hierarchy and others nested below them. Folders at lower levels of the the hierarchy are stored within the folders at higher levels. For example, your Downloads and Documents folders are actually nested within your user’s home folder (which is a folder you may not commonly encounter when browsing around your computer).

Folders and files in a heirarchy on your computer

Storage and items in a heirarchy in kitchen

A good way to think about how your computer’s hierarchy of folders (usually called a file system) is organized is like a kitchen. You have access to everything, but you don’t want all your forks, spoons, plates, etc. all in one big pile on the floor. You organize them so they are easy to find and store. You may start with broad categories–you have a drawer for all your utensils–but inside that drawer it is broken down into smaller spaces; a spot for forks, spoons, knives, etc. The same principle is true of files. You create ever more specific spaces for things so they are easy to find later.

Making effective use of your file system

Unlike a kitchen, you can create new spaces for files whenever you want in the form of folders. For example, on the first day of class you made a special folder called an R Project, which we’ll learn more about in a minute. The power to create new folders is a useful organizational tool; it allows you to keep related files grouped together, but isolated from other irrelevant files.

It may be tempting to just store all your files in one place, like your Downloads folder or your Desktop. After all, it’s fewer clicks (or perhaps 0 clicks!) when you’re saving your work, and you can quickly accessing those folders. But, with all your files from all your classes in one place, you can’t quickly find what you need, just like if all your kitchen ware were in one giant pile. Also, if everything is in one place you can’t easily name your files, since all the names have to be unique (or you end up with several similar names like “Essay.docx”, “Essay (1).docx”, Essay (2).docx”, etc.). This makes searching for your files with the search bar tricky.

Furthermore, other programs on your computer might access and change files in this location (e.g., your web browser might clear your Downloads folder if you ask it to clean up space). For reasons like this, it’s a good idea to make folders on your computer to store work for different tasks. For example, it’s probably a good idea to make a folder for each of your classes inside your Documents folder. And it’s probably a good idea to make a folder just for homework assignments inside each of your class folders. Something like this:

Home
├── Documents/
│   └── Classes/
│       ├── SDS 100/
│       │   └── Homework
│       ├── SDS 192
│       ├── SDS 210
│       └── SDS 270
├── Downloads/
│   └── Memes
├── Pictures/
│   ├── Vacation
│   └── School
└── Music

R Projects & The working directory

Keeping track of where you save and store files on your computer is important to statisticians and data scientists who write code beyond just good organization: you need to tell R where files are so it can work with them. Without knowing exactly where these files are and what their exact names are, you won’t be able to access them. Often, the easiest way to make sure you can access files from within R is to make sure R’s working directory is the same folder as your files are in. The easiest way to do this is by making sure that you are in an RStudio Project, like the one from Lab 1.

While it’s not obvious, R is assigned a “working directory”: a folder on your computer where it is looking for files. Opening an RStudio Project sets R’s working directory to the Project directory. You can check and see what working directory R is currently watching by going into the Console, and using the command getwd(), which stands for “get working directory,” in the console.

getwd()
[1] "/home/will/Documents/Classes/SDS 100"

The getwd() function prints out the path to your current working directory. Mine will be different than your because we are on different computers! Think of paths as addresses that give directions to get to a specific folder on your computer. For example, the result from getwd() shown above says to get to the folder where the R program is currently watching on my computer, you have to:

  1. Start the very top of the folder hierarchy
  2. Then go to the folder name home
  3. Then go to the folder named will
  4. Then go to the folder named Documents
  5. Then go to the folder named Classes
  6. Then go to the folder named SDS 100

Since this particular path begins at the very top of the folder hierarchy (you can tell by the / at the very start), it’s called an absolute path. If you paste this command into your R console, we expect the path to your current working directory will be different, since your computer isn’t an exact copy of ours!

Question

Use getwd() to report the path to R’s current working directory

Why does the working directory matter?

For R to work on any files, it needs to know where they are. R Projects help R do that, because it will start looking for files inside the project folder, rather than way up at your Home folder. Let’s work through an example to see how this helps.

Go to the File menu in the upper left corner of your screen, and click on New File > Text File. This will open a new text file in R Studio. Inside this text file, Copy the sentence: “R is helpful, but needs very specific directions.” Save this file inside your SDS 100 project folder by going to the File menu in the upper left again, and clicking Save As ..., call it paths.txt. If you have your SDS 100 project open, you should be able to see your new file in the files pane in the lower right of R Studio. The video below shows an example of creating this file:

Say we wanted R to read this file for us, we have two options: give R the absolute path starting at our home directory, or the relative path, starting from our working directory. Where the absolute path to this file might look like:

"/home/will/Documents/Classes/SDS 100/paths.txt"

The relative path could cut out everything before the project folder, meaning all we would need for R to find the file is:

"paths.txt"

Let’s try it together.

Question

In the console, use getwd() again to get your working directory. Copy the output from that command (including the quotes!) whole path, and add /paths.txt to the end (inside the quotation marks).

Take the full path to your paths.txt file, and in the console, write the function readLines(" PATH TO YOUR WORKING DIRECTORY /paths.txt") and press Enter. R should read what you wrote in that text file!

Code
readLines("PUT YOUR ABSOLUTE PATH HERE")
# make sure there are only one set of quotes!

Reading files from relative paths will be super important when you start working with other people on code projects. Your partner in a group project can’t load a file from "/home/will/Documents/Classes/SDS 100/" because they won’t have a “will” folder on their computer. However, if you use a relative path, and everybody has the same files in their project, everyone will be able to use the same code.

Question

Repeat the activity from the previous question, but this time, use a relative path to the paths.txt file to locate it. Remember, relative paths mean “look for this file starting from your current location”, and your current location should be your SDS 100 R project directory. from our project. So if you

  1. have saved the paths.txt in your SDS 100 R project directory, and
  2. you have entered your SDS 100 R project in RStudio

all you need to do is type in readLines("paths.txt") and press Enter.

Code
readLines("PUT YOUR RELATIVE PATH HERE")
# make sure there are only one set of quotes!

Finishing this lab

Instructions

Step 1: Create twin document

Download the Lab 2 template (if you haven’t done so already). Complete it such that the results in the generated output match the examples in the Lab 2 solutions file.

Step 2: Complete the Moodle Quiz

Complete the Moodle quiz for this lab.

Optional reading

To prepare for next week’s lab on data visualization, we encourage reading sections 4.1–4.4 in Data Science: A first introduction

References