Syllabus

About this course

Course Title

SDS 100: Reproducible Scientific Computing with Data

Course Description

The practice of data science rests upon computing environments that foster responsible uses of data and reproducible scientific inquiries. This course develops students’ ability to engage in data science work using modern workflows, open-source tools, and ethical practices. Students will learn how to author a scientific report written in a lightweight markup language (e.g., Markdown) that includes code (e.g., R), data, graphics, text, and other media. Students will also learn to reason about ethical practices in data science.

Instructors

The instructors for SDS 100 are working collaboratively to coordinate across all sections of the course. Please see your instructor’s page (in the sidebar) for information about their contact information and student hours, etc.

Overview

This course will introduce students to the basic functionality of the R programming language using the RStudio integrated development environment, as well as the fundamentals of using Quarto to generate reproducible scientific reports. Class will consist of 75 minute weekly labs that introduce students to ways to build reproducible statistical analysis in R.

Learning goals

There are several learning goals for the course:

  • Author a scientific report written in a lightweight markup language that includes code, data, graphics, text, and other media
  • Compute a table of summary statistics, include it in a Markdown document, and automatically reference that table by number in the document narrative
  • Compute a data graphic, include it in a Markdown document, and automatically reference that figure by number in the document narrative
  • Import data and other local or remote resources into a Markdown document that another person can compile on their computer
  • Assess the ethical implications to society of data-based research, analyses, and technology in an informed manner. Use resources, such as professional guidelines, institutional review boards, and published research, to inform ethical responsibilities.
  • Build a bibliography automatically using BiBTeX and include selected references in the text

Co-requisites

This course may only be taken by students concurrently enrolled in one of the following courses: SDS 192, SDS 201, SDS 220, SDS 290, or SDS 291. It may not be repeated for credit.

We require that students enroll in this 1-credit course at the same time as they take any of the above listed courses for the first time.

Textbooks

Each course meeting is associated with at least one reading. Most of these readings are optional; however, some readings are mandatory.

Most readings will come from these free, open-access textbooks:

Supplementary textbooks

The following textbooks are not required for SDS 100. However, they may very well be useful to you, and sections of them may be referenced in some SDS 100 labs. We encourage you to make use of them.

Evaluation

This course is graded mandatory S/U, meaning that rather than receiving a letter grade you will receive a satisfactory/unsatisfactory grade. The course assessment is divided into 7 tokens .

Important

To earn a satisfactory grade in the course, you must successfully earn at least 6 of the 7 tokens by the end of the semester.

Attendance ( 1 token)

Attendance is mandatory. We expect you to come to class, in person and on time, each week. You earn the attendance token by having your attendance recorded on Moodle during at least 9 of the 13 class sessions. Moodle attendance will be available from 5 minutes before until 20 minutes after the start of the class period.

Weekly Lab Quizzes ( 1 token)

Each week, you will be tasked with completing a weekly lab that you begin in class and finish by the Monday after class. The lab assignments will cover a variety of topics in R, and also include post-lab assignments such as readings or data dives. Once you finish the lab assignment, you must complete a quiz on Moodle that covers the lab and any post lab activities. These quizzes will include verification that you have completed the assigned lab, and may include basic content questions about the reading, or ask you to report on the output of your code from the lab. The quizzes are open everything and you may use any resource (other than your fellow classmates) to complete them.

To earn the quiz token, you must:

  • : get at least 80% correct on at least 10 of the 12 of the weekly quizzes. You can take every quiz twice and only the higher score will be recorded.

Reproducible Data Analysis Project ( 2 tokens)

As part of a group, you will identify, explore, and evaluate potential data sources that could potentially be used for labs in future iterations of this course. The data that your group ultimately selects and reports on will be the result of a multi-step exploration and evaluation process that takes places in phases over the last 8 weeks of the course. The workload for the project is divided into approximately 80% individual work and about 20% group work. The final report will include a written introduction to the data, explore the contents of the data set with visualizations, summarize the data with statistics, and discuss of ethical considerations involved with utilizing the data. See the Project page for a more detailed description of the expectations for the project, and a breakdown of each individual phase.

There are two tokens for you to earn related to this project:

  • Project Engagement

    • You can earn your project engagement token by fully completing your individual work for each phase of the project in a timely fashion, and by fully and equally participating with your group work. Full completion of each phase means completing all required tasks, submitting all necessary materials on Moodle, and sending all necessary materials to your project groups members. Timely completion of each phase means each of these things is done by the assigned deadline for each phase.

    At the end of the semester, we will ask you to reflect each group member’s contributions to the project, including your own contributions, in order to understand whether each person in your group was a full and equal participant in the project.

  • Final Report

    • In the final phase of the project, your group as a whole will submit a data analysis report. You can earn your final report token by making sure your group’s data analysis report meets all content and formatting expectations, and is able to be fully reproduced by the instructors without editing any of its contents.

Coding Fluency ( 3 tokens)

SDS 100 is designed to develop your proficiency in the fundamentals of computing in . At various points throughout the semester, you will have the opportunity demonstrate your proficiency by completing “coding fluency” quizzes on Moodle. Each coding fluency quiz will focus on one of the following fundamental skills:

  • Data Wrangling
  • Data Visualization
  • Data Importing

To earn your Data Wrangling, Data Visualization, and Data Importing tokens, you must pass the corresponding quiz. You will have three opportunities to earn each token over the course of the semester.

Coding fluency quizzes are “open everything”. You may consult the books, the Internet, and other sources (including R and our lab assignments). However, you may not discuss this exam with anyone other than the instructor, or use any artificial intelligence tools during the quiz for any purpose.

Your responses will be read by a human being (one of your instructors), who will determine if your answers demonstrate sufficient understanding to earn the token. If your responses are insufficient to earn the token, you will have the opportunity to take up to two more coding fluency quizzes focused on same topic later in the semester.

Format of each Coding Fluency Quiz

Each quiz will consist of four questions.

The first question will display a prompt, along with three responses that you’ll be asked to consider in subsequent questions. Exactly one of the responses is the correct answer to the prompt, and the other two are incorrect. This initial question will simply ask you to acknowledge that you understand the prompt and are ready to move on.

The remaining three questions will display the prompt and responses options, and ask you determine whether the highlighted response option is the correct option, or one of the incorrect response options. If you indicate that the highlighted response is incorrect, you must also give a precise explanation about why the response option is incorrect. If you indicate that the highlighted response is correct, no further explanation is required.

Example of a Coding Fluency Quiz Question

Consider the hypothetical coding fluency quiz question and answers below.

Prompt

Imagine that Reg I. Gigas asks three of their closest friends how many credits they are taking this semester. Rose is taking 12 credits, Drag is taking 20 credits, and Art is taking 17 credits. Reg wants to compute the average number of credits their friends are taking. How could Reg compute the average number of credits in R?

  1. mean(12, 20, 17)
  2. mean(c("12", "17", "20"))
  3. mean(c(12, 20, 17))
Option Satisfactory Answer Unsatisfactory Answer Explanation
a. Incorrect. The numeric values need to be collected into a vector by placing them inside the c() function. Incorrect. The c() function should be used. Both answers point out that the c() function is missing, but the unsatisfactory answer does not explain how or why it should be used.
b. Incorrect. The quotation marks should be removed because placing quotation marks around the numeric values creates text data, and R cannot take the mean of text data. Incorrect because of the quotation marks. Both answers point out that quotations marks are problematic, but the unsatisfactory answer does not explain the nature of the problem.
c. Correct anything other than the word correct To identify the response option that is the correct response option, simply type the word correct

Contrasting the satisfactory and unsatisfactory answers to this question reveals something important about the coding fluency quizzes: you must give a precise explanation of the mistakes to demonstrate your understanding and earn your token. Answers that are on the right track but lack specifics will not earn a token.

Technology in the Classroom

We will use the R statistical software package extensively and exclusively. R and RStudio are open source software that are available for free on Mac, Windows, and Linux operating systems. There are two ways to access R for this class:

Recommended for laptop users

RStudio Desktop: You can download R and RStudio to your computer and run things locally. The program is open source and free to install. You will need to download both R and RStudio for class. Bring a laptop to every class and come ready to install R and RStudio as part of your first lab assignment.

Recommended for Chromebook and tablet users

RStudio Server: Enrolled students will get an account on the Smith College RStudio Server shortly after enrolling in the class. The server is located at (http://rstudio.smith.edu).

The advantages of using the Server version are that it is cloud-based: your work will be automatically saved and backed up, and you can work from any computer that has an Internet connection and a web browser (note: you may need to be on a Smith College IP address to access the server, to access it off-campus, you may need to establish a VPN connection to Smith). The disadvantages of using the Server are that like any shared resource, there are limits on how much data you can store and how quickly it processes during periods of high use. You also will have greater ability to customize your local RStudio installation.

Accommodation

Smith is committed to providing support services and reasonable accommodations to all students with disabilities. If you have a documented accommodation, please share your accommodation letter with both of your instructors.

To request an accommodation, please register with the Accessibility Resource Center (ARC) at the beginning of the semester. Please call (413) 585-2071, email , or visit the ARC website for instructions on how to register and to arrange an appointment with their staff.

Policies

Inclusion

Inclusion policy

We are committed to fostering a classroom environment where all students thrive. We are committed to affirming the identities, realities and voices of all students, especially those from historically marginalized or underrepresented backgrounds. We are dedicated to creating a space where everyone in the class is respected, is free from discrimination based on race, ethnicity, sexual orientation, religion, gender identity, disability status, and other identities, and feel welcome and ready to learn at your highest potential.

If you have any concerns or suggestions for how to make this class more inclusive, please reach out to your instructor.

We are here to support your learning and growth as data scientists and people!

Attendance

Attendance comprises one of the seven tokens. We expect you attend class in person. We expect your full attention. Please put your phone away during class unless otherwise directed.

In keeping with Smith’s core identity and mission as an in-person, residential college, SDS affirms College policy (as per the Provost and Dean of the College) that students will attend class in person. SDS courses will not provide options for remote attendance. Students who have been determined to require a remote attendance accommodation by the Office of Disability Services will be the only exceptions to this policy. As with any other kind of ADA accommodations, please notify your instructor during the first week of classes to discuss how we can meet your accommodations.

If you are unable to attend class for any reason, please follow the materials on the course website and check with another student about what happened in class.

Collaboration

Warning

Much of this course will operate on a collaborative basis, and you are expected and encouraged to work together with a partner or in small groups to study, complete labs, and prepare for exams. However, all work that you submit for credit must be your own. Copying and pasting sentences, paragraphs, or blocks of code from another student or from online sources is not acceptable and will receive no credit. No interaction with anyone but the instructors is allowed on any exams or quizzes.

Academic Honor Code Statement

All students, staff and faculty are bound by the Smith College Honor Code, which Smith has had since 1944.

Smith College expects all students to be honest and committed to the principles of academic and intellectual integrity in their preparation and submission of course work and examinations. Students and faculty at Smith are part of an academic community defined by its commitment to scholarship, which depends on scrupulous and attentive acknowledgement of all sources of information, and honest and respectful use of college resources.

Cases of dishonesty, plagiarism, etc., will be reported to the Academic Honor Board.

Code of Conduct

As the instructor and assistants for this course, we are committed to making participation in this course a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion. Examples of unacceptable behavior by participants in this course include the use of sexual language or imagery, derogatory comments or personal attacks, deliberate misgendering or use of “dead” names, trolling, public or private harassment, insults, or other unprofessional conduct.

As the instructor and assistants we have the right and responsibility to point out and stop behavior that is not aligned to this Code of Conduct. Participants who do not follow the Code of Conduct may be reprimanded for such behavior. Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the instructor.

Important

All students, the instructor, the lab instructor, and all assistants are expected to adhere to this Code of Conduct in all settings for this course: lectures, labs, office hours, tutoring hours, and over Slack.

This Code of Conduct is adapted from the Contributor Covenant, version 1.0.0, available here.

Extensions

Extensions up to 24 hours will typically be granted when requested at least 24 hours in advance. Longer extensions, or those requested within 24 hours of a deadline, will typically not be granted. Please plan accordingly. Please note that because some of the assignments in this class are collaborative, individual extensions for group assignments will be problematic. All extended deadlines will appear on Moodle.

Resources

Moodle and course website

The course website and Moodle will be updated regularly with lecture handouts, project information, assignments, and other course resources. Grades will be recorded on Moodle. Please check both regularly.

Communication

  • Slack is the primary mechanism for course-related discussions of all kinds. Please do not email the course instructor with course-related questions! Instead, post these on #questions on Slack. If discretion is absolutely necessary, private message the instructor on Slack.

Writing

Your ability to communicate results (which may be technical in nature) to your audience (which is likely to be non-technical) is critical to your success as a data analyst. The assignments in this class will place an emphasis on the clarity of your writing.

Writing Enriched Curriculum

This course is part of Smith College’s Writing Enriched Curriculum. As such, the course supports the Writing Plan of the Program in Statistical & Data Sciences.

Please read the SDS Writing Plan for more information.

The Spinelli Center

The Spinelli Center (now in Seelye 207) supports students doing quantitative work across the curriculum. In particular, they employ:

  • SDS Tutors available from 7:00–9:00pm on Sunday–Thursday evenings in Sabin-Reed 301, except Tuesdays evenings in McConnell 214. These students are trained to help you with your statistics and R questions.
  • A Data Research and Statistics Counselor (Gabriel Lewis) who keeps both drop-in hours and appointments. Students are welcome to email qlctutor@smith.edu to make an appointment. Your fellow students are also an excellent source for explanations, tips, etc.

Schedule

The following outline gives a basic description of the course. Please see Moodle for more specific information about readings and assignments.

Week Topic Tokens
1 Lab 1: Getting Started
2 Lab 2: Intro to RStudio and Projects
3 Lab 3: Basic Data Visualizations
4 Lab 4: Introduction to Data Wrangling
5 Lab 5: Summarizing Data
6 Lab 6: Polishing figures DW
7 Lab 7: Importing Data DV
8 Lab 8: Formatting in Quarto DI
9 Lab 9: Ethics DW
10 Lab 10: Referencing Figures and Tables DV
11 Lab 11: Ethical Codes for Data Science DI
12 Lab 12: Bibliography
13 Lab 13: Final Group Project Meeting PE, FR, DW, DV, DI