Syllabus
About this course
Course Title
SDS 100: Reproducible Scientific Computing with Data
Course Description
The practice of data science rests upon computing environments that foster responsible uses of data and reproducible scientific inquiries. This course develops students’ ability to engage in data science work using modern workflows, open-source tools, and ethical practices. Students will learn how to author a scientific report written in a lightweight markup language (e.g., Markdown) that includes code (e.g., R), data, graphics, text, and other media. Students will also learn to reason about ethical practices in data science.
Instructors
The instructors for SDS 100 are working collaboratively to coordinate across all sections of the course. Please see your instructor’s page (in the sidebar) for information about their contact information and student hours, etc.
Overview
This course will introduce students to the basic functionality of the R programming language using the RStudio integrated development environment, as well as the fundamentals of using Quarto to generate reproducible scientific reports. Class will consist of 75 minute weekly labs that introduce students to ways to build reproducible statistical analysis in R.
Learning goals
There are several learning goals for the course:
- Author a scientific report written in a lightweight markup language that includes code, data, graphics, text, and other media
- Compute a table of summary statistics, include it in a Markdown document, and automatically reference that table by number in the document narrative
- Compute a data graphic, include it in a Markdown document, and automatically reference that figure by number in the document narrative
- Import data and other local or remote resources into a Markdown document that another person can compile on their computer
- Assess the ethical implications to society of data-based research, analyses, and technology in an informed manner. Use resources, such as professional guidelines, institutional review boards, and published research, to inform ethical responsibilities.
- Build a bibliography automatically using BiBTeX and include selected references in the text
Co-requisites
This course may only be taken by students concurrently enrolled in one of the following courses: SDS 192, SDS 201, SDS 220, SDS 290, or SDS 291. It may not be repeated for credit.
We require that students enroll in this 1-credit course at the same time as they take any of the above listed courses for the first time.
Textbooks
Each course meeting is associated with at least one reading. Most of these readings are optional; however, some readings are mandatory.
Most readings will come from these free, open-access textbooks:
- Data Science: A First Introduction
Tiffany Timbers, Trevor Campbell, and Melissa Lee
CRC Press, 2022- Web version (free): https://datasciencebook.ca/
- Print version (~$60): Amazon
- R for Data Science
Hadley Wickham & Garrett Grolemund
O’Reilly Media, 2017.- Web version (free): https://r4ds.hadley.nz/
- Print version (~$70): Amazon
Supplementary textbooks
The following textbooks are not required for SDS 100. However, they may very well be useful to you, and sections of them may be referenced in some SDS 100 labs. We encourage you to make use of them.
- ModernDive: Statistical Inference via Data Science, 1st edition. Ismay and Kim. CRC Press, 2022. Available for free online.
- Introduction to Modern Statistics, 1st edition. Çetinkaya-Rundel and Hardin. OpenIntro, 2022. Available for free online.
- Modern Data Science with R, 3rd edition. Baumer, Kaplan, and Horton. CRC Press, 2021. Available for free online.
Evaluation
This course is graded mandatory S/U, meaning that rather than receiving a letter grade you will receive a satisfactory/unsatisfactory grade. The course assessment is divided into 7 tokens .
To earn a satisfactory grade in the course, you must successfully earn at least 6 of the 7 tokens by the end of the semester.
Attendance ( 1 token)
Attendance is mandatory. We expect you to come to class, in person and on time, each week. You earn the attendance token by having your attendance recorded on Moodle during at least 9 of the 13 class sessions. Moodle attendance will be available from 5 minutes before until 20 minutes after the start of the class period.
Weekly Lab Quizzes ( 1 token)
Each week, you will be tasked with completing a weekly lab that you begin in class and finish by the Monday after class. The lab assignments will cover a variety of topics in R, and also include post-lab assignments such as readings or data dives. Once you finish the lab assignment, you must complete a quiz on Moodle that covers the lab and any post lab activities. These quizzes will include verification that you have completed the assigned lab, and may include basic content questions about the reading, or ask you to report on the output of your code from the lab. The quizzes are open everything and you may use any resource to complete them. However, you may not discuss these quizzes with anyone other than the instructor, or use any artificial intelligence tools during the quizzes for any purpose.
To earn the quiz token, you must:
- : get at least 80% correct on at least 10 of the 12 of the weekly quizzes. You can take every quiz twice and only the higher score will be recorded.
Reproducible Data Analysis Project ( 2 tokens)
As part of a group, you will identify, explore, and evaluate potential data sources that could potentially be used for labs in future iterations of this course. The data that your group ultimately selects and reports on will be the result of a multi-step exploration and evaluation process that takes places in phases over the last 8 weeks of the course. The workload for the project is divided into approximately 80% individual work and about 20% group work. The final report will include a written introduction to the data, explore the contents of the data set with visualizations, summarize the data with statistics, and discuss of ethical considerations involved with utilizing the data. See the Project page for a more detailed description of the expectations for the project, and a breakdown of each individual phase.
There are two tokens for you to earn related to this project:
- Project Engagement
- To earn the project engagement token, you must get completion without serious problems on at least 3 of the first 5 project phases. The examples of serious problems include no submission; the content of the submission doesn’t match the expectations; and your peers indicate that you didn’t contribute equally.
In other words, you can earn your project engagement token by fully completing your individual work for each phase of the project in a timely fashion, and by fully and equally participating with your group work. Full completion of each phase means completing all required tasks, submitting all necessary materials on Moodle, and sending all necessary materials to your project groups members. Timely completion of each phase means each of these things is done by the assigned deadline for each phase.
At the end of the semester, we will ask you to reflect each group member’s contributions to the project, including your own contributions, in order to understand whether each person in your group was a full and equal participant in the project.
- Final Report
- In the final phase of the project, your group as a whole will submit a data analysis report. You can earn your final report token by making sure your group’s data analysis report meets all content and formatting expectations, and is able to be fully reproduced by the instructors without editing any of its contents.
Coding Fluency ( 3 tokens)
SDS 100 is designed to develop your proficiency in the fundamentals of computing in r fontawesome::fa("r-project", fill = "steelblue")
. At various points throughout the semester, you will have the opportunity demonstrate your proficiency by completing “coding fluency” quizzes on Moodle. Each coding fluency quiz will focus on one of the following fundamental skills:
- Data Wrangling
- Data Visualization
- Data Importing
To earn your Data Wrangling, Data Visualization, and Data Importing tokens, you must pass the corresponding quiz. You will have three opportunities to earn each token over the course of the semester.
Coding fluency quizzes are “open everything”. You may consult the books, the Internet, and other sources (including R and our lab assignments). However, you may not discuss this exam with anyone other than the instructor, or use any artificial intelligence tools during the quiz for any purpose.
Your responses will be read by a human being (one of your instructors), who will determine if your answers demonstrate sufficient understanding to earn the token. If your responses are insufficient to earn the token, you will have the opportunity to take up to two more coding fluency quizzes focused on same topic later in the semester.
Format of each Coding Fluency Quiz
Each quiz will consist of four questions.
The first question will display a prompt, along with three responses that you’ll be asked to consider in subsequent questions. Exactly one of the responses is the correct answer to the prompt, and the other two are incorrect. This initial question will simply ask you to acknowledge that you understand the prompt and are ready to move on.
The remaining three questions will display the prompt and responses options, and ask you determine whether the highlighted response option is the correct option, or one of the incorrect response options. If you indicate that the highlighted response is incorrect, you must also give a precise explanation about why the response option is incorrect. If you indicate that the highlighted response is correct, no further explanation is required.
Example of a Coding Fluency Quiz Question
Consider the hypothetical coding fluency quiz question and answers below.
Imagine that Reg I. Gigas asks three of their closest friends how many credits they are taking this semester. Rose is taking 12 credits, Drag is taking 20 credits, and Art is taking 17 credits. Reg wants to compute the average number of credits their friends are taking. How could Reg compute the average number of credits in R?
mean(12, 20, 17)
mean(c("12", "17", "20"))
mean(c(12, 20, 17))
Option | Satisfactory Answer | Unsatisfactory Answer | Explanation |
---|---|---|---|
a. | Incorrect. The numeric values need to be collected into a vector by placing them inside the c() function. |
Incorrect. The c() function should be used. |
Both answers point out that the c() function is missing, but the unsatisfactory answer does not explain how or why it should be used. |
b. | Incorrect. The quotation marks should be removed because placing quotation marks around the numeric values creates text data, and R cannot take the mean of text data. | Incorrect because of the quotation marks. | Both answers point out that quotations marks are problematic, but the unsatisfactory answer does not explain the nature of the problem. |
c. | Correct | anything other than the word correct | To identify the response option that is the correct response option, simply type the word correct |
Contrasting the satisfactory and unsatisfactory answers to this question reveals something important about the coding fluency quizzes: you must give a precise explanation of the mistakes to demonstrate your understanding and earn your token. Answers that are on the right track but lack specifics will not earn a token.
Technology in the Classroom
We will use the R statistical software package extensively and exclusively. R and RStudio are open source software that are available for free on Mac, Windows, and Linux operating systems. There are two ways to access R for this class:
RStudio Desktop: You can download R and RStudio to your computer and run things locally. The program is open source and free to install. You will need to download both R and RStudio for class. Bring a laptop to every class and come ready to install R and RStudio as part of your first lab assignment.
RStudio Server: Enrolled students will get an account on the Smith College RStudio Server shortly after enrolling in the class. The server is located at (http://rstudio.smith.edu).
The advantages of using the Server version are that it is cloud-based: your work will be automatically saved and backed up, and you can work from any computer that has an Internet connection and a web browser (note: you may need to be on a Smith College IP address to access the server, to access it off-campus, you may need to establish a VPN connection to Smith). The disadvantages of using the Server are that like any shared resource, there are limits on how much data you can store and how quickly it processes during periods of high use. You also will have greater ability to customize your local RStudio installation.
Accommodation
Smith is committed to providing support services and reasonable accommodations to all students with disabilities. If you have a documented accommodation, please share your accommodation letter with both of your instructors.
To request an accommodation, please register with the Accessibility Resource Center (ARC) at the beginning of the semester. Please call (413) 585-2071, email arc@smith.edu, or visit the ARC website for instructions on how to register and to arrange an appointment with their staff.
Policies
Inclusion
We are committed to fostering a classroom environment where all students thrive. We are committed to affirming the identities, realities and voices of all students, especially those from historically marginalized or underrepresented backgrounds. We are dedicated to creating a space where everyone in the class is respected, is free from discrimination based on race, ethnicity, sexual orientation, religion, gender identity, disability status, and other identities, and feel welcome and ready to learn at your highest potential.
If you have any concerns or suggestions for how to make this class more inclusive, please reach out to your instructor.
We are here to support your learning and growth as data scientists and people!
Attendance
Attendance comprises one of the seven tokens. We expect you attend class in person. We expect your full attention. Please put your phone away during class unless otherwise directed.
In keeping with Smith’s core identity and mission as an in-person, residential college, SDS affirms College policy (as per the Provost and Dean of the College) that students will attend class in person. SDS courses will not provide options for remote attendance. Students who have been determined to require a remote attendance accommodation by the Accessibility Resource Center will be the only exceptions to this policy. As with any other kind of ADA accommodations, please notify your instructor during the first week of classes to discuss how we can meet your accommodations.
If you are unable to attend class for any reason, please follow the materials on the course website and check with another student about what happened in class.
Collaboration
Much of this course will operate on a collaborative basis, and you are expected and encouraged to work together with a partner or in small groups to study, complete labs, and prepare for exams. However, all work that you submit for credit must be your own. Copying and pasting sentences, paragraphs, or blocks of code from another student or from online sources is not acceptable and will receive no credit. No interaction with anyone but the instructors is allowed on any exams or quizzes.
Academic Honor Code Statement
All students, staff and faculty are bound by the Smith College Honor Code, which Smith has had since 1944.
Smith College expects all students to be honest and committed to the principles of academic and intellectual integrity in their preparation and submission of course work and examinations. Students and faculty at Smith are part of an academic community defined by its commitment to scholarship, which depends on scrupulous and attentive acknowledgement of all sources of information, and honest and respectful use of college resources.
Cases of dishonesty, plagiarism, etc., will be reported to the Academic Honor Board.
Code of Conduct
As the instructor and assistants for this course, we are committed to making participation in this course a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion. Examples of unacceptable behavior by participants in this course include the use of sexual language or imagery, derogatory comments or personal attacks, deliberate misgendering or use of “dead” names, trolling, public or private harassment, insults, or other unprofessional conduct.
As the instructor and assistants we have the right and responsibility to point out and stop behavior that is not aligned to this Code of Conduct. Participants who do not follow the Code of Conduct may be reprimanded for such behavior. Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the instructor.
All students, the instructor, the lab instructor, and all assistants are expected to adhere to this Code of Conduct in all settings for this course: lectures, labs, office hours, tutoring hours, and over Slack.
This Code of Conduct is adapted from the Contributor Covenant, version 1.0.0, available here.
Extensions
For the Weekly Lab quizzes, extensions up to 24 hours will typically be granted when requested at least 24 hours in advance. Longer extensions, or those requested within 24 hours of a deadline, will typically not be granted. Please plan accordingly. For the Coding Fluency quizzes, extensions will typically not be granted; you can take it during the next iteration. Please note that because some of the assignments in this class are collaborative, individual extensions for group assignments will be problematic. All extended deadlines will appear on Moodle.
Resources
Moodle and course website
The course website and Moodle will be updated regularly with lecture handouts, project information, assignments, and other course resources. Grades will be recorded on Moodle. Please check both regularly.
Communication
- Slack is the primary mechanism for course-related discussions of all kinds. Please do not email the course instructor with course-related questions! Instead, post these on
#questions
on Slack. If discretion is absolutely necessary, private message the instructor on Slack.
Writing
Your ability to communicate results (which may be technical in nature) to your audience (which is likely to be non-technical) is critical to your success as a data analyst. The assignments in this class will place an emphasis on the clarity of your writing.
Writing Enriched Curriculum
This course is part of Smith College’s Writing Enriched Curriculum. As such, the course supports the Writing Plan of the Program in Statistical & Data Sciences.
Please read the SDS Writing Plan for more information.
The Spinelli Center
The Spinelli Center (now in Seelye 207) supports students doing quantitative work across the curriculum. In particular, they employ:
- SDS Tutors available from 7:00–9:00pm on Sunday–Thursday evenings in Sabin-Reed 301. These students are trained to help you with your statistics and R questions.
- A Data Research and Statistics Counselor (Cameron Haas) who keeps both drop-in hours and appointments. Students are welcome to email qlctutor@smith.edu to make an appointment. Your fellow students are also an excellent source for explanations, tips, etc.
Schedule
The schedule for the Spring 2025 semester is shown below. Be sure to check the schedule regularly, as changes to the schedule may occur during the semester.