Spring 2024

Overview

This class will survey techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Topics will include data cleaning, data integration, scalable systems (SQL/relational databases, NoSQL, Spark, etc.), analytics (data cubes, scalable statistics and machine learning), embeddings, RAG models, LLMs, and other topics. The goal of the class is to gain working experience with data tools along with in-depth discussions of the concepts covered in lecture. Students should have a background in programming and algorithms. There will be a semester-long project and hands-on labs focused on working with real data sets.

There will be a semester long project, 6 labs of varying length, and 2 quizzes.

Enrollment may be limited.

The course web site is http://dsg.csail.mit.edu/6.S079/.

Lectures

Lectures are held twice a week, from 2:30-4:00 in 32-155 on Tuesdays and Thursdays. Attendance at lectures is mandatory and you are expected to show up prepared to answer questions and participate in discussion.

Topics Covered

What is Data Science?
Data Representation and Basic Operations
Common Tools for Data Science
Data Extraction & Wrangling
Data Cleaning
Entity Resolution
Machine Learning Overview
ML in Python
Risk Factors
Embeddings
Vector Stores
LLMs & Generative AI
RAG architectures
Visualization
Scaling Beyond Python
Database Performance Tuning
Parallelism in Data Processing
Scalable Data Processing (Spark, Ray)
Modern Data Warehousing
Cloud Data Tools Ecosystem

Prerequisites

Students should have taken 6.0001 (Introduction to Computer Science Programming in Python) as well as 6.0002 (Introduction to Computational Thinking and Data Science) or equivalent. If you do not have experience in these subjects and would like to take the course, please email the instructor. Prior database experience is not required. Python programming experience is assumed, familiarity with simple functionality of a linux command line is a benefit.

Units

3-0-9.

Grading

Grades are assigned based on labs, quizzes, and final project, as well as class participation. The grading breakdown is as follows:

Each student is allowed 5 "late days", each of which may be used to turn in a lab one day (24 hour period) later than it is due without penalty. After all five late days are used, assignments will be docked one letter grade for each day they are late.

Late days may not be used for the final project report submission. Please don't hesistate to reach out to the course staff if you are struggling for any reason; we are generally happy to offer extensions with a note from S3 or GradSupport.

Collaboration Policy

In line with MIT’s policy on Academic Integrity, here are our expectations regarding collaboration and sharing of work. For most problems sets and labs, you are allowed one collaborator with whom you solve problems and write code and submit one solution. Such collaborative submissions should explictly list the collaborators. Besides this collaborator, for problem sets and labs, you are allowed to discuss your general ideas and approach with other students, but you are expected to write your own code and solutions. Here are some examples of things you should not do with anyone except your collaborator:

It's OK to help another student solve a problem in their code, but if you do this, don't use your own implementation as you do so. Note that we will use software to detect copying of lab and homework assignments.

[Public Code] Please do not make solutions of any of your 6.S079 labs public. Copyright for lab code is held by the course staff, and does not allow redistribution of derived works without prior permission. Your solutions are a derived work, so you may not distribute your problem set or project solutions publicly. This means you cannot post them in a public Dropbox folder, on a public server accessible to others, or on GitHub. (Be aware that GitHub repositories are public by default.) Keep in mind that when work on a problem set or project is copied, both the provider and the consumer of copied materials are violating academic honesty standards, as described above.

[Generative AI] You are welcome to use Generative AI tools like ChatGPT to help with general coding questions, but please do not ask to specifically solve the problems in any of the labs. Examples of questions that are allowed:

A simple rule of thumb for figuring out what is not allowed is any case in which you are copying text from our problem sets or labs. For example, asking it to fill in a missing piece of code in one of the lab skeletons, or providing it with the schemas from a SQL problem set and asking it to write a query. For open ended labs and the final project, it's OK to use these tools as code assistants to help you write code.

Text

There is no textbook. Readings from the Internet will be posted before most classes. The readings are designed to introduce the topics in lecture in more detail and are designed to be a good resource to prepare for the exams.

Last change: 1/15/2024.