Spring 2022

Overview

This class will survey techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Topics will include data cleaning, data integration, scalable systems (relational databases, NoSQL, Spark, etc.), analytics (data cubes, scalable statistics and machine learning), fundamental statistics and machine learning and scalable visualization of large data sets. The goal of the class is to gain working experience along with in-depth discussions of the topics covered. Students should have a background in programming and algorithms. There will be a semester-long project and paper, and hands-on labs designed to give experience with state of the are data processing tools.

There will be a semester long project, 6 labs of varying length, and 2 quizzes.

Enrollment may be limited.

The course web site is http://dsg.csail.mit.edu/6.S079/.

Lectures

Lectures are held twice a week, from 2:30-4:00 in 37-212 on Mondays and Wednesdays. Attendance at lectures is mandatory and you are expected to show up prepared to answer questions and participate in discussion.

Topics Covered

What is Data Science?
Data Representation and Basic Operations
Common Tools for Data Science
Data Extraction & Wrangling
Data Cleaning
Entity Resolution
Machine Learning Overview
Embeddings
ML in Python
Visualization
Dashboards / Declarative Visualizations
No-code platforms
Scaling Beyond Python
Database Performance Tuning
Parallelism in Data Processing
Scalable Data Processing (Hadoop, Spark)
Modern Data Warehousing
Cloud Data Tools Ecosystem
Risk Factors

Prerequisites

Students should have taken 6.0001 (Introduction to Computer Science Programming in Python) as well as 6.0002 (Introduction to Computational Thinking and Data Science) or equivalent. If you do not have experience in these subjects and would like to take the course, please email the instructor. Prior database experience is not required. Python programming experience is assumed, familiarity with simple functionality of a linux command line is a benefit.

Units

3-0-9.

Grading

Grades are assigned based on labs, quizzes, and final project, as well as class participation. The grading breakdown is as follows:

Each student is allowed 5 "late days", each of which may be used to turn in a lab one day (24 hour period) later than it is due without penalty. After all five late days are used, assignments will be docked one letter grade for each day they are late.

Late days may not be used for the final project report submission.

Collaboration Policy

For labs, you are allowed to discuss your answers with other students, but please write up your own answers and list your collaborators. Copying solutions from other students is never allowed. For the group project you will work in teams and hand in only one written report. Note that we will use software to detect copying of lab and homework assignments.

Text

Text will be periodicaly assigned for reading before classes from online sources.

Last change: 2/23/2022.