Fall 2019

This is the website for the Fall 2019 iteration of this course. For the current iteration, please click here.

Overview

This class will survey techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Topics will include data cleaning, data integration, scalable systems (relational databases, NoSQL, Spark, etc.), analytics (data cubes, scalable statistics and machine learning), fundamental statistics and machine learning and scalable visualization of large data sets. The goal of the class is to gain working experience along with in-depth discussions of the topics covered. Students should have a background in programming and algorithms. There will be a semester-long project and paper, and hands-on labs designed to give experience with state of the are data processing tools.

There will be a semester long project, and about 10 labs.

Enrollment may be limited.

The course web site is http://dsg.csail.mit.edu/6.S080/.

Lectures

Lectures are held twice a week, from 2:30-4:00 in E25-111 on Mondays and Wednesdays. Attendance at lectures is mandatory and you are expected to show up prepared to answer questions and participate in discussion.

Topics Covered

What is Data Science?
Data Representation and Basic Operations
Common Tools for Data Science
Data Wrangling
Data Integration
Data Cleaning
Machine Learning Overview
Data Mining
Visualization
Dashboards / Declarative Visualizations
Data Tools Ecosystem
Database Systems
Data Warehousing
Data Cubes
Performance Tuning
Data Lakes (Storage, column-oriented storage)
Scalable Data Processing (Hadoop, Spark)
NoSQL
Key Value Stores
Document Stores
Approximate Query Processing
Streaming Systems
Graph Systems
Risk Factors

includes semester-long project and paper.

Prerequisites

Students should have taken 6.00, 6.0001 (Introduction To Computer Science and Programming in Python) as well as 6.006 (Introduction to Algorithms) or equivalent. If you do not have experience in these subjects and would like to take the course, please email the instructor. Prior database experience is not required. Python programming experience is assumed, familiarity with simple functionality of a linux command line is a benefit.

Units

3-0-9.

Grading

Grades are assigned based on labs, class participation, and final project, and class participation. The grading breakdown is as follows:

Each student is allowed 5 "late days", each of which may be used to turn in one problem set or lab one day (24 hour period) later than it is due without penalty. After all five late days are used, assignments will be docked one letter grade for each day they are late.

Late days may not be used for the final project. Regardless of late days, problem sets must be handed in before problem set solutions are posted,

Collaboration Policy

For labs, you are allowed to discuss your answers with other students, but please write up your own answers and list your collaborators. Copying solutions from other students is never allowed. For the group project you will work in teams and hand in only one written report. Note that we will use software to detect copying of lab and homework assignments.

Text

Text will be periodicaly assigned for reading before classes from online sources.

Last change: 8/26/2019.