CS365: Foundations of Data Science (Spring’24) |

Info

Instructor

When: Tue, Thu 5pm-6.15pm
Where: CAS-313
Prof: Babis Tsourakakis
Email: ctsourak@bu.edu
Office hours (CDS 912): Tue 11-noon, Thu 10.30-11.30am

Teaching Fellow

TF: Mr. Tiany Chen
Email: ctony@bu.edu
Labs : schedule
Office hours (CDS 908): Mon 4-5.30pm, Thu 3-4.30pm

Github

Prerequisites

Students taking this class must have taken:

CS 112
CS 131 (MA293)
CS 132 (MA242)
and CS 237 (MA581) or equivalent.

This year the prerequisites will be strictly enforced. CS 330 is highly recommended but not a prereq.

Syllabus

Topics will include probability, information theory, linear algebra, calculus, Fourier analysis, graph theory with a strong focus on their applicability for analyzing datasets. Finally, two lectures will be devoted to data management, and more specifically the classic relational model, SQL and Datalog. A detailed syllabus is available on Piazza.

Textbooks

There will be assigned readings from the following books that are available online (click for the pdf)

Machine Learning: A Probabilistic Perspective [M] by Kevin Murphy
Mathematics for Machine Learning [DFO] by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong.
Foundations of Data Science by Avrim Blum, John Hopcroft, Ravi Kannan [BHK]
Understanding Machine Learning: From theory to algorithms by Shai Shalev-Shwartz and Shai Ben-David [SD]
Introduction to Probability for Data Science [SC] by Stanley Chan

Programming

The class assumes familiarity with programming. The recommended languages for this class are Python3 and Julia. R, Mathematica and Matlab are also recommended. Other languages are welcome (C, C++, Java, etc), but are not recommended for this class.

Lectures

Note: at the end of each lecture, you will find the assigned readings. The readings associated with a magnifying glass are mandatory. The rest is material if you are further interested, and have the time to devote.

Part I: Core Concepts in Data Science (Probability, Linear Algebra, Optimization)

Introduction (1/18), : Introduction
Slides available here
PART 1A: Probability and Statistics
Slides available here and Julia notebook here
Readings: [SC] Chapters 1-5 and [M] Chapter 2

cs365-probability-review-with-julia Download

PART 1B: Linear Algebra, SVD, PCA

part2b_linalg Download

PART 1C: Vector Calculus and Optimization
Slides are available here
Readings: [DFO] Chapters 5 and 7

cs365-opt Download

note-opt-cs365 Download

Part II: Data Science in Action

Topic 1: Data streams
Slides available here
Topic 2: Dimensionality reduction
- Singular Value Decomposition (SVD)
- Principal Component Analysis (PCA)
  Handwritten notes here and Jupyter notebook here
  Slides from a tutorial here
  Readings
  - BHK 3.1-3.5, 3.9.1, 3.9.2
- Johnson-Lindenstrauss theorem, Random Projections
  Jupyter notebook here
Topic 3: EM Algorithm
Slides available here
Readings
- What is the expectation maximization algorithm?
- Optional reading: mixtures of Gaussians Andrew Ng’s notes

Topic 4: Markov Chains
Slides available here
Readings
- BHK 4.1, 4.8
Topic 5: Time Series
Slides available here
Topic 6: What is learning? The Perceptron algorithm
Slides from CMU available here
Readings
- [SD] Chapters 2,3 and 9.1.2
Topic 7: Unsupervised learning
Readings
- BHK Chapter 7
- k-means demo and a youtube video
- Spectral graph theory and its applications
- Densest subgraph problem tutorial

Topic 8: Image denoising using Bayes classifier
Slides available here and Python code here