CS365: Foundations of Data Science (Spring’23) |

Info

Instructor

When: Tue, Thu 5pm-6.15pm
Where: CAS-211
Prof: Babis Tsourakakis
Email: ctsourak@bu.edu
Office hours (CDS 912): Tu and Th 6.30-7.30pm

Teaching Fellow

TF: Mr. Tiany Chen
Email: ctony@bu.edu
Labs : schedule
Office hours (CDS 362): Wed 2:00-3:30 pm, Fri 10:00-11:30 am

Piazza website

Github

Prerequisites

Students taking this class must have taken:

CS 112
CS 131 (MA293)
CS 132 (MA242)
and CS 237 (MA581) or equivalent.

This year the prerequisites will be strictly enforced. CS 330 is highly recommended but not a prereq.

Syllabus

Topics will include probability, information theory, linear algebra, calculus, Fourier analysis, graph theory with a strong focus on their applicability for analyzing datasets. Finally, two lectures will be devoted to data management, and more specifically the classic relational model, SQL and Datalog. A detailed syllabus is available on Piazza.

Textbooks

There will be assigned readings from the following books that are available online (click for the pdf)

Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong.
Foundations of Data Science by Avrim Blum, John Hopcroft, Ravi Kannan
Understanding Machine Learning: From theory to algorithms by Shai Shalev-Shwartz and Shai Ben-David
Introduction to Probability for Data Science by Stanley Chan

Programming

The class assumes familiarity with programming. The recommended languages for this class are Python3 and Julia. R and Matlab are also recommended. Other languages are welcome (C, C++, Java, etc), but are not recommended for this class.

Lectures

Note: at the end of each lecture, you will find the assigned readings. The readings associated with a magnifying glass are mandatory. The rest is material if you are further interested, and have the time to devote.

Lecture 1 (1/19): data visualization – introduction, class logistics, types of data, basics of data visualization
Slides available here.
Lecture 2 (1/25): probability – review of prerequisite material, and other basic concepts through problem solving
Slides available here.
Lecture 3 (1/26):probability – convergence of random variables, Markov’s inequality
Slides available here.
Lecture 4 (1/31): probability – Weak law of large numbers, confidence intervals, π estimation randomized algorithm
Slides available here.
Lecture 5 (2/2): probability, statistical inference – Central Limit theorem, Bayes’ rule
Slides available here.
Lecture 6 (2/7): practice problems (blackboard)
Lecture 7 (2/9): probability, statistical inference , machine learning-Naive Bayes classifier
Slides available here.
Lecture 8 (2/14): probability, statistical inference, machine learning – Bayes classifier for denoising images
Slides available here and Jupyter notebook (Python3) here.
Lecture 9 (2/16): probability, statistical inference – definition of moments, CLT for confidence intervals, midterm practice.
Slides available here.
Midterm 2/23
Lecture 10 (2/28): probability, statistical inference – moment generating functions, Chernoff bounds, MLE, Method of Moments (MoM), EM
Slides available here
Lecture 11 (3/2): probability, statistical inference, machine learning – moment generating functions, Chernoff bounds, MLE, Method of Moments (MoM), EM
Slides available here
Lecture 12 (3/14): graphs, probability – G(n,p) model, degree sequence, asymptotic approximations
Reading material: Theorem 8.3 and Corollary 8.4, Sections 8.1, 8.1.1, 8.1.2, 8.2 (disappearance of isolated vertices) from Foundations of Data Science by Avrim Blum, John Hopcroft, Ravi Kannan
Lecture 13 (3/17): graphs, probability – emergence of K4s
Reading material: notes from last year
Lectures 14, 15, 16 (3/21, 23, 28): streaming algorithms, probability – streaming model, reservoir sampling, F0 estimation (min sketch, Flajolet Martin, Hyperloglog), F1 estimation (Morris algorithm), Heavy Hitters (Count min sketch)
Slides available here
Lecture 17 (3/30): streaming algorithms, hashing – Basics of hash functions, universal hashing, k-wise independent hash functions, F2 estimation (AMS sketch)
Slides available here
Set of notes on hashing by Jeff Erickson
Lectures 18, 19 (4/4, 4/6): SVD – linear algebra review, SVD, PCA (dimensionality reduction, least squares, k-rank approximation)
Handwritten notes here and Jupyter notebook here
Lectures 20, 21 (4/11, 4/13): vector calculus – level curves, gradient, directional derivative, Hessian, chain rule, Taylor series
Slides available here.
Lectures 22, 23, 24, 25 (4/18, 4/20, 4/25, 4/28):vector calculus (cont.), optimization
Slides available here.
Handwritten notes from 4/20 here.
Note on sufficient second order condition from 4/25 here.
Lecture 26 (2/5): Rel for data management (guest lecture from RelationalAI)
Lecture material here (class code “CS365”)
More Rel lessons available here

Assignments

Homework 1 (due to 2/3)
Homework 2 (due to 2/10)
Homework 3 (due to 2/17)
Midterm practice guide (on Piazza)
Homework 4 (due to 3/17)
Homework 5 (due to 3/24)
Homework 6 (due to 3/31)
Homework 7 (due to 4/7)
Homework 8 (due to 4/14)
Homework 9 (due to 4/25)
Final practice problems (on Piazza)

Educational videos

Persi Diaconis discussing in greater detail slide 3 from lecture 2

A nice exposition of the central limit theorem, see lecture 9.

Graham Cormode is one of the inventors of the count min sketch. In this presentation he goes over CountMin sketch that we learned about during Lecture 16

Jelani Nelson discussing algorithms for streaming data, including the distinct elements algorithm using the min value of the hashes we saw in Lectures 14, 15.

Mikkel Thorup on Hashing. The presentation is largely based on his paper in the form of lecture notes on high speed hashing for integers and strings.

An overview of RelationalAI