CS365: Foundations of Data Science (Spring’22) |

Info

Instructor

When: Tue, Thu 5pm-6.30pm
Where: CAS-B18
Prof: Babis Tsourakakis
Email: ctsourak@bu.edu
Office hours: Tu 9.30-10.30am (MCS102) and R 9.30-10.30am (Zoom)

Teaching Fellow

TF: Mr. Ryan Yu
Email: ryu1@bu.edu
Teaching Labs on each Monday (Attendance mandatory, attend your session!)
Office hours: M (PSY B53): 12:20 – 2.15 , W (via zoom): 5:30 – 7 PM

Zoom Links and passcodes are available on Piazza

Piazza website

CS365 Github

Prerequisites

Students taking this class must have taken:

CS 112
CS 131 (MA293)
CS 132 (MA242)
and CS 237 (MA581) or equivalent.

The consent of the instructor is necessary to take the class. Otherwise, you will not get a final grade. CS 330 is highly recommended but not mandatory requirements as the previous.

Syllabus

Topics will include probability, information theory, linear algebra, calculus, Fourier analysis, graph theory with a strong focus on their applicability for analyzing datasets. Finally, two lectures will be devoted to data management, and more specifically the classic relational model, SQL and Datalog. A detailed syllabus is available on Piazza, with the code to sign up on Gradescope.

Textbooks

No need to buy a textbook. There will be assigned readings from the following books that are available online (click for the pdf)

Foundations of Data Science by Avrim Blum, John Hopcroft, Ravi Kannan
Understanding Machine Learning: From theory to algorithms by Shai Shalev-Shwartz and Shai Ben-David
Introduction to Probability for Data Science by Stanley Chan
Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong.

Programming

The class assumes familiarity with programming. The recommended languages for this class are Python3 and Julia. R and Matlab are also recommended. Other languages are welcome (C, C++, Java, etc), but are not recommended for this class.

Lectures

Note: at the end of each lecture, you will find the assigned readings. The readings associated with a magnifying glass are mandatory. The rest is material if you are further interested, and have the time to devote.

Lecture 1 (1/20): data visualization – introduction, class logistics, types of data, basics of data visualization
Slides available here
Lecture 2 (1/25): probability I – review of prerequisite material, and other basic concepts through problem solving
Slides available here.
Lecture 3 (1/27): probability II – convergence of random variables, probability inequalities, Weak law of large numbers, confidence intervals
Slide available here.
Lecture 4 (2/1): probability III – π estimation randomized algorithm Central Limit theorem, moment generating functions, Chernoff bounds
Slides available here.
Lecture 5 (2/3): probability IV, statistical inference I , machine learning I– Bayes’ rule, Naive Bayes classifier
Slides available here.
Lecture 6 (2/8): probability V, statistical inference II, machine learning II– denoising images using Bayes’ rule
Slides available here.
Lecture 7 (2/10): probability VI, statistical inference III– concentration of measure (cont.), sampling theorem
Slides available here.
Lecture 8 (2/15): statistical inference IV – method of moments, MLE, Bayesian inference, MAP
Slides available here.
Lecture 9 (2/17): statistical inference V – EM algorithm for parametric inference
Slides available here.
Midterm 2/24
Lecture 10 (3/1): streaming algorithms I – streaming model, missing number puzzle, reservoir sampling, moment estimation problem
Slides available here.
Lecture 11 (3/3): streaming algorithms II – F1 estimation using Morris counters
Slides available here.
Lecture 12 (3/15): streaming algorithms III -k-wise independence, F0, F2 estimation
Slides available here and here.
Lecture 13 (3/17): dimensionality reduction I, machine learning III – distance functions, k-nearest neighbors classifier, Johnson-Lindenstrauss lemma
Slides available here.
Lecture 14 (3/22): linear algebra I – vector space, subspace, linear mapping, linear independence, basis, basics of matrices (whiteboard lecture)
Prerequisite CS132 material here.
Lecture 15 (3/24): linear algebra II – projections on subspaces, least squares, eigenvalue decomposition, real symmetric matrices’ spectral properties (whiteboard lecture)
Prerequisite CS132 material here.
Lecture 16 (3/29): dimensionality reduction II, matrix decompositions I – singular value decomposition (SVD)
Slides available here.
Lecture 17 (3/31): dimensionality reduction III, matrix decompositions II – math of singular value decomposition (SVD), and principal component analysis (PCA) (whiteboard lecture)
Lecture 18 (4/5): dimensionality reduction IV, matrix decompositions III – PCA for dimensionality reduction
Python code here and a demo here.
Lecture 19 (4/7): graphs I – G(n,p) model, community detection using spectral clustering
Python code here and notes for the emergence of K4s here.
Lecture 20 (4/12): graphs II – community detection using spectral clustering, Markov Chains (intro)
Python code here.
Lecture 21 (4/14): graphs III – Markov Chains (cont.), Pagerank
Python code here.
Lecture 22 (4/19): vector calculus I – level curves, gradient, directional derivative, Hessian, chain rule
Slides available here.
Lecture 23 (4/21): vector calculus II – matrix calculus
Slides available here.
Lecture 24 (4/26): vector calculus III optimization I – Taylor series, formulating problems, minimization, Weierstrass theorem
Slides available here.
Lecture 25 (4/28):vector calculus IV optimization II – 1st and 2nd order necessary conditions, gradient descent
Slides available here.
Lecture 26 (5/3):vector calculus V optimization III – gradient descent, convexity, Lagrange multipliers, duality
Slides available here.