CS365: Foundations of Data Science (Spring’23)

Info

Instructor

Teaching Fellow

Piazza website

Github

Prerequisites

Students taking this class must have taken:

  • CS 112
  • CS 131 (MA293)
  • CS 132 (MA242) 
  • and CS 237 (MA581) or equivalent.

This year the prerequisites will be strictly enforced. CS 330 is highly recommended but not a prereq.

Syllabus

Topics will include probability, information theory, linear algebra, calculus, Fourier analysis, graph theory with a strong focus on their applicability for analyzing datasets. Finally, two lectures will be devoted to data management, and more specifically the classic relational model, SQL and Datalog. A detailed syllabus is available on Piazza.

Textbooks

There will be assigned readings from the following books that are available online (click for the pdf)

  1. Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong.
  2. Foundations of Data Science by Avrim Blum, John Hopcroft, Ravi Kannan
  3. Understanding Machine Learning: From theory to algorithms by Shai Shalev-Shwartz and Shai Ben-David
  4. Introduction to Probability for Data Science by Stanley Chan

Programming

The class assumes familiarity with programming. The recommended languages for this class are Python3 and Julia. R and Matlab are also recommended. Other languages are welcome (C, C++, Java, etc), but are not recommended for this class.

Lectures

Note: at the end of each lecture, you will find the assigned readings. The readings associated with a magnifying glass are mandatory. The rest is material if you are further interested, and have the time to devote.

  • Lecture 1 (1/19): data visualization – introduction, class logistics, types of data, basics of data visualization
    Slides available here.
  • Lecture 2 (1/25)probability – review of prerequisite material, and other basic concepts through problem solving
    Slides available here.
  • Lecture 3 (1/26):probability – convergence of random variables, Markov’s inequality
    Slides available here.
  • Lecture 4 (1/31): probability – Weak law of large numbers, confidence intervals, π estimation randomized algorithm
    Slides available here.
  • Lecture 5 (2/2): probability, statistical inference – Central Limit theorem, Bayes’ rule
    Slides available here.
  • Lecture 6 (2/7): practice problems (blackboard)
  • Lecture 7 (2/9): probability, statistical inference , machine learning-Naive Bayes classifier
    Slides available here.
  • Lecture 8 (2/14): probability, statistical inference, machine learning – Bayes classifier for denoising images
    Slides available here and Jupyter notebook (Python3) here.
  • Lecture 9 (2/16): probability, statistical inference – definition of moments, CLT for confidence intervals, midterm practice.
    Slides available here.
  • Midterm 2/23
  • Lecture 10 (2/28): probability, statistical inference – moment generating functions, Chernoff bounds, MLE, Method of Moments (MoM), EM
    Slides available here
  • Lecture 11 (3/2): probability, statistical inference, machine learning – moment generating functions, Chernoff bounds, MLE, Method of Moments (MoM), EM
    Slides available here
  • Lecture 12 (3/14): graphs, probability – G(n,p) model, degree sequence, asymptotic approximations
    Reading material: Theorem 8.3 and Corollary 8.4, Sections 8.1, 8.1.1, 8.1.2, 8.2 (disappearance of isolated vertices) from Foundations of Data Science by Avrim Blum, John Hopcroft, Ravi Kannan
  • Lecture 13 (3/17): graphs, probability – emergence of K4s
    Reading material: notes from last year
  • Lectures 14, 15, 16 (3/21, 23, 28): streaming algorithms, probability – streaming model, reservoir sampling, F0 estimation (min sketch, Flajolet Martin, Hyperloglog), F1 estimation (Morris algorithm), Heavy Hitters (Count min sketch)
    Slides available here
  • Lecture 17 (3/30): streaming algorithms, hashing – Basics of hash functions, universal hashing, k-wise independent hash functions, F2 estimation (AMS sketch)
    Slides available here
    Set of notes on hashing by Jeff Erickson
  • Lectures 18, 19 (4/4, 4/6): SVD – linear algebra review, SVD, PCA (dimensionality reduction, least squares, k-rank approximation)
    Handwritten notes here and Jupyter notebook here
  • Lectures 20, 21 (4/11, 4/13): vector calculus – level curves, gradient, directional derivative, Hessian, chain rule, Taylor series
    Slides available here.
  • Lectures 22, 23, 24, 25 (4/18, 4/20, 4/25, 4/28):vector calculus (cont.), optimization
    Slides available here.
    Handwritten notes from 4/20 here.
    Note on sufficient second order condition from 4/25 here.
  • Lecture 26 (2/5): Rel for data management (guest lecture from RelationalAI)
    Lecture material here (class code “CS365”)
    More Rel lessons available here

Assignments

Educational videos

Persi Diaconis discussing in greater detail slide 3 from lecture 2
A nice exposition of the central limit theorem, see lecture 9.
Graham Cormode is one of the inventors of the count min sketch. In this presentation he goes over CountMin sketch that we learned about during Lecture 16
Jelani Nelson discussing algorithms for streaming data, including the distinct elements algorithm using the min value of the hashes we saw in Lectures 14, 15.
Mikkel Thorup on Hashing. The presentation is largely based on his paper in the form of lecture notes on high speed hashing for integers and strings.
An overview of RelationalAI
%d bloggers like this: