CS365: Foundations of Data Science (Spring’24)

Info

Instructor

Teaching Fellow

Github

Prerequisites

Students taking this class must have taken:

  • CS 112
  • CS 131 (MA293)
  • CS 132 (MA242) 
  • and CS 237 (MA581) or equivalent.

This year the prerequisites will be strictly enforced. CS 330 is highly recommended but not a prereq.

Syllabus

Topics will include probability, information theory, linear algebra, calculus, Fourier analysis, graph theory with a strong focus on their applicability for analyzing datasets. Finally, two lectures will be devoted to data management, and more specifically the classic relational model, SQL and Datalog. A detailed syllabus is available on Piazza.

Textbooks

There will be assigned readings from the following books that are available online (click for the pdf)

  1. Machine Learning: A Probabilistic Perspective [M] by Kevin Murphy
  2. Mathematics for Machine Learning [DFO] by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong.
  3. Foundations of Data Science by Avrim Blum, John Hopcroft, Ravi Kannan [BHK]
  4. Understanding Machine Learning: From theory to algorithms by Shai Shalev-Shwartz and Shai Ben-David [SD]
  5. Introduction to Probability for Data Science [SC]  by Stanley Chan

Programming

The class assumes familiarity with programming. The recommended languages for this class are Python3 and Julia. R, Mathematica and Matlab are also recommended. Other languages are welcome (C, C++, Java, etc), but are not recommended for this class.

Lectures

Note: at the end of each lecture, you will find the assigned readings. The readings associated with a magnifying glass are mandatory. The rest is material if you are further interested, and have the time to devote.

Part I: Core Concepts in Data Science (Probability, Linear Algebra, Optimization)

  • Introduction (1/18), : Introduction
    Slides available here
  • PART 1A: Probability and Statistics
    Slides available here and Julia notebook here
    Readings: [SC] Chapters 1-5 and [M] Chapter 2

PART 1B: Linear Algebra, SVD, PCA


PART 1C: Vector Calculus and Optimization
Slides are available here
Readings: [DFO] Chapters 5 and 7

Part II: Data Science in Action

  • Topic 1: Data streams
    Slides available here
  • Topic 2: Dimensionality reduction
    • Singular Value Decomposition (SVD)
    • Principal Component Analysis (PCA)
      Handwritten notes here and Jupyter notebook here
      Slides from a tutorial here
      Readings
      • BHK 3.1-3.5, 3.9.1, 3.9.2
    • Johnson-Lindenstrauss theorem, Random Projections
      Jupyter notebook here
  • Topic 3: EM Algorithm
    Slides available here
    Readings
  • Topic 8: Image denoising using Bayes classifier
    Slides available here and Python code here