## Info

#### Instructor

**When**: Tue, Thu 5pm-6.15pm**Where**: CAS-211**Prof**: Babis Tsourakakis**Email**: ctsourak@bu.edu**Office hours**(CDS 912): Tu and Th 6.30-7.30pm

#### Teaching Fellow

**TF**: Mr. Tiany Chen**Email**: ctony@bu.edu**Labs**: schedule**Office hours**(CDS 362): Wed 2:00-3:30 pm, Fri 10:00-11:30 am

## Piazza website

## Github

Prerequisites

Students taking this class must have taken:

- CS 112
- CS 131 (MA293)
- CS 132 (MA242)
- and CS 237 (MA581) or equivalent.

This year the prerequisites will be strictly enforced. CS 330 is *highly *recommended but not a prereq.

## Syllabus

Topics will include probability, information theory, linear algebra, calculus, Fourier analysis, graph theory with a strong focus on their applicability for analyzing datasets. Finally, two lectures will be devoted to data management, and more specifically the classic relational model, SQL and Datalog. A detailed syllabus is available on Piazza.

## Textbooks

There will be assigned readings from the following books that are available online (click for the pdf)

- Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong.
- Foundations of Data Science by Avrim Blum, John Hopcroft, Ravi Kannan
- Understanding Machine Learning: From theory to algorithms by Shai Shalev-Shwartz and Shai Ben-David
- Introduction to Probability for Data Science by Stanley Chan

## Programming

The class assumes familiarity with programming. The recommended languages for this class are Python3 and Julia. R and Matlab are also recommended. Other languages are welcome (C, C++, Java, etc), but are not recommended for this class.

## Lectures

**Note**: at the end of each lecture, you will find the assigned readings. The readings associated with a magnifying glass are mandatory. The rest is material if you are further interested, and have the time to devote.

**Lecture 1 (1/19)**:**data****visualization**– introduction, class logistics, types of data, basics of data visualization

Slides available here.**Lecture 2 (1/25)**:**probability**– review of prerequisite material, and other basic concepts through problem solving

Slides available here.**Lecture 3 (1/26)**:**probability**– convergence of random variables, Markov’s inequality

Slides available here.**Lecture 4 (1/31)**:**probability**– Weak law of large numbers, confidence intervals, π estimation randomized algorithm

Slides available here.**Lecture 5 (2/2)**:**probability**,**statistical****inference**– Central Limit theorem, Bayes’ rule

Slides available here.**Lecture 6 (2/7)**: practice problems (blackboard)**Lecture 7 (2/9)**:**probability**,**statistical****inference**,**machine****learning**-Naive Bayes classifier

Slides available here.**Lecture 8 (2/14)**:**probability**,**statistical****inference**,**machine****learning**– Bayes classifier for denoising images

Slides available here and Jupyter notebook (Python3) here.**Lecture 9 (2/16)**:**probability**,**statistical****inference**– definition of moments, CLT for confidence intervals, midterm practice.

Slides available here.**Midterm 2/23****Lecture 10 (2/28)**:**probability**,**statistical****inference**– moment generating functions, Chernoff bounds, MLE, Method of Moments (MoM), EM

Slides available here**Lecture 11 (3/2)**:**probability**,**statistical****inference**,**machine****learning**– moment generating functions, Chernoff bounds, MLE, Method of Moments (MoM), EM

Slides available here**Lecture 12 (3/14)**:**graphs**,**probability**– G(n,p) model, degree sequence, asymptotic approximations

Reading material: Theorem 8.3 and Corollary 8.4, Sections 8.1, 8.1.1, 8.1.2, 8.2 (disappearance of isolated vertices) from Foundations of Data Science by Avrim Blum, John Hopcroft, Ravi Kannan**Lecture 13 (3/17)**:**graphs**,**probability**– emergence of K4s

Reading material: notes from last year**Lectures 14, 15, 16 (3/21, 23, 28)**:**streaming****algorithms**,**probability**– streaming model, reservoir sampling, F0 estimation (min sketch, Flajolet Martin, Hyperloglog), F1 estimation (Morris algorithm), Heavy Hitters (Count min sketch)

Slides available here**Lecture 17 (3/30)**:**streaming algorithms**,**hashing**– Basics of hash functions, universal hashing, k-wise independent hash functions, F2 estimation (AMS sketch)

Slides available here

Set of notes on hashing by Jeff Erickson**Lectures 18, 19 (4/4, 4/6)**:**SVD**– linear algebra review, SVD, PCA (dimensionality reduction, least squares, k-rank approximation)

Handwritten notes here and Jupyter notebook here**Lectures 20, 21 (4/11, 4/13)**:**vector calculus**– level curves, gradient, directional derivative, Hessian, chain rule, Taylor series

Slides available here.**Lectures 22, 23, 24, 25 (4/18, 4/20, 4/25, 4/28)**:**vector calculus**(cont.),**optimization**

Slides available here.

Handwritten notes from 4/20 here.

Note on sufficient second order condition from 4/25 here.**Lecture 26 (2/5)**: Rel for data management (guest lecture from RelationalAI)

Lecture material here (class code “CS365”)

More Rel lessons available here

## Assignments

- Homework 1 (due to 2/3)
- Homework 2 (due to 2/10)
- Homework 3 (due to 2/17)
- Midterm practice guide (on Piazza)
- Homework 4 (due to 3/17)
- Homework 5 (due to 3/24)
- Homework 6 (due to 3/31)
- Homework 7 (due to 4/7)
- Homework 8 (due to 4/14)
- Homework 9 (due to 4/25)
- Final practice problems (on Piazza)