CS 591 Data Analytics

CS 591 Data Analytics: Theory and Applications

Instructor: Charalampos E. Tsourakakis
Semester: Spring 2017
When: MW 2.30-3.45
Where: MCS B25

Course description

Our growing ability to measure and record almost everything creates unprecedented opportunities for optimizing our lives. For example, consider the following potential application in healthcare: can we design software that uses cell phone cameras to detect early on deadly skin diseases? Developing such applications requires –from a software perspective– multiple skills, including mathematical, algorithmic, and statistical skills. This course aims to introduce students to foundational techniques used in mining large-scale datasets: searching large volumes of data, analyzing streams of data, mining networks, and using effectively machine learning in applications. These techniques will illustrate the two prominent ways to design software for analyzing data: traditional CS techniques where explicit instructions are coded, and creating software systems that learn to perform tasks by being shown examples of desired input and output patterns.

Course overview

We will cover topics related to the following fundamental questions that one faces when dealing with real-world datasets. (a) How do we perform efficiently searches over large volumes of data? (b) How do we obtain the important properties of a high dimensional dataset via dimensionality reduction? (c) How do we deal with streams of data? (d) How do we mine data that come in the form of graphs? (e) Machine learning techniques.

Specifically, we will go over the following topics: (a) hashing, Bloom filters, LSH, (b) SVD, dimensionality reduction, (c) Misra Gries, distinct elements, AMS sketches, Count-Min sketch, (d) spectral partitioning, dense subgraphs, (e) linear regression, logistic regression, perceptrons, and feedforward neural networks.

Course prerequisites

Basic linear-algebra, calculus, probability, programming, data structures and algorithms.

Announcements

  1. We will cover the missed lecture due to President’s day (220) by appending half an hour to each lecture on 227 and 3/1. On those days, I will hold office hours from 4.15 to 5pm.
  2. Project is out. Start early! (2/10)
  3. First day of class is Monday 1/23. See you all there!

Grading

Each registered student will be required to scribe one lecture (10%), complete the class project (70%), and write a final exam (20%).

Auditing

Students that are not registered and want to audit, will have to solve two problems from the project, and write a report on a paper of their choice.

Class Project

The project will become available at the end of the third week (Week 3) here. It will consist of two parts. The first part will be a set of problems testing the class material, both from a mathematical and a coding standpoint. For the second part, you are welcome to work on topics that either extend the class material, or on a data analytics problem of your choice. Collaborations are welcome. In case you want to work on a purely theoretical topic, please coordinate with me first. Depending on the number of projects, there will be a 10 to 15min presentation of your project in class.

Scribes

Each student is required to scribe at least once. You can use this LaTex template.

Google group

Discussion among students is strongly encouraged as long as it complies with BU’s academic conduct code. For those interested in participating in an online discussion about the class, please make a request here. Make sure to use your BU email address.

Suggested textbooks

You don’t need to buy a textbook for the class. Some great textbooks I suggest are the following:

  1. Machine Learning: A Probabilistic Perspective, by Kevin P. Murphy
  2. Information Theory, Inference, and Learning Algorithms by David Mackay
  3. Mining Massive Datasets by Leskovec, Rajaraman, and Ullman.
  4. Understanding Machine Learning by Shai Shalev-Shwartz and Shai Ben-David

Office hours

I will hold office hours on each Monday and Wednesday from 3.45 to 4.30. My office is located in MCS, room 292. You are strongly encouraged to attend. In case you this time does not work well for you, please let me know.

Class material

Lecture Topic Readings Additional material
1/23 Introduction
Python notebook (by Ben Lawson)
Exact MAP for Binary Images
Max-flow min-cut theorem
The unreasonable effectiveness of data
Kleinberg-Tardos (Metric Labeling and MRFs)
Scene completion
1/25 Probability and information theory recap Chapter 2 (Mackay)
Visual Information Theory
Bayesian integration in sensorimotor learning
Bayesian reasoning implicated in some mental disorders
1/30 Concentration, Balls into Bins Birthday problem
Balls into Bins
Balanced Allocations
2/1 Hashing I Rasmus Pagh’s slides
Understanding hash functions (by Geoff Pike)
dict_example.py
DoSdjbx33a.cc
Hashing demo (Google)
2/6 Hashing and Data Streams An Optimal Algorithm for the Distinct Elements Problem Sketch Library
2/8 Hashing II Minimal Perfect Hashing
Practice problems
Cuckoo Hashing for Undergraduates
2/13 Bloom Filters Network Applications of Bloom Filters: A Survey
Python Bloom Filter
Demo
Less Hashing, Same Performance
How to write a Bloom filter in C
2/15 Streams and Puzzles Reservoir sampling
Majority element demo
Data Streams: Algorithms and Applications (Sections 1.1, 1.2)
2/20 Probabilistic Counting
Python code
Jelani’s notes I Morris’ original paper
2/27 AMS sketch

Project ideas
Piotr Indyk's slides

Chandra Chekuri’s notes AMS original paper
3/1 Count-min sketch Notes
C code (G. Cormode)
Count-Min Sketch paper
count-min sketch and its applications
3/6 Spring break
3/8 Spring break
3/13 Min-wise hashes
LSH (part 1, pptx)
Brief survey on Minhash by Edith Cohen
Sections 3.1,3.2,3.3
Yahoo Data Sketches
3/15 LSH (part 1, pptx)
LSH (part 2, pptx)
Min-Wise Independent Permutations
LSH Algorithm and Implementation
Python code
3/20 Nearest Neighbor Search I
(Hamming distance)
Wikipedia article
CACM Survey
Nearest Neighbors and Similarity Search
3/22 Nearest Neighbor Search II (SimHash) SimHash FAst Lookups of Cosine and Other Nearest Neighbors
3/27 Graphs and Networks I:
Models
Random graph models of social networks Introduction to Random Graphs
3/29 Graphs and Networks II:
Models
The Degree Sequence of a Scale-Free Random Graph Process Introduction to Random Graphs
4/3 Graphs and Networks III:
Dense Subgraphs
Dense subgraphs k-clique DSP
4/5 Graphs and Networks IV:
Spectral Graph Clustering
Spectral Graph Theory and its Applications
by Dan Spielman
Cheeger’s Inequality
4/10 Gradient, and Stochastic Gradient Descent
(Whiteboard)
Chapter 14 Gradient Descent
4/12 Machine Learning Fundamentals
(Whiteboard)
Chapter 2,3 A Theory of the Learnably
by L. Valiant
4/17 Regression
(Linear Predictors)
(Whiteboard)
Chapter 9 LinReg sklearn
4/19 Clustering
(Whiteboard)
Chapter 22 k-means+: The Advantages of Careful Seeding
4/24 Dimensionality Reduction
(Whiteboard)
Chapter 23 Roughgarden-Valiant notes
Randomized algorithms for matrices and data by Michael W. Mahoney
Dimensionality Reduction for k-Means Clustering and Low Rank Approximation by Michael Cohen et al.
4/26 Naive Bayes Classifier
(Whiteboard)
Chapter 24.2 NB scikit-learn
5/1 Project Presentations
5/3 Project Presentations
5/8 Final exam

Video lectures (external)

  • Hashing: Playlist from the 2014 Summer School on Hashing
  • Count-Min sketch, 10 years later: Talk by Muthu Muthukrishnan

 

 

%d bloggers like this: