CS 591 Data Analytics |

CS 591 Data Analytics: Theory and Applications

Instructor: Charalampos E. Tsourakakis
Semester: Spring 2017
When: MW 2.30-3.45
Where: MCS B25

Course description

Our growing ability to measure and record almost everything creates unprecedented opportunities for optimizing our lives. For example, consider the following potential application in healthcare: can we design software that uses cell phone cameras to detect early on deadly skin diseases? Developing such applications requires –from a software perspective– multiple skills, including mathematical, algorithmic, and statistical skills. This course aims to introduce students to foundational techniques used in mining large-scale datasets: searching large volumes of data, analyzing streams of data, mining networks, and using effectively machine learning in applications. These techniques will illustrate the two prominent ways to design software for analyzing data: traditional CS techniques where explicit instructions are coded, and creating software systems that learn to perform tasks by being shown examples of desired input and output patterns.

Course overview

We will cover topics related to the following fundamental questions that one faces when dealing with real-world datasets. (a) How do we perform efficiently searches over large volumes of data? (b) How do we obtain the important properties of a high dimensional dataset via dimensionality reduction? (c) How do we deal with streams of data? (d) How do we mine data that come in the form of graphs? (e) Machine learning techniques.

Specifically, we will go over the following topics: (a) hashing, Bloom filters, LSH, (b) SVD, dimensionality reduction, (c) Misra Gries, distinct elements, AMS sketches, Count-Min sketch, (d) spectral partitioning, dense subgraphs, (e) linear regression, logistic regression, perceptrons, and feedforward neural networks.

Course prerequisites

Basic linear-algebra, calculus, probability, programming, data structures and algorithms.

Announcements

We will cover the missed lecture due to President’s day (220) by appending half an hour to each lecture on 227 and 3/1. On those days, I will hold office hours from 4.15 to 5pm.
Project is out. Start early! (2/10)
First day of class is Monday 1/23. See you all there!

Grading

Each registered student will be required to scribe one lecture (10%), complete the class project (70%), and write a final exam (20%).

Auditing

Students that are not registered and want to audit, will have to solve two problems from the project, and write a report on a paper of their choice.

Class Project

The project will become available at the end of the third week (Week 3) here. It will consist of two parts. The first part will be a set of problems testing the class material, both from a mathematical and a coding standpoint. For the second part, you are welcome to work on topics that either extend the class material, or on a data analytics problem of your choice. Collaborations are welcome. In case you want to work on a purely theoretical topic, please coordinate with me first. Depending on the number of projects, there will be a 10 to 15min presentation of your project in class.

Scribes

Each student is required to scribe at least once. You can use this LaTex template.

Google group

Discussion among students is strongly encouraged as long as it complies with BU’s academic conduct code. For those interested in participating in an online discussion about the class, please make a request here. Make sure to use your BU email address.

Suggested textbooks

You don’t need to buy a textbook for the class. Some great textbooks I suggest are the following:

Machine Learning: A Probabilistic Perspective, by Kevin P. Murphy
Information Theory, Inference, and Learning Algorithms by David Mackay
Mining Massive Datasets by Leskovec, Rajaraman, and Ullman.
Understanding Machine Learning by Shai Shalev-Shwartz and Shai Ben-David

Office hours

I will hold office hours on each Monday and Wednesday from 3.45 to 4.30. My office is located in MCS, room 292. You are strongly encouraged to attend. In case you this time does not work well for you, please let me know.

Class material

Lecture	Topic	Readings	Additional material
1/23	Introduction Python notebook (by Ben Lawson)	Exact MAP for Binary Images Max-flow min-cut theorem	The unreasonable effectiveness of data Kleinberg-Tardos (Metric Labeling and MRFs) Scene completion
1/25	Probability and information theory recap	Chapter 2 (Mackay) Visual Information Theory	Bayesian integration in sensorimotor learning Bayesian reasoning implicated in some mental disorders
1/30	Concentration, Balls into Bins	Birthday problem Balls into Bins	Balanced Allocations
2/1	Hashing I	Rasmus Pagh’s slides Understanding hash functions (by Geoff Pike) dict_example.py DoSdjbx33a.cc	Hashing demo (Google)
2/6	Hashing and Data Streams	An Optimal Algorithm for the Distinct Elements Problem	Sketch Library
2/8	Hashing II	Minimal Perfect Hashing Practice problems	Cuckoo Hashing for Undergraduates
2/13	Bloom Filters	Network Applications of Bloom Filters: A Survey Python Bloom Filter Demo	Less Hashing, Same Performance How to write a Bloom filter in C
2/15	Streams and Puzzles	Reservoir sampling Majority element demo	Data Streams: Algorithms and Applications (Sections 1.1, 1.2)
2/20	Probabilistic Counting Python code	Jelani’s notes I	Morris’ original paper
2/27	AMS sketch Project ideas`Piotr Indyk's slides`	Chandra Chekuri’s notes	AMS original paper
3/1	Count-min sketch	Notes C code (G. Cormode)	Count-Min Sketch paper count-min sketch and its applications
3/6	Spring break	–	–
3/8	Spring break	–	–
3/13	Min-wise hashes LSH (part 1, pptx)	Brief survey on Minhash by Edith Cohen Sections 3.1,3.2,3.3	Yahoo Data Sketches
3/15	LSH (part 1, pptx) LSH (part 2, pptx)	Min-Wise Independent Permutations LSH Algorithm and Implementation	Python code
3/20	Nearest Neighbor Search I (Hamming distance)	Wikipedia article CACM Survey	Nearest Neighbors and Similarity Search
3/22	Nearest Neighbor Search II (SimHash)	SimHash	FAst Lookups of Cosine and Other Nearest Neighbors
3/27	Graphs and Networks I: Models	Random graph models of social networks	Introduction to Random Graphs
3/29	Graphs and Networks II: Models	The Degree Sequence of a Scale-Free Random Graph Process	Introduction to Random Graphs
4/3	Graphs and Networks III: Dense Subgraphs	Dense subgraphs	k-clique DSP
4/5	Graphs and Networks IV: Spectral Graph Clustering	Spectral Graph Theory and its Applications by Dan Spielman	Cheeger’s Inequality
4/10	Gradient, and Stochastic Gradient Descent (Whiteboard)	Chapter 14	Gradient Descent
4/12	Machine Learning Fundamentals (Whiteboard)	Chapter 2,3	A Theory of the Learnably by L. Valiant
4/17	Regression (Linear Predictors) (Whiteboard)	Chapter 9	LinReg sklearn
4/19	Clustering (Whiteboard)	Chapter 22	k-means+: The Advantages of Careful Seeding
4/24	Dimensionality Reduction (Whiteboard)	Chapter 23	Roughgarden-Valiant notes Randomized algorithms for matrices and data by Michael W. Mahoney Dimensionality Reduction for k-Means Clustering and Low Rank Approximation by Michael Cohen et al.
4/26	Naive Bayes Classifier (Whiteboard)	Chapter 24.2	NB scikit-learn
5/1	Project Presentations
5/3	Project Presentations
5/8	Final exam

Video lectures (external)

Hashing: Playlist from the 2014 Summer School on Hashing

Count-Min sketch, 10 years later: Talk by Muthu Muthukrishnan