CS 591 Data Analytics: Theory and Applications
Our growing ability to measure and record almost everything creates unprecedented opportunities for optimizing our lives. For example, consider the following potential application in healthcare: can we design software that uses cell phone cameras to detect early on deadly skin diseases? Developing such applications requires –from a software perspective– multiple skills, including mathematical, algorithmic, and statistical skills. This course aims to introduce students to foundational techniques used in mining large-scale datasets: searching large volumes of data, analyzing streams of data, mining networks, and using effectively machine learning in applications. These techniques will illustrate the two prominent ways to design software for analyzing data: traditional CS techniques where explicit instructions are coded, and creating software systems that learn to perform tasks by being shown examples of desired input and output patterns.
We will cover topics related to the following fundamental questions that one faces when dealing with real-world datasets. (a) How do we perform efficiently searches over large volumes of data? (b) How do we obtain the important properties of a high dimensional dataset via dimensionality reduction? (c) How do we deal with streams of data? (d) How do we mine data that come in the form of graphs? (e) Machine learning techniques.
Specifically, we will go over the following topics: (a) hashing, Bloom filters, LSH, (b) SVD, dimensionality reduction, (c) Misra Gries, distinct elements, AMS sketches, Count-Min sketch, (d) spectral partitioning, dense subgraphs, (e) linear regression, logistic regression, perceptrons, and feedforward neural networks.
Basic linear-algebra, calculus, probability, programming, data structures and algorithms.
- We will cover the missed lecture due to President’s day (220) by appending half an hour to each lecture on 227 and 3/1. On those days, I will hold office hours from 4.15 to 5pm.
- Project is out. Start early! (2/10)
- First day of class is Monday 1/23. See you all there!
Each registered student will be required to scribe one lecture (10%), complete the class project (70%), and write a final exam (20%).
Students that are not registered and want to audit, will have to solve two problems from the project, and write a report on a paper of their choice.
The project will become available at the end of the third week (Week 3) here. It will consist of two parts. The first part will be a set of problems testing the class material, both from a mathematical and a coding standpoint. For the second part, you are welcome to work on topics that either extend the class material, or on a data analytics problem of your choice. Collaborations are welcome. In case you want to work on a purely theoretical topic, please coordinate with me first. Depending on the number of projects, there will be a 10 to 15min presentation of your project in class.
Each student is required to scribe at least once. You can use this LaTex template.
Discussion among students is strongly encouraged as long as it complies with BU’s academic conduct code. For those interested in participating in an online discussion about the class, please make a request here. Make sure to use your BU email address.
You don’t need to buy a textbook for the class. Some great textbooks I suggest are the following:
- Machine Learning: A Probabilistic Perspective, by Kevin P. Murphy
- Information Theory, Inference, and Learning Algorithms by David Mackay
- Mining Massive Datasets by Leskovec, Rajaraman, and Ullman.
- Understanding Machine Learning by Shai Shalev-Shwartz and Shai Ben-David
I will hold office hours on each Monday and Wednesday from 3.45 to 4.30. My office is located in MCS, room 292. You are strongly encouraged to attend. In case you this time does not work well for you, please let me know.
Video lectures (external)
- Hashing: Playlist from the 2014 Summer School on Hashing
- Count-Min sketch, 10 years later: Talk by Muthu Muthukrishnan