Dense Subgraph Discovery

Numerous high-impact applications rely on dense subgraph discovery, including anomaly detection and security, community detection in social networks, and the Web graph, detection of protein complexes in protein interaction networks, and extraction of highly correlated entities from a relevance graph. The latter type of graphs is used to encode pairwise similarities among entities, e.g., time series, and is ubiquitous in data science applications. Two major formulations for dense subgraph discovery are the maximum clique problem and the  densest subgraph problem (DSP). The former is a notoriously hard problem, whereas the latter is poly-time solvable and lies at the core of large-scale data mining.

Denser than the Densest Subgraph: A heuristic approach for large-near clique extraction 

In this paper, we define a novel density function, which gives subgraphs of much higher quality than densest subgraphs: the graphs found by our method are compact, dense, and with smaller diameter. We show that the proposed function can be derived from a general framework, which includes other important density functions as subcases and for which we show interesting general theoretical properties. To optimize the proposed function we provide an additive approximation algorithm and a local-search heuristic. Both algorithms are very efficient and scale well to large graphs.



  1. Denser than the Densest Subgraph: Extracting Optimal Quasi-Cliques with Quality Guarantees

The k-clique densest subgraph problem: a principled approach for large-near clique extraction


k-clique densest subgraph on a collaboration network

My work introduced  the k-clique densest subgraph problem rigorous theoretical guarantees for scalable extraction of large near-cliques from networks. The original intuition behind designing this family of objectives, which contain DSP as the special case k = 2, is that triangles are a better signature for community participation compared to edges.  The figure on the right, shows what we observe on an astrophysics collaboration network, and is representative of what we observe on real-world networks. As k grows, the size of the optimal sets S^*_k drops and the edge density f_e(S^*_k)= e(S^*_k)/ {|S^*_k| \choose 2} grows. Notice the sudden change from k=2 to k=3 and that for k=5 we are able to find a large near-clique on 62 nodes. The dots are scaled according to f_e. The same behavior is observed across social networks, autonomous systems, blog and Web graphs. Subsequent work  focuses on scaling to large-scale datasets, while maintaining



  1. Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling
  2. The K-clique Densest Subgraph Problem


Densest Subgraph Problem in the Streaming Model

Bahmani et al. first provided a 2+\epsilon approximation algorithm for the DSP (\epsilon>0) that performs O(\log n/\epsilon) passes over the input in the streaming model. It was an open question whether that number of passes was the best possible. In joint work with Sayan Bhattacharya, Monika Henzinger, and Danupon Nanongkai we showed that a single pass suffices to get a  2+\epsilon approximation. In follow-up paper, my collaborators and I improved the approximation guarantee to 1+\epsilon. Specifically, we showed that sampling O(n\log n /epsilon^2) edges uniformly at random suffices to get a 1+\epsilon approximation guarantee for the DSP.


  1. Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling
  2. Space- and Time-Efficient Algorithm for Maintaining Dense Subgraphs on One-Pass Dynamic Streams


Densest Subgraph Problem for Dynamic Graphs 

In our STOC 2015 paper, we provide state-of-the-art results for the DSP on time-evolving graphs. For more details, see here.

Dense Subgraph Discovery, KDD Tutorial

In 2015, Aris Gionis and I gave a tutorial at KDD on dense subgraph discovery.  You can watch the Youtube videos from here.

%d bloggers like this: