Motivation: Suppose you are given a large “who-calls-whom” network. This is a network where each vertex represents a human and we have an (undirected) edge between two humans if and only if they exchanged a phone call. An important problem that comes up in anomaly detection is finding sets of vertices that “look like” cliques. There are different ways to quantify what “looks like” precisely means, but for now we will think of this as a subgraph which is close to being fully interconnected modulo few missing edges. Large sets of such vertices are contradicting to the average human habits: who talks to say 50 people who all talk among each other as well? The problem of finding such sets of vertices is a hard problem. More importantly, this problem comes up in many real-world applications, including correlation mining, graph visualization, data mining, bioinformatics and drug discovery.
We will use the following notation: let be a simple, unweighted, undirected graph. Also let . For any set of vertices , let is the number of induced edges by .
Related work: The maximum clique problem is NP-complete and even worse -unless P=NP- there can be no polynomial time algorithm that approximates the maximum clique to within a factor better than for any , a famous result due to Hastad. Many other formulations end up being connected to the maximum clique problem. For instance, in the paper “Denser than the Densest Subgraph: Extracting Optimal Quasi-Cliques with Quality Guarantees” I suggested the optimal quasi-clique formulation which aims to maximize over all possible the objective . Here, . It is not hard to check that if one sets , one can solve the maximum clique problem.
It is impossible to exhaust the existing work related to the problem of dense subgraph discovery. However, here is a set of available algorithmic tools.
- Brute force, which is possible if is small enough.
- Heuristics of all kinds. Such methods suffer from at least one of the following drawbacks: (i) they are not well understood, (ii) they frequently fail (iii) they do not scale well.
- Maximal clique enumeration methods.
- Spectral methods
- NP-hard formulations and various relaxations
- Poly-time solvable objectives
The latter family constitutes a major algorithmic toolbox. Among poly-time solvable objectives the densest subraph problem stands out. To quote a paper due to Bahmani, Kumar and Vassilvitski, “The densest subgraph problem lies at the core of large scale data mining”.
Densest subgraph problem: The goal is to find that maximizes the degree density
. Notice that degree density is different from the edge density . Some well-known facts concerning the densest subgraph problem follow.
- The densest subgraph problem is solvable in polynomial time. The most efficient exact method relies on parametric maximum flow methods.
- There exists a 2-approximation peeling algorithm which uses linear space and runs in linear time due to Charikar.
- On many real-world instances, the densest subgraph problems returns large sets of vertices with low edge density. Here is an extreme example: on a small network with 115 vertices and 613 edges (Football network), the densest subgraph is the whole network.
k-clique densest subgraph problem: A couple of months ago, I presented at WWW’15 my work “The K-clique densest subgraph problem“. This problem generalized the densest subgraph problem. Specifically, the problem is to maximize over all possible the k-clique density . Here, is the number of induced k-cliques by . Notice that , so the 2-clique densest subgraph problem is nothing but the well-studied densest subgraph problem. It turns out that the k-clique densest subgraph problem is also solvable in polynomial time for any constant. In the same paper, I generalize prior work of Charikar to -approximate the k-clique densest subgraph problem and prior work due to Bahmani, Kumar and Vassilvitski to produce an efficient MapReduce algorithm. Here is why the k-clique densest subgraph problem is an important step towards rigorous methods that are able to succeed empirically on the hard problem of large near-clique extraction. Here is an example of what we observe on a large collaboration network when we solve exactly the k-clique DSP.
- For k=2, the optimal solution is a set of 18,686 vertices with edge density 0.11.
- For k=3, the optimal solution is a set of 76 vertices with edge density 0.80.
- For k=4, the optimal solution is a set of 62 vertices with edge density 0.96.
Scalability: Essentially, in the worst case, the run time depends exponentially on , as the number of k-cliques can grow as . The same holds for the space complexity. This is a major scalability bottleneck. In a recent paper, which is joint work with Michael Mitzenmacher, Richard Peng, Jakub Pachocki and Shen Chen Xu we provide a randomized algorithm that allows to scale the k-clique densest subgraph problem solution to large networks, while achieving highly-accurate solutions. I will post more on this soon, so stay tuned.