Recently, I compiled a collection of applications that rely on dense subgraph discovery for my KDD’15 tutorial with Aris Gionis. In general, dense subgraph discovery is a key graph mining primitive. While by “dense” we generally mean subgraphs which are large enough and contain many edges, the exact notion of dense is application dependent. The list is not exhaustive, but contains several interesting applications.
Spam detection in the Web graph: Gibson, Kumar and Tomkins show that many of dense subgraphs are link spam, namely websites that interlink in order to manipulate search engine results.
Reachability and distance query indexing: Distance labeling is the state-of-the-art method for computing shortest path queries extremely fast on massive networks at the cost of preprocessing appropriately the graph. Cohen, Halperin, Kaplan, Zwick use dense subgraph discovery as a subroutine in their proposed distance labeling algorithm.
Correlation mining: By this term, we mean a wide variety of applications of the following form. We are given a dataset which is not a graph to begin with. It could be a gene expression dataset, a market basket or time series. For each entity that appears in the dataset (e.g., gene, item or a time series) we create a vertex and we add an edge between two items if they are ”correlated enough”. The notion of correlation depends typically on the application, as well as the threshold that determines the existence of an edge. Finding dense subgraphs corresponds to finding sets of entities that are highly correlated. Such sets are then further analyzed by a domain expert who draws useful conclusions.
Piggybacking on social networks: How can one find a good trade-off between the number of data requests and flow information on a social media platform such as Twitter? On the one hand, updating the feed every msec keeps the user updated, but on the other hand it is computationally expensive. It turns out that an efficient algorithm for this task relies on dense subgraph discovery. Take a look at this paper.
Bioinformatics: Numerous applications in bioinformatics rely on dense subgraph discovery, for instance (i) DNA motif detection and (ii) detection of complex annotation patterns from gene annotation data.
Mining Twitter data: Angel et al. rely on dense subgraph discovery to mine Twitter data. Specifically, they create a graph out of Twitter data by adding a vertex for each entity and an edge between two entities if they co-occur in the same tweet. It turns out that subgraphs that are dense correspond to sets of entities that relate via a story which is “hot”.
Community detection: Chen and Saad show that dense subgraph discovery can be used to extract communities from some real-world networks.
Graph compression: Feder and Motwani propose a useful graph compression scheme that allows speeding up various graph computations. The compression relies on finding dense subgraphs. Since then, more papers with the same flavor, all relying on dense subgraph discovery, have followed, see this and this.
Security applications: See my previous post here.
Electronic commerce: Lang and Andersen show that large sets of vertices that look like bicliques have a special meaning on market graphs. These graphs are bipartite, the left set of vertices A corresponds to advertisers and the right set Q corresponds to queries. Each edge has weight equal to the amount of money advertiser
is willing to spend on query
.
Approximate subgraph isomorphism: Suppose I have two complete weighted graphs, not necessarily on the same number of vertices. The problem we wish to address is find two subgraphs within the two graphs that are (close to being) isomorphic. It is well known that the subgraph isomorphism problem is NP-hard. Nonetheless, here is a practical approach to it that relies on dense subgraph discovery. Create a graph , where
is the cartesian product of the vertex sets of the two input graphs. We add an edge between
iff
. Now, we find a dense subgraph in
.