# Provably Fast Inference of Latent Features from Networks

### Latent feature learning: where overlapping communities, correlation clustering, binary matrix factorization and extremal graph theory meet

Motivation: Suppose we are given the following.

• Agents: Five agents, represented by $\{v_1,\ldots,v_5\}$.
• Binary vertex features. An agent can either be interested or not in each out of three news categories businessentertainment, sports.  Each interest is represented by a binary feature. For instance, in the figure below, $v_1$ is interested only in business news.
• Connection rule. Two agents connect iff they share at least one common interest.

Given this information, it is straightforward to generate the graph. For instance, $v_1$ is connected to $v_2,v_3$ since they share their interest in business news. However there are no edges between $v_1$ and $v_4,v_5$ as they share no common interests.

In principle, the connection rule is probabilistic.  In that case, we can sample a graph with the correct probability.
Here is a fundamental inverse problem: we are given an unweighted, undirected graph $G(V,E)$, the number of binary features $k$, but we are not given the binary vertex features.  How do we learn (a good setting of) the binary features? This is a fundamental learning problem on networks with many related applications. This set of applications includes overlapping community detection: think that features correspond to communities as in the following figure. What we see are two overlapping communities. Vertices that belong to both communities are labeled as 11, whereas the rest are labeled as 10 or 01.

However, on real data the communities are far away from being as dense as shown above.  They are more likely to look like the following figure. As you observe the communities are sparser. Also, the overlap is denser than the two counterparts. This is not necessarily always true, but as Yang and Leskovec point out, this is frequently the case in social networks.

Despite the obvious conceptual connection between latent feature learning and overlapping community detection, the two problems are not the identical. For the graph shown in the first figure, any good algorithms  will output the same result. However, things are different for the sparser graph shown in the second figure.
Any good overlapping community detection method will output two communities, just as in figure 1. However, Bayesian non-parametric methods -a prominent class of latent feature learning methods- will attempt to explain the sparsity within each community by using a value of $k$ greater than 2, despite the existence of two communities. From this point of view, they can seen as refined overlapping community detection methods. The latent feature learning problem touches various research lines, including binary matrix factorization, machine learning and graph analysis. For more details, see the related work section in my recent WWW’15 paper.   One thing that I realized by experimenting with various state-of-art machine learning methods is that Bayesian non-parametric methods are accurate but they don’t scale well. This makes them useful for relatively small networks.  The component that appears to be the main bottleneck for scaling up such methods is the poor mixing of a Markov chain on which they rely.

Some results from Ref. [1]: Last year I became interested in the latent feature learning problem. I considered a simple probabilistic model according to which the probability of an edge is a non-decreasing function of the common interests of the the two endpoints. The probabilistic setting follows.

• Number of features. $k=\Theta(1)$ (known)
• Latent labeling. $\omega:V \rightarrow \{0,1\}^k$
• Vertex “interests”. Define for each vertex $v$ the set $\mathcal{C}(v)$ of non-zero coordinates. E.g., if $\omega(v)=010$ then $\mathcal{C}(v) =\{2\}$.
• Connection rule. Let $r$ be a threshold parameter. Vertex $v$ establishes its connections with other vertices based on the following simple rule:
• $|\mathcal{C}(v) \cap \mathcal{C}(u)| \geq r$, then add edge $(u,v)$ with probability $p$.
• If $|\mathcal{C}(v) \cap \mathcal{C}(u)| \leq r-1$, then add edge $(u,v)$ with probability $q < p$.

Notice that when $r=0$ the resulting random graph model is the $G(n,p)$ model. However, for $r \geq 1$ we obtain an interesting random graph setting. This model is reminiscent of an overlapping version of the planted partition model.  Based on this probabilistic model we can write down the likelihood function of a graph $G(V,E)$ given the latent labeling $\omega$. We use $r=1$. Specifically, $Prob(G|\omega) = \prod_{(u,v) \in E(G)} p^{X_{uv}} q^{1-X_{uv}} \times \prod_{(u,v) \notin E(G)} (1-p)^{X_{uv}} (1-q)^{1-X_{uv}}.$
Here, $X_{uv}$ is an indicator variable equal to 1 iff $u,v$ share at least one common interest. By taking logarithms we find that the log-likelihood function equals $C(n,m,p,q)+\sum_{(u,v) \in E(G)} X_{uv} \log_2( \frac{p}{q})+\sum_{(u,v) \in \bar{E}(G)}(1-X_{uv}) \log_2( \frac{1-q}{1-p})$ where $C(n,m,p,q)=m\log_2{q}+ \big( {n \choose 2} - m \big) \log_2{(1-p)}$ is a constant that does not depend on the labeling $\omega$ and $\bar{E}(G) = {[n] \choose 2} \backslash E(G)$. Here is a nice observation. If we set $p=2/3,q=1/3$ (notice that in general their sum need not be one) we obtain the overlapping correlation clustering (OCC) objective, namely $\sum_{(u,v) \in E(G)} X_{uv} +\sum_{(u,v) \in \bar{E}(G)}(1-X_{uv})$. This objective was introduced by Bonchi et al. and was shown to be NP-hard. So we know that our problem is NP-hard in general.  Bonchi et al. introduced in their work a heuristic that in certain cases works surprisingly well, given its simplicity. You can take a look in their paper for the details but here is the rough idea: perform a systematic scan of the vertices and for each vertex change its labeling in a way that increases the objective value. One could choose any labeling that increases the objective or choose greedily the labeling that results in the largest increase.   Here is another observation: their algorithm is a deterministic hill climbing heuristic even if it is not explicitly stated in their paper. Furthermore, their method works only for dense graphs. To see why, suppose that I give you a graph with $m=o(n^2)$. Then, since the first summation term (over edges) is negligible with respect to the second term (non-edges), we can set $f(v)=\underbrace{0\ldots0}_{k \text{~zeros}}$ for all vertices. Similarly, if we have more edges than non-edges, we can satisfy them by setting $f(v)=\underbrace{1\ldots 1}_{k \text{~zeros}}$. Since in real-world networks $m=o(n^2)$ the OCC formulation is not suitable. This observation also shows that obtaining a constant factor approximation algorithm is trivial. One result in [1] is  to provide a rapidly mixing Metropolis chain. whose stationary distribution favors the states that achieve a high likelihood score.  The second question that to my knowledge had not been asked by latent feature learning methods on graphs, is what kind of graph substructures force us to use large k values in order to achieve a good score. It turns out that bipartite cliques make the problem hard. This is intuitive. If my neighborhood $\{u_1,\ldots,u_r\}$ is a stable set, then I have a common interest with $u_1$. I Also have a common interest with $u_2$. But since $u_1$ and $u_2$ are not connected there have to be two different features and so on. This is interesting when you combine it with the empirical fact that  existing latent feature learning methods achieve good scores on co-authorship graphs and social networks with low $k$ values. This suggests that there are not large bipartite cliques on these networks. Indeed, Ugander et al. studied how frequently certain small subgraphs appear in real-world networks.

Source: people.cam.cornell.edu/~jugander/subgraphs/

It turns out that social networks have a strikingly small number of induced cycles of length 4, i.e., usually denoted as $C_4$s. This makes sense. If there were large bipartite cliques, then we would have many $C_4$s.

Open problems. Some interesting research directions include:

• Combinatorial understanding of latent feature learning methods for graphs.
• Richer family of probabilistic models for undirected graphs for which we can have efficient (approximation) algorithms?
• New probabilistic models and latent feature learning algorithms for directed graphs.

References

[1] “Provably Fast Inference of Latent Features from Networks“, Charalampos E. Tsourakakis, WWW’15