Graphlet Sampling Kernel

The graphlet sampling kernel decomposes graphs into graphlets (i.e. small subgraphs with \(k\) nodes where \(k \in \{ 3,4,5, \ldots \}\)) [Prvzulj07] and counts matching graphlets in the input graphs. Let \(\mathcal{G} = \{ graphlet_1,graphlet_2, \ldots, graphlet_r\}\) be the set of size-\(k\) graphlets. Let also \(f_G \in \mathbb{N}^r\) be a vector such that its \(i\)-th entry is equal to the frequency of occurrence of \(graphlet_i\) in \(G\), \(f_{G,i} = \#(graphlet_i \sqsubseteq G)\). Then, the graphlet kernel is defined as follows.

Graphlet of size \(k\) Kernel

Let \(G_i\), \(G_j\) be two graphs of size \(n \geq k\), and \(f_{G_i}, f_{G_j}\) vectors that count the occurrence of each graphlet of size \(k\) in the two graphs. Then the graphlet kernel is defined as

\begin{equation} k(G_i,G_j) = f_{G_i}^\top \ f_{G_j} \end{equation}

As is evident from the above definition, the graphlet kernel is computed by explicit feature maps. First, the representation of each graph in the feature space is computed. And then, the kernel value is computed as the dot product of the two feature vectors. The main problem of graphlet kernel is that an exaustive enumeration of graphlets is very expensive. Since there are \(\binom{n}{k}\) size-\(k\) subgraphs in a graph, computing the feature vector for a graph of size \(n\) requires \(\mathcal{O}(n^k)\) time. To account for that, Shervashidze et al. resorted to sampling [SPM+09]. Following Weissman et al. [WOS+03], they showed that by sampling a fixed number of graphlets the empirical distribution of graphlets will be sufficiently close to their actual distribution in the graph.

Below follows the implemented graphlet sampling kernel. By using the parameter sampling the user can explore various possibilities from sampling all graphlets, to sampling probabilistically based either on the number of samples or the satisfaction of a certain probabilistic on error.

GraphletSampling([n_jobs, normalize, …])

The graphlet sampling kernel.

Bibliography

Prvzulj07

Nataša Pržulj. Biological network comparison using graphlet degree distribution. Bioinformatics, 23(2):e177–e183, 2007.

SPM+09

N. Shervashidze, T. Petri, K. Mehlhorn, K. M. Borgwardt, and S. Vishwanathan. Efficient Graphlet Kernels for Large Graph Comparison. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 488–495. 2009.

WOS+03

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the $l_1$ deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.