Multiscale Laplacian Kernel¶
The multiscale Laplacian graph kernel can handle unlabeled graphs, graphs with discrete node labels and graphs with continuous node attributes [KP16]. It takes into account structure in graphs at a range of different scales by building a hierarchy of nested subgraphs. These subgraphs are compared to each other using another graph kernel, called the feature space laplacian graph kernel. This kernel is capable of lifting a base kernel defined on the vertices of two graphs to a kernel between the graphs themselves. Since exact computation of the multiscale laplacian graph kernel is a very expensive operation, the kernel uses a randomized projection procedure similar to the popular Nystr{"o}m approximation for kernel matrices [WS01].
Let \(G=(V,E)\) be an undirected graph such that \(n = |V|\). The Laplacian of \(G\) is a \(n \times n\) matrix defined as
where \(\mathbf{A}\) is the adjacency matrix of \(G\) and \(\mathbf{D}\) is a diagonal matrix such that \(\mathbf{D}_{ii} = \sum_j \mathbf{A}_{ij}\).
Given two graphs \(G_1\) and \(G_2\) of \(n\) vertices, we can define the kernel between them to be a kernel between the corresponding normal distributions \(p_1 = \mathcal{N}(\mathbf{0}, \mathbf{L_1}^{-1})\) and \(p_2 = \mathcal{N}(\mathbf{0}, \mathbf{L_2}^{-1})\) where \(\mathbf{0}\) is the \(n\)-dimensional all-zeros vector. More specifically, given two graphs \(G_1\) and \(G_2\) of \(n\) vertices with Laplacians \(\mathbf{L_1}\) and \(\mathbf{L_2}\) respectively, the Laplacian graph kernel with parameter \(\gamma\) between the two graphs is
where \(\mathbf{S}_1 = \mathbf{L}_1^{-1} + \gamma \mathbf{I}\), \(\mathbf{S}_2 = \mathbf{L}_2^{-1} + \gamma \mathbf{I}\) and \(\mathbf{I}\) is the \(n \times n\) identity matrix. The Laplacian graph kernel captures similarity between the overall shapes of the two graphs. However, it assumes that both graphs have the same size, and it is not invariant to permutations of the vertices.
To achieve permutation invariance, the multiscale Laplacian graph kernel represents each vertex as an \(m\)-dimensional vector whose components correspond to local and permutation invariant vertex features. Such features may include for instance the degree of the vertex or the number of triangles in which it participates. Then, it performs a linear transformation and represents each graph as a distribution of the considered features instead of a distribution of its vertices. Let \(\mathbf{U}_1, \mathbf{U}_2 \in \mathbb{R}^{m \times n}\) be the feature mapping matrices of the two graphs, that is the matrices whose columns contain the vector representations of the vertices of the two graphs. Then, the feature space Laplacian graph kernel is defined as
where \(\mathbf{S}_1 = \mathbf{U}_1 \mathbf{L}_1^{-1} \mathbf{U}_1^\top + \gamma \mathbf{I}\), \(\mathbf{S}_2 = \mathbf{U}_2 \mathbf{L}_2^{-1} \mathbf{U}_2^\top + \gamma \mathbf{I}\) and \(\mathbf{I}\) is the \(m \times m\) identity matrix. Since the vertex features are local and invariant to vertex reordering, the feature space Laplacian graph kernel is permutation invariant. Furthermore, since the distributions now live in the space of features rather than the space of vertices, the feature space Laplacian graph kernel can be applied to graphs of different sizes.
Let \(\phi(v)\) be the representation of vertex \(v\) constructed from local vertex features as described above. The base kernel \(\kappa\) between two vertices \(v_1\) and \(v_2\) corresponds to the dot product of their feature vectors
Let \(G_1\) and \(G_2\) be two graphs with vertex sets \(V_1 = \{ v_1, \ldots, v_{n_1}\}\) and \(V_2 = \{ u_1, \ldots, u_{n_2} \}\) respectively, and let \(\bar{V} = \{ \bar{v}_1, \ldots, \bar{v}_{n_1+n_2} \}\) be the union of the two vertex sets. Let also \(\mathbf{K} \in \mathbb{R}^{(n_1+n_2) \times (n_1+n_2)}\) be the kernel matrix defined as
Let \(\mathbf{u}_1, \ldots, \mathbf{u}_p\) be a maximal orthonormal set of the non-zero eigenvalue eigenvectors of \(\mathbf{K}\) with corresponding eigenvalues \(\lambda_1, \ldots, \lambda_p\). Then the vectors
where \([\mathbf{u}_i]_l\) is the \(l^{th}\) component of vector \(\mathbf{u}_i\) form an orthonormal basis for the subspace \(\{ \phi(\bar{v}_1), \ldots, \phi(\bar{v}_{n_1+n_2}) \}\). Moreover, let \(\mathbf{Q} = [ \lambda_1^{1/2} \mathbf{u}_1, \ldots,\lambda_p^{1/2} \mathbf{u}_p ] \in \mathbb{R}^{p \times p}\) and \(\mathbf{Q}_1, \mathbf{Q}_2\) denote the first \(n_1\) and last \(n_2 \)mathbf{Q}` respectively. Then, the generalized feature space Laplacian graph kernel induced from the base kernel \(\kappa\) is defined as
where \(\mathbf{S}_1 = \mathbf{Q}_1 \mathbf{L}_1^{-1} \mathbf{Q}_1^\top + \gamma \mathbf{I}\) and \(\mathbf{S}_2 = \mathbf{Q}_2 \mathbf{L}_2^{-1} \mathbf{Q}_2^\top + \gamma \mathbf{I}\) where \(\mathbf{I}\) is the \(p \times p\) identity matrix.
The multiscale Laplacian graph kernel builds a hierarchy of nested subgraphs, where each subgraph is centered around a vertex and computes the generalized feature space Laplacian graph kernel between every pair of these subgraphs. Let \(G\) be a graph with vertex set \(V\), and \(\kappa\) a positive semi-definite kernel on \(V\). Assume that for each \(v \in V\), we have a nested sequence of \(L\) neighborhoods
and for each \(N_l(v)\), let \(G_l(v)\) be the corresponding induced subgraph of \(G\). The multiscale Laplacian subgraph kernels are defined as \(\mathfrak{K}_1, \ldots, \mathfrak{K}_L : V \times V \rightarrow \mathbb{R}\) as follows
\(\mathfrak{K}_1\) is just the generalized feature space Laplacian graph kernel \(k_{FLG}^\kappa\) induced from the base kernel \(\kappa\) between the lowest level subgraphs (ie the vertices)
For \(l=2,3,\ldots,L\), \(\mathfrak{K}_l\) is the the generalized feature space Laplacian graph kernel induced from \(\mathfrak{K}_{l-1}\) between \(G_l(v)\) and \(G_l(u)\)
Then, the multiscale Laplacian graph Kernel between two graphs \(G_1, G_2\) is defined as follows
The multiscale Laplacian graph kernel computes \(\mathfrak{K}_1\) for all pairs of vertices, then computes \(\mathfrak{K}_2\) for all pairs of vertices, and so on. Hence, it requires \(\mathcal{O}(Ln^2)\) kernel evaluations. At the top levels of the hierarchy each subgraph centered around a vertex \(G_l(v)\) may have as many as \(n\) vertices. Therefore, the cost of a single evaluation of the generalized feature space Laplacian graph kernel may take \(\mathcal{O}(n^3)\) time. This means that in the worst case, the overall cost of computing \(k_{MLG}\) is \(\mathcal{O}(Ln^5)\). Given a dataset of \(N\) graphs, computing the kernel matrix requires repeating this for all pairs of graphs, which takes \(\mathcal{O}(LN^2n^5)\) time and is clearly problematic for real-world settings.
The solution to this issue is to compute for each level \(l=1,2,\ldots,L+1\) a single joint basis for all subgraphs at the given level across all graphs. Let \(G_1, G_2, \ldots, G_N\) be a collection of graphs, \(V_1, V_2, \ldots, V_N\) their vertex sets, and assume that \(V_1, V_2, \ldots, V_N \subseteq \mathcal{V}\) for some general vertex space \(\mathcal{V}\). The joint vertex feature space of the whole graph collection is \(W = span \big\{ \bigcup_{i=1}^N \bigcup_{v \in V_i} \{ \phi(v) \} \big\}\). Let \(c = \sum_{i=1}^N |V_i|\) be the total number of vertices and \(\bar{V} = (\bar{v}_1, \ldots, \bar{v}_c)\) be the concatenation of the vertex sets of all graphs. Let \(\mathbf{K}\) be the corresponding joint kernel matrix and \(\mathbf{u}_1, \ldots, \mathbf{u}_p\) be a maximal orthonormal set of non-zero eigenvalue eigenvectors of \(\mathbf{K}\) with corresponding eigenvalues \(\lambda_1,\ldots,\lambda_p\) and \(p=dim(W)\). Then the vectors
form an orthonormal basis for \(W\). Moreover, let \(\mathbf{Q} = [ \lambda_1^{1/2} \mathbf{u}_1, \ldots, \lambda_p^{1/2} \mathbf{u}_p ] \in \mathbb{R}^{p \times p}\) and \(\mathbf{Q}_1\) denote the first \(n_1\) rows of matrix \(\mathbf{Q}\), \(\mathbf{Q}_2\) denote the next \(n_2 \)mathbf{Q}` and so on. For any pair of graphs \(G_i, G_j\) of the collection, the generalized feature space Laplacian graph kernel induced from \(\kappa\) can be expressed as
where \(\bar{\mathbf{S}}_i = \mathbf{Q}_i \mathbf{L}_i^{-1} \mathbf{Q}_i^\top + \gamma \mathbf{I}\), \(\bar{\mathbf{S}}_j = \mathbf{Q}_j \mathbf{L}_j^{-1} \mathbf{Q}_j^\top + \gamma \mathbf{I}\) and \(\mathbf{I}\) is the \(p \times p\) identity matrix.
The implementation of the multiscale Laplacian kernel can be found below
|
Laplacian Graph Kernel as proposed in [KP16]. |
Low Rank Approximation¶
Computing the kernel matrix between all vertices of all graphs (\(c\) vertices in total) and storing it is a very costly procedure. Computing its eigendecomposition is even worse in terms of the required runtime. Morever, \(p\) is also very large. Managing the \(\bar{\mathbf{S}}_1, \ldots, \bar{\mathbf{S}}_N\) matrices (each of which is of size \(p \times p\)) becomes infeasible. Hence, the multiscale Laplacian graph kernel replaces \(W\) with a smaller, approximate joint features space. Let \(\tilde{V} = (\tilde{v}_1, \ldots, \tilde{v}_{\tilde{c}})\) be \(\tilde{c} \ll c\) vertices sampled from the joint vertex set. Then, the corresponding subsampled vertex feature space is \(\tilde{W} = span \{ \phi(v) : v \in \tilde{V} \}\). Let \(\tilde{p} = dim(\tilde{W})\). Similarly to before, the kernel constructs an orthonormal basis \(\{ \xi_1, \ldots, \xi_{\tilde{p}} \}\) for \(\tilde{W}\) by forming the (now much smaller) kernel matrix \(\mathbf{K}_{ij} = \kappa(\tilde{v}_i, \tilde{v}_j)\), computing its eigenvalues and eigenvectors, and setting \(\xi_i = \frac{1}{\sqrt{\lambda_i}} \sum_{l=1}^{\tilde{c}} [\mathbf{u}_i]_l \phi(\tilde{v}_l)\). The resulting approximate generalized feature space Laplacian graph kernel is
where \(\tilde{\mathbf{S}}_1 = \tilde{\mathbf{Q}}_1 \mathbf{L}_1^{-1} \tilde{\mathbf{Q}}_1^\top + \gamma \mathbf{I}\), \(\tilde{\mathbf{S}}_2 = \tilde{\mathbf{Q}}_2 \mathbf{L}_2^{-1} \tilde{\mathbf{Q}}_2^\top + \gamma \mathbf{I}\) are the projections of \(\bar{\mathbf{S}}_1\) and \(\bar{\mathbf{S}}_2\) to \(\tilde{W}\) and \(\mathbf{I}\) is the \(\tilde{p} \times \tilde{p}\) identity matrix. Finally, the kernel introduces a further layer of approximation by restricting \(\tilde{W}\) to be the space spanned by the first \(\hat{p} < \tilde{p}\) basis vectors (ordered by descending eigenvalue), effectively doing kernel PCA on \(\{ \phi(\tilde{v}) \}_{\tilde{v} \in \tilde{V}}\). The combination of these two factors makes computing the entire stack of kernels feasible, reducing the complexity of computing the kernel matrix for a dataset of \(N\) graphs to \(\mathcal{O}(NL \tilde{c}^2 \hat{p}^3 + NL \tilde{c}^3 + N^2 \hat{p}^3)\).
The approximate multiscale Laplacian graph kernel can be found below
Bibliography¶
- KP16(1,2)
Risi Kondor and Horace Pan. The Multiscale Laplacian Graph Kernel. In Advances in Neural Information Processing Systems, 2990–2998. 2016.
- WS01
Christopher KI Williams and Matthias Seeger. Using the Nyström Method to Speed Up Kernel Machines. In Advances in Neural Information Processing Systems, 682–688. 2001.