Neighborhood Hash Kernel

The neighborhood hash kernel assumes node-labeled graphs [HK09]. It compares graphs by updating their node labels and counting the number of common labels. The kernel replaces the discrete node labels with binary arrays of fixed length, and it then employs logical operations to update the labels so that they contain information regarding the neighborhood structure of each vertex.

Let \(\ell : \mathcal{V} \rightarrow \Sigma\) be a function that maps vertices to an alphabet \(\Sigma\) which is the set of possible discrete node labels. Hence, given a vertex \(v\), \(\ell(v) \in \Sigma\) is the label of vertex \(v\). The algorithm first transform each discrete node label to a bit label. A bit label is a binary array consisting of \(d\) bits as

\[s = \{ b_1, b_2, \ldots, b_d \}\]

where the constant \(d\) satisfies \(2^d - 1 \gg |\Sigma|\) and \(b_1, b_2, \ldots, b_d \in \{0, 1\}\).

The most important step of the algorithm involves a procedure that updates the labels of the vertices. To achieve that, the kernel makes use of two very common bit operations: (1) the exclusive or (\(XOR\)) operation, and (2) the bit rotation (\(ROT\)) operation. Let \(XOR(s_i, s_j) = s_i \oplus s_j\) denote the \(XOR\) operation between two bit labels \(s_i\) and \(s_j\) (ie the \(XOR\) operation is applied to all their components). The output of the operation is a new binary array whose components represent the \(XOR\) value between the corresponding components of the \(s_i\) and \(s_j\) arrays. The \(ROT_o\) operation takes as input a bit array and shifts its last \(o\) bits to the left by \(o\) bits and moves the first \(o\) bits to the right end as shown below

\[ROT_o(s) = \{ b_{o+1}, b_{o+2}, \ldots, b_d, b_1, \ldots, b_o \}\]

Below, we present in detail two procedures for updating the labels of the vertices: (1) the simple neighborhood hash, and (2) the count-sensitive neighborhood hash.

Simple Neighborhood Hash

Given a graph \(G=(V,E)\) with bit labels, the simple neighborhood hash update procedure computes a neighborhood hash for each vertex using the logical operations \(XOR\) and \(ROT\). More specifically, given a vertex \(v \in V\), let \(\mathcal{N}(v)=\{ u_1,\ldots,u_d \}\) be the set of neighbors of \(v\). Then, the kernel computes the neighborhood hash as

\[NH(v) = ROT_1 \big( \ell(v) \big) \oplus \big( \ell(u_1) \oplus \ldots \oplus \ell(u_d) \big)\]

The resulting hash \(NH(v)\) is still a bit array of length \(d\), and we regard it as the new label of \(v\). This new label represents the distribution of the node labels around \(v\). Hence, if \(v_i\) and \(v_j\) are two vertices that have the same label (ie \(\ell(v_i) = \ell(v_j)\)) and the label sets of their neighborhors are also identical, their hash values will be the same (ie \(NH(v_i) = NH(v_j))\). Otherwise, they will be different except for accidental hash collisions. The main idea behind this update procedure is that the hash value is independent of the order of the neighborhood values due to the properties of the \(XOR\) operation. Hence, one can check whether or not the distributions of neighborhood labels of two vertices are equivalent without sorting or matching these two label sets.

Count-sensitive Neighborhood Hash

The simple neighborhood hash update procedure described above suffers from some problematic hash collisions. Specifically, the neighborhood hash values for two independent nodes have a small probability of being the same even if there is no accidental hash collision. Such problematic hash collisions may affect the positive semidefiniteness of the kernel. To address that problem, the count-sensitive neighborhood hash update procedure counts the number of occurences of each label in the set. More specifically, it first uses a sorting algorithm (eg radix sort) to align the bit labels of the neighbors, and then, it extracts the unique labels (set \(\{ \ell_1, \ldots, \ell_l \}\) in the case of \(l\) unique labels) and for each label counts its number of occurences. Then, it updates each unique label based on its number of occurences as follows

\[\ell'_i = ROT_o \big( \ell_i \oplus o \big)\]

where \(\ell_i, \ell'_i\) is the initial and updated label respectively, and \(o\) is the number of occurences of that label in the set of neighbors. The above operation makes the hash values unique by depending on the number of label occurrences. Then, the count-sensitive neighborhood hash is computed as

\[CSNH(v) = ROT_1 \big( \ell(v) \big) \oplus \big( \ell'_1 \oplus \ldots \oplus \ell'_l \big)\]

Both the simple and the count-sensitive neighborhood hash can be seen as general approaches for enriching the labels of vertices based on the label distribution of their neighborhood vertices.

Kernel Calculation

The neighborhood hash update procedures presented above aggregate the information of the neighborhood vertices to each vertex. Then, given two graphs \(G\) and \(G'\), the updated labels of their vertices are compared using the following function

\[\kappa(G, G') = \frac{c}{|V| + |V'| - c}\]

where \(c\) is the number of labels the two graphs have in common. This function is equivalent to the Tanimoto coefficent which is commonly used as a similarity measure between sets of discrete values and which has been proven to be positive semidefinite [Gow71].

The label-update procedures is not necessary to be applied once, but they can be applied iteratively. By updating the bit labels several times, the new labels can capture high-order relationships between vertices. For instance, if the procedure is performed \(h\) times in total, the updated label \(\ell(v)\) of a vertex \(v\) represents the label distribution of its \(h\)-neighbors. Hence, two vertices \(v_i, v_j\) with identical labels and connections among their \(r\)-neighbors will be assigned the same label. Given a graph \(G=(V,E)\), let \(G_1, \ldots, G_h\) denote the updated graphs where the node labels have been updated \(1,\ldots,h\) times, respectively. Then, given two graphs \(G\) and \(G'\), the neighborhood hash kernel is defined as

\[k(G, G') = \frac{1}{h} \sum_{i=1}^h \kappa(G_i, G'_i)\]

The computational complexity of the neighborhood hash kernel is \(\mathcal{O}(dhn\bar{D})\) where \(n=|V|\) is the number of vertices of the graphs and \(\bar{D}\) is the average degree of their vertices.

The implementation of the neighborhood hash kernel can be found below

NeighborhoodHash([n_jobs, normalize, …])

Neighborhood hashing kernel as proposed in [HK09].

Bibliography

Gow71

John C Gower. A general coefficient of similarity and some of its properties. Biometrics, pages 857–871, 1971.

HK09(1,2)

Shohei Hido and Hisashi Kashima. A Linear-time Graph Kernel. In Proceedings of the 9th International Conference on Data Mining, 179–188. 2009.