Mining

Introduction

The goal of subgraph mining is to identify graph structures which occur a significant number of times across a set of graphs. In the context of PyeMap, this means searching for shared pathways/motifs amongst a set of protein crystal structures. In PyeMap, users can either search for all patterns which appear in a specified number of PDBs (General Pattern Mining), or search for a specified group of patterns which match a string representation of the graph of interest (Specific Pattern Mining).

General Pattern Mining

In the graph mining literature, the frequency that a pattern appears in the set of input graphs is referred to as the support for that pattern. In other words, if a pattern appears in 12/14 graphs, one would say it supports 12 graphs, or equivalently, has a support of 12 (regardless of whether it appears multiple times within a given graph).

In PyeMap, we use a Python implementation of the gSpan algorithm [Han2002], one of the most efficient and popular approaches for graph mining. The technical details are beyond the scope of this documentation (and can be found in the gSpan paper), but here we emphasize a few of its key features.

  • gSpan is a recursive algorithm which relies on the use of minimum depth first search (DFS) codes

  • gSpan is a complete algorithm, so it will find all patterns which meet a specified support threshold

  • The perfomance of gSpan is greatly sped up by increasing the minimum support threshold, as this allows more aggressive pruning of candidate subgraphs

In addition to the support number, one can also specify a minimum or maximum number of vertices for the identified subgraph patterns. For a full list of accepted kwargs, please refer to the gspan_mining package’s repository.

To perform general pattern mining with PyeMap, use the run_gspan() function.

Example

Search for all subgraph patterns comprised of 4-6 nodes with a support number of 10 or greater using PyeMap.

pg.run_gspan(10,min_num_vertices=4,max_num_vertices=6)

Specific Pattern Mining

Instead of searching for all subgraphs up to a given support threshold, one instead may be interested in finding protein subgraphs which match a previously known pattern. In this case, the problem is reduced to one of graph matching, and we simply search each PDB for a monomorphism using the NetworkX implementation of the VF2 algorithm (see the NetworkX documentation for more details).

In PyeMap, this is done for protein graphs using the find_subgraph() function. find_subgraph() accepts a string representation of a subgraph, where each character is one of the following:

  • 1-character amino acid code for a standard residue

  • X’ for unknown amino acid types

  • # for non-protein residue

  • * as a wildcard character

Branching can be specified using a syntax similar to the SMILES format, where there is no specification of bonding and each amino acid or special character described above must be separated by brackets (see example below). See the write_graph_smiles() function, and the pysmiles repository for more details.

If edge thresholds are used (see the classification section), the search will be performed for all possible combinations of edges, and thus several subgraph patterns will be found for a set of residue types. If the * wildcard character is used, subgraph pattern(s) will be found for each combination of each residue type replacing the * placeholder character(s), including the special X and # residue types.

Examples

../../_images/linear.png

Fig. 1 Examples of subgraph patterns identified using string ‘WWW*’.

../../_images/smiles_ex.png

Fig. 2 Subgraph pattern (left) and protein subgraph (right) identified using the string ‘[H]1[C][#][C]1’.

Subgraph Patterns

The end result of either mining option is a set of subgraph patterns, each of which has a unique ID with the format:

{Index}_{String representation}_{support number}

e.g. 1_WWW_18.

The string representation for each pattern is a pseudo-SMILES string generated using the pysmiles package. Importantly, these strings can be used as inputs for find_subgraph(), as they correctly encode the structure of the graph using a syntax similar to the SMILES format.

Source

pyemap.graph_mining.write_graph_smiles(...)

Returns pseudo-SMILES string for supplied graph.