PDB Group

The PDBGroup object is the primary data structure for graph mining analysis in PyeMap. It stores the emap objects which contain the graph theory models of the protein structures, the graph database defined by the classifications of nodes and edges, and the results of the mining calculation.

class pyemap.graph_mining.PDBGroup(title)[source]

Contains all information regarding the group of proteins being analyzed, and all of the the subgraph patterns identified by the gSpan algorithm.

title

Title of PDB group

Type: str

emaps

Dict of PDBs being analyzed by PyeMap. The keys are PDB IDs, meaning that only one emap object per PDB ID is allowed.

Type: dict of str: emap

subgraph_patterns

Dict of subgraph patterns found by GSpan. Keys are the unique IDs of the SubgraphPattern objects.

Type: dict of str: SubgraphPattern

fasta

List of unaligned sequences in FASTA format

Type: str

aligned_fasta

List of sequences in FASTA format after multiple sequence alignment

Type: str

__init__(title)[source]

Initializes PDBGroup object

Parameters: title (str) – Title of PDB group

add_emap(emap_obj)[source]

Adds an emap object to the PDB group.

Parameters: emap_obj (emap object) – Parsed PDB generated by parse() or fetch_and_parse()

Examples

>>> my_pg.add_emap(pyemap.fetch_and_parse('1u3d'))

find_subgraph(graph_specification)[source]

Mine for a specified subgraph pattern.

Linear chains can be specified simply by list the 1-letter amino acid codes/special characters, e.g. WWW. To specify branching, use a syntax similar to the SMILES format, where there is no specification of bonding and each character must be upper case and separated by brackets, e.g. [H]1[C][#][C]1. If edge thresholds are used, all possible combinations of edges will be searched for.

Parameters: graph_specification (str) – String of graph to search for

Notes

Special characters: * - wildcard character # - non-protein residue X - unknown residue

Examples

>>> my_pg.find_subgraph('WWW#')

generate_graph_database(sub=[], edge_thresh=[])[source]

Generates graph database for mining.

Parameters

sub (list of str, optional) – List of 1-character amino acid codes to be labeled as “X”. All other included standard amino acids receive their own category.
edge_thresholds (list of float, optional) – List of edge thresholds. Edges with weight below the first value will be given the label 2, edges between the 1st and second values will be labeled as 3, and so on.

Examples

>>> pg.generate_graph_database(['W','Y'],[12,15])

mining_report(dest=None)[source]

Generates general report of all subgraph patterns found in the analysis.

Parameters: dest (str, optional) – Destination to write report to file
Returns: report – General report of all subgraph patterns found in the analysis.
Return type: str

process_emaps(chains={}, eta_moieties={}, include_residues=['Y', 'W'], **kwargs)[source]

Processes emap objects in order to generate protein graphs.

Should be executed once all of the emap objects have been added to the group.

Parameters

chains (dict of str: list of str, optional) – Chains to include for each PDB. The special keyword ‘All’ is also accepted.
eta_moieties (dict of str: list of str, optional) – Dict containing list of ETA moieties(specified by their residue label) to include for each PDB. By default, all on the included chains will be included.
include_residues (list of str, optional) – List of 1-letter standard AA codes to include in the graph
**kwargs – For a list of accepted kwargs, see the documentation for process().

Examples

>>> eta_moieties = {'1u3d': ['FAD510(A)-2'], '1u3c': ['FAD510(A)-2'], '6PU0': ['FAD501(A)-2'], '4I6G': ['FAD900(A)-2'], '2J4D': ['FAD1498(A)-2']}
>>> chains = {'1u3d': ['A'], '1u3c': ['A'], '6PU0': ['A'], '4I6G': ['A'], '2J4D': ['A']}
>>> my_pg.process_emaps(chains=chains,eta_moieties=eta_moieties)

run_gspan(min_support, min_num_vertices=4, max_num_vertices=inf, **kwargs)[source]

Mines for common subgraphs using gSpan algorithm. Results are stored as SubgraphPattern objects in the subgraph_patterns dictionary.

References

Yan, Xifeng, and Jiawei Han. “gspan: Graph-based substructure pattern mining.” 2002 IEEE International Conference on Data Mining, 2002. Proceedings.. IEEE, 2002.

Parameters

min_support (int) – Minimum support number of subgraphs in the search space
min_num_vertices (int, optional) – Minimum number of nodes for subgraphs in the search space
max_num_vertices (int, optional) – Maximum number of nodes for subgraphs in the search space
**kwargs – See https://github.com/betterenvi/gSpan for a list of accepted kwargs.

Examples

>>> my_pg.run_gspan(10)

save_fasta(dest='')[source]

Saves fasta from multiple sequence alignment to file

Parameters: dest (str, optional) – Destination to write aligned fasta to file