PDB Group

The PDBGroup object is the primary data structure for graph mining analysis in PyeMap. It stores the emap objects which contain the graph theory models of the protein structures, the graph database defined by the classifications of nodes and edges, and the results of the mining calculation.

class pyemap.graph_mining.PDBGroup(title)[source]

Contains all information regarding the group of proteins being analyzed, and all of the the subgraph patterns identified by the gSpan algorithm.

title

Title of PDB group

Type

str

emaps

Dict of PDBs being analyzed by PyeMap. The keys are PDB IDs, meaning that only one emap object per PDB ID is allowed.

Type

dict of str: emap

subgraph_patterns

Dict of subgraph patterns found by GSpan. Keys are the unique IDs of the SubgraphPattern objects.

Type

dict of str: SubgraphPattern

fasta

List of unaligned sequences in FASTA format

Type

str

aligned_fasta

List of sequences in FASTA format after multiple sequence alignment

Type

str

__init__(title)[source]

Initializes PDBGroup object

Parameters

title (str) – Title of PDB group

add_emap(emap_obj)[source]

Adds an emap object to the PDB group.

Parameters

emap_obj (emap object) – Parsed PDB generated by parse() or fetch_and_parse()

Examples

>>> my_pg.add_emap(pyemap.fetch_and_parse('1u3d'))
find_subgraph(graph_specification)[source]

Mine for a specified subgraph pattern.

Linear chains can be specified simply by list the 1-letter amino acid codes/special characters, e.g. WWW. To specify branching, use a syntax similar to the SMILES format, where there is no specification of bonding and each character must be upper case and separated by brackets, e.g. [H]1[C][#][C]1. If edge thresholds are used, all possible combinations of edges will be searched for.

Parameters

graph_specification (str) – String of graph to search for

Notes

Special characters: * - wildcard character # - non-protein residue X - unknown residue

Examples

>>> my_pg.find_subgraph('WWW#')
generate_graph_database(sub=[], edge_thresh=[])[source]

Generates graph database for mining.

Parameters
  • sub (list of str, optional) – List of 1-character amino acid codes to be labeled as “X”. All other included standard amino acids receive their own category.

  • edge_thresholds (list of float, optional) – List of edge thresholds. Edges with weight below the first value will be given the label 2, edges between the 1st and second values will be labeled as 3, and so on.

Examples

>>> pg.generate_graph_database(['W','Y'],[12,15])
mining_report(dest=None)[source]

Generates general report of all subgraph patterns found in the analysis.

Parameters

dest (str, optional) – Destination to write report to file

Returns

report – General report of all subgraph patterns found in the analysis.

Return type

str

process_emaps(chains={}, eta_moieties={}, include_residues=['Y', 'W'], **kwargs)[source]

Processes emap objects in order to generate protein graphs.

Should be executed once all of the emap objects have been added to the group.

Parameters
  • chains (dict of str: list of str, optional) – Chains to include for each PDB. The special keyword ‘All’ is also accepted.

  • eta_moieties (dict of str: list of str, optional) – Dict containing list of ETA moieties(specified by their residue label) to include for each PDB. By default, all on the included chains will be included.

  • include_residues (list of str, optional) – List of 1-letter standard AA codes to include in the graph

  • **kwargs – For a list of accepted kwargs, see the documentation for process().

Examples

>>> eta_moieties = {'1u3d': ['FAD510(A)-2'], '1u3c': ['FAD510(A)-2'], '6PU0': ['FAD501(A)-2'], '4I6G': ['FAD900(A)-2'], '2J4D': ['FAD1498(A)-2']}
>>> chains = {'1u3d': ['A'], '1u3c': ['A'], '6PU0': ['A'], '4I6G': ['A'], '2J4D': ['A']}
>>> my_pg.process_emaps(chains=chains,eta_moieties=eta_moieties)
run_gspan(min_support, min_num_vertices=4, max_num_vertices=inf, **kwargs)[source]

Mines for common subgraphs using gSpan algorithm. Results are stored as SubgraphPattern objects in the subgraph_patterns dictionary.

References

Yan, Xifeng, and Jiawei Han. “gspan: Graph-based substructure pattern mining.” 2002 IEEE International Conference on Data Mining, 2002. Proceedings.. IEEE, 2002.

Parameters
  • min_support (int) – Minimum support number of subgraphs in the search space

  • min_num_vertices (int, optional) – Minimum number of nodes for subgraphs in the search space

  • max_num_vertices (int, optional) – Maximum number of nodes for subgraphs in the search space

  • **kwargs – See https://github.com/betterenvi/gSpan for a list of accepted kwargs.

Examples

>>> my_pg.run_gspan(10)
save_fasta(dest='')[source]

Saves fasta from multiple sequence alignment to file

Parameters

dest (str, optional) – Destination to write aligned fasta to file