PDB Group
The PDBGroup object is the primary data structure for
graph mining analysis in PyeMap. It stores the emap objects which contain
the graph theory models of the protein structures, the graph database defined by the classifications
of nodes and edges, and the results of the mining calculation.
- class pyemap.graph_mining.PDBGroup(title)[source]
Contains all information regarding the group of proteins being analyzed, and all of the the subgraph patterns identified by the gSpan algorithm.
- title
Title of PDB group
- Type:
str
- emaps
Dict of PDBs being analyzed by PyeMap. The keys are PDB IDs, meaning that only one
emapobject per PDB ID is allowed.- Type:
dict of str:
emap
- subgraph_patterns
Dict of subgraph patterns found by GSpan. Keys are the unique IDs of the
SubgraphPatternobjects.- Type:
dict of str:
SubgraphPattern
- fasta
List of unaligned sequences in FASTA format
- Type:
str
- aligned_fasta
List of sequences in FASTA format after multiple sequence alignment
- Type:
str
- add_emap(emap_obj)[source]
Adds an
emapobject to the PDB group.- Parameters:
emap_obj (
emapobject) – Parsed PDB generated byparse()orfetch_and_parse()
Examples
>>> my_pg.add_emap(pyemap.fetch_and_parse('1u3d'))
- find_subgraph(graph_specification)[source]
Mine for a specified subgraph pattern.
Linear chains can be specified simply by list the 1-letter amino acid codes/special characters, e.g. WWW. To specify branching, use a syntax similar to the SMILES format, where there is no specification of bonding and each character must be upper case and separated by brackets, e.g. [H]1[C][#][C]1. If edge thresholds are used, all possible combinations of edges will be searched for.
- Parameters:
graph_specification (str) – String of graph to search for
Notes
Special characters: * - wildcard character # - non-protein residue X - unknown residue
Examples
>>> my_pg.find_subgraph('WWW#')
- generate_graph_database(sub=[], edge_thresh=[])[source]
Generates graph database for mining.
- Parameters:
sub (list of str, optional) – List of 1-character amino acid codes to be labeled as “X”. All other included standard amino acids receive their own category.
edge_thresholds (list of float, optional) – List of edge thresholds. Edges with weight below the first value will be given the label 2, edges between the 1st and second values will be labeled as 3, and so on.
Examples
>>> pg.generate_graph_database(['W','Y'],[12,15])
- mining_report(dest=None)[source]
Generates general report of all subgraph patterns found in the analysis.
- Parameters:
dest (str, optional) – Destination to write report to file
- Returns:
report – General report of all subgraph patterns found in the analysis.
- Return type:
str
- process_emaps(chains={}, eta_moieties={}, include_residues=['Y', 'W'], **kwargs)[source]
Processes
emapobjects in order to generate protein graphs.Should be executed once all of the
emapobjects have been added to the group.- Parameters:
chains (dict of str: list of str, optional) – Chains to include for each PDB. The special keyword ‘All’ is also accepted.
eta_moieties (dict of str: list of str, optional) – Dict containing list of ETA moieties(specified by their residue label) to include for each PDB. By default, all on the included chains will be included.
include_residues (list of str, optional) – List of 1-letter standard AA codes to include in the graph
**kwargs – For a list of accepted kwargs, see the documentation for
process().
Examples
>>> eta_moieties = {'1u3d': ['FAD510(A)-2'], '1u3c': ['FAD510(A)-2'], '6PU0': ['FAD501(A)-2'], '4I6G': ['FAD900(A)-2'], '2J4D': ['FAD1498(A)-2']} >>> chains = {'1u3d': ['A'], '1u3c': ['A'], '6PU0': ['A'], '4I6G': ['A'], '2J4D': ['A']} >>> my_pg.process_emaps(chains=chains,eta_moieties=eta_moieties)
- run_gspan(min_support, min_num_vertices=4, max_num_vertices=inf, **kwargs)[source]
Mines for common subgraphs using gSpan algorithm. Results are stored as
SubgraphPatternobjects in the subgraph_patterns dictionary.References
Yan, Xifeng, and Jiawei Han. “gspan: Graph-based substructure pattern mining.” 2002 IEEE International Conference on Data Mining, 2002. Proceedings.. IEEE, 2002.
- Parameters:
min_support (int) – Minimum support number of subgraphs in the search space
min_num_vertices (int, optional) – Minimum number of nodes for subgraphs in the search space
max_num_vertices (int, optional) – Maximum number of nodes for subgraphs in the search space
**kwargs – See https://github.com/betterenvi/gSpan for a list of accepted kwargs.
Examples
>>> my_pg.run_gspan(10)