Poster 9.
Pellequer J.-L., Chen S.-w. W., Imbert G., Pible O. et Vergely I., «Selecting non-redundant protein structures from the Protein Data Bank», 2004, 3Dsig: Structural Bioinformatics Special Interest Group, Glasgow, Ecosse
In the post-genomic era, screening a large set of macromolecular structures
becomes a necessary procedure. The Protein Data Bank, a repository site maintained by the Research Collaboratory
for Structural Bioinformatics (RCSB), allows us to access protein and nucleic acid structures. Two requirements
are needed before screening «functional» properties of macromolecules.
First, a representative set of protein structures has to be constructed from the total 52000 deposited chain entries.
To take account of drastic conformational changes observed in proteins during a functional screening, it is necessary
to cover the largest possible structural variations of each protein. On the other hand, we cluster representative
structures to reduce structural redundancy. Accordingly, we developed a protocol that allows us to extract
representative sets of protein structures. The root of our selection is the accession number of proteins. Using
additional data from our CUPS database (see figure) such as the resolution, R-factor, completeness, number of
mutations, we identify a leading structure to which each of the rest structures is compared. Structures having a
Cα RMSD lower than 0.5 Å form the first cluster. The procedure is repeated for each
remaining structure of the same protein.
Second, a selected structure needs a complete set of atomic coordinates. This requirement concerns integrity of
structural coordinates and completeness of data found in the deposited entries. Molecular structures are sometimes
found in the following situations: only Cα atoms presented for amino-acid residues,
or non-natural amino acids unmarked, or residues with a single atom missing. To prepare structures ready for screening
using a molecular force field, we have developed a protocol that attempts to fix these problematic cases. We have also
built coordinates of missing side-chain atoms in disordered area. In the current version, our program have repaired up
to 94 % of detected coordinate errors. We chose to work with the macromolecular Crystallography Information Files (mmCIF).
To evaluate inconsistent items in all mmCIF files, we parsed the whole PDB and found an error rate of 19 %. Weekly
updates of modified mmCIF entries will be incorporated into CUPS, a mySQL relational database of cured protein
structures.
