Poster 9.

Pellequer J.-L., Chen S.-w. W., Imbert G., Pible O. et Vergely I., «Selecting non-redundant protein structures from the Protein Data Bank», 2004, 3Dsig: Structural Bioinformatics Special Interest Group, Glasgow, Ecosse

In the post-genomic era, screening a large set of macromolecular structures becomes a necessary procedure. The Protein Data Bank, a repository site maintained by the Research Collaboratory for Structural Bioinformatics (RCSB), allows us to access protein and nucleic acid structures. Two requirements are needed before screening «functional» properties of macromolecules.
First, a representative set of protein structures has to be constructed from the total 52000 deposited chain entries. To take account of drastic conformational changes observed in proteins during a functional screening, it is necessary to cover the largest possible structural variations of each protein. On the other hand, we cluster representative structures to reduce structural redundancy. Accordingly, we developed a protocol that allows us to extract representative sets of protein structures. The root of our selection is the accession number of proteins. Using additional data from our CUPS database (see figure) such as the resolution, R-factor, completeness, number of mutations, we identify a leading structure to which each of the rest structures is compared. Structures having a Cα RMSD lower than 0.5 Å form the first cluster. The procedure is repeated for each remaining structure of the same protein.
Second, a selected structure needs a complete set of atomic coordinates. This requirement concerns integrity of structural coordinates and completeness of data found in the deposited entries. Molecular structures are sometimes found in the following situations: only Cα atoms presented for amino-acid residues, or non-natural amino acids unmarked, or residues with a single atom missing. To prepare structures ready for screening using a molecular force field, we have developed a protocol that attempts to fix these problematic cases. We have also built coordinates of missing side-chain atoms in disordered area. In the current version, our program have repaired up to 94 % of detected coordinate errors. We chose to work with the macromolecular Crystallography Information Files (mmCIF). To evaluate inconsistent items in all mmCIF files, we parsed the whole PDB and found an error rate of 19 %. Weekly updates of modified mmCIF entries will be incorporated into CUPS, a mySQL relational database of cured protein structures.

Pages personnelles d'Isabelle Vergely

Poster 9.