Moss and Bleasby BBSRC Grant Application

Development of a Class Library for Bioinformatics and Molecular Modelling

Introduction

During the past five years object technology has gained widespread acceptance in software development, not just because of code reusability, but because it takes advantage of a new generation of object oriented environments. These environments and associated tools are rapidly becoming standardised. A draft ANSI C++ standard is due to be published this year and its new Standard Template Library is likely to influence programming paradigms for years to come. The Common Object Request Broker Architecture (CORBA), which allows the distribution of objects across disparate systems, is becoming an industry standard for distributed objects and OMDG-93 has set a standard for object databases. The latest release of the X-Window system contains an object oriented application framework (FRESCO) which uses CORBA to allow the distribution of graphical objects. Similar developments will also be part of the next generation of Microsoft Windows software where object technology will provide an object-centred rather than an application-centred user interface.

Many other changes in software development and bioinformatics have taken place. Ten years ago scientific programming was done almost exclusively in Fortran, file formats were relatively simple, graphical user interfaces had not emerged as an important issue and the explosive growth of bioinformation had only just started. Today C has become the most widely used language for biocomputing and the Internet and the World Wide Web have transformed the way in which the scientific community can participate in software development and gain access to the results. The Linux operating system (Unix clone for PCs) is a remarkable testimony to what can be achieved.

Yet in the bioinformatics community today we still have a situation where there is relatively little code reuse and each generation of research student writes software from scratch to carry out routine tasks such as reading standard file formats or calculating molecular geometry. The purpose of this application is to create tools that can be reused and to promote a climate in which reuse actually takes place.

Pilot Studies

In order to explore the relevance of the above developments to bioinformatics, we conducted a pilot study in the use of object technology to speed up the biocomputing software development cycle. A class library called MOL was constructed which included molecule classes which contained a limited repertoire of input/output, molecular geometry and comparison operations together with vector and matrix classes containing methods for eigenvalue determination and matrix inversion. These classes have now been used in the rapid development of software to analyse beta-sheet topologies and to construct specialised molecular dynamics software to explore MHC protein- antigen interactions. At the same time Peter Murray-Rust (GLAXO-Wellcome and Birkbeck) developed a class library (DEMOCRITOS) which assists in reading and writing files of molecular information in the new CIF format. This has now been extended to handle protein and DNA sequences and has hypertext documentation on the World Wide Web. The above developments demonstrated that the concept of building software for molecular modelling from reusable components is both feasible and cost-effective.

Other laboratories have also been exploring the potential of object oriented design in molecular software. Phil Bourne at Columbia University has developed a class library called PDBlib for the construction of molecules from PDB files and has introduced the concept of internal and external classes to enable the library to be extensible (1,2). Thomas Ferrin and colleagues at the University of California have designed a molecular class library for chemists containing a MolGraph class which provides functionality associated with molecular connectivity. MOLBIO++ is a C++ library from Harvard for reading and writing sequences in number of library formats, producing statistics and translating nucleotide sequences into proteins. IT companies are also developing C++ class libraries for different application areas. Silicon Graphics has the Open Inventor library for building 3D applications, Image Vision for image processing applications and object oriented tools for Digital Media development while Hewlett-Packard was responsible for the developments which lead to the new Standard Template Library.

Software toolkit for molecular modelling and bioinformatics

The purpose of the proposed project is to build on the above developments and design an object oriented toolkit for molecular modelling and sequence analysis to be made available to the bioinformatics community. The library would be aimed primarily at software developers although there would also be some illustrative user-friendly applications which could be used generally by molecular biologists.

The toolkit would be constructed in ANSI C++ and would enable the rapid construction of software for the analysis of molecular structure and sequence analysis. The emphasis would be on cross-platform portability and the use of software that conforms to open standards rather than proprietary libraries which might be expensive and confined to a limited number of platforms. It would build on the work described above and provide

a) Molecular classes

There would be three types of molecular class. The first type would be for the representation of molecular objects in three dimensions. These classes would provide atoms, molecules (including proteins and nucleic acids), secondary structural elements and user-specified groups of atoms. They would contain functions for constructing such objects from co-ordinate files in databank formats and for rotating and translating atomic co-ordinates and the construction of symmetry-related molecules. There would also be functions for the geometrical comparison of molecules. The second type of molecular class would represent the stereochemical properties of molecular objects such as bond lengths, angles, torsion angles, principle axes, planes and simple surfaces and space curves. The third type of class would represent dynamic atoms with functions to calculate interatomic forces, velocity and energy evaluation and energy minimisation.

b) Sequence classes

There would be classes for single protein and nucleic acid sequences and also for sets of aligned sequences. Methods would be provided for reading, visualising, aligning and comparing sequences. Other methods would predict secondary structure or protein fold from sequence.

c) Visual front-ends

Graphical user interfaces would be constructed using Tcl/Tk which would enable construction and manipulation of objects by dialogue boxes and simple point-and-click. Excellent progress with this type of work has already been made by CCP11 (BBSRC Collaborative Computational Project 11).

d) Methods for the visualisation of objects

Classes would be provided for viewing and interactive manipulation of molecules, their sequences and their alignments. Use would be made of the new X Consortium libraries such as Fresco with PEXlib and XIElib for 3D viewing and images.

e) Object persistence and distribution

The sharing of objects across networks is now becoming possible both through open standards such as CORBA and through proprietary standards such as Microsoft s Common Object Model (COM). The use of these new technologies will be explored with a view to providing objects which exist across host boundaries.

Project Plan

The aim is to produce reusable software components which adhere to international standards and be usable across platforms, particularly Unix and Microsoft Windows. Demonstrator applications using the components would also be provided. The software would be documented in Hypertext Markup Language (HTML) on the World Wide Web and made available by FTP. In a separate archive, software under development would also be available. In this way international help and criticism can contribute to the project at an early stage.

The starting point of the project would be the existing software. Our own pilot studies would form a base for the object oriented development. Many excellent routines already exist in programs developed in the bioinformatics community which should be integrated into any class library. Routines for sequence manipulation are publicly available through CCP11 and for co-ordinate manipulation are available in crystallographic software from Birkbeck (3,4,5,6,7) and Janet Thornton has offered to contribute the software from her group at University College, Biochemistry. The public visibility of the project on the Internet would also attract other contributions.

The basic implementation strategy would be to exploit the extensibility and adaptability of the ANSI C++ Standard Template Library which provides well structured generic components which work together in a seamless way. Existing software would be adapted to provide new components and function objects. Existing software in C would be modified to use generic components but underlying algorithms would not be rewritten.

The CCP11 Internet site would make the toolkit available during development so there could be widespread feedback from the bioinformatics community who could also be encouraged to submit software for possible inclusion. In this way, software developers would be encouraged to see themselves as stakeholders in the library and would promote its use.

Project management

The project would be jointly managed between Birkbeck and CCP11. The content of the library would be partly determined by our assessment of the needs of the community and partly by the software contributions received. It would not be the aim of the project to rewrite existing software which was not written in C++. Established techniques such as the provision of wrapper classes for reusable components would be adopted in order to integrate them into the toolkit. The close involvement of one of us (AB) with the CCP11 community is an important strength and we would also take advice from contributors of software such as Janet Thornton and Peter Murray-Rust.

Exploitation of results

The development of a class library is of little benefit unless it is widely used. Three essential requirements are publicity, documentation and training. The World Wide Web is an increasingly vital publicity tool and the Darebury Web server provides a direct lead into the CCP11 community while Birkbeck as a WWW host site for a number of national societies such as the Biochemical Society and the British Crystallographic Association, provides a strong link into the bioinformatics community. Hypertext documentation on the Web will also be provided in HTML files on the Web. Hypertext is very suitable for class libraries and is widely used by commercial developers for documentating their class libraries (eg the class libraries associated with Borland development tools).

With regard to training, the involvement of CCP11 will again be important. Software training is an activity with which CCP11 is already familiar and we believe that CCP11 should extend this role in the same way as CCP4 has so successfully done in promoting the use of software in the crystallographic community. Andrew Coulson (Co-founder of CCP11) is happy that CCP11 should take on this role when a class library has been developed.

Resources requested

a) Personnel

This work would require knowledge of the function and structure of biomolecules and high level skills in object oriented analysis, design and code construction. A post-doctoral research assistant with relevant experience in bioinformatics and enthusiasm for modern software design is needed. The project would involve the testing of the classes on several platforms such as Windows, SG and HP Unix and therefore it will be necessary to install new C++ compilers on these platforms as they track the ANSI standard. CORBA and FRESCO support will also have to be installed and maintained. Our computer technician John Bouquiere is responsible for such tasks and we therefore ask for a contribution towards his salary. We also ask for a contribution towards the salary of our Computer Manager (Richard Westlake) in recognition of his role in supporting the Internet and World Wide Web infrastructure which is involved in this proposal. Glaxo Research and Development have already formally offered to participate in a CASE studentship at Birkbeck associated with object technology development for bioinformatics.

b) Hardware and software

We request a Silicon Graphics Indy workstation with XZ graphics and a contribution towards the cost of connection to the Deaprtment s FDDI network. The XZ would greatly assist the development of the graphics classes. Software associated with object technology undergoes rapid development and hence we request a contribution to software development tools for our workstaions. Birkbeck College will contribute a PC for Windows testing but a contribution towards emerging Windows ANSI C++ compilers is requested.

c) Travel

We also request travel funds for six return trips per year between London and Daresbury for discussions between the applicants and for the attendance of the applicants and the research assistant at annual CCP11 meetings. We also request one visit to the Supercomputer Center at San Diego to consult with Phil Bourne who is keen to support this project1,2.

d) Consumables

We request ś100 pa as a contribution towards the cost of paper and disks.

References

Macromolecular Query Language (MMQL) - Prototype Data Model and Implementation: Shindyalov I, Chang W and Bourne P, Prot. Eng. (1994) 7(11), 1311-1322.
Design and Application of PDBlib, a C++ Macromolecular Class Library: Chang W, Shindyalov i, Pu C and Bourne P, CABIOS (1994) 10(3), 309-317.
PROCHECK: a Program to Check the Stereochemical Quality of Protein Structures: Laskowski, R A, Macarthur M W, Moss D S and Thorton J M, J. Appl. Cryst. (1993) 26, 283-291.
TLSANL: TLS Parameter Analysis Program for Segmented Anisotropic Refinement of Macromolecular Structures: Howlin B, Butler S A, Harris G W and Driessen H P C, J. Appl. Cryst. (1993) 26, 622-624.
An Algorithm for Automatically Generating Protein Topology Cartoons: Flores T P, Moss D S and Thornton J M, Prot. Eng. (1994) 7, 31-37.
RESTRAIN: restrained structure-factor least-squares refinement program for macromolecular structures: Driessen H P C, Haneef I, Harris G W, Howlin B Khan G and Moss D S, J. Appl. Cryst. (1989) 22, 510-516
Use of Parallel Processing in the Study of Protein Ligand Binding: Goodfellow J M, Jones D M, Laskowski R, Moss D S, Saqi M, Thanki N and Westlake R, J. Comp. Chemistry (1990) 11, 314-325