A Class Library for Bioinformatics and Molecular Modelling

The Bioinformatics Template Library

Mark A. Williams, Will R. Pitt, Alan J. Bleasby & David S. Moss

Biochemistry & Molecular Biology, University College London
Chiroscience Group plc, Cambridge
Bioinformatics Applications Group, HGMP-RC, Hinxton
Crystallography, Birkbeck College London

The value of a component-based approach to software development has been recognized for many years. Well designed and tested components allow faster development of new algorithms and applications, that are both more reliable and provide near optimum computational performance. We are developing a C++ library of algorithms and data structures that are commonly used within the fields of bioinformatics and molecular modelling.

The library uses templates, which are a relatively new technique that allows compilers to make very efficient use of generic components, and closely follows the design of the Standard Template Library that is part of ISO/ANSI standard C++. The BTL and STL generic components can be thought of as building blocks for the construction of domain specific classes and applications.

The aim of the BTL is to provide the generic mathematical components that will allow programmers to more rapidly construct applications that model biological entities. The BTL and STL take care of complex mathematical and memory management tasks in a reliable and efficient manner, allowing programmers to focus on the biological and physical aspects of their application specific design.

Generic Components

Many people have written small libraries of software components to represent and manipulate biological entities. However, they tend to be idiosyncratic, complex, tied closely to a specific application and incompatible with those of other authors. Consequently, there is usually little reuse of them within the wider community. We believe that the generic programming approach provides a better method for the construction and distribution of reusable components for biocomputing.

A generic component can be used in more than one context. A generic data structure can contain many different types of data, e.g. a generic list could contain single letter amino-acid names or atomic co-ordinates. A generic algorithm can be used upon data stored in a range of data structures e.g. to find the largest data element in a vector or the largest element in a list. Generic algorithms and data structures can be freely combined in a very flexible manner, providing maximum opportunity for their use.

Much of the functionality of the new ISO standard C++ library is provided in the form of a collection of generic software components that closely follow the design pattern and functional content of the Standard Template Library. In particular, they follow the STL in using template classes, a feature of C++, for compile time parameterisation of the components. The standard library components include vectors, linked lists, sets, queues and sorting algorithms.

The BTL extends the standard C++ library to cover components which are commonly used within biocomputing, such as graphs, vector and matrix algebra, sequence alignment methods, FFT algorithms, least squares methods, genetic algorithms, simulated annealing, pattern recognition/matching algorithms ....

Abstraction Encourages Code Reuse

Bioinformatics and molecular modelling involve manipulating a large variety of complex data types in many different ways. The task of representing these domains in classes is extremely time consuming and usually results in specialized and highly complex class hierarchies. The more complex a class library, the more effort is required to learn how to use it and the less likely it will be widely adopted. Additionally, learning how to use a sequence class library will not be much use if you switch areas to structure analysis.

Another significant barrier to reuse of domain-specific class libraries is that they impose a more or less fixed model of, say, a protein. In reality, even in one application area, the conception of what something like a protein is, and what you can do with it, varies between individuals, over time and with the task being performed.

The generic component approach provides a way of side-stepping these problems and of reusing substantial amounts of functional code within in a simple, consistent and flexible framework. The components within the STL and BTL perform abstract mathematical tasks without reference to any domain (except indirectly in that the scope of the components is restricted to functions that are useful in biocomputing).

The BTL and STL components are written in an abstract manner. They make no internal reference to a particular application domain. They make no reference to specific data types, e.g. int, float, char or any user-defined type, but are written in terms of dummy parameters (arguements). In the case of the STL and BTL components, the abstract parameters are often iterators. Iterators form the interface between data structure components and algorithm components. A data structure, or container, component provides a certain type of iterator that can be used to navigate through its individual data elements. Examples of STL iterator types are "forward" that allow sequential traversal of elements in the data structure and "random access" that also allows direct access to each element. An algorithm component often takes two iterators as input that indicate a range of elements within a container. To be compatible with a particular algorithm a container must provide iterators that allow the algorithm to navigate through a container in the way that it needs. The simpler the demands made by an algorithm upon its input parameters, the more generic it is likely to be. The greater the functionality of the iterators that a container provides, the greater the potential range of algorithms that can operate on it.

Non-algorithmic Components

Component	Description
Containers numeric_vector matrix	These containers are similar to many matrix and vector classes found in other C++ mathematical libraries, implementing a natural mathematical notation for basic vector and matrix algebra through overloading the usual C++ arithmetic operators. However, they are template classes, allowing the programmer to determine the type of data contained within them. They provide iterators that allow random access to the contained elements. Consequently, they can be combined with all the BTL/STL generic algorithms.
Container adaptors graph graph_with_edges	Provide two graph classes with different functionality and performance profiles. The first class has simple pointers between vertices while the second type has labelled edges. The vertices and edges can be of any type. Iterators are provided for sequential access to vertices and edges, and also for access in an order determined by the connectivity within a graph.
Function objects pdb_sort amino_acid_property random_uniform random_normal random_exponential less_absolute	Can be used as a parameter for the STL generic sort algorithms. This functor takes two input strings containing atom names and returns true if the first name comes before the second in the order found in PDB format files and false if it does not. Takes a single character as its input and returns a real number property value for the amino acid the character represents. The type of property produced (hydropathy, volume etc.) is defined when the function object is constructed. Classes which generate random numbers, from a variety of functional distributions, using well documented portable algorithms (Knuth, 1998). A random number generator is also provided by the STL however its internal implementation will probably vary from compiler to compiler and its statistical properties are not necessarily suitable for biocomputing applications. This functor takes two inputs and returns true if the absolute value of the first is less than the absolute value of the second. This is particularly useful for sorting operations.
File Processors PIR_processor ATOM_processor XYZ_processor PPM_processor	Non-template classes for extracting data from external files, which read sequences, PDB files, raw co-ordinate files and image files respectively. Member functions deliver the data in the form of STL vectors.

Using BTL components

Generic component programming provides great flexibility. Components from the BTL and STL can be mixed without conflict. New generic components can be added as and when they are needed without having to understanding the inner workings of any of the other components. The BTL and STL are simpler to use than many other class libraries because there is a great consistency and commonality in the way that components of the same type are used. It is likely that the use of the STL will become very familiar to C++ programmers. We hope that this familiarity will transfer to the BTL, and make its use almost second nature to these programmers.

Programmers can use the components in any style of programming that they choose, be it to create a large object-oriented class library for modelling proteins or short procedural single function application. Provided that the biological data with which the application programmer is dealing is contained in a C++ class that provides an appropriate iterator for navigating through the class' data structure, the BTL and STL algorithms can be used to manipulate that data. The BTL and STL containers themselves will often be sufficient for holding biological data. When additional functionality is required of the class holding the data, this can often be built around an STL or BTL container.

One important use of the BTL will be to simplify the writing of domain-specific class libraries. From a stylistic perspective, it would be nice if these libraries could largely be written as higher-level generic components. Yet for many problems this would not be the most suitable design. Not every situation is best regarded in the very abstract manner inherent in developing generic components. We are also developing a hierarchical object-oriented domain-specific library of software for use in molecular modelling. Examples of the classes are Molecule, Atom and Residue. This seems to be a sensible approach as the hierarchy of macromolecular structure classification transfers easily onto the classes. However, the consequences of this hierarchical object-oriented design are that initially our class library was rather complex and had inefficient duplication of functionality.

For example, consider the implementation of a function centroid, which determines the centre of a group of atoms. If knowledge about these atoms is contained within several classes e.g. a Molecule, a Residue, a Chain, then separate versions of the code to perform this function would normally be put into each class. This obviously leads to unnecessarily large classes and potential inconsistencies. The most effective way of avoiding such repetition is to have a generic centroid algorithm. The same algorithm can then be used to calculate the centroid in each case. The transfer of functionality in this way offers a substantial simplification of the classes representing the physical objects, freeing the developer and subsequent user to spend more time on specifically biological problems.

Under Construction

This poster describes work in progress and currently the BTL should be treated as an evolving rather than a finished product. We welcome suggestions for algorithms or data structures that could be included in future releases. We are also happy to accept donations of working code for algorithms (and test data) that could be adapted to the format of the BTL. Best of all would be to establish collaboration between specialists from different application areas to further future development and distibution of state-of-the-art techniques through this standard format.

Availability

The BTL will be free under the GNU General Public License. Source code and extended documentation of the current version is available from our web site http://www.cryst.bbk.ac.uk/~classlib/.

Acknowledgements

We would like to thank Claude Beazley, Mary Steven, Breen Sweeney, Oliver Theis & Ian Tickle for their helpful comments and suggestions. This work was supported by a BBSRC/EPSRC grant BIF05348 within the Bioinformatics Programme.

Algorithms in the BTL

Algorithm	Description	Requirements of Containers
fourier_transform	Fourier algorithms Fast Fourier transform in 1,2,and 3 dimensions	Random access iterator Sequential storage in particular order
length separation separation_squared scalar_product sum sum_precise sum_of_squares sum_of_squares_precise vector_product scalar_triple_product vector_triple_product rotate translate	Vectors operations (Sum (ai)2)1/2 (Sum (ai-bi)2)1/2 Sum (ai-bi)2 Sum aibi Sum ai first sort elements in ascending order then Sum (ai)2 first sort elements in ascending order then Sum (ai)2 a^b (triple only) a·(b^c) (triple only) a^ (b^c) (triple only) apply a rotation to each of a container of triples add a given vector to each of a container of triples	Forward iterator Sequential storage in particular order
inverse_square inverse_cholesky transpose determinant eigen_solution column_mean matrix_product matrix_matrixtranspose_product matrixtranspose_matrix_product matrix_vector_product vector_matrix_product	Matrix operations M-1 (small square matrix only) M-1(positive definite symmetric matrix only) MT (transpose - in place if desired) \|M\|(small square matrix only) eigenvectors and values for square matrix Sum j M(i,j) / (number of rows) MN MMT MTM Ma aM	Forward iterator Sequential storage in particular order
mean mean_absolute_deviation variance normal_statistics	Statistics <a> = S (ai)/ n Sum \|<a>-ai \|/ n Sum (<a>-ai)2/ (n-1) calculate mean, mean_absolute_deviation, variance, skew (1/n)Sum ((<a>-ai)/s )3 and kurtosis (1/n)Sum((<a>-ai)/s )4	Forward iterator
max_mismatch max_absolute_mismatch min_mismatch min_absolute_mismatch every_less every_less_equal	Numerical comparison of two containers maximum difference between corresponding elements max absolute difference between corresponding elements minimum difference between corresponding elements min absolute difference between corresponding elements true if ai < bi for all corresponding pairs of elements true if ai <= bi for all corresponding pairs of elements	Forward iterator Sequential storage in particular order
centroid lsqfit rotation_from_fit lsqfit_rmsd rmsd	3D dimensional co-ordinate fitting calculate geometric centre of co-ordinates superpose two sets of co-ordinates using kearsley algorithm produce the least squares fit rotation matrix calculate the root mean square distance between two sets of co-ordinates upon superposition, without superposing them (Sum (ai-bi)2/n)1/2	Forward iterator
needleman_wunsch_alignment needleman_wunsch_similarity	Sequence algorithms pairwise sequence alignment with selectable gap penalty calculate the similarity score of a pairwise alignment	Forward iterator

a ,b, c are vectors and ai, bi, ci their elements. M, M1, M2 are Matrices and M(i,j) etc. their elements.

An illustration in which STL and BTL components are used to superimpose two molecular structures

// Standard header files
#include <vector.h>
#include <iomanip.h>
// BTL header files
#include "btl_biomolecular_data.h"
#include "btl_least_squares.h"
#include "btl_matrix.h"
#include "btl_numeric_vector.h"
#include "btl_matrix_algorithms.h"
int main(int argc, char* argv[])
{
    if (argc != 3) {
        cerr << "Usage: fitpdb firstPDBFile secondPDBFile" << endl;
        exit(1);
    }
    // Create objects to represent each structure using one of the file processor classes from the BTL
    // Read information from PDB files (reading only chains M and N, and the B atoms when alternatives are given)
    ATOM_processor A; A.ReadFile(argv[1],"MN ",'B');
    ATOM_processor B; B.ReadFile(argv[2],"MN ",'B');
    // The Coords() member function of ATOM_processor returns an STL vector containing the coordinates.
    // Consequently, the number of atoms in each file can be retrieved using the standard size() member function.
    if (A.Coords().size() != B.Coords().size() ) {
        cerr << "Number of atoms unequal" << endl;
        exit(1);
    }
    bool long_way=true;
    if(long_way){
    // Do the superposition the long way in order to demonstrate the vector and matrix algorithms
    // The geometric centre of each structure is declared as a BTL numeric_vector with 3 elements of
    // BTL_REAL(0.0) (the default). The coordinates of the centres are calculated using the generic
    // BTL centroid algorithm is in this case operating on both STL and BTL vectors.
    numeric_vector<> centreA, centreB;
    centroid(A.Coords().begin(), A.Coords().end(), centreA.begin());
    centroid(B.Coords().begin(), B.Coords().end(), centreB.begin());
    // Move protein A such that the protein centres are superimposed using the generic BTL algorithm `translate'
     numeric_vector<> translation = centreB - centreA;
     translate(A.Coords().begin(), A.Coords().end(), translation.begin());
    // Determine and perform the rotation necessary to superimpose structures
    // First calculate the Kearsley matrix and determine its eigenvalues and eigenvectors
    matrix<> matfit(4,4), evector(4,4); numeric_vector<> evalue(4);
    _kearsley_matrix(A.Coords().begin(), A.Coords().end(), B.Coords().begin(), B.Coords().end(), matfit.begin());
    eigen_solution(matfit.begin(), matfit.end(), 4 ,evector.begin(), evalue.begin());
    transpose(evector.begin(), evector.end(), 4, evector.begin());
    // Then rotate A about its centre in order to effect the superposition
    matrix<> rotation(3,3);
    rotation_from_fit(evector.begin(),rotation.begin());
    rotate(A.Coords().begin(), A.Coords().end(), rotation.begin(), centreB.begin());
    }
   else {
   // Alternatively, and much shorter, the above steps are incorporated in a single algorithm in which the
    // first protein's coordinates are overwritten. Here again we apply a BTL algorithm to the coordinate
    // data held in STL vectors.
    double rmsd = 0.0;
    rmsd = lsqfit(A.Coords().begin(), A.Coords().end(), B.Coords().begin(), B.Coords().end(), rmsd);
    cout << "Root mean square distance : " << rmsd << "\n";
    }
    // The outstream operator << is overloaded to write the contents of an ATOM_processor object in PDB format.
    cout << A;
    return 0;
}