The Bioinformatics Template Library
Mark A. Williams, Will R. Pitt, Alan J. Bleasby &
David S. Moss
Biochemistry & Molecular Biology, University College
London
Chiroscience Group plc, Cambridge
Bioinformatics Applications Group, HGMP-RC, Hinxton
Crystallography, Birkbeck College London
The value of a component-based approach to software development has been recognized for many years. Well designed and tested components allow faster development of new algorithms and applications, that are both more reliable and provide near optimum computational performance. We are developing a C++ library of algorithms and data structures that are commonly used within the fields of bioinformatics and molecular modelling.
The library uses templates, which are a relatively new technique that allows compilers to make very efficient use of generic components, and closely follows the design of the Standard Template Library that is part of ISO/ANSI standard C++. The BTL and STL generic components can be thought of as building blocks for the construction of domain specific classes and applications.
The aim of the BTL is to provide the generic mathematical
components that will allow programmers to more rapidly construct applications
that model biological entities. The BTL and STL take care of complex mathematical
and memory management tasks in a reliable and efficient manner, allowing
programmers to focus on the biological and physical aspects of their application
specific design.
Generic Components
Many people have written small libraries of software components to represent and manipulate biological entities. However, they tend to be idiosyncratic, complex, tied closely to a specific application and incompatible with those of other authors. Consequently, there is usually little reuse of them within the wider community. We believe that the generic programming approach provides a better method for the construction and distribution of reusable components for biocomputing.
A generic component can be used in more than one context. A generic data structure can contain many different types of data, e.g. a generic list could contain single letter amino-acid names or atomic co-ordinates. A generic algorithm can be used upon data stored in a range of data structures e.g. to find the largest data element in a vector or the largest element in a list. Generic algorithms and data structures can be freely combined in a very flexible manner, providing maximum opportunity for their use.
Much of the functionality of the new ISO standard C++ library is provided in the form of a collection of generic software components that closely follow the design pattern and functional content of the Standard Template Library. In particular, they follow the STL in using template classes, a feature of C++, for compile time parameterisation of the components. The standard library components include vectors, linked lists, sets, queues and sorting algorithms.
The BTL extends the standard C++ library to cover
components which are commonly used within biocomputing, such as graphs,
vector and matrix algebra, sequence alignment methods, FFT algorithms,
least squares methods, genetic algorithms, simulated annealing, pattern
recognition/matching algorithms ....
Abstraction Encourages Code Reuse
Bioinformatics and molecular modelling involve manipulating a large variety of complex data types in many different ways. The task of representing these domains in classes is extremely time consuming and usually results in specialized and highly complex class hierarchies. The more complex a class library, the more effort is required to learn how to use it and the less likely it will be widely adopted. Additionally, learning how to use a sequence class library will not be much use if you switch areas to structure analysis.
Another significant barrier to reuse of domain-specific class libraries is that they impose a more or less fixed model of, say, a protein. In reality, even in one application area, the conception of what something like a protein is, and what you can do with it, varies between individuals, over time and with the task being performed.
The generic component approach provides a way of side-stepping these problems and of reusing substantial amounts of functional code within in a simple, consistent and flexible framework. The components within the STL and BTL perform abstract mathematical tasks without reference to any domain (except indirectly in that the scope of the components is restricted to functions that are useful in biocomputing).
The BTL and STL components are written in an abstract
manner. They make no internal reference to a particular application
domain. They make no reference to specific data types, e.g. int, float,
char or any user-defined type, but are written in terms of dummy parameters
(arguements). In the case of the STL and BTL components, the abstract parameters
are often iterators. Iterators form the interface between data structure
components and algorithm components. A data structure, or container,
component provides a certain type of iterator that can be used to navigate
through its individual data elements. Examples of STL iterator types are
"forward" that allow sequential traversal of elements in the data structure
and "random access" that also allows direct access to each element. An
algorithm component often takes two iterators as input that indicate a
range of elements within a container. To be compatible with a particular
algorithm a container must provide iterators that allow the algorithm to
navigate through a container in the way that it needs. The simpler the
demands made by an algorithm upon its input parameters, the more generic
it is likely to be. The greater the functionality of the iterators that
a container provides, the greater the potential range of algorithms that
can operate on it.
Non-algorithmic Components
|
|
Containers
numeric_vector matrix |
These containers are similar to many matrix and vector classes found in other C++ mathematical libraries, implementing a natural mathematical notation for basic vector and matrix algebra through overloading the usual C++ arithmetic operators. However, they are template classes, allowing the programmer to determine the type of data contained within them. They provide iterators that allow random access to the contained elements. Consequently, they can be combined with all the BTL/STL generic algorithms. |
Container adaptors
graph graph_with_edges
|
Provide two graph classes with different functionality and performance profiles. The first class has simple pointers between vertices while the second type has labelled edges. The vertices and edges can be of any type. Iterators are provided for sequential access to vertices and edges, and also for access in an order determined by the connectivity within a graph. |
Function objects
pdb_sort
amino_acid_property
random_uniform
less_absolute |
Can be used as a parameter for the STL generic
sort algorithms. This functor takes two input strings containing atom names
and returns true if the first name comes before the second in the order
found in PDB format files and false if it does not.
Takes a single character as its input and returns a real number property value for the amino acid the character represents. The type of property produced (hydropathy, volume etc.) is defined when the function object is constructed. Classes which generate random numbers, from a variety of functional distributions, using well documented portable algorithms (Knuth, 1998). A random number generator is also provided by the STL however its internal implementation will probably vary from compiler to compiler and its statistical properties are not necessarily suitable for biocomputing applications. This functor takes two inputs and returns true if the absolute value of the first is less than the absolute value of the second. This is particularly useful for sorting operations. |
File Processors
PIR_processor
|
Non-template classes for extracting data from external files, which read sequences, PDB files, raw co-ordinate files and image files respectively. Member functions deliver the data in the form of STL vectors.
|
Using BTL components
Generic component programming provides great flexibility. Components from the BTL and STL can be mixed without conflict. New generic components can be added as and when they are needed without having to understanding the inner workings of any of the other components. The BTL and STL are simpler to use than many other class libraries because there is a great consistency and commonality in the way that components of the same type are used. It is likely that the use of the STL will become very familiar to C++ programmers. We hope that this familiarity will transfer to the BTL, and make its use almost second nature to these programmers.
Programmers can use the components in any style of programming that they choose, be it to create a large object-oriented class library for modelling proteins or short procedural single function application. Provided that the biological data with which the application programmer is dealing is contained in a C++ class that provides an appropriate iterator for navigating through the class' data structure, the BTL and STL algorithms can be used to manipulate that data. The BTL and STL containers themselves will often be sufficient for holding biological data. When additional functionality is required of the class holding the data, this can often be built around an STL or BTL container.
One important use of the BTL will be to simplify the writing of domain-specific class libraries. From a stylistic perspective, it would be nice if these libraries could largely be written as higher-level generic components. Yet for many problems this would not be the most suitable design. Not every situation is best regarded in the very abstract manner inherent in developing generic components. We are also developing a hierarchical object-oriented domain-specific library of software for use in molecular modelling. Examples of the classes are Molecule, Atom and Residue. This seems to be a sensible approach as the hierarchy of macromolecular structure classification transfers easily onto the classes. However, the consequences of this hierarchical object-oriented design are that initially our class library was rather complex and had inefficient duplication of functionality.
For example, consider the implementation of a function
centroid, which determines the centre of a group of atoms. If knowledge
about these atoms is contained within several classes e.g. a Molecule,
a Residue, a Chain, then separate versions of the code to perform this
function would normally be put into each class. This obviously leads to
unnecessarily large classes and potential inconsistencies. The most
effective way of avoiding such repetition is to have a generic centroid
algorithm. The same algorithm can then be used to calculate the centroid
in each case. The transfer of functionality in this way offers a substantial
simplification of the classes representing the physical objects, freeing
the developer and subsequent user to spend more time on specifically
biological problems.
Under Construction
This poster describes work in progress and currently the
BTL should be treated as an evolving rather than a finished product. We
welcome suggestions for algorithms or data structures that could
be included in future releases. We are also happy to accept donations
of working code for algorithms (and test data) that could be adapted to
the format of the BTL. Best of all would be to establish collaboration
between specialists from different application areas to further future
development and distibution of state-of-the-art techniques through this
standard format.
Availability
The BTL will be free under the GNU General Public License.
Source code and extended documentation of the current version is available
from our web site http://www.cryst.bbk.ac.uk/~classlib/.
Acknowledgements
We would like to thank Claude Beazley, Mary Steven, Breen
Sweeney, Oliver Theis & Ian Tickle for their helpful comments and suggestions.
This work was supported by a BBSRC/EPSRC grant BIF05348 within the Bioinformatics
Programme.
Algorithms in the BTL
|
|
|
fourier_transform | Fourier algorithms
Fast Fourier transform in 1,2,and 3 dimensions |
Random access iterator
Sequential storage in particular order |
length separation separation_squared scalar_product sum sum_precise sum_of_squares sum_of_squares_precise vector_product
translate |
Vectors operations
(Sum (ai)2)1/2 (Sum (ai-bi)2)1/2 Sum (ai-bi)2 Sum aibi Sum ai first sort elements in ascending order then Sum (ai)2 first sort elements in ascending order then Sum (ai)2 a^b (triple only) a·(b^c) (triple only) a^ (b^c) (triple only) apply a rotation to each of a container of triples add a given vector to each of a container of triples |
Forward iterator
Sequential storage in particular order |
inverse_square inverse_cholesky transpose
|
Matrix operations
M-1 (small square matrix only) M-1(positive definite symmetric matrix only) MT (transpose - in place if desired) |M|(small square matrix only) eigenvectors and values for square matrix Sum j M(i,j) / (number of rows) MN MMT MTM Ma aM |
Forward iterator
Sequential storage in particular order |
mean mean_absolute_deviation variance normal_statistics
|
Statistics
<a> = S (ai)/ n Sum |<a>-ai |/ n Sum (<a>-ai)2/ (n-1) calculate mean, mean_absolute_deviation, variance, skew (1/n)Sum ((<a>-ai)/s )3 and kurtosis (1/n)Sum((<a>-ai)/s )4 |
Forward iterator
|
max_mismatch max_absolute_mismatch min_mismatch min_absolute_mismatch every_less every_less_equal |
Numerical comparison of two containers
maximum difference between corresponding elements
|
Forward iterator
Sequential storage in particular order |
centroid
lsqfit rotation_from_fit lsqfit_rmsd
rmsd |
3D dimensional co-ordinate fitting
calculate geometric centre of co-ordinates
(Sum (ai-bi)2/n)1/2 |
Forward iterator
|
needleman_wunsch_alignment
needleman_wunsch_similarity
|
Sequence algorithms
pairwise sequence alignment with selectable gap penalty calculate the similarity score of a pairwise alignment |
Forward iterator |
a ,b, c are vectors and ai,
bi, ci their elements. M, M1, M2 are Matrices and M(i,j) etc. their elements.
An illustration in which STL and BTL components are used to superimpose two molecular structures
// Standard header files
#include <vector.h>
#include <iomanip.h>
// BTL header files
#include "btl_biomolecular_data.h"
#include "btl_least_squares.h"
#include "btl_matrix.h"
#include "btl_numeric_vector.h"
#include "btl_matrix_algorithms.h"int main(int argc, char* argv[])
{
if (argc != 3) {
cerr << "Usage: fitpdb firstPDBFile secondPDBFile" << endl;
exit(1);
}// Create objects to represent each structure using one of the file processor classes from the BTL
// Read information from PDB files (reading only chains M and N, and the B atoms when alternatives are given)
ATOM_processor A; A.ReadFile(argv[1],"MN ",'B');
ATOM_processor B; B.ReadFile(argv[2],"MN ",'B');// The Coords() member function of ATOM_processor returns an STL vector containing the coordinates.
// Consequently, the number of atoms in each file can be retrieved using the standard size() member function.
if (A.Coords().size() != B.Coords().size() ) {
cerr << "Number of atoms unequal" << endl;
exit(1);
}bool long_way=true;
if(long_way){
// Do the superposition the long way in order to demonstrate the vector and matrix algorithms
// The geometric centre of each structure is declared as a BTL numeric_vector with 3 elements of
// BTL_REAL(0.0) (the default). The coordinates of the centres are calculated using the generic
// BTL centroid algorithm is in this case operating on both STL and BTL vectors.
numeric_vector<> centreA, centreB;
centroid(A.Coords().begin(), A.Coords().end(), centreA.begin());
centroid(B.Coords().begin(), B.Coords().end(), centreB.begin());// Move protein A such that the protein centres are superimposed using the generic BTL algorithm `translate'
numeric_vector<> translation = centreB - centreA;
translate(A.Coords().begin(), A.Coords().end(), translation.begin());// Determine and perform the rotation necessary to superimpose structures
// First calculate the Kearsley matrix and determine its eigenvalues and eigenvectors
matrix<> matfit(4,4), evector(4,4); numeric_vector<> evalue(4);
_kearsley_matrix(A.Coords().begin(), A.Coords().end(), B.Coords().begin(), B.Coords().end(), matfit.begin());eigen_solution(matfit.begin(), matfit.end(), 4 ,evector.begin(), evalue.begin());
transpose(evector.begin(), evector.end(), 4, evector.begin());// Then rotate A about its centre in order to effect the superposition
matrix<> rotation(3,3);
rotation_from_fit(evector.begin(),rotation.begin());
rotate(A.Coords().begin(), A.Coords().end(), rotation.begin(), centreB.begin());
}else {
// Alternatively, and much shorter, the above steps are incorporated in a single algorithm in which the
// first protein's coordinates are overwritten. Here again we apply a BTL algorithm to the coordinate
// data held in STL vectors.
double rmsd = 0.0;
rmsd = lsqfit(A.Coords().begin(), A.Coords().end(), B.Coords().begin(), B.Coords().end(), rmsd);
cout << "Root mean square distance : " << rmsd << "\n";
}// The outstream operator << is overloaded to write the contents of an ATOM_processor object in PDB format.
cout << A;
return 0;
}