Object-Oriented Database Management Systems in Java

Object-Oriented Database Management Systems
An object-oriented data model is based on the notion of objects and classes. Objects are used to model real-world entities; they have an identity. A class can be used to define a set of objects. Objects of the same class have the same data structure and the same operations, called methods. Classes are organized in a class hierarchy and constitute the database schema. Object-oriented database management systems (ODBMSs) are more suitable to build systems in new fields such as CAD and multimedia. Typically, objects in a genomic database have a dynamic nature. In the following Prototype section, we address part of this problem. Some OODBMSs provide schema evolution facilities [Kim 1990], but they seldom support automatic propagation to the instances. A survey of the schema evolution problem can be found in [Zicari 1991]. In summary, we can conclude that OODBMSs provide the ability to represent complex data (composite objects, list of objects) and to build efficiently advanced applications such as genomic information systems. In this chapter, we present an approach using an object-oriented database management system to store and manage genetic sequences. We will show some limitations of these systems due to a lack of flexibility. This particularly concerns dynamic propagation of update operations. To overcome the last limitation, we introduced a mechanism inspired from Smalltalk [Goldberg 1983] that dynamically propagates update operations on objects.
In this section, we describe the structure of the database and the operations that can be applied on objects. We store sequences, queries, functions predicting properties about sequences, and the results of the application of these functions on the sequences. Therefore, the schema of our object-oriented database consists of four main classes: Sequence, Function, Result, and Query.
Application Requirements
In our application, we needed to store data as lists and matrices, the treatment on the sequences, learning functions, and results of certain queries. Our study began with the use of an RDBMS, but we found it inadequate for modeling our data, and the SQL language was not powerful enough to express the treatment on the sequences and the
learning functions. The main benefit of using an OODBMS lies in the possibility of the embodiment of methods in the data structures themselves. Besides, in an OODBMS these methods can be triggered in queries. We have intensively used this capability in our implementation. Learning techniques may be used at the level of DNA, for example, to search the sequences coding for genes. However, one of the main challenges in this field is the prediction of protein structures from their amino-acid sequence. Only a few hundred proteins have a known three-dimensional structure. The structures of these proteins are stored in a database and used for reference to facilitate the search of the structure of other proteins by using an alignment algorithm. An alignment consists of searching similar patterns of two sequences with a dynamic programming algorithm like [Needleman 1970]. This is called homology modeling. Our prototype allows coupling the OODBMS O2 with a 1D alignment program developed by J. Gracy [Gracy 1991], which can be triggered through a method [Ripoche 1995]. Furthermore, it is possible to store alignments as objects and then to pose queries on them and on the way they were obtained. An alignment allows a matrix to be built [Gribskof 1987], which is itself a query that can be invoked to retrieve sequences close to example sequences that have been used in the alignment process. This allows us to automate the production of consensus motifs and their exploitation with an object-oriented query language like OQL.
Modeling Genetic Sequences
A genome sequence consists of two parts: A nucleic sequence (succession of nucleic acids) A proteic sequence (succession of amino acids) In a cell, the proteic sequence is the result of the translation from the nucleic sequence. A proteic sequence can be viewed, in fact, according to three aspects: Primary structure, which is actually the succession of amino acids Secondary structure, which is a succession of more elaborated structures (alphahelix or beta-sheet) Tertiary structure, where the position of each amino acid in space is known We need to model these sequences because genome projects provide large amounts of data. They are difficult to model because they can be viewed according to several aspects. Nucleic sequences are coded by four letters (CGUA for RNA and CGTA for DNA). Proteic sequences consist of a sequence of amino acids coded by an alphabet of twenty letters. Six letters are excluded: B, J, O, U, X, and Z. Each sequence is represented as a string. Examples of sequences are: Sequence 0 ACHGKLMPACERVATR Sequence 1 ERTACDEAPMLKNVCWCFAA
