The Information Retrieval Technique
In this section, we will present some of the techniques that we used for Information Retrieval.
An OQL Query Generator
Since populating our XML database, we have found that OQL has been powerful enough to retrieve and structure any information in a declarative way, due to its extensibility (by implementing, for instance, a new class method when needed). Of course, OQL is not dedicated to XML but, from experience, our model genericity makes it easier to understand and quicker to learn than XQL or XML-QL, a current XML query language submission to the W3C, for people who already know SQL. In our system, any user request is analyzed and submitted to O2 via an OQL query. There are two request types, one asking for documents matching specific criteria (e.g., a topic and a full-text expression), and the other asking for the textual content of one of the relative documents found. The following examples illustrate query typology of the first type. Full Text Queries Suppose that the user wants to search for the documents containing the word "XML". We want to return these documents in the descending order of the number of occurrences of this word (see the previous database model):
select docname: id->document, count: count(select e from partition p, e in p->id->ies) from w in TheWords, id in w->ids where w->narne = "XML" group by id->document order by count(select e from partition p, e in p->id->ies) desc
Here, we query the word collection to find occurrences of "XML" directly. OQL partitioning (with "group by" and "order by" clauses) is used intensively for ranking results. We extended this query structure to full text expressions including "and", "or", and "exact phrase" retrieval operators. Structured Queries Suppose now that the user wants to search for documents about "ODBMS". We want to return these documents in descending order of their relative publication date:
select e->getDocument()->name from a in TheAttributes, e in bag(a->element) where a->value = "ODBMS" and e->tagname = "TOPIC" order by e->getDocument()->getDate() desc
This query shows OQL's flexibility and the diversity of possible strategies to solve the same problem. Here, we chose to query the attribute collection (rather than the document or element ones), as we know that looking first for "ODBMS" attributes is more selective than looking first for "TOPIC" elements. This might be different with someone interested in documents having no topic at all! The method getDate is a specific one we developed to deal with document dates. Mixed Queries Suppose, finally, that the user wants to search for documents satisfying the two previous constraints. We merge the two previous queries with an additional join constraint (a pure navigational strategy may be more efficient in certain cases, but we haven't optimized that part yet). We give more importance to the order by clause with the number of occurrences rather than to the one with the date (see earlier). The where clause becomes (the rest of the query now being trivial):
where w->name = "XML" and a->value = "ODBMS" and e->tagname = "TOPIC" and id->document = e->getDocument()
The Navigation Interface
From an OQL query, O2Web is a convenient tool to interact with users. Far richer than the SQL view concept, the OQL class constructor makes it possible to process and present any query result in any desired form. For example, we add for each document found its related topics and for each of these topics a link to other documents about the same topic, each link being in turn a new generated OQL query. Formally, if Q is a query, it means creating a class C with the same structure as Q's result, implementing
A New Stage in ODBMS Normalization: Myth or Reality
its reserved HTML presentation methods ("html_header", "html_report", etc.) in C++ or the O2C language and submitting C(Q) to O2Web. The following query illustrates our most difficult process to enhance the textual content of a document:
EnhancedContent(element(select r:d->getElementByTagName("HTML"), 1:(select struct(e->start,e->end)from w in TheWords, id in w->ids, e in id->ies where w->name = "XML" and id->document = d) from d in TheDocuments where d->name = "D"))
EnhancedContent is a class whose presentation method executes a getEnhancedContent method on the retrieved document tree root. getEnhancedContent recursively reconstitutes the document content with additional links between successive relevant text occurrences ("XML" in the preceding example), from its list argument of relevant positions. With this process, we also can retrieve and reconstitute just the relevant fragments of document such as titles, paragraphs, tables, etc. (see Figure 5.6).
