Five applications of bio-ontologies in data-driven science
Ontology has become the method of choice throughout most of biology and biomedicine for constructing and maintaining standardisations of the terminology used in database annotations. This was the primary motivation for the development of the Gene Ontology, and remains until today a pressing and urgent requirement throughout computer-assisted science in many different fields. This is thus the first application of bio-ontology in data-driven science:
- Standardised vocabulary with definitions and synonyms for unified database annotations across different databases
Most of the OBO Foundry principles are designed to facilitate this objective. In particular, emphasising the stability of identifiers. Stably maintained identifiers are an essential requirement if annotations to an ontology are to be created across multiple databases, because different update and release cycles will certainly result in dead links within downstream databases if IDs just disappear from the source ontology. Also, using semantics-free (numeric) identifiers means that the identifiers can remain stable while the underlying ontology changes. Having clearly delineated scope for a particular ontology and not overlapping with other ontologies, another OBO principle, is also very helpful for database annotation standardisation, since a plethora of similar-sounding options from different ontologies bewilders and deters curators from using ontologies for annotation.
This is a very good use case for ontologies, but it by no means exhausts the potential benefits of bio-ontologies for science. Another benefit has to do with the need for aggregation in data-driven scientific research and in searching, browsing and visualising scientific data. The second application is thus:
- Hierarchical organisation for aggregation and multi-level comparison of results
Science is about looking for generalities and patterns. The discovery of patterns and generalities allow predictions to be made and contribute to our understanding of the natural world. Obviously, if we only ever look at single examples of things (such as genes), with no way to group them together to discover if a particular group has something interesting in common (such as functions), we can't see patterns or make predictions. So, hierarchical structure is very important to organise subject matter across many different domains of science. Ontologies provide a very generic and flexible structure that can be organised hierarchically to arbitrary depth. However, to obtain the maximum benefit from the hierarchical structure for data aggregation, the ontology must be is-a complete, that is, all the entities in the ontology must be provided with a taxonomic parent (another OBO principle).
One of the most pressing challenges in data-driven biology is the proliferation of databases. Well, it's a good thing: more databases => more open data. But more and more bioinformaticians are spending a substantial portion of their time trying to unify the available data from multiple different sources in order to inform and address their primary research question. Unifying based on names and other metadata is hard, patchy work. But if the data are annotated with a shared ontology, the integration has already been done at the time of the annotation, and doesn't have to be re-done by every researcher who needs to consume the data. That's the third application:
- Community adoption for easy comparison of results to other project results worldwide
Unfortunately, getting different projects to agree to use a shared ontology is hard, hard work. Using the shared ontology is also hard work -- depending on the strength of the infrastructure available to the ontology developers, they may or may not be able to address requests or resolve disagreements within a time frame that is acceptable to the consumers of the ontology, who certainly will have their own deadlines, priorities and pressures. Success in this regard comes down to availability of infrastructural funding and to good project management. Like database maintenance, ontology development and maintenance cannot be funded by research funding alone.
All of the above applications were already well served by the OBO ontology format, but additional expressivity as provided by the new OWL 2 ontology language allows for an interesting juxtaposition of the capabilities of artificial intelligence technology with the needs of scientific data management. Application four:
- Explicit relationships and underlying logical definitions for automated reasoning
Traditionally, logic-based artificial intelligence developed rather separately from computational methods in the life sciences. The latter have emphasised statistical artificial intelligence approaches, with neural networks and similar models being used to allow predictive models to be built from large-scale data. However, in cases where knowledge is being captured, such as in the curation of biological knowledge to allow for sophisticated query interfaces to be developed for biological databases, logic-based AI provides the perfect underlying formalism for capturing and encoding the content. Even better, the sort of inferences that can be drawn from a properly developed logic-based knowledge base allow automated error detection, assisting curators in the management of large-scale data. Here, the uptake of the powerful technology is limited by the usability for non-logicians (i.e. biological domain experts) of the tools for encoding logical axioms. Currently the most widely used OWL editing tool is the Protégé ontology editor. Protégé is a fantastic tool, but its usability still lags behind what would be required to ensure widespread adoption by a community of scientists, i.e. non-logicians. Here, a two-pronged approach is needed: on the one hand, increased investment in sophisticated OWL and Protégé training aimed at biological audiences, and on the other hand, investment in development of greater usability for the tool itself -- or in alternative tools more aimed at biologists.
Finally, an exciting direction that ontology development is expanding into in the current generation of bio-ontologies is that where ontologies are becoming increasingly interlinked. This interlinking -- for example, links from the Gene Ontology to the ChEBI ontology to explicitly represent chemical participation in biological processes, or links from Gene Ontology processes to the anatomical locations where the processes take place -- shows vast potential for enabling the sort of whole-systems scientific modelling that is needed to transform basic knowledge about biology into predictive models and simulations that allow scientists to design and investigate perturbations for explicit therapeutic endpoints in silico. The fifth application is thus:
- Explicit bridging relationships between different ontologies for exploring underlying mechanisms
Those are five that I can think of. What have I left out?