Five applications of bio-ontologies in data-driven science
Ontology has become the method of choice throughout most of biology and biomedicine for constructing and maintaining standardisations of the terminology used in database annotations. This was the primary motivation for the development of the Gene Ontology, and remains until today a pressing and urgent requirement throughout computer-assisted science in many different fields. This is thus the first application of bio-ontology in data-driven science:
- Standardised vocabulary with definitions and synonyms for unified database annotations across different databases
Most of the OBO Foundry principles are designed to facilitate this objective. In particular, emphasising the stability of identifiers. Stably maintained identifiers are an essential requirement if annotations to an ontology are to be created across multiple databases, because different update and release cycles will certainly result in dead links within downstream databases if IDs just disappear from the source ontology. Also, using semantics-free (numeric) identifiers means that the identifiers can remain stable while the underlying ontology changes. Having clearly delineated scope for a particular ontology and not overlapping with other ontologies, another OBO principle, is also very helpful for database annotation standardisation, since a plethora of similar-sounding options from different ontologies bewilders and deters curators from using ontologies for annotation.
This is a very good use case for ontologies, but it by no means exhausts the potential benefits of bio-ontologies for science. Another benefit has to do with the need for aggregation in data-driven scientific research and in searching, browsing and visualising scientific data. The second application is thus:
- Hierarchical organisation for aggregation and multi-level comparison of results
Science is about looking for generalities and patterns. The discovery of patterns and generalities allow predictions to be made and contribute to our understanding of the natural world. Obviously, if we only ever look at single examples of things (such as genes), with no way to group them together to discover if a particular group has something interesting in common (such as functions), we can't see patterns or make predictions. So, hierarchical structure is very important to organise subject matter across many different domains of science. Ontologies provide a very generic and flexible structure that can be organised hierarchically to arbitrary depth. However, to obtain the maximum benefit from the hierarchical structure for data aggregation, the ontology must be is-a complete, that is, all the entities in the ontology must be provided with a taxonomic parent (another OBO principle).
One of the most pressing challenges in data-driven biology is the proliferation of databases. Well, it's a good thing: more databases => more open data. But more and more bioinformaticians are spending a substantial portion of their time trying to unify the available data from multiple different sources in order to inform and address their primary research question. Unifying based on names and other metadata is hard, patchy work. But if the data are annotated with a shared ontology, the integration has already been done at the time of the annotation, and doesn't have to be re-done by every researcher who needs to consume the data. That's the third application:
- Community adoption for easy comparison of results to other project results worldwide
Unfortunately, getting different projects to agree to use a shared ontology is hard, hard work. Using the shared ontology is also hard work -- depending on the strength of the infrastructure available to the ontology developers, they may or may not be able to address requests or resolve disagreements within a time frame that is acceptable to the consumers of the ontology, who certainly will have their own deadlines, priorities and pressures. Success in this regard comes down to availability of infrastructural funding and to good project management. Like database maintenance, ontology development and maintenance cannot be funded by research funding alone.
All of the above applications were already well served by the OBO ontology format, but additional expressivity as provided by the new OWL 2 ontology language allows for an interesting juxtaposition of the capabilities of artificial intelligence technology with the needs of scientific data management. Application four:
- Explicit relationships and underlying logical definitions for automated reasoning
Traditionally, logic-based artificial intelligence developed rather separately from computational methods in the life sciences. The latter have emphasised statistical artificial intelligence approaches, with neural networks and similar models being used to allow predictive models to be built from large-scale data. However, in cases where knowledge is being captured, such as in the curation of biological knowledge to allow for sophisticated query interfaces to be developed for biological databases, logic-based AI provides the perfect underlying formalism for capturing and encoding the content. Even better, the sort of inferences that can be drawn from a properly developed logic-based knowledge base allow automated error detection, assisting curators in the management of large-scale data. Here, the uptake of the powerful technology is limited by the usability for non-logicians (i.e. biological domain experts) of the tools for encoding logical axioms. Currently the most widely used OWL editing tool is the Protégé ontology editor. Protégé is a fantastic tool, but its usability still lags behind what would be required to ensure widespread adoption by a community of scientists, i.e. non-logicians. Here, a two-pronged approach is needed: on the one hand, increased investment in sophisticated OWL and Protégé training aimed at biological audiences, and on the other hand, investment in development of greater usability for the tool itself -- or in alternative tools more aimed at biologists.
Finally, an exciting direction that ontology development is expanding into in the current generation of bio-ontologies is that where ontologies are becoming increasingly interlinked. This interlinking -- for example, links from the Gene Ontology to the ChEBI ontology to explicitly represent chemical participation in biological processes, or links from Gene Ontology processes to the anatomical locations where the processes take place -- shows vast potential for enabling the sort of whole-systems scientific modelling that is needed to transform basic knowledge about biology into predictive models and simulations that allow scientists to design and investigate perturbations for explicit therapeutic endpoints in silico. The fifth application is thus:
- Explicit bridging relationships between different ontologies for exploring underlying mechanisms
Those are five that I can think of. What have I left out?
Standardisation and intelligent querying for chemical biology screening experiments
On Thursday, 8th March, I visited the Center for Computational Science at the University of Miami. Note, not computer science but computational science: the center is a broad-ranging interdisciplinary think-tank for projects that need to use innovative computing in order to tackle challenging questions at the frontiers of research in the life sciences, social sciences, physical sciences, and other fields. The center brings together interdisciplinary teams of experienced computer scientists, domain science researchers and software engineers to create computational solutions and new methods.
Being there only for one day meant that I didn't get to enjoy the Miami beaches as I would have liked, but did enjoy some good views of Miami downtown and the fresh sea breeze from the hotel. The reason for the visit was to learn more about the BioAssay Ontology being developed by an interdisciplinary team headed by Dr Stephan Schürer.
The BioAssay Ontology (BAO) was developed to address the standardisation of assay descriptions in the PubChem BioAssay database. Bioassays in PubChem are deposited by screening centers performing high-throughput and high-content screening experiments across a wide variety of scientific topics, technology platforms and experimental design strategies. For various reasons having to do with historical legacy, the description of the assay experimentsin PubChem was largely free text, meaning that similar experiments were frequently described differently and that it was near to impossible to compare and aggregate results across experiments originating from different screening centers where different methodologies and standards may be internally applicable.
BAO provides standardised terminology for all aspects of chemical biology screening experiment description organised into a hierarchy and supplemented by OWL axioms to encode additional semantics and to allow for more advanced automated reasoning across assays annotated to BAO. It is therefore a prime candidate for adoption within the EU-OPENSCREEN project. The challenges faced by EU-OPENSCREEN will be similar -- aggregation and comparison of results between experiments originating in different screening centers -- but given the pan-European nature of the project and required infrastructure, ontology-backed standardisation will be an essential component from the very beginning of the project in order to elegantly deal with the issues arising from different national languages and local or national operational paradigms and constraints.
However, I had some concerns with an earlier version of BAO that was presented at ICBO 2011. One concern was that the classification hierarchy allowed incorrect inferences to be made. For example, they had 'small molecule' classified as subClassOf 'perturbagen'. This means that all small molecules are perturbagens. Now, I am pretty sure I can think of some very inert small molecules that certainly do not perturb any biological systems (diamonds?). Other concerns were the lack of alignment to upper level ontologies and some other OBO ontologies such as OBI. However, I was delighted to discover that the new version of BAO, 2.0, currently under development, includes extensive refactoring and alignment to BFO. In the new version, small molecules that are active in experiments are much more sensibly encoded as 'small molecule' and has_role some 'perturbagen role'. BAO 2.0 has not been released yet, but will be soon. Other updates that we enjoyed during the day-long workshop included a sneak preview of new developments in the intelligent ontology-based search interface BAOSearch and a fascinating presentation of an ongoing interdisciplinary project to discover agents that are able to stimulate neuronal regeneration as treatments for spinal cord injuries, where novel forms of high-content image-based screening technology are being developed as a core component of the scientific methodology.
The new version of BAO is therefore the leading contender for adoption in the EU-OPENSCREEN project for standardisation of assay descriptions in the a centralised EU-wide chemical biology database, meaning that European and US-based chemical biology assays will be automatically integrated and scientists will be able to compare and contrast results across a rapidly widening collection of openly available experiments with an increasingly global perspective.