The role of ontologies in data integration
As data repositories grow both in size and in number, data integration becomes an ever more pressing challenge. In bioinformatics data integration, bio-ontologies have long been used to assist in data integration pipelines mainly through their provision of:
(a) semantics-free content identifiers; and
(b) a wealth of relevant and useful synonyms.
But perhaps ontologies can do even more to support knowledge-based data integration. For starters, integration pipelines may be able to incorporate rules based on heterogeneous ontology relationships. This already is being done at the University of Manchester with ChEBI chemical relationships to introduce a chemistry-aware fuzzy integration for metabolic network data in order to plug gaps, more details here. Another example would be, when integrating tissue sample data, traversing the part_of relation: a sample from the left lung is also a sample from the lung (traversing <left lung part_of lung>).
Another interesting possibility would be to use the nearest common ancestor in order to obtain a possible integration path for stubborn related data. With such a method, for example, data for cholest-5-en-3-one (CHEBI:63906) and pregn-5-ene-3,20-dione (CHEBI:63837) could be integrated around their shared parent 3-oxo delta(5)-steroid (CHEBI:47907). And 4,5-dihydrocortisone (CHEBI:23736) could be integrated with cholest-5-en-3-one around the shared ancestor 3-oxo-steroid. You would need to set the allowed parse depth in such pipeline (i.e. it probably isn't much use to integrate around a high-level class such as "molecular entity").
The below was presented at the EBI Industry Programme Workshop on data integration last week here in Hinxton.
Semantic similarity is used in many applications to exploit the knowledge contained in ontologies and give a measure of relatedness for two ontology terms, say A and B. Typically, structural features of the ontology are used in this computation, such as the distance (in number of relationships) between A and B, the nearest common ancester they share, the number of relationships they have of various different types and so on. ChEBI has been used to derive semantic similarity between chemical entities and their roles for various applications, one of which is described here.
Now, in new work arising from a lab with prominent expertise in semantic similarity in the context of biomedical ontologies and related applications, for the first time an element of OWL logical expressivity has been used to enhance a measure of semantic similarity over and above the topological or OBO-style features of hierarchy, graph structure and relationship types. In the article "Exploiting disjointness axioms to improve semantic similarity measures" recently published in Bioinformatics , authors João D. Ferreira and colleagues explain how they include disjointness axioms from the source ontology to enhance the measure of semantic similarity. The basic intuition is that the existence of a disjointness axiom between any two entities A and B in the ontology should make them less similar, i.e. have a lower semantic similarity score, than if there were no such disjointness axiom. The same should apply to all the descendents of A and B.
ChEBI was used in the evaluation, for which semantic similarity was compared to average structural similarity for pairs of classes. (The idea of average structural similarity for classes is an interesting one in its own right and might in the far distant future be beneficial to visualization of ChEBI's ontology.) The authors found that their enhanced, disjointness aware semantic similarity measure performed significantly better at correlating with the structural similarity between classes than did traditional measures of semantic similarity.
Read more about it here.
(Whimsically, the picture shows an infinitely repeating self-similar dragon curve.)
1. João D. Ferreira, Janna Hastings and Francisco M. Couto. Exploiting disjointness axioms to improve semantic similarity measures. Bioinformatics (2013) doi: 10.1093/bioinformatics/btt491
Read more about it in our short application note recently published in Bioinformatics.