Semantic Similarity for Complex Biomedical Data Mining

Date: 05/11/2019

1. Project Title: Semantic Similarity for Complex Biomedical Data Mining

2. Area of knowledge: Life Sciences

3. Group of disciplines: Biotechnology, Bioinformatics, Pharmaceuticals, Food Technology

4. Research project:
In the last few decades, life sciences research and healthcare practice have generated a large and diverse amount of digital content ranging from unstructured scientific papers and clinical notes to more structured scientific databases and electronic health records (EHR). To make the most of this bounty, data mining techniques are being increasingly used to support diagnosis, prognosis and patient stratification.
However there are several obstacles to the mining of medical data, related to its inherent properties and uniqueness. Biomedical data is heterogeneous, existing in a plurality of formats; EHRs typically contains free text fields, with all the inherent challenges of natural language and manual input; and finally, biomedical vocabulary suffers from a proliferation of homonyms and polysemous terms. Classical data mining techniques struggle with these challenges because they are unable to extract the con-text and meaning from the raw data that is necessary to handle complex data.
This project proposes to tackle the challenges in mining semi-structured biomedical and clinical data by exploring semantic web technologies. Ontologies or knowledge graphs provide a formalization of the meaning of terms within a domain and support more complex analyses such as the computation of semantic similarity or distance. Distance metrics are the cornerstone of several machine learning approaches, both supervised and unsupervised, and can thus represent a solution for the integration of data mining with semantic technologies. Semantic similarity measures have been highly successful in molecular biology applications, however their application to the clinical data poses additional challenges: clinical data can be very heterogeneous and span different domains, so multiple ontologies need to be used; clinical data may have negative annotations, where the absence of a feature may be as important as its presence, so semantic similarity measures need to be able to take this into account.

5. Job position description:
Candidates should have an MSc or equivalent degree in Computer Science, Bioinformatics, Mathematics, Statistics, Life Sciences or related areas. Candidates should be proficient in programming and the English language. Multidisciplinary backgrounds are welcome. Research experience is a plus, particularly in machine learning and/or semantic web.
The successful candidate will conduct research to address the following challenges: (1) developing methods to automate the selection of multiple ontologies/knowledge graphs that cover the data do-mains; (2) designing improved methods for computing semantic similarity taking into account multiple ontologies and negative annotations; (3) investigating and adapting different machine learning to use semantic distance as a kernel method; (4) application of the developed approaches to clinical problems backed by semi-structured data.
In particular, the student will explore a large private dataset of genotype-phenotype data and clinical temporal data already collected by national FCT project NEUROCLINOMICS2 (2016-2019, PI Sara C. Ma-deira) and European JPND project OnWebDuals (2016-2019, PI Mamede de Carvalho), concerning the study of patients with Amyotrophic Lateral Sclerosis.
The work will be supervised by Catia Pesquita ( and Sara Madeira ( Catia Pesquita is an Assistant Professor at the Department of Informatics, Faculty of Sciences (FCUL), University of Lisbon, and coordinator of the Health and Biomedical Informatics Research Line of Excellence of LASIGE. Sara Madeira is an Associate Professor at FCUL and the coordinator of the Data and Systems Intelligence line of LASIGE. The candidate will integrate a multidisciplinary research team within LASIGE with longstanding expertise in knowledge graphs and semantic web technologies, as well as data mining in the biomedical domain.

6. Group leader:
Full name: Catia Luisa Santana Calisto Pesquita