17th WideHealth Seminar: Diogo Azevedo

The EU-funded WideHealth project aims to conduct research on pervasive eHealth and establish a sustainable network of research and dissemination across Europe.

Title: Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis
Speaker: Diogo Azevedo, LASIGE, DI/FCUL
When: April 4, 2023, 16:00 CET (15:00 in Portugal)
Where: zoom link shared before the session to those who register
Registration link:

Abstract: The normalized compression distance (NCD) is a similarity measure between a pair of finite objects based on compression. Clustering methods usually use distances (e.g., Euclidean distance, Manhattan distance) to measure the similarity between objects. The NCD is yet another distance with particular characteristics that can be used to build the starting distance matrix for methods such as hierarchical clustering or K-medoids. In this work, we propose Zgli, a novel Python module that enables the user to compute the NCD between files inside a given folder. Inspired by the CompLearn Linux command line tool, this module iterates on it by providing new text file compressors, a new compression-by-column option for tabular data, such as CSV files, and an encoder for small files made up of categorical data. Our results demonstrate that compression by column can yield better results than previous methods in the literature when clustering tabular data. Additionally, the categorical encoder shows that it can augment categorical data, allowing the use of the NCD for new data types. This technique is applied to data from patients with Ankylosing spondylitis and the results identified different possible patient groups, as well as possible best treatments for each group.

Short bio: Diogo Azevedo recently finished his MsC studies in Data Science at the Universidade de Lisboa’s Faculdade de Ciências, and is applying for a PhD position at the same University. His research is centered on machine learning and artificial intelligence, especially in the field of medicine. He has spent the last year developing and publishing Zgli, a Python tool for clustering by compression, as well as studying techniques to enable clinician diagnostic choices for various healthcare domains (for example, patients with rheumatic disorders and persons with Parkinson’s Disease). Diogo is now involved in various projects, including Datapark and the Matisse, a project involving genome analysis of gliomas.