The paper “A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing”, authored by LASIGE’s PhD student Diana Sousa has been published in Database: The Journal of Biological Databases & Curation, a top-ranked publication. The paper co-authors are LASIGE’s integrated researcher Francisco Couto, and André Lamúrias.
The work presented on the paper proposes a method to provide quality to crowdsourced relation extraction datasets. Biomedical relation extraction (RE) datasets are vital in constructing knowledge bases and potentiating the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how, and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, the authors use a previously created distantly supervised human phenotype–gene relations (PGR) dataset to perform crowdsourcing validation. The main contributions are a detailed pipeline for RE crowdsourcing validation, a new release of the PGR dataset with partial domain expert revision, and an assessment of the MTurk platform’s quality.
The article is available here.