16th International SWAT4HCLS conference

Accepted submissions (NEW)

Oral presentations

The Pistoia Alliance Pharma General Ontology: Experience using LinkML in Pharma

Joshua Valdez1, Philippe Rocca-Serra2,3, Markus Hartmann4, Darko Hric1, Asiyah Yu Lin5, Birgit Meldal6, Peter McQuilton7, Riccardo Mariani8, Martin Romacker9, Giovanni Nisato5, Ben Gardner3

1Novo Nordisk, Copenhagen, Denmark. 2University of Oxford, Oxford, United Kingdom. 3AstraZeneca, Cambridge, United Kingdom. 4Merck Group, Darmstadt, Germany. 5Pistoia Alliance, Philadelphia, USA. 6Pfizer, Cambridge, United Kingdom. 7Glaxo Smith Kline PLC, Stevenage, United Kingdom. 8Chiesi Farmateuci, Parma, Italy. 9Hoffman la Roche, Basel, Switzerland

Abstract

Pharmaceutical organizations generate vast amounts of data, often fragmented across multiple systems, domains, and silos. This fragmentation is exacerbated by the proliferation of ontologies, each offering its own diverse interpretations of key concepts. The complexity grows further when pharmaceutical companies, contract research organizations (CROs), regulatory bodies, and other stakeholders need to exchange information. Although the FAIR principles (Findable, Accessible, Interoperable, Reusable) promote good data management, achieving wide interoperability at scale is challenging with the current risk of creating “FAIR silos” resources compliant with the FAIR principles, but interoperable only within specific organizations or units.  

To overcome these challenges, several members of the Pistoia Alliance initiated the Pharma General Ontology (PGO) project, which will identify, select and recommend a set of core vocabularies for use in relation to describing core pharmaceutical R&D concepts and deliver shared semantics, thereby, supporting interoperability across pharmaceutical domains.  

The PGO will supply a set of public URIs for key R&D concepts, establishing a community-aligned, controlled vocabulary. It is hoped this shared controlled vocabulary will enable smoother data exchange and improve understanding across the pharmaceutical sector by creating a community consensus and convergence hub. Initially, the project will focus on R&D during 2024, with plans to broaden its scope to the entire medicinal product lifecycle.

Submission type

5. Long research paper (max 10 pages)

Categories

C4 – Data and models

Biomedical Ontology Matching using Relational Graph Neural Networks and RDFs Meta-Path Rules

Shervin Mehryar, Julie Loesch, Emma Pinckers, Alan van den Akker, Yannick Damoiseaux, Athanasios Stavrinopoulos, Nikola Prianikov, Remzi Celebi

Maastricht University, Maastricht, Netherlands

Abstract

The growing complexity of relational data in knowledge graphs necessitates advanced models to capture intricate graph structures. In the domain of health and life sciences, the use of biomedical ontologies prevails in many applications from database management to retrieval and publication. Due to heterogeneity and lack of standardization to create local ontologies, the reusibility and interoperability for these resources become limited whereby often manual and time-consuming processes are put in place to match representations for cross-domain applications. In this paper we explore embedding-based methods as an alternative approach for entity matching among biomedical ontologies at different complexity and interoperability levels and propose a novel framework base on Relational Graph Convolutional Networks (R-GCN) in combination with symbolic meta-rule integration. We compare our results to the state-of-the-art baseline models using metrics such as Hits@k, F-scores and Mean Rank (MR) and demonstrate the effectiveness of the proposed model in improving ontology matching tasks across multiple complex biomedical datasets.

Submission type

5. Long research paper (max 10 pages)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

Semantic Beacons: a framework to support federated querying over genomic variants and public Knowledge Graphs

Alexandrina Bodrug-Schepers1, Hugo Chabane2, Gabriela Montoya2, Patricia Serrano-Alvarado2, Richard Redon1, Alban Gaignard1,3

1Nantes Université, CNRS, INSERM, l’institut du thorax, Nantes, France. 2Nantes Université, LS2N, Nantes, France. 3IFB-core, Institut Français de Bioinformatique (IFB), CNRS, INSERM, INRAE, CEA, Evry, France

Abstract

Comprehensive genomic data exchange is challenging but critical for future research. Beacon API networks, promoted by initiatives like \gls{ga4gh}, facilitate genomic variation data discovery while preserving privacy and data ownership. However, their use is often limited by the need for costly storage, compute intensive data pre-processing, and periodic updates as genomic knowledge constantly progresses. This work proposes an on-the-fly approach for enriching genomic variants with biological annotations provided by established knowledge bases. It thus reduces the computational load and processing time. We explore integrating open and interoperable life sciences knowledge graphs with sensitive health genomic data discoverable through Beacon APIs. We propose this federated framework as a step towards increasing FAIRness of genomic data.

Submission type

5. Long research paper (max 10 pages)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

HERO-Genomics: An Ontology for Integration and Access of Multicenter Genomic Data

Mirco Cazzaro1, Ivo G. Gut2,3, Laura Menotti1, Manuel Rueda2,3, Gianmaria Silvello1

1Department of Information Engineering, University of Padua, Padua, Italy. 2Centro Nacional de Análisis Genómico (CNAG),, Barcelona, Spain. 3Universitat de Barcelona (UB), Barcelona, Spain

Abstract

The Hereditary Ontology for Genomic Data (HERO-Genomics) facilitates the structured representation of genomic information, with an initial focus on documenting genomic variations related to Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS). The current release provides a framework for capturing specific sequence variations, focusing on Single Nucleotide Variants (SNV). HERO-Genomics is a component of the broader Hereditary Ontology (HERO), developed to model the gut-brain connection from phenoclinical and genomic viewpoints. HERO serves as the backbone of the Semantic Data Integration platform for the HEREDITARY project, utilizing the Ontology-Based Data Access (OBDA) paradigm to query heterogeneous and distributed data respecting legal constraints.

Submission type

5. Long research paper (max 10 pages)

Categories

C4 – Data and models

Enhancing GISMo: Integrating Recipe Contexts into a Graph-Based Ingredient Substitution Module

Julie Loesch1, David Schimmel1, Funda Erdem2, Deryanur Kalkavan2, Zeki Bilgin2, Remzi Celebi1

1Maastricht University, Maastricht, Netherlands. 2BEKO A.S., Istanbul, Turkey

Abstract

To comply with dietary restrictions, individuals seek to replace ingredients in culinary recipes. The advanced model Graph-based Ingredient Substitution Module, called GISMo, recently was proposed to facilitate the learning of ingredient substitutions within recipes. We identify some potential improvements in GISMo’s functionality for generating substitution recommendations based on recipe contexts by capturing various graph representations of recipe contexts, including ingredient-recipe relationships and cooking actions. Furthermore, we introduce a novel benchmark dataset comprising substitutions for specific recipes, validated by four domain experts. To assess the proposed improvements, we conduct experiments on both existing and new substitution datasets. Our findings demonstrate that the proposed changes in GISMo can augment the diversity of food substitution recommendations without compromising prediction quality. 

Submission type

5. Long research paper (max 10 pages)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

PubChemRDF on the Google Cloud Platform

Sunghwan Kim1, Bo Yu1, Jia He1, Qingliang Li1, Tiejun Cheng1, Siqian He1, Hannah Bast2, Johannes Kalmbach2, Evan Bolton1

1National Center for Biotechnology Information, Bethesda, USA. 2University of Freiburg, Department of Computer Science, Freiburg, Germany

Abstract

Knowledge graphs are a powerful tool to exploit a vast amount of heterogeneous digital data on the Web and the Resource Description Framework (RDF) is a popular way to represent and manage knowledge graphs. As such, many public information resources provide RDF-formatted scientific data for data sharing and integration among scientific resources. An example of this is PubChemRDF, which refers to PubChem data in the RDF format, and where PubChem is a rather large and complex publicly available chemical data system. The present paper describes an ongoing effort to make PubChemRDF data available in cloud computing resources using two different RDF databases, Virtuoso and QLever.   PubChemRDF data was loaded into a Virtuoso database in a series of virtual machines on the Google Cloud Platform (GCP). The Virtuoso database containing the PubChemRDF data was tested on these GCP virtual machines to investigate the effect of the number of virtual CPUs (vCPUs), memory size, and the disk storage type upon the query performance. In conjunction with the virtual machine pricing information, the query performance data was analyzed to find a cost-effective way to exploit PubChemRDF data in the cloud. In addition, the query performance against QLever (an open source triplestore and graph database) was investigated on a local machine to explore its potential as an alternative to Virtuoso.

Submission type

5. Long research paper (max 10 pages)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

Treating Query Graphs As Connecting Trees \ Mapping SPARQL Queries to FHIR API

Eric Prud’hommeaux1, Jerven Bolleman2, Claude Nanjo3

1Leiden University Medical Center, Leiden, Netherlands. 2Swiss Institute for Bioinformatics, Lousanne, Switzerland. 3University of Utah, Salt Lake City, USA

Abstract

FHIR RDF defines the RDF representation of vast swaths of clinical data. HAPI FHIR, the Java reference implementation for FHIR, can read and write FHIR RDF using FHIR’s RESTful API, which at least in theory enables merging clinical data with the larger Semantic Web infrastructure.

There exists, however, few native SPARQL endpoints for FHIR. This paper presents an approach called Fhir-Sparql, whereby SPARQL queries can interface with the FHIR REST API using a novel intermediate structure called ArcTrees, to address a large number of primary use cases for clinical data. Unlike HAPI FHIR queries, which are primarily resource-centric, Fhir-Sparql allows for composing detailed declarative queries spanning different FHIR Resource types.

Fhir-Sparql decomposes a user query into ArcTrees, which are then easily matched against a library mapping ArcTrees to FHIR REST API operations. The paper describes that process with the intention of (attracting users and collaborators and) illustrating to the community how to methodically ensure completeness in SPARQL queries over non-RDF data

Submission type

5. Long research paper (max 10 pages)

Categories

D1 – RDF Dataset or SPARQL endpoint

FAIR Linked Data Framework for Glioblastoma Research.

Zoi Katsirea1, Samira Osterop2, Valérie Dutoit3, Tong Li1, Etienne Doumazane4, Stephane Pages2, Omer Ali Bayraktar1, Josh Moore5, Andra Waagmeester6, Tannia Gracia1

1Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge,, United Kingdom. 2Wyss Center for Bio and Neuroengineering, Geneva, Switzerland. 3University Hospitals of Geneva (HUG), Geneva, Switzerland, Geneva, Switzerland. 4Paris Brain Institute, Pitié-Salpêtrière University Hospital, Paris, France. 5German BioImaging e.V., Constance, Germany. 6Micelio, Ekeren – Antwerpen, Belgium

Abstract

Glioblastoma is an aggressive brain tumour with low survival rates and resistance to treatment, requiring innovative approaches to identify prognostic factors and therapeutic targets. Patient-derived imaging, pathological, and molecular data are critical for advancing these efforts. We have developed a dataset integrating sc-RNAseq, spatial transcriptomics, proteomics, neuropathology annotations, and clinical metadata. Efficient access, organization, and sharing of these data are essential for understanding the disease and developing therapies. This linked resource enables querying of cellular states, tissue architecture, and the tumour microenvironment by combining histological features, 3D imaging, and clinical data.

Using FAIR (Findable, Accessible, Interoperable, and Reusable) principles, glioblastoma experimental and clinical metadata from the Wellcome Sanger Institute and Wyss Center for Bio and Neuroengineering were curated and linked. Unified identifiers from ontologies and community standards were assigned, with data formatted into RDF (Resource Description Framework).

We describe efforts to standardize and promote interoperability of multimodal data collected to profile glioblastoma.

Submission type

5. Long research paper (max 10 pages)

Categories

T4 Semantic methods and AI

Quality issues on mappings between ICD10 and SNOMED CT

Emiliano Reynares1, Andrea Splendiani2

1IQVIA, Barcelona, Spain. 2IQVIA, Basel, Switzerland

Abstract

Research and clinical data are generated for a variety of needs, and as such make use of various medical vocabularies. Enabling complex research and analytics use cases requires semantic interoperability between data sources, which implies the definition and maintenance of vocabulary mappings. Both the vocabularies and the mappings are dynamic and involve a context, yet in many practical usages such a dynamic and context are not considered. Although evidence has been presented on possible quality issues on commonly used vocabulary mappings, it is unclear what the extent of such issues may be. As an initial assessment we compared the mappings relating an ICD10 term to a single SNOMED CT concept, as defined by OHDSI Standardized Vocabularies and SNOMED CT International. Our analysis found that 27.5% of the mappings do not match due to differences on the level of abstraction of the mappings (47% of the mismatches), slightly variations on the semantics of the terms (10% of the mismatches), evolution of the vocabularies (4% of the mismatches), and (d) plain errors on the release of mappings (2% of the mismatches)Identification of the causes for the remaining mismatches (37%) will be tackled in future works. The lack of proper attention to mapping dynamics results in a lower quality of the resulting datasets, in ways that are very difficult to detect once a dataset has been generated. With this work, we made a step in quantifying the potential impact of such data quality issues, so that proper actions can be taken.

Submission type

5. Long research paper (max 10 pages)

Categories

C4 – Data and models

Imaging and curation of human TB lung tissue with linked data

Gordon Wells1, Kapongo Lumamba1, Keviershen Nargan1, Delon Naicker1, Ashendree Govender1, Rajhmun Madansein2, Kameel Maharaj2, Mpumelelo Msimang3, Paul Benson4, Threnesan Naidoo1, Zoi Katsirea5, Tannia Gracia5, Fani Memi5, Josh Moore6, Andra Waagmeester7, Adrie Steyn8

1Steyn Lab, Africa Health Research Institute, Durban, South Africa, Durban, South Africa. 2Inkosi Albert Luthuli Central Hospital and University of KwaZulu-Natal, Durban, South Africa. 3Department of Anatomical Pathology, National Health Laboratory Service, Inkosi Albert Luthuli Central Hospital, Durban, South Africa. 4{Department of Pathology, University of Alabama at Birmingham, Birmingham, USA. 5Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom. 6German BioImaging – Gesellschaft für Mikroskopie und Bildanalyse e.V. (GerBI-GMB), Constance, Germany. 7Micelio, Ekeren, Belgium. 8Steyn Lab, Africa Health Research Institute, Durban, South Africa, Durb, South Africa

Abstract

We describe a linked data model for a tissue repository comprising samples from more than 900 patients that is being used to characterise the complex manifestations of tuberculosis. This repository comprises tissue samples, formalin-fixed paraffin-embedded blocks, and microscope slides, each with associated metadata. These samples, blocks, and slides are subjected to various imaging modalities to characterise TB at the cellular and molecular level. These include immunohistochemistry, traditional histology, RNAScope, spatial transcriptomics, metallomics, spatial proteomics and micro-computed tomography. All of these samples and imaging outputs have associated data and metadata that need to be linked for maximum scientific value. We are developing a linked data model that combines metadata capture in REDCap with imaging data stored in OMERO and related platforms.

Submission type

5. Long research paper (max 10 pages)

Categories

T3 FAIR4HCLS

Judgment Day: Symbolic AI versus Neuro-Symbolic AI in a Usability Study with UK Junior Doctors

Mercedes Arguello Casteleiro1, Chloe Henson2, Manoj Kulshrestha2, Diego Maseda Fernandez2, Julio Des Diz3, Carlos Sevillano Torrado3, Nava Maroto4, Maria Jesus Fernandez Prieto5, Tim Furmston1, Chris Wroe6, John Keane1, Robert Stevens1

1Department of Computer Science, School of Engineering, University of Manchester, Manchester, United Kingdom. 2Mid Cheshire Hospital Foundation Trust, NHS England, Crewe, United Kingdom. 3Hospital do Salnés, Villagarcía de Arousa, Spain. 4Depto. Lingüística Aplicada a la Ciencia y a la Tecnología, Universidad Politécnica de Madrid, Madrid, Spain. 5Salford Languages, University of Salford, Salford, United Kingdom. 6BMJ, London, United Kingdom

Abstract

This paper investigates two Artificial Intelligence (AI) approaches for transforming a body of evidence into knowledge graphs (KGs) supporting question answering from junior doctors. The manual symbolic AI approach focuses on COVID-19 and follows the traditional knowledge engineering approach of CommonKADS. The semi-automatic neuro-symbolic AI approach focuses on disease-treatment correlations and exploits prior knowledge and a type of 4-term analogy with embeddings from deep learning. Both AI approaches leveraged on nanopublication and micropublication ontologies (statement-based formalisations) to underpin KGs. The paper reports the results of a usability testing with 13 UK junior doctors.

Submission type

5. Long research paper (max 10 pages)

Categories

T4 Semantic methods and AI

Oral presentations (short papers)

Trap of Time: Historical Common Names to Modern Taxonomy Mapping using LLMs

Jan Fillies1,2, Naouel Karam1, Alois Wieshuber3, Giada Matheisen3, Malte Rehbein4, Belen Escobari5, Sarah Fischer6, Adrian Paschke2,1,7

1Institute for Applied Informatics, Leipzig, Germany. 2FU Berlin, Berlin, Germany. 3Generaldirektion der Staatlichen Archive Bayerns, München, Germany. 4Chair of Computational Humanities, University of Passau, Passau, Germany. 5Botanic Garden and Botanical Museum Berlin, Berlin, Germany. 6Research Institute for Farm Animal Biology (FBN), Dummerstorf, Germany. 7Fraunhofer FOKUS, Berlin, Germany

Abstract

As society and scientific research evolve, so does the language used to express concepts, names of species, and descriptions of objects. This research addresses the challenge of mapping historic terms—specifically, common names for species in the field of biodiversity— to a modern taxonomy. Historic biodiversity collections, already sparse, are further complicated by the use of common names rather than scientific names, making exact alignment with modern taxonomies highly challenging. Changes in spelling, along with species being merged, split, or renamed, further add to this complexity. This research explores the use of a large language model (LLM), GPT-4o, to assist in this alignment process. Results show that, when provided with context, the LLM can accurately generate modern equivalents of historic instances, demonstrating an embedded understanding of historical semantic shifts in biodiversity terminology. In a test set, the LLM successfully matched (91% of cases) both unchanged and altered common names to their correct scientific names (with the inclusion of minimal context) and modern common name counterparts (68% with minimal context), underscoring its potential to standardize historical datasets and support human annotation in the future.

Submission type

3. Short research paper (max 5 pages)

Categories

T4 Semantic methods and AI

13

A multi-modal and temporal antibiotic resistance knowledge graph

Brieuc Quemeneur1,2, Audrey Bihouée1, Samuel Chaffron3, Claudine Médigue2,4, Hervé Ménager5,2, Alban Gaignard1,2

1Nantes Université, CNRS, INSERM, l’institut du thorax, Nantes, France. 2IFB-core, Institut Français de Bioinformatique (IFB), CNRS, INSERM, INRAE, CEA, Every, France. 3Nantes Universite, Centrale Nantes, CNRS, LS2N, Nantes, France. 4CNRS UMR8030, Université Evry-Val-d’Essone, CEA, Genoscope, LABGeM, Every, France. 5Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, Paris, France

Abstract

Understanding how antibiotic resistance genes spread is essential for protecting human, animal, and environmental health. It requires collaboration across multiple fields and expertise under One Health initiatives, emphasizing the pressing need to consolidate diverse antibiotic data from human, animal, and environmental samples. In this paper, we propose a domain-specific Knowledge Graph leveraging the SOSA ontology to uniformly represent multi-modal data and their analysis while allowing for describing provenance metadata covering both time and geographical locations. This work is driven by a national consortium of antibiotic resitance experts (ABRomics).  As experimental result we show show how this domain knowledge can be used to answer typical expert questions as well as increasing the FAIRness of antibiotic resistance data.

Submission type

3. Short research paper (max 5 pages)

Categories

C3 – SWAT4Health

14

Exploring Vendor-Neutral SMART on FHIR Infrastructure for Secure Data Exchange in Research and Patient Care

Katja Hoffmann, Eveline Prochaska, Markus Wolfien, Martin Sedlmayr

Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany

Abstract

Health data is essential for advancing research and improving patient care, emphasizing the need for secure, interoperable infrastructures that enable seamless access to healthcare systems. SMART on FHIR is a framework that ensures secure data exchange by specifying APIs for authentication and authorization between applications and healthcare systems. Through a review of academic and non-academic sources, we identified open-source, vendor-neutral implementations of SMART on FHIR infrastructure, including the Alvearie Keycloak extension and the LinuxForHealth FHIR server as promising components. These solutions were prototypically implemented in a reference infrastructure to evaluate their feasibility. Despite their potential for experimental use, limitations in compatibility and functionality underscore the need for ongoing community-driven development to enhance interoperability, reduce costs, and foster innovation in healthcare.

Submission type

3. Short research paper (max 5 pages)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

24

A Resolution-Alignment-Completeness System for Diagnosis Code Imputation in Clinical Knowledge Graphs

Shervin Mehryar1, Ozge Erten1, Tsvetan Asamov2, Svetla Boytcheva2, Remzi Celebi1

1Maastricht University, Maastricht, Netherlands. 2Ontotext, Sofia, Bulgaria

Abstract

The rapid growth of electronic health records (EHR) presents challenges in data integration and interoperability due to the incomplete nature of this information, limiting its effective utilization. While ontology-based data integration across diverse resources has been widely practiced, the process of codifying records remains error-prone and largely manual. Knowledge Graph Embeddings as an alternative solution can provide for efficient quality data representations. In this paper, we propose an embedding-based system that applies entity resolution and alignment across medical terminologies and ontologies for imputing codified data. Through experimentation we demonstrate the benefits of the proposed solution for semantic completion and consistency tasks in terms of NDCG@K and Sem@K.  

Submission type

3. Short research paper (max 5 pages)

Categories

T4 Semantic methods and AI

27

LLM-Based Ontology Mapping for Privacy-Preserving Healthcare Data Management

Maria Papoutsoglou, Apostolos Mavridis, Stergios Tegos, Christos Anastasiou, Georgios Meditskos

Aristotle University of Thessaloniki, Thessaloniki, Greece

Abstract

The rapid growth of healthcare data requires efficient, privacy-preserving management and analysis methods. Traditional relational databases often lack the necessary contextual understanding for advanced analytics and regulatory compliance. Using techniques such as large language models (LLMs), we can enhance relational data with semantic metadata, creating knowledge graphs. These graphs support applications such as automated data de-identification, clinical decision support, and research analytics. This paper presents a framework combining LLMs, ontologies and vector dataases to improve healthcare data understanding and standardization.

Submission type

3. Short research paper (max 5 pages)

Categories

C3 – SWAT4Health

52

Utilization of Semantic Technologies to Achieve Interoperability Between Dental Data and EHRs: A Proof of Concept

Hugo Lebredo, Ruben del Rey Álvarez, Jose Emilio Labra Gayo

Universidad de Oviedo, Oviedo, Spain

Abstract

The fragmentation of a patient’s clinical history between the systems of the different professionals and health providers from different medical areas, both public and private, which they have attended is a common scenario. Several problems arise because of this, such as the appearance of information silos, data incompatibilities between different institutions, duplication of information, and data dispersion.

This work presents a proof of concept that enables individuals to comprehensively manage their dental clinical information, allowing them to share it with other stakeholders according to their preferences. The main feature of this proof of concept is its focus on interoperability, achieved through the integration of semantic technologies, such as Shape Expressions, and widely adopted standards in the medical community, such as FHIR and SNOMED. Furthermore, it aligns with the SOLID (Social Linked Data) principles, promoting a decentralized, secure, and accessible personal clinical data ecosystem.

Submission type

4. Position paper (max 5 pages)

Categories

C3 – SWAT4Health

54

Harnessing Semantic Technologies and Large Language Models for Trusted Knowledge Identification in Healthcare and Pharma

Peter Dörr

metaphacts GmbH, Walldorf, Germany

Abstract

The combination of large language models (LLMs) and semantic technologies presents unique opportunities for transforming knowledge management in the pharmaceutical and healthcare sectors. This paper introduces a novel methodology to integrate LLMs with symbolic AI and human expertise to address critical challenges such as reproducibility, explainability, and trust in data outputs. Focusing on the drug development process, the proposed annotation pipeline and knowledge graph framework enhance the accessibility and usability of biomedical knowledge by identifying and linking data from scientific publications and other sources. Results demonstrate the capability to achieve high-quality precision and scalability, offering a transformative approach to handling vast biomedical datasets and the overwhelming amount of scientific publications.

Submission type

4. Position paper (max 5 pages)

Categories

T4 Semantic methods and AI

58

Leveraging Zarr to handle RDF data

Ángel Iglesias Préstamo1, Diego Martín Fernández1, Jose Emilio Labra Gayo1, Josh Moore2

1WESO Lab, Oviedo, Spain. 2Open Microscopy Environment, Constance, Germany

Abstract

Handling large RDF graphs often requires a considerable amount of computational resources, posing challenges for processing on personal computers or resource-constrained environments. This paper proposes leveraging Zarr, a chunked and compressed storage format, to partition RDF datasets into manageable pieces. By taking advantage of Zarr’s distributed architecture, RDF data can be accessed on-demand, enabling clients to retrieve specific subsets of the graph. This approach allows the efficient processing of datasets that exceed available memory while eliminating the need for dedicated servers, bridging the gap between lightweight client-side operations and the demands of handling complex knowledge graphs.

Submission type

3. Short research paper (max 5 pages)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

64

From Clinical Data to Discovery: integration of Beacon v2 with OMOP-CDM

Sergi Aguiló1, Alberto Labarga1, Miguel Ángel Mayer2, Juan Manuel Ramírez-Anguita3, Oriol López-Doriga Sagales4, Aurora Moreno-Racero4, Liina Nagirnaja4, Dmitry Repchevski1, Salvador Capella-Gutierrez1, Jordi Rambla5

1Barcelona Supercomputing Center (BSC), Barcelona, Spain. 2Hospital del Mar Research Institute, Barcelona, Spain. 3Universitat Pompeu Fabra, Barcelona, Spain. 4Centre for Genomic Regulation, Barcelona, Spain. 5Centro de regulación genética, Barcelona, Spain

Abstract

The growing volume of clinical patient data offers unique opportunities for understanding diseases using Real World Data (RWD). However, standardized storage, discovery, and access methods that preserve patient privacy are crucial. OHDSI’s Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) has become a widely used standard for organizing and analyzing healthcare data in an interoperable format, enabling large-scale observational research. Beacon v2, an open-source data discovery protocol, facilitates secure data discovery across institutions by querying OMOP CDM databases in a federated manner. Developed under the Global Alliance for Genomics and Health (GA4GH), Beacon v2 supports interoperability while protecting patient data.

Beacon v2 is built on two core concepts: the Framework, which defines query mechanisms, and the Model, which structures the data. This decoupling allows flexibility in integrating various data sources like OMOP CDM or HL7 FHIR. Beacon v2 leverages ontologies to perform queries, enabling harmonization at the API level without altering underlying databases. A new Beacon v2 Production Implementation (B2PI) simplifies its deployment, especially for clinical datasets, and a specialized Beacon4OMOP extension aligns OMOP CDM data with Beacon models.

Approved as a GA4GH standard in 2022, Beacon v2 enables parallel querying of unified networks of biomedical centers, returning aggregated responses while maintaining privacy. This federated approach enhances collaboration, ensuring interoperable and reusable data discovery. By integrating OMOP CDM databases, Beacon v2 supports better decision-making in healthcare and improved patient outcomes through secure and efficient data utilization.

Submission type

3. Short research paper (max 5 pages)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

65

Merging and Validating Health Ontologies: A Framework for Ontological Integration and Evaluation

Safaa menad, Saïd Abdeddaïm, fatima lina soualmia

University of Rouen Normandy, Rouen, France

Abstract

The integration of health ontologies is essential for addressing semantic heterogeneity across biomedical systems and enabling interoperability. In this study, we propose a structured and systematic framework for merging ontologies in the healthcare domain, with a focus on harmonizing terminologies derived from diverse sources. 

The resulting ontology was validated using Protégé, leveraging reasoning tools such as ELK to ensure logical consistency and detect potential conflicts. Furthermore, the ontology is now available in the BioPortal platform. This work demonstrates the potential of an integrated approach for enhancing biomedical data interoperability while maintaining high standards of consistency and usability.

Submission type

3. Short research paper (max 5 pages)

Categories

C3 – SWAT4Health

70

Comparison of Cardiovascular Disease Trajectories: AFrance Vs Nordic countries’ perspective under the HealthData@EU EHDS project

Gayo Diallo

Team AHeaD, Bordeaux, France

Abstract

The HealthData@EU Pilot is a two-year project for building a Pilot version of the European Health Data Space (EHDS) infrastructure for the secondary use of health data (known as the “HealthData@EU”). The current study gives an overall description of the use case entitled ”Cardiometabolic diseases” aiming at comparing care pathways for cardiometabolic diseases in European countries and building prediction models using artificial intelligence.  More specifically, the use case addresses the issue of using nationwide health registry data in order to predict cardiometabolic disease trajectories using machine learning and make comparisons between three Nordic countries (Finland, Denmark, and Norway) and France. Due to time constraints, the use case focused on cardiovascular diseases (CVD) instead of cardio metabolic diseases.

Submission type

3. Short research paper (max 5 pages)

Categories

T4 Semantic methods and AI

RDF resources for biomedical research

Mayumi Kamada1, Shuichi Kawashima2, Toshiaki Katayama2

1Kitasato Univeristy, Sagamihara, Japan. 2Database Center for Life Science, Kashiwa, Japan

Abstract

The integration of biomedical databases is essential for drug discovery and understanding disease mechanisms. Med2RDF addresses this by proposing a unified data model and providing tools to convert key biomedical datasets into Resource Description Framework (RDF) format. Target databases include those for disease variants, molecular interactions, cancer omics, and genomic variations. To enhance usability, Med2RDF offers Med2RDF-ontology for organizing shared concepts and the RDF-config for generating SPARQL queries and schema diagrams from YAML-based models. Converted RDF data is available through SPARQL endpoints on the DBCLS RDF Portal, enabling seamless integration with other life science data. This standardized and semantic approach facilitates efficient data utilization, advancing biomedical research.

Submission type

7. Resource Description

Categories

D1 – RDF Dataset or SPARQL endpoint

68

GlyCosmos: An Integrated Semantic Knowledge Base for Glycoscience

Sunmyoung Lee1, Yasunori Yamamoto2, Achille Zappa1, Kiyoko F. Aoki-Kinoshita1,3,4

1Glycan and Life Systems Integration Center (GaLSIC), Soka University, Hachioji, Japan. 2Database Center for Life Science (DBCLS), ROIS-DS, Kashiwa, Japan. 3Graduate School of Science and Engineering, Soka University, Hachioji, Japan. 4Institute for Glyco-core Research, Nagoya University, Nagoya, Japan

Abstract

GlyCosmos, a comprehensive web resource for glycoscience research, has been significantly enhanced to provide a unified platform for accessing and analyzing glycan structures, related genes, proteins, pathways, and diseases. This paper presents the latest developments in GlyCosmos, highlighting the adoption of a unified Resource Description Framework (RDF) schema and semantic web technologies. These advancements address previous challenges in data integration and enable seamless connection of diverse glycan-related datasets. The implementation of SPARQL endpoints allows for powerful querying capabilities, including federated queries, knowledge discovery, and inference-based queries. The enhanced data integration has improved search functionality, supporting multi-faceted searches across various parameters and nomenclatures. GlyCosmos now offers expanded resources for glycans, genes, lectins, and diseases, integrating data from multiple sources to provide a comprehensive view of glycobiology. This semantic web approach improves data interoperability, enhances knowledge discovery, and positions GlyCosmos as a crucial platform for integrating and disseminating glycan-related knowledge in the evolving field of glycoscience.

Submission type

3. Short research paper (max 5 pages)

Categories

D1 – RDF Dataset or SPARQL endpoint

Posters

6

Semantifying Genomic Variant Data: VCF to RDF Conversion Framework

Elias Crum1,2, Ruben Taelman1, Bart Buelens2, Gokhan Ertaylan2, Ruben Verborgh1

1Ghent University, Gent, Belgium. 2VITO NV, Mol, Belgium

Abstract

Variant Call Format (VCF) files, the standard for representing patient genomic variant data, face limitations in interoperability, data linking, querying, and semantic interpretability. We propose a framework for converting VCF data into semantic data using the Resource Description Framework (RDF) to address these limitations. Our approach includes a comprehensive ontology, a storage-efficient RDF representation using Header Dictionary Triples (HDT), and integration with clinical metadata schemas like that proposed by SPHN. Our framework and the representation of VCF data semantically will contribute to greater integration, scalability, and usability of these genomic variant data in both genomic medicine and research.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C3 – SWAT4Health

9

Integrating Data Treasures – The First Knowledge Graphs of the DSMZ Digital Diversity Databases

Julia Koblitz, Lorenz Christian Reimer, the Digital Diversity Team

Leibniz Institute DSMZ, Braunschweig, Germany

Abstract

The DSMZ (German Collection of Microorganisms and Cell Cultures) hosts a wealth of biological data, spanning microbial taxonomy, enzymes, rRNA genes, cell lines, cultivation media, and more. To make these diverse datasets accessible and interoperable, the DSMZ Digital Diversity initiative provides a central hub for integrated data and establishes a framework for linking and accessing these resources (https://hub.dsmz.de). At its core is the DSMZ Digital Diversity Ontology (D3O), which standardizes and connects data from all databases to enable seamless integration and advanced exploration.

In this work, we present the first knowledge graphs of two major databases: BacDive, which provides detailed microbial strain information, and MediaDive, which focuses on data on microbial cultivation. Both knowledge graphs are accessible via SPARQL endpoints at https://sparql.dsmz.de, allowing researchers to query and analyze the data in a standardized way. These initial steps lay the groundwork for integrating additional databases, such as BRENDA, SILVA, LPSN, and StrainInfo, into a unified, queryable knowledge graph.

Our goal is to connect this vast diversity of datasets and foster collaboration toward a more open and connected future for biological databases. By sharing our approach and results, we aim to inspire others to explore the potential of linked data in the life sciences.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C4 – Data and models

18

Predicting Clinical Remission in Crohn’s Disease: A Comparative Study of Expert-Generated and Computer-Generated Bayesian Networks

Yomi Okegunna1,2, Maija Utriaine1, Manon .J.J. Cloots1,2, Jiaxu Zhang1, Inigo Bermejo1, Loriane .M.M. Verleye2, Marjolijn Duijvestein3, Zlatan Mujagic2, Andre Dekker1, Marieke .J. Pierik2, Rianne R.R Fijten1

1Department of Radiation Oncology (MAASTRO), GROW-School for Oncology and Developmental Biology, Maastricht University Medical Centre+, Maastricht, The Netherlands ,, Maastricht, Netherlands. 2Department of Gastroenterology-Hepatology, NUTRIM, Maastricht University Medical Centre, Maastricht, The Netherlands, Maastricht, Netherlands. 3Department of Gastroenterology-Hepatology, Radboud University Medical Centre,, Nijmegen, Netherlands

Abstract

Background: Artificial intelligence (AI) may improve treatment optimization and aid in clinical decision-making for Crohn’s disease (CD). The study aims to compare the results of Bayesian Networks (BNs) of both an Expert Knowledge Model (EKM) and a Computer Algorithm-Generated Model (CAGM) in predicting corticosteroid-free clinical remission at 52 weeks after introducing ustekinumab and vedolizumab treatment in patients with CD.

Methods: Data were extracted from the Dutch Initiative on Crohn and Colitis (ICC) registry. Observations were conducted on patients with CD (n = 440) based on remission criteria including Harvey Bradshaw Index (HBI<5) assessment and no corticosteroid use. Data were divided into training (70%, N = 309) and validation (30%, N = 131) subsets. Based on these, two Bayesian network models were developed.

Results: The EKM contained 21 expert-defined variables and 38 edges between them. The EKM showed AUC, accuracy, sensitivity, and specificity values of 0.70, 0.64, 0.15, and 0.92, respectively. On the other hand, the CAGM contained 14 variables and 17 edges.  The CAGM showed an AUC, accuracy, sensitivity, and specificity values of 0.59, 0.62, 0.17, and 0.88, respectively. Consequently, age at diagnosis, previous medication, thrombocytes, and perianal diseases were common nodes left in both models. 

Conclusion: In this comparative study, after rounds of node elimination, the expert-generated model has a higher AUC than the computer-generated model. With BN’s capacity to incorporate expert information and surpass algorithm-only methods in remission prediction, they may provide physicians with a potent, interpretable tool for individualized decision-making and enhancing diagnostic precision.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C4 – Data and models

23

SPARQL federated query debugging tool

Marek Moos, Jakub Galgonek

IOCB, Prague, Czech Republic

Abstract

This report introduces a SPARQL federated query debugging tool (https://idsm-react-debugger-1.dyn.cloud.e-infra.cz/) designed to trace the execution of federated queries. Gaining insights into complex problems often requires combining multiple RDF datasets. As SPARQL endpoints are typically treated as black boxes, we focus on using standard SPARQL Federated Queries. This approach ensures compatibility with any SPARQL-compliant endpoint without the need for its modifications.

However, in practice, we have encountered several pitfalls that significantly complicate the use of SPARQL Federated Queries. These include uninformative query error responses, performance bottlenecks, and semantic changes introduced by SPARQL endpoints.

To overcome these pitfalls, a web application has been developed to monitor the entire tree structure of federated service executions in real time. Monitoring federated queries is crucial for both error detection and performance optimization. Detailed service execution data (such as request, response, duration, etc.) can help identify the specific service execution responsible for a problem, even if it is deeply nested within the service execution tree.

This tool has already proven effective in practice, helping identify and resolve several errors and performance issues that were previously beyond our ability to address.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

31

Assessment of metadata descriptors of AI-ready datasets

Jerven Bolleman1, Leyla Jael Castro2, Alban Gaignard3, Rea Kalampaliki4, Edwin Jun Kiat Ong5, Núria Queralt-Rosinach6, Nelson David Quiñones2, Rohitha Ravinder2, Dhwani Solanki2, David Steinberg7, Claus Weiland8, Daphne Wijnbergen6

1SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland. 2ZB MED Information Centre for Life Sciences, Cologne, Germany. 3Nantes Université, CNRS, INSERM, l’institut du thorax, Nantes, France. 4Biomedical Sciences Research Center “Alexander Fleming”, Vari, Greece. 5Queen’s University of Belfast, Belfast, United Kingdom. 6Leiden University Medical Center, Leiden, Netherlands. 7University of California Santa Cruz Genomics Institute, Santa Cruz, USA. 8Senckenberg Nature Research Society, Frankfurt, Germany

Abstract

To advance the use of Artificial Intelligence, including notably Machine Learning, for the understanding of diseases and conservation of biodiversity, it is important to promote FAIR AI-ready datasets. However, it is not clear how much AI-ready metadata is covered in well-known dataset repositories such as OpenML, Hugging Face or Kaggle. During the BioHackathon Europe 2024, we tackled this problem following a programmatic approach and applying Semantic Web technologies. Here, we show our preliminary results on the coverage of the implemented Croissant metadata format and discuss its implications in ML data management and future steps.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

T5 Hackers delight

32

Development of in-context methods with Large Language Models for statement validation towards semi-automated ontology learning in novel biological contexts

James Wilsenach1, Sebastian Ahnert2,1

1The Alan Turing Institute, London, United Kingdom. 2Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, United Kingdom

Abstract

Biological knowledge is often contextual, requiring domain-specific expertise in order to determine the relative validity of a given statement, resulting in ever more context-specific ontologies. At the level of individual annotated entities the completeness of the annotational space is also contextual. For example, differences in both the number and depth of GO annotations are evident between model and non-model organisms. In the curation of domain-specific ontologies, parts of other hierarchies are often borrowed from a variety of related ontological sources. Learning which ontological statements can be imported from one context to another is therefore relevant to the goal of completing the ontological and annotational spaces and performing more accurate downstream analyses in a given context. 

We explore the use of inferred probabilities from light-weight GPT2-type Large Language Models to determine the contextual relevance of statements. We show that context-specific prompting can improve the ability of non-fine-tuned models to determine the correct direction in subtype relations for a simple taxonomy drawn from FOODON, the edible food ontology. This ontology was tested because of it’s mixture of in-sample and out-of sample terms. The resulting measure could be used to gauge which statements are most contextually relevant when a model is appropriately prompted or fine-tuned. Through refining our neurosymbolic approach, we plan to provide a tool to guide investigators when deciding which ontological statements might plausibly be imported, such as from one species or cell line to another, and to identify possible areas for further study, where statements are least likely to hold.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

34

Anonymization Ontology: A Privacy-Preserving Framework for GDPR-Compliant Anonymization of Personal Health Data

Noopur Rai, Marta Dembska, Sirko Schindler

German Aerospace Center (DLR), Institute of Data Science, Jena, Germany

Abstract

Medical research can immensely benefit from the use of personal health data. However, a large amount of health data remains unexplored due to existing data protection laws guaranteeing individual full control of their data. For a broad use in research, this health data needs to be anonymized without losing its specific characteristics. The anonymization ensures compliance with the data protection laws while providing a wealth of data, e.g., to train machine learning algorithms. In this paper, we present an ontology to describe generic anonymization software for GDPR-compliant health data anonymization as well as its characteristics and requirements.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C4 – Data and models

35

FAIR principles: from seed supermarkets to germplasm data centers

Alberto Cámara, Santiago Moreno, Mark D. Wilkinson

Departamento de Biotecnología-Biología Vegetal, E.T.S.I. Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid, Avda. Puerta de Hierro 2-4, 28040, Madrid, Spain

Abstract

Ex situ seed storage in germplasm banks (GBs) plays a critical role in biodiversity conservation. However, databases for wild species in GBs, often managed by resource-constrained botanical gardens, are fragmented, utilizing diverse data formats, platforms, and standards with inconsistent accessibility. This fragmentation hinders the integration of data necessary for developing effective conservation strategies for wild species. To address this, we propose applying the FAIR Principles (Findable, Accessible, Interoperable, Reusable) to wild species and crop wild relatives data by creating a federated network of FAIR GB databases. This network would enable seamless cross-resource discovery and analysis, supporting more effective conservation strategies and highlighting the importance of GB data providers. Each database would integrate a FAIR transformation layer comprising semantic models, metadata publication via FAIR Data Points, and three innovations: a query-endpoint matching algorithm, an enhanced Triple Pattern Fragments resolution algorithm for data federation, and a Virtual Platform for data analytics through data visiting.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

43

FAIR Software as a Service: Deployment of FAIR infra made easy

Rick Overkleeft1,2, Sander van Boom1,2, Eric Prud’Hommeaux1, Kees Burger3, Jip Fransen2, Marc Padros Goossens4, Luiz Bonino da Silva Santos5,2, Rajaram Kaliyaperumal2, Marco Roos2

14MedBox Nederland B.V., Leiden, Netherlands. 2Leiden University Medical Center, Leiden, Netherlands. 3Health-RI, Utrecht, Netherlands. 4Universitair Medisch Centrum Utrecht, Utrecht, Netherlands. 5Technical University Twente, Enschede, Netherlands

Abstract

IT departments have operational expertise in deploying software services, but lack expertise in deploying and exploiting Semantic Web technology which are used by FAIR services for machine actionability. This inexperience poses a risk for delivering the FAIR foundation of a federated health data infrastructure that is ready for federated analytics, AI, and machine learning.

This project created an architecture and a first version to make deployment by IT departments of FAIR services cost-effective, scalable, and sustainable by delivering ‘FAIR Software as a Service’ (FAIR-SaaS).

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

T3 FAIR4HCLS

45

AgroLD: a Knowledge Graph for the Plant Sciences

Pierre Larmande1, Bertrand Pitollat2, Ndomassi Tando1, Yann Pomie1, Bill Gates Happi Happi1, Valentin Guignon3

1IRD, Montpellier, France. 2CIRAD, Montpellier, France. 3CGIAR, Montpellier, France

Abstract

The AgroLD Knowledge Graph is a semantic framework designed to integrate and explore data relevant to plant sciences, particularly focused on plant genomics. AgroLD contains around 900M triples created by combining more than 100 datasets from 15 data sources. Our objective is to offer a domain-specific knowledge platform to answer complex biological and plant sciences questions related to the implication of genes in, for instance, plant disease resistance or adaptative responses to climate change. In this poster, we present some results which currently focused on genomics, genetics, and trait associations.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

D1 – RDF Dataset or SPARQL endpoint

46

DATOS.CAT: Improving the level of the interoperability of GCAT in line with the FAIR principles with OntoBridge, dsOMOP and OMOP Beacon

Aikaterini Lymperidou1,2, David Sarrat-González3, Ramon Mateo-Navarro1,3, Guillem Bracons-Cucó4, Aurora Moreno-Racero5, Jordi Rambla de Argila5, Liina Lagirnaja5, Carles Hernandez-Ferrer5, Salvador Capella-Gutierrez6, Santiago Frid4, Juan R González3, Rafael de Cid2

1Institute for Bioengineering of Catalonia (IBEC), Barcelona, Spain. 2GCAT Genomes for Life – Germans Trias i Pujol Research Institute, Badalona, Spain. 3Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain. 4Hospital Clinic de Barcelona, Barcelona, Spain. 5Center for Genomic Regulation, Barcelona, Spain. 6Barcelona Supercomputing Center, Barcelona, Spain

Abstract

In the context of the DATOS.CAT, we suggest a groundbreaking pipeline that can be applicable to population-based cohorts, improving the level of the interoperability of the data in line with the Findable, Accessible, Interoperable, Reusable (FAIR) principles. This procedure includes: the transformation of the local database to the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) with OntoBridge, statistical analysis with dsOMOP and the exploration of the data with the tool OMOP Beacon.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C3 – SWAT4Health

47

An Automated Framework to generate pairs of Natural Language Questions and SPARQL Queries from Common RDF Graph Patterns

Julio C. Rangel1, Tarcisio Mendes de Farias2, Yamamoto Yasunori3,1, Ana Claudia Sima2, Norio Kobayashi1

1Riken, Wako, Japan. 2SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland. 3Database Center for Life Science, Kashiwa, Japan

Abstract

This paper presents an approach for automatically generating SPARQL queries from natural language questions against any RDF dataset without requiring example queries or domain-specific training data. Our approach identifies frequently occurring relational paths through a community detection process on the RDF graph, thereby extracting canonical class-level patterns that occur with high frequency. Using these patterns, we produce a rich dataset of natural language questions and automatically generated SPARQL queries. This dataset is then indexed to enable similarity search of relevant natural language questions paired with their SPARQL queries, aligning with user-input questions. During inference, the system uses a semantic index to retrieve existing similar questions and SPARQL queries. If the retrieved question differs within a defined threshold, an entity linking mechanism replaces entities found in the user questions with those in the retrieved SPARQL queries, to ensure flexibility and domain independence.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

T4 Semantic methods and AI

50

A Knowledge Graph for Enhanced Study Discovery and Semantic Exploration

Lea Gütebier1, Dagmar Waltemath1,2, Ron Henkel1

1Medical Informatics Laboratory, Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany. 2Data Integration Center, University Medicine Greifswald, Greifswald, Germany

Abstract

A knowledge graph provides a powerful framework for integrating and linking diverse study resources, offering enhanced capabilities for study discovery and exploration. We created a semantically enriched graph that facilitates the retrieval and understanding of clinical studies by consolidating data from ClinicalTrials.gov, a German portal for medical data models, and ontologies such as UMLS and MeSH. Built on a Neo4j graph database and domain-specific ETL processes, the knowledge graph allows for efficient targeted searches and semantic exploration of clinical studies. By leveraging ontological annotations, it connects studies across domains, revealing relationships and providing a comprehensive context for research. This approach significantly enhances the accessibility, findability, and usability of clinical studies for patients, researchers, and clinicians alike.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C3 – SWAT4Health

51

OMExcavator: a tool for exporting and connecting domain-specific metadata in a wider knowledge graph

Stefan Dvoretskii1,2, Michele Bortolomeazzi1,3, Christian Schmidt1,3, Josh Moore4,3, Klaus Maier-Hein1,5, Marco Nolden1,2,5

1DKFZ, Heidelberg, Germany. 2HMC, Köln, Germany. 3NFDI4BIOIMAGE, Karlsruhe, Germany. 4German Bioimaging, Konstanz, Germany. 5University Clinic Heidelberg, Heidelberg, Germany

Abstract

Bioimaging data volume has greatly increased in recent years, and this trend is paving the way for future important discoveries in Biology and Healthcare. Even though the volume of generated data is huge, its findability, interoperability, and general reusability needs to be improved.

In Bioimaging, the most used RDM system is OMERO, which stores images and their accompanying metadata in the OME Model with a possibility of interoperable export. In this work, we developed a tool to make the metadata records of images from OMERO servers available for semantic exploration. We used domain-specific tools to generate a generic metadata representation to link with other resources, in this case, unHIDE an overarching knowledge graph of the Helmholtz Association as a part of a Helmholtz FAIR data space.

This tool is a critical step for the relevant communities in the Bioimaging field in Germany and beyond. Furthermore, others may face similar challenges in their respective research domains. 

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

55

Towards dynamic consent and transparent access tracking for decentralized health records using the SOLID framework

Kalliopi Kastampolidou, Achilleas Chytas, Pantelis Natsiavas

Centre for Research and Technology Hellas| Institute of Applied Biosciences, Thessaloniki, Greece

Abstract

Managing health data in decentralized systems is challenging due to the need for fine-grained consent and transparent access workflows. Based on RDF/OWL technologies, the SOLID framework offers a unique approach to addressing these challenges by enabling individuals to store and control their data independently. This paper presents a conceptual design that integrates dynamic consent management and transparent access tracking into a decentralized health record infrastructure. By exploring RDF’s semantic capabilities, we aim to enhance interoperability and enable advanced reasoning over consent rules and data sharing, laying the foundation for future prototypes and implementations in patient centric health data ecosystems. 

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

57

Lowering Barriers to FAIR Data: A User-Friendly Tool for Semantic Data Transformation

Karolis Cremers, César Beranabé, Daphne Wijnbergen, Rosa Zwart, Anna Niehues, Marco Roos

Leiden University Medical Center, Leiden, Netherlands

Abstract

The FAIR guiding principles are essential for improving data reusability but often require researchers to perform complex transformations, such as converting data into RDF and alignment with semantic models. These tasks demand significant technical expertise, posing barriers to adoption. We propose a user-friendly, minimalist tool that simplifies this process by providing an intuitive drag-and-drop interface for mapping data elements to metadata model classes, automatically generating YARRRML mappings, and producing FAIR-compliant RDF. Our proof of concept aims to lower the technical barriers to FAIR data transformation, empowering researchers to adopt FAIR practices more easily.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

T5 Hackers delight

59

RDF Portal: Enhancing Accessibility and Interoperability of Life Science RDF

Shuichi Kawashima, Yasunori Yamamoto, Hirokazu Chiba, Yuki Moriya, Toshiaki Katayama

Database Center for Life Science, Chiba, Japan

Abstract

The RDF Portal provides comprehensive access to high-quality RDF data in life sciences. This paper highlights recent advancements, including the automation of update processes using RDF-config, improved metadata standardization, and expanded dataset coverage. These developments address challenges such as manual dependencies and schema inconsistencies, ensuring data reliability and interoperability. By integrating public and custom RDF datasets, the portal fosters innovation and supports diverse applications. Additionally, as a foundational resource for DBCLS and external services, the RDF Portal plays a critical role in advancing life sciences research and data integration.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

D1 – RDF Dataset or SPARQL endpoint

61

Temporal Modeling in the Radiation Therapy Ontology

Howard Fu1, Andre Dekker2, Jonathan Bona3, Mark Phillips4

1Cornell University, Ithaca, USA. 2Maastricht University, Maastricht, Netherlands. 3University of Arkansas for Medical Sciences, Little Rock, USA. 4University of Washington, Seattle, USA

Abstract

Radiation therapy is a complex treatment modality that requires precise temporal coordination and understanding of concurrent treatment events. The Radiation Therapy Ontology (RTO) aims to provide a comprehensive framework for representing radiation therapy concepts. It will also serve as a foundation for future data modeling as the incorporation of knowledge graphs in data analysis increases. This poster explores the integration of temporal modeling into RTO using Time Ontology Rules and extending Allen’s Interval Algebra to model treatment concurrency effectively.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

C3 – SWAT4Health

62

FAIR in practice: minimum metadata schema for bioinformatics analytics by machines

Daphne Wijnbergen1, Núria Queralt-Rosinach1, Valérie Barbié2, Emma Verkinderen3, Nirupama Benis4, Annika Jacobsen1, Peter ‘t Hoen5, Claudio Carta6, Marco Roos1, Eleni Mina1

1Leiden University Medical Center, Leiden, Netherlands. 2Swiss Institute of Bioinformatics, Geneva, Switzerland. 3Université Libre de Bruxelles, Brussels, Belgium. 4Amsterdam UMC location University of Amsterdam, Amsterdam, Netherlands. 5Radboud University Medical Center, Nijmegen, Netherlands. 6Istituto Superiore di Sanità, Rome, Italy

Abstract

The reuse of datasets leads to more efficient research, and a reduction of costs and time spent on generating new data. Findability and reuse of datasets as well as accessibility and interoperability can be improved by following the FAIR principles which emphasize machine actionability. In practice, metadata often lacks in machine actionability due to incomplete standardised metadata and lack of ontological descriptions. In this work, we identify minimal metadata necessary for bioinformatics tools’ machine actionability and propose a schema to address current limitations. The schema includes steps for identification, selection, validation, and execution. We also align the metadata of tools to the metadata of datasets to improve machine actionability.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

T3 FAIR4HCLS

67

Constructing a synonymous URI dictionary in life sciences

Yasunori Yamamoto1, Takatomo Fujisawa2

1Database Center for Life Science, ROIS-DS, Kashiwa, Japan. 2Bioinformation and DDBJ Center, National Institute of Genetics (NIG), Mishima, Japan

Abstract

Many life science data have been published in RDF. In general, life science data are diverse and store knowledge about various relevant concepts and relationships among them. These concepts include proteins, genes, com- pounds, and diseases, and are represented by identifiers in databases. To understand biological phenomena, it is crucial to extensively investigate their characteristics and relationships among them, and it is ideal to use one and only identifier for a concept over the databases However, multiple identifiers are often used for a concept in reality. Database Center for Life Science (DBCLS) constructs or collects life science RDF data and provides them at RDF Portal. Here, we have investigated the synonym URIs in it and examined the challenges and future works.

Submission type

1. Poster (max 2 pages, including reports on applications, data, and models)

Categories

D1 – RDF Dataset or SPARQL endpoint

Demonstrations

63

RDF resources for biomedical research

Mayumi Kamada1, Shuichi Kawashima2, Toshiaki Katayama2

1Kitasato Univeristy, Sagamihara, Japan. 2Database Center for Life Science, Kashiwa, Japan

Abstract

The integration of biomedical databases is essential for drug discovery and understanding disease mechanisms. Med2RDF addresses this by proposing a unified data model and providing tools to convert key biomedical datasets into Resource Description Framework (RDF) format. Target databases include those for disease variants, molecular interactions, cancer omics, and genomic variations. To enhance usability, Med2RDF offers Med2RDF-ontology for organizing shared concepts and the RDF-config for generating SPARQL queries and schema diagrams from YAML-based models. Converted RDF data is available through SPARQL endpoints on the DBCLS RDF Portal, enabling seamless integration with other life science data. This standardized and semantic approach facilitates efficient data utilization, advancing biomedical research.

Submission type

7. Resource Description

Categories

D1 – RDF Dataset or SPARQL endpoint

68

GlyCosmos: An Integrated Semantic Knowledge Base for Glycoscience

Sunmyoung Lee1, Yasunori Yamamoto2, Achille Zappa1, Kiyoko F. Aoki-Kinoshita1,3,4

1Glycan and Life Systems Integration Center (GaLSIC), Soka University, Hachioji, Japan. 2Database Center for Life Science (DBCLS), ROIS-DS, Kashiwa, Japan. 3Graduate School of Science and Engineering, Soka University, Hachioji, Japan. 4Institute for Glyco-core Research, Nagoya University, Nagoya, Japan

Abstract

GlyCosmos, a comprehensive web resource for glycoscience research, has been significantly enhanced to provide a unified platform for accessing and analyzing glycan structures, related genes, proteins, pathways, and diseases. This paper presents the latest developments in GlyCosmos, highlighting the adoption of a unified Resource Description Framework (RDF) schema and semantic web technologies. These advancements address previous challenges in data integration and enable seamless connection of diverse glycan-related datasets. The implementation of SPARQL endpoints allows for powerful querying capabilities, including federated queries, knowledge discovery, and inference-based queries. The enhanced data integration has improved search functionality, supporting multi-faceted searches across various parameters and nomenclatures. GlyCosmos now offers expanded resources for glycans, genes, lectins, and diseases, integrating data from multiple sources to provide a comprehensive view of glycobiology. This semantic web approach improves data interoperability, enhances knowledge discovery, and positions GlyCosmos as a crucial platform for integrating and disseminating glycan-related knowledge in the evolving field of glycoscience.

Submission type

3. Short research paper (max 5 pages)

Categories

D1 – RDF Dataset or SPARQL endpoint

26

LARA – building automated knowledge graphs for Life Sciences

Mark Doerr1, Stefan Born2

1University Greifswald, Greifswald, Germany. 2Technical Universtity Berlin, Berlin, Germany

Abstract

LARA (https://gitlab.com/larasuite) is evolving to a widely applicable 

open source lab automation and research data management system for life-science applications.

The focus of the current development lies on a highly automatisable infrastructure with a powerful,

fast and sophisticated API to enable closed-loop designs and utilise machine-learning and AI applications

to interact with the interlinked data, captured and provided by the LARA databases.

While research data is collected, an automated pipeline annotates the data semantically and creates links to web-based taxonomies and ontologies.

In this way a growing knowledge graph is generated which can be explored by the built-in SPARQL endpoint (openlink virtuoso, https://virtuoso.openlinksw.com/ ).

In this presentation we would like to demonstrate this semantic workflow and highlight diverse applications / use cases originating form the world of life science, cultural science and machine learning in protein engineering.

Submission type

2. Demonstration (max 2 pages; use comment box for technical requirements)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

33

Leveraging AI Agents for Efficient Data Annotation

Chang In Moon, Christina Parry, Jineta Banerjee, John Hill, Jay Hodgson, Alberto Pepe, Milen Nikolov, Sonia Carlson, Robert Allaway

Sage Bionetworks, Seattle, USA

Abstract

Effective management and analysis of biomedical datasets are critical components for accelerating translational research and advancing healthcare innovations. Well-structured metadata enables data to be easily found, understood, and reused, facilitating reproducibility and advancing scientific discovery. However, tasks related to curation and annotating data with metadata are extremely time-consuming. Consequently, this can lead to significant delays in research projects, resulting in missed opportunities to publish findings or collaborate with other researchers, ultimately hindering scientific progress. 

Sage Bionetworks has a platform called Synapse which hosts more than 3 petabytes of  biomedical research data.  In this demonstration, we present Sage Bionetworks’ effort to streamline the data annotation process using artificial intelligence (AI).   The demo highlights the Synapse Agent and its transformative role in metadata annotation and data integration workflows.

The Synapse Agent automates metadata annotation tasks by leveraging schema-bound file entities and contextual information from project wikis, enabling rapid and accurate assignment of metadata. It performs both simple tasks, such as extracting valid values from filenames, and complex inferential annotations guided by user input or schema rules.

This demonstration underscores Sage’s commitment to creating scalable, FAIR-compliant data management solutions for life sciences. Attendees will gain insights into the practical implementation of AI-driven tools for metadata annotation and data harmonization, offering a blueprint for addressing challenges in data integration.

Submission type

2. Demonstration (max 2 pages; use comment box for technical requirements)

Categories

T3 FAIR4HCLS

36

QASAR: A quality assurance tool for semantic resources

Gonzalo Nicolás Martínez1, Francisco Abad Navarro1, Manuel Quesada Martínez2, Astrid Duque Ramos1, Belén Juanes Cortés1, Jesualdo Tomás Fernández Breis1

1Universidad de Murcia, Murcia, Spain. 2Universidad Miguel Hernández, Murcia, Spain

Abstract

Ontologies play a crucial role in enabling interoperability and data sharing in multiple fields of relevance. The ontology engineering community has dedicated significant efforts to develop quality assurance frameworks and methodologies, which address various quality aspects and propose different approaches when it comes to quality evaluation. However there remains a gap in providing integrated tools that combine these quality frameworks and methodologies into a unified, streamlined experience to assist ontology developers. We propose in this demo an ontology quality assurance tool designed to address this gap, through the integration multiple quality frameworks and methodologies developed by our team throughout the years, to provide a cohesive quality assurance workflow that assist ontology developers in the process of developing higher-quality ontologies.Gon

Submission type

2. Demonstration (max 2 pages; use comment box for technical requirements)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

39

MetaTron: Streamlining Collaborative Annotation for Biomedical Documents

Ornella Irrera, Stefano Marchesin, Gianmaria Silvello

University of Padova, Padova, Italy

Abstract

Manual annotation of biomedical texts, such as electronic health records and medical reports, is crucial to creating reliable corpora for training automated methods for tasks like relation extraction and entity linking. To streamline this process, we introduce MetaTron, a web-based collaborative tool that supports mention-level and document-level annotations and automatic built-in predictions.

Submission type

2. Demonstration (max 2 pages; use comment box for technical requirements)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

40

Demonstration: The MeDaX-KG on FHIR

Ilya Mazein, Tom Gebhardt, Sebastian Berthe, Kai Fitzer, Ron Henkel, Markus Mandalka, Seyed Reza Mazlooman, Lea Michaelis, Dagmar Waltemath, Benjamin Winter, Judith Wodke

University Medicine Greifswald, Greifswald, Germany

Abstract

The Medical Informatics Initiative in Germany aims to advance biomedical research and healthcare through the secondary use of high-quality health data. As part of this effort, the MeDaX project presents a pipeline for generating knowledge graphs (KG) from health research data. The pipeline generically transforms FHIR-formatted clinical data into a graph, optimises its structure, and integrates it with ontological information using the BioCypher framework. Collaborating with the Data Integration Centre at University Medicine Greifswald, we are now implementing our first clinical prototype.
Improving on the previous version of the pipeline, we transition from Neo4j’s CyFHIR plugin to NetworkX, eliminating proprietary dependencies for data transformation. We also improve on semi-automatic schema creation required for KG creation with BioCypher.
Our graph-based solution will ease interaction with clinical data for research, providing a more unified and intuitive way of accessing the data, and allow medical practitioners to further organise their information and answer work-related questions more efficiently with the help of structured graph database queries. Future work includes a specialised user interface and incorporation of additional data sources.
With this demonstration, we want to gather feedback from the semantic web community to further improve the pipeline for our first stable release.

Submission type

2. Demonstration (max 2 pages; use comment box for technical requirements)

Categories

C4 – Data and models

41

The Wheat and Rice Genomics Scientific Literature Knowledge Graphs

Nadia Yacoubi Ayadi1, Franck Michel2, Robert Bossy3, Marine Courtin3, Bill Gates Happi Happi4, Pierre Larmande4, Claire Nedellec3, Catherine Faron2

1Univ Lyon, Lyon, France. 2Univ. Nice, Nice, France. 3INRAE, Jouy-En-Josas, France. 4IRD, Montpellier, France

Abstract

This paper presents a generic semantic model to describe, structure, and integrate the named entities automatically extracted from scientific texts, represented as annotations. This model has been used to construct knowledge graphs from two distinct agricultural corpora consisting of PubMed scientific publications on wheat and rice genetics. The named entities to be recognized are genes, phenotypes, traits, genetic markers, and taxa. For both corpora, named entities were automatically extracted using natural language processing tools. The RDF model was populated using a mapping-based transformation pipeline implemented with the Morph-xR2RML tool which takes CSV files as input. The resulting RDF knowledge graphs are deployed and query-able through dedicated web applications.

Submission type

2. Demonstration (max 2 pages; use comment box for technical requirements)

Categories

D1 – RDF Dataset or SPARQL endpoint

44

Collaborative management of Semantic Web datasets: maintaining EuroSciVoc with VocBench and ShowVoc

Anikó Gerencsér1, Baya Remaoun1, Enrico Bignotti2

1Publications Office of the European Union, Luxembourg, Luxembourg. 2Infeurope S.A., Luxembourg, Luxembourg

Abstract

This paper presents the collaborative maintenance approach for the European Science Vocabulary (EuroSciVoc) through the implementation of VocBench and ShowVoc corporate knowledge management systems. EuroSciVoc is a standardized taxonomy designed to classify research projects within the CORDIS platform of the Publications Office of the European Union. This paper demonstrates how VocBench and ShowVoc, two open-source platforms leveraging Semantic Web technologies, enable efficient management and dissemination of controlled vocabularies through their specialized features and collaborative functionalities.

Submission type

2. Demonstration (max 2 pages; use comment box for technical requirements)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI

48

Using NFDI4Health Local Data Hub and Metadata-Schema for clinical research information management

Frank Meineke1, Matthias Löbe1, René Hänsel1, Masoud Abedi1, Xiaoming Hu2, Martin Golebiewski2

1Leipzig University, Leipzig, Germany. 2Heidelberg Institute for Theoretical Studies, Heidelberg, Germany

Abstract

The German National Research Data Infrastructure for Personal Health Data (NFDI4Health) aims at making structured health data internationally searchable and accessible according to the FAIR guiding principles. We have developed an overarching metadata schema (MDS) for clinical trials and epidemiological studies and have implemented it in the Local Data Hub (LDH) software. LDHs installations are provided as data management platforms for data-holding organizations to organize comprehensive local research information on clinical and epidemiological research projects. Metadata data can be entered into the LDHSs and via direct transmissions published to the central German Health Study Hub of NFDI4Health. LDHs are locally based low-level entry points into research data and information sharing. We will introduce the NFDI4Health metadata schema and demonstrate the LDH system as well as report on initial experiences from the first roll-outs.

Submission type

2. Demonstration (max 2 pages; use comment box for technical requirements)

Categories

T3 FAIR4HCLS

49

Demonstrating Study Discovery and Exploration Through a Knowledge Graph

Lea Gütebier1, Dagmar Waltemath1,2, Ron Henkel1

1Medical Informatics Laboratory, Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany. 2Data Integration Center, University Medicine Greifswald, Greifswald, Germany

Abstract

Efficiently finding and contextualizing clinical studies is a critical challenge for clinician scientists and also impacts patients. To improve the access to information about clinical studies, we developed a knowledge graph that integrates diverse resources, including information from ClinicalTrials.gov and the a German Portal for Medical Data Models. We enrich the knowledge graph with entries from biomedical terminologies such as the Unified Medical Language System (UMLS).
The knowledge graph facilitates enhanced interoperability and complex query capabilities, enables targeted searches and identification of study similarities, and allows scientists to efficiently explore study  eligibility criteria.
Through demonstration queries, we illustrate these abilities.

Submission type

2. Demonstration (max 2 pages; use comment box for technical requirements)

Categories

C2 – Tools and methods based on Semantic Web, knowledge representation and AI