publications
2024
- A comparative analysis of language models for the classification of alzheimer’s disease based on connected speechHelena Balabin, Laure Spruyt, Ella Eycken, and 6 more authorsIn Proceedings of the First Workshop on Artificial Intelligence for Brain Encoding and Decoding , Feb 2024
Early diagnosis of neurological disorders is a fast developing field of applied artificial intelligence (AI). In this context, AI- based language models have been increasingly used to distinguish cognitively healthy individuals from those affected by Alzheimer’s disease (AD) based on their connected speech. Yet, it remains unknown how the adaptation of the language models to the language and domain of the connected speech samples impacts the classification results. Here, we construct several classification tasks from Dutch Flemish samples of connected speech from a cohort of 74 healthy controls and 20 subjects affected by AD. First, we compare the classification performance of Dutch and multilingual models as well as models that incorporate long-range context. Additionally, we examine how varying amounts of fine-tuning data from a separate corpus of speech samples affect domain adaptation. We demonstrate that increasing fine-tuning data leads to increased domain adaptation, but it does not necessarily translate into improved classification performance. Furthermore, our findings support the use of language-specific models over multilingual ones, even for multilingual models that were trained to incorporate wider context.
- Semantic harmonization of Alzheimer’s disease datasets using AD-MapperPhilipp Wegner, Helena Balabin, Mehmet Can Ay, and 5 more authors[Accepted] Journal of Alzheimer’s Disease, Apr 2024
2023
- Investigating Neural Fit Approaches for Sentence Embedding Model ParadigmsHelena Balabin, Antonietta Gabriella Liuzzi, Jingyuan Sun, and 3 more authorsIn Proceedings of the 26th European Conference on Artificial Intelligence , Oct 2023
In recent years, representations from brain activity patterns and pre-trained language models have been linked to each other based on neural fits to validate hypotheses about language processing. Nonetheless, open questions remain about what intrinsic properties of language processing these neural fits reflect and whether they differ across neural fit approaches, brain networks, and models. In this study, we use parallel sentence and functional magnetic resonance imaging data to perform a comprehensive analysis of four paradigms (masked language modeling, pragmatic coherence, semantic comparison, and contrastive learning) representing linguistic hypotheses about sentence processing. We include three sentence embedding models for each paradigm, resulting in a total of 12 models, and examine differences in their neural fit to four different brain networks using regression-based neural encoding and Representational Similarity Analysis (RSA). Among the different models tested, GPT-2, SkipThoughts, and S-RoBERTa yielded the strongest correlations with language network patterns, whereas contrastive learning-based models resulted in overall low neural fits. Our findings demonstrate that neural fits vary across brain networks and models representing the same linguistic hypothesis (e.g., GPT-2 and GPT-3). More importantly, we show the need for both neural encoding and RSA as complementary methods to provide full understanding of neural fits. All code used in the analysis is publicly available: https://github.com/lcn-kul/sentencefmricomparison.
2022
- How Relevant is Selective Memory Population in Lifelong Language Learning?Vladimir Araujo, Helena Balabin, Julio Hurtado, and 2 more authorsIn Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing , Nov 2022
Lifelong language learning seeks to have models continuously learn multiple tasks in a sequential order without suffering from catastrophic forgetting. State-of-the-art approaches rely on sparse experience replay as the primary approach to prevent forgetting. Experience replay usually adopts sampling methods for the memory population; however, the effect of the chosen sampling strategy on model performance has not yet been studied. In this paper, we investigate how relevant the selective memory population is in the lifelong learning process of text classification and question-answering tasks. We found that methods that randomly store a uniform number of samples from the entire data stream lead to high performances, especially for low memory size, which is consistent with computer vision studies.
- Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing ApplicationsIn Interspeech 2022 , Oct 2022
Speech quality in online conferencing applications is typically assessed through human judgements in the form of the mean opinion score (MOS) metric. Since such a labor-intensive approach is not feasible for large-scale speech quality assessments in most settings, the focus has shifted towards automated MOS prediction through end-to-end training of deep neural networks (DNN). Instead of training a network from scratch, we propose to leverage the speech representations from the pre-trained wav2vec-based XLS-R model. However, the number of parameters of such a model exceeds task-specific DNNs by several orders of magnitude, which poses a challenge for resulting fine-tuning procedures on smaller datasets. Therefore, we opt to use pre-trained speech representations from XLS-R in a feature extraction rather than a fine-tuning setting, thereby significantly reducing the number of trainable model parameters. We compare our proposed XLS-R-based feature extractor to a Mel-frequency cepstral coefficient (MFCC)-based one, and experiment with various combinations of bidirectional long short term memory (Bi-LSTM) and attention pooling feedforward (AttPoolFF) networks trained on the output of the feature extractors. We demonstrate the increased performance of pre-trained XLS-R embeddings in terms a reduced root mean squared error (RMSE) on the ConferencingSpeech 2022 MOS prediction task.
- Deep Learning-based detection of psychiatric attributes from German mental health recordsSumit Madan, Fabian Julius Zimmer, Helena Balabin, and 7 more authorsInternational Journal of Medical Informatics, May 2022
Background Health care records provide large amounts of data with real-world and longitudinal aspects, which is advantageous for predictive analyses and improvements in personalized medicine. Text-based records are a main source of information in mental health. Therefore, application of text mining to the electronic health records – especially mental state examination – is a key approach for detection of psychiatric disease phenotypes that relate to treatment outcomes. Methods We focused on the mental state examination (MSE) in the patients’ discharge summaries as the key part of the psychiatric records. We prepared a sample of 150 text documents that we manually annotated for psychiatric attributes and symptoms. These documents were further divided into training and test sets. We designed and implemented a system to detect the psychiatric attributes automatically and linked the pathologically assessed attributes to AMDP terminology. This workflow uses a pre-trained neural network model, which is fine-tuned on the training set, and validated on the independent test set. Furthermore, a traditional NLP and rule-based component linked the recognized mentions to AMDP terminology. In a further step, we applied the system on a larger clinical dataset of 510 patients to extract their symptoms. Results The system identified the psychiatric attributes as well as their assessment (normal and pathological) and linked these entities to the AMDP terminology with an F1-score of 86% and 91% on an independent test set, respectively. Conclusion The development of the current text mining system and the results highlight the feasibility of text mining methods applied to MSE in electronic mental health care reports. Our findings pave the way for the secondary use of routine data in the field of mental health, facilitating further clinical data analyses.
- ProtSTonKGs: A Sophisticated Transformer Trained on Protein Sequences, Text, and Knowledge GraphsHelena Balabin, Charles Tapley Hoyt, Benjamin M Gyori, and 4 more authorsIn Proceedings of the 13th International Conference on Semantic Web Applications and Tools for Life Sciences , Jan 2022
While most approaches individually exploit unstructured data from the biomedical literature or structured data from biomedical knowledge graphs, their union can better exploit the advantages of such approaches, ultimately improving representations of biology. Using multimodal transformers for such purposes can improve performance on context dependent classification tasks, as demonstrated by our previous model, the Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs (STonKGs). In this work, we introduce Prot-STonKGs, a transformer aimed at learning all-encompassing representa- tions of protein-protein interactions. ProtSTonKGs presents an extension to our previous work by adding textual protein descriptions and amino acid sequences (i.e., structural information) to the text- and knowledge graph-based input sequence used in STonKGs. We benchmark Prot-STonKGs against STonKGs, resulting in improved F 1 scores by up to 0.066 (i.e., from 0.204 to 0.270) in several tasks such as predicting protein interactions in several contexts. Our work demonstrates how multimodal transformers can be used to integrate heterogeneous sources of information, paving the foundation for future approaches that use multiple modalities for biomedical applications.
- STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphsHelena Balabin, Charles Tapley Hoyt, Colin Birkenbihl, and 6 more authorsBioinformatics, Mar 2022
The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models. However, representations based on a single modality are inherently limited.To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs (KGs). This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations in a shared embedding space. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against three baseline models trained on either one of the modalities (i.e. text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.084 (i.e. from 0.881 to 0.965). Finally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications.We make the source code and the Python package of STonKGs available at GitHub (https://github.com/stonkgs/stonkgs) and PyPI (https://pypi.org/project/stonkgs/). The pre-trained STonKGs models and the task-specific classification models are respectively available at https://huggingface.co/stonkgs/stonkgs-150k and https://zenodo.org/communities/stonkgs.Supplementary data are available at Bioinformatics online.
2020
- The COVID-19 OntologyAstghik Sargsyan, Alpha Tom Kodamullil, Shounak Baksi, and 10 more authorsBioinformatics, Dec 2020
The COVID-19 pandemic has prompted an impressive, worldwide response by the academic community. In order to support text mining approaches as well as data description, linking and harmonization in the context of COVID-19, we have developed an ontology representing major novel coronavirus (SARS-CoV-2) entities. The ontology has a strong scope on chemical entities suited for drug repurposing, as this is a major target of ongoing COVID-19 therapeutic development.The ontology comprises 2270 classes of concepts and 38 987 axioms (2622 logical axioms and 2434 declaration axioms). It depicts the roles of molecular and cellular entities in virus-host interactions and in the virus life cycle, as well as a wide spectrum of medical and epidemiological concepts linked to COVID-19. The performance of the ontology has been tested on Medline and the COVID-19 corpus provided by the Allen Institute.COVID-19 Ontology is released under a Creative Commons 4.0 License and shared via https://github.com/covid-19-ontology/covid-19. The ontology is also deposited in BioPortal at https://bioportal.bioontology.org/ontologies/COVID-19.Supplementary data are available at Bioinformatics online.