Updating and improving publicly available TF-DNA interaction data processing and curation for the JASPAR database

Supervision team information: Computational Biology & Gene Regulation group, Norwegian Centre for Molecular Biosciences and Medicine (NCMBM), UiO
Supervisor: Anthony Mathelier
IBV supervisor: Eivind Valen
Co-supervisors: Ieva Rauluseviciute
e-mail address: anthony.mathelier@ncmbm.uio.n

Introduction

The JASPAR database is one of the most used resources for transcription factor (TF) binding profiles. It is a comprehensive, open-access resource that provides manually curated position frequency matrices (PFMs) and deep learning (DL) models for TFs in six different taxonomic groups1 . The update process requires mining TF-DNA interaction data (ChIP-seq, CUT&Tag, HT-SELEX, and others) from resources such as GEO and GTRD2,3. Then, data is processed to discover DNA binding patterns, known as motifs. The most popular tools for this task include RSAT, MEME, STREME, and HOMER4–7. The manually curated JASPAR database contains high-quality profiles. Although manual steps are necessary to ensure the quality of profiles, they are usually time-consuming. The emerging applications of large language models (LLMs) can offer a significant improvement in the curation of incoming data8,9 .

Aim

The project aims to explore de novo motif discovery tools and motif quality evaluation techniques to ensure high-quality data integration and analysis. It will also involve working with big data and navigating data repositories and scientific databases. The project will also focus on the latest LLM technologies to develop a more automated method for retrieving relevant data and curating data from scientific literature.

Methods

The student(s) will be involved in the development and update of JASPAR data processing and curation pipelines. The methods employed vary from simple motif discovery tools to the possibility of using advanced LLMs. In the lab, we use multiple programming languages and pipeline management systems (bash, Python, R, Snakemake, Nextflow). In addition, the student(s) can be involved in web application development, data mining, processing, and curation. Learning outcome The student(s) will be exposed to an interdisciplinary research environment where they will be trained and gain knowledge and skills:

  • Biological knowledge and interpretation of TF-DNA interactions and their importance for transcriptional regulation
  • High-throughput TF-DNA interaction data analysis applying various computational methods
  • Data and scientific literature curation and interpretation
  • Pipeline development using (Snakemake and/or Nexflow)
  • Good practices for software development

Host environment

The selected candidate(s) will be part of the Computational Biology and Gene Regulation group (CBGR) at the Norwegian Centre for Molecular Biosciences and Medicine (NCMBM), UiO, led by Prof. Anthony Mathelier. The group combines wet lab and computational techniques to shed light on the transcriptional regulation of gene expression in health and disease. Main supervision will be provided by Dr. Mathelier, with close day-to-day supervision from Dr. Ieva Rauluseviciute.

References

1. Rauluseviciute, I. et al. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 52, D174–D182 (2024).

2. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 41, D991–5 (2013).

3. Kolmykov, S. et al. GTRD: an integrated view of transcription regulation. Nucleic Acids Res. 49, D104–D111 (2021).

4. Thomas-Chollier, M. et al. A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nat. Protoc. 7, 1551–1568 (2012).

5. Thomas-Chollier, M. et al. RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res. 40, e31 (2012).

6. Bailey, T. L., Johnson, J., Grant, C. E. & Noble, W. S. The MEME suite. Nucleic Acids Res. 43, W39–49 (2015).

7. Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).

8. Chen, Z., Cao, L. & Madden, S. Lingua Manga: A Generic Large Language Model Centric System for Data Curation. arXiv [cs.DB] (2023).

9. Hatch, V. Deciphering the data deluge: how large language models are transforming scientific data curation. EMBL-EBI News https://www.ebi.ac.uk/about/news/technology-and-innovation/deciphering-the-data-deluge-how-large-language-models-are-transforming-scientific-data-curation/ (2023)

Publisert 3. sep. 2025 10:51 - Sist endret 3. sep. 2025 10:51

Veileder(e)

Omfang (studiepoeng)

60