To understand the workings of DNA in relation to disease, scientists at Los Alamos National Laboratory have developed the first multimodal deep learning model of its kind, EPBDxDNABERT-2, capable of ascertaining the precise relationship between transcription factors, proteins that regulate gene activities, leveraging an aspect of DNA called DNA breathing, in which the double-helix structure opens and closes spontaneously. The model has the potential to aid in the design of drugs used to treat diseases that originate in gene activity.
“There are many types of transcription factors, and the human genome is incomprehensibly large,” said Anowarul Kabir, Los Alamos researcher and lead author on the paper. “So, it is necessary to find out which transcription factor binds to which location on the incredibly long DNA structure. We tried to solve that problem with artificial intelligence, particularly deep-learning algorithms.”
A deep-learning model trained on DNA
Written into every human cell in the equivalent of 3 billion English letters, DNA provides the blueprint for how human life grows and is maintained. Transcription factors bind onto parts of the DNA and affect the regulation of gene expression: how individual genes provide specific instructions for the development and function of cells. Because that expression can manifest itself in diseases, such as cancer, predicting transcription factors that bind with specific gene locations may have implications for drug development.
The foundational model used by the research team was trained on DNA sequences. The team built a DNA simulation program that captures numerous DNA dynamics and integrated it with the genomic foundation model, resulting in EPBDxDNABERT-2, capable of processing genome sequences across chromosomes and incorporating corresponding DNA dynamics as input. One such input, DNA breathing, or the local and spontaneous opening and closing of the DNA double-helix structure, correlates with transcriptional activity, such as transcription factor binding.
“The integration of the DNA breathing features with the DNABERT-2 foundational model greatly enhanced transcription factor-binding predictions,” said Los Alamos researcher Manish Bhattarai. “We give sections of DNA code as input to the model and ask the model whether it binds to a transcription factor, or not, across many cell lines. The results improved the predictive probability of binding specific gene locations with many transcription factors.”
Using Venado for AI algorithms
The team ran their deep-learning model on the Laboratory’s newest supercomputer, Venado, which combines a central processing unit with a graphics processing unit to drive artificial intelligence capabilities. A deep-learning model works in ways similar to the brain’s neural networks, incorporating images and text and uncovering complex patterns to generate predictions and insights.
To train the model, the team used gene sequencing data from 690 experimental results, encompassing 161 distinct transcription factors and 91 human cell types. They found that EPBDxDNABERT-2 significantly improves — by 9.6% in one key metric — the prediction of the binding of over 660 transcription factors. Further experiments on in vitro datasets, drawn from experiments in a controlled environment, complemented the in nature datasets, or the data drawn directly from research with living organisms, such as mice.
The team found that while DNA breathing alone can estimate transcriptional activity almost accurately, the multimodal model can extract binding motifs, the specific DNA sequences to which transcription factors bind — a crucial element for explaining transcription processes.
“As demonstrated by its performance across multiple, diverse datasets, our multimodal foundational model exhibits versatility, robustness and efficacy,” Bhattarai said. “This model signifies a substantial advancement in computational genomics, providing a sophisticated tool for analyzing complex biological mechanisms.”
Paper: “DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors.” Nucleic Acids Research. DOI: 10.1093/nar/gkae783
Funding: The work was supported by the National Institutes of Health and the National Science Foundation.
LA-UR-24-31984