Summary: A team of researchers from the Dana-Farber Cancer Institute, the Broad Institute of MIT and Harvard, the Google and the University of Columbia have created an artificial intelligence model that can predict which genes are expressed in any kind of human cell. The model, called Epibert, was inspired by Bert, a deep learning model designed to understand and create human -like language.
EPIBERT was trained in data from hundreds of types of human cells in multiple phases. The genomic sequence was fed, which is 3 billion in length, along with chromatin accessibility maps that inform which of these sequences unfold from the chromosome and read by the cell. The model was first trained to learn the relationship between DNA sequence and chromatin accessibility in large pieces of genome in a particular type of cell. He then uses these learning relationships to predict which genes were active in the corresponding cell type. It accurately identified the regulatory elements – parts of the genome recognized by transcription factors – and their effect on gene expression on many cell types, building a “grammar” that is generalized and predictable. This grammar building process can be assimilated to the way a large linguistic model, such as Chatgpt, learns to create important sentences and paragraphs from many examples of text. The EPIBERT model can process accessibility and predict functional bases as well as RNA expression for a type of cell that has never seen.
Meaning: Each cell in the body has the same genome sequence, so the difference between two types of cells is not genes in the genome, but what genes are activated, when and how much. About 20% of genome codes for regulatory elements determine which genes are activated, but very few are known about where their genome codes are, what their instructions are or how mutations affect function in a cell. EPIBERT will shed light on the way genes are regulated in cells and, possibly, how the cell regulatory system can be mutated in ways that lead to diseases such as cancer.
Financing: The Broad Institute, the Novo Nordisk Foundation, the National Genome Research Institute, the Sharf Green Cancer Research Fund, the Richard and Nancy Lubin family and the American Cancer Society. Tenniser Editor (TPU) Access and support provided by Google.
Source:
Magazine report:
Javed, N., et al. (2025). A multi-model transformer for cellular-type predictions. Cell genomic. doi.org/10.1016/j.xgen.2025.100762.