The microscopic organisms that populate our bodies, soil, oceans and atmosphere play essential roles in human health and the planet’s ecosystems. However, even with modern DNA sequencing, understanding what these microbes are and how they are related to each other remains extremely difficult.
In a pair of new studies, researchers at Arizona State University are introducing powerful tools that make this work easier, more precise, and much more scalable. A tool improves the way scientists build microbial family trees. The other provides a software base used worldwide to analyze biological data.
Together, these advances strengthen the scientific foundations of microbiome research, disease surveillance, environmental monitoring, and emerging fields such as precision medicine.
Our team builds open source software tools because we believe that when everyone can access and extend scientific tools, the entire community benefits and discovery is accelerated.”
Qiyun Zhu, Arizona State University;
Zhu is a researcher in the Biodesign Center for Fundamental and Applied Microbiomics and an assistant professor in ASU’s School of Life Sciences. He is joined by ASU colleagues and international partners.
The first study, on improving the marker genes, appears in the journal Nature communications. The second study, describing an open source software library known as scikit-bio, appears in Nature Methods.
A family affair
Building detailed and accurate evolutionary trees is essential to understanding how microbes evolve and affect the world. Better evolutionary trees improve disease tracking and help scientists track how harmful microbes change over time. They also sharpen environmental research by showing how microbial communities respond to pollution or climate change. Clearer microbial identification also enhances studies of the gut microbiome and its role in health.
Unraveling how microbes are related starts with choosing the right marker genes – the marks in the DNA that trace their evolutionary history.
For many years, scientists relied on the same small set of traditional marker genes. But in the growing field of metagenomics, researchers are now working with millions of genomes, often directly from environmental samples. Metagenomics allows scientists to collect all the DNA in an environment and sequence it instantly, revealing entire hidden microbial communities.
These genomes are extremely valuable, but are often incomplete or uneven in quality. This makes it difficult to use a fixed set of marker genes and expect accurate evolutionary results.
To solve this, Zhu and his colleagues helped develop TMarSel (short for Tree-based Marker Selection). Instead of selecting genes by hand, TMarSel automatically searches thousands of possible gene families and selects the combination that creates the most reliable evolutionary tree. It evaluates each gene for how common it is, how informative it is, and how much it contributes to a stable, meaningful picture of microbial relationships.
The result is a flexible, data-driven way to create microbial trees that work well even for large and diverse groups of organisms—even when many genomes are only partially complete.
Scikit-bio: Ancestry.com for germs
Zhu is also the lead developer of scikit-bio, a massive open source software library. Scikit-bio gives scientists the tools they need to analyze massive biological datasets. It is particularly useful for studying microbiomes—communities of microbes that live in a specific environment, such as the human gut.
Biological data sets are unlike any other kind of data: they are extremely large, very sparse, and often contain thousands of interconnected features. Standard data analysis programs are not built for this level of fragmentation and complexity. Scikit-bio fills this gap by offering more than 500 functions for tasks such as:
- Comparison of microbial communities.
- Diversity calculation.
- Transform composition elements.
- Analysis of DNA, RNA and protein sequences.
- Construction and modification of phylogenetic trees.
- Preparing data for machine learning.
The project is community-driven, supported by more than 80 contributors, and maintained with rigorous testing and documentation. It has already been cited in tens of thousands of scientific papers across medicine, ecology, climate science and cancer biology. It has become an essential tool for researchers analyzing the microbiome and other large, data-rich areas of modern biology.
A new era in microbial research
As biological datasets grow, tools such as scikit-bio and TMarSel make large-scale research more reliable and reproducible.
The studies reinforce ASU’s expanding role at the intersection of biology and computing. Zhu’s work shows how combining evolutionary insight with advanced software engineering can produce tools used by scientists around the world.
As DNA sequencing continues to get faster and cheaper, scientists will uncover even more of the microbial universe. Tools like TMarSel and scikit-bio ensure that this deluge of data can be turned into real scientific knowledge.
Source:
Journal Reference:
Aton, M., et al. (2025). Scikit-bio: a fundamental Python library for analyzing biological ohmic data. Nature Methods. DOI:10.1038/s41592-025-02981-z. https://www.nature.com/articles/s41592-025-02981-z.
