A genomic language model called resLens could help researchers identify antibiotic resistance genes that may be missed by conventional database matching tools, offering a faster path to tracking emerging resistance while highlighting the need for careful validation.
Study: resLens: genomic language models to enhance the detection of antibiotic resistance genes. Image credit: nepool / Shutterstock
A recent study published in npj Antimicrobials and Resistance developed a family of novel genomic language models (gLM), namely resLens, to improve the detection of antibiotic resistance genes (ARGs).
The increase in antibiotic resistance in pathogenic microbes warrants the development of more advanced tools for study ARGs and their evolution. Most alignment-based tools available, such as k-mer approaches, best-hit algorithms, and hidden Markov model (Hmmm) methods, have several limitations, including poor performance in variation and reference ARGs they don’t fit closely.
Furthermore, databases represent only a fraction of resistance and may not keep pace with the scale and pace of resistance evolution. While deep learning methods are more dynamic than alignment-based tools and have attempted to address these limitations, many previous approaches need to learn ARG and protein function representations from scratch, while resLens uses transfer learning from pre-trained DNA language model.
ARG dataset and resLens model design
In the present study, the researchers introduced resLens for enhancement ARG detection and analysis. The study originates ARGs from the National Center for Biotechnology Information (NCBI) refGene and ResFinder pathogen detection databases. These datasets were merged and genes that were perfect duplicates or perfect sub-sequences of other genes conferring resistance to the same class of antibiotics were excluded.
Antibiotic resistance classes with ≥ 20 cases in the dataset were then retained and passed through the Prodigal tool to ensure only open reading frames (ORF) were present. This preprocessing yielded over 7,600 ARGs in 12 classes of antibiotics. Further, GenBank was queried for bacterial non-resistant genes of comparable length to ARGsexcluding those with >90% sequence identity to any ARG sequence.
THE ARG The dataset was merged with an equal number of randomly selected non-resistance genes. The dataset was used to optimize the long read (LR) model. For the short read (MR) dataset, whole-gene sequences were split into 150 base pairs (bp) is reading. The datasets were divided into 80% training sets and 20% testing sets. In total, four models were improved: two for MR data and two for LR data. One model performed a binary classification of noARG and ARG for each data set.
The second model was then classified predicted ARGs in specific categories of it ARGs. The team evaluated the resLens models against five alignment-based tools (AMR++k-mer-based antibiotic resistance gene analyzer [KARGA]ResFinder, Meta-MARC and resistance gene identifier [RGI]) and two deep learning models (DeepARG and ARGNet). The researchers observed that the resLens outperformed the other models in the LR data set.
ResLens Benchmarking and Performance Results
However, there was a modest difference between the resLens and the CRAP or RGI. particularly, RGI and CRAP outperformed resLens in MR data set. In addition, the resLens models closely reproduced the class distribution at LR test set compared to other models. resLens also showed competitive wall-clock inference times on the test set, although it was slower than ARGNet alone on LR test set and DeepARG and CRAP at MR test set.
In addition, the team aimed to evaluate the performance of the model in new ARGs. To this end, two gene families confer resistance to aminoglycosides (aminoglycoside nucleotidyltransferase, ANT) and beta-lactams (blaADC), respectively, were identified, which had low sequence similarity to other gene families conferring resistance to the same antibiotics. The team then created one LR test set only with ANT and blaADC family genes, and another LR training set that includes other genes.
The model was optimized and evaluated on the new training and test sets. The model accurately classified genes hidden from the training set, although performance varied by gene family and was stronger for blaADC than for ANT. For comparison with an alignment-based method, the ResFinder database was rebuilt without ANT and blaADC genes, and ResFinder was evaluated on this new test set of cryptic sequences. ResFinder performed poorly, identifying 86% of the ANT genes but none of blaADC.
The researchers also performed a more rigorous clustering analysis to test for more dissimilar sequences. Performance decreased, especially for binary ARG localization, indicating that resLens could generalize beyond narrow database matches, but still lost accuracy under stronger distribution shifts.
Limits of whole-genome testing and screening
Finally, the team used LR models for whole genome sequencing (WGS) data from organisms with validated resistance phenotypes. RGI and ResFinder were similarly tested for comparison. Filtering and mapping antibiotic classes to those predicted by resLens yielded 79 genomes with validated resistance phenotypes, with one to three antibiotic classes per organism. RGI and resLens identified at least one gene corresponding to a marked phenotype of a given genome more often than ResFinder.
However, the authors pointed out that this WGS The analysis was exploratory rather than a definitive benchmark because the data set had a limited sample size, non-exhaustive laboratory testing and lacked gene-level annotation of the mechanisms underlying each resistance phenotype. Manual validation of resLens predictions identified many true positives, but also false positives and ambiguous or misclassifications, highlighting the need to use such tools for screening and hypothesis generation rather than definitive conclusions.
Genomic language models improve ARG control
The findings show it gLMs can classify ARGs with high fidelity and speed and are less dependent on databases than other deep learning or alignment tools. resLens models outperformed deep learning tools and performed competitively with leading alignment-based tools. Overall, the results highlight its potential gLMs to improve ARG detection, including ARGs with limited representation in reference databases, while reducing reliance on curated reference datasets without eliminating them.
