Benchmarking NFMs on Viral Genomics

ViroBench
for nucleotide foundation models

ViroBench is an interactive benchmark platform for evaluating nucleotide foundation models across diverse viral genomics tasks.

Dual-axis evaluation

4

Task types

18

Scenarios

66

NFMs

58K

Viral samples

Interactive leaderboard

Model rankings

Animated chart

Cross-task map

Cross-task scatter analysis

Click legend dots to hide/show model families

Each point is a model. X/Y axes are task-level normalized scores.

Task taxonomy

Four task types,
eighteen scenarios

ViroBench organizes viral genomics evaluation into classification tasks for biological understanding and generation tasks for sequence modeling, making the paper's benchmark design directly explorable on the web.

Research · April 2026

When Nucleotide Foundation Models Meet
Viral Genomics

ViroBench Team · Research · 5 min read

The Gap Between General Genomics and Viral Reality

Nucleotide foundation models are rapidly becoming a central tool for biological sequence modeling. They can read long DNA or RNA sequences, transfer across species, and support both discriminative and generative tasks. But viral genomics poses a harder question: can these models truly understand viral sequences under realistic evolutionary, taxonomic, and temporal shifts?

Viruses are not just another subset of biological sequences. They evolve quickly, span diverse genome types, and often appear in long-tailed, rapidly changing distributions. A model that performs well on general genomic benchmarks may still fail when asked to distinguish closely related viral taxa, predict host categories, or remain robust to newly recorded viral sequences.

We built ViroBench to make this gap measurable.

ViroBench asks a practical question: do nucleotide foundation models retain reliable viral understanding under taxonomic, evolutionary, and temporal shift, rather than only on easier in-distribution settings?

58,314

Viral Samples

4

Task Families

2

Evaluation Axes

18

Scenarios

Why Viral Benchmarks Need to Be Different

ViroBench is a unified benchmark for evaluating nucleotide foundation models on viral genomics tasks. Instead of testing models on a single simplified prediction problem, it organizes evaluation around two complementary axes: biological understanding and generation diagnostics. The first asks whether a model can recognize biologically meaningful viral signals; the second asks whether a model can produce or score viral sequences in a way that reflects sequence-level fidelity, coding constraints, and long-context behavior.

At its core, ViroBench contains 58,314 curated viral samples from NCBI, enriched with taxonomy, host annotations, nucleic-acid types, and data-source information. The benchmark covers four task families: Taxonomy Classification, Host Prediction, Genome Modeling, and CDS Completion. Together, these tasks provide a structured view of how nucleotide foundation models behave across viral sequence understanding and generation-oriented evaluation.

Many existing genomic benchmarks focus on regulatory elements, human genomics, or general DNA classification. These tasks are important, but they do not fully capture the distinctive challenges of viral genomics. Viral sequences introduce several forms of difficulty at once: DNA and RNA viruses follow different sequence distributions; closely related genera may be hard to separate; and new viral records can shift substantially over time.

ViroBench therefore avoids relying on a single random split. For classification, it includes genus-disjoint and temporal split settings, making it harder for models to benefit from near-duplicate or closely related sequences across train and test sets. This design better reflects practical use cases, where models may need to generalize to newly observed or evolutionarily distant viruses.

What ViroBench Evaluates

ViroBench evaluates models through four task groups.

Taxonomy ClassificationMeasures whether models can identify viral taxonomic labels under controlled split settings.

Host PredictionAsks models to infer host categories from viral sequences and connect sequence modeling with ecological patterns.

Genome ModelingEvaluates how well models assign likelihood to viral genome sequences across different length regimes.

CDS CompletionTests whether generated continuations preserve sequence similarity and distributional fidelity with multi-metric diagnostics.

Rather than treating generation as a single score, ViroBench reports multiple diagnostic metrics, including edit distance, alignment identity, exact match accuracy, and k-mer distribution distances.

A More Diagnostic Leaderboard

The ViroBench leaderboard is designed to be more than a ranking table. It lets users compare models by task family, scenario, and metric, making it easier to identify where a model succeeds and where it fails.

A model may perform well on taxonomy classification but struggle with host prediction. Another may show strong short-sequence modeling but degrade sharply on longer viral genomes. Some models may produce sequences with plausible local statistics while failing to preserve coding-level constraints. These differences are difficult to see from a single averaged score, but they are central to understanding model behavior in viral genomics.

Looking Forward

ViroBench is intended as a living benchmark for the next generation of nucleotide foundation models. As models become longer-context, more generative, and more widely used in biological research, viral genomics offers a demanding testbed for both capability and reliability.

The goal is not only to ask which model ranks first. The more important question is: what kind of viral knowledge does each model actually learn, and where do its limits appear?

By providing standardized data, controlled task settings, and interpretable metrics, ViroBench aims to support more reproducible evaluation and more biologically grounded model development for viral genomics.

Team · Contact

Contact us

Participating Institutions · Paper