According to reports from NVIDIA and other sources, the Arc Institute and NVIDIA have released Evo 2, a groundbreaking open-source biological foundation model trained on genomes covering the entire tree of life. This marks a significant advancement in generative biology and AI-driven genomic research.
Evo 2 Features Overview
Evo 2 represents a major leap in biological AI models. It is trained on an unprecedented 9.3 trillion nucleotides from over 128,000 complete genomes and metagenomic data. This vast dataset encompasses genetic information from humans, plants, bacteria, and various single-celled and multi-celled organisms. This development is a significant improvement over its predecessor, Evo 1, which focused solely on single-cell genomes.
The model’s capabilities are impressive, processing DNA sequences up to 1 million nucleotides long and achieving over 90% accuracy in predicting benign versus pathogenic mutations in genes such as BRCA1.
StripedHyena 2 Architecture
The development of Evo 2 utilized NVIDIA’s DGX Cloud AI platform, employing over 2,000 NVIDIA H100 GPUs for model training. The novel StripedHyena 2 architecture combines efficient Fourier and convolution kernels, enabling Evo 2 to handle up to one million nucleotides in a single run. This surpasses standard transformer-based architectures and allows the model to analyze relationships across extensive genomic distances, from individual molecules to entire chromosomes.
Evo 2 boasts 40 billion parameters, similar to current large language models from major tech companies. Training on the StripedHyena 2 architecture was nearly three times faster than optimized transformer models. The model is integrated into NVIDIA’s BioNeMo framework, making it accessible to researchers worldwide.
Applications in Biology Research
Evo 2’s capabilities extend well beyond basic genomic analysis, showcasing transformative potential across various biological research fields.
- Disease Research: The model has demonstrated remarkable accuracy, achieving over 90% precision in predicting benign versus pathogenic mutations for genes like BRCA1. This high accuracy could significantly accelerate the identification of harmful genetic variations and inform targeted therapies.
- Genome Design: Evo 2 also excels in genome design, generating novel DNA sequences at the scale of yeast chromosomes or small bacterial genomes. The model can design DNA sequences with specific chromatin accessibility profiles, opening new possibilities in generative epigenomics. This allows for simulating eukaryotic gene regulation with unprecedented accuracy.
Open-Source Accessibility
To promote collaboration and accelerate scientific discovery, the Arc Institute and NVIDIA have made Evo 2 fully open-source and widely accessible. The model is available through the NVIDIA BioNeMo platform and a user-friendly interface called Evo Designer. Researchers can access not only the model itself but also its training data, code, and weights. This open approach enables further development and customization by the scientific community.
This initiative aims to create an “app store for biology,” empowering researchers worldwide to tackle complex biomedical challenges using AI.