Imagine being able to speed up evolution – hypothetically – to learn which genes might have a harmful or beneficial effect on human health. Imagine, further, being able to rapidly generate new genetic sequences that could help cure disease or solve environmental challenges. Now, scientists have developed a generative AI tool that can predict the form and function of proteins coded in the DNA of all domains of life, identify molecules that could be useful for bioengineering and medicine, and allow labs to run dozens of other standard experiments with a virtual query – in minutes or hours instead of years (or millennia).
The open-source, all-access tool, known as Evo 2, was developed by a multi-institutional team co-led by Stanford’s Brian Hie, an assistant professor of chemical engineering and a faculty fellow in Stanford Data Science. Evo 2 was trained on a dataset that includes all known living species, including humans, plants, bacteria, amoebas, and even a few extinct species. Stanford Report talked to Hie about Evo 2’s advanced capabilities, why the scientific world is so eager to get its hands on this new tool, and how Evo 2 could reshape the biological sciences.
From left to right: Michael Poli, Brian Hie, and Garyk Brixi. Biology is written in a combination of As, Cs, Gs, and Ts that can be hard to understand. The Evo2 team, co-led by Assistant Professor Brian Hie, aims to make the language of biology more accessible to researchers. | Video Kurt Hickman; image: Andrew Brodhead
Can you give us the lay version of how Evo 2 works?
All life is encoded in DNA using just four chemicals, known as nucleotides. These complex molecules are abbreviated using the letters A, C, G, and T. The human genome, at 3 billion nucleotides long, is just a string of these four letters. Now, if you imagine DNA as the characters in a book that is 3 billion letters long, the individual genes are the words. They are spelled differently. Some have more letters than others. And they have different purposes and meanings – that is, they have different functions.
With AI, we can search for patterns in all that code and use it to predict what the next nucleotide in the sequence is likely to be. In this way, Evo 2 is able to generate – to write – new genetic code that has never existed before. With Evo 2, you can enter a sequence of up to 1 million nucleotides. The million-nucleotide window in biology is important, as it allows us to explore long-distance interactions between two or more genes that may not be physically close to one another on the DNA molecule. The longer context window could allow us to spot connections between these long-distance collaborators that we wouldn’t even know about with a shorter window.
How is Evo 2 different from Evo 1 – which came out just last year – and how did you advance the technology so quickly?
Honestly, Evo 1 was more effective than we thought it would be. Evo 1 was trained on only 113,000 or so genomes of simpler life forms like bacteria and archaea, known as the prokaryotes.
Evo 2, on the other hand, also includes the known genomes of 15,000 or so plants and animals – the eukaryotes – which includes humans. Our dataset has now expanded from about 300 billion nucleotides to almost 9 trillion with Evo 2. In terms of safety, we have left out the genomes of viruses to prevent Evo 2 from being used to create new or more dangerous diseases. It’s like a representative snapshot of all species on Earth. Because it has the potential to improve tasks related to human disease, we felt like we needed to share Evo 2 quickly.

Claire Scully
How is Evo 2 like ChatGPT?
In a natural language processor, like ChatGPT, you can prompt it with some text, and it will autocomplete the sentence based on patterns from previously written words. Evo 2 does this with DNA. If you want to design a new gene, you prompt the model with the beginning of a gene sequence of base pairs, and Evo 2 will autocomplete the gene.
Sometimes that completion will look exactly like a gene found in nature, but other times the model will make some improvements or write the gene in a different way than has ever happened in evolutionary history. In the real world, these mutations happen by chance. With Evo 2, we can be more direct and steer toward mutations that have useful functions. Evo 2 also includes machine learning models that will tell you if the sequence exists in nature and predict how this new sequence will function in real life. Then we go into the lab and synthesize the DNA and insert it into a living cell to test it using a gene editing technology like CRISPR. Essentially, Evo 2 is speeding up evolution, providing promising new genetic paths for us to explore.
How do you hope other scientists will use Evo 2?
We hope that Evo 2 will someday have clinical significance. It is really good at discovery. Evo 2 could help predict which mutations lead to pathogenicity and disease. Everyone has random mutations in their DNA and, mostly, they’re harmless. But on rare occasions, they’ll cause cancer or other disease. The model is actually very good at distinguishing which mutations are just random, harmless variations and which cause disease. The last area we are hopeful about is using Evo 2 for designing new genetic sequences with specific functions of interest. Another relevant next step is integrating these models with models of systems biology that would help us learn about interactions between two or more genes to cause disease.
Can you talk about the collaboration needed to make something like Evo 2 happen?
Something of this scale cannot be done by a single person. The three major institutions involved are Stanford, NVIDIA – which makes the AI computer chips and software to run it – and the Arc Institute, a biomedical research nonprofit that is itself a collaboration among Stanford, the University of California, Berkeley, and the University of California, San Francisco.
In terms of personnel, we had three subteams. First, the machine learning team focused on training the model and making sure that the computers ran efficiently. Then, once you train a model, you need to know it actually works as intended. So there’s a team of biologists – computational, molecular, systems, prokaryotic, eukaryotic biologists – to make sure the information we are getting back is valuable and usable. And, last, we have an experimental biology team that synthesizes the new DNA, puts it into cells, and tests the cells to make sure what we’ve created works in real life. It’s all very hard work, and I’m very grateful to everyone on the team for their help.
For more information
Hie is the Dieter Schwarz Foundation SDS Faculty Fellow at Stanford Data Science, an assistant professor of chemical engineering in the School of Engineering, a member of Bio-X, and a faculty fellow of Sarafan ChEM-H.