News center > News > Headlines > Context
Science: AI simulates 500 million years of biological evolution and creates an "unprecedented" protein
Editor
2025-01-17 10:02 8,355

Science: AI simulates 500 million years of biological evolution and creates an

Image source: generated by Unbounded AI

Proteins are very important in organisms Functional molecules are formed through billions of years of natural selection and evolution. In this process, the sequence and structure of the protein undergo countless random mutations and are screened through the selection mechanism of the biological system, ultimately forming those proteins with specific biological functions.

In recent years, with the development of deep learning and language models (LM), scientists have begun to try to apply these tools to understand biological systems, especially proteins.

Today, Science magazine published an important research result that demonstrates how to use language models to generate and reason about protein sequences, structures, and functions, and proposes a multimodal generative model called ESM3 . The model is not only capable of generating functional proteins, it is also capable of simulating more than 500 million years of evolution, generating entirely new proteins that differ from known protein sequences in nature.

The ESM3 model was developed by artificial intelligence startup Evolutionary Scale to help scientists understand, conceive and create proteins. In this work, researchers designed a new green fluorescent protein (GFP) through ESM3. Its gene sequence is very different from known fluorescent proteins. If the biological evolution of natural fluorescent proteins takes more than 500 million years, time.

This means that language models can not only interpret biological data accumulated in natural evolution, but also generate new biomolecules through analysis, opening up new paths for protein design and drug development.

AI decodes biological language

Biological organisms are essentially programmable.

This is because every organism in nature shares the same genetic code, and the proteins that form the basis of life are composed of only 20 amino acids. For this reason, some people compare it to the "alphabet" of life.

The complex protein information in organisms contains deep biological laws and evolutionary history. In recent years, scientists have accumulated a large amount of protein data by sequencing genome sequences and protein structures, including billions of sequences and hundreds of millions of structural information.

With the development of AI technology, scientists have begun to try to use deep learning models, such as large language models (LLM), to "decode" this genetic information to reveal the deep patterns and logic hidden in protein sequences. , and use these patterns to infer and design new protein structures and functions.

Currently, multiple language models (such as ProtBERT, ProtGPT) have proven that patterns in protein sequences can be "decoded" by language models, which can help understand their functions. in this fieldResearch also shows that as the size of the model increases, the power and accuracy of the language model also improves.

To do this, the researchers used more than 3.15 billion protein sequences, 236 million protein structures, and 539 million functionally annotated protein data to train the ESM3 model. In total, the model comes in three different sizes with 1.4 billion, 7 billion, and 98 billion parameters.

Experiments show that as the model parameter scale increases, the performance of ESM3 in generation capabilities and representation learning is significantly improved. Especially when generating protein structures, the 98 billion parameter model outperforms existing models. The power of the model.

As a cutting-edge achievement in this field, ESM3 is not just a traditional sequence generation model, but a multi-modal generation model that can simultaneously handle protein sequence, three-dimensional structure and function.

ESM3 also demonstrates its superior performance on a variety of generation tasks. ESM3 uses a method called "generative mask language modeling" to randomly mask the sequence, structure, and function of proteins in the input, and then generate the missing parts through model inference.

(Source: Evolutionary Scale)

The researchers generated sequences and structures through random masks, compared the matching of the generated results with real proteins, and found that the model can generate high-quality proteins Sequence and structure, which differ from the true structure on average by only 0.5Å.

Furthermore, studies have shown that ESM3 is able to generate proteins with targeted functions through different cues, which brings a high degree of flexibility to protein design. Unlike traditional complex modeling methods in three-dimensional space, ESM3 discretizes three-dimensional structures into tokens, which allows it to be input into the model for processing together with sequence and functional information. This method avoids complex three-dimensional spatial diffusion architecture and makes the generation process more efficient and controllable.

Generating a fluorescent protein that took 500 million years to evolve

To demonstrate the huge potential of the ESM3 model in generating entirely new proteins, the researchers tried to select Green fluorescent protein challenge.

Green fluorescent protein is a very important tool in biological research, used to label and track molecules and structures within cells. However, most of the existing fluorescent proteins come from nature, and their mutations are usually limited around existing sequences, making it difficult to significantly change their sequences. In a few cases, using high-throughput experiments and machine learning, scientists have only been able to introduce up to 40-50 mutations (i.e., 80% sequence homology) while retaining the fluorescent function of the protein.

(Source: Evolutionary Scale)

In order to break through this bottleneck, researchers implemented specific functions on the ESM3 modelTip: Try to generate a new green fluorescent protein, which requires that the sequence of the protein has low similarity to the known green fluorescent protein sequence, but still maintains its fluorescent properties.

First, the researchers defined a 229-amino-acid-long protein sequence, which contains key amino acids related to the fluorescent activity of green fluorescent protein. The researchers also provided three-dimensional information of green fluorescent protein, especially Amino acid residues involved in forming the active site of a fluorochrome.

After receiving these cues, the ESM3 model generates a three-dimensional structure of the protein, specifically ensuring that the amino acid positions in the active site are well coordinated. Then, based on the generated structure, the model further infers to generate a suitable amino acid sequence and attempts to maintain the correct structure of the active site.

In this process, ESM3 not only generates new sequences based on existing green fluorescent protein structures, but can also innovate on the basis of "known" structures to generate sequences with low sequence similarity. Novel proteins.

After a series of generation and optimization steps, researchers obtained multiple new green fluorescent proteins, one of which was specially designed and named esmGFP. This novel protein has a sequence similarity of 58% to existing fluorescent proteins such as tagRFP and a sequence difference of 107 amino acids and a sequence similarity of 53% to the closest native protein (eqFP578).

The researchers further verified whether the generated green fluorescent protein had actual fluorescent function. The results show that despite the delayed luminescence properties and longer maturation time of esmGFP, the final fluorescence brightness is similar to that of known green fluorescent proteins and has stable fluorescence properties.

The researchers also provided a time-calibrated phylogenetic analysis, indicating that if esmGFP was obtained through the natural evolution process of existing proteins, it would require an equivalent time of more than 500 million years.

Future potential and applications of ESM3

Another significant highlight of ESM3 is its generation and control capabilities under multi-modal conditions.

In other words, researchers can generate new proteins that meet these conditions by suggesting specific protein structures, functions, or specific key amino acids. For example, models can generate proteins with specific functional sites while maintaining overall structural integrity.

In addition, by combining different cues, the model is also able to generate proteins that meet complex requirements. For example, researchers suggest secondary structure and functional keywords for proteins and generate proteins that are highly consistent with these tips.

This prompt-responsiveness and controllable nature of the ESM3 model makes it highly practical in the field of protein design, especially in generating novel proteins that are significantly different from existing known proteins.

With the help of the ESM3 model, researchers were not only able to designThe new green fluorescent protein can also innovate in design and break through the limitations of natural evolution. This provides new possibilities for future fields such as protein engineering, synthetic biology, and drug development, and also provides more efficient tools for protein design and functional verification.

For example, ESM3 can greatly speed up protein design compared to natural evolution and generate new proteins that are not readily available in nature, which is huge for both basic and applied research breakthrough.

In addition, in the field of drug design, generating proteins with specific functions is an important research direction, and through ESM3, researchers can design proteins that meet specific targets, reducing experimental verification time and cost.

In the field of synthetic biology, ESM3 can help develop new synthetic pathways and generate enzymes or metabolic pathways with new functions.

The researchers also pointed out that as the model size and data volume further increase, ESM3 has the potential to generate more complex and innovative proteins. In the future, the applications of ESM3 may cover more fields, from basic research to drug design, opening up new possibilities for protein engineering.

Currently, ESM3 is in public beta via an API, enabling scientists to design proteins through programming or an interactive browser-based app. Scientists can use the EvolutionaryScale Forge API through the free academic access tier, or use the open model's code and weights.

Keywords: Bitcoin
Share to: