DeepMind的AlphaFold 2:攻克蛋白质折叠难题

摘要

DeepMind的AlphaFold 2在protein folding领域取得了里程碑式的突破,在CASP竞赛最难蛋白质类别中获得87分——比2018年的前一版本提高了29分,领先最近竞争对手26分。这一成就被比作计算机视觉领域的ImageNet时刻,可能代表着近几十年来结构生物学和人工智能领域最重要的进展之一。这一突破有望解锁数百万种蛋白质的三维结构,为疾病治疗、药物设计和生物模拟开辟新的前沿领域。


核心要点

  • AlphaFold 2在CASP竞赛基准测试中获得87分,较2018年的58分大幅提升,达到了与X-ray crystallography等昂贵实验方法相媲美的精度
  • Protein folding——从氨基酸序列预测蛋白质三维结构——是一个悬而未决的重大挑战,已困扰科学界逾50年
  • 通过实验手段确定一种蛋白质的三维结构,每种蛋白质约耗资$120,000、耗时约1年;计算方法有望使这一过程大幅提速并降低成本
  • 在已知的2亿种蛋白质中,仅有17万种通过实验方法完成了三维结构测绘——AlphaFold 2有望将这一数字提升数个数量级
  • 蛋白质折叠错误是许多疾病的根本原因,使这一突破与医学直接相关
  • AlphaFold 2很可能以基于Transformer的注意力机制取代卷积神经网络——与深度学习领域的整体趋势一致
  • 进化相关序列的多序列比对(MSA)现已被整合到学习过程本身,而非仅用于特征工程步骤
  • 作者预测,由AlphaFold 2计算方法所催生的衍生研究至少将斩获一项诺贝尔奖

详细笔记

蛋白质折叠问题

  • 蛋白质是amino acids链;在人类和其他真核生物中,共有21种氨基酸
  • 蛋白质既是细胞的结构基石,也是功能执行者——充当催化剂、转运体和结构材料
  • 蛋白质的氨基酸序列几乎唯一决定其三维结构(在大多数情况下为一一对应关系)
  • 三维结构决定蛋白质的功能
  • 可能折叠构型的搜索空间极为庞大——估计有10^143种可能的构型——这一问题被正式表述为Levinthal悖论,揭示了蛋白质在自然界中能够正确且快速折叠这一现象的奇妙之处
  • Protein misfolding是许多疾病的根本原因

CASP竞赛与AlphaFold的表现

  • CASP(蛋白质结构预测关键评估)是蛋白质结构预测的主要基准测试
  • AlphaFold 1(2018年):在最难蛋白质类别中得分58
  • AlphaFold 2(2020年):得分87——最近竞争对手得分约为61
  • 这一表现被认为可与X射线晶体学等实验方法相媲美

AlphaFold 1的工作原理

  1. 第一步(机器学习): convolutional neural network以氨基酸残基序列及特征为输入——包括进化相关序列的多序列比对(MSA)——输出距离矩阵(最终三维结构中氨基酸两两之间距离的置信度分布)
  2. 第二步(优化,无机器学习): 梯度下降优化利用距离矩阵,寻找与预测两两距离最吻合的三维折叠结构

AlphaFold 2可能的工作原理(推测)

  • 视频发布时尚未发表完整论文——以下内容基于博客文章及推测
  • Transformer取代CNN——注意力机制似乎是新架构的核心
  • MSA现在很可能是学习过程本身的一部分,而非仅作为特征工程输入
  • 一种迭代信息传递机制似乎在以下两者之间运作:
    • 残基序列表示(进化/序列侧)
    • 残基间距离表示(结构侧)
  • 提到了空间图表示,可能比简单的距离矩阵或邻接矩阵更为丰富
  • 近期深度学习的两大核心经验在此得到应用:(1)注意力机制提升性能;(2)使更多流程可学习带来显著增益

潜在应用与未来影响

近期:

  • 通过解析蛋白质结构来确定DNA编码的未知基因功能
  • 理解和治疗由蛋白质折叠错误引发的疾病
  • 药物设计——工程化蛋白质以纠正错误折叠的蛋白质
  • 农业应用:杀虫蛋白、防霜涂层
  • 利用自组装蛋白质实现组织再生
  • 补充剂、抗衰老及纺织业先进生物材料

长期:

  • 多蛋白质相互作用及蛋白质复合物形成预测(被描述为难度更高的问题)
  • 环境背景纳入折叠模型
  • 基于物理的生物系统模拟——细胞、器官,乃至最终整个生物体
  • 面向复杂现实生命科学问题的端到端deep learning,超越游戏AI的边界

涉及概念

  • protein folding
  • amino acids
  • protein misfolding
  • structural biology
  • X-ray crystallography
  • convolutional neural network
  • transformer
  • attention mechanism
  • multiple sequence alignment
  • deep learning
  • gradient descent
  • reinforcement learning
  • natural language processing

English Original 英文原文

DeepMind’s AlphaFold 2: Solving the Protein Folding Problem

Summary

DeepMind’s AlphaFold 2 has achieved a landmark breakthrough in protein folding, scoring 87 on the CASP competition’s hardest protein class — a 29-point improvement over its 2018 predecessor and 26 points ahead of the nearest competitor. This accomplishment is being compared to the ImageNet moment in computer vision, potentially representing one of the most significant advances in both structural biology and artificial intelligence in recent decades. The breakthrough could unlock the 3D structures of millions of proteins, opening new frontiers in disease treatment, drug design, and biological simulation.


Key Takeaways

  • AlphaFold 2 scored 87 on the CASP competition benchmark, up from 58 in 2018, matching the accuracy of expensive experimental methods like X-ray crystallography
  • Protein folding — predicting a protein’s 3D structure from its amino acid sequence — has been an unsolved grand challenge for over 50 years
  • Determining a protein’s 3D structure experimentally costs ~$120,000 and takes ~1 year per protein; computational methods could make this dramatically faster and cheaper
  • Only 170,000 of 200 million known proteins have had their 3D structures mapped experimentally — AlphaFold 2 could expand this by orders of magnitude
  • The misfolding of proteins is the underlying cause of many diseases, making this breakthrough directly relevant to medicine
  • AlphaFold 2 likely replaces convolutional neural networks with transformer-based attention mechanisms — consistent with broader trends across deep learning
  • Multiple Sequence Alignment (MSA) of evolutionarily related sequences appears to now be integrated into the learning process itself, rather than used only as a feature engineering step
  • The author predicts at least one Nobel Prize will result from derivative work enabled by AlphaFold 2’s computational methods

Detailed Notes

The Protein Folding Problem

  • Proteins are chains of amino acids; in humans and other eukaryotes, there are 21 amino acids
  • Proteins serve as both structural building blocks and functional workhorses of cells — acting as catalysts, transporters, and structural materials
  • A protein’s amino acid sequence almost uniquely determines its 3D structure (a one-to-one mapping in most cases)
  • The 3D structure determines the protein’s function
  • The search space for possible folds is astronomically large — estimated at 10^143 possible configurations — formalized in Levinthal’s Paradox, which highlights how strange it is that proteins fold correctly and quickly in nature
  • Protein misfolding is the root cause of many diseases

The CASP Competition and AlphaFold’s Performance

  • CASP (Critical Assessment of Structure Prediction) is the primary benchmark for protein structure prediction
  • AlphaFold 1 (2018): score of 58 on the hardest protein class
  • AlphaFold 2 (2020): score of 87 — the next closest competitor scored ~61
  • This performance is considered comparable to experimental methods like X-ray crystallography

How AlphaFold 1 Worked

  1. Step 1 (Machine Learning): A convolutional neural network takes amino acid residue sequences plus features — including Multiple Sequence Alignment (MSA) of evolutionarily related sequences — and outputs a distance matrix (a confidence distribution of pairwise distances between amino acids in the final 3D structure)
  2. Step 2 (Optimization, no ML): A gradient descent optimization uses the distance matrix to find the 3D folded structure that best matches predicted pairwise distances

How AlphaFold 2 Likely Works (Speculative)

  • No full paper published at time of video — based on a blog post and speculation
  • Transformers replace CNNs — attention mechanisms appear central to the new architecture
  • MSA is now likely part of the learning process itself, not just a feature engineering input
  • An iterative information-passing mechanism appears to operate between:
    • The residue sequence representation (evolutionary/sequence side)
    • The residue-to-residue distance representation (structural side)
  • A spatial graph representation is mentioned, potentially richer than a simple distance or adjacency matrix
  • Two key lessons from recent deep learning applied here: (1) attention mechanisms boost performance, and (2) making more of the pipeline learnable yields significant gains

Potential Applications and Future Impact

Near-term:

  • Determining unknown gene functions encoded in DNA by resolving protein structures
  • Understanding and treating diseases caused by misfolded proteins
  • Drug design — engineering proteins that correct misfolded proteins
  • Agricultural applications: insecticidal proteins, frost-protective coatings
  • Tissue regeneration via self-assembling proteins
  • Supplements, anti-aging, and advanced biomaterials for textiles

Long-term:

  • Multi-protein interaction and protein complex formation prediction (described as a far harder problem)
  • Incorporating environmental context into folding models
  • Physics-based simulation of biological systems — cells, organs, and eventually entire organisms
  • End-to-end deep learning for complex real-world life science problems beyond game-playing AI

Mentioned Concepts

  • protein folding
  • amino acids
  • protein misfolding
  • structural biology
  • X-ray crystallography
  • convolutional neural network
  • transformer
  • attention mechanism
  • multiple sequence alignment
  • deep learning
  • gradient descent
  • reinforcement learning
  • natural language processing