蛋白质、病毒、进化与人工智能:与Dmitry Korkin的深度对话

摘要

计算生物学家Dmitry Korkin深入探讨了蛋白质的模块化结构、SARS-CoV-2的结构生物学、生命的进化起源,以及DeepMind的AlphaFold2所取得的里程碑式成就。这场对话跨越分子生物学、进化理论与人工智能,阐明了生命基本构件的组织方式、病毒的运作机制,以及机器学习如何变革生物学发现。


核心要点

  • 蛋白质结构域,而非单个氨基酸,才是蛋白质真正的功能性和进化性构建模块——它们在进化过程中不断被重组和排列
  • SARS-CoV-2的刺突蛋白同源三聚体(三个相同亚基)形式运作,近期观察到的突变可导致多个受体结合臂同时打开,可能增强感染性
  • SARS-CoV-2的M蛋白在病毒包膜上形成结构性晶格,其进化稳定性高于刺突蛋白,是一个颇具潜力却尚未充分开发的药物靶点
  • AlphaFold2在预测蛋白质三维结构方面代表了一项里程碑式成就,但”解决了蛋白质折叠问题”的说法言过其实——复杂的多结构域蛋白及其动态相互作用在很大程度上仍未得到解决
  • Alternative splicing(选择性剪接)可从单个基因产生多种不同的蛋白质产物,在”一基因一蛋白”模型之外增添了深刻的复杂性
  • 连接结构域的柔性片段——蛋白质连接子——研究不足,但在功能上至关重要,能够实现动态空间重组和蛋白质-蛋白质相互作用
  • RNA world hypothesis(RNA世界假说)和在彗星尘埃中发现甘氨酸(一种基本氨基酸)表明,生命的化学前体可能广泛存在于宇宙中
  • 用于protein-protein interactions(蛋白质-蛋白质相互作用)的机器学习评分函数早在十多年前就已开创先河,预示了AlphaFold革命的到来

详细笔记

蛋白质结构:结构域作为真正的构建模块

  • 蛋白质最好被理解为一串珠子,每颗珠子即为一个蛋白质结构域——一个在结构和功能上独立的单元
  • 蛋白质结构域是进化重排的基本单位:它们在不同物种中以新的方式组合,产生新的功能
  • 为何这一点鲜为人知: 早期结构生物学使用X射线晶体学和核磁共振波谱学,这些方法在小型单结构域蛋白上效果最佳——由此强化了错误的”球状团块”模型
  • Cryo-electron microscopy(冷冻电子显微镜,cryo-EM)改变了这一局面,使刺突蛋白等大型多结构域复合物的结构解析成为可能
  • 结构域之间的连接子具有高度柔性且研究甚少,但在蛋白质-蛋白质相互作用和空间动力学中发挥关键作用
  • 本征无序区域(如尾部/末端)尽管缺乏稳定结构,在功能上同样重要

SARS-CoV-2结构生物学

  • 该病毒有四种结构蛋白

    • S(刺突蛋白): 同源三聚体;介导ACE2受体结合;每个病毒颗粒约含50–90个拷贝
    • E(包膜蛋白): 五聚体;每个颗粒仅含2–3个拷贝
    • M(膜蛋白): 二聚体;形成结构性晶格;每个颗粒约含1,000个拷贝;构成外壳的主体
    • N(核衣壳蛋白): 保护病毒RNA;可能与M蛋白相互作用;有助于外壳稳定性
  • 刺突蛋白的受体结合域(RBD)异步方式运作——通常三个臂中仅一个在同一时间打开以结合ACE2 receptor(ACE2受体)

  • 近期发现的一种突变(与UMass医学院合作研究)可导致两个臂同时打开,可能提高结合效率

  • 刺突蛋白约三分之一嵌入病毒膜中,其结构尚未解析——功能仍不明确

药物靶点潜力

  • 刺突蛋白: 目前疫苗和抗体疗法的主要靶点
  • M蛋白晶格: 新兴靶点;破坏外壳可彻底摧毁病毒颗粒;进化稳定性高于刺突蛋白(靶点更为稳定)
  • 纳米颗粒诱饵: 模拟病毒形状并整合刺突蛋白的工程化颗粒,可竞争性封堵ACE2受体,阻止真实病毒进入

病毒进化与突变

  • 突变是病毒跨物种传播的主要机制——每次跨物种跳跃都会引入受新宿主环境塑造的新突变
  • 在动物宿主中获得的突变当病毒返回人体时是中性还是有害,目前尚不清楚
  • M蛋白的进化保守性高于刺突蛋白,使其成为更稳定的长期药物靶点
  • 疫苗推广与病毒突变之间的”军备竞赛”正在持续监测中,但目前的证据并不强烈支持SARS-CoV-2出现激进的疫苗逃逸突变

选择性剪接与基因层面的复杂性

  • 在真核生物中,单个基因包含外显子(编码区)和内含子(非编码区,被剪切掉)
  • Alternative splicing(选择性剪接)允许不同的外显子组合被拼装在一起,从一个基因产生多种不同的蛋白质亚型
  • 这一过程并非随机——它受到疾病状态和发育阶段的动态调控(可通过RNA-seq观测)
  • 外显子边界与蛋白质结构域边界频繁重合,表明基因结构与蛋白质结构之间存在深刻的进化耦合

蛋白质折叠与AlphaFold2

  • 蛋白质折叠 = 线性氨基酸序列可靠、快速地形成其功能性三维结构的过程
  • 折叠在翻译完成之前就已开始——蛋白质从核糖体中出现时,二级结构便开始形成
  • CASP(蛋白质结构预测方法批判性评估)是用于对折叠预测方法进行基准测试的双年度竞赛

AlphaFold2的成就

  • 对紧凑型单结构域或双结构域蛋白,能以接近实验精度预测接触图(哪些残基在空间上相邻)
  • 融合了跨物种的进化序列比对,以提取结构保守性信号
  • 代表了真正的范式转变——首个在这一问题上明显超越基于物理方法的机器学习系统

尚未解决的问题

  • 多结构域蛋白(含3–7个以上结构域),尤其是涉及神经系统的蛋白——距离解决还很遥远
  • 动态和无序蛋白(如PSD-95,一种含5个结构域的关键突触支架蛋白)需要理解柔性,而非仅仅是静态结构
  • Protein-protein interactions(蛋白质-蛋白质相互作用)和大分子复合物需要另一类方法(在CAPRI竞赛中进行基准测试)
  • 折叠在实时条件下如何发生的机制过程仍未知——AlphaFold预测的是终态,而非路径

生命起源与天体生物学

  • 在彗星67P/Churyumov–Gerasimenko的尘埃中检测到甘氨酸(20种基本氨基酸之一),表明有机化学构建模块存在于太空中
  • rare earth hypothesis(稀有地球假说)认为,地球的特殊条件(行星防护、与太阳的距离、陆地/水域比例等)可能是复杂生命所独有的必要条件
  • 生命的起源仍是深刻的谜——现有模型无法解释从化学反应到复制的转变
  • RNA病毒表明,以RNA为基础的生命在宇宙其他地方可能是一种可行的替代形式

人工智能与生物信息学:历史背景

  • Joshua Lederberg(1958年诺贝尔奖,遗传学)于1960年代创建了DENDRAL——最早的人工智能专家系统之一——用于从质谱数据推断分子结构,以分析地外分子
  • 专家系统并非”失败”,而是演变为现代机器学习——其核心思想(领域知识编码)在AlphaFold等系统中依然存在
  • Korkin认为AlphaFold2是人工智能史上最重要的三大突破之一,另外两个分别是Deep Blue击败卡斯帕罗夫(1997年)和计算机视觉领域的深度学习革命

涉及概念

  • protein domain
  • protein folding
  • alternative splicing
  • cryo-electron microscopy
  • ACE2 receptor
  • spike protein
  • homotrimer
  • contact map
  • CASP competition

English Original 英文原文

Proteins, Viruses, Evolution, and AI: A Deep Dive with Dmitry Korkin

Summary

Computational biologist Dmitry Korkin explores the modular architecture of proteins, the structural biology of SARS-CoV-2, the evolutionary origins of life, and the landmark achievements of DeepMind’s AlphaFold2. The conversation bridges molecular biology, evolutionary theory, and artificial intelligence to illuminate how life’s building blocks are organized, how viruses work, and how machine learning is transforming biological discovery.


Key Takeaways

  • Protein domains, not individual amino acids, are the true functional and evolutionary building blocks of proteins — they get shuffled and recombined throughout evolution
  • The SARS-CoV-2 spike protein operates as a homotrimer (three copies), and recent mutations have been observed causing more than one receptor-binding arm to open simultaneously, potentially increasing infectivity
  • The M protein of SARS-CoV-2 forms a structural lattice across the viral envelope and is evolutionarily more stable than the spike protein, making it a promising — though underexplored — drug target
  • AlphaFold2 represents a landmark achievement in predicting 3D protein structures, but “solving protein folding” is overstated — complex multi-domain proteins and dynamic interactions remain largely unsolved
  • Alternative splicing creates multiple distinct protein products from a single gene, adding a profound layer of complexity beyond the one-gene-one-protein model
  • Protein linkers — flexible segments connecting domains — are understudied but functionally critical, enabling dynamic spatial reorganization and protein-protein interactions
  • The RNA world hypothesis and discovery of glycine (a basic amino acid) in comet dust suggest the chemical precursors of life may be widespread in the universe
  • Machine learning scoring functions for protein-protein interactions were pioneered over a decade ago, foreshadowing the AlphaFold revolution

Detailed Notes

Protein Architecture: Domains as the True Building Block

  • A protein is best understood as a string of beads, where each bead is a protein domain — a structurally and functionally independent unit
  • Protein domains are the units of evolutionary shuffling: they get combined in new ways across species, generating new functions
  • Why this isn’t widely known: Early structural biology used X-ray crystallography and NMR spectroscopy, which worked best on small, single-domain proteins — reinforcing the incorrect “globular blob” model
  • Cryo-electron microscopy (cryo-EM) changed this, enabling structural resolution of large, multi-domain complexes like the spike protein
  • Linkers between domains are highly flexible and largely unstudied, but play key roles in protein-protein interactions and spatial dynamics
  • Intrinsically disordered regions (like tails/termini) are also functionally important despite lacking stable structure

SARS-CoV-2 Structural Biology

  • The virus has four structural proteins:

    • S (Spike): Homotrimer; mediates ACE2 receptor binding; ~50–90 copies per viral particle
    • E (Envelope): Pentamer; only 2–3 copies per particle
    • M (Membrane): Dimer; forms a structural lattice; ~1,000 copies per particle; makes up the bulk of the outer shell
    • N (Nucleocapsid): Protects viral RNA; likely interacts with M protein; contributes to outer shell stability
  • The spike protein’s receptor-binding domain (RBD) operates asynchronously — typically one of the three arms opens at a time to bind ACE2 receptor

  • A recently identified mutation (studied in collaboration with UMass Medical School) causes two arms to open simultaneously, potentially increasing binding efficiency

  • ~one-third of the spike protein is embedded in the viral membrane and remains structurally unresolved — its function is still poorly understood

Drug Target Potential

  • Spike protein: Current primary target for vaccines and antibody therapies
  • M protein lattice: Emerging target; disrupting the outer shell could destroy the viral particle entirely; evolutionarily more stable (slower-moving target) than spike
  • Nanoparticle decoys: Engineered particles mimicking viral shape with integrated spike proteins could block ACE2 receptors competitively, preventing real virus entry

Viral Evolution and Mutation

  • Mutations are the primary mechanism by which viruses jump between species — each cross-species jump introduces new mutations shaped by the new host environment
  • It is unknown whether mutations acquired in animal hosts are neutral or harmful when the virus returns to humans
  • The M protein is more evolutionarily conserved than the spike protein, making it a more stable long-term drug target
  • The “arms race” between vaccine rollout and viral mutation is being monitored, but current evidence does not strongly support aggressive vaccine-escape mutations in SARS-CoV-2

Alternative Splicing and Gene-Level Complexity

  • In eukaryotes, a single gene contains exons (coding regions) and introns (non-coding, spliced out)
  • Alternative splicing allows different combinations of exons to be assembled, producing multiple distinct protein isoforms from one gene
  • This process is not random — it is regulated dynamically in response to disease states and developmental stages (visible via RNA-seq)
  • Boundaries of exons frequently coincide with boundaries of protein domains, suggesting deep evolutionary coupling between gene structure and protein architecture

Protein Folding and AlphaFold2

  • Protein folding = the process by which a linear amino acid sequence adopts its functional 3D structure, reliably and rapidly
  • Folding begins before translation is complete — secondary structures form as the protein emerges from the ribosome
  • CASP (Critical Assessment of Protein Structure Prediction) is the biennial competition used to benchmark folding prediction methods

AlphaFold2’s Achievements

  • Predicts contact maps (which residues are spatially close) with near-experimental accuracy for compact, single- or two-domain proteins
  • Incorporates evolutionary sequence alignments across species to extract structural conservation signals
  • Represents a genuine paradigm shift — the first machine learning system to clearly outperform physics-based methods on this problem

What Remains Unsolved

  • Multi-domain proteins (3–7+ domains) — especially those involved in neural systems — are far from solved
  • Dynamic and disordered proteins (like PSD-95, a key synaptic scaffolding protein with 5 domains) require understanding of flexibility, not just static structure
  • Protein-protein interactions and macromolecular complexes require a separate class of methods (benchmarked in the CAPRI competition)
  • The mechanistic process of how folding occurs in real time remains unknown — AlphaFold predicts the endpoint, not the pathway

Origin of Life and Astrobiology

  • Glycine (one of the 20 basic amino acids) was detected in dust from the comet 67P/Churyumov–Gerasimenko, suggesting organic chemistry building blocks exist in space
  • The rare earth hypothesis argues Earth’s conditions (planetary shielding, distance from sun, land/water ratio, etc.) may be uniquely necessary for complex life
  • Life’s origin remains deeply mysterious — current models don’t explain the transition from chemistry to replication
  • RNA viruses suggest RNA-based life could be a viable alternative to protein-based life elsewhere

AI and Bioinformatics: Historical Context

  • Joshua Lederberg (1958 Nobel Prize, genetics) created DENDRAL in the 1960s — one of the first AI expert systems — to infer molecular structures from mass spectrometry data for extraterrestrial molecule analysis
  • Expert systems transitioned into modern machine learning rather than “failing” — their core ideas (domain knowledge encoding) persist in systems like AlphaFold
  • AlphaFold2 is considered among the top 3 AI breakthroughs in history by Korkin, alongside Deep Blue defeating Kasparov (1997) and the deep learning revolution in computer vision

Mentioned Concepts

  • protein domain
  • protein folding
  • alternative splicing
  • cryo-electron microscopy
  • ACE2 receptor
  • spike protein
  • homotrimer
  • contact map
  • CASP competition