Proteins, Viruses, Evolution, and AI: A Deep Dive with Dmitry Korkin

Summary

Computational biologist Dmitry Korkin explores the modular architecture of proteins, the structural biology of SARS-CoV-2, the evolutionary origins of life, and the landmark achievements of DeepMind’s AlphaFold2. The conversation bridges molecular biology, evolutionary theory, and artificial intelligence to illuminate how life’s building blocks are organized, how viruses work, and how machine learning is transforming biological discovery.


Key Takeaways

  • Protein domains, not individual amino acids, are the true functional and evolutionary building blocks of proteins — they get shuffled and recombined throughout evolution
  • The SARS-CoV-2 spike protein operates as a homotrimer (three copies), and recent mutations have been observed causing more than one receptor-binding arm to open simultaneously, potentially increasing infectivity
  • The M protein of SARS-CoV-2 forms a structural lattice across the viral envelope and is evolutionarily more stable than the spike protein, making it a promising — though underexplored — drug target
  • AlphaFold2 represents a landmark achievement in predicting 3D protein structures, but “solving protein folding” is overstated — complex multi-domain proteins and dynamic interactions remain largely unsolved
  • Alternative splicing creates multiple distinct protein products from a single gene, adding a profound layer of complexity beyond the one-gene-one-protein model
  • Protein linkers — flexible segments connecting domains — are understudied but functionally critical, enabling dynamic spatial reorganization and protein-protein interactions
  • The RNA world hypothesis and discovery of glycine (a basic amino acid) in comet dust suggest the chemical precursors of life may be widespread in the universe
  • Machine learning scoring functions for protein-protein interactions were pioneered over a decade ago, foreshadowing the AlphaFold revolution

Detailed Notes

Protein Architecture: Domains as the True Building Block

  • A protein is best understood as a string of beads, where each bead is a protein domain — a structurally and functionally independent unit
  • Protein domains are the units of evolutionary shuffling: they get combined in new ways across species, generating new functions
  • Why this isn’t widely known: Early structural biology used X-ray crystallography and NMR spectroscopy, which worked best on small, single-domain proteins — reinforcing the incorrect “globular blob” model
  • Cryo-electron microscopy (cryo-EM) changed this, enabling structural resolution of large, multi-domain complexes like the spike protein
  • Linkers between domains are highly flexible and largely unstudied, but play key roles in protein-protein interactions and spatial dynamics
  • Intrinsically disordered regions (like tails/termini) are also functionally important despite lacking stable structure

SARS-CoV-2 Structural Biology

  • The virus has four structural proteins:

    • S (Spike): Homotrimer; mediates ACE2 receptor binding; ~50–90 copies per viral particle
    • E (Envelope): Pentamer; only 2–3 copies per particle
    • M (Membrane): Dimer; forms a structural lattice; ~1,000 copies per particle; makes up the bulk of the outer shell
    • N (Nucleocapsid): Protects viral RNA; likely interacts with M protein; contributes to outer shell stability
  • The spike protein’s receptor-binding domain (RBD) operates asynchronously — typically one of the three arms opens at a time to bind ACE2 receptor

  • A recently identified mutation (studied in collaboration with UMass Medical School) causes two arms to open simultaneously, potentially increasing binding efficiency

  • ~one-third of the spike protein is embedded in the viral membrane and remains structurally unresolved — its function is still poorly understood

Drug Target Potential

  • Spike protein: Current primary target for vaccines and antibody therapies
  • M protein lattice: Emerging target; disrupting the outer shell could destroy the viral particle entirely; evolutionarily more stable (slower-moving target) than spike
  • Nanoparticle decoys: Engineered particles mimicking viral shape with integrated spike proteins could block ACE2 receptors competitively, preventing real virus entry

Viral Evolution and Mutation

  • Mutations are the primary mechanism by which viruses jump between species — each cross-species jump introduces new mutations shaped by the new host environment
  • It is unknown whether mutations acquired in animal hosts are neutral or harmful when the virus returns to humans
  • The M protein is more evolutionarily conserved than the spike protein, making it a more stable long-term drug target
  • The “arms race” between vaccine rollout and viral mutation is being monitored, but current evidence does not strongly support aggressive vaccine-escape mutations in SARS-CoV-2

Alternative Splicing and Gene-Level Complexity

  • In eukaryotes, a single gene contains exons (coding regions) and introns (non-coding, spliced out)
  • Alternative splicing allows different combinations of exons to be assembled, producing multiple distinct protein isoforms from one gene
  • This process is not random — it is regulated dynamically in response to disease states and developmental stages (visible via RNA-seq)
  • Boundaries of exons frequently coincide with boundaries of protein domains, suggesting deep evolutionary coupling between gene structure and protein architecture

Protein Folding and AlphaFold2

  • Protein folding = the process by which a linear amino acid sequence adopts its functional 3D structure, reliably and rapidly
  • Folding begins before translation is complete — secondary structures form as the protein emerges from the ribosome
  • CASP (Critical Assessment of Protein Structure Prediction) is the biennial competition used to benchmark folding prediction methods

AlphaFold2’s Achievements

  • Predicts contact maps (which residues are spatially close) with near-experimental accuracy for compact, single- or two-domain proteins
  • Incorporates evolutionary sequence alignments across species to extract structural conservation signals
  • Represents a genuine paradigm shift — the first machine learning system to clearly outperform physics-based methods on this problem

What Remains Unsolved

  • Multi-domain proteins (3–7+ domains) — especially those involved in neural systems — are far from solved
  • Dynamic and disordered proteins (like PSD-95, a key synaptic scaffolding protein with 5 domains) require understanding of flexibility, not just static structure
  • Protein-protein interactions and macromolecular complexes require a separate class of methods (benchmarked in the CAPRI competition)
  • The mechanistic process of how folding occurs in real time remains unknown — AlphaFold predicts the endpoint, not the pathway

Origin of Life and Astrobiology

  • Glycine (one of the 20 basic amino acids) was detected in dust from the comet 67P/Churyumov–Gerasimenko, suggesting organic chemistry building blocks exist in space
  • The rare earth hypothesis argues Earth’s conditions (planetary shielding, distance from sun, land/water ratio, etc.) may be uniquely necessary for complex life
  • Life’s origin remains deeply mysterious — current models don’t explain the transition from chemistry to replication
  • RNA viruses suggest RNA-based life could be a viable alternative to protein-based life elsewhere

AI and Bioinformatics: Historical Context

  • Joshua Lederberg (1958 Nobel Prize, genetics) created DENDRAL in the 1960s — one of the first AI expert systems — to infer molecular structures from mass spectrometry data for extraterrestrial molecule analysis
  • Expert systems transitioned into modern machine learning rather than “failing” — their core ideas (domain knowledge encoding) persist in systems like AlphaFold
  • AlphaFold2 is considered among the top 3 AI breakthroughs in history by Korkin, alongside Deep Blue defeating Kasparov (1997) and the deep learning revolution in computer vision

Mentioned Concepts