Note on AlphaFold2

原创已于 2025-06-03 12:46:49 修改 · 1k 阅读

16 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

收录于

pytorch

于 2025-06-03 10:20:57 首次发布

前言

2022的知名老文章了。
做点拾人牙慧的事情。

ref

Highly accurate protein structure prediction with AlphaFold
https://doi.org/10.1038/s41586-021-03819-2
https://github.com/google-deepmind/alphafold

理解AlphaFold2 - 简枫的文章 - 知乎
https://zhuanlan.zhihu.com/p/570610949

background & preliminaries

AA (amino acid) 氨基酸

注：这是一个不规范的缩写，通常不会在专业文章中这样写，但笔者喜欢。

残基 residue。因为肽链里的氨基酸失去了H和OH，脱水缩合，是不完整的AA，称为残基。记首字母R。

一级结构（Primary structure）：AA sequence
二级结构（Secondary structure）：

α-helix （中译螺旋）
β-sheet (中译片）
loop (中译环），常见描述Random coil。可以稍微记一下，一般作用点位所在区域都叫CDR loop。

三级结构（Tertiary structure）： coordinates。(Natoms, 3)。

remark：
其实可以很容易地感知到，
二级结构这个概念是没有必要存在的。
只是pre-AI时代，人类没办法直接拿到精细的三级结构（David Baker奉献了一辈子）
于是退而求其次，通过一系列方法拿到精确度可以接受的二级结构，来辅助下游的预测、设计、分析任务。
三级结构已经够准的情况下，二级这个概念可以删除了。
序列就是序列，结构就是结构。
你想说基团这种中间层级的concept是有其应用价值的？
emmm…
I believe,
in the era of AI,
大方向是entity decomposition。
做不了电子云才做原子，做不了原子才做基团。

Task define：

input AA sequence (aka primary structure)
output coordinates (aka tertiary structure)

framework

~ MSA

with Protein Database:
	target sequence -> MMseqs2
		->MSA 
		->Template

同源蛋白 Homologs

注：省流的话就是mmseqs找到同源蛋白。
但详细调查的话会发现mmseqs只是一个相似度评分工具。
它只负责返回db里最相似的序列。
人类在这个地方引入了一个假设：相似度评分大于一定程度的序列，就被认为是同源序列。
显然这个假设是不正确的。
同源蛋白被定义为由同一个祖先进化而来的蛋白group（可译作群？）。
同源蛋白的序列可以长得完全不相似（进化的无序性）。
不同源蛋白的序列可以长得很相似（进化的偶然性）。
这种假设属于典型的AI许愿式思维。
严格一点的话，此处返回的应该叫相似蛋白(similar)、友元蛋白(friend/companion)、近邻蛋白(neighbor/nearest) 之类的名字。
作者使用 Homologs 一词事实上地误导了许多初心者。
当然这种写作技巧也值得学习，。

MSA (Multiple Sequence Alignment)

对所有"同源蛋白"的一级结构(AA sequence)作对齐的操作称为MSA。
对齐过程中会引入空缺(gap)。这个符号默认是短横线-。
需要注意和NN模型里的tokenizer对齐。(有些tokenizer使用别的字符作为<gap>空缺符号）

模板 template

上述同源蛋白的3D structures (coordinates)称为templates。
中间的每个结构称为一个template。

注：可以很自然地想到，
一个AA sequence有无穷多的structure，
这样拿到的templates会很多很多。
因此必然会引入筛选过程。
筛选一些相似度足够高的template。（how?
drop一些相似度比较差的template。

alphafold2支持有模板和无模板的forward过程。
有模板时，模板信息作为initial structure constrain，加速结构收敛。
无模板时，free modeling。

MSA是必须的(required)。

~ feature embd

# Input
# 上一步得到的 
msa_tensor = tensor(n_msa, n_residue) 
templates = Optional[tensor(n_templates, n_residues, 37, 3)] 

# embd msa
msa_feature = tensor(n_msa, n_residue, n_dim) 
pair_feature = tensor(n_residue, n_residue, n_dim)

# embd templates
if templates:
	templates_feature = tensor(n_templates, n_residues, n_dim)
	template_pair_representation = tensor(n_residue, n_residue, n_dim)
	# update
	msa_feature += templates_feature 
	pair_feature += template_pair_representation

注：
37=
主链4原子：N、Cα、C、O。
通用侧链原子（5个）：Cβ、H、OXT（C端氧）、CA（备用）。
氨基酸特异性原子（28个）。
缺失处补零。

注：
pair_feature 里的dim通常少于 msa_feature。
直觉上也很合理，因为前面是N^2。
此处为了简便统一使用n_dim。
自行注意。

评：
看下来，感觉蛋白这个领域的东西，看起来很多很杂，但其实自由度不太高?
明明这么多这么杂，但这么一大坨几百个的东西，还能用template处理。
难怪那么多工程选手喜欢这个行当。
自由度不高就有很多花式分割+替换的玩法。

Relative Position Encoding, relpos

计算C-alpha的相对位置序号。

+1
-1。
以此类推。

extra msa ?

略

~ feature loop

论文中叫recycle机制。额…
据说可以模拟蛋白折叠的过程。

这个地方还是值得注意一下的。
我觉得 recycle 这个名字很恰当。
可以和常见的 iterate loop 做区分。

通常我们在iterate的时候会flatten这个结构。
每个iteration都用一个独立的sub module处理。
这里的recycle是每个cycle使用同一个sub module处理。
这样结构如果不做residual处理就很难吧，直觉上感觉是residual update。

像这样同一个module多次forward的操作很容易想到，
bp的时候梯度算起来很麻烦。
需要处理这个问题。
果然此处引入了stopgrad，只对最后一个cycle计算grad。

Given:
- msa_feature = tensor(n_msa, n_residue, n_dim) 
- pair_feature = tensor(n_residue, n_residue, n_dim2)

Do:

msa_feature , pair_feature = self.evolformer(msa_feature , pair_feature)
single_feature = Pooling(msa_feature, 'mean')
coords = self.structure_module(single_feature, pair_feature)

# 处理cycle
n_cycle = radom.choice(range(1,4))
for _ in range(n_cycle - 1):
	recycle_feat = self.recycle_embd(coords) # algorithm 32
	# 用delta加法是偷懒了，实际上可以换成concat
	pair_feature = recycle_feat + LayNorm(pair_feature)
	msa_feature = LayNorm(msa_feature)
	# stopgrad
	msa_feature = msa_feature.detach()
	pair_feature = pair_feature.detach()
	# next feature forward
	msa_feature , pair_feature = self.evolformer(msa_feature , pair_feature)
	single_feature = Pooling(msa_feature, 'mean')
	coords = self.structure_module(single_feature , pair_feature)

return single_feature, coords

remark:
infer_time, n_cycle fixed to 3。

recycle_embd的具体做法是，
根据coords，
计算 C-alpha和C-beta的距离，
分箱(bins)成onehot -> projection。
找gpt写了个伪代码。输入coords输出tensor。

def update_structure_embeddings(predicted_coords):
    """
    predicted_coords: [batch, num_residues, atom_type, 3] (e.g., atom_type=N/CA/C/O)
    returns: [batch, num_residues, new_structure_dim]
    """
    # 1. 提取CA原子坐标（作为残基代表点）
    ca_coords = predicted_coords[:, :, atom_order["CA"], :]  # [batch, num_res, 3]
    
    # 2. 计算残基间的距离矩阵（Pairwise Features）
    dists = torch.cdist(ca_coords, ca_coords)  # [batch, num_res, num_res]
    dist_embeddings = linear_projection(dists)  # 分桶+嵌入表
    
    # 3. 计算每个残基的局部坐标系（Frame）
    frames = compute_frames(predicted_coords)  # [batch, num_res, 6] (6D旋转表示)
    frame_embeddings = linear_projection(frames)
    
    # 4. 组合所有结构特征
    single_embeddings = frame_embeddings  # 残基内特征
    pair_embeddings = dist_embeddings     # 残基间特征
    
    # 5. 可能与其他特征（如原始Pairwise嵌入）融合
    new_structure_embeddings = torch.cat([single_embeddings, pair_embeddings], dim=-1)
    return new_structure_embeddings

def compute_frames(coords):
    """
    根据N/CA/C原子计算每个残基的局部坐标系（6D旋转表示）。
    """
    n_coords = coords[:, :, atom_order["N"], :]
    ca_coords = coords[:, :, atom_order["CA"], :]
    c_coords = coords[:, :, atom_order["C"], :]
    
    # 计算三个基向量
    e1 = ca_coords - n_coords                   # CA -> N
    e2 = c_coords - ca_coords                   # CA -> C
    e3 = torch.cross(e1, e2, dim=-1)            # 法向量
    e1 = e1 / torch.norm(e1, dim=-1, keepdim=True)
    e3 = e3 / torch.norm(e3, dim=-1, keepdim=True)
    e2 = e2 / torch.norm(e2, dim=-1, keepdim=True)
    
    # 转换为6D旋转表示（避免3×3矩阵的冗余性）
    rotation_6d = torch.cat([e1, e2], dim=-1)   # [batch, num_res, 6]
    return rotation_6d

此处的6D表示按我的理解就是旋转矩阵的前2列。
因为旋转矩阵正交，第3列可以通过前2列叉积导出。
存储6D会比存储整个旋转矩阵9D减少3个参数量。
On the Continuity of Rotation Representations in Neural Network
https://arxiv.org/abs/1812.07035

evoformer（48blocks）和
structure_module (8 blocks) 的结构需要看一下。
见下文。

后面有消融。
说recycle过程至关重要。
比IPA结构重要多了。
IPA是融合template信息的模块。

~ Evoformer

有点意思的东西。

后面对这个triangle机制做过消融。
很重要。
在这里插入图片描述

~ structure_module

上文提及过Optional[templates]可以作为initial帮助生成3D结构。
具体应用部分就在这里了。backbone frames。
(当然templates还有一部分用在了featuren embd阶段）
在这里插入图片描述
IPA（Invariant point attention）

backbone_frames = None # maybe
pred_rel_rotations, pred_rel_transitions = self.ipa(single_feature, pair_feature, backbone_frames)

black hole initialization。原点处开始生长。

3个旋转角+3个平移量，描述一个骨干上的残基。
relative rotation
relative transition

~ loss

FAPE(Frame Aligned Point Error)
so3 invariant。

aux loss
distance loss
msa loss: BERT-like掩码msa位置预测
confidence loss : pLDDT 的 mse。
experiment resolved loss : 如果结构来自一个高精确的实验预测。
violation loss

pLDDT，表征残基i在结构中的局部刚性（数值越大，越稳定)。

$L_{pretrain}= 0.5L_{FAPE}+0.5L_{aux}+0.3L_{dist}+2.0L_{msa}+0.01L_{conf}$

$L_{finetune}= 0.5L_{FAPE}+0.5L_{aux}+0.3L_{dist}+2.0L_{msa}+0.01L_{conf}+0.01L_{expresolved}+1.0L_{vio}$

消融显示
MSA pair很重要， MSA mask很重要。
ipa和aux很轻微。
triangle很重要。
end-to-end structure gradient是什么？

AF2 在 TPUv3 上 pretrain 7天，finetune 3天。

~ Noisy Student Training

self-distillation

根据标记的图像训练 teacher 模型
使用 teacher 在未标记的图像上生成伪标签
训练一个 student 模型拟合标签图像和伪标签图像的组合。

N次迭代后，将 student 视为 teacher，
对未标记的数据重新标记，并训练一个新 student。

data
20% labeled
80% unlabled 

step 1. 

model1 = Model(n_dim=2, n_layer=2)
train model1 with 20% labeled data
psu_y1 = model1.predict(80% unlabeld)

step 2.

model2 = Model(n_dim=4, n_layer=4)
train model2 with cat([20% labeled data， psu_y1])
psu_y2 = model2.predict(80% unlabeld)

step 3.

model3 = Model(n_dim=8, n_layer=8)
train model3 with cat([20% labeled data， psu_y2])
psu_y3 = model3.predict(80% unlabeld)

标签

#人工智能 #pytorch