一、ligand数据库构建
1. 采用了split_sdf 进行分割处理
2. 数据库分子ID索引问题:
# ... in getID function
for id_key in ['IDNUMBER','Catalog ID','ID']:
try:
return mol.GetProp(id_key)
except:
pass
raise ValueError('No ID found in molecule')def getID(mol):
# 最终的 ID 键搜索列表,Catalog_ID 是最高优先级
for id_key in ['Catalog_ID', 'ZID', 'Compound_ID', 'ENAMINE_ID', 'IDNUMBER', 'Catalog ID', 'ID', 'NSC']:
try:
# 找到正确的 ID 键并返回
return mol.GetProp(id_key)
except:
pass
# 如果所有键都失败,则抛出错误
raise ValueError('No ID found in molecule')3.特殊的原子进行UFF 力场的MMFF优化时失败
Se2+2 (4)、Pt+2 (1)、Ca+2 (0)、 S_6+6 (12)、Re5 (2)、Fe2+2 (0)、 Ce+3 (0)
主循环会在写入 LMDB 之前,自动过滤掉所有返回 None 的分子(即跳过这些无法处理的分子)
二、加载权重进行编码
1. 网盘下载pt权重
2. unicore 是自己定义的,找不到
git clone https://github.com/dptech-corp/Uni-Core.git
# 或者直接下载到本地然后上传,上传后解压
cd /home/dataset-assist-0/tmp/zsl/AIDD/
unzip main.zip
cd Uni-Core-main
pip install -e ./ # 3. 安装 unicore 库(在 pytorch2.5.1 环境中)3. positive_ligand 数据ID 直接用分子式
4. 直接调用开始做rank
#DrugCLIP 验证脚本分析
脚本简介
run_vs_validation.py 是独立脚本,用于验证 DrugCLIP 模型在不同规模(1M、2M、4M)诱饵库上的虚拟筛选任务。主要功能包括:
- 编码蛋白质口袋和分子数据并缓存。
- 对活性与非活性分子进行编码和相似度计算。
- 输出分子排名和验证结果至 Excel 文件。
主要模块与函数
1. main(args)
- 功能:脚本主入口,加载模型和任务配置,调用
run_single_validation完成三种规模验证。
2. run_single_validation(model, task, args, max_inact_mols)
- 功能:执行单次虚拟筛选任务,核心逻辑包括:
- 编码蛋白质口袋并缓存。
- 编码活性与非活性分子并缓存。
- 合并编码数据,计算相似度,输出分子排名和结果。
3. encode_data_with_cache(model, data_loader, is_pocket, cache_path, max_items=None)
- 功能:通用编码函数,支持缓存和数据截断,被
run_single_validation调用。
4. 缓存辅助函数
load_cached_emb(cache_path):从缓存路径加载编码数据。save_cached_emb(cache_path, embeddings, names, labels=None):将编码数据保存到缓存路径。
命令行参数
脚本支持定制化参数,如数据路径、输出路径和缓存目录等。
关键点总结
- 缓存优化:通过
pickle缓存编码向量,降低重复计算开销。 - 分布式支持:采用
DistributedSampler助力大规模数据处理。 - 灵活扩展:参数化设计支持不同规模诱饵库验证。
python /home/dataset-assist-0/tmp/zsl/AIDD/3-baseline/0302_encoder/030201_encode.py \
/home/dataset-assist-0/tmp/zsl/AIDD/3-baseline/0302_encoder/checkpoint_best.pt \
--user-pocket-path /home/dataset-assist-0/tmp/zsl/AIDD/2-data/protein/pocket.lmdb \
--user-mol-path-inact /home/dataset-assist-0/tmp/zsl/AIDD/2-data/ligand/sdftolmdb/enamine_molecules.lmdb \
--user-mol-path-act /home/dataset-assist-0/tmp/zsl/AIDD/2-data/positive_ligand/OIU.lmdb \
--output-excel-path ./validation_results/OIU_vs_enamine.xlsx \
--ligand-encoder-dir /home/dataset-assist-0/tmp/zsl/AIDD/2-data/ligand/encoder \
--pocket-encoder-dir /home/dataset-assist-0/tmp/zsl/AIDD/2-data/protein/encoder \
--active-encoder-dir /home/dataset-assist-0/tmp/zsl/AIDD/2-data/positive_ligand \
--batch-size 1024\
--task drugclip \
--data /dummy/data/path5. 局部调用出问题
W1029 13:35:32.406891 83249 site-packages/torch/distributed/run.py:793]
W1029 13:35:32.406891 83249 site-packages/torch/distributed/run.py:793] *****************************************
W1029 13:35:32.406891 83249 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1029 13:35:32.406891 83249 site-packages/torch/distributed/run.py:793] *****************************************
fused_multi_tensor is not installed corrected
fused_rounding is not installed corrected
fused_multi_tensor is not installed corrected
fused_rounding is not installed corrected
fused_layer_norm is not installed corrected
fused_rms_norm is not installed corrected
fused_softmax is not installed corrected
fused_layer_norm is not installed corrected
fused_rms_norm is not installed corrected
fused_softmax is not installed corrected
FATAL ERROR: 无法导入 UniMol/DrugCLIP 组件。请检查 PYTHONPATH。
详情: cannot import name 'ConcatDataset' from 'unicore.data' (/home/dataset-assist-0/tmp/zsl/AIDD/Uni-Core-main/unicore/data/__init__.py)
FATAL ERROR: 无法导入 UniMol/DrugCLIP 组件。请检查 PYTHONPATH。
详情: cannot import name 'ConcatDataset' from 'unicore.data' (/home/dataset-assist-0/tmp/zsl/AIDD/Uni-Core-main/unicore/data/__init__.py)
W1029 13:35:39.093090 83249 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 83258 closing signal SIGTERM
E1029 13:35:39.126109 83249 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 83259) of binary: /opt/conda/envs/pytorch2.5.1/bin/python3.9
Traceback (most recent call last):
File "/opt/conda/envs/pytorch2.5.1/bin/torchrun", line 7, in <module>
sys.exit(main())
File "/opt/conda/envs/pytorch2.5.1/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/pytorch2.5.1/lib/python3.9/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/opt/conda/envs/pytorch2.5.1/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/opt/conda/envs/pytorch2.5.1/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/pytorch2.5.1/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/dataset-assist-0/tmp/zsl/AIDD/3-baseline/0302_encoder/030201_encode.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-10-29_13:35:39
host : drug-clip2-6a8298
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 83259)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html我现在改变主意了:
- 口袋:/home/dataset-assist-0/tmp/zsl/AIDD/2-data/protein/pocket.lmdb;
- ligand:/home/dataset-assist-0/tmp/zsl/AIDD/2-data/ligand/sdftolmdb/enamine_molecules.lmdb
- 活性ligand:/home/dataset-assist-0/tmp/zsl/AIDD/2-data/positive_ligand/OIU.lmdb
我希望取ligand 顺序取100w/200w/400w个数据,取positive ligand 数据混合作为ligands;将口袋作为pockets,利用drugclip进行虚拟筛选,输出excel表格,每个ligands的得分,并且标注打印positiveligand 在里面的位次。
我希望每次对我的ligand进行编码,就要 用faiss 进行向量存储,使其更加高效,其中向量存储在
ligand的向量存储在:/home/dataset-assist-0/tmp/zsl/AIDD/2-data/ligand/encoder
pocket的向量存储在:/home/dataset-assist-0/tmp/zsl/AIDD/2-data/protein/encoder
活性ligand向量存储在:
/home/data/home/dataset-assist-0/tmp/zsl/AIDD/Uni-Core-main/unicoreset-assist-0/tmp/zsl/AIDD/2-data/positive_ligand;
/home/dataset-assist-0/tmp/zsl/AIDD/DrugCLIP/unimol/retrieval.py
你提到的模型: