一、ligand数据库构建

1. 采用了split_sdf 进行分割处理

2. 数据库分子ID索引问题:

# ... in getID function
for id_key in ['IDNUMBER','Catalog ID','ID']:
    try:
        return mol.GetProp(id_key)
    except:
        pass
raise ValueError('No ID found in molecule')
def getID(mol):
    # 最终的 ID 键搜索列表,Catalog_ID 是最高优先级
    for id_key in ['Catalog_ID', 'ZID', 'Compound_ID', 'ENAMINE_ID', 'IDNUMBER', 'Catalog ID', 'ID', 'NSC']: 
        try:
            # 找到正确的 ID 键并返回
            return mol.GetProp(id_key)
        except:
            pass
    
    # 如果所有键都失败,则抛出错误
    raise ValueError('No ID found in molecule')

3.特殊的原子进行UFF 力场的MMFF优化时失败

Se2+2 (4)、Pt+2 (1)、Ca+2 (0)、 S_6+6 (12)、Re5 (2)、Fe2+2 (0)、 Ce+3 (0)

主循环会在写入 LMDB 之前,自动过滤掉所有返回 None 的分子(即跳过这些无法处理的分子)

二、加载权重进行编码

1. 网盘下载pt权重

2. unicore 是自己定义的,找不到

unicore

git clone https://github.com/dptech-corp/Uni-Core.git
# 或者直接下载到本地然后上传,上传后解压
cd /home/dataset-assist-0/tmp/zsl/AIDD/
unzip main.zip
cd Uni-Core-main 
pip install -e ./ # 3. 安装 unicore 库(在 pytorch2.5.1 环境中)

3. positive_ligand 数据ID 直接用分子式

4. 直接调用开始做rank

#DrugCLIP 验证脚本分析

脚本简介

run_vs_validation.py 是独立脚本,用于验证 DrugCLIP 模型在不同规模(1M、2M、4M)诱饵库上的虚拟筛选任务。主要功能包括:

  1. 编码蛋白质口袋和分子数据并缓存。
  2. 对活性与非活性分子进行编码和相似度计算。
  3. 输出分子排名和验证结果至 Excel 文件。

主要模块与函数

1. main(args)
2. run_single_validation(model, task, args, max_inact_mols)
3. encode_data_with_cache(model, data_loader, is_pocket, cache_path, max_items=None)
4. 缓存辅助函数

命令行参数

脚本支持定制化参数,如数据路径、输出路径和缓存目录等。

关键点总结

python /home/dataset-assist-0/tmp/zsl/AIDD/3-baseline/0302_encoder/030201_encode.py \
    /home/dataset-assist-0/tmp/zsl/AIDD/3-baseline/0302_encoder/checkpoint_best.pt \
    --user-pocket-path /home/dataset-assist-0/tmp/zsl/AIDD/2-data/protein/pocket.lmdb \
    --user-mol-path-inact /home/dataset-assist-0/tmp/zsl/AIDD/2-data/ligand/sdftolmdb/enamine_molecules.lmdb \
    --user-mol-path-act /home/dataset-assist-0/tmp/zsl/AIDD/2-data/positive_ligand/OIU.lmdb \
    --output-excel-path ./validation_results/OIU_vs_enamine.xlsx \
    --ligand-encoder-dir /home/dataset-assist-0/tmp/zsl/AIDD/2-data/ligand/encoder \
    --pocket-encoder-dir /home/dataset-assist-0/tmp/zsl/AIDD/2-data/protein/encoder \
    --active-encoder-dir /home/dataset-assist-0/tmp/zsl/AIDD/2-data/positive_ligand \
    --batch-size 1024\
    --task drugclip \
    --data /dummy/data/path

5. 局部调用出问题

W1029 13:35:32.406891 83249 site-packages/torch/distributed/run.py:793] 
W1029 13:35:32.406891 83249 site-packages/torch/distributed/run.py:793] *****************************************
W1029 13:35:32.406891 83249 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1029 13:35:32.406891 83249 site-packages/torch/distributed/run.py:793] *****************************************
fused_multi_tensor is not installed corrected
fused_rounding is not installed corrected
fused_multi_tensor is not installed corrected
fused_rounding is not installed corrected
fused_layer_norm is not installed corrected
fused_rms_norm is not installed corrected
fused_softmax is not installed corrected
fused_layer_norm is not installed corrected
fused_rms_norm is not installed corrected
fused_softmax is not installed corrected
FATAL ERROR: 无法导入 UniMol/DrugCLIP 组件。请检查 PYTHONPATH。
详情: cannot import name 'ConcatDataset' from 'unicore.data' (/home/dataset-assist-0/tmp/zsl/AIDD/Uni-Core-main/unicore/data/__init__.py)
FATAL ERROR: 无法导入 UniMol/DrugCLIP 组件。请检查 PYTHONPATH。
详情: cannot import name 'ConcatDataset' from 'unicore.data' (/home/dataset-assist-0/tmp/zsl/AIDD/Uni-Core-main/unicore/data/__init__.py)
W1029 13:35:39.093090 83249 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 83258 closing signal SIGTERM
E1029 13:35:39.126109 83249 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 83259) of binary: /opt/conda/envs/pytorch2.5.1/bin/python3.9
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch2.5.1/bin/torchrun", line 7, in <module>
    sys.exit(main())
  File "/opt/conda/envs/pytorch2.5.1/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/pytorch2.5.1/lib/python3.9/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/opt/conda/envs/pytorch2.5.1/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/opt/conda/envs/pytorch2.5.1/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/pytorch2.5.1/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/dataset-assist-0/tmp/zsl/AIDD/3-baseline/0302_encoder/030201_encode.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-10-29_13:35:39
  host      : drug-clip2-6a8298
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 83259)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

我现在改变主意了:

  1. 口袋:/home/dataset-assist-0/tmp/zsl/AIDD/2-data/protein/pocket.lmdb;
  2. ligand:/home/dataset-assist-0/tmp/zsl/AIDD/2-data/ligand/sdftolmdb/enamine_molecules.lmdb
  3. 活性ligand:/home/dataset-assist-0/tmp/zsl/AIDD/2-data/positive_ligand/OIU.lmdb
    我希望取ligand 顺序取100w/200w/400w个数据,取positive ligand 数据混合作为ligands;将口袋作为pockets,利用drugclip进行虚拟筛选,输出excel表格,每个ligands的得分,并且标注打印positiveligand 在里面的位次。
    我希望每次对我的ligand进行编码,就要 用faiss 进行向量存储,使其更加高效,其中向量存储在
    ligand的向量存储在:/home/dataset-assist-0/tmp/zsl/AIDD/2-data/ligand/encoder
    pocket的向量存储在:/home/dataset-assist-0/tmp/zsl/AIDD/2-data/protein/encoder
    活性ligand向量存储在:
    /home/data/home/dataset-assist-0/tmp/zsl/AIDD/Uni-Core-main/unicoreset-assist-0/tmp/zsl/AIDD/2-data/positive_ligand;
    /home/dataset-assist-0/tmp/zsl/AIDD/DrugCLIP/unimol/retrieval.py
    你提到的模型: