Dataset#

Summary of dataset sources

To switch between different datasets, simply change the dataset argument in the launch command. For example:

workshop train encoder=gear_net dataset=<DATASET_NAME> task=inverse_folding trainer=cpu
# or
python proteinworkshop/train.py encoder=gear_net dataset=<DATASET_NAME> task=inverse_folding trainer=cpu # or trainer=gpu

Where <DATASET_NAME> is given by bracketed name in the listing below. For example, the dataset name for CATH is cath.

Note

If you have pip-installed proteinworkshop, you can download pre-training or processed downstream datasets from Zenodo with:

workshop download <DATASET_NAME>

Unlabelled Datasets#

Structure-based Pre-training Corpuses#

Pre-training corpuses (with the exception of pdb, cath, and astral) are provided in FoldComp database format. This format is highly compressed, resulting in very small disk space requirements despite the large size. pdb is provided as a collection of MMTF files, which are significantly smaller in size than conventional .pdb or .cif file.

Name

Description

Source

Size

Disk Size

License

astral

SCOPe domain structures

SCOPe/ASTRAL

1 - 2.2 Gb

Publicly available

afdb_rep_v4

Representative structures identified from the AlphaFold database by FoldSeek structural clustering

Barrio-Hernandez et al.

2.27M Chains

9.6 Gb

GPL-3.0

afdb_rep_dark_v4

Dark proteome structures identied by structural clustering of the AlphaFold database.

Barrio-Hernandez et al.

~800k

2.2 Gb

GPL-3.0

afdb_swissprot_v4

AlphaFold2 predictions for SwissProt/UniProtKB

Kim et al.

542k Chains

2.9 Gb

GPL-3.0

afdb_uniprot_v4

AlphaFold2 predictions for UniProt

Kim et al.

214M Chains

1 Tb

GPL-3.0 / CC-BY 4.0

cath

CATH 4.2 40% split by CATH topologies.

Ingraham et al.

~18k chains

4.3 Gb

CC-BY 4.0

esmatlas

ESMAtlas predictions (full)

Kim et al.

1 Tb

GPL-3.0 / CC-BY 4.0

esmatlas_v2023_02

ESMAtlas predictions (v2023_02 release)

Kim et al.

137 Gb

GPL-3.0 / CC-BY 4.0

highquality_clust30

ESMAtlas High Quality predictions

Kim et al.

37M Chains

114 Gb

GPL-3.0 / CC-BY 4.0

igfold_paired_oas

IGFold Predictions for Paired OAS

Ruffolo et al.

104,994 paired Ab chains

CC-BY 4.0

igfold_jaffe

IGFold predictions for Jaffe2022 data

Ruffolo et al.

1,340,180 paired Ab chains

CC-BY 4.0

pdb

Experimental structures deposited in the RCSB Protein Data Bank

wwPDB consortium

~800k Chains

23 Gb

CC0 1.0

Additionally, we provide several species-specific compilations (mostly reference species) | Name | Description | Source | Size | | ----------------| ----------- | ------ | ------ | | `a_thaliana` | _Arabidopsis thaliana_ (thale cress) proteome | AlphaFold2| | `c_albicans` | _Candida albicans_ (a fungus) proteome | AlphaFold2| | `c_elegans` | _Caenorhabditis elegans_ (roundworm) proteome | AlphaFold2 | | | `d_discoideum` | _Dictyostelium discoideum_ (slime mold) proteome | AlphaFold2| | | `d_melanogaster` | [_Drosophila melanogaster_](https://www.uniprot.org/taxonomy/7227) (fruit fly) proteome | AlphaFold2 | | | `d_rerio` | [_Danio rerio_](https://www.uniprot.org/taxonomy/7955) (zebrafish) proteome | AlphaFold2 | | | `e_coli` | _Escherichia coli_ (a bacteria) proteome | AlphaFold2 | | | `g_max` | _Glycine max_ (soy bean) proteome | AlphaFold2 | | | `h_sapiens` | _Homo sapiens_ (human) proteome | AlphaFold2 | | | `m_jannaschii` | _Methanocaldococcus jannaschii_ (an archaea) proteome | AlphaFold2 | | | `m_musculus` | _Mus musculus_ (mouse) proteome | AlphaFold2 | | | `o_sativa` | _Oryza sative_ (rice) proteome | AlphaFold2 | | | `r_norvegicus` | _Rattus norvegicus_ (brown rat) proteome | AlphaFold2 | | | `s_cerevisiae` | _Saccharomyces cerevisiae_ (brewer's yeast) proteome | AlphaFold2 | | | `s_pombe` | _Schizosaccharomyces pombe_ (a fungus) proteome | AlphaFold2 | | | `z_mays` | _Zea mays_ (corn) proteome | AlphaFold2 | |

ASTRAL (astral)#

ASTRAL provides compendia of protein domain structures, regions of proteins that can maintain their structure and function independently of the rest of the protein. Domains typically exhibit highly-specific functions and can be considered structural building blocks of proteins.

config/dataset/astral.yaml#
datamodule:
  _target_: "proteinworkshop.datasets.astral.AstralDataModule"
  path: ${env.paths.data}/Astral/ # Directory where the dataset is stored
  release: "2.08" # Version of ASTRAL to use
  identity: "95" # Percent identity clustering threshold to use
  batch_size: 32 # Batch size for dataloader
  pin_memory: True # Pin memory for dataloader
  num_workers: 4 # Number of workers for dataloader
  dataset_fraction: 1.0 # Fraction of dataset to use
  transforms: ${transforms} # Transforms to apply to dataset examples
  overwrite: False # Whether to overwrite cached dataset example files
  train_val_test: [0.8, 0.1, 0.1] # Cross-validation ratios to use for train, val, and test splits
num_classes: null # Number of classes

CATH (cath)#

config/dataset/cath.yaml#
datamodule:
  _target_: "proteinworkshop.datasets.cath.CATHDataModule"
  path: ${env.paths.data}/cath/ # Directory where the dataset is stored
  pdb_dir: ${env.paths.data}/pdb/ # Directory where raw PDB/mmtf files are stored
  format: "mmtf" # Format of the raw PDB/MMTF files
  num_workers: 4 # Number of workers for dataloader
  pin_memory: True # Pin memory for dataloader
  batch_size: 32 # Batch size for dataloader
  dataset_fraction: 1.0 # Fraction of the dataset to use
  transforms: ${transforms} # Transforms to apply to dataset examples
  overwrite: False # Whether to overwrite the dataset if it already exists
  in_memory: True # Whether to load the entire dataset into memory
num_classes: 23 # Number of classes

PDB (pdb)#

See also

proteinworkshop.datasets.pdb_dataset.PDBData

config/dataset/pdb.yaml#
datamodule:
  _target_: "proteinworkshop.datasets.pdb_dataset.PDBDataModule"
  path: ${env.paths.data}/pdb/ # Directory where the dataset is stored
  batch_size: 32 # Batch size for dataloader
  num_workers: 4 # Number of workers for dataloader
  pin_memory: True # Pin memory for dataloader
  transforms: ${transforms} # Transforms to apply to dataset examples
  overwrite: False # Whether to overwrite existing dataset files

  pdb_dataset:
    _target_: "proteinworkshop.datasets.pdb_dataset.PDBData"
    fraction: 1.0 # Fraction of dataset to use
    molecule_type: "protein" # Type of molecule for which to select
    experiment_types: ["diffraction", "NMR", "EM", "other"] # All experiment types
    max_length: 1000 # Exclude polypeptides greater than length 1000
    min_length: 10 # Exclude peptides of length 10
    oligomeric_min: 1 # Include only monomeric proteins
    oligomeric_max: 5 # Include up to 5-meric proteins 
    best_resolution: 0.0 # Include only proteins with resolution >= 0.0
    worst_resolution: 8.0 # Include only proteins with resolution <= 8.0
    has_ligands: ["ZN"] # Include only proteins containing the ligand `ZN`
    remove_ligands: [] # Exclude specific ligands from any available protein-ligand complexes
    remove_non_standard_residues: True # Include only proteins containing standard amino acid residues
    remove_pdb_unavailable: True # Include only proteins that are available to download
    split_sizes: [0.8, 0.1, 0.1] # Cross-validation ratios to use for train, val, and test splits

AFdb Rep. v4 (afdb_rep_v4)#

This is a dataset of approximately 3 million protein structures from the AlphaFold database, structurally clustered using FoldSeek.

config/dataset/afdb_rep_v4.yaml#
datamodule:
  _target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
  data_dir: ${env.paths.data}/afdb_rep_v4/
  database: "afdb_rep_v4"
  batch_size: 32
  num_workers: 4

  train_split: 0.98
  val_split: 0.01
  test_split: 0.01

  pin_memory: True
  use_graphein: True
  transform: ${transforms}

dataset_name: "afdb_rep_v4"
num_classes: None # number of classes

AFdb Dark Proteome (afdb_rep_dark_v4)#

config/dataset/afdb_rep_dark_v4.yaml#
datamodule:
  _target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
  data_dir: ${env.paths.data}/afdb_rep_dark_v4/
  database: "afdb_rep_dark_v4"
  batch_size: 32
  num_workers: 4

  train_split: 0.8
  val_split: 0.1
  test_split: 0.1

  pin_memory: True
  use_graphein: True
  transform: ${transforms}

dataset_name: "afdb_rep_dark_v4"
num_classes: None # number of classes

ESM Atlas (esmatlas_v2023_02)#

config/dataset/esmatlas_v2023_02.yaml#
datamodule:
  _target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
  data_dir: ${env.paths.data}/esmatlas_v2023_02/
  database: "esmatlas_v2023_02"
  batch_size: 32
  num_workers: 4

  train_split: 0.8
  val_split: 0.1
  test_split: 0.1

  pin_memory: True
  use_graphein: True
  transform: ${transforms}

dataset_name: "esmatlas_v2023_02"
num_classes: None # number of classes

ESM Atlas (High Quality) (highquality_clust30)#

config/dataset/highquality_clust30.yaml#
datamodule:
  _target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
  data_dir: ${env.paths.data}/highquality_clust30/
  database: "highquality_clust30"
  batch_size: 32
  num_workers: 4

  train_split: 0.8
  val_split: 0.1
  test_split: 0.1

  pin_memory: True
  use_graphein: True
  transform: ${transforms}

dataset_name: "highquality_clust30"
num_classes: None # number of classes

UniProt (Alphafold) (afdb_uniprot_v4)#

config/dataset/afdb_uniprot_v4.yaml#
datamodule:
  _target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
  data_dir: ${env.paths.data}/afdb_uniprot_v4/
  database: "afdb_uniprot_v4"
  batch_size: 32
  num_workers: 4

  train_split: 0.8
  val_split: 0.1
  test_split: 0.1

  pin_memory: True
  use_graphein: True
  transform: ${transforms}

dataset_name: "afdb_uniprot_v4"
num_classes: None # number of classes

SwissProt (Alphafold) (afdb_swissprot_v4)#

config/dataset/afdb_swissprot_v4.yaml#
datamodule:
  _target_: graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule
  data_dir: ${env.paths.data}/afdb_swissprot_v4/
  database: "afdb_swissprot_v4"
  batch_size: 32
  num_workers: 32

  train_split: 0.8
  val_split: 0.1
  test_split: 0.1

  pin_memory: True
  use_graphein: True
  transform: ${transforms}

dataset_name: "afdb_swissprot_v4"
num_classes: None # number of classes

Species-Specific Datasets#

Stay tuned!

Graph-level Datasets#

Antibody Developability (antibody_developability)#

Therapeutic antibodies must be optimised for favourable physicochemical properties in addition to target binding affinity and specificity to be viable development candidates. Consequently, this task frames prediction of antibody developability as a binary graph classification task indicating whether a given antibody is developable

Dataset: We adopt the antibody developability dataset originally curated from SabDab by TDC.

Impact: From a benchmarking perspective, this task is important as it enables targeted performance assessment of models on a specific (immunoglobulin) fold, providing insight into whether general- purpose structure-based encoders can be applicable to fold-specific tasks.

config/dataset/antibody_developability.yaml#
datamodule:
  _target_: proteinworkshop.datasets.antibody_developability.AntibodyDevelopabilityDataModule
  path: ${env.paths.data}/AntibodyDevelopability # Directory where the dataset is stored
  pdb_dir: ${env.paths.data}/pdb/ # Path to all downloaded PDB files
  batch_size: 32 # Batch size for dataloader
  pin_memory: True # Pin memory for dataloader
  num_workers: 4 # Number of workers for dataloader
  in_memory: False # Load the dataset in memory
  format: "mmtf" # Format of the structure files
  obsolete_strategy: "drop" # What to do with obsolete PDB entries
  transforms: ${transforms} # Transforms to apply to dataset examples
  overwrite: False
num_classes: 2 # Number of classes

Atom3D Mutation Stability Prediction (atom3d_msp)#

This task is defined in the Atom3D benchmark.

As per their documentation:

Impact: Identifying mutations that stabilize a protein’s interactions is a key task in designing new proteins. Experimental techniques for probing these are labor intensive, motivating the development of efficient computational methods.

Dataset description: We derive a novel dataset by collecting single-point mutations from the SKEMPI database (Jankauskaitė et al., 2019) and model each mutation into the structure to produce mutated structures.

Task: We formulate this as a binary classification task where we predict whether the stability of the complex increases as a result of the mutation.

Splitting criteria: We split protein complexes by sequence identity at 30%.

Downloads: The full dataset, split data, and split indices are available for download via Zenodo (doi:10.5281/zenodo.4962515)

config/dataset/atom3d_msp.yaml#
datamodule:
  _target_: proteinworkshop.datasets.atom3d_datamodule.ATOM3DDataModule
  task: MSP
  data_dir: ${env.paths.data}/ATOM3D
  max_units: 0
  unit: edge
  batch_size: 1
  num_workers: 4
  pin_memory: false
num_classes: 2

Atom3D Protein Structure Ranking (atom3d_psr)#

This task is defined in the Atom3D benchmark.

As per their documentation:

Impact: Proteins are one of the primary workhorses of the cell, and knowing their structure is often critical to understanding (and engineering) their function.

Dataset description: The Critical Assessment of Structure Prediction (CASP) (Kryshtafovych et al., 2019) is a blind international competition for predicting protein structure.

Task: We formulate this as a regression task, where we predict the global distance test (GDT_TS) from the true structure for each of the predicted structures submitted in the last 18 years of CASP.

Splitting criteria: We split structures temporally by competition year.

Downloads: The full dataset, split data, and split indices are available for download via Zenodo (doi:10.5281/zenodo.4915648)

config/dataset/atom3d_psr.yaml#
datamodule:
  _target_: proteinworkshop.datasets.atom3d_datamodule.ATOM3DDataModule
  task: PSR
  data_dir: ${env.paths.data}/ATOM3D
  max_units: 0
  unit: edge
  batch_size: 1
  num_workers: 4
  pin_memory: false
num_classes: 1

Deep Sea Protein Classification (deep_sea_proteins)#

config/dataset/deepsea.yaml#
datamodule:
  _target_: proteinworkshop.datasets.deep_sea_proteins.DeepSeaProteinsDataModule
  path: ${env.paths.data}/deep-sea-proteins/ # Directory where the dataset is stored
  pdb_dir: ${env.paths.data}/pdb/ # Directory where raw PDB/mmtf files are stored
  validation_fold: "4" # Fold to use for validation (one of '1', '2', '3', '4', 'PM_group')
  batch_size: 32 # Batch size for dataloader
  pin_memory: True # Pin memory for dataloader
  num_workers: 8 # Number of workers for dataloader
  obsolete_strategy: "drop"
  format: "mmtf" # Format of the raw PDB/MMTF files
  transforms: ${transforms}
  overwrite: False
num_classes: 2

Enzyme Commission Number Prediction (ec_reaction)#

config/dataset/ec_reaction.yaml#
datamodule:
  _target_: proteinworkshop.datasets.ec_reaction.EnzymeCommissionReactionDataset
  path: ${env.paths.data}/ECReaction/ # Directory where the dataset is stored
  pdb_dir: ${env.paths.data}/pdb/ # Directory where raw PDB/mmtf files are stored
  format: "mmtf" # Format of the raw PDB/MMTF files
  batch_size: 32 # Batch size for dataloader
  pin_memory: True # Pin memory for dataloader
  num_workers: 8 # Number of workers for dataloader
  dataset_fraction: 1.0 # Fraction of the dataset to use
  shuffle_labels: False # Whether to shuffle labels for permutation testing
  transforms: ${transforms}
  overwrite: False
  in_memory: True
num_classes: 384

Fold Classification (fold-family, fold-superfamily, fold-fold)#

This is a multiclass graph classification task where each protein, G, is mapped to a label y ∈ {1, … , 1195} denoting the fold class.

Dataset: We adopt the fold classification dataset originally curated from SCOP 1.75 by Hermosilla et al. In particular, this dataset contains three distinct test splits across which we average a method’s results.

Impact: The utility of this task is that it serves as a litmus test for the ability of a model to distinguish different structural folds. It stands to reason that models that perform poorly on distinguishing fold classes likely learn limited or low-quality structural representations.214

Splitting Criteria:

config/dataset/fold_family.yaml#
datamodule:
  _target_: "proteinworkshop.datasets.fold_classification.FoldClassificationDataModule"
  path: ${env.paths.data}/FoldClassification/ # Directory where the dataset is stored
  split: "family" # Level of fold classification to perform (`family`, `superfamily`, or `fold`)
  batch_size: 32 # Batch size for dataloader
  pin_memory: True # Pin memory for dataloader
  num_workers: 4 # Number of workers for dataloader
  dataset_fraction: 1.0 # Fraction of dataset to use
  shuffle_labels: False # Whether to shuffle labels for permutation testing
  transforms: ${transforms} # Transforms to apply to dataset examples
  overwrite: False # Whether to overwrite existing dataset files
  in_memory: True # Whether to load the entire dataset into memory
num_classes: 1195 # Number of classes

Gene Ontology (go-bp, go-cc, go-mf)#

config/dataset/go-bp.yaml / config/dataset/go-cc.yaml / config/dataset/go-mf.yaml#
datamodule:
  _target_: proteinworkshop.datasets.go.GeneOntologyDataset
  path: ${env.paths.data}/GeneOntology/ # Directory where the dataset is stored
  pdb_dir: ${env.paths.data}/pdb/ # Directory where raw PDB/mmtf files are stored
  format: "mmtf" # Format of the raw PDB/MMTF files
  batch_size: 32 # Batch size for dataloader
  dataset_fraction: 1.0 # Fraction of the dataset to use
  shuffle_labels: False # Whether to shuffle labels for permutation testing
  pin_memory: True # Pin memory for dataloader
  num_workers: 8 # Number of workers for dataloader
  split: "BP" # Split of the dataset to use (`BP`, `MF`, `CC`)
  transforms: ${transforms} # Transforms to apply to dataset examples
  overwrite: False # Whether to overwrite existing dataset files
  in_memory: True
num_classes: 1943 # Number of classes

Node-level Datasets#

Atom3D Residue Identity Prediction (atom3d_res)#

This task is defined in the Atom3D benchmark.

As per their documentation:

Impact: Understanding the structural role of individual amino acids is important for engineering new proteins. We can understand this role by predicting the substitutabilities of different amino acids at a given protein site based on the surrounding structural environment.

Dataset description: We generate a novel dataset consisting of atomic environments extracted from nonredundant structures in the PDB.

Task: We formulate this as a classification task where we predict the identity of the amino acid in the center of the environment based on all other atoms.

Splitting criteria: We split residue environments by domain-level CATH protein topology class.

config/dataset/atom3d_res.yaml#
datamodule:
  _target_: proteinworkshop.datasets.atom3d_datamodule.ATOM3DDataModule
  task: RES
  data_dir: ${env.paths.data}/ATOM3D
  res_split: cath-topology
  max_units: 0
  unit: edge
  batch_size: 1
  num_workers: 4
  pin_memory: false
num_classes: 20

CCPDB Ligand Binding (ccpdb_ligand)#

config/dataset/ccpdb_ligand.yaml#
datamodule:
  _target_: proteinworkshop.datasets.cc_pdb.CCPDBDataModule
  path: ${env.paths.data}/ccpdb/ligands/ # Path to the dataset
  pdb_dir: ${env.paths.data}/pdb/ # Path to the PDB files
  name: "ligands" # Name of the ccPDB dataset

  batch_size: 32 # Batch size
  pin_memory: True # Pin memory for the dataloader
  num_workers: 4 # Number of workers for the dataloader
  format: "mmtf" # Format of the structure files
  obsolete_strategy: "drop" # What to do with obsolete PDB entries
  split_strategy: "random" # (or 'stratified') How to split the dataset
  train_fraction: 0.8 # Fraction of the dataset to use for training
  val_fraction: 0.1 # Fraction of the dataset to use for validation
  test_fraction: 0.1 # Fraction of the dataset to use for testing
  transforms: ${transforms}
  overwrite: False # Whether to overwrite the dataset if it already exists

num_classes: 7 # Number of classes

CCPDB Metal Binding (ccpdb_metal)#

config/dataset/ccpdb_metal.yaml#
datamodule:
  _target_: proteinworkshop.datasets.cc_pdb.CCPDBDataModule
  path: ${env.paths.data}/ccpdb/metal/ # Path to the dataset
  pdb_dir: ${env.paths.data}/pdb/ # Path to the PDB files
  name: "metal" # Name of the ccPDB dataset

  batch_size: 32 # Batch size
  pin_memory: True # Pin memory for the dataloader
  num_workers: 4 # Number of workers for the dataloader
  format: "mmtf" # Format of the structure files
  obsolete_strategy: "drop" # What to do with obsolete PDB entries
  split_strategy: "random" # (or 'stratified') How to split the dataset
  train_fraction: 0.8 # Fraction of the dataset to use for training
  val_fraction: 0.1 # Fraction of the dataset to use for validation
  test_fraction: 0.1 # Fraction of the dataset to use for testing
  transforms: ${transforms}
  overwrite: False # Whether to overwrite the dataset if it already exists

num_classes: 7 # Number of classes

CCPDB Nucleic Acid Binding (ccpdb_nucleic)#

config/dataset/ccpdb_nucleic.yaml#
datamodule:
  _target_: proteinworkshop.datasets.cc_pdb.CCPDBDataModule
  path: ${env.paths.data}/ccpdb/nucleic/ # Path to the dataset
  pdb_dir: ${env.paths.data}/pdb/ # Path to the PDB files
  name: "nucleic" # Name of the ccPDB dataset

  batch_size: 32 # Batch size
  pin_memory: True # Pin memory for the dataloader
  num_workers: 4 # Number of workers for the dataloader
  format: "mmtf" # Format of the structure files
  obsolete_strategy: "drop" # What to do with obsolete PDB entries
  split_strategy: "random" # (or 'stratified') How to split the dataset
  train_fraction: 0.8 # Fraction of the dataset to use for training
  val_fraction: 0.1 # Fraction of the dataset to use for validation
  test_fraction: 0.1 # Fraction of the dataset to use for testing
  transforms: ${transforms}
  overwrite: False # Whether to overwrite the dataset if it already exists

num_classes: 2 # Number of classes

CCPDB Nucleotide Binding (ccpdb_nucleotides)#

config/dataset/ccpdb_nucleotides.yaml#
datamodule:
  _target_: proteinworkshop.datasets.cc_pdb.CCPDBDataModule
  path: ${env.paths.data}/ccpdb/nucleotides/ # Path to the dataset
  pdb_dir: ${env.paths.data}/pdb/ # Path to the PDB files
  name: "nucleotides" # Name of the ccPDB dataset

  batch_size: 32 # Batch size
  pin_memory: True # Pin memory for the dataloader
  num_workers: 4 # Number of workers for the dataloader
  format: "mmtf" # Format of the structure files
  obsolete_strategy: "drop" # What to do with obsolete PDB entries
  split_strategy: "random" # (or 'stratified') How to split the dataset
  train_fraction: 0.8 # Fraction of the dataset to use for training
  val_fraction: 0.1 # Fraction of the dataset to use for validation
  test_fraction: 0.1 # Fraction of the dataset to use for testing
  transforms: ${transforms}
  overwrite: False # Whether to overwrite the dataset if it already exists

num_classes: 8 # Number of classes

Post Translational Modifications (ptm)#

config/dataset/ptm.yaml#
datamodule:
  _target_: "proteinworkshop.datasets.ptm.PTMDataModule"
  dataset_name: "ptm_13" # Options currently include (`ptm_13`, `optm`)
  path: ${env.paths.data}/PostTranslationalModification/ # Directory where the dataset is stored
  batch_size: 32 # Batch size for dataloader
  in_memory: False # Load the dataset in memory
  pin_memory: True # Pin memory for dataloader
  num_workers: 16 # Number of workers for dataloader
  transforms: ${transforms} # Transforms to apply to dataset examples
  overwrite: False # Whether to overwrite existing dataset files
num_classes: 13 # Number of classes

PPI Site Prediction (masif_site)#

We use the dataset of experimental structures curated from the PDB by Gainza et al. and retain the original splits, though we modify the labelling scheme to be based on inter-atomic proximity (3.5 Å), which can be user-defined, rather than solvent exclusion.

The dataset is composed by selecting PPI pairs from the PRISM list of nonredundant proteins, the ZDock benchmark, PDBBind and SabDab. Splits are performed using CD-HIT and structural splits are performed using TM-algin.

config/dataset/masif_site.yaml#
datamodule:
  _target_: proteinworkshop.datasets.masif_site.MaSIFPPISP
  path: ${env.paths.data}/masif_site/ # Directory where the dataset is stored
  pdb_dir: ${env.paths.data}/pdb/ # Directory where raw PDB/mmtf files are stored
  format: "mmtf" # Format of the raw PDB/MMTF files
  batch_size: 32 # Batch size for dataloader
  pin_memory: True # Pin memory for dataloader
  num_workers: 8 # Number of workers for dataloader
  dataset_fraction: 1.0 # Fraction of the dataset to use
  shuffle_labels: False # Whether to shuffle labels for permutation testing
  transforms: ${transforms} # Transforms to apply to dataset examples
  overwrite: False # Whether to overwrite existing dataset files
num_classes: 2 # Number of classes