1. Quickstart: Loading standard PDBs

Pablo tries its best to make it as easy as possible (but no easier) to get complete chemical information out of a PDB file. This is a surprisingly hard problem; see Why Pablo is different. Fortunately, most ordinary use cases are easy!

The function used to load all files in Pablo is topology_from_pdb. Many PDB files using standard atom and residue names “just work” in Pablo, as long as they aren’t missing any atoms:

from openff.pablo import topology_from_pdb

topology = topology_from_pdb("2hi7_prepared.pdb")
topology.visualize()

Even files that include ligands can load automatically, as long as the ligands have all their atoms and are named in the “standard” PDB way (according to the CCD):

from openff.pablo import topology_from_pdb

topology = topology_from_pdb("1c9h_prepared.pdb")
topology.visualize()

topology_from_pdb can also load from file-like objects. 1A4T is a rare example of a PDB file from the PDB that is chemically complete and can load without modification:

from urllib.request import urlopen
from openff.pablo import topology_from_pdb

with urlopen("https://files.rcsb.org/download/1A4T.pdb") as pdb_file_object:
    topology = topology_from_pdb(pdb_file_object)
topology.visualize()

/home/docs/checkouts/readthedocs.org/user_builds/openff-pablo/conda/122/lib/python3.12/site-packages/openff/pablo/_pdb_data.py:440: UserWarning: Multi-model files not supported; topology will reflect first model
  warnings.warn(

Note that we don’t yet support loading subsequent models from PDB files, so this fresh-from-the-PDB file issues a warning.

1.1. Non-standard ligands

Non-standard molecules can be loaded by providing the chemical information that Pablo doesn’t know about. This works regardless of what the atoms in the molecule are named or what residues they’re in, as long as they have CONECT records and elements. 5ap1_nosol.pdb, for instance, includes a non-standard ligand, and so needs a little extra help:

from openff.pablo import topology_from_pdb, ResidueDefinition

topology = topology_from_pdb(
    "5ap1_prepared.pdb",
    additional_definitions=[
        ResidueDefinition.anon_from_smiles(
            "O=C([O-])Cn1cc(cn1)c2ccc(cc2OCC#N)Nc3ccc(c(n3)NC4CCCCC4)C#N"
        ),
    ],
)
topology.visualize()

additional_definitions is a powerful, flexible mechanism that can do much more than assigning chemical information to a ligand with complete CONECT records. It can “fill in the gaps” between known residues in a partially-standard polymer, only requiring CONECT records for bonds that aren’t present in the standard residues. It can do this by matching the whole molecule, or defining a substructure, as long as all atoms and bonds are described and unambiguous. We’ll focus more on this topic in additional_definitions: Residue-free chemical templates.

Normal usage of Pablo expects that things will “just work” if your inputs are made of standard residues, or you’ll get a helpful error message explaining how your input file needs to change or which residues need additional information to be loaded. We offer several options in our API for users to teach Pablo about nonstandard residues, and intend to offer more in the future. Users shouldn’t have to build residue definitions atom-by-atom; instead, we offer several API points to allow users to copy and modify existing residue definitions, and we expect these will be the shortest path to defining custom residues in the vast majority of cases.

1.2. What are the standard names and residues?

The PDB format is designed to be interpreted in concert with the Chemical Component Dictionary, or CCD. The CCD defines standard atom names, residue codes, and resonance forms for tens of thousands of chemicals, as well as how they may bond into larger polymers.

Pablo interprets the CCD as faithfully as is practical, but unfortunately the CCD is designed for crystallography and not biomolecular simulation and this limits its applicability. Pablo provides many tools to augment, patch, and even replace the CCD, which we’ll dig into in Customizing CCD access and The residue library.

You can investigate the standard atom names for a given residue by inspecting the STD_CCD_CACHE:

from openff.pablo import STD_CCD_CACHE

{
    resdef.description: ", ".join(
        "|".join((atom.name, *atom.synonyms))
        for atom in resdef.atoms
    ) for resdef in STD_CCD_CACHE["GLY"]
}

{'GLYCINE': 'N, CA, C, O, OXT, H|H1, H2, HA2, HA3, HXT',
 'GLYCINE altids': 'N, CA, C, O, OXT, H|H1, HN2, HA1, HA2, HXT',
 'GLYCINE -HXT': 'N, CA, C, O, OXT, H|H1, H2, HA2, HA3',
 'GLYCINE -HXT altids': 'N, CA, C, O, OXT, H|H1, HN2, HA1, HA2',
 'GLYCINE -H2': 'N, CA, C, O, OXT, H|H1, HA2, HA3, HXT',
 'GLYCINE -H2 altids': 'N, CA, C, O, OXT, H|H1, HA1, HA2, HXT',
 'GLYCINE -HXT -H2': 'N, CA, C, O, OXT, H|H1, HA2, HA3',
 'GLYCINE -HXT -H2 altids': 'N, CA, C, O, OXT, H|H1, HA1, HA2',
 'GLYCINE +H3': 'N, CA, C, O, OXT, H|H1, H2, HA2, HA3, HXT, H3',
 'GLYCINE +H3 altids': 'N, CA, C, O, OXT, H|H1, HN2, HA1, HA2, HXT, H3',
 'GLYCINE -HXT +H3': 'N, CA, C, O, OXT, H|H1, H2, HA2, HA3, H3',
 'GLYCINE -HXT +H3 altids': 'N, CA, C, O, OXT, H|H1, HN2, HA1, HA2, H3',
 'GLYCINE -H2 +H3': 'N, CA, C, O, OXT, H|H1, HA2, HA3, HXT, H3',
 'GLYCINE -H2 +H3 altids': 'N, CA, C, O, OXT, H|H1, HA1, HA2, HXT, H3',
 'GLYCINE -HXT -H2 +H3': 'N, CA, C, O, OXT, H|H1, HA2, HA3, H3',
 'GLYCINE -HXT -H2 +H3 altids': 'N, CA, C, O, OXT, H|H1, HA1, HA2, H3'}

Note that since we only provide suport of CCD-compliant atom/residue names, some naming conventions in common use are not supported out of the box. Pablo’s modular design and replaceable residue library makes it easy to write (and distribute!) your own residue library for alternative naming schemes such as Amber’s.