4. The residue library

Unlike all other PDB loading tools we know about, Pablo exposes its library of known residues to the user for inspection, modification, and even replacement at runtime. This can be done with the residue_library argument to topology_from_pdb. residue_library takes a mapping from residue IDs to a list of residue definitions (see Creating custom residue definitions). All of these definitions are matched against any residue with that ID, and all that do match must have consistent chemistry — Pablo would rather raise an error than provide ambiguous chemistry.

The fact that a residue library admits multiple residue definitions for each name allows alternate protonation states and even resonance forms to be included in a residue library. Matches are considered chemically consistent if they have identical elements, connectivity, and net formal charge, so if multiple resonance forms match and are chemically consistent than the first one is used. Only a few residues have alternative protonation states in the default library, so the user will often need to augment it — thankfully, this is easy.

4.1. The default residue library: STD_CCD_CACHE

The default value of the residue_library argument is a module-level constant called STD_CCD_CACHE. This is an instance of the CcdCache class that comes pre-loaded with a number of common residues and patches designed to make it more usable for common biomolecular workflows.

The CcdCache class provides a Mapping interface to the CCD itself; unknown residues are downloaded from the CCD when requested, and cached both in memory and on disk in the standard user cache directory. If the $XDG_CACHE_HOME environment variable is set, this is $XDG_CACHE_HOME/openff-pablo/ccd_cache; otherwise, it defaults to ~/.cache/openff-pablo/ccd_cache on Linux, and ~/Library/Caches/openff-pablo/ccd_cache on MacOS.

4.2. Augmenting the default library

The CcdCache class also provides a fluent, declarative API for easily augmenting the residue library. Unfortunately it’s not possible to prepare every residue in the CCD to the standard of, say, the canonical amino acids, and you may encounter bugs and missing protonation states. We try to make it as easy as possible to work around these issues.

If you’ve generated a fix for a common residue, let us know — we might upstream it so that everyone can use it!

You can read the entire interface in the CcdCache reference docs, but we summarize and motivate some important use cases here. In general, the way to add new residue definitions to the default library is with the with_ and with_replaced methods:

from openff.pablo import STD_CCD_CACHE, ResidueDefinition

my_residue_library = STD_CCD_CACHE.with_([
    # Create a custom ammonia definition
    ResidueDefinition.from_smiles(
        "[H:2][N:1]([H:3])[H:4]",
        atom_names={1: "N", 2: "H1", 3: "H2", 4: "H3"},
        residue_name="AMMN",
        description="AMMONIA",
    )
])

We’ll see how to make custom definitions in the next chapter! Like the other CcdCache.with_*() methods, they return the modified CcdCache rather than modify it in place. This both preserves the original library and allows calls to be chained and used directly in calls to topology_from_pdb:

from openff.pablo import STD_CCD_CACHE, topology_from_pdb, ResidueDefinition

topology = topology_from_pdb(
    "nh3_tip5p.pdb",
    residue_library=(
        STD_CCD_CACHE
            # Add ammonia with resid AMMO
            .with_([
                ResidueDefinition.from_smiles(
                    "[H:2][N:1]([H:3])[H:4]",
                    atom_names={1: "N", 2: "H1", 3: "H2", 4: "H3"},
                    residue_name="AMMN",
                    description="AMMONIA",
                ),
            ])
            # Require 5-point water
            .with_replaced([
                ResidueDefinition.from_smiles(
                    "[H:1][O:2][H:3]",
                    atom_names={1: "H1", 2: "O", 3: "H2"},
                    residue_name="HOH",
                    virtual_sites=["EP1", "EP2"],
                    description="5-POINT WATER",
                ),
            ])
    ),
)

4.2.1. Protonation states

The CCD format does not provide alternate protonation states for all residues, and for those it does many are chemical nonsense and represent fragments that may be visible crystallographically rather than something you’d want to simulate. As a result, all alternate protonation states must be specified either by Pablo or by the user. The CcdCache.with_varied_protonation function makes this simple:

my_residue_library = STD_CCD_CACHE.with_varied_protonation(
    "ABU", # GABA, a gamma amino acid with both basic and acidic groups
    acidic=["HXT"], # Atom name of existing abstractable proton
    basic=[("N", "H3")], # Existing atom to protonate, name of new proton
)
[resdef.description for resdef in my_residue_library["ABU"]]
['GAMMA-AMINO-BUTANOIC ACID',
 'GAMMA-AMINO-BUTANOIC ACID altids',
 'GAMMA-AMINO-BUTANOIC ACID -HXT',
 'GAMMA-AMINO-BUTANOIC ACID +H3',
 'GAMMA-AMINO-BUTANOIC ACID -HXT +H3',
 'GAMMA-AMINO-BUTANOIC ACID altids +H3']

with_varied_protonation adds all combinations of the specified proton abstractions and additions to all existing residues in the cache. It skips any combinations with clashing names or that try to modify or remove atoms that don’t exist. Use the acidic argument to deprotonate, and the basic argument to protonate.

4.2.2. Virtual sites

Virtual sites can be defined on existing residues using with_virtual_sites:

my_residue_library = (
    STD_CCD_CACHE
        .with_virtual_sites("HOH", ["EP1", "EP2"])
        .with_virtual_sites("SOL", ["EP1", "EP2"])
        .with_virtual_sites("WAT", ["EP1", "EP2"])
        .with_virtual_sites("HOH", ["EPW"])
        .with_virtual_sites("SOL", ["EPW"])
        .with_virtual_sites("WAT", ["EPW"])
)

with_virtual_sites adds a new residue definition for each existing definition with the given name. The new definitions require the exact set of virtual sites requested be present, and discards the virtual site atom records.

For the common case of 4- and 5-point water models with virtual sites named EPW or EP1 and EP2, the with_vsite_water method is even easier, and avoids some pitfalls of the above approach:

my_residue_library = STD_CCD_CACHE.with_vsite_water()

Note that with_virtual_sites and with_vsite_water both use the patching mechanism under the hood, and so will not apply to any additional water residue definitions added later with the with_ and with_replaced methods. They will apply to all residues already in the cache and to new residues downloaded from the CCD.

4.2.3. Alternate atom names

Synonyms for existing atoms can be added to residue definitions with ResidueDefinition.with_synonyms:

my_residue_library = STD_CCD_CACHE.with_replaced([
    resdef.with_synonyms({"H1": ["HA"], "H2": ["HB"]})
    for resdef in STD_CCD_CACHE["HOH"] + STD_CCD_CACHE["WAT"]
])

with_synonyms will raise an error if these synonyms introduce any ambiguity.

4.2.4. Arbitrary patches

Arbitrary transformations can be made with the with_patch method:

my_residue_library = STD_CCD_CACHE.with_patch(
    "ALA",
    lambda resdef: [resdef, resdef.with_synonyms({"N": ["NH"]})],
)

The patch is a function that takes a single residue definition as an argument and returns a list of residue definitions. It is applied to each residue definition already in the cache with the given residue name, and the returned definitions replace the definition that was patched. This means that residue definitions can be added by returning the original and modified definitions, split up by returning multiple modified definitions, or replaced by returning only the modified definition.

The patching mechanism will not apply to residues added later with the with_ and with_replaced methods. They will apply to all residues already in the cache and to new residues downloaded from the CCD.

Some patch functions used by the default library are found in the openff.pablo.ccd.patches module.

4.3. Defining a custom residue library

The simplest residue library is a dictionary:

from openff.pablo import topology_from_pdb, ResidueDefinition
from openff.pablo.chem import PEPTIDE_BOND

reslib = {
    "GLY": [
        ResidueDefinition.from_smiles(
            "[H:1][N:2]([H:3])[C:4]([H:5])([H:6])[C:7](=[O:8])[O:9][H:10]",
            atom_names={
                1: "H2", 2: "N", 3: "H", 4: "CA", 5: "HA1", 6: "HA2", 7: "C",
                8: "O", 9: "OXT", 10: "HXT"
            },
            residue_name="GLY",
            linking_bond=PEPTIDE_BOND,
            leaving_atoms=[1, 9, 10],
        )
    ]
}

You can read more about defining new residues in Creating custom residue definitions and the ResidueDefinition reference. This minimal residue library would be sufficient to load polyglycines of arbitrary length, as long as they have neutral termini:

topology_from_pdb("polygly.pdb", residue_library=reslib).visualize()

Of course, the default residue library would have no problem with this PDB file. Pablo gives you complete control over the chemistry that it interprets.

You can also define a completely custom instance of CcdCache; this is described in Customizing CCD access.