Preparing SMILES Strings For Downstream Applications

by Corin Wagen · Oct 3, 2025

The SMILES format is one of the most common ways to represent molecules in large chemical libraries. One challenge associated with storing molecules in SMILES format is that different tautomers and protonation states all have different SMILES representations, although these strings all represent the same fundamental chemical species. For modeling solution-phase properties like solubility, toxicity, or binding affinity, scientists typically prefer to use the SMILES that best represents the actual protonation state and tautomer of the compound in solution.

Unfortunately, identifying the precise protonation state of a molecule can be challenging to accomplish at scale. While SMILES strings can easily be converted into the "canonical" SMILES form in the RDKit, the canonical SMILES doesn't necessarily represent what will actually be in solution. For instance, the canonical SMILES for trimethylamine is CN(C)C, while the species which will predominate at pH 7.4 is the protonated microstate C[NH+](C)C. Correctly identifying this SMILES string in a black-box fashion requires determining the relative acidity and basicity of various sites, a complex and non-trivial challenge which cheminformatics packages like the RDKit don't attempt to solve.

Rowan's macroscopic pK_a workflow provides a simple and robust way to automatically convert molecules to the protonation state and tautomer predicted to predominate at a given pH. Here's a simple Python script that uses Rowan's API to convert a SMILES string into the preferred protonation state:

import rowan

# Set ROWAN_API_KEY environment variable to your API key or set rowan.api_key directly
# rowan.api_key = "rowan-sk..."

def get_best_microstate(smiles: str, target_ph: float=7.4) -> str:
    """
    Converts a given input SMILES string to the most populated microstate at a given pH.

    :param smiles: the input SMILES string
    :param target_ph: the pH at which to assess microstate distribution
    :returns: the SMILES of the microstate
    """

    result = rowan.submit_macropka_workflow(
        initial_smiles=smiles,
        name="example macropka",
    )

    result.wait_for_result().fetch_latest(in_place=True)

    for ph, microstate_weights in result.data["microstate_weights_by_pH"]:
        if abs(target_ph - ph) < 0.01:
            ms = result.data["microstates"][microstate_weights.index(max(microstate_weights))]
            return ms["smiles"]


print(f"best microstate is {get_best_microstate('CN(C)C')}")

As expected, running the above Python script quickly returns:

best microstate is C[NH+](C)C

For small to medium-sized molecules, each macro-pK_a calculation takes approximately 20 seconds (or 0.3 credits on Rowan). This is fast enough that this workflow can be run on thousands or tens of thousands of molecules, letting scientists quickly run these calculations before initializing a docking screen or training an ML model.

Rowan's macroscopic pKa workflow is available to all subscribers. If your work requires studying large numbers of molecules under physiological conditions, consider subscribing to Rowan or reaching out about a plan for your organization! We'd love to partner with you and support your science.