Spaces:

ml-jku
/

tox21_xgboost_classifier

Sleeping

App Files Files Community

antoniaebner commited on Sep 11

Commit

a3a1ae9

1 Parent(s): b308ee1

add code and assets

Browse files

Files changed (11) hide show

.gitignore +1 -0
Dockerfile +16 -0
README.md +92 -1
app.py +78 -0
predict.py +54 -0
requirements.txt +10 -0
src/__init__.py +0 -0
src/data.py +198 -0
src/model.py +79 -0
src/train.py +92 -0
src/utils.py +441 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ __pycache__/

Dockerfile ADDED Viewed

	@@ -0,0 +1,16 @@

+# Read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
+# you will also find guides on how best to write your Dockerfile
+FROM python:3.11
+RUN useradd -m -u 1000 user
+USER user
+ENV PATH="/home/user/.local/bin:$PATH"
+WORKDIR /app
+COPY --chown=user ./requirements.txt requirements.txt
+RUN pip install --no-cache-dir --upgrade -r requirements.txt
+COPY --chown=user . /app
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -9,4 +9,95 @@ license: apache-2.0
 short_description: XGBoost baseline classifier for Tox21
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 short_description: XGBoost baseline classifier for Tox21
 ---
+# Tox21 XGBoost Classifier
+This repository hosts a Hugging Face Space that provides an examplary API for submitting models to the [Tox21 Leaderboard](https://huggingface.co/spaces/tschouis/tox21_leaderboard).
+In this example, we train a XGBoost classifier on the Tox21 targets and save the trained model in the `assets/` folder.
+**Important:** For leaderboard submission, your Space does not need to include training code. It only needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a prediction dictionary as output, with SMILES and targets as keys. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.
+# Repository Structure
+- `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
+- `app.py` - FastAPI application wrapper (can be used as-is).
+- `src/` - Core model & preprocessing logic:
+    - `data.py` - SMILES preprocessing pipeline
+    - `model.py` - XGBoost classifier wrapper
+    - `train.py` - Script to train the classifier
+    - `utils.py` – Constants and Helper functions
+# Quickstart with Spaces
+You can easily adapt this project in your own Hugging Face account:
+- Open this Space on Hugging Face.
+- Click "Duplicate this Space" (top-right corner).
+- Modify `src/` for your preprocessing pipeline and model class
+- Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.
+That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.
+# Installation
+To run (and train) the XGBoost, clone the repository and install dependencies:
+```bash
+git clone https://huggingface.co/spaces/tschouis/tox21_xgboost_classifier
+cd tox_21_xgb_classifier
+conda create -n tox21_xgb_cls python=3.11
+conda activate tox21_xgb_cls
+pip install -r requirements.txt
+```
+# Training
+To train the XGBoost model from scratch:
+```bash
+python -m src/train.py
+```
+This will:
+1. Load and preprocess the Tox21 training dataset.
+2. Train a XGBoost classifier.
+3. Save the trained model to the assets/ folder.
+4. Evaluate the trained XGBoost classifier on the validation split.
+# Inference
+For inference, you only need `predict.py`.
+Example usage inside Python:
+```python
+from predict import predict
+smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
+results = predict(smiles_list)
+print(results)
+```
+The output will be a nested dictionary in the format:
+```python
+{
+    "CCO": {"target1": 0, "target2": 1, ..., "target12": 0},
+    "c1ccccc1": {"target1": 1, "target2": 0, ..., "target12": 1},
+    "CC(=O)O": {"target1": 0, "target2": 0, ..., "target12": 0}
+}
+```
+# Notes
+- Only adapting `predict.py` for your model inference is required for leaderboard submission.
+- Training (`src/train.py`) is provided for reproducibility.
+- Preprocessing (here inside `src/data.py`) must be applied at inference time, not just training.

app.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+This is the main entry point for the FastAPI application.
+The app handles the request to predict toxicity for a list of SMILES strings.
+"""
+# ---------------------------------------------------------------------------------------
+# Dependencies and global variable definition
+import os
+from typing import List, Dict, Optional
+from fastapi import FastAPI, Header, HTTPException
+from pydantic import BaseModel, Field
+from predict import predict as predict_func
+API_KEY = os.getenv("API_KEY")  # set via Space Secrets
+# ---------------------------------------------------------------------------------------
+class Request(BaseModel):
+    smiles: List[str] = Field(min_items=1, max_items=1000)
+class Response(BaseModel):
+    predictions: dict
+    model_info: Dict[str, str] = {}
+app = FastAPI(title="toxicity-api")
+@app.get("/")
+def root():
+    return {
+        "message": "Toxicity Prediction API",
+        "endpoints": {
+            "/metadata": "GET - API metadata and capabilities",
+            "/healthz": "GET - Health check",
+            "/predict": "POST - Predict toxicity for SMILES",
+        },
+        "usage": "Send POST to /predict with {'smiles': ['your_smiles_here']} and Authorization header",
+    }
+@app.get("/metadata")
+def metadata():
+    return {
+        "name": "AwesomeTox",
+        "version": "1.0.0",
+        "max_batch_size": 256,
+        "tox_endpoints": [
+            "NR-AR",
+            "NR-AR-LBD",
+            "NR-AhR",
+            "NR-Aromatase",
+            "NR-ER",
+            "NR-ER-LBD",
+            "NR-PPAR-gamma",
+            "SR-ARE",
+            "SR-ATAD5",
+            "SR-HSE",
+            "SR-MMP",
+            "SR-p53",
+        ],
+    }
+@app.get("/healthz")
+def healthz():
+    return {"ok": True}
+@app.post("/predict", response_model=Response)
+def predict(request: Request):
+    predictions = predict_func(request.smiles)
+    return {
+        "predictions": predictions,
+        "model_info": {"name": "random_clf", "version": "1.0.0"},
+    }

predict.py ADDED Viewed

	@@ -0,0 +1,54 @@

+"""
+This files includes a predict function for the Tox21.
+As an input it takes a list of SMILES and it outputs a nested dictionary with
+SMILES and target names as keys.
+"""
+# ---------------------------------------------------------------------------------------
+# Dependencies
+from collections import defaultdict
+from src.data import preprocess_molecules
+from src.model import Tox21XGBClassifier
+# ---------------------------------------------------------------------------------------
+def predict(smiles_list: list[str]) -> dict:
+    """Applies the classifier to a list of SMILES strings. Returns prediction=0.0 for
+    any molecule that could not be cleaned.
+    Args:
+        smiles_list (list[str]): list of SMILES strings
+    Returns:
+        dict: nested prediction dictionary, following {'<smiles>': {'<target>': <pred>}}
+    """
+    print(f"Received {len(smiles_list)} SMILES strings")
+    # preprocessing pipeline
+    features, removed_idxs = preprocess_molecules(
+        smiles_list,
+        load_ecdf_path="assets/ecdfs.pkl",
+        load_scaler_path="assets/scaler.pkl",
+    )
+    print(f"{len(removed_idxs)} molecules removed during cleaning")
+    # setup model
+    model = Tox21XGBClassifier(seed=42)
+    model.load_model("assets/xgb_alltasks.joblib")
+    # make predicitons
+    predictions = defaultdict(dict)
+    # make smiles list with same num_samples as features
+    clean_smiles = [smi for i, smi in enumerate(smiles_list) if i not in removed_idxs]
+    no_pred_smiles = [smi for i, smi in enumerate(smiles_list) if i in removed_idxs]
+    for target in model.tasks:
+        target_pred = model.predict(target, features)
+        for i, smiles in enumerate(clean_smiles):
+            predictions[smiles][target] = target_pred[i]
+        for smiles in no_pred_smiles:
+            predictions[smiles][target] = 0.0
+    return predictions

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+fastapi
+uvicorn[standard]
+statsmodels
+rdkit
+numpy
+scikit-learn==1.7.1
+joblib
+tabulate
+datasets
+xgboost=3.0.5

src/__init__.py ADDED Viewed

File without changes

src/data.py ADDED Viewed

	@@ -0,0 +1,198 @@

+# pipeline taken from https://huggingface.co/spaces/ml-jku/mhnfs/blob/main/src/data_preprocessing/create_descriptors.py
+"""
+This files includes a the data processing for Tox21.
+As an input it takes a list of SMILES and it outputs a nested dictionary with
+SMILES and target names as keys.
+"""
+import os
+import numpy as np
+from sklearn.preprocessing import StandardScaler
+from statsmodels.distributions.empirical_distribution import ECDF
+from rdkit import Chem, DataStructs
+from rdkit.Chem import Descriptors, rdFingerprintGenerator
+from rdkit.Chem.rdchem import Mol
+from utils import USED_200_DESCR, Standardizer, load_pickle, write_pickle
+def preprocess_molecules(
+    smiles_list: list[str],
+    load_ecdf_path: str = "",
+    load_scaler_path: str = "",
+    save_ecdf_path: str = "",
+    save_scaler_path: str = "",
+) -> tuple[np.ndarray, list[int]]:
+    """Preprocessing pipeline for a list of molecules.
+    Args:
+        smiles_list (list[str]): list of SMILES
+        load_ecdf_path (str, optional): Path to load ECDFs from. Defaults to "".
+        load_scaler_path (str, optional): Path to load fitted StandardScaler from. Defaults to "".
+        save_ecdf_path (str, optional): Path to save calculated ECDFs. Defaults to "".
+        save_scaler_path (str, optional): Path to save fitted StandardScaler. Defaults to "".
+    Returns:
+        np.ndarray: normalized ECFPs fingerprints and RDKit descriptor quantiles
+        list[bool]: mask that contains False at index `i`, if molecule in `smiles_list` at
+            index `i` could not be cleaned and was removed.
+    """
+    assert not (
+        load_ecdf_path and save_ecdf_path
+    ), "Cannot pass 'load_ecdf_path' and 'save_ecdf_path' simultaneously"
+    assert not (
+        load_scaler_path and save_scaler_path
+    ), "Cannot pass 'load_scaler_path' and 'save_scaler_path' simultaneously"
+    ecdfs = (
+        load_pickle(load_ecdf_path)
+        if load_ecdf_path and os.path.exists(load_ecdf_path)
+        else None
+    )
+    scaler = (
+        load_pickle(load_scaler_path)
+        if load_scaler_path and os.path.exists(load_scaler_path)
+        else None
+    )
+    # Create cleanded rdkit mol objects
+    mols, clean_mol_mask = create_cleaned_mol_objects(smiles_list)
+    print("Cleaned molecules")
+    # Create fingerprints and descriptors
+    ecfps = create_ecfp_fps(mols)
+    print("Created ECFP fingerprints")
+    rdkit_descrs = create_rdkit_descriptors(mols)
+    print("Created RDKit descriptors")
+    # Create and save ecdfs
+    if ecdfs is None:
+        print("Create ECDFs")
+        ecdfs = []
+        for column in range(rdkit_descrs.shape[1]):
+            raw_values = rdkit_descrs[:, column].reshape(-1)
+            ecdfs.append(ECDF(raw_values))
+        if save_ecdf_path:
+            write_pickle(save_ecdf_path, ecdfs)
+            print(f"Saved ECDFs under {save_ecdf_path}")
+    # Create quantiles
+    rdkit_descr_quantiles = create_quantiles(rdkit_descrs, ecdfs)
+    print("Created quantiles of RDKit descriptors")
+    # Concatenate features
+    raw_features = np.concatenate((ecfps, rdkit_descr_quantiles), axis=1)
+    if scaler is None:
+        scaler = StandardScaler()
+        scaler.fit(raw_features)
+        print("Fitted the StandardScaler")
+        if save_scaler_path:
+            write_pickle(save_scaler_path, scaler)
+            print(f"Saved the StandardScaler under {save_scaler_path}")
+    # Normalize feature vectors
+    normalized_features = scaler.transform(raw_features)
+    print("Normalized the molecule features")
+    return normalized_features, clean_mol_mask
+def create_cleaned_mol_objects(smiles: list[str]) -> list[Mol]:
+    """This function creates cleaned RDKit mol objects from a list of SMILES.
+    Args:
+        smiles (list[str]): list of SMILES
+    Returns:
+        list[Mol]: list of cleaned molecules
+        list[bool]: mask that contains False at index `i`, if molecule in `smiles` at
+            index `i` could not be cleaned and was removed.
+    """
+    sm = Standardizer(canon_taut=True)
+    clean_mol_mask = list()
+    mols = list()
+    for i, smile in enumerate(smiles):
+        mol = Chem.MolFromSmiles(smile)
+        standardized_mol, _ = sm.standardize_mol(mol)
+        is_cleaned = standardized_mol is not None
+        clean_mol_mask.append(is_cleaned)
+        if not is_cleaned:
+            continue
+        can_mol = Chem.MolFromSmiles(Chem.MolToSmiles(standardized_mol))
+        mols.append(can_mol)
+    return mols, clean_mol_mask
+def create_ecfp_fps(mols: list[Mol]) -> np.ndarray:
+    """This function ECFP fingerprints for a list of molecules.
+    Args:
+        mols (list[Mol]): list of molecules
+    Returns:
+        np.ndarray: ECFP fingerprints of molecules
+    """
+    ecfps = list()
+    for mol in mols:
+        fp_sparse_vec = rdFingerprintGenerator.GetCountFPs(
+            [mol], fpType=rdFingerprintGenerator.MorganFP
+        )[0]
+        fp = np.zeros((0,), np.int8)
+        DataStructs.ConvertToNumpyArray(fp_sparse_vec, fp)
+        ecfps.append(fp)
+    return np.array(ecfps)
+def create_rdkit_descriptors(mols: list[Mol]) -> np.ndarray:
+    """This function creates RDKit descriptors for a list of molecules.
+    Args:
+        mols (list[Mol]): list of molecules
+    Returns:
+        np.ndarray: RDKit descriptors of molecules
+    """
+    rdkit_descriptors = list()
+    for mol in mols:
+        descrs = []
+        for _, descr_calc_fn in Descriptors._descList:
+            descrs.append(descr_calc_fn(mol))
+        descrs = np.array(descrs)
+        descrs = descrs[USED_200_DESCR]
+        rdkit_descriptors.append(descrs)
+    return np.array(rdkit_descriptors)
+def create_quantiles(raw_features: np.ndarray, ecdfs: list) -> np.ndarray:
+    """Create quantile values for given features using the columns
+    Args:
+        raw_features (np.ndarray): values to put into quantiles
+        ecdfs (list): ECDFs to use
+    Returns:
+        np.ndarray: computed quantiles
+    """
+    quantiles = np.zeros_like(raw_features)
+    for column in range(raw_features.shape[1]):
+        raw_values = raw_features[:, column].reshape(-1)
+        ecdf = ecdfs[column]
+        q = ecdf(raw_values)
+        quantiles[:, column] = q
+    return quantiles

src/model.py ADDED Viewed

	@@ -0,0 +1,79 @@

+"""
+This files includes a XGBoost model for Tox21.
+As an input it takes a list of SMILES and it outputs a nested dictionary with
+SMILES and target names as keys.
+"""
+# ---------------------------------------------------------------------------------------
+# Dependencies
+import os
+import joblib
+import numpy as np
+from xgboost import XGBClassifier
+from utils import TASKS
+# ---------------------------------------------------------------------------------------
+class Tox21XGBClassifier:
+    """A XGBoost classifier that assigns a toxicity score to a given SMILES string."""
+    def __init__(self, seed: int = 42):
+        """Initialize an XGBoost classifier for each of the 12 Tox21 tasks.
+        Args:
+            seed (int, optional): seed for XGBoost to ensure reproducibility. Defaults to 42.
+        """
+        self.tasks = TASKS
+        self.model = {
+            task: XGBClassifier(n_estimators=1000, random_state=seed, n_jobs=8)
+            for task in self.tasks
+        }
+    def load_model(self, path: str) -> None:
+        """Loads the model from a given path
+        Args:
+            path (str): path to model checkpoint
+        """
+        self.model = joblib.load(path)
+    def save_model(self, path: str) -> None:
+        """Saves the model to a given path
+        Args:
+            path (str): path to save model to
+        """
+        if not os.path.exists(os.path.dirname(path)):
+            os.makedirs(os.path.dirname(path))
+        joblib.dump(self.model, path)
+    def fit(self, task: str, input_features: np.ndarray, labels: np.ndarray) -> None:
+        """Train XGBoost for a given task
+        Args:
+            task (str): task to train
+            input_features (np.ndarray): training features
+            labels (np.ndarray): training labels
+        """
+        assert task in self.tasks, f"Unknown task: {task}"
+        self.model[task].fit(input_features, labels)
+    def predict(self, task: str, features: np.ndarray) -> np.ndarray:
+        """Predicts labels for a given Tox21 target using molecule features
+        Args:
+            task (str): the Tox21 target to predict for
+            features (np.ndarray): molecule features used for prediction
+        Returns:
+            np.ndarray: predicted probability for positive class
+        """
+        assert task in self.tasks, f"Unknown task: {task}"
+        assert (
+            len(features.shape) == 2
+        ), f"Function expects 2D np.array. Current shape: {features.shape}"
+        preds = self.model[task].predict_proba(features)
+        return preds[:, 1]

src/train.py ADDED Viewed

	@@ -0,0 +1,92 @@

+"""
+Script for fitting and saving any preprocessing assets, as well as the fitted XGBoost model
+"""
+import argparse
+import numpy as np
+from tabulate import tabulate
+from datasets import load_dataset
+from sklearn.metrics import roc_auc_score
+from data import preprocess_molecules
+from model import Tox21XGBClassifier
+from utils import HF_TOKEN
+parser = argparse.ArgumentParser(description="XGBoost Trainig script for Tox21 dataset")
+parser.add_argument(
+    "--save_path_model",
+    type=str,
+    default="assets/xgb_alltasks.joblib",
+)
+parser.add_argument(
+    "--path_ecdfs",
+    type=str,
+    default="assets/ecdfs.pkl",
+)
+parser.add_argument(
+    "--path_scaler",
+    type=str,
+    default="assets/scaler.pkl",
+)
+def main(args):
+    ds = load_dataset("tschouis/tox21", token=HF_TOKEN)
+    print("Preprocess train molecules")
+    train_smiles = list(ds["train"]["smiles"])
+    train_features, train_mol_mask = preprocess_molecules(
+        train_smiles,
+        save_ecdf_path=args.path_ecdfs,
+        save_scaler_path=args.path_scaler,
+    )
+    print("Preprocess validation molecules")
+    val_smiles = list(ds["validation"]["smiles"])
+    val_features, val_mol_mask = preprocess_molecules(
+        val_smiles,
+        load_ecdf_path=args.path_ecdfs,
+        load_scaler_path=args.path_scaler,
+    )
+    model = Tox21XGBClassifier(seed=42)
+    print("Start training.")
+    for task in model.tasks:
+        task_labels = ds["train"].to_pandas()[task].to_numpy()
+        task_labels = task_labels[train_mol_mask]
+        label_mask = ~np.isnan(task_labels)
+        print(f"Fit task {task} using {sum(label_mask)} samples")
+        model.fit(task, train_features[label_mask], task_labels[label_mask].astype(int))
+    print(f"Save model under {args.save_path_model}")
+    model.save_model(args.save_path_model)
+    print("Evaluate model")
+    results = {}
+    for task in model.tasks:
+        task_labels = ds["validation"].to_pandas()[task].to_numpy()
+        task_labels = task_labels[val_mol_mask]
+        label_mask = ~np.isnan(task_labels)
+        pred = model.predict(task, val_features[label_mask])
+        results[task] = [
+            roc_auc_score(y_true=task_labels[label_mask].astype(int), y_score=pred)
+        ]
+    print("Results:")
+    print(tabulate(results, headers="keys"))
+    print("Average: ", sum([val[0] for val in results.values()]) / len(results))
+if __name__ == "__main__":
+    args = parser.parse_args()
+    main(args)

src/utils.py ADDED Viewed

	@@ -0,0 +1,441 @@

+## These MolStandardizer classes are due to Paolo Tosco
+## It was taken from the FS-Mol github
+## (https://github.com/microsoft/FS-Mol/blob/main/fs_mol/preprocessing/utils/
+##  standardizer.py)
+## They ensure that a sequence of standardization operations are applied
+## https://gist.github.com/ptosco/7e6b9ab9cc3e44ba0919060beaed198e
+import os
+import pickle
+from rdkit import Chem
+from rdkit.Chem.MolStandardize import rdMolStandardize
+HF_TOKEN = os.environ.get("HF_TOKEN")
+TASKS = [
+    "NR-AR",
+    "NR-AR-LBD",
+    "NR-AhR",
+    "NR-Aromatase",
+    "NR-ER",
+    "NR-ER-LBD",
+    "NR-PPAR-gamma",
+    "SR-ARE",
+    "SR-ATAD5",
+    "SR-HSE",
+    "SR-MMP",
+    "SR-p53",
+]
+USED_200_DESCR = [
+    0,
+    1,
+    2,
+    3,
+    4,
+    5,
+    6,
+    7,
+    8,
+    9,
+    10,
+    11,
+    12,
+    13,
+    14,
+    15,
+    16,
+    25,
+    26,
+    27,
+    28,
+    29,
+    30,
+    31,
+    32,
+    33,
+    34,
+    35,
+    36,
+    37,
+    38,
+    39,
+    40,
+    41,
+    42,
+    43,
+    44,
+    45,
+    46,
+    47,
+    48,
+    49,
+    50,
+    51,
+    52,
+    53,
+    54,
+    55,
+    56,
+    57,
+    58,
+    59,
+    60,
+    61,
+    62,
+    63,
+    64,
+    65,
+    66,
+    67,
+    68,
+    69,
+    70,
+    71,
+    72,
+    73,
+    74,
+    75,
+    76,
+    77,
+    78,
+    79,
+    80,
+    81,
+    82,
+    83,
+    84,
+    85,
+    86,
+    87,
+    88,
+    89,
+    90,
+    91,
+    92,
+    93,
+    94,
+    95,
+    96,
+    97,
+    98,
+    99,
+    100,
+    101,
+    102,
+    103,
+    104,
+    105,
+    106,
+    107,
+    108,
+    109,
+    110,
+    111,
+    112,
+    113,
+    114,
+    115,
+    116,
+    117,
+    118,
+    119,
+    120,
+    121,
+    122,
+    123,
+    124,
+    125,
+    126,
+    127,
+    128,
+    129,
+    130,
+    131,
+    132,
+    133,
+    134,
+    135,
+    136,
+    137,
+    138,
+    139,
+    140,
+    141,
+    142,
+    143,
+    144,
+    145,
+    146,
+    147,
+    148,
+    149,
+    150,
+    151,
+    152,
+    153,
+    154,
+    155,
+    156,
+    157,
+    158,
+    159,
+    160,
+    161,
+    162,
+    163,
+    164,
+    165,
+    166,
+    167,
+    168,
+    169,
+    170,
+    171,
+    172,
+    173,
+    174,
+    175,
+    176,
+    177,
+    178,
+    179,
+    180,
+    181,
+    182,
+    183,
+    184,
+    185,
+    186,
+    187,
+    188,
+    189,
+    190,
+    191,
+    192,
+    193,
+    194,
+    195,
+    196,
+    197,
+    198,
+    199,
+    200,
+    201,
+    202,
+    203,
+    204,
+    205,
+    206,
+    207,
+]
+class Standardizer:
+    """
+    Simple wrapper class around rdkit Standardizer.
+    """
+    DEFAULT_CANON_TAUT = False
+    DEFAULT_METAL_DISCONNECT = False
+    MAX_TAUTOMERS = 100
+    MAX_TRANSFORMS = 100
+    MAX_RESTARTS = 200
+    PREFER_ORGANIC = True
+    def __init__(
+        self,
+        metal_disconnect=None,
+        canon_taut=None,
+    ):
+        """
+        Constructor.
+        All parameters are optional.
+        :param metal_disconnect:    if True, metallorganic complexes are
+                                    disconnected
+        :param canon_taut:          if True, molecules are converted to their
+                                    canonical tautomer
+        """
+        super().__init__()
+        if metal_disconnect is None:
+            metal_disconnect = self.DEFAULT_METAL_DISCONNECT
+        if canon_taut is None:
+            canon_taut = self.DEFAULT_CANON_TAUT
+        self._canon_taut = canon_taut
+        self._metal_disconnect = metal_disconnect
+        self._taut_enumerator = None
+        self._uncharger = None
+        self._lfrag_chooser = None
+        self._metal_disconnector = None
+        self._normalizer = None
+        self._reionizer = None
+        self._params = None
+    @property
+    def params(self):
+        """Return the MolStandardize CleanupParameters."""
+        if self._params is None:
+            self._params = rdMolStandardize.CleanupParameters()
+            self._params.maxTautomers = self.MAX_TAUTOMERS
+            self._params.maxTransforms = self.MAX_TRANSFORMS
+            self._params.maxRestarts = self.MAX_RESTARTS
+            self._params.preferOrganic = self.PREFER_ORGANIC
+            self._params.tautomerRemoveSp3Stereo = False
+        return self._params
+    @property
+    def canon_taut(self):
+        """Return whether tautomer canonicalization will be done."""
+        return self._canon_taut
+    @property
+    def metal_disconnect(self):
+        """Return whether metallorganic complexes will be disconnected."""
+        return self._metal_disconnect
+    @property
+    def taut_enumerator(self):
+        """Return the TautomerEnumerator object."""
+        if self._taut_enumerator is None:
+            self._taut_enumerator = rdMolStandardize.TautomerEnumerator(self.params)
+        return self._taut_enumerator
+    @property
+    def uncharger(self):
+        """Return the Uncharger object."""
+        if self._uncharger is None:
+            self._uncharger = rdMolStandardize.Uncharger()
+        return self._uncharger
+    @property
+    def lfrag_chooser(self):
+        """Return the LargestFragmentChooser object."""
+        if self._lfrag_chooser is None:
+            self._lfrag_chooser = rdMolStandardize.LargestFragmentChooser(
+                self.params.preferOrganic
+            )
+        return self._lfrag_chooser
+    @property
+    def metal_disconnector(self):
+        """Return the MetalDisconnector object."""
+        if self._metal_disconnector is None:
+            self._metal_disconnector = rdMolStandardize.MetalDisconnector()
+        return self._metal_disconnector
+    @property
+    def normalizer(self):
+        """Return the Normalizer object."""
+        if self._normalizer is None:
+            self._normalizer = rdMolStandardize.Normalizer(
+                self.params.normalizationsFile, self.params.maxRestarts
+            )
+        return self._normalizer
+    @property
+    def reionizer(self):
+        """Return the Reionizer object."""
+        if self._reionizer is None:
+            self._reionizer = rdMolStandardize.Reionizer(self.params.acidbaseFile)
+        return self._reionizer
+    def charge_parent(self, mol_in):
+        """Sequentially apply a series of MolStandardize operations:
+        * MetalDisconnector
+        * Normalizer
+        * Reionizer
+        * LargestFragmentChooser
+        * Uncharger
+        The net result is that a desalted, normalized, neutral
+        molecule with implicit Hs is returned.
+        """
+        params = Chem.RemoveHsParameters()
+        params.removeAndTrackIsotopes = True
+        mol_in = Chem.RemoveHs(mol_in, params, sanitize=False)
+        if self._metal_disconnect:
+            mol_in = self.metal_disconnector.Disconnect(mol_in)
+        normalized = self.normalizer.normalize(mol_in)
+        Chem.SanitizeMol(normalized)
+        normalized = self.reionizer.reionize(normalized)
+        Chem.AssignStereochemistry(normalized)
+        normalized = self.lfrag_chooser.choose(normalized)
+        normalized = self.uncharger.uncharge(normalized)
+        # need this to reassess aromaticity on things like
+        # cyclopentadienyl, tropylium, azolium, etc.
+        Chem.SanitizeMol(normalized)
+        return Chem.RemoveHs(Chem.AddHs(normalized))
+    def standardize_mol(self, mol_in):
+        """
+        Standardize a single molecule.
+        :param mol_in:  a Chem.Mol
+        :return:        * (standardized Chem.Mol, n_taut) tuple
+                          if success. n_taut will be negative if
+                          tautomer enumeration was aborted due
+                          to reaching a limit
+                        * (None, error_msg) if failure
+        This calls self.charge_parent() and, if self._canon_taut
+        is True, runs tautomer canonicalization.
+        """
+        n_tautomers = 0
+        if isinstance(mol_in, Chem.Mol):
+            name = None
+            try:
+                name = mol_in.GetProp("_Name")
+            except KeyError:
+                pass
+            if not name:
+                name = "NONAME"
+        else:
+            error = f"Expected SMILES or Chem.Mol as input, got {str(type(mol_in))}"
+            return None, error
+        try:
+            mol_out = self.charge_parent(mol_in)
+        except Exception as e:
+            error = f"charge_parent FAILED: {str(e).strip()}"
+            return None, error
+        if self._canon_taut:
+            try:
+                res = self.taut_enumerator.Enumerate(mol_out, False)
+            except TypeError:
+                # we are still on the pre-2021 RDKit API
+                res = self.taut_enumerator.Enumerate(mol_out)
+            except Exception as e:
+                # something else went wrong
+                error = f"canon_taut FAILED: {str(e).strip()}"
+                return None, error
+            n_tautomers = len(res)
+            if hasattr(res, "status"):
+                completed = (
+                    res.status == rdMolStandardize.TautomerEnumeratorStatus.Completed
+                )
+            else:
+                # we are still on the pre-2021 RDKit API
+                completed = len(res) < 1000
+            if not completed:
+                n_tautomers = -n_tautomers
+            try:
+                mol_out = self.taut_enumerator.PickCanonical(res)
+            except AttributeError:
+                # we are still on the pre-2021 RDKit API
+                mol_out = max(
+                    [(self.taut_enumerator.ScoreTautomer(m), m) for m in res]
+                )[1]
+            except Exception as e:
+                # something else went wrong
+                error = f"canon_taut FAILED: {str(e).strip()}"
+                return None, error
+        mol_out.SetProp("_Name", name)
+        return mol_out, n_tautomers
+def load_pickle(path: str):
+    with open(path, "rb") as file:
+        content = pickle.load(file)
+    return content
+def write_pickle(path: str, obj: object):
+    with open(path, "wb") as file:
+        pickle.dump(obj, file)