File size: 4,716 Bytes
71070d0
584628e
71070d0
 
 
 
 
e46cca6
97697e0
71070d0
 
d84754c
 
05ef690
d84754c
97697e0
d71474f
05ef690
97697e0
 
 
 
 
 
d84754c
97697e0
d84754c
 
 
 
97697e0
 
 
 
 
 
d84754c
 
97697e0
 
 
d84754c
 
 
 
 
 
 
 
 
 
 
 
 
97697e0
 
 
 
d84754c
 
 
 
 
 
05ef690
d84754c
 
 
 
 
 
 
f1835be
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d84754c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97697e0
d84754c
97697e0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: Tox21 Random Forest Classifier
emoji: 🚀
colorFrom: red
colorTo: purple
sdk: docker
pinned: false
license: cc-by-nc-4.0
short_description: Random Forest Baseline for Tox21
---

# Tox21 Random Forest Classifier

This repository hosts a Hugging Face Space that provides an API for submitting models to the [Tox21 Leaderboard](https://huggingface.co/spaces/ml-jku/tox21_leaderboard).

Here **Random Forest (RF)** models are trained on the Tox21 dataset, and the trained models are provided for inference. For each of the twelve toxic effects, a separate RF model is trained. The input to the model is a **SMILES** string of the small molecule, and the output are 12 numeric values for each of the toxic effects of the Tox21 dataset. 

**Important:** For leaderboard submission, your Space needs to include training code. The file `train.py` should train the model using the config specified inside the `config/` folder and save the final model parameters into a file inside the `checkpoints/` folder. The model should be trained using the [Tox21_dataset](https://huggingface.co/datasets/ml-jku/tox21) provided on Hugging Face. The datasets can be loaded like this:
```python
from datasets import load_dataset
ds = load_dataset("ml-jku/tox21", token=token)
train_df = ds["train"].to_pandas()
val_df = ds["validation"].to_pandas()
```

Additionally, the Space needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a nested prediction dictionary as output, with SMILES as keys and dictionaries containing targetname-prediction pairs as values. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.

# Repository Structure
- `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
- `app.py` - FastAPI application wrapper (can be used as-is).
- `preprocess.py` - preprocesses SMILES strings to generate feature descriptors and saves results as NPZ files in `data/`.
- `train.py` - trains and saves a model using the config in the `config/` folder.
- `config/` - the config file used by `train.py`. 
- `logs/` - all the logs of `train.py`, the saved model, and predictions on the validation set.
- `data/` - RF uses numerical data. During preprocessing in `preprocess.py` two NPZ files containing molecule features are created and saved here.
- `checkpoints/` - the saved model that is used in `predict.py` is here.

- `src/` - Core model & preprocessing logic:
    - `preprocess.py` - SMILES preprocessing logic
    - `model.py` - RF model class with processing, saving and loading logic
    - `utils.py` - utility functions

# Quickstart with Spaces

You can easily adapt this project in your own Hugging Face account:

- Open this Space on Hugging Face.

- Click "Duplicate this Space" (top-right corner).

- Modify `src/` for your preprocessing pipeline and model class

- Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.

- Modify `train.py` and/or `preprocess.py` according to your model and preprocessing pipeline.

- Modify the file inside `config/` to contain all hyperparameters that are set in `train.py`.

That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.

# Installation
To run (and train) the random forest, clone the repository and install dependencies:

```bash
git clone https://huggingface.co/spaces/ml-jku/tox21_rf_classifier
cd tox_21_rf_classifier

conda create -n tox21_rf_cls python=3.11
conda activate tox21_rf_cls
pip install -r requirements.txt
```

# Training

To train the Random Forest model from scratch, run:

```bash
python preprocess.py
python train.py
```

These commands will:
1. Load and preprocess the Tox21 training dataset
2. Train a Random Forest classifier
3. Store the resulting model in the `checkpoints/` directory.


# Inference

For inference, you only need `predict.py`.

Example usage inside Python:

```python
from predict import predict

smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
results = predict(smiles_list)

print(results)
```

The output will be a nested dictionary in the format:

```python
{
    "CCO": {"target1": 0, "target2": 1, ..., "target12": 0},
    "c1ccccc1": {"target1": 1, "target2": 0, ..., "target12": 1},
    "CC(=O)O": {"target1": 0, "target2": 0, ..., "target12": 0}
}
```

# Notes

- Adapting `predict.py`, `train.py`, `config/`, and `checkpoints/` is required for leaderboard submission.

- Preprocessing must be done inside `predict.py` not just `train.py`.