SmartBERT V3 CodeBERT

Overview

SmartBERT V3 is a domain-adapted pre-trained programming language model for smart contract code understanding, built upon CodeBERT-base-mlm.

The model is further trained on SmartBERT V2 with a substantially larger corpus of smart contracts, enabling improved robustness and richer semantic representations of function-level smart contract code.

SmartBERT V3 is particularly suitable for tasks such as:

Smart contract intent detection
Code representation learning
Code similarity analysis
Vulnerability detection
Smart contract classification

Compared with SmartBERT V2, this version significantly expands the training corpus and improves the model’s ability to capture semantic patterns in smart contract functions.

Training Data

SmartBERT V3 was trained on a total of 80,000 smart contracts, including:

16,000 contracts used in SmartBERT V2
64,000 additional smart contracts collected from public blockchain repositories

All contracts are primarily written in Solidity and processed at the function level to better capture fine-grained semantic structures of smart contract code.

Training Objective

The model is trained using the Masked Language Modeling (MLM) objective, following the same training paradigm as CodeBERT.

During training:

A subset of tokens in the input code is randomly masked
The model learns to predict these masked tokens from surrounding context

This process enables the model to learn deeper syntactic and semantic representations of smart contract programs.

Training Setup

Training was conducted using the HuggingFace Transformers framework.

Hardware: 2 × Nvidia A100 (80GB)
Training Duration: Over 30 hours
Training Dataset: 80,000 smart contracts
Evaluation Dataset: 1,500 smart contracts

Example training configuration:

training_args = TrainingArguments(
  output_dir=OUTPUT_DIR,
  overwrite_output_dir=True,
  num_train_epochs=20,
  per_device_train_batch_size=64,
  save_steps=10000,
  save_total_limit=2,
  evaluation_strategy="steps",
  eval_steps=10000,
  resume_from_checkpoint=checkpoint
)

Preprocessing

During preprocessing, all newline (\n) and tab (\t) characters in the function code were replaced with a single space to ensure a consistent input format for tokenization.

Base Model

SmartBERT V3 builds upon the following models:

Original Model: CodeBERT-base-mlm
Intermediate Model: SmartBERT V2

Usage

Example usage with HuggingFace Transformers:

from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline

model = RobertaForMaskedLM.from_pretrained('web3se/SmartBERT-v3')
tokenizer = RobertaTokenizer.from_pretrained('web3se/SmartBERT-v3')

code_example = "function totalSupply() external view <mask> (uint256);"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(code_example)
print(outputs)

How to Use

To train and deploy the SmartBERT V3 model for Web API services, please refer to our GitHub repository: web3se-lab/SmartBERT.

Contributor

Citation

@article{huang2025smart,
  title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
  author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
  journal={arXiv preprint arXiv:2508.20086},
  year={2025}
}

Acknowledgement

Institute of Intelligent Computing Technology, Suzhou, CAS
Macau University of Science and Technology
CAS Mino (中科劢诺)

Downloads last month: 40

Model tree for web3se/SmartBERT-v3

Base model

microsoft/codebert-base-mlm

Finetuned

(17)

this model

Dataset used to train web3se/SmartBERT-v3

Space using web3se/SmartBERT-v3 1

Paper for web3se/SmartBERT-v3

Smart Contract Intent Detection with Pre-trained Programming Language Model

Paper • 2508.20086 • Published Aug 27, 2025