PEFT documentation
BOFT
BOFT
Orthogonal Butterfly (BOFT) is a generic method designed for finetuning foundation models. It improves the parameter efficiency of the finetuning paradigm — Orthogonal Finetuning (OFT), by taking inspiration from Cooley-Tukey fast Fourier transform, showing favorable results across finetuning different foundation models, including large vision transformers, large language models and text-to-image diffusion models.
The abstract from the paper is:
Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm — Orthogonal Finetuning (OFT) — for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.
BOFT focuses on preserving a pretrained model’s generative capabilities while being significantly more parameter-efficient than standard OFT. Like OFT, BOFT maintains the same cosine similarity (hyperspherical energy) between all pairwise neurons in a layer by applying an orthogonal transformation to the pretrained weight matrix, ensuring the semantic relationships among neurons are preserved.
Instead of using a block-diagonal orthogonal matrix, BOFT factorizes the orthogonal transformation into a product of sparse butterfly matrices (originally introduced in the Cooley–Tukey FFT). Unlike OFT’s block-diagonal rotations, which only mix inputs within each block, the butterfly structure guarantees that every input can influence every output, producing a dense connectivity with just O(d log d) parameters. This factorization preserves expressivity while drastically reducing the parameter count compared to OFT (at the expense of computation time).
In practice, BOFT multiplies each pretrained weight matrix by a sequence of butterfly-structured orthogonal factors, enabling efficient and expressive neuron rotations. This makes BOFT well-suited for controllable generation and tasks where maintaining the pretrained model’s subject representation is critical, while also scaling to larger models with lower memory and compute overhead.
BOFT can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. Given the target layers for injecting BOFT parameters, the number of trainable parameters can be determined based on the size of the weight matrices.
Benchmark overview
Merge BOFT weights into the base model
Similar to LoRA, the weights learned by BOFT can be integrated into the pretrained weight matrices using the [~BOFTModel.merge_and_unload() function. This function merges the adapter weights with the base model which allows you to effectively use the newly merged model as a standalone model.

This works because during training, the orthogonal weight matrix (R in the diagram above) and the pretrained weight matrices are separate. But once training is complete, these weights can actually be merged (multiplied) into a new weight matrix that is equivalent.
BOFT Example Usage
For an example of the BOFT method application to various downstream tasks, please refer to the following guides:
Take a look at the following step-by-step guides on how to finetune a model with BOFT:
For the task of image classification, one can initialize the BOFT config for a DinoV2 model as follows:
import transformers
from transformers import AutoModelForSeq2SeqLM, BOFTConfig
from peft import BOFTConfig, get_peft_model
config = BOFTConfig(
boft_block_size=4,
boft_n_butterfly_factor=2,
target_modules=["query", "value", "key", "output.dense", "mlp.fc1", "mlp.fc2"],
boft_dropout=0.1,
bias="boft_only",
modules_to_save=["classifier"],
)
model = transformers.Dinov2ForImageClassification.from_pretrained(
"facebook/dinov2-large",
num_labels=100,
)
boft_model = get_peft_model(model, config)API
BOFTConfig
class peft.BOFTConfig
< source >( task_type: Optional[Union[str, TaskType]] = None peft_type: Optional[Union[str, PeftType]] = None auto_mapping: Optional[dict] = None peft_version: Optional[str] = None base_model_name_or_path: Optional[str] = None revision: Optional[str] = None inference_mode: bool = False boft_block_size: int = 4 boft_block_num: int = 0 boft_n_butterfly_factor: int = 1 target_modules: Optional[Union[list[str], str]] = None exclude_modules: Optional[Union[list[str], str]] = None boft_dropout: float = 0.0 fan_in_fan_out: bool = False bias: str = 'none' modules_to_save: Optional[list[str]] = None init_weights: bool = True layers_to_transform: Optional[Union[list[int], int]] = None layers_pattern: Optional[Union[list[str], str]] = None )
Parameters
- boft_block_size (
int) — BOFT matrix block size across different layers, expressed inint. Bigger block sizes results in more dense update matrices with more trainable parameters. Chooseboft_block_sizeto be divisible by most layer’s input dimension (in_features), e.g., 4, 8, 16. Also, please only specify eitherboft_block_sizeorboft_block_num, but not both simultaneously or leaving both to 0, becauseboft_block_sizexboft_block_nummust equal the layer’s input dimension. - boft_block_num (
int) — Number of BOFT blocks per injected layer. Biggerboft_block_numresult in sparser update matrices with fewer trainable parameters. Note, please chooseboft_block_numto be divisible by most layer’s input dimension (in_features), e.g., 4, 8, 16. Only specify eitherboft_block_sizeorboft_block_num, but not both simultaneously or leaving both to 0, becauseboft_block_sizexboft_block_nummust equal the layer’s input dimension. - boft_n_butterfly_factor (
int) — Number of butterfly factors across different layers. Forboft_n_butterfly_factor=1, BOFT is the same as vanilla OFT, forboft_n_butterfly_factor=2, the effective block size of OFT becomes twice as big and the number of blocks become half. - target_modules (
Union[List[str],str]) — The names of the modules to apply the adapter to. - exclude_modules (
Optional[Union[List[str], str]]) — The names of the modules to not apply the adapter. When passing a string, a regex match will be performed. When passing a list of strings, either an exact match will be performed or it is checked if the name of the module ends with any of the passed strings. - boft_dropout (
float) — The multiplicative dropout probability, by setting OFT blocks to identity during training, similar to the dropout layer in LoRA. - fan_in_fan_out (
bool) — Set this to True if the layer to replace stores weight like (fan_in, fan_out). For example, gpt-2 usesConv1Dwhich stores weights like (fan_in, fan_out) and hence this should be set toTrue. - bias (
str) — Bias type for BOFT. Can be ‘none’, ‘all’ or ‘boft_only’. If ‘all’ or ‘boft_only’, the corresponding biases will be updated during training. Be aware that this means that, even when disabling the adapters, the model will not produce the same output as the base model would have without adaptation. - modules_to_save (
List[str]) —List of modules apart from BOFT layers to be set as trainable and saved in the final checkpoint. - layers_to_transform (
Union[List[int],int]) — The layer indexes to transform, if this argument is specified, it will apply the BOFT transformations on the layer indexes that are specified in this list. If a single integer is passed, it will apply the BOFT transformations on the layer at this index. - layers_pattern (
Optional[Union[List[str], str]]) — The layer pattern name, used only iflayers_to_transformis different fromNoneand if the layer pattern is not in the common layers pattern. This should target thenn.ModuleListof the model, which is often called'layers'or'h'.
This is the configuration class to store the configuration of a BOFTModel.
BOFTModel
class peft.BOFTModel
< source >( model peft_config: Union[PeftConfig, dict[str, PeftConfig]] adapter_name: str low_cpu_mem_usage: bool = False state_dict: Optional[dict[str, torch.Tensor]] = None ) → torch.nn.Module
Parameters
- model (transformers.PreTrainedModel) — The model to be adapted.
- config (BOFTConfig) — The configuration of the BOFT model.
- adapter_name (
str) — The name of the adapter, defaults to"default". - low_cpu_mem_usage (
bool,optional, defaults toFalse) — Create empty adapter weights on meta device. Useful to speed up the loading process.
Returns
torch.nn.Module
The BOFT model.
Creates BOFT and OFT model from a pretrained transformers model. Paper: https://huggingface.co/papers/2311.06243 https://huggingface.co/papers/2306.07280
Example:
>>> import transformers
>>> from peft import BOFTConfig, get_peft_model
>>> config = BOFTConfig(
... boft_block_size=8,
... boft_n_butterfly_factor=1,
... target_modules=["query", "value", "key", "output.dense", "mlp.fc1", "mlp.fc2"],
... boft_dropout=0.1,
... bias="boft_only",
... modules_to_save=["classifier"],
... )
>>> model = transformers.Dinov2ForImageClassification.from_pretrained(
... "facebook/dinov2-large",
... num_labels=100,
... )
>>> boft_model = get_peft_model(model, config)Attributes:
- model (transformers.PreTrainedModel) — The model to be adapted.
- peft_config (BOFTConfig): The configuration of the BOFT model.