Is it bitnet {-1,0,1}?

by Remek - opened Apr 9, 2024

Apr 9, 2024

I looked through many bitnet1.58 implementations and noticed that they all use the method suggested in "The Era from 1-bit LLMs: Training Tips, Code and FAQ". The weights of the models that are currently trained according to this recipe are not numbers in the set {-1, 0, 1} and values in the interval (0,1). Is this the way it should be?

The formula describing the quanztization of weights ("The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits").
Implementation proposal ("The Era of 1-bit LLMs: Training Tips, Code and FAQ").
Weights quantization test.
Model during training.

qmsoqm

Apr 10, 2024

I think you're correct Remek. The scaling factor is only used inside of the RoundClip, not after it. I guess there is an error in their code.

And maybe this is why this model has such a big size of weights using fp32, which shouldn't be the case since its parameters are integers.

EasonWei

Jun 12, 2024

Also curious about it.

EasonWei

Jun 12, 2024

I think the reason is that, they are using high precision gemm to simulate the low precision forward process (int8 activation * ternary weight).
We can go back to the original paper, bitnet(2310.11453), eq.11.

The correct process during linear forward is:

     # notice the following quant func will not include `/scale`
     x_quant, x_scaling_factor = activation_quant(x)
     w_quant, w_scaling_factor = weight_quant(w)
     output = low_precision_compute(x_quant, w_quant)
     output = output / x_scaling_factor / w_scaling_factor

Now, they are using high precision compute function to simulate this process. And in this way, we can convert the previous process like this:

     # notice the following quant func will not include `/scale`
     x_quant, x_scaling_factor = activation_quant(x)
     x_quant = x_quant / x_scaling_factor
     w_quant, w_scaling_factor = weight_quant(w)
     w_quant = w_quant / w_scaling_factor
     output = high_precision_compute(x_quant, w_quant)
     # output = output / x_scaling_factor / w_scaling_factor

So I think their implementation is consistent with their claim.

Huixiu

Sep 28, 2024

Yes, I think so, EasonWei! Just for convenience in dequantization after multiplication. The simulation assigns the scaling factors in the quantization process, respectively.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment