This is the final, fully finished chat-ready checkpoint in GGUF format Note: for very short prompts, you should prefill <think> xml tag at the start of the assistant response to ensure it would properly reason.

note: use f32 gguf for highest quality, bf16 is ideal, q8_0 for edge devices. (on a Samsung z fold 5 with fa2, 5k context + some layers offloaded to the GPU, TPS gets to around 40.)

Fijik-1.5 2.6B

Trained on H200,A100 and some use of rtx 2000 Ada gpus. Fijik-1.5 2.6b boasts serious performance at a fraction of the price, while keeping incredible interference speeds. The model runs at about 300 tokens/s on a single rtx 3080 (bf16 gguf), supports 32k context (in theory can be scaled up to 128k with minimal quality issues) while keeping a low memory footprint so many other users could use it at the same time. Or a single user on an edge device. banner

What it is

Fijik 1.5 is a generalist llm, with a knowledge cutoff date of march 2025, yet with limited information after July 2024. The original model was pre-trained on 2T tokens (by huggingface, this model is based on smollm2 135m: https://huggingface.co/HuggingFaceTB/SmolLM2-135M) then we turned the original model into a 32 expert "Franken"-moe. Obviously, after that stage, it was nowhere near finished, so heavy CPT (continual pre-training) was done, this also allowed us to scale the context from 8192 tokens to 32k, and techincally the model should work up to 128k tokens.

This model is completely uncensored, and thus is not ideal for production use-cases when safety is a must.

The model should be used for:

  • General chat applications
  • Fun quick local model
  • Code suggestions / generation
  • Fine-tuning for domain specific tasks (eg; only front-end generation, title generation, tool calling etc.)

The model should NOT be used for:

  • Anything which needs lots of knowledge (model is too small for that)
  • Medical, law or high-risk fields
  • Math (From internal testing, the model is not good at math, could be fine-tuned to excel at it)

Overall, it is a special little model, it has a different style compared to other similarly sized LLMs, is uncensored completely and is a very small MoE.

Model information

Feature Amount/other
Chat model? Yes
architecture Mixtral
max_position_embeddings 32,768
intermediate_size 1,536
num_hidden_layers 30
hidden_size 576
num_experts_per_tok 4
num_attention_heads 9
vocab_size 49,166
rope_theta 500,000

CPT (continual pre-training)

To make a proper decent base model for the size, CPT had to be done, both to make the experts actual experts and to improve the context, knowledge of the model.

The CPT data was ~60% synthetic and ~40% non synthetic (across all CPT stages combined)

5 stages of continual pre-training were done.

Started with low batches, high noise, forced diversion. Dataset included slightly lower quality general 2025 Wiki articles, older wiki articles, synthetic math with deepseek r1, mix of synthetic and non synthetic code and some synthetic web datasets (like cosmopedia) Afterwards similar higher batched and at stage 3 gpt-oss reasoning traces were added. Up until stage 4 (including stage 4) overall cleaner datasets, slightly higher lr's (full training, not Lora for all stages of CPT) at stage 5 we used a very similar dataset to the stage 4 dataset, but with added deepseek r1 reasoning traces, less sources and more focused data on code Gen (from qwen3 480b, deepseek r1), gpt oss generated and cleaned articles, and more 2025 data with a 32k context length.

By doing this, the model got an effective knowledge cutoff date of March 2025, but with limited information past July 2024.

SFT (supervised fine-tuning)

For sft, a ~549M token high quality diverse dataset was used. It was almost completely synthetic, with many examples generated by deepseek r1, qwen3 80b.

Estimated data mix:

  • ~12% tool/json
  • ~27% code generation (front-end, backend, competitive coding)
  • ~43% general chats / instruction following
  • ~18% math

Estimated based on pure dataset mix but real percentages are unknown.

RL (reinforcment learning)

SFT was not enough, especially in todays times. After SFT 3 different "rounds" of DPO (direct preference optimization) were done, which improved instruction following significantly, yet, that was still not enough and more RL was done.

After the 3 DPO stages, DeepSeek-R1-like GRPO was done, (note: DPO, GRPO were done with LORA other than the final DPO stage that would be talked about soon) the grpo had very hard rewards, that the model had a "hard time" getting good reward, but this actually helped it, before this GRPO stage(s) the model had significant looping issues, more incoherent outputs and worse instruction following. This GRPO helped it think for less time, go into loops less and be better overall.

But still, a little more was done. After this, two final stages were done:

  • DPO (final): Different DPO dataset with more coding, stricter instruction following, generalist chat (eg; Hi! what are you?) was done, with full fine-tuning enabled (no lora).
  • GRPO (final): Two epochs of the same dataset and rewards as the previous GRPO stages just as a last push.

Benchmarks

None done yet, soon.

How to run

This model should be ran with a system prompt ideally, works perfectly fine without. It uses standard qwen3 tool calls but it should be fine-tuned to excell at it as it currently has some issues with tool calling.

recommended sampling parameters:

  • Tempature: 0.35
  • Top-k: 35
  • Repetition penalty: 1.1
  • Top-p: 0.85
  • Min-p: 0.1 (optional)

Test it out with a simple prompt, like "Why is the sky blue" Keep in mind, this model does support multi-turns, but be aware it expect the previous response to also have reasoning, removing reasoning from the previous response could save compute and context, but would break the model.

When fine-tuning, you would need at minimum 8gb of memory for basic QLoRa with low context, ideally, 16gb.

Special thanks

This wouldn't have been possible without HuggingfaceTB (They trained smollm2 135M), Unsloth, MergeKit, Transformers.

For questions, Open a community discussion.

GGUF comparison (qwen3 0.6b vs fijik 1.5 a380m, both at q8_0)

(note: 1 shot, obviously f32 would provide better responses)

Prompt: Why is the sky blue?

qwen3 0.6b The sky appears blue because of **Rayleigh scattering** of sunlight by molecules in the Earth's atmosphere. Here's a breakdown:
  1. Sunlight Composition: The Sun emits light across the visible spectrum, with wavelengths ranging from red (longer) to violet (shorter).
  2. Scattering Mechanism: Shorter wavelengths (blue) are scattered more efficiently than longer ones (red or violet). This scattering causes blue light to spread out in all directions.
  3. Perception: Our eyes perceive the blue light as the dominant color, even though we can see the sun's rays. The red and violet parts of the spectrum are scattered less, so they don't reach our eyes as much.

This phenomenon explains why the sky appears blue despite the sun being visible in its natural form.

fijik 1.5 a0.38b The sky looks blue **because** sunlight is scattered by air molecules and light waves are bent differently at different altitudes. Here’s how: - Sunlight travels through Earth’s atmosphere, hitting particles like dust or smoke (which scatter shorter wavelengths). - The sun’s rays have a "blue peak" in the blue part of the spectrum because it has a longer wavelength than violet/red. - At altitudes where ozone and water vapor are present, UV light from the Sun breaks up certain molecules into shorter wavelengths like ultraviolet (UV) radiation (which is weaker). Blue light scatters more efficiently, so it’s seen as blue in the sky at sunrise or sunset.

Why do some colors have different wavelengths?

  • Red/violet are scattered a lot by other molecules because they scatter shorter wavelengths better than violet and red.
    • Example: A 10 km altitude (26,000 m) above the Earth’s surface has an ozone layer with high concentrations of oxygen and nitrogen dioxide gas. The Sun emits UV radiation that scatters blue light more efficiently, making it visible at sunrise/sunset but less intense in other parts of the sky.
  • Blue is scattered by water molecules (oxygen and nitrogen) because they scatter shorter wavelengths better than violet or red.
    • Example: A 10 km altitude above Earth’s surface has a very high concentration of ozone, which absorbs UV radiation from the Sun but doesn’t reflect it back to space. Blue light scatters best in the atmosphere where oxygen and nitrogen are dense enough for ozone molecules to form stable layers (e.g., at about 250 km altitude).
  • Why some colors have different wavelengths?
    • UV radiation from the Sun has a shorter wavelength than visible light, so it’s scattered by air molecules more efficiently. Red and violet scatter best because they have longer wavelengths for their own reasons (e.g., red is about 405 nm, violet ~381 nm). Blue scatters even better because of its shorter wavelength.
    • Ozone/water vapor in the stratosphere are important for Earth’s health and climate: ozone protects us from UV radiation by absorbing UV light (a key factor contributing to skin cancer), while water droplets help trap moisture, reducing evaporation. The ozone layer is a shield against ultraviolet damage.

Key takeaway

The sky looks blue because sunlight scatters through the atmosphere like rain drops on a clear day, but it’s not just about color; it’s how Earth’s delicate balance of atmospheric gases (ozone, water vapor) and ozone/water vapor chemistry is critical for life. The colors you see are shaped by these complex interactions at different altitudes where UV radiation interacts with molecules in the atmosphere, making them visible to us as blue or violet.

Downloads last month
54
GGUF
Model size
3B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Pinkstack/Fijik1.5-2.6b-a380m-GGUF

Quantized
(3)
this model

Collection including Pinkstack/Fijik1.5-2.6b-a380m-GGUF