cono3-mini
Compact. Capable. Code-native.
cono3-mini is a 9B-parameter language model purpose-built for autonomous software engineering. Starting from the Qwen3.5-9B foundation, we applied supervised fine-tuning with LoRA across a large-scale corpus of real coding agent sessions — teaching the model not just what to write, but how to navigate codebases, recover from mistakes, and work iteratively like a human developer.
Why cono3-mini?
Most code models are trained on static code completion. cono3-mini takes a different approach — it was fine-tuned on a large collection of real-world coding agent sessions where AI systems solved engineering problems end-to-end: reading files, running commands, interpreting errors, editing code, and verifying results.
This gives cono3-mini behaviors you won't find in a typical code LLM:
- Reads before writing — inspects existing code and project structure before making changes
- Handles failures gracefully — parses compiler/linter output and self-corrects
- Applies surgical edits — produces minimal diffs rather than rewriting entire files
- Reasons step-by-step — uses
<think>...</think>blocks to decompose complex tasks - Supports 262K tokens — natively handles very large contexts, extensible beyond 1M
The model is fully open under Apache 2.0 with no usage restrictions.
Getting Started
Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Cono3/cono3-mini"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
prompt = [
{"role": "system", "content": "You are an expert software engineer."},
{"role": "user", "content": "Refactor this function to use async/await instead of callbacks."},
]
text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=2048, temperature=0.6, top_p=0.95, top_k=20)
print(tokenizer.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))
Using vLLM
vllm serve theblackhacker/cono3-mini --tensor-parallel-size 1 --max-model-len 65536
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
resp = client.chat.completions.create(
model="Cono3/cono3-mini",
messages=[{"role": "user", "content": "How do I set up a GitHub Actions CI pipeline for a Rust project?"}],
temperature=0.6,
)
print(resp.choices[0].message.content)
How It Was Trained
cono3-mini was created by fine-tuning Qwen3.5-9B using LoRA (rank 64, alpha 32) on a curated dataset of agentic coding trajectories. The training data captures full coding sessions — not isolated Q&A pairs — from multiple frontier models, operating within real scaffolding environments.
| Base | Qwen3.5-9B |
| Technique | LoRA SFT (r=64, α=32) |
| Data | Curated agentic coding sessions |
| Packing | Sequence packing |
| Tooling | Axolotl |
| Precision | bf16 |
| Optimizer | AdamW, cosine decay |
Under the Hood
cono3-mini retains the Qwen3.5 hybrid architecture that alternates Gated Delta Network layers (efficient linear attention) with standard multi-head attention. This design excels at long-range dependencies while keeping memory and compute practical at 9B scale.
Inference Settings
| Temperature | 0.6 |
| Top-P | 0.95 |
| Top-K | 20 |
| Presence Penalty | 0.0 |
Tip: For tool-use or agentic workflows, drop temperature to 0.2–0.4 for more predictable outputs.
Known Limitations
- Primarily evaluated on English-language tasks; multilingual performance may vary.
- Best results come from prompting patterns similar to the agent scaffolding used in training data.
Credits
Built on top of Qwen3.5-9B by the Qwen team. Training powered by Axolotl.
Citation
@misc{cono3mini2026,
title = {cono3-mini: Autonomous Coding Agent Built on Qwen3.5-9B},
author = {Cono3},
year = {2026},
url = {https://huggingface.co/theblackhacker/cono3-mini}
}
- Downloads last month
- 47
4-bit
8-bit