| --- |
| license: apache-2.0 |
| language: |
| - en |
| base_model: |
| - Salesforce/codet5-small |
| tags: |
| - cpp |
| - complete |
| --- |
| |
|
|
| # π Codelander |
|
|
| --- |
|
|
| ## π Overview |
|
|
| This specialized **CodeT5** model has been fine-tuned for **C++ code completion** tasks. |
| It excels at understanding **C++ syntax** and **common programming patterns** to provide intelligent code suggestions as you type. |
|
|
| --- |
|
|
| ## β¨ Key Features |
|
|
| - πΉ Context-aware completions for C++ functions, classes, and control structures |
| - πΉ Handles complex C++ syntax including **templates, STL, and modern C++ features** |
| - πΉ Trained on **competitive programming solutions** from high-quality Codeforces submissions |
| - πΉ Low latency suitable for **real-time editor integration** |
|
|
| --- |
|
|
| ## π Model Performance |
|
|
| | Metric | Value | |
| |---------------------|---------| |
| | Training Loss | 1.2475 | |
| | Validation Loss | 1.0016 | |
| | Training Epochs | 3 | |
| | Training Steps | 14010 | |
| | Samples per second | 6.275 | |
|
|
| --- |
|
|
| ## βοΈ Installation & Usage |
|
|
| ### π§ Direct Integration with HuggingFace Transformers |
|
|
| ```python |
| from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
| |
| # Load model and tokenizer |
| model = AutoModelForSeq2SeqLM.from_pretrained("outlander23/codelander") |
| tokenizer = AutoTokenizer.from_pretrained("outlander23/codelander") |
| |
| # Generate completion |
| def get_completion(code_prefix, max_new_tokens=100): |
| inputs = tokenizer(f"complete C++ code: {code_prefix}", return_tensors="pt") |
| outputs = model.generate( |
| inputs.input_ids, |
| max_new_tokens=max_new_tokens, |
| temperature=0.7, |
| top_p=0.9, |
| do_sample=True |
| ) |
| return tokenizer.decode(outputs[0], skip_special_tokens=True) |
| ``` |
|
|
| --- |
|
|
| ## ποΈ Model Architecture |
|
|
| - Base Model: **Salesforce/codet5-base** |
| - Parameters: **220M** |
| - Context Window: **512 tokens** |
| - Fine-tuning: **Seq2Seq training on C++ code snippets** |
| - Training Time: ~ **5 hours** |
|
|
| --- |
|
|
| ## π Training Data |
|
|
| - Dataset: **open-r1/codeforces-submissions** |
| - Selection: **Accepted C++ solutions only** |
| - Size: **50,000+ code samples** |
| - Processing: **Prefix-suffix pairs with random splits** |
|
|
| --- |
|
|
| ## β οΈ Limitations |
|
|
| - β May generate syntactically correct but semantically incorrect code |
| - β Limited knowledge of **domain-specific libraries** not present in training data |
| - β May occasionally produce **incomplete code fragments** |
|
|
| --- |
|
|
| ## π» Example Completions |
|
|
| ### β
Example 1: Factorial Function |
|
|
| **Input:** |
| ```cpp |
| int factorial(int n) { |
| if (n <= 1) { |
| return 1; |
| } else { |
| ``` |
|
|
| **Completion:** |
| ```cpp |
| return n * factorial(n - 1); |
| } |
| } |
| ``` |
|
|
| --- |
|
|
|
|
| --- |
|
|
| ## π Training Details |
|
|
| - Training completed on: **2025-08-28 12:51:09 UTC** |
| - Training epochs: **3/3** |
| - Total steps: **14010** |
| - Training loss: **1.2475** |
|
|
| ### π Epoch Performance |
|
|
| | Epoch | Training Loss | Validation Loss | |
| |-------|---------------|-----------------| |
| | 1 | 1.2638 | 1.1004 | |
| | 2 | 1.1551 | 1.0250 | |
| | 3 | 1.1081 | 1.0016 | |
|
|
| --- |
|
|
| ## π₯οΈ Compatibility |
|
|
| - β
Compatible with **Transformers 4.30.0+** |
| - β
Optimized for **Python 3.8+** |
| - β
Supports both **CPU and GPU inference** |
|
|
| --- |
|
|
| ## β€οΈ Credits |
|
|
| Made with β€οΈ by **outlander23** |
|
|
| > "Good code is its own best documentation." β *Steve McConnell* |
|
|
| --- |