TinyTimV1: Fine-tuning TinyLlama on Finnegan's Wake
A project exploring the fine-tuning of TinyLlama-1.1B on James Joyce's Finnegan's Wake to generate Joyce-inspired text.
Overview
This project fine-tunes the TinyLlama-1.1B-Chat model on the complete text of James Joyce's Finnegan's Wake, creating a language model capable of generating text in Joyce's distinctive experimental style. The model learns to replicate the complex wordplay, neologisms, and stream-of-consciousness narrative techniques characteristic of Joyce's final work.
Files
process_wake.py- Preprocesses the raw text, removes page numbers, and splits into manageable chunksfine_tune_joyce.py- Main training script using HuggingFace Transformerstext_gen.py- Text generation script for the fine-tuned modelfinn_wake.txt- Complete text of Finnegan's Wake (1.51 MB)finn_wake.csv- Processed dataset in CSV formatfinn_wake_dataset/- Tokenized dataset directory
Usage
1. Data Preprocessing
python process_wake.py
This removes page numbers and splits the text into 100-word chunks for training. 2. Fine-tuning
python fine_tune_joyce.py
Fine-tunes TinyLlama on the processed dataset for 3 epochs with CPU training. 3. Text Generation
python text_gen.py
Generates Joyce-inspired text using the fine-tuned model.
Model Details
Base Model: TinyLlama-1.1B-Chat-v1.0 Training Data: Finnegan's Wake (~1.5MB text) Training Parameters:
3 epochs Batch size: 1 Max sequence length: 128 tokens Temperature: 0.7 Top-k: 50, Top-p: 0.95
Example Output Input: "ae left to go to ireland and found a fairy" The model generates text continuing in Joyce's experimental style with invented words, Irish references, and complex linguistic play. Requirements transformers datasets pandas torch Installation bashpip install transformers datasets pandas torch Notes
Training was performed on CPU due to resource constraints Model checkpoints saved every 500 steps Resume training supported from checkpoints
- Downloads last month
- 90