A newer version of the Gradio SDK is available:
6.1.0
metadata
title: TextSyncMimi Speech Editing
emoji: ๐๏ธ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: cc-by-4.0
TextSyncMimi Speech Editing Demo
Interactive demo for TextSyncMimi, a text-synchronous neural audio codec that enables token-level speech editing.
What This Demo Does
- Generate Speech: Use OpenAI TTS to create two audio samples with different voices and speaking styles
- Token-Level Analysis: See how text is tokenized (LLaMA-3 tokenizer)
- Speech Embedding Swapping: Swap speech characteristics at specific token positions
- Real-time Editing: Hear the results instantly
How to Use
Step 1: Configure Voices
- Enter your text transcript
- Select two different OpenAI TTS voices (e.g., "alloy" and "echo")
- (Optional) Add style instructions like "speak slowly" or "sound excited"
Step 2: Generate Audio
- Click "Generate & Process" to create both audio samples
- The model will show you the tokenization and generate a baseline reconstruction
Step 3: Swap Embeddings
- Enter token indices to swap (e.g., "0,2,5")
- Click "Perform Swap" to hear Voice 1 with Voice 2's characteristics at those positions
Examples
Example 1: Word-Level Swapping
Text: "Hello, how are you today?"
- Token 0-1: "Hello" (swap these)
- Result: First word has Voice 2's style, rest has Voice 1's style
Example 2: Prosody Transfer
Voice 1: "speak slowly and calmly" Voice 2: "speak quickly with excitement" Swap indices: Middle of sentence Result: Sentence starts calm, becomes excited mid-way
For Users
Just try the demo! The OpenAI API key is already configured. Enter text, select voices, and experiment with speech editing.
For Developers (Running Your Own Copy)
Want to run your own version? Here's how:
- Duplicate this Space or create a new one
- Copy the files (
app.py,requirements.txt,README.md) - Add your OpenAI API key as a Secret:
- Go to Space Settings โ Repository secrets
- Click "New secret"
- Name:
OPENAI_API_KEY - Value: Your OpenAI API key
- Click "Add secret"
- The Space will automatically restart with your key (securely stored, never exposed)
Technical Details
- Model: TextSyncMimi-v1 (loaded from HuggingFace Hub)
- Tokenizer: LLaMA-3.1 (128K vocabulary, loaded from HuggingFace)
- Text Embeddings: Embeddings built into the model (4096-dim)
- Audio Codec: Mimi (24kHz, 12.5 fps)
- TTS Provider: OpenAI (gpt-4o-mini-tts with instructions, or tts-1)
- Security: API keys stored securely in Space secrets
Links
- ๐ค Model Card