File size: 6,031 Bytes
9e5ede4
72ce17a
 
 
 
9e5ede4
72ce17a
9e5ede4
 
 
 
 
72ce17a
 
9e5ede4
 
 
 
 
 
 
 
 
 
 
 
 
 
72ce17a
9e5ede4
72ce17a
9e5ede4
 
72ce17a
9e5ede4
72ce17a
 
 
 
 
9e5ede4
72ce17a
9e5ede4
72ce17a
 
 
 
 
 
 
 
 
 
 
 
 
9e5ede4
72ce17a
 
 
 
 
487d8fc
72ce17a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89148be
72ce17a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89148be
72ce17a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
license: mit
title: VoiceAPI
tags:
- tts
- text-to-speech
- indian-languages
- vits
- multilingual
- speech-synthesis
language:
- hi
- bn
- mr
- te
- kn
- en
- bho
- mai
- mag
- hne
- gu
---

# 🎙️ VoiceAPI - Multi-lingual Indian Language TTS

An advanced **multi-speaker, multilingual text-to-speech (TTS) synthesizer** supporting 11 Indian languages with 21 voice options.


## 🌟 Features

- **11 Indian Languages**: Hindi, Bengali, Marathi, Telugu, Kannada, Gujarati, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
- **21 Voice Options**: Male and female voices for each language
- **High-Quality Audio**: 22050 Hz sample rate, natural prosody
- **REST API**: Simple GET/POST endpoints for easy integration
- **Real-time Synthesis**: Fast inference on CPU/GPU

## 🗣️ Supported Languages

| Language | Code | Female | Male | Script |
|----------|------|--------|------|--------|
| Hindi | hi | ✅ | ✅ | देवनागरी |
| Bengali | bn | ✅ | ✅ | বাংলা |
| Marathi | mr | ✅ | ✅ | देवनागरी |
| Telugu | te | ✅ | ✅ | తెలుగు |
| Kannada | kn | ✅ | ✅ | ಕನ್ನಡ |
| Gujarati | gu | ✅ (MMS) | - | ગુજરાતી |
| Bhojpuri | bho | ✅ | ✅ | देवनागरी |
| Chhattisgarhi | hne | ✅ | ✅ | देवनागरी |
| Maithili | mai | ✅ | ✅ | देवनागरी |
| Magahi | mag | ✅ | ✅ | देवनागरी |
| English | en | ✅ | ✅ | Latin |

## 📡 API Usage

### Endpoint

\`\`\`
[https://harshil748-voiceapi.hf.space/](https://harshil748-voiceapi.hf.space/)
\`\`\`

### Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| \`text\` | string | Yes | Text to synthesize (lowercase for English) |
| \`lang\` | string | Yes | Language name (hindi, bengali, etc.) |
| \`speaker_wav\` | file | Yes | Reference WAV file (for API compatibility) |

### Example (Python)

\`\`\`python
import requests

base_url = 'https://harshil748-voiceapi.hf.space/Get_Inference'
WavPath = 'reference.wav'

params = {
    'text': 'नमस्ते, आप कैसे हैं?',
    'lang': 'hindi',
}

with open(WavPath, "rb") as AudioFile:
    response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile.read()})

if response.status_code == 200:
    with open('output.wav', 'wb') as f:
        f.write(response.content)
    print("Audio saved as 'output.wav'")
\`\`\`

### Example (cURL)

\`\`\`bash
curl -X POST "https://harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang=english" \\
  -F "speaker[email protected]" \\
  -o output.wav
\`\`\`

## 🏗️ Model Architecture

- **Base Model**: VITS (Variational Inference with adversarial learning for Text-to-Speech)
- **Encoder**: Transformer-based text encoder (6 layers, 192 hidden channels)
- **Decoder**: HiFi-GAN neural vocoder
- **Duration Predictor**: Stochastic duration predictor for natural prosody
- **Sample Rate**: 22050 Hz (16000 Hz for Gujarati MMS)

## 📊 Training

### Datasets Used

| Dataset | Languages | Source | License |
|---------|-----------|--------|---------|
| OpenSLR-103 | Hindi | [OpenSLR](https://www.openslr.org/103/) | CC BY 4.0 |
| OpenSLR-37 | Bengali | [OpenSLR](https://www.openslr.org/37/) | CC BY 4.0 |
| OpenSLR-64 | Marathi | [OpenSLR](https://www.openslr.org/64/) | CC BY 4.0 |
| OpenSLR-66 | Telugu | [OpenSLR](https://www.openslr.org/66/) | CC BY 4.0 |
| OpenSLR-79 | Kannada | [OpenSLR](https://www.openslr.org/79/) | CC BY 4.0 |
| OpenSLR-78 | Gujarati | [OpenSLR](https://www.openslr.org/78/) | CC BY 4.0 |
| Common Voice | Hindi, Bengali | [Mozilla](https://commonvoice.mozilla.org/) | CC0 |
| IndicTTS | Multiple | [IIT Madras](https://www.iitm.ac.in/donlab/tts/) | Research |
| Indic-Voices | Multiple | [AI4Bharat](https://ai4bharat.iitm.ac.in/indic-voices/) | CC BY 4.0 |

### Training Configuration

- **Epochs**: 1000
- **Batch Size**: 32
- **Learning Rate**: 2e-4
- **Optimizer**: AdamW
- **FP16 Training**: Enabled
- **Hardware**: NVIDIA V100/A100 GPUs

See \`training/\` directory for full training scripts and configurations.

## 🚀 Deployment

This API is deployed on HuggingFace Spaces using Docker:

\`\`\`dockerfile
FROM python:3.10-slim
# ... installs dependencies
# Downloads models from Harshil748/VoiceAPI-Models
# Runs FastAPI server on port 7860
\`\`\`

Models are hosted separately at [Harshil748/VoiceAPI-Models](https://huggingface.co/Harshil748/VoiceAPI-Models) (~8GB).

## 📁 Project Structure

\`\`\`

VoiceAPI/
├── app.py                 # HuggingFace Spaces entry point
├── Dockerfile             # Docker configuration
├── requirements.txt       # Python dependencies
├── download_models.py     # Model downloader
├── src/
│   ├── api.py             # FastAPI REST server
│   ├── engine.py          # TTS inference engine
│   ├── config.py          # Voice configurations
│   └── tokenizer.py       # Text tokenization
└── training/
    ├── train_vits.py      # VITS training script
    ├── prepare_dataset.py # Data preparation
    ├── export_model.py    # Model export
    ├── datasets.csv       # Dataset links
    └── configs/           # Training configs
    
\`\`\`

## 📜 License

- **Code**: MIT License
- **Models**: CC BY 4.0 (following SYSPIN licensing)
- **Datasets**: Individual licenses (see training/datasets.csv)

## 🙏 Acknowledgments

- [SYSPIN IISc SPIRE Lab](https://syspin.iisc.ac.in/) for pre-trained VITS models
- [Facebook MMS](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) for Gujarati TTS
- [Coqui TTS](https://github.com/coqui-ai/TTS) for the TTS library
- [AI4Bharat](https://ai4bharat.iitm.ac.in/) for Indian language resources

## 📧 Contact

Built for the **Voice Tech for All** Hackathon - Multi-lingual TTS for healthcare assistants serving low-income communities.