VoiceAPI-Models / README.md

Harshil748

Update README.md

487d8fc verified 8 days ago

preview code

raw

history blame contribute delete

6.03 kB

metadata

colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
license: mit
title: VoiceAPI
tags:
  - tts
  - text-to-speech
  - indian-languages
  - vits
  - multilingual
  - speech-synthesis
language:
  - hi
  - bn
  - mr
  - te
  - kn
  - en
  - bho
  - mai
  - mag
  - hne
  - gu

🎙️ VoiceAPI - Multi-lingual Indian Language TTS

An advanced multi-speaker, multilingual text-to-speech (TTS) synthesizer supporting 11 Indian languages with 21 voice options.

🌟 Features

11 Indian Languages: Hindi, Bengali, Marathi, Telugu, Kannada, Gujarati, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
21 Voice Options: Male and female voices for each language
High-Quality Audio: 22050 Hz sample rate, natural prosody
REST API: Simple GET/POST endpoints for easy integration
Real-time Synthesis: Fast inference on CPU/GPU

🗣️ Supported Languages

Language	Code	Female	Male	Script
Hindi	hi	✅	✅	देवनागरी
Bengali	bn	✅	✅	বাংলা
Marathi	mr	✅	✅	देवनागरी
Telugu	te	✅	✅	తెలుగు
Kannada	kn	✅	✅	ಕನ್ನಡ
Gujarati	gu	✅ (MMS)	-	ગુજરાતી
Bhojpuri	bho	✅	✅	देवनागरी
Chhattisgarhi	hne	✅	✅	देवनागरी
Maithili	mai	✅	✅	देवनागरी
Magahi	mag	✅	✅	देवनागरी
English	en	✅	✅	Latin

📡 API Usage

Endpoint

``` https://harshil748-voiceapi.hf.space/ ```

Parameters

Parameter	Type	Required	Description
`text`	string	Yes	Text to synthesize (lowercase for English)
`lang`	string	Yes	Language name (hindi, bengali, etc.)
`speaker_wav`	file	Yes	Reference WAV file (for API compatibility)

Example (Python)

```python import requests

base_url = 'https://harshil748-voiceapi.hf.space/Get_Inference' WavPath = 'reference.wav'

params = { 'text': 'नमस्ते, आप कैसे हैं?', 'lang': 'hindi', }

with open(WavPath, "rb") as AudioFile: response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile.read()})

if response.status_code == 200: with open('output.wav', 'wb') as f: f.write(response.content) print("Audio saved as 'output.wav'") ```

Example (cURL)

```bash curl -X POST "https://harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang=english" \ -F "[email protected]" \ -o output.wav ```

🏗️ Model Architecture

Base Model: VITS (Variational Inference with adversarial learning for Text-to-Speech)
Encoder: Transformer-based text encoder (6 layers, 192 hidden channels)
Decoder: HiFi-GAN neural vocoder
Duration Predictor: Stochastic duration predictor for natural prosody
Sample Rate: 22050 Hz (16000 Hz for Gujarati MMS)

📊 Training

Datasets Used

Dataset	Languages	Source	License
OpenSLR-103	Hindi	OpenSLR	CC BY 4.0
OpenSLR-37	Bengali	OpenSLR	CC BY 4.0
OpenSLR-64	Marathi	OpenSLR	CC BY 4.0
OpenSLR-66	Telugu	OpenSLR	CC BY 4.0
OpenSLR-79	Kannada	OpenSLR	CC BY 4.0
OpenSLR-78	Gujarati	OpenSLR	CC BY 4.0
Common Voice	Hindi, Bengali	Mozilla	CC0
IndicTTS	Multiple	IIT Madras	Research
Indic-Voices	Multiple	AI4Bharat	CC BY 4.0

Training Configuration

Epochs: 1000
Batch Size: 32
Learning Rate: 2e-4
Optimizer: AdamW
FP16 Training: Enabled
Hardware: NVIDIA V100/A100 GPUs

See `training/` directory for full training scripts and configurations.

🚀 Deployment

This API is deployed on HuggingFace Spaces using Docker:

```dockerfile FROM python:3.10-slim

... installs dependencies

Downloads models from Harshil748/VoiceAPI-Models

Runs FastAPI server on port 7860

```

Models are hosted separately at Harshil748/VoiceAPI-Models (~8GB).

📁 Project Structure

```

VoiceAPI/ ├── app.py # HuggingFace Spaces entry point ├── Dockerfile # Docker configuration ├── requirements.txt # Python dependencies ├── download_models.py # Model downloader ├── src/ │ ├── api.py # FastAPI REST server │ ├── engine.py # TTS inference engine │ ├── config.py # Voice configurations │ └── tokenizer.py # Text tokenization └── training/ ├── train_vits.py # VITS training script ├── prepare_dataset.py # Data preparation ├── export_model.py # Model export ├── datasets.csv # Dataset links └── configs/ # Training configs

```

📜 License

Code: MIT License
Models: CC BY 4.0 (following SYSPIN licensing)
Datasets: Individual licenses (see training/datasets.csv)

🙏 Acknowledgments

SYSPIN IISc SPIRE Lab for pre-trained VITS models
Facebook MMS for Gujarati TTS
Coqui TTS for the TTS library
AI4Bharat for Indian language resources

📧 Contact

Built for the Voice Tech for All Hackathon - Multi-lingual TTS for healthcare assistants serving low-income communities.