VoiceAPI-Models / README.md
Harshil748's picture
Update README.md
487d8fc verified
metadata
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
license: mit
title: VoiceAPI
tags:
  - tts
  - text-to-speech
  - indian-languages
  - vits
  - multilingual
  - speech-synthesis
language:
  - hi
  - bn
  - mr
  - te
  - kn
  - en
  - bho
  - mai
  - mag
  - hne
  - gu

🎙️ VoiceAPI - Multi-lingual Indian Language TTS

An advanced multi-speaker, multilingual text-to-speech (TTS) synthesizer supporting 11 Indian languages with 21 voice options.

🌟 Features

  • 11 Indian Languages: Hindi, Bengali, Marathi, Telugu, Kannada, Gujarati, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
  • 21 Voice Options: Male and female voices for each language
  • High-Quality Audio: 22050 Hz sample rate, natural prosody
  • REST API: Simple GET/POST endpoints for easy integration
  • Real-time Synthesis: Fast inference on CPU/GPU

🗣️ Supported Languages

Language Code Female Male Script
Hindi hi देवनागरी
Bengali bn বাংলা
Marathi mr देवनागरी
Telugu te తెలుగు
Kannada kn ಕನ್ನಡ
Gujarati gu ✅ (MMS) - ગુજરાતી
Bhojpuri bho देवनागरी
Chhattisgarhi hne देवनागरी
Maithili mai देवनागरी
Magahi mag देवनागरी
English en Latin

📡 API Usage

Endpoint

``` https://harshil748-voiceapi.hf.space/ ```

Parameters

Parameter Type Required Description
`text` string Yes Text to synthesize (lowercase for English)
`lang` string Yes Language name (hindi, bengali, etc.)
`speaker_wav` file Yes Reference WAV file (for API compatibility)

Example (Python)

```python import requests

base_url = 'https://harshil748-voiceapi.hf.space/Get_Inference' WavPath = 'reference.wav'

params = { 'text': 'नमस्ते, आप कैसे हैं?', 'lang': 'hindi', }

with open(WavPath, "rb") as AudioFile: response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile.read()})

if response.status_code == 200: with open('output.wav', 'wb') as f: f.write(response.content) print("Audio saved as 'output.wav'") ```

Example (cURL)

```bash curl -X POST "https://harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang=english" \ -F "[email protected]" \ -o output.wav ```

🏗️ Model Architecture

  • Base Model: VITS (Variational Inference with adversarial learning for Text-to-Speech)
  • Encoder: Transformer-based text encoder (6 layers, 192 hidden channels)
  • Decoder: HiFi-GAN neural vocoder
  • Duration Predictor: Stochastic duration predictor for natural prosody
  • Sample Rate: 22050 Hz (16000 Hz for Gujarati MMS)

📊 Training

Datasets Used

Dataset Languages Source License
OpenSLR-103 Hindi OpenSLR CC BY 4.0
OpenSLR-37 Bengali OpenSLR CC BY 4.0
OpenSLR-64 Marathi OpenSLR CC BY 4.0
OpenSLR-66 Telugu OpenSLR CC BY 4.0
OpenSLR-79 Kannada OpenSLR CC BY 4.0
OpenSLR-78 Gujarati OpenSLR CC BY 4.0
Common Voice Hindi, Bengali Mozilla CC0
IndicTTS Multiple IIT Madras Research
Indic-Voices Multiple AI4Bharat CC BY 4.0

Training Configuration

  • Epochs: 1000
  • Batch Size: 32
  • Learning Rate: 2e-4
  • Optimizer: AdamW
  • FP16 Training: Enabled
  • Hardware: NVIDIA V100/A100 GPUs

See `training/` directory for full training scripts and configurations.

🚀 Deployment

This API is deployed on HuggingFace Spaces using Docker:

```dockerfile FROM python:3.10-slim

... installs dependencies

Downloads models from Harshil748/VoiceAPI-Models

Runs FastAPI server on port 7860

```

Models are hosted separately at Harshil748/VoiceAPI-Models (~8GB).

📁 Project Structure

```

VoiceAPI/ ├── app.py # HuggingFace Spaces entry point ├── Dockerfile # Docker configuration ├── requirements.txt # Python dependencies ├── download_models.py # Model downloader ├── src/ │ ├── api.py # FastAPI REST server │ ├── engine.py # TTS inference engine │ ├── config.py # Voice configurations │ └── tokenizer.py # Text tokenization └── training/ ├── train_vits.py # VITS training script ├── prepare_dataset.py # Data preparation ├── export_model.py # Model export ├── datasets.csv # Dataset links └── configs/ # Training configs

```

📜 License

  • Code: MIT License
  • Models: CC BY 4.0 (following SYSPIN licensing)
  • Datasets: Individual licenses (see training/datasets.csv)

🙏 Acknowledgments

📧 Contact

Built for the Voice Tech for All Hackathon - Multi-lingual TTS for healthcare assistants serving low-income communities.