0xWerz commited on
Commit
1b6c34a
·
1 Parent(s): 0f6a7a4
Dockerfile CHANGED
@@ -1,7 +1,7 @@
1
  # Read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
2
  # you will also find guides on how best to write your Dockerfile
3
 
4
- FROM python:3.9
5
 
6
  RUN useradd -m -u 1000 user
7
  USER user
 
1
  # Read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
2
  # you will also find guides on how best to write your Dockerfile
3
 
4
+ FROM python:3.11
5
 
6
  RUN useradd -m -u 1000 user
7
  USER user
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
README copy.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # kokoro
2
+
3
+ An inference library for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M). You can [`pip install kokoro`](https://pypi.org/project/kokoro/).
4
+
5
+ > **Kokoro** is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
6
+
7
+ ### Usage
8
+ You can run this basic cell on [Google Colab](https://colab.research.google.com/). [Listen to samples](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/SAMPLES.md).
9
+ ```py
10
+ !pip install -q kokoro>=0.9.4 soundfile
11
+ !apt-get -qq -y install espeak-ng > /dev/null 2>&1
12
+ from kokoro import KPipeline
13
+ from IPython.display import display, Audio
14
+ import soundfile as sf
15
+ import torch
16
+ pipeline = KPipeline(lang_code='a')
17
+ text = '''
18
+ [Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
19
+ '''
20
+ generator = pipeline(text, voice='af_heart')
21
+ for i, (gs, ps, audio) in enumerate(generator):
22
+ print(i, gs, ps)
23
+ display(Audio(data=audio, rate=24000, autoplay=i==0))
24
+ sf.write(f'{i}.wav', audio, 24000)
25
+ ```
26
+ Under the hood, `kokoro` uses [`misaki`](https://pypi.org/project/misaki/), a G2P library at https://github.com/hexgrad/misaki
27
+
28
+ ### Advanced Usage
29
+ You can run this advanced cell on [Google Colab](https://colab.research.google.com/).
30
+ ```py
31
+ # 1️⃣ Install kokoro
32
+ !pip install -q kokoro>=0.9.4 soundfile
33
+ # 2️⃣ Install espeak, used for English OOD fallback and some non-English languages
34
+ !apt-get -qq -y install espeak-ng > /dev/null 2>&1
35
+
36
+ # 3️⃣ Initalize a pipeline
37
+ from kokoro import KPipeline
38
+ from IPython.display import display, Audio
39
+ import soundfile as sf
40
+ import torch
41
+ # 🇺🇸 'a' => American English, 🇬🇧 'b' => British English
42
+ # 🇪🇸 'e' => Spanish es
43
+ # 🇫🇷 'f' => French fr-fr
44
+ # 🇮🇳 'h' => Hindi hi
45
+ # 🇮🇹 'i' => Italian it
46
+ # 🇯🇵 'j' => Japanese: pip install misaki[ja]
47
+ # 🇧🇷 'p' => Brazilian Portuguese pt-br
48
+ # 🇨🇳 'z' => Mandarin Chinese: pip install misaki[zh]
49
+ pipeline = KPipeline(lang_code='a') # <= make sure lang_code matches voice, reference above.
50
+
51
+ # This text is for demonstration purposes only, unseen during training
52
+ text = '''
53
+ The sky above the port was the color of television, tuned to a dead channel.
54
+ "It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
55
+ It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.
56
+
57
+ These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.
58
+
59
+ [Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
60
+ '''
61
+ # text = '「もしおれがただ偶然、そしてこうしようというつもりでなくここに立っているのなら、ちょっとばかり絶望するところだな」と、そんなことが彼の頭に思い浮かんだ。'
62
+ # text = '中國人民不信邪也不怕邪,不惹事也不怕事,任何外國不要指望我們會拿自己的核心利益做交易,不要指望我們會吞下損害我國主權、安��、發展利益的苦果!'
63
+ # text = 'Los partidos políticos tradicionales compiten con los populismos y los movimientos asamblearios.'
64
+ # text = 'Le dromadaire resplendissant déambulait tranquillement dans les méandres en mastiquant de petites feuilles vernissées.'
65
+ # text = 'ट्रांसपोर्टरों की हड़ताल लगातार पांचवें दिन जारी, दिसंबर से इलेक्ट्रॉनिक टोल कलेक्शनल सिस्टम'
66
+ # text = "Allora cominciava l'insonnia, o un dormiveglia peggiore dell'insonnia, che talvolta assumeva i caratteri dell'incubo."
67
+ # text = 'Elabora relatórios de acompanhamento cronológico para as diferentes unidades do Departamento que propõem contratos.'
68
+
69
+ # 4️⃣ Generate, display, and save audio files in a loop.
70
+ generator = pipeline(
71
+ text, voice='af_heart', # <= change voice here
72
+ speed=1, split_pattern=r'\n+'
73
+ )
74
+ # Alternatively, load voice tensor directly:
75
+ # voice_tensor = torch.load('path/to/voice.pt', weights_only=True)
76
+ # generator = pipeline(
77
+ # text, voice=voice_tensor,
78
+ # speed=1, split_pattern=r'\n+'
79
+ # )
80
+
81
+ for i, (gs, ps, audio) in enumerate(generator):
82
+ print(i) # i => index
83
+ print(gs) # gs => graphemes/text
84
+ print(ps) # ps => phonemes
85
+ display(Audio(data=audio, rate=24000, autoplay=i==0))
86
+ sf.write(f'{i}.wav', audio, 24000) # save each audio file
87
+ ```
88
+
89
+ ### Windows Installation
90
+ To install espeak-ng on Windows:
91
+ 1. Go to [espeak-ng releases](https://github.com/espeak-ng/espeak-ng/releases)
92
+ 2. Click on **Latest release**
93
+ 3. Download the appropriate `*.msi` file (e.g. **espeak-ng-20191129-b702b03-x64.msi**)
94
+ 4. Run the downloaded installer
95
+
96
+ For advanced configuration and usage on Windows, see the [official espeak-ng Windows guide](https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md)
97
+
98
+ ### MacOS Apple Silicon GPU Acceleration
99
+
100
+ On Mac M1/M2/M3/M4 devices, you can explicitly specify the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to enable GPU acceleration.
101
+
102
+ ```bash
103
+ PYTORCH_ENABLE_MPS_FALLBACK=1 python run-your-kokoro-script.py
104
+ ```
105
+
106
+ ### Conda Environment
107
+ Use the following conda `environment.yml` if you're facing any dependency issues.
108
+ ```yaml
109
+ name: kokoro
110
+ channels:
111
+ - defaults
112
+ dependencies:
113
+ - python==3.9
114
+ - libstdcxx~=12.4.0 # Needed to load espeak correctly. Try removing this if you're facing issues with Espeak fallback.
115
+ - pip:
116
+ - kokoro>=0.3.1
117
+ - soundfile
118
+ - misaki[en]
119
+ ```
120
+
121
+ ### Acknowledgements
122
+ - 🛠️ [@yl4579](https://huggingface.co/yl4579) for architecting StyleTTS 2.
123
+ - 🏆 [@Pendrokar](https://huggingface.co/Pendrokar) for adding Kokoro as a contender in the TTS Spaces Arena.
124
+ - 📊 Thank you to everyone who contributed synthetic training data.
125
+ - ❤️ Special thanks to all compute sponsors.
126
+ - 👾 Discord server: https://discord.gg/QuGxSWBfQy
127
+ - 🪽 Kokoro is a Japanese word that translates to "heart" or "spirit". Kokoro is also a [character in the Terminator franchise](https://terminator.fandom.com/wiki/Kokoro) along with [Misaki](https://github.com/hexgrad/misaki?tab=readme-ov-file#acknowledgements).
128
+
129
+ <img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
130
+ # kokoro
demo/README.md ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Kokoro TTS
3
+ emoji: ❤️
4
+ colorFrom: indigo
5
+ colorTo: pink
6
+ sdk: gradio
7
+ sdk_version: 5.12.0
8
+ app_file: app.py
9
+ pinned: true
10
+ license: apache-2.0
11
+ short_description: Upgraded to v1.0!
12
+ disable_embedding: true
13
+ ---
14
+
15
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
demo/app.py ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import spaces
2
+ from kokoro import KModel, KPipeline
3
+ import gradio as gr
4
+ import os
5
+ import random
6
+ import torch
7
+
8
+ CUDA_AVAILABLE = torch.cuda.is_available()
9
+ models = {gpu: KModel().to('cuda' if gpu else 'cpu').eval() for gpu in [False] + ([True] if CUDA_AVAILABLE else [])}
10
+ pipelines = {lang_code: KPipeline(lang_code=lang_code, model=False) for lang_code in 'ab'}
11
+ pipelines['a'].g2p.lexicon.golds['kokoro'] = 'kˈOkəɹO'
12
+ pipelines['b'].g2p.lexicon.golds['kokoro'] = 'kˈQkəɹQ'
13
+
14
+ @spaces.GPU(duration=30)
15
+ def forward_gpu(ps, ref_s, speed):
16
+ return models[True](ps, ref_s, speed)
17
+
18
+ def generate_first(text, voice='af_heart', speed=1, use_gpu=CUDA_AVAILABLE):
19
+ pipeline = pipelines[voice[0]]
20
+ pack = pipeline.load_voice(voice)
21
+ use_gpu = use_gpu and CUDA_AVAILABLE
22
+ for _, ps, _ in pipeline(text, voice, speed):
23
+ ref_s = pack[len(ps)-1]
24
+ try:
25
+ if use_gpu:
26
+ audio = forward_gpu(ps, ref_s, speed)
27
+ else:
28
+ audio = models[False](ps, ref_s, speed)
29
+ except gr.exceptions.Error as e:
30
+ if use_gpu:
31
+ gr.Warning(str(e))
32
+ gr.Info('Retrying with CPU. To avoid this error, change Hardware to CPU.')
33
+ audio = models[False](ps, ref_s, speed)
34
+ else:
35
+ raise gr.Error(e)
36
+ return (24000, audio.numpy()), ps
37
+ return None, ''
38
+
39
+ # Arena API
40
+ def predict(text, voice='af_heart', speed=1):
41
+ return generate_first(text, voice, speed, use_gpu=False)[0]
42
+
43
+ def tokenize_first(text, voice='af_heart'):
44
+ pipeline = pipelines[voice[0]]
45
+ for _, ps, _ in pipeline(text, voice):
46
+ return ps
47
+ return ''
48
+
49
+ def generate_all(text, voice='af_heart', speed=1, use_gpu=CUDA_AVAILABLE):
50
+ pipeline = pipelines[voice[0]]
51
+ pack = pipeline.load_voice(voice)
52
+ use_gpu = use_gpu and CUDA_AVAILABLE
53
+ first = True
54
+ for _, ps, _ in pipeline(text, voice, speed):
55
+ ref_s = pack[len(ps)-1]
56
+ try:
57
+ if use_gpu:
58
+ audio = forward_gpu(ps, ref_s, speed)
59
+ else:
60
+ audio = models[False](ps, ref_s, speed)
61
+ except gr.exceptions.Error as e:
62
+ if use_gpu:
63
+ gr.Warning(str(e))
64
+ gr.Info('Switching to CPU')
65
+ audio = models[False](ps, ref_s, speed)
66
+ else:
67
+ raise gr.Error(e)
68
+ yield 24000, audio.numpy()
69
+ if first:
70
+ first = False
71
+ yield 24000, torch.zeros(1).numpy()
72
+
73
+ with open('en.txt', 'r') as r:
74
+ random_quotes = [line.strip() for line in r]
75
+
76
+ def get_random_quote():
77
+ return random.choice(random_quotes)
78
+
79
+ def get_gatsby():
80
+ with open('gatsby5k.md', 'r') as r:
81
+ return r.read().strip()
82
+
83
+ def get_frankenstein():
84
+ with open('frankenstein5k.md', 'r') as r:
85
+ return r.read().strip()
86
+
87
+ CHOICES = {
88
+ '🇺🇸 🚺 Heart ❤️': 'af_heart',
89
+ '🇺🇸 🚺 Bella 🔥': 'af_bella',
90
+ '🇺🇸 🚺 Nicole 🎧': 'af_nicole',
91
+ '🇺🇸 🚺 Aoede': 'af_aoede',
92
+ '🇺🇸 🚺 Kore': 'af_kore',
93
+ '🇺🇸 🚺 Sarah': 'af_sarah',
94
+ '🇺🇸 🚺 Nova': 'af_nova',
95
+ '🇺🇸 🚺 Sky': 'af_sky',
96
+ '🇺🇸 🚺 Alloy': 'af_alloy',
97
+ '🇺🇸 🚺 Jessica': 'af_jessica',
98
+ '🇺🇸 🚺 River': 'af_river',
99
+ '🇺🇸 🚹 Michael': 'am_michael',
100
+ '🇺🇸 🚹 Fenrir': 'am_fenrir',
101
+ '🇺🇸 🚹 Puck': 'am_puck',
102
+ '🇺🇸 🚹 Echo': 'am_echo',
103
+ '🇺🇸 🚹 Eric': 'am_eric',
104
+ '🇺🇸 🚹 Liam': 'am_liam',
105
+ '🇺🇸 🚹 Onyx': 'am_onyx',
106
+ '🇺🇸 🚹 Santa': 'am_santa',
107
+ '🇺🇸 🚹 Adam': 'am_adam',
108
+ '🇬🇧 🚺 Emma': 'bf_emma',
109
+ '🇬🇧 🚺 Isabella': 'bf_isabella',
110
+ '🇬🇧 🚺 Alice': 'bf_alice',
111
+ '🇬🇧 🚺 Lily': 'bf_lily',
112
+ '🇬🇧 🚹 George': 'bm_george',
113
+ '🇬🇧 🚹 Fable': 'bm_fable',
114
+ '🇬🇧 🚹 Lewis': 'bm_lewis',
115
+ '🇬🇧 🚹 Daniel': 'bm_daniel',
116
+ }
117
+ for v in CHOICES.values():
118
+ pipelines[v[0]].load_voice(v)
119
+
120
+ TOKEN_NOTE = '''
121
+ 💡 Customize pronunciation with Markdown link syntax and /slashes/ like `[Kokoro](/kˈOkəɹO/)`
122
+
123
+ 💬 To adjust intonation, try punctuation `;:,.!?—…"()“”` or stress `ˈ` and `ˌ`
124
+
125
+ ⬇️ Lower stress `[1 level](-1)` or `[2 levels](-2)`
126
+
127
+ ⬆️ Raise stress 1 level `[or](+2)` 2 levels (only works on less stressed, usually short words)
128
+ '''
129
+
130
+ with gr.Blocks() as generate_tab:
131
+ out_audio = gr.Audio(label='Output Audio', interactive=False, streaming=False, autoplay=True)
132
+ generate_btn = gr.Button('Generate', variant='primary')
133
+ with gr.Accordion('Output Tokens', open=True):
134
+ out_ps = gr.Textbox(interactive=False, show_label=False, info='Tokens used to generate the audio, up to 510 context length.')
135
+ tokenize_btn = gr.Button('Tokenize', variant='secondary')
136
+ gr.Markdown(TOKEN_NOTE)
137
+ predict_btn = gr.Button('Predict', variant='secondary', visible=False)
138
+
139
+ STREAM_NOTE = ['⚠️ There is an unknown Gradio bug that might yield no audio the first time you click `Stream`.']
140
+ STREAM_NOTE = '\n\n'.join(STREAM_NOTE)
141
+
142
+ with gr.Blocks() as stream_tab:
143
+ out_stream = gr.Audio(label='Output Audio Stream', interactive=False, streaming=True, autoplay=True)
144
+ with gr.Row():
145
+ stream_btn = gr.Button('Stream', variant='primary')
146
+ stop_btn = gr.Button('Stop', variant='stop')
147
+ with gr.Accordion('Note', open=True):
148
+ gr.Markdown(STREAM_NOTE)
149
+ gr.DuplicateButton()
150
+
151
+ API_OPEN = True
152
+ with gr.Blocks() as app:
153
+ with gr.Row():
154
+ with gr.Column():
155
+ text = gr.Textbox(label='Input Text', info=f"Arbitrarily many characters supported")
156
+ with gr.Row():
157
+ voice = gr.Dropdown(list(CHOICES.items()), value='af_heart', label='Voice', info='Quality and availability vary by language')
158
+ use_gpu = gr.Dropdown(
159
+ [('ZeroGPU 🚀', True), ('CPU 🐌', False)],
160
+ value=CUDA_AVAILABLE,
161
+ label='Hardware',
162
+ info='GPU is usually faster, but has a usage quota',
163
+ interactive=CUDA_AVAILABLE
164
+ )
165
+ speed = gr.Slider(minimum=0.5, maximum=2, value=1, step=0.1, label='Speed')
166
+ random_btn = gr.Button('🎲 Random Quote 💬', variant='secondary')
167
+ with gr.Row():
168
+ gatsby_btn = gr.Button('🥂 Gatsby 📕', variant='secondary')
169
+ frankenstein_btn = gr.Button('💀 Frankenstein 📗', variant='secondary')
170
+ with gr.Column():
171
+ gr.TabbedInterface([generate_tab, stream_tab], ['Generate', 'Stream'])
172
+ random_btn.click(fn=get_random_quote, inputs=[], outputs=[text])
173
+ gatsby_btn.click(fn=get_gatsby, inputs=[], outputs=[text])
174
+ frankenstein_btn.click(fn=get_frankenstein, inputs=[], outputs=[text])
175
+ generate_btn.click(fn=generate_first, inputs=[text, voice, speed, use_gpu], outputs=[out_audio, out_ps])
176
+ tokenize_btn.click(fn=tokenize_first, inputs=[text, voice], outputs=[out_ps])
177
+ stream_event = stream_btn.click(fn=generate_all, inputs=[text, voice, speed, use_gpu], outputs=[out_stream])
178
+ stop_btn.click(fn=None, cancels=stream_event)
179
+ predict_btn.click(fn=predict, inputs=[text, voice, speed], outputs=[out_audio])
180
+
181
+ if __name__ == '__main__':
182
+ app.queue(api_open=API_OPEN).launch(server_name="0.0.0.0", server_port=40001, show_api=API_OPEN)
demo/en.txt ADDED
The diff for this file is too large to render. See raw diff
 
demo/frankenstein5k.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence in the success of my undertaking.
2
+
3
+ I am already far north of London, and as I walk in the streets of Petersburgh, I feel a cold northern breeze play upon my cheeks, which braces my nerves and fills me with delight. Do you understand this feeling? This breeze, which has travelled from the regions towards which I am advancing, gives me a foretaste of those icy climes. Inspirited by this wind of promise, my daydreams become more fervent and vivid. I try in vain to be persuaded that the pole is the seat of frost and desolation; it ever presents itself to my imagination as the region of beauty and delight. There, Margaret, the sun is for ever visible, its broad disk just skirting the horizon and diffusing a perpetual splendour. There—for with your leave, my sister, I will put some trust in preceding navigators—there snow and frost are banished; and, sailing over a calm sea, we may be wafted to a land surpassing in wonders and in beauty every region hitherto discovered on the habitable globe. Its productions and features may be without example, as the phenomena of the heavenly bodies undoubtedly are in those undiscovered solitudes. What may not be expected in a country of eternal light? I may there discover the wondrous power which attracts the needle and may regulate a thousand celestial observations that require only this voyage to render their seeming eccentricities consistent for ever. I shall satiate my ardent curiosity with the sight of a part of the world never before visited, and may tread a land never before imprinted by the foot of man. These are my enticements, and they are sufficient to conquer all fear of danger or death and to induce me to commence this laborious voyage with the joy a child feels when he embarks in a little boat, with his holiday mates, on an expedition of discovery up his native river. But supposing all these conjectures to be false, you cannot contest the inestimable benefit which I shall confer on all mankind, to the last generation, by discovering a passage near the pole to those countries, to reach which at present so many months are requisite; or by ascertaining the secret of the magnet, which, if at all possible, can only be effected by an undertaking such as mine.
4
+
5
+ These reflections have dispelled the agitation with which I began my letter, and I feel my heart glow with an enthusiasm which elevates me to heaven, for nothing contributes so much to tranquillise the mind as a steady purpose—a point on which the soul may fix its intellectual eye. This expedition has been the favourite dream of my early years. I have read with ardour the accounts of the various voyages which have been made in the prospect of arriving at the North Pacific Ocean through the seas which surround the pole. You may remember that a history of all the voyages made for purposes of discovery composed the whole of our good Uncle Thomas’s library. My education was neglected, yet I was passionately fond of reading. These volumes were my study day and night, and my familiarity with them increased that regret which I had felt, as a child, on learning that my father’s dying injunction had forbidden my uncle to allow me to embark in a seafaring life.
6
+
7
+ These visions faded when I perused, for the first time, those poets whose effusions entranced my soul and lifted it to heaven. I also became a poet and for one year lived in a paradise of my own creation; I imagined that I also might obtain a niche in the temple where the names of Homer and Shakespeare are consecrated. You are well acquainted with my failure and how heavily I bore the disappointment. But just at that time I inherited the fortune of my cousin, and my thoughts were turned into the channel of their earlier bent.
8
+
9
+ Six years have passed since I resolved on my present undertaking. I can, even now, remember the hour from which I dedicated myself to this great enterprise. I commenced by inuring my body to hardship. I accompanied the whale-fishers on several expeditions to the North Sea; I voluntarily endured cold, famine, thirst, and want of sleep; I often worked harder than the common sailors during the day and devoted my nights to the study of mathematics, the theory of medicine, and those branches of physical science from which a naval adventurer might derive the greatest practical advantage. Twice I actually hired myself as an under-mate in a Greenland whaler, and acquitted myself to admiration. I must own I felt a little proud when my captain offered me the second dignity in the vessel and entreated me to remain with the greatest earnestness, so valuable did he consider my services.
10
+
11
+ And now, dear Margaret, do I not deserve to accomplish some great purpose?
demo/gatsby5k.md ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ In my younger and more vulnerable years my father gave me some advice that I’ve been turning over in my mind ever since.
2
+
3
+ “Whenever you feel like criticizing anyone,” he told me, “just remember that all the people in this world haven’t had the advantages that you’ve had.”
4
+
5
+ He didn’t say any more, but we’ve always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that. In consequence, I’m inclined to reserve all judgements, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores. The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men. Most of the confidences were unsought—frequently I have feigned sleep, preoccupation, or a hostile levity when I realized by some unmistakable sign that an intimate revelation was quivering on the horizon; for the intimate revelations of young men, or at least the terms in which they express them, are usually plagiaristic and marred by obvious suppressions. Reserving judgements is a matter of infinite hope. I am still a little afraid of missing something if I forget that, as my father snobbishly suggested, and I snobbishly repeat, a sense of the fundamental decencies is parcelled out unequally at birth.
6
+
7
+ And, after boasting this way of my tolerance, I come to the admission that it has a limit. Conduct may be founded on the hard rock or the wet marshes, but after a certain point I don’t care what it’s founded on. When I came back from the East last autumn I felt that I wanted the world to be in uniform and at a sort of moral attention forever; I wanted no more riotous excursions with privileged glimpses into the human heart. Only Gatsby, the man who gives his name to this book, was exempt from my reaction—Gatsby, who represented everything for which I have an unaffected scorn. If personality is an unbroken series of successful gestures, then there was something gorgeous about him, some heightened sensitivity to the promises of life, as if he were related to one of those intricate machines that register earthquakes ten thousand miles away. This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the “creative temperament”—it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again. No—Gatsby turned out all right at the end; it is what preyed on Gatsby, what foul dust floated in the wake of his dreams that temporarily closed out my interest in the abortive sorrows and short-winded elations of men.
8
+
9
+ My family have been prominent, well-to-do people in this Middle Western city for three generations. The Carraways are something of a clan, and we have a tradition that we’re descended from the Dukes of Buccleuch, but the actual founder of my line was my grandfather’s brother, who came here in fifty-one, sent a substitute to the Civil War, and started the wholesale hardware business that my father carries on today.
10
+
11
+ I never saw this great-uncle, but I’m supposed to look like him—with special reference to the rather hard-boiled painting that hangs in father’s office. I graduated from New Haven in 1915, just a quarter of a century after my father, and a little later I participated in that delayed Teutonic migration known as the Great War. I enjoyed the counter-raid so thoroughly that I came back restless. Instead of being the warm centre of the world, the Middle West now seemed like the ragged edge of the universe—so I decided to go East and learn the bond business. Everybody I knew was in the bond business, so I supposed it could support one more single man. All my aunts and uncles talked it over as if they were choosing a prep school for me, and finally said, “Why—[ye-es](/jˈɛ ɛs/),” with very grave, hesitant faces. Father agreed to finance me for a year, and after various delays I came East, permanently, I thought, in the spring of twenty-two.
12
+
13
+ The practical thing was to find rooms in the city, but it was a warm season, and I had just left a country of wide lawns and friendly trees, so when a young man at the office suggested that we take a house together in a commuting town, it sounded like a great idea. He found the house, a weather-beaten cardboard bungalow at eighty a month, but at the last minute the firm ordered him to Washington, and I went out to the country alone. I had a dog—at least I had him for a few days until he ran away—and an old Dodge and a Finnish woman, who made my bed and cooked breakfast and muttered Finnish wisdom to herself over the electric stove.
14
+
15
+ It was lonely for a day or so until one morning some man, more recently arrived than I, stopped me on the road.
16
+
17
+ “How do you get to West Egg village?” he asked helplessly.
demo/packages.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ espeak-ng
demo/requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ kokoro>=0.7.13
2
+ gradio
3
+ pip
examples/device_examples.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Quick example to show how device selection can be controlled, and was checked
3
+ """
4
+ import time
5
+ from kokoro import KPipeline
6
+ from loguru import logger
7
+
8
+ def generate_audio(pipeline, text):
9
+ for _, _, audio in pipeline(text, voice='af_bella'):
10
+ samples = audio.shape[0] if audio is not None else 0
11
+ assert samples > 0, "No audio generated"
12
+ return samples
13
+
14
+ def time_synthesis(device=None):
15
+ try:
16
+ start = time.perf_counter()
17
+ pipeline = KPipeline(lang_code='a', device=device)
18
+ samples = generate_audio(pipeline, "The quick brown fox jumps over the lazy dog.")
19
+ ms = (time.perf_counter() - start) * 1000
20
+ logger.info(f"✓ {device or 'auto':<6} | {ms:>5.1f}ms total | {samples:>6,d} samples")
21
+ except RuntimeError as e:
22
+ logger.error(f"✗ {'cuda' if 'CUDA' in str(e) else device or 'auto':<6} | {'not available' if 'CUDA' in str(e) else str(e)}")
23
+
24
+ def compare_shared_model():
25
+ try:
26
+ start = time.perf_counter()
27
+ en_us = KPipeline(lang_code='a')
28
+ en_uk = KPipeline(lang_code='a', model=en_us.model)
29
+
30
+ for pipeline in [en_us, en_uk]:
31
+ generate_audio(pipeline, "Testing model reuse.")
32
+
33
+ ms = (time.perf_counter() - start) * 1000
34
+ logger.info(f"✓ reuse | {ms:>5.1f}ms for both models")
35
+ except Exception as e:
36
+ logger.error(f"✗ reuse | {str(e)}")
37
+
38
+ if __name__ == '__main__':
39
+ logger.info("Device Selection & Performance")
40
+ logger.info("-" * 40)
41
+ time_synthesis()
42
+ time_synthesis('cuda')
43
+ time_synthesis('cpu')
44
+ logger.info("-" * 40)
45
+ compare_shared_model()
examples/export.py ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+ import torch
4
+ import onnx
5
+ import onnxruntime as ort
6
+ import sounddevice as sd
7
+
8
+ from kokoro import KModel, KPipeline
9
+ from kokoro.model import KModelForONNX
10
+
11
+ def export_onnx(model, output):
12
+ onnx_file = output + "/" + "kokoro.onnx"
13
+
14
+ input_ids = torch.randint(1, 100, (48,)).numpy()
15
+ input_ids = torch.LongTensor([[0, *input_ids, 0]])
16
+ style = torch.randn(1, 256)
17
+ speed = torch.randint(1, 10, (1,)).int()
18
+
19
+ torch.onnx.export(
20
+ model,
21
+ args = (input_ids, style, speed),
22
+ f = onnx_file,
23
+ export_params = True,
24
+ verbose = True,
25
+ input_names = [ 'input_ids', 'style', 'speed' ],
26
+ output_names = [ 'waveform', 'duration' ],
27
+ opset_version = 17,
28
+ dynamic_axes = {
29
+ 'input_ids': {0: "batch_size", 1: 'input_ids_len' },
30
+ 'style': {0: "batch_size"},
31
+ "speed": {0: "batch_size"}
32
+ },
33
+ do_constant_folding = True,
34
+ )
35
+
36
+ print('export kokoro.onnx ok!')
37
+
38
+ onnx_model = onnx.load(onnx_file)
39
+ onnx.checker.check_model(onnx_model)
40
+ print('onnx check ok!')
41
+
42
+ def load_input_ids(pipeline, text):
43
+ if pipeline.lang_code in 'ab':
44
+ _, tokens = pipeline.g2p(text)
45
+ for gs, ps, tks in pipeline.en_tokenize(tokens):
46
+ if not ps:
47
+ continue
48
+ else:
49
+ ps, _ = pipeline.g2p(text)
50
+
51
+ if len(ps) > 510:
52
+ ps = ps[:510]
53
+
54
+ input_ids = list(filter(lambda i: i is not None, map(lambda p: pipeline.model.vocab.get(p), ps)))
55
+ print(f"text: {text} -> phonemes: {ps} -> input_ids: {input_ids}")
56
+ input_ids = torch.LongTensor([[0, *input_ids, 0]]).to(pipeline.model.device)
57
+ return ps, input_ids
58
+
59
+ def load_voice(pipeline, voice, phonemes):
60
+ pack = pipeline.load_voice(voice).to('cpu')
61
+ return pack[len(phonemes) - 1]
62
+
63
+ def load_sample(model):
64
+ pipeline = KPipeline(lang_code='a', model=model.kmodel, device='cpu')
65
+ text = '''
66
+ In today's fast-paced tech world, building software applications has never been easier — thanks to AI-powered coding assistants.'
67
+ '''
68
+ text = '''
69
+ The sky above the port was the color of television, tuned to a dead channel.
70
+ '''
71
+ voice = 'checkpoints/voices/af_heart.pt'
72
+
73
+ pipeline = KPipeline(lang_code='z', model=model.kmodel, device='cpu')
74
+ text = '''
75
+ 2月15日晚,猫眼专业版数据显示,截至发稿,《哪吒之魔童闹海》(或称《哪吒2》)今日票房已达7.8亿元,累计票房(含预售)超过114亿元。
76
+ '''
77
+ voice = 'checkpoints/voices/zf_xiaoxiao.pt'
78
+
79
+ phonemes, input_ids = load_input_ids(pipeline, text)
80
+ style = load_voice(pipeline, voice, phonemes)
81
+ speed = torch.IntTensor([1])
82
+
83
+ return input_ids, style, speed
84
+
85
+ def inference_onnx(model, output):
86
+ onnx_file = output + "/" + "kokoro.onnx"
87
+ session = ort.InferenceSession(onnx_file)
88
+
89
+ input_ids, style, speed = load_sample(model)
90
+
91
+ outputs = session.run(None, {
92
+ 'input_ids': input_ids.numpy(),
93
+ 'style': style.numpy(),
94
+ 'speed': speed.numpy(),
95
+ })
96
+
97
+ output = torch.from_numpy(outputs[0])
98
+ print(f'output: {output.shape}')
99
+ print(output)
100
+
101
+ audio = output.numpy()
102
+ sd.play(audio, 24000)
103
+ sd.wait()
104
+
105
+ def check_model(model):
106
+ input_ids, style, speed = load_sample(model)
107
+ output, duration = model(input_ids, style, speed)
108
+
109
+ print(f'output: {output.shape}')
110
+ print(f'duration: {duration.shape}')
111
+ print(output)
112
+
113
+ audio = output.numpy()
114
+ sd.play(audio, 24000)
115
+ sd.wait()
116
+
117
+ if __name__ == "__main__":
118
+ parser = argparse.ArgumentParser("Export kokoro Model to ONNX", add_help=True)
119
+ parser.add_argument("--inference", "-t", help="test kokoro.onnx model", action="store_true")
120
+ parser.add_argument("--check", "-m", help="check kokoro model", action="store_true")
121
+ parser.add_argument(
122
+ "--config_file", "-c", type=str, default="checkpoints/config.json", help="path to config file"
123
+ )
124
+ parser.add_argument(
125
+ "--checkpoint_path", "-p", type=str, default="checkpoints/kokoro-v1_0.pth", help="path to checkpoint file"
126
+ )
127
+ parser.add_argument(
128
+ "--output_dir", "-o", type=str, default="onnx", help="output directory"
129
+ )
130
+
131
+ args = parser.parse_args()
132
+
133
+ # cfg
134
+ config_file = args.config_file # change the path of the model config file
135
+ checkpoint_path = args.checkpoint_path # change the path of the model
136
+ output_dir = args.output_dir
137
+
138
+ # make dir
139
+ os.makedirs(output_dir, exist_ok=True)
140
+
141
+ kmodel = KModel(config=config_file, model=checkpoint_path, disable_complex=True)
142
+ model = KModelForONNX(kmodel).eval()
143
+
144
+ if args.inference:
145
+ inference_onnx(model, output_dir)
146
+ elif args.check:
147
+ check_model(model)
148
+ else:
149
+ export_onnx(model, output_dir)
examples/make_triton_compatible.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ This script makes the ONNX model compatible with Triton inference server.
3
+ """
4
+
5
+ import sys
6
+ import numpy as np
7
+ import onnx
8
+ import onnxruntime as ort
9
+ import onnx_graphsurgeon as gs
10
+
11
+
12
+ def add_squeeze(graph, speed_input, speed_unsqueezed):
13
+ """
14
+ Add squeeze operation to the speed input to change shape from [batch_size, 1] to [batch_size]
15
+ """
16
+ # Create a squeeze node
17
+ squeeze_node = gs.Node(
18
+ op="Squeeze",
19
+ name="speed_squeeze",
20
+ inputs=[speed_unsqueezed],
21
+ outputs=[gs.Variable(name="speed_squeezed", dtype=speed_unsqueezed.dtype)]
22
+ )
23
+
24
+ ## Find first node that has speed_unsqueezed as input
25
+ insert_idx = 0
26
+ for idx, node in enumerate(graph.nodes):
27
+ for i, input_name in enumerate(node.inputs):
28
+ if input_name.name == speed_unsqueezed.name:
29
+ insert_idx = idx
30
+ break
31
+ if insert_idx != 0:
32
+ break
33
+
34
+ ## Add squeeze node to the graph
35
+ insert_idx = min(0, insert_idx - 1)
36
+ graph.nodes.insert(insert_idx, squeeze_node)
37
+
38
+ # Update the speed input to point to the squeezed output
39
+ for node in graph.nodes:
40
+ for i, input_name in enumerate(node.inputs):
41
+ if input_name.name == speed_input.name and not node.name == "speed_squeeze":
42
+ node.inputs[i] = squeeze_node.outputs[0]
43
+
44
+ return graph
45
+
46
+
47
+ def main():
48
+ if len(sys.argv) != 2:
49
+ print("Usage: python make_triton_compatible.py <onnx_model_path>")
50
+ sys.exit(1)
51
+
52
+ onnx_model_path = sys.argv[1]
53
+ onnx_model = onnx.load(onnx_model_path)
54
+ onnx.checker.check_model(onnx_model)
55
+ print("Model is valid")
56
+
57
+ graph = gs.import_onnx(onnx_model)
58
+
59
+ ## get input_id for speed
60
+ speed_idx, speed = None, None
61
+ for idx, input_ in enumerate(graph.inputs):
62
+ if input_.name=="speed":
63
+ speed_idx = idx
64
+ speed = input_
65
+
66
+ # Update the speed input to have shape [batch_size, 1]
67
+ speed_unsqueezed = gs.Variable(name="speed", dtype=speed.dtype, shape=[speed.shape[0], 1])
68
+ graph.inputs[speed_idx] = speed_unsqueezed
69
+
70
+ ## Add squeeze to change speed shape from [batch_size, 1] to [batch_size]
71
+ if speed is not None:
72
+ print(f"Found speed input: {speed.name}")
73
+ print(f"Found speed input shape: {speed.shape}")
74
+ print(f"Found speed input dtype: {speed.dtype}")
75
+ print(f"Found speed input: {speed}")
76
+ print(f"Found speed input: {type(speed)}")
77
+ graph = add_squeeze(graph, speed, speed_unsqueezed)
78
+
79
+ # Export the modified graph back to ONNX
80
+ modified_model = gs.export_onnx(graph)
81
+ onnx.checker.check_model(modified_model)
82
+
83
+ # Save the modified model
84
+ output_path = onnx_model_path.replace('.onnx', '_triton.onnx')
85
+ onnx.save(modified_model, output_path)
86
+ print(f"Modified model saved to: {output_path}")
87
+ else:
88
+ print("Speed input not found in the model")
89
+
90
+
91
+ if __name__ == "__main__":
92
+ main()
examples/phoneme_example.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from kokoro import KPipeline, KModel
2
+ import torch
3
+ from scipy.io import wavfile
4
+
5
+ def save_audio(audio: torch.Tensor, filename: str):
6
+ """Helper function to save audio tensor as WAV file"""
7
+ if audio is not None:
8
+ # Ensure audio is on CPU and in the right format
9
+ audio_cpu = audio.cpu().numpy()
10
+
11
+ # Save using scipy.io.wavfile
12
+ wavfile.write(
13
+ filename,
14
+ 24000, # Kokoro uses 24kHz sample rate
15
+ audio_cpu
16
+ )
17
+ print(f"Audio saved as '{filename}'")
18
+ else:
19
+ print("No audio was generated")
20
+
21
+ def main():
22
+ # Initialize pipeline with American English
23
+ pipeline = KPipeline(lang_code='a')
24
+
25
+ # The phoneme string for:
26
+ # "How are you today? I am doing reasonably well, thank you for asking"
27
+ phonemes = "hˌW ɑɹ ju tədˈA? ˌI ɐm dˈuɪŋ ɹˈizənəbli wˈɛl, θˈæŋk ju fɔɹ ˈæskɪŋ"
28
+
29
+ try:
30
+ print("\nExample 1: Using generate_from_tokens with raw phonemes")
31
+ results = list(pipeline.generate_from_tokens(
32
+ tokens=phonemes,
33
+ voice="af_bella",
34
+ speed=1.0
35
+ ))
36
+ if results:
37
+ save_audio(results[0].audio, 'phoneme_output_new.wav')
38
+
39
+ # Example 2: Using generate_from_tokens with pre-processed tokens
40
+ print("\nExample 2: Using generate_from_tokens with pre-processed tokens")
41
+ # get the tokens through G2P or any other method
42
+ text = "How are you today? I am doing reasonably well, thank you for asking"
43
+ _, tokens = pipeline.g2p(text)
44
+
45
+ # Then generate from tokens
46
+ for result in pipeline.generate_from_tokens(
47
+ tokens=tokens,
48
+ voice="af_bella",
49
+ speed=1.0
50
+ ):
51
+ # Each result may contain timestamps if available
52
+ if result.tokens:
53
+ for token in result.tokens:
54
+ if hasattr(token, 'start_ts') and hasattr(token, 'end_ts'):
55
+ print(f"Token: {token.text} ({token.start_ts:.2f}s - {token.end_ts:.2f}s)")
56
+ save_audio(result.audio, f'token_output_{hash(result.phonemes)}.wav')
57
+
58
+ except Exception as e:
59
+ print(f"An error occurred: {str(e)}")
60
+
61
+ if __name__ == "__main__":
62
+ main()
kokoro/__init__.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __version__ = '0.9.4'
2
+
3
+ from loguru import logger
4
+ import sys
5
+
6
+ # Remove default handler
7
+ logger.remove()
8
+
9
+ # Add custom handler with clean format including module and line number
10
+ logger.add(
11
+ sys.stderr,
12
+ format="<green>{time:HH:mm:ss}</green> | <cyan>{module:>16}:{line}</cyan> | <level>{level: >8}</level> | <level>{message}</level>",
13
+ colorize=True,
14
+ level="INFO" # "DEBUG" to enable logger.debug("message") and up prints
15
+ # "ERROR" to enable only logger.error("message") prints
16
+ # etc
17
+ )
18
+
19
+ # Disable before release or as needed
20
+ logger.disable("kokoro")
21
+
22
+ from .model import KModel
23
+ from .pipeline import KPipeline
kokoro/__main__.py ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Kokoro TTS CLI
2
+ Example usage:
3
+ python3 -m kokoro --text "The sky above the port was the color of television, tuned to a dead channel." -o file.wav --debug
4
+
5
+ echo "Bom dia mundo, como vão vocês" > text.txt
6
+ python3 -m kokoro -i text.txt -l p --voice pm_alex > audio.wav
7
+
8
+ Common issues:
9
+ pip not installed: `uv pip install pip`
10
+ (Temporary workaround while https://github.com/explosion/spaCy/issues/13747 is not fixed)
11
+
12
+ espeak not installed: `apt-get install espeak-ng`
13
+ """
14
+
15
+ import argparse
16
+ import wave
17
+ from pathlib import Path
18
+ from typing import Generator, TYPE_CHECKING
19
+
20
+ import numpy as np
21
+ from loguru import logger
22
+
23
+ languages = [
24
+ "a", # American English
25
+ "b", # British English
26
+ "h", # Hindi
27
+ "e", # Spanish
28
+ "f", # French
29
+ "i", # Italian
30
+ "p", # Brazilian Portuguese
31
+ "j", # Japanese
32
+ "z", # Mandarin Chinese
33
+ ]
34
+
35
+ if TYPE_CHECKING:
36
+ from kokoro import KPipeline
37
+
38
+
39
+ def generate_audio(
40
+ text: str, kokoro_language: str, voice: str, speed=1
41
+ ) -> Generator["KPipeline.Result", None, None]:
42
+ from kokoro import KPipeline
43
+
44
+ if not voice.startswith(kokoro_language):
45
+ logger.warning(f"Voice {voice} is not made for language {kokoro_language}")
46
+ pipeline = KPipeline(lang_code=kokoro_language)
47
+ yield from pipeline(text, voice=voice, speed=speed, split_pattern=r"\n+")
48
+
49
+
50
+ def generate_and_save_audio(
51
+ output_file: Path, text: str, kokoro_language: str, voice: str, speed=1
52
+ ) -> None:
53
+ with wave.open(str(output_file.resolve()), "wb") as wav_file:
54
+ wav_file.setnchannels(1) # Mono audio
55
+ wav_file.setsampwidth(2) # 2 bytes per sample (16-bit audio)
56
+ wav_file.setframerate(24000) # Sample rate
57
+
58
+ for result in generate_audio(
59
+ text, kokoro_language=kokoro_language, voice=voice, speed=speed
60
+ ):
61
+ logger.debug(result.phonemes)
62
+ if result.audio is None:
63
+ continue
64
+ audio_bytes = (result.audio.numpy() * 32767).astype(np.int16).tobytes()
65
+ wav_file.writeframes(audio_bytes)
66
+
67
+
68
+ def main() -> None:
69
+ parser = argparse.ArgumentParser()
70
+ parser.add_argument(
71
+ "-m",
72
+ "--voice",
73
+ default="af_heart",
74
+ help="Voice to use",
75
+ )
76
+ parser.add_argument(
77
+ "-l",
78
+ "--language",
79
+ help="Language to use (defaults to the one corresponding to the voice)",
80
+ choices=languages,
81
+ )
82
+ parser.add_argument(
83
+ "-o",
84
+ "--output-file",
85
+ "--output_file",
86
+ type=Path,
87
+ help="Path to output WAV file",
88
+ required=True,
89
+ )
90
+ parser.add_argument(
91
+ "-i",
92
+ "--input-file",
93
+ "--input_file",
94
+ type=Path,
95
+ help="Path to input text file (default: stdin)",
96
+ )
97
+ parser.add_argument(
98
+ "-t",
99
+ "--text",
100
+ help="Text to use instead of reading from stdin",
101
+ )
102
+ parser.add_argument(
103
+ "-s",
104
+ "--speed",
105
+ type=float,
106
+ default=1.0,
107
+ help="Speech speed",
108
+ )
109
+ parser.add_argument(
110
+ "--debug",
111
+ action="store_true",
112
+ help="Print DEBUG messages to console",
113
+ )
114
+ args = parser.parse_args()
115
+ if args.debug:
116
+ logger.level("DEBUG")
117
+ logger.debug(args)
118
+
119
+ lang = args.language or args.voice[0]
120
+
121
+ if args.text is not None and args.input_file is not None:
122
+ raise Exception("You cannot specify both 'text' and 'input_file'")
123
+ elif args.text:
124
+ text = args.text
125
+ elif args.input_file:
126
+ file: Path = args.input_file
127
+ text = file.read_text()
128
+ else:
129
+ import sys
130
+ print("Press Ctrl+D to stop reading input and start generating", flush=True)
131
+ text = '\n'.join(sys.stdin)
132
+
133
+ logger.debug(f"Input text: {text!r}")
134
+
135
+ out_file: Path = args.output_file
136
+ if not out_file.suffix == ".wav":
137
+ logger.warning("The output file name should end with .wav")
138
+ generate_and_save_audio(
139
+ output_file=out_file,
140
+ text=text,
141
+ kokoro_language=lang,
142
+ voice=args.voice,
143
+ speed=args.speed,
144
+ )
145
+
146
+
147
+ if __name__ == "__main__":
148
+ main()
kokoro/__pycache__/__init__.cpython-311.pyc ADDED
Binary file (761 Bytes). View file
 
kokoro/__pycache__/custom_stft.cpython-311.pyc ADDED
Binary file (7.32 kB). View file
 
kokoro/__pycache__/istftnet.cpython-311.pyc ADDED
Binary file (30.9 kB). View file
 
kokoro/__pycache__/model.cpython-311.pyc ADDED
Binary file (12.5 kB). View file
 
kokoro/__pycache__/modules.cpython-311.pyc ADDED
Binary file (16.8 kB). View file
 
kokoro/__pycache__/pipeline.cpython-311.pyc ADDED
Binary file (25.4 kB). View file
 
kokoro/custom_stft.py ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from attr import attr
2
+ import numpy as np
3
+ import torch
4
+ import torch.nn as nn
5
+ import torch.nn.functional as F
6
+
7
+ class CustomSTFT(nn.Module):
8
+ """
9
+ STFT/iSTFT without unfold/complex ops, using conv1d and conv_transpose1d.
10
+
11
+ - forward STFT => Real-part conv1d + Imag-part conv1d
12
+ - inverse STFT => Real-part conv_transpose1d + Imag-part conv_transpose1d + sum
13
+ - avoids F.unfold, so easier to export to ONNX
14
+ - uses replicate or constant padding for 'center=True' to approximate 'reflect'
15
+ (reflect is not supported for dynamic shapes in ONNX)
16
+ """
17
+
18
+ def __init__(
19
+ self,
20
+ filter_length=800,
21
+ hop_length=200,
22
+ win_length=800,
23
+ window="hann",
24
+ center=True,
25
+ pad_mode="replicate", # or 'constant'
26
+ ):
27
+ super().__init__()
28
+ self.filter_length = filter_length
29
+ self.hop_length = hop_length
30
+ self.win_length = win_length
31
+ self.n_fft = filter_length
32
+ self.center = center
33
+ self.pad_mode = pad_mode
34
+
35
+ # Number of frequency bins for real-valued STFT with onesided=True
36
+ self.freq_bins = self.n_fft // 2 + 1
37
+
38
+ # Build window
39
+ assert window == 'hann', window
40
+ window_tensor = torch.hann_window(win_length, periodic=True, dtype=torch.float32)
41
+ if self.win_length < self.n_fft:
42
+ # Zero-pad up to n_fft
43
+ extra = self.n_fft - self.win_length
44
+ window_tensor = F.pad(window_tensor, (0, extra))
45
+ elif self.win_length > self.n_fft:
46
+ window_tensor = window_tensor[: self.n_fft]
47
+ self.register_buffer("window", window_tensor)
48
+
49
+ # Precompute forward DFT (real, imag)
50
+ # PyTorch stft uses e^{-j 2 pi k n / N} => real=cos(...), imag=-sin(...)
51
+ n = np.arange(self.n_fft)
52
+ k = np.arange(self.freq_bins)
53
+ angle = 2 * np.pi * np.outer(k, n) / self.n_fft # shape (freq_bins, n_fft)
54
+ dft_real = np.cos(angle)
55
+ dft_imag = -np.sin(angle) # note negative sign
56
+
57
+ # Combine window and dft => shape (freq_bins, filter_length)
58
+ # We'll make 2 conv weight tensors of shape (freq_bins, 1, filter_length).
59
+ forward_window = window_tensor.numpy() # shape (n_fft,)
60
+ forward_real = dft_real * forward_window # (freq_bins, n_fft)
61
+ forward_imag = dft_imag * forward_window
62
+
63
+ # Convert to PyTorch
64
+ forward_real_torch = torch.from_numpy(forward_real).float()
65
+ forward_imag_torch = torch.from_numpy(forward_imag).float()
66
+
67
+ # Register as Conv1d weight => (out_channels, in_channels, kernel_size)
68
+ # out_channels = freq_bins, in_channels=1, kernel_size=n_fft
69
+ self.register_buffer(
70
+ "weight_forward_real", forward_real_torch.unsqueeze(1)
71
+ )
72
+ self.register_buffer(
73
+ "weight_forward_imag", forward_imag_torch.unsqueeze(1)
74
+ )
75
+
76
+ # Precompute inverse DFT
77
+ # Real iFFT formula => scale = 1/n_fft, doubling for bins 1..freq_bins-2 if n_fft even, etc.
78
+ # For simplicity, we won't do the "DC/nyquist not doubled" approach here.
79
+ # If you want perfect real iSTFT, you can add that logic.
80
+ # This version just yields good approximate reconstruction with Hann + typical overlap.
81
+ inv_scale = 1.0 / self.n_fft
82
+ n = np.arange(self.n_fft)
83
+ angle_t = 2 * np.pi * np.outer(n, k) / self.n_fft # shape (n_fft, freq_bins)
84
+ idft_cos = np.cos(angle_t).T # => (freq_bins, n_fft)
85
+ idft_sin = np.sin(angle_t).T # => (freq_bins, n_fft)
86
+
87
+ # Multiply by window again for typical overlap-add
88
+ # We also incorporate the scale factor 1/n_fft
89
+ inv_window = window_tensor.numpy() * inv_scale
90
+ backward_real = idft_cos * inv_window # (freq_bins, n_fft)
91
+ backward_imag = idft_sin * inv_window
92
+
93
+ # We'll implement iSTFT as real+imag conv_transpose with stride=hop.
94
+ self.register_buffer(
95
+ "weight_backward_real", torch.from_numpy(backward_real).float().unsqueeze(1)
96
+ )
97
+ self.register_buffer(
98
+ "weight_backward_imag", torch.from_numpy(backward_imag).float().unsqueeze(1)
99
+ )
100
+
101
+
102
+
103
+ def transform(self, waveform: torch.Tensor):
104
+ """
105
+ Forward STFT => returns magnitude, phase
106
+ Output shape => (batch, freq_bins, frames)
107
+ """
108
+ # waveform shape => (B, T). conv1d expects (B, 1, T).
109
+ # Optional center pad
110
+ if self.center:
111
+ pad_len = self.n_fft // 2
112
+ waveform = F.pad(waveform, (pad_len, pad_len), mode=self.pad_mode)
113
+
114
+ x = waveform.unsqueeze(1) # => (B, 1, T)
115
+ # Convolution to get real part => shape (B, freq_bins, frames)
116
+ real_out = F.conv1d(
117
+ x,
118
+ self.weight_forward_real,
119
+ bias=None,
120
+ stride=self.hop_length,
121
+ padding=0,
122
+ )
123
+ # Imag part
124
+ imag_out = F.conv1d(
125
+ x,
126
+ self.weight_forward_imag,
127
+ bias=None,
128
+ stride=self.hop_length,
129
+ padding=0,
130
+ )
131
+
132
+ # magnitude, phase
133
+ magnitude = torch.sqrt(real_out**2 + imag_out**2 + 1e-14)
134
+ phase = torch.atan2(imag_out, real_out)
135
+ # Handle the case where imag_out is 0 and real_out is negative to correct ONNX atan2 to match PyTorch
136
+ # In this case, PyTorch returns pi, ONNX returns -pi
137
+ correction_mask = (imag_out == 0) & (real_out < 0)
138
+ phase[correction_mask] = torch.pi
139
+ return magnitude, phase
140
+
141
+
142
+ def inverse(self, magnitude: torch.Tensor, phase: torch.Tensor, length=None):
143
+ """
144
+ Inverse STFT => returns waveform shape (B, T).
145
+ """
146
+ # magnitude, phase => (B, freq_bins, frames)
147
+ # Re-create real/imag => shape (B, freq_bins, frames)
148
+ real_part = magnitude * torch.cos(phase)
149
+ imag_part = magnitude * torch.sin(phase)
150
+
151
+ # conv_transpose wants shape (B, freq_bins, frames). We'll treat "frames" as time dimension
152
+ # so we do (B, freq_bins, frames) => (B, freq_bins, frames)
153
+ # But PyTorch conv_transpose1d expects (B, in_channels, input_length)
154
+ real_part = real_part # (B, freq_bins, frames)
155
+ imag_part = imag_part
156
+
157
+ # real iSTFT => convolve with "backward_real", "backward_imag", and sum
158
+ # We'll do 2 conv_transpose calls, each giving (B, 1, time),
159
+ # then add them => (B, 1, time).
160
+ real_rec = F.conv_transpose1d(
161
+ real_part,
162
+ self.weight_backward_real, # shape (freq_bins, 1, filter_length)
163
+ bias=None,
164
+ stride=self.hop_length,
165
+ padding=0,
166
+ )
167
+ imag_rec = F.conv_transpose1d(
168
+ imag_part,
169
+ self.weight_backward_imag,
170
+ bias=None,
171
+ stride=self.hop_length,
172
+ padding=0,
173
+ )
174
+ # sum => (B, 1, time)
175
+ waveform = real_rec - imag_rec # typical real iFFT has minus for imaginary part
176
+
177
+ # If we used "center=True" in forward, we should remove pad
178
+ if self.center:
179
+ pad_len = self.n_fft // 2
180
+ # Because of transposed convolution, total length might have extra samples
181
+ # We remove `pad_len` from start & end if possible
182
+ waveform = waveform[..., pad_len:-pad_len]
183
+
184
+ # If a specific length is desired, clamp
185
+ if length is not None:
186
+ waveform = waveform[..., :length]
187
+
188
+ # shape => (B, T)
189
+ return waveform
190
+
191
+ def forward(self, x: torch.Tensor):
192
+ """
193
+ Full STFT -> iSTFT pass: returns time-domain reconstruction.
194
+ Same interface as your original code.
195
+ """
196
+ mag, phase = self.transform(x)
197
+ return self.inverse(mag, phase, length=x.shape[-1])
kokoro/istftnet.py ADDED
@@ -0,0 +1,421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ADAPTED from https://github.com/yl4579/StyleTTS2/blob/main/Modules/istftnet.py
2
+ from kokoro.custom_stft import CustomSTFT
3
+ from torch.nn.utils.parametrizations import weight_norm
4
+ import math
5
+ import torch
6
+ import torch.nn as nn
7
+ import torch.nn.functional as F
8
+
9
+
10
+ # https://github.com/yl4579/StyleTTS2/blob/main/Modules/utils.py
11
+ def init_weights(m, mean=0.0, std=0.01):
12
+ classname = m.__class__.__name__
13
+ if classname.find("Conv") != -1:
14
+ m.weight.data.normal_(mean, std)
15
+
16
+ def get_padding(kernel_size, dilation=1):
17
+ return int((kernel_size*dilation - dilation)/2)
18
+
19
+
20
+ class AdaIN1d(nn.Module):
21
+ def __init__(self, style_dim, num_features):
22
+ super().__init__()
23
+ # affine should be False, however there's a bug in the old torch.onnx.export (not newer dynamo) that causes the channel dimension to be lost if affine=False. When affine is true, there's additional learnably parameters. This shouldn't really matter setting it to True, since we're in inference mode
24
+ self.norm = nn.InstanceNorm1d(num_features, affine=True)
25
+ self.fc = nn.Linear(style_dim, num_features*2)
26
+
27
+ def forward(self, x, s):
28
+ h = self.fc(s)
29
+ h = h.view(h.size(0), h.size(1), 1)
30
+ gamma, beta = torch.chunk(h, chunks=2, dim=1)
31
+ return (1 + gamma) * self.norm(x) + beta
32
+
33
+
34
+ class AdaINResBlock1(nn.Module):
35
+ def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5), style_dim=64):
36
+ super(AdaINResBlock1, self).__init__()
37
+ self.convs1 = nn.ModuleList([
38
+ weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
39
+ padding=get_padding(kernel_size, dilation[0]))),
40
+ weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
41
+ padding=get_padding(kernel_size, dilation[1]))),
42
+ weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
43
+ padding=get_padding(kernel_size, dilation[2])))
44
+ ])
45
+ self.convs1.apply(init_weights)
46
+ self.convs2 = nn.ModuleList([
47
+ weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=1,
48
+ padding=get_padding(kernel_size, 1))),
49
+ weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=1,
50
+ padding=get_padding(kernel_size, 1))),
51
+ weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=1,
52
+ padding=get_padding(kernel_size, 1)))
53
+ ])
54
+ self.convs2.apply(init_weights)
55
+ self.adain1 = nn.ModuleList([
56
+ AdaIN1d(style_dim, channels),
57
+ AdaIN1d(style_dim, channels),
58
+ AdaIN1d(style_dim, channels),
59
+ ])
60
+ self.adain2 = nn.ModuleList([
61
+ AdaIN1d(style_dim, channels),
62
+ AdaIN1d(style_dim, channels),
63
+ AdaIN1d(style_dim, channels),
64
+ ])
65
+ self.alpha1 = nn.ParameterList([nn.Parameter(torch.ones(1, channels, 1)) for i in range(len(self.convs1))])
66
+ self.alpha2 = nn.ParameterList([nn.Parameter(torch.ones(1, channels, 1)) for i in range(len(self.convs2))])
67
+
68
+ def forward(self, x, s):
69
+ for c1, c2, n1, n2, a1, a2 in zip(self.convs1, self.convs2, self.adain1, self.adain2, self.alpha1, self.alpha2):
70
+ xt = n1(x, s)
71
+ xt = xt + (1 / a1) * (torch.sin(a1 * xt) ** 2) # Snake1D
72
+ xt = c1(xt)
73
+ xt = n2(xt, s)
74
+ xt = xt + (1 / a2) * (torch.sin(a2 * xt) ** 2) # Snake1D
75
+ xt = c2(xt)
76
+ x = xt + x
77
+ return x
78
+
79
+
80
+ class TorchSTFT(nn.Module):
81
+ def __init__(self, filter_length=800, hop_length=200, win_length=800, window='hann'):
82
+ super().__init__()
83
+ self.filter_length = filter_length
84
+ self.hop_length = hop_length
85
+ self.win_length = win_length
86
+ assert window == 'hann', window
87
+ self.window = torch.hann_window(win_length, periodic=True, dtype=torch.float32)
88
+
89
+ def transform(self, input_data):
90
+ forward_transform = torch.stft(
91
+ input_data,
92
+ self.filter_length, self.hop_length, self.win_length, window=self.window.to(input_data.device),
93
+ return_complex=True)
94
+ return torch.abs(forward_transform), torch.angle(forward_transform)
95
+
96
+ def inverse(self, magnitude, phase):
97
+ inverse_transform = torch.istft(
98
+ magnitude * torch.exp(phase * 1j),
99
+ self.filter_length, self.hop_length, self.win_length, window=self.window.to(magnitude.device))
100
+ return inverse_transform.unsqueeze(-2) # unsqueeze to stay consistent with conv_transpose1d implementation
101
+
102
+ def forward(self, input_data):
103
+ self.magnitude, self.phase = self.transform(input_data)
104
+ reconstruction = self.inverse(self.magnitude, self.phase)
105
+ return reconstruction
106
+
107
+
108
+ class SineGen(nn.Module):
109
+ """ Definition of sine generator
110
+ SineGen(samp_rate, harmonic_num = 0,
111
+ sine_amp = 0.1, noise_std = 0.003,
112
+ voiced_threshold = 0,
113
+ flag_for_pulse=False)
114
+ samp_rate: sampling rate in Hz
115
+ harmonic_num: number of harmonic overtones (default 0)
116
+ sine_amp: amplitude of sine-wavefrom (default 0.1)
117
+ noise_std: std of Gaussian noise (default 0.003)
118
+ voiced_thoreshold: F0 threshold for U/V classification (default 0)
119
+ flag_for_pulse: this SinGen is used inside PulseGen (default False)
120
+ Note: when flag_for_pulse is True, the first time step of a voiced
121
+ segment is always sin(torch.pi) or cos(0)
122
+ """
123
+ def __init__(self, samp_rate, upsample_scale, harmonic_num=0,
124
+ sine_amp=0.1, noise_std=0.003,
125
+ voiced_threshold=0,
126
+ flag_for_pulse=False):
127
+ super(SineGen, self).__init__()
128
+ self.sine_amp = sine_amp
129
+ self.noise_std = noise_std
130
+ self.harmonic_num = harmonic_num
131
+ self.dim = self.harmonic_num + 1
132
+ self.sampling_rate = samp_rate
133
+ self.voiced_threshold = voiced_threshold
134
+ self.flag_for_pulse = flag_for_pulse
135
+ self.upsample_scale = upsample_scale
136
+
137
+ def _f02uv(self, f0):
138
+ # generate uv signal
139
+ uv = (f0 > self.voiced_threshold).type(torch.float32)
140
+ return uv
141
+
142
+ def _f02sine(self, f0_values):
143
+ """ f0_values: (batchsize, length, dim)
144
+ where dim indicates fundamental tone and overtones
145
+ """
146
+ # convert to F0 in rad. The interger part n can be ignored
147
+ # because 2 * torch.pi * n doesn't affect phase
148
+ rad_values = (f0_values / self.sampling_rate) % 1
149
+ # initial phase noise (no noise for fundamental component)
150
+ rand_ini = torch.rand(f0_values.shape[0], f0_values.shape[2], device=f0_values.device)
151
+ rand_ini[:, 0] = 0
152
+ rad_values[:, 0, :] = rad_values[:, 0, :] + rand_ini
153
+ # instantanouse phase sine[t] = sin(2*pi \sum_i=1 ^{t} rad)
154
+ if not self.flag_for_pulse:
155
+ rad_values = F.interpolate(rad_values.transpose(1, 2), scale_factor=1/self.upsample_scale, mode="linear").transpose(1, 2)
156
+ phase = torch.cumsum(rad_values, dim=1) * 2 * torch.pi
157
+ phase = F.interpolate(phase.transpose(1, 2) * self.upsample_scale, scale_factor=self.upsample_scale, mode="linear").transpose(1, 2)
158
+ sines = torch.sin(phase)
159
+ else:
160
+ # If necessary, make sure that the first time step of every
161
+ # voiced segments is sin(pi) or cos(0)
162
+ # This is used for pulse-train generation
163
+ # identify the last time step in unvoiced segments
164
+ uv = self._f02uv(f0_values)
165
+ uv_1 = torch.roll(uv, shifts=-1, dims=1)
166
+ uv_1[:, -1, :] = 1
167
+ u_loc = (uv < 1) * (uv_1 > 0)
168
+ # get the instantanouse phase
169
+ tmp_cumsum = torch.cumsum(rad_values, dim=1)
170
+ # different batch needs to be processed differently
171
+ for idx in range(f0_values.shape[0]):
172
+ temp_sum = tmp_cumsum[idx, u_loc[idx, :, 0], :]
173
+ temp_sum[1:, :] = temp_sum[1:, :] - temp_sum[0:-1, :]
174
+ # stores the accumulation of i.phase within
175
+ # each voiced segments
176
+ tmp_cumsum[idx, :, :] = 0
177
+ tmp_cumsum[idx, u_loc[idx, :, 0], :] = temp_sum
178
+ # rad_values - tmp_cumsum: remove the accumulation of i.phase
179
+ # within the previous voiced segment.
180
+ i_phase = torch.cumsum(rad_values - tmp_cumsum, dim=1)
181
+ # get the sines
182
+ sines = torch.cos(i_phase * 2 * torch.pi)
183
+ return sines
184
+
185
+ def forward(self, f0):
186
+ """ sine_tensor, uv = forward(f0)
187
+ input F0: tensor(batchsize=1, length, dim=1)
188
+ f0 for unvoiced steps should be 0
189
+ output sine_tensor: tensor(batchsize=1, length, dim)
190
+ output uv: tensor(batchsize=1, length, 1)
191
+ """
192
+ f0_buf = torch.zeros(f0.shape[0], f0.shape[1], self.dim, device=f0.device)
193
+ # fundamental component
194
+ fn = torch.multiply(f0, torch.FloatTensor([[range(1, self.harmonic_num + 2)]]).to(f0.device))
195
+ # generate sine waveforms
196
+ sine_waves = self._f02sine(fn) * self.sine_amp
197
+ # generate uv signal
198
+ # uv = torch.ones(f0.shape)
199
+ # uv = uv * (f0 > self.voiced_threshold)
200
+ uv = self._f02uv(f0)
201
+ # noise: for unvoiced should be similar to sine_amp
202
+ # std = self.sine_amp/3 -> max value ~ self.sine_amp
203
+ # for voiced regions is self.noise_std
204
+ noise_amp = uv * self.noise_std + (1 - uv) * self.sine_amp / 3
205
+ noise = noise_amp * torch.randn_like(sine_waves)
206
+ # first: set the unvoiced part to 0 by uv
207
+ # then: additive noise
208
+ sine_waves = sine_waves * uv + noise
209
+ return sine_waves, uv, noise
210
+
211
+
212
+ class SourceModuleHnNSF(nn.Module):
213
+ """ SourceModule for hn-nsf
214
+ SourceModule(sampling_rate, harmonic_num=0, sine_amp=0.1,
215
+ add_noise_std=0.003, voiced_threshod=0)
216
+ sampling_rate: sampling_rate in Hz
217
+ harmonic_num: number of harmonic above F0 (default: 0)
218
+ sine_amp: amplitude of sine source signal (default: 0.1)
219
+ add_noise_std: std of additive Gaussian noise (default: 0.003)
220
+ note that amplitude of noise in unvoiced is decided
221
+ by sine_amp
222
+ voiced_threshold: threhold to set U/V given F0 (default: 0)
223
+ Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
224
+ F0_sampled (batchsize, length, 1)
225
+ Sine_source (batchsize, length, 1)
226
+ noise_source (batchsize, length 1)
227
+ uv (batchsize, length, 1)
228
+ """
229
+ def __init__(self, sampling_rate, upsample_scale, harmonic_num=0, sine_amp=0.1,
230
+ add_noise_std=0.003, voiced_threshod=0):
231
+ super(SourceModuleHnNSF, self).__init__()
232
+ self.sine_amp = sine_amp
233
+ self.noise_std = add_noise_std
234
+ # to produce sine waveforms
235
+ self.l_sin_gen = SineGen(sampling_rate, upsample_scale, harmonic_num,
236
+ sine_amp, add_noise_std, voiced_threshod)
237
+ # to merge source harmonics into a single excitation
238
+ self.l_linear = nn.Linear(harmonic_num + 1, 1)
239
+ self.l_tanh = nn.Tanh()
240
+
241
+ def forward(self, x):
242
+ """
243
+ Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
244
+ F0_sampled (batchsize, length, 1)
245
+ Sine_source (batchsize, length, 1)
246
+ noise_source (batchsize, length 1)
247
+ """
248
+ # source for harmonic branch
249
+ with torch.no_grad():
250
+ sine_wavs, uv, _ = self.l_sin_gen(x)
251
+ sine_merge = self.l_tanh(self.l_linear(sine_wavs))
252
+ # source for noise branch, in the same shape as uv
253
+ noise = torch.randn_like(uv) * self.sine_amp / 3
254
+ return sine_merge, noise, uv
255
+
256
+
257
+ class Generator(nn.Module):
258
+ def __init__(self, style_dim, resblock_kernel_sizes, upsample_rates, upsample_initial_channel, resblock_dilation_sizes, upsample_kernel_sizes, gen_istft_n_fft, gen_istft_hop_size, disable_complex=False):
259
+ super(Generator, self).__init__()
260
+ self.num_kernels = len(resblock_kernel_sizes)
261
+ self.num_upsamples = len(upsample_rates)
262
+ self.m_source = SourceModuleHnNSF(
263
+ sampling_rate=24000,
264
+ upsample_scale=math.prod(upsample_rates) * gen_istft_hop_size,
265
+ harmonic_num=8, voiced_threshod=10)
266
+ self.f0_upsamp = nn.Upsample(scale_factor=math.prod(upsample_rates) * gen_istft_hop_size)
267
+ self.noise_convs = nn.ModuleList()
268
+ self.noise_res = nn.ModuleList()
269
+ self.ups = nn.ModuleList()
270
+ for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
271
+ self.ups.append(weight_norm(
272
+ nn.ConvTranspose1d(upsample_initial_channel//(2**i), upsample_initial_channel//(2**(i+1)),
273
+ k, u, padding=(k-u)//2)))
274
+ self.resblocks = nn.ModuleList()
275
+ for i in range(len(self.ups)):
276
+ ch = upsample_initial_channel//(2**(i+1))
277
+ for j, (k, d) in enumerate(zip(resblock_kernel_sizes,resblock_dilation_sizes)):
278
+ self.resblocks.append(AdaINResBlock1(ch, k, d, style_dim))
279
+ c_cur = upsample_initial_channel // (2 ** (i + 1))
280
+ if i + 1 < len(upsample_rates):
281
+ stride_f0 = math.prod(upsample_rates[i + 1:])
282
+ self.noise_convs.append(nn.Conv1d(
283
+ gen_istft_n_fft + 2, c_cur, kernel_size=stride_f0 * 2, stride=stride_f0, padding=(stride_f0+1) // 2))
284
+ self.noise_res.append(AdaINResBlock1(c_cur, 7, [1,3,5], style_dim))
285
+ else:
286
+ self.noise_convs.append(nn.Conv1d(gen_istft_n_fft + 2, c_cur, kernel_size=1))
287
+ self.noise_res.append(AdaINResBlock1(c_cur, 11, [1,3,5], style_dim))
288
+ self.post_n_fft = gen_istft_n_fft
289
+ self.conv_post = weight_norm(nn.Conv1d(ch, self.post_n_fft + 2, 7, 1, padding=3))
290
+ self.ups.apply(init_weights)
291
+ self.conv_post.apply(init_weights)
292
+ self.reflection_pad = nn.ReflectionPad1d((1, 0))
293
+ self.stft = (
294
+ CustomSTFT(filter_length=gen_istft_n_fft, hop_length=gen_istft_hop_size, win_length=gen_istft_n_fft)
295
+ if disable_complex
296
+ else TorchSTFT(filter_length=gen_istft_n_fft, hop_length=gen_istft_hop_size, win_length=gen_istft_n_fft)
297
+ )
298
+
299
+ def forward(self, x, s, f0):
300
+ with torch.no_grad():
301
+ f0 = self.f0_upsamp(f0[:, None]).transpose(1, 2) # bs,n,t
302
+ har_source, noi_source, uv = self.m_source(f0)
303
+ har_source = har_source.transpose(1, 2).squeeze(1)
304
+ har_spec, har_phase = self.stft.transform(har_source)
305
+ har = torch.cat([har_spec, har_phase], dim=1)
306
+ for i in range(self.num_upsamples):
307
+ x = F.leaky_relu(x, negative_slope=0.1)
308
+ x_source = self.noise_convs[i](har)
309
+ x_source = self.noise_res[i](x_source, s)
310
+ x = self.ups[i](x)
311
+ if i == self.num_upsamples - 1:
312
+ x = self.reflection_pad(x)
313
+ x = x + x_source
314
+ xs = None
315
+ for j in range(self.num_kernels):
316
+ if xs is None:
317
+ xs = self.resblocks[i*self.num_kernels+j](x, s)
318
+ else:
319
+ xs += self.resblocks[i*self.num_kernels+j](x, s)
320
+ x = xs / self.num_kernels
321
+ x = F.leaky_relu(x)
322
+ x = self.conv_post(x)
323
+ spec = torch.exp(x[:,:self.post_n_fft // 2 + 1, :])
324
+ phase = torch.sin(x[:, self.post_n_fft // 2 + 1:, :])
325
+ return self.stft.inverse(spec, phase)
326
+
327
+
328
+ class UpSample1d(nn.Module):
329
+ def __init__(self, layer_type):
330
+ super().__init__()
331
+ self.layer_type = layer_type
332
+
333
+ def forward(self, x):
334
+ if self.layer_type == 'none':
335
+ return x
336
+ else:
337
+ return F.interpolate(x, scale_factor=2, mode='nearest')
338
+
339
+
340
+ class AdainResBlk1d(nn.Module):
341
+ def __init__(self, dim_in, dim_out, style_dim=64, actv=nn.LeakyReLU(0.2), upsample='none', dropout_p=0.0):
342
+ super().__init__()
343
+ self.actv = actv
344
+ self.upsample_type = upsample
345
+ self.upsample = UpSample1d(upsample)
346
+ self.learned_sc = dim_in != dim_out
347
+ self._build_weights(dim_in, dim_out, style_dim)
348
+ self.dropout = nn.Dropout(dropout_p)
349
+ if upsample == 'none':
350
+ self.pool = nn.Identity()
351
+ else:
352
+ self.pool = weight_norm(nn.ConvTranspose1d(dim_in, dim_in, kernel_size=3, stride=2, groups=dim_in, padding=1, output_padding=1))
353
+
354
+ def _build_weights(self, dim_in, dim_out, style_dim):
355
+ self.conv1 = weight_norm(nn.Conv1d(dim_in, dim_out, 3, 1, 1))
356
+ self.conv2 = weight_norm(nn.Conv1d(dim_out, dim_out, 3, 1, 1))
357
+ self.norm1 = AdaIN1d(style_dim, dim_in)
358
+ self.norm2 = AdaIN1d(style_dim, dim_out)
359
+ if self.learned_sc:
360
+ self.conv1x1 = weight_norm(nn.Conv1d(dim_in, dim_out, 1, 1, 0, bias=False))
361
+
362
+ def _shortcut(self, x):
363
+ x = self.upsample(x)
364
+ if self.learned_sc:
365
+ x = self.conv1x1(x)
366
+ return x
367
+
368
+ def _residual(self, x, s):
369
+ x = self.norm1(x, s)
370
+ x = self.actv(x)
371
+ x = self.pool(x)
372
+ x = self.conv1(self.dropout(x))
373
+ x = self.norm2(x, s)
374
+ x = self.actv(x)
375
+ x = self.conv2(self.dropout(x))
376
+ return x
377
+
378
+ def forward(self, x, s):
379
+ out = self._residual(x, s)
380
+ out = (out + self._shortcut(x)) * torch.rsqrt(torch.tensor(2))
381
+ return out
382
+
383
+
384
+ class Decoder(nn.Module):
385
+ def __init__(self, dim_in, style_dim, dim_out,
386
+ resblock_kernel_sizes,
387
+ upsample_rates,
388
+ upsample_initial_channel,
389
+ resblock_dilation_sizes,
390
+ upsample_kernel_sizes,
391
+ gen_istft_n_fft, gen_istft_hop_size,
392
+ disable_complex=False):
393
+ super().__init__()
394
+ self.encode = AdainResBlk1d(dim_in + 2, 1024, style_dim)
395
+ self.decode = nn.ModuleList()
396
+ self.decode.append(AdainResBlk1d(1024 + 2 + 64, 1024, style_dim))
397
+ self.decode.append(AdainResBlk1d(1024 + 2 + 64, 1024, style_dim))
398
+ self.decode.append(AdainResBlk1d(1024 + 2 + 64, 1024, style_dim))
399
+ self.decode.append(AdainResBlk1d(1024 + 2 + 64, 512, style_dim, upsample=True))
400
+ self.F0_conv = weight_norm(nn.Conv1d(1, 1, kernel_size=3, stride=2, groups=1, padding=1))
401
+ self.N_conv = weight_norm(nn.Conv1d(1, 1, kernel_size=3, stride=2, groups=1, padding=1))
402
+ self.asr_res = nn.Sequential(weight_norm(nn.Conv1d(512, 64, kernel_size=1)))
403
+ self.generator = Generator(style_dim, resblock_kernel_sizes, upsample_rates,
404
+ upsample_initial_channel, resblock_dilation_sizes,
405
+ upsample_kernel_sizes, gen_istft_n_fft, gen_istft_hop_size, disable_complex=disable_complex)
406
+
407
+ def forward(self, asr, F0_curve, N, s):
408
+ F0 = self.F0_conv(F0_curve.unsqueeze(1))
409
+ N = self.N_conv(N.unsqueeze(1))
410
+ x = torch.cat([asr, F0, N], axis=1)
411
+ x = self.encode(x, s)
412
+ asr_res = self.asr_res(asr)
413
+ res = True
414
+ for block in self.decode:
415
+ if res:
416
+ x = torch.cat([x, asr_res, F0, N], axis=1)
417
+ x = block(x, s)
418
+ if block.upsample_type != "none":
419
+ res = False
420
+ x = self.generator(x, s, F0_curve)
421
+ return x
kokoro/model.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from .istftnet import Decoder
2
+ from .modules import CustomAlbert, ProsodyPredictor, TextEncoder
3
+ from dataclasses import dataclass
4
+ from huggingface_hub import hf_hub_download
5
+ from loguru import logger
6
+ from transformers import AlbertConfig
7
+ from typing import Dict, Optional, Union
8
+ import json
9
+ import torch
10
+
11
+ class KModel(torch.nn.Module):
12
+ '''
13
+ KModel is a torch.nn.Module with 2 main responsibilities:
14
+ 1. Init weights, downloading config.json + model.pth from HF if needed
15
+ 2. forward(phonemes: str, ref_s: FloatTensor) -> (audio: FloatTensor)
16
+
17
+ You likely only need one KModel instance, and it can be reused across
18
+ multiple KPipelines to avoid redundant memory allocation.
19
+
20
+ Unlike KPipeline, KModel is language-blind.
21
+
22
+ KModel stores self.vocab and thus knows how to map phonemes -> input_ids,
23
+ so there is no need to repeatedly download config.json outside of KModel.
24
+ '''
25
+
26
+ MODEL_NAMES = {
27
+ 'hexgrad/Kokoro-82M': 'kokoro-v1_0.pth',
28
+ 'hexgrad/Kokoro-82M-v1.1-zh': 'kokoro-v1_1-zh.pth',
29
+ }
30
+
31
+ def __init__(
32
+ self,
33
+ repo_id: Optional[str] = None,
34
+ config: Union[Dict, str, None] = None,
35
+ model: Optional[str] = None,
36
+ disable_complex: bool = False
37
+ ):
38
+ super().__init__()
39
+ if repo_id is None:
40
+ repo_id = 'hexgrad/Kokoro-82M'
41
+ print(f"WARNING: Defaulting repo_id to {repo_id}. Pass repo_id='{repo_id}' to suppress this warning.")
42
+ self.repo_id = repo_id
43
+ if not isinstance(config, dict):
44
+ if not config:
45
+ logger.debug("No config provided, downloading from HF")
46
+ config = hf_hub_download(repo_id=repo_id, filename='config.json')
47
+ with open(config, 'r', encoding='utf-8') as r:
48
+ config = json.load(r)
49
+ logger.debug(f"Loaded config: {config}")
50
+ self.vocab = config['vocab']
51
+ self.bert = CustomAlbert(AlbertConfig(vocab_size=config['n_token'], **config['plbert']))
52
+ self.bert_encoder = torch.nn.Linear(self.bert.config.hidden_size, config['hidden_dim'])
53
+ self.context_length = self.bert.config.max_position_embeddings
54
+ self.predictor = ProsodyPredictor(
55
+ style_dim=config['style_dim'], d_hid=config['hidden_dim'],
56
+ nlayers=config['n_layer'], max_dur=config['max_dur'], dropout=config['dropout']
57
+ )
58
+ self.text_encoder = TextEncoder(
59
+ channels=config['hidden_dim'], kernel_size=config['text_encoder_kernel_size'],
60
+ depth=config['n_layer'], n_symbols=config['n_token']
61
+ )
62
+ self.decoder = Decoder(
63
+ dim_in=config['hidden_dim'], style_dim=config['style_dim'],
64
+ dim_out=config['n_mels'], disable_complex=disable_complex, **config['istftnet']
65
+ )
66
+ if not model:
67
+ model = hf_hub_download(repo_id=repo_id, filename=KModel.MODEL_NAMES[repo_id])
68
+ for key, state_dict in torch.load(model, map_location='cpu', weights_only=True).items():
69
+ assert hasattr(self, key), key
70
+ try:
71
+ getattr(self, key).load_state_dict(state_dict)
72
+ except:
73
+ logger.debug(f"Did not load {key} from state_dict")
74
+ state_dict = {k[7:]: v for k, v in state_dict.items()}
75
+ getattr(self, key).load_state_dict(state_dict, strict=False)
76
+
77
+ @property
78
+ def device(self):
79
+ return self.bert.device
80
+
81
+ @dataclass
82
+ class Output:
83
+ audio: torch.FloatTensor
84
+ pred_dur: Optional[torch.LongTensor] = None
85
+
86
+ @torch.no_grad()
87
+ def forward_with_tokens(
88
+ self,
89
+ input_ids: torch.LongTensor,
90
+ ref_s: torch.FloatTensor,
91
+ speed: float = 1
92
+ ) -> tuple[torch.FloatTensor, torch.LongTensor]:
93
+ input_lengths = torch.full(
94
+ (input_ids.shape[0],),
95
+ input_ids.shape[-1],
96
+ device=input_ids.device,
97
+ dtype=torch.long
98
+ )
99
+
100
+ text_mask = torch.arange(input_lengths.max()).unsqueeze(0).expand(input_lengths.shape[0], -1).type_as(input_lengths)
101
+ text_mask = torch.gt(text_mask+1, input_lengths.unsqueeze(1)).to(self.device)
102
+ bert_dur = self.bert(input_ids, attention_mask=(~text_mask).int())
103
+ d_en = self.bert_encoder(bert_dur).transpose(-1, -2)
104
+ s = ref_s[:, 128:]
105
+ d = self.predictor.text_encoder(d_en, s, input_lengths, text_mask)
106
+ x, _ = self.predictor.lstm(d)
107
+ duration = self.predictor.duration_proj(x)
108
+ duration = torch.sigmoid(duration).sum(axis=-1) / speed
109
+ pred_dur = torch.round(duration).clamp(min=1).long().squeeze()
110
+ indices = torch.repeat_interleave(torch.arange(input_ids.shape[1], device=self.device), pred_dur)
111
+ pred_aln_trg = torch.zeros((input_ids.shape[1], indices.shape[0]), device=self.device)
112
+ pred_aln_trg[indices, torch.arange(indices.shape[0])] = 1
113
+ pred_aln_trg = pred_aln_trg.unsqueeze(0).to(self.device)
114
+ en = d.transpose(-1, -2) @ pred_aln_trg
115
+ F0_pred, N_pred = self.predictor.F0Ntrain(en, s)
116
+ t_en = self.text_encoder(input_ids, input_lengths, text_mask)
117
+ asr = t_en @ pred_aln_trg
118
+ audio = self.decoder(asr, F0_pred, N_pred, ref_s[:, :128]).squeeze()
119
+ return audio, pred_dur
120
+
121
+ def forward(
122
+ self,
123
+ phonemes: str,
124
+ ref_s: torch.FloatTensor,
125
+ speed: float = 1,
126
+ return_output: bool = False
127
+ ) -> Union['KModel.Output', torch.FloatTensor]:
128
+ input_ids = list(filter(lambda i: i is not None, map(lambda p: self.vocab.get(p), phonemes)))
129
+ logger.debug(f"phonemes: {phonemes} -> input_ids: {input_ids}")
130
+ assert len(input_ids)+2 <= self.context_length, (len(input_ids)+2, self.context_length)
131
+ input_ids = torch.LongTensor([[0, *input_ids, 0]]).to(self.device)
132
+ ref_s = ref_s.to(self.device)
133
+ audio, pred_dur = self.forward_with_tokens(input_ids, ref_s, speed)
134
+ audio = audio.squeeze().cpu()
135
+ pred_dur = pred_dur.cpu() if pred_dur is not None else None
136
+ logger.debug(f"pred_dur: {pred_dur}")
137
+ return self.Output(audio=audio, pred_dur=pred_dur) if return_output else audio
138
+
139
+ class KModelForONNX(torch.nn.Module):
140
+ def __init__(self, kmodel: KModel):
141
+ super().__init__()
142
+ self.kmodel = kmodel
143
+
144
+ def forward(
145
+ self,
146
+ input_ids: torch.LongTensor,
147
+ ref_s: torch.FloatTensor,
148
+ speed: float = 1
149
+ ) -> tuple[torch.FloatTensor, torch.LongTensor]:
150
+ waveform, duration = self.kmodel.forward_with_tokens(input_ids, ref_s, speed)
151
+ return waveform, duration
kokoro/modules.py ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # https://github.com/yl4579/StyleTTS2/blob/main/models.py
2
+ from .istftnet import AdainResBlk1d
3
+ from torch.nn.utils.parametrizations import weight_norm
4
+ from transformers import AlbertModel
5
+ import numpy as np
6
+ import torch
7
+ import torch.nn as nn
8
+ import torch.nn.functional as F
9
+
10
+
11
+ class LinearNorm(nn.Module):
12
+ def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):
13
+ super(LinearNorm, self).__init__()
14
+ self.linear_layer = nn.Linear(in_dim, out_dim, bias=bias)
15
+ nn.init.xavier_uniform_(self.linear_layer.weight, gain=nn.init.calculate_gain(w_init_gain))
16
+
17
+ def forward(self, x):
18
+ return self.linear_layer(x)
19
+
20
+
21
+ class LayerNorm(nn.Module):
22
+ def __init__(self, channels, eps=1e-5):
23
+ super().__init__()
24
+ self.channels = channels
25
+ self.eps = eps
26
+ self.gamma = nn.Parameter(torch.ones(channels))
27
+ self.beta = nn.Parameter(torch.zeros(channels))
28
+
29
+ def forward(self, x):
30
+ x = x.transpose(1, -1)
31
+ x = F.layer_norm(x, (self.channels,), self.gamma, self.beta, self.eps)
32
+ return x.transpose(1, -1)
33
+
34
+
35
+ class TextEncoder(nn.Module):
36
+ def __init__(self, channels, kernel_size, depth, n_symbols, actv=nn.LeakyReLU(0.2)):
37
+ super().__init__()
38
+ self.embedding = nn.Embedding(n_symbols, channels)
39
+ padding = (kernel_size - 1) // 2
40
+ self.cnn = nn.ModuleList()
41
+ for _ in range(depth):
42
+ self.cnn.append(nn.Sequential(
43
+ weight_norm(nn.Conv1d(channels, channels, kernel_size=kernel_size, padding=padding)),
44
+ LayerNorm(channels),
45
+ actv,
46
+ nn.Dropout(0.2),
47
+ ))
48
+ self.lstm = nn.LSTM(channels, channels//2, 1, batch_first=True, bidirectional=True)
49
+
50
+ def forward(self, x, input_lengths, m):
51
+ x = self.embedding(x) # [B, T, emb]
52
+ x = x.transpose(1, 2) # [B, emb, T]
53
+ m = m.unsqueeze(1)
54
+ x.masked_fill_(m, 0.0)
55
+ for c in self.cnn:
56
+ x = c(x)
57
+ x.masked_fill_(m, 0.0)
58
+ x = x.transpose(1, 2) # [B, T, chn]
59
+ lengths = input_lengths if input_lengths.device == torch.device('cpu') else input_lengths.to('cpu')
60
+ x = nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
61
+ self.lstm.flatten_parameters()
62
+ x, _ = self.lstm(x)
63
+ x, _ = nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
64
+ x = x.transpose(-1, -2)
65
+ x_pad = torch.zeros([x.shape[0], x.shape[1], m.shape[-1]], device=x.device)
66
+ x_pad[:, :, :x.shape[-1]] = x
67
+ x = x_pad
68
+ x.masked_fill_(m, 0.0)
69
+ return x
70
+
71
+
72
+ class AdaLayerNorm(nn.Module):
73
+ def __init__(self, style_dim, channels, eps=1e-5):
74
+ super().__init__()
75
+ self.channels = channels
76
+ self.eps = eps
77
+ self.fc = nn.Linear(style_dim, channels*2)
78
+
79
+ def forward(self, x, s):
80
+ x = x.transpose(-1, -2)
81
+ x = x.transpose(1, -1)
82
+ h = self.fc(s)
83
+ h = h.view(h.size(0), h.size(1), 1)
84
+ gamma, beta = torch.chunk(h, chunks=2, dim=1)
85
+ gamma, beta = gamma.transpose(1, -1), beta.transpose(1, -1)
86
+ x = F.layer_norm(x, (self.channels,), eps=self.eps)
87
+ x = (1 + gamma) * x + beta
88
+ return x.transpose(1, -1).transpose(-1, -2)
89
+
90
+
91
+ class ProsodyPredictor(nn.Module):
92
+ def __init__(self, style_dim, d_hid, nlayers, max_dur=50, dropout=0.1):
93
+ super().__init__()
94
+ self.text_encoder = DurationEncoder(sty_dim=style_dim, d_model=d_hid,nlayers=nlayers, dropout=dropout)
95
+ self.lstm = nn.LSTM(d_hid + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
96
+ self.duration_proj = LinearNorm(d_hid, max_dur)
97
+ self.shared = nn.LSTM(d_hid + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
98
+ self.F0 = nn.ModuleList()
99
+ self.F0.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
100
+ self.F0.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
101
+ self.F0.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))
102
+ self.N = nn.ModuleList()
103
+ self.N.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
104
+ self.N.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
105
+ self.N.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))
106
+ self.F0_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
107
+ self.N_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
108
+
109
+ def forward(self, texts, style, text_lengths, alignment, m):
110
+ d = self.text_encoder(texts, style, text_lengths, m)
111
+ m = m.unsqueeze(1)
112
+ lengths = text_lengths if text_lengths.device == torch.device('cpu') else text_lengths.to('cpu')
113
+ x = nn.utils.rnn.pack_padded_sequence(d, lengths, batch_first=True, enforce_sorted=False)
114
+ self.lstm.flatten_parameters()
115
+ x, _ = self.lstm(x)
116
+ x, _ = nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
117
+ x_pad = torch.zeros([x.shape[0], m.shape[-1], x.shape[-1]], device=x.device)
118
+ x_pad[:, :x.shape[1], :] = x
119
+ x = x_pad
120
+ duration = self.duration_proj(nn.functional.dropout(x, 0.5, training=False))
121
+ en = (d.transpose(-1, -2) @ alignment)
122
+ return duration.squeeze(-1), en
123
+
124
+ def F0Ntrain(self, x, s):
125
+ x, _ = self.shared(x.transpose(-1, -2))
126
+ F0 = x.transpose(-1, -2)
127
+ for block in self.F0:
128
+ F0 = block(F0, s)
129
+ F0 = self.F0_proj(F0)
130
+ N = x.transpose(-1, -2)
131
+ for block in self.N:
132
+ N = block(N, s)
133
+ N = self.N_proj(N)
134
+ return F0.squeeze(1), N.squeeze(1)
135
+
136
+
137
+ class DurationEncoder(nn.Module):
138
+ def __init__(self, sty_dim, d_model, nlayers, dropout=0.1):
139
+ super().__init__()
140
+ self.lstms = nn.ModuleList()
141
+ for _ in range(nlayers):
142
+ self.lstms.append(nn.LSTM(d_model + sty_dim, d_model // 2, num_layers=1, batch_first=True, bidirectional=True))
143
+ self.lstms.append(AdaLayerNorm(sty_dim, d_model))
144
+ self.dropout = dropout
145
+ self.d_model = d_model
146
+ self.sty_dim = sty_dim
147
+
148
+ def forward(self, x, style, text_lengths, m):
149
+ masks = m
150
+ x = x.permute(2, 0, 1)
151
+ s = style.expand(x.shape[0], x.shape[1], -1)
152
+ x = torch.cat([x, s], axis=-1)
153
+ x.masked_fill_(masks.unsqueeze(-1).transpose(0, 1), 0.0)
154
+ x = x.transpose(0, 1)
155
+ x = x.transpose(-1, -2)
156
+ for block in self.lstms:
157
+ if isinstance(block, AdaLayerNorm):
158
+ x = block(x.transpose(-1, -2), style).transpose(-1, -2)
159
+ x = torch.cat([x, s.permute(1, 2, 0)], axis=1)
160
+ x.masked_fill_(masks.unsqueeze(-1).transpose(-1, -2), 0.0)
161
+ else:
162
+ lengths = text_lengths if text_lengths.device == torch.device('cpu') else text_lengths.to('cpu')
163
+ x = x.transpose(-1, -2)
164
+ x = nn.utils.rnn.pack_padded_sequence(
165
+ x, lengths, batch_first=True, enforce_sorted=False)
166
+ block.flatten_parameters()
167
+ x, _ = block(x)
168
+ x, _ = nn.utils.rnn.pad_packed_sequence(
169
+ x, batch_first=True)
170
+ x = F.dropout(x, p=self.dropout, training=False)
171
+ x = x.transpose(-1, -2)
172
+ x_pad = torch.zeros([x.shape[0], x.shape[1], m.shape[-1]], device=x.device)
173
+ x_pad[:, :, :x.shape[-1]] = x
174
+ x = x_pad
175
+
176
+ return x.transpose(-1, -2)
177
+
178
+
179
+ # https://github.com/yl4579/StyleTTS2/blob/main/Utils/PLBERT/util.py
180
+ class CustomAlbert(AlbertModel):
181
+ def forward(self, *args, **kwargs):
182
+ outputs = super().forward(*args, **kwargs)
183
+ return outputs.last_hidden_state
kokoro/pipeline.py ADDED
@@ -0,0 +1,442 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from .model import KModel
2
+ from dataclasses import dataclass
3
+ from huggingface_hub import hf_hub_download
4
+ from loguru import logger
5
+ from misaki import en, espeak
6
+ from typing import Callable, Generator, List, Optional, Tuple, Union
7
+ import re
8
+ import torch
9
+ import os
10
+
11
+ ALIASES = {
12
+ 'en-us': 'a',
13
+ 'en-gb': 'b',
14
+ 'es': 'e',
15
+ 'fr-fr': 'f',
16
+ 'hi': 'h',
17
+ 'it': 'i',
18
+ 'pt-br': 'p',
19
+ 'ja': 'j',
20
+ 'zh': 'z',
21
+ }
22
+
23
+ LANG_CODES = dict(
24
+ # pip install misaki[en]
25
+ a='American English',
26
+ b='British English',
27
+
28
+ # espeak-ng
29
+ e='es',
30
+ f='fr-fr',
31
+ h='hi',
32
+ i='it',
33
+ p='pt-br',
34
+
35
+ # pip install misaki[ja]
36
+ j='Japanese',
37
+
38
+ # pip install misaki[zh]
39
+ z='Mandarin Chinese',
40
+ )
41
+
42
+ class KPipeline:
43
+ '''
44
+ KPipeline is a language-aware support class with 2 main responsibilities:
45
+ 1. Perform language-specific G2P, mapping (and chunking) text -> phonemes
46
+ 2. Manage and store voices, lazily downloaded from HF if needed
47
+
48
+ You are expected to have one KPipeline per language. If you have multiple
49
+ KPipelines, you should reuse one KModel instance across all of them.
50
+
51
+ KPipeline is designed to work with a KModel, but this is not required.
52
+ There are 2 ways to pass an existing model into a pipeline:
53
+ 1. On init: us_pipeline = KPipeline(lang_code='a', model=model)
54
+ 2. On call: us_pipeline(text, voice, model=model)
55
+
56
+ By default, KPipeline will automatically initialize its own KModel. To
57
+ suppress this, construct a "quiet" KPipeline with model=False.
58
+
59
+ A "quiet" KPipeline yields (graphemes, phonemes, None) without generating
60
+ any audio. You can use this to phonemize and chunk your text in advance.
61
+
62
+ A "loud" KPipeline _with_ a model yields (graphemes, phonemes, audio).
63
+ '''
64
+ def __init__(
65
+ self,
66
+ lang_code: str,
67
+ repo_id: Optional[str] = None,
68
+ model: Union[KModel, bool] = True,
69
+ trf: bool = False,
70
+ en_callable: Optional[Callable[[str], str]] = None,
71
+ device: Optional[str] = None
72
+ ):
73
+ """Initialize a KPipeline.
74
+
75
+ Args:
76
+ lang_code: Language code for G2P processing
77
+ model: KModel instance, True to create new model, False for no model
78
+ trf: Whether to use transformer-based G2P
79
+ device: Override default device selection ('cuda' or 'cpu', or None for auto)
80
+ If None, will auto-select cuda if available
81
+ If 'cuda' and not available, will explicitly raise an error
82
+ """
83
+ if repo_id is None:
84
+ repo_id = 'hexgrad/Kokoro-82M'
85
+ print(f"WARNING: Defaulting repo_id to {repo_id}. Pass repo_id='{repo_id}' to suppress this warning.")
86
+ self.repo_id = repo_id
87
+ lang_code = lang_code.lower()
88
+ lang_code = ALIASES.get(lang_code, lang_code)
89
+ assert lang_code in LANG_CODES, (lang_code, LANG_CODES)
90
+ self.lang_code = lang_code
91
+ self.model = None
92
+ if isinstance(model, KModel):
93
+ self.model = model
94
+ elif model:
95
+ if device == 'cuda' and not torch.cuda.is_available():
96
+ raise RuntimeError("CUDA requested but not available")
97
+ if device == 'mps' and not torch.backends.mps.is_available():
98
+ raise RuntimeError("MPS requested but not available")
99
+ if device == 'mps' and os.environ.get('PYTORCH_ENABLE_MPS_FALLBACK') != '1':
100
+ raise RuntimeError("MPS requested but fallback not enabled")
101
+ if device is None:
102
+ if torch.cuda.is_available():
103
+ device = 'cuda'
104
+ elif os.environ.get('PYTORCH_ENABLE_MPS_FALLBACK') == '1' and torch.backends.mps.is_available():
105
+ device = 'mps'
106
+ else:
107
+ device = 'cpu'
108
+ try:
109
+ self.model = KModel(repo_id=repo_id).to(device).eval()
110
+ except RuntimeError as e:
111
+ if device == 'cuda':
112
+ raise RuntimeError(f"""Failed to initialize model on CUDA: {e}.
113
+ Try setting device='cpu' or check CUDA installation.""")
114
+ raise
115
+ self.voices = {}
116
+ if lang_code in 'ab':
117
+ try:
118
+ fallback = espeak.EspeakFallback(british=lang_code=='b')
119
+ except Exception as e:
120
+ logger.warning("EspeakFallback not Enabled: OOD words will be skipped")
121
+ logger.warning({str(e)})
122
+ fallback = None
123
+ self.g2p = en.G2P(trf=trf, british=lang_code=='b', fallback=fallback, unk='')
124
+ elif lang_code == 'j':
125
+ try:
126
+ from misaki import ja
127
+ self.g2p = ja.JAG2P()
128
+ except ImportError:
129
+ logger.error("You need to `pip install misaki[ja]` to use lang_code='j'")
130
+ raise
131
+ elif lang_code == 'z':
132
+ try:
133
+ from misaki import zh
134
+ self.g2p = zh.ZHG2P(
135
+ version=None if repo_id.endswith('/Kokoro-82M') else '1.1',
136
+ en_callable=en_callable
137
+ )
138
+ except ImportError:
139
+ logger.error("You need to `pip install misaki[zh]` to use lang_code='z'")
140
+ raise
141
+ else:
142
+ language = LANG_CODES[lang_code]
143
+ logger.warning(f"Using EspeakG2P(language='{language}'). Chunking logic not yet implemented, so long texts may be truncated unless you split them with '\\n'.")
144
+ self.g2p = espeak.EspeakG2P(language=language)
145
+
146
+ def load_single_voice(self, voice: str):
147
+ if voice in self.voices:
148
+ return self.voices[voice]
149
+ if voice.endswith('.pt'):
150
+ f = voice
151
+ else:
152
+ f = hf_hub_download(repo_id=self.repo_id, filename=f'voices/{voice}.pt')
153
+ if not voice.startswith(self.lang_code):
154
+ v = LANG_CODES.get(voice, voice)
155
+ p = LANG_CODES.get(self.lang_code, self.lang_code)
156
+ logger.warning(f'Language mismatch, loading {v} voice into {p} pipeline.')
157
+ pack = torch.load(f, weights_only=True)
158
+ self.voices[voice] = pack
159
+ return pack
160
+
161
+ """
162
+ load_voice is a helper function that lazily downloads and loads a voice:
163
+ Single voice can be requested (e.g. 'af_bella') or multiple voices (e.g. 'af_bella,af_jessica').
164
+ If multiple voices are requested, they are averaged.
165
+ Delimiter is optional and defaults to ','.
166
+ """
167
+ def load_voice(self, voice: Union[str, torch.FloatTensor], delimiter: str = ",") -> torch.FloatTensor:
168
+ if isinstance(voice, torch.FloatTensor):
169
+ return voice
170
+ if voice in self.voices:
171
+ return self.voices[voice]
172
+ logger.debug(f"Loading voice: {voice}")
173
+ packs = [self.load_single_voice(v) for v in voice.split(delimiter)]
174
+ if len(packs) == 1:
175
+ return packs[0]
176
+ self.voices[voice] = torch.mean(torch.stack(packs), dim=0)
177
+ return self.voices[voice]
178
+
179
+ @staticmethod
180
+ def tokens_to_ps(tokens: List[en.MToken]) -> str:
181
+ return ''.join(t.phonemes + (' ' if t.whitespace else '') for t in tokens).strip()
182
+
183
+ @staticmethod
184
+ def waterfall_last(
185
+ tokens: List[en.MToken],
186
+ next_count: int,
187
+ waterfall: List[str] = ['!.?…', ':;', ',—'],
188
+ bumps: List[str] = [')', '”']
189
+ ) -> int:
190
+ for w in waterfall:
191
+ z = next((i for i, t in reversed(list(enumerate(tokens))) if t.phonemes in set(w)), None)
192
+ if z is None:
193
+ continue
194
+ z += 1
195
+ if z < len(tokens) and tokens[z].phonemes in bumps:
196
+ z += 1
197
+ if next_count - len(KPipeline.tokens_to_ps(tokens[:z])) <= 510:
198
+ return z
199
+ return len(tokens)
200
+
201
+ @staticmethod
202
+ def tokens_to_text(tokens: List[en.MToken]) -> str:
203
+ return ''.join(t.text + t.whitespace for t in tokens).strip()
204
+
205
+ def en_tokenize(
206
+ self,
207
+ tokens: List[en.MToken]
208
+ ) -> Generator[Tuple[str, str, List[en.MToken]], None, None]:
209
+ tks = []
210
+ pcount = 0
211
+ for t in tokens:
212
+ # American English: ɾ => T
213
+ t.phonemes = '' if t.phonemes is None else t.phonemes#.replace('ɾ', 'T')
214
+ next_ps = t.phonemes + (' ' if t.whitespace else '')
215
+ next_pcount = pcount + len(next_ps.rstrip())
216
+ if next_pcount > 510:
217
+ z = KPipeline.waterfall_last(tks, next_pcount)
218
+ text = KPipeline.tokens_to_text(tks[:z])
219
+ logger.debug(f"Chunking text at {z}: '{text[:30]}{'...' if len(text) > 30 else ''}'")
220
+ ps = KPipeline.tokens_to_ps(tks[:z])
221
+ yield text, ps, tks[:z]
222
+ tks = tks[z:]
223
+ pcount = len(KPipeline.tokens_to_ps(tks))
224
+ if not tks:
225
+ next_ps = next_ps.lstrip()
226
+ tks.append(t)
227
+ pcount += len(next_ps)
228
+ if tks:
229
+ text = KPipeline.tokens_to_text(tks)
230
+ ps = KPipeline.tokens_to_ps(tks)
231
+ yield ''.join(text).strip(), ''.join(ps).strip(), tks
232
+
233
+ @staticmethod
234
+ def infer(
235
+ model: KModel,
236
+ ps: str,
237
+ pack: torch.FloatTensor,
238
+ speed: Union[float, Callable[[int], float]] = 1
239
+ ) -> KModel.Output:
240
+ if callable(speed):
241
+ speed = speed(len(ps))
242
+ return model(ps, pack[len(ps)-1], speed, return_output=True)
243
+
244
+ def generate_from_tokens(
245
+ self,
246
+ tokens: Union[str, List[en.MToken]],
247
+ voice: str,
248
+ speed: float = 1,
249
+ model: Optional[KModel] = None
250
+ ) -> Generator['KPipeline.Result', None, None]:
251
+ """Generate audio from either raw phonemes or pre-processed tokens.
252
+
253
+ Args:
254
+ tokens: Either a phoneme string or list of pre-processed MTokens
255
+ voice: The voice to use for synthesis
256
+ speed: Speech speed modifier (default: 1)
257
+ model: Optional KModel instance (uses pipeline's model if not provided)
258
+
259
+ Yields:
260
+ KPipeline.Result containing the input tokens and generated audio
261
+
262
+ Raises:
263
+ ValueError: If no voice is provided or token sequence exceeds model limits
264
+ """
265
+ model = model or self.model
266
+ if model and voice is None:
267
+ raise ValueError('Specify a voice: pipeline.generate_from_tokens(..., voice="af_heart")')
268
+
269
+ pack = self.load_voice(voice).to(model.device) if model else None
270
+
271
+ # Handle raw phoneme string
272
+ if isinstance(tokens, str):
273
+ logger.debug("Processing phonemes from raw string")
274
+ if len(tokens) > 510:
275
+ raise ValueError(f'Phoneme string too long: {len(tokens)} > 510')
276
+ output = KPipeline.infer(model, tokens, pack, speed) if model else None
277
+ yield self.Result(graphemes='', phonemes=tokens, output=output)
278
+ return
279
+
280
+ logger.debug("Processing MTokens")
281
+ # Handle pre-processed tokens
282
+ for gs, ps, tks in self.en_tokenize(tokens):
283
+ if not ps:
284
+ continue
285
+ elif len(ps) > 510:
286
+ logger.warning(f"Unexpected len(ps) == {len(ps)} > 510 and ps == '{ps}'")
287
+ logger.warning("Truncating to 510 characters")
288
+ ps = ps[:510]
289
+ output = KPipeline.infer(model, ps, pack, speed) if model else None
290
+ if output is not None and output.pred_dur is not None:
291
+ KPipeline.join_timestamps(tks, output.pred_dur)
292
+ yield self.Result(graphemes=gs, phonemes=ps, tokens=tks, output=output)
293
+
294
+ @staticmethod
295
+ def join_timestamps(tokens: List[en.MToken], pred_dur: torch.LongTensor):
296
+ # Multiply by 600 to go from pred_dur frames to sample_rate 24000
297
+ # Equivalent to dividing pred_dur frames by 40 to get timestamp in seconds
298
+ # We will count nice round half-frames, so the divisor is 80
299
+ MAGIC_DIVISOR = 80
300
+ if not tokens or len(pred_dur) < 3:
301
+ # We expect at least 3: <bos>, token, <eos>
302
+ return
303
+ # We track 2 counts, measured in half-frames: (left, right)
304
+ # This way we can cut space characters in half
305
+ # TODO: Is -3 an appropriate offset?
306
+ left = right = 2 * max(0, pred_dur[0].item() - 3)
307
+ # Updates:
308
+ # left = right + (2 * token_dur) + space_dur
309
+ # right = left + space_dur
310
+ i = 1
311
+ for t in tokens:
312
+ if i >= len(pred_dur)-1:
313
+ break
314
+ if not t.phonemes:
315
+ if t.whitespace:
316
+ i += 1
317
+ left = right + pred_dur[i].item()
318
+ right = left + pred_dur[i].item()
319
+ i += 1
320
+ continue
321
+ j = i + len(t.phonemes)
322
+ if j >= len(pred_dur):
323
+ break
324
+ t.start_ts = left / MAGIC_DIVISOR
325
+ token_dur = pred_dur[i: j].sum().item()
326
+ space_dur = pred_dur[j].item() if t.whitespace else 0
327
+ left = right + (2 * token_dur) + space_dur
328
+ t.end_ts = left / MAGIC_DIVISOR
329
+ right = left + space_dur
330
+ i = j + (1 if t.whitespace else 0)
331
+
332
+ @dataclass
333
+ class Result:
334
+ graphemes: str
335
+ phonemes: str
336
+ tokens: Optional[List[en.MToken]] = None
337
+ output: Optional[KModel.Output] = None
338
+ text_index: Optional[int] = None
339
+
340
+ @property
341
+ def audio(self) -> Optional[torch.FloatTensor]:
342
+ return None if self.output is None else self.output.audio
343
+
344
+ @property
345
+ def pred_dur(self) -> Optional[torch.LongTensor]:
346
+ return None if self.output is None else self.output.pred_dur
347
+
348
+ ### MARK: BEGIN BACKWARD COMPAT ###
349
+ def __iter__(self):
350
+ yield self.graphemes
351
+ yield self.phonemes
352
+ yield self.audio
353
+
354
+ def __getitem__(self, index):
355
+ return [self.graphemes, self.phonemes, self.audio][index]
356
+
357
+ def __len__(self):
358
+ return 3
359
+ #### MARK: END BACKWARD COMPAT ####
360
+
361
+ def __call__(
362
+ self,
363
+ text: Union[str, List[str]],
364
+ voice: Optional[str] = None,
365
+ speed: Union[float, Callable[[int], float]] = 1,
366
+ split_pattern: Optional[str] = r'\n+',
367
+ model: Optional[KModel] = None
368
+ ) -> Generator['KPipeline.Result', None, None]:
369
+ model = model or self.model
370
+ if model and voice is None:
371
+ raise ValueError('Specify a voice: en_us_pipeline(text="Hello world!", voice="af_heart")')
372
+ pack = self.load_voice(voice).to(model.device) if model else None
373
+
374
+ # Convert input to list of segments
375
+ if isinstance(text, str):
376
+ text = re.split(split_pattern, text.strip()) if split_pattern else [text]
377
+
378
+ # Process each segment
379
+ for graphemes_index, graphemes in enumerate(text):
380
+ if not graphemes.strip(): # Skip empty segments
381
+ continue
382
+
383
+ # English processing (unchanged)
384
+ if self.lang_code in 'ab':
385
+ logger.debug(f"Processing English text: {graphemes[:50]}{'...' if len(graphemes) > 50 else ''}")
386
+ _, tokens = self.g2p(graphemes)
387
+ for gs, ps, tks in self.en_tokenize(tokens):
388
+ if not ps:
389
+ continue
390
+ elif len(ps) > 510:
391
+ logger.warning(f"Unexpected len(ps) == {len(ps)} > 510 and ps == '{ps}'")
392
+ ps = ps[:510]
393
+ output = KPipeline.infer(model, ps, pack, speed) if model else None
394
+ if output is not None and output.pred_dur is not None:
395
+ KPipeline.join_timestamps(tks, output.pred_dur)
396
+ yield self.Result(graphemes=gs, phonemes=ps, tokens=tks, output=output, text_index=graphemes_index)
397
+
398
+ # Non-English processing with chunking
399
+ else:
400
+ # Split long text into smaller chunks (roughly 400 characters each)
401
+ # Using sentence boundaries when possible
402
+ chunk_size = 400
403
+ chunks = []
404
+
405
+ # Try to split on sentence boundaries first
406
+ sentences = re.split(r'([.!?]+)', graphemes)
407
+ current_chunk = ""
408
+
409
+ for i in range(0, len(sentences), 2):
410
+ sentence = sentences[i]
411
+ # Add the punctuation back if it exists
412
+ if i + 1 < len(sentences):
413
+ sentence += sentences[i + 1]
414
+
415
+ if len(current_chunk) + len(sentence) <= chunk_size:
416
+ current_chunk += sentence
417
+ else:
418
+ if current_chunk:
419
+ chunks.append(current_chunk.strip())
420
+ current_chunk = sentence
421
+
422
+ if current_chunk:
423
+ chunks.append(current_chunk.strip())
424
+
425
+ # If no chunks were created (no sentence boundaries), fall back to character-based chunking
426
+ if not chunks:
427
+ chunks = [graphemes[i:i+chunk_size] for i in range(0, len(graphemes), chunk_size)]
428
+
429
+ # Process each chunk
430
+ for chunk in chunks:
431
+ if not chunk.strip():
432
+ continue
433
+
434
+ ps, _ = self.g2p(chunk)
435
+ if not ps:
436
+ continue
437
+ elif len(ps) > 510:
438
+ logger.warning(f'Truncating len(ps) == {len(ps)} > 510')
439
+ ps = ps[:510]
440
+
441
+ output = KPipeline.infer(model, ps, pack, speed) if model else None
442
+ yield self.Result(graphemes=chunk, phonemes=ps, output=output, text_index=graphemes_index)
pyproject.toml ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["hatchling"]
3
+ build-backend = "hatchling.build"
4
+
5
+ [project]
6
+ name = "kokoro"
7
+ version = "0.9.4"
8
+ description = "TTS"
9
+ readme = "README.md"
10
+ authors = [
11
+ { name="hexgrad", email="[email protected]" }
12
+ ]
13
+ license = { file = "LICENSE" }
14
+ classifiers = [
15
+ "Programming Language :: Python :: 3",
16
+ "License :: OSI Approved :: Apache Software License",
17
+ "Operating System :: OS Independent"
18
+ ]
19
+ requires-python = ">=3.10, <3.14"
20
+ dependencies = [
21
+ "fastapi>=0.121.3",
22
+ "huggingface_hub",
23
+ "loguru",
24
+ "misaki[en]>=0.9.4",
25
+ "numpy",
26
+ "torch",
27
+ "transformers",
28
+ "uvicorn>=0.38.0",
29
+ ]
30
+
31
+ [project.scripts]
32
+ kokoro = "kokoro.__main__:main"
33
+
34
+ [tool.hatch.build.targets.wheel]
35
+ only-include = ["kokoro"]
36
+ only-packages = true
37
+
38
+ [project.urls]
39
+ Homepage = "https://github.com/hexgrad/kokoro"
40
+ Repository = "https://github.com/hexgrad/kokoro"
run.sh ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Get the directory where the script is located
3
+ SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
4
+
5
+ # Enable MPS fallback for macOS Apple Silicon
6
+ export PYTORCH_ENABLE_MPS_FALLBACK=1
7
+
8
+ # Run the python script with uv using existing .venv
9
+ cd "$SCRIPT_DIR" && VIRTUAL_ENV="$SCRIPT_DIR/.venv" uv run --no-sync talk.py "$@"
talk.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import os
3
+ import soundfile as sf
4
+ from kokoro import KPipeline
5
+ import subprocess
6
+ import argparse
7
+
8
+ def play_audio(file_path):
9
+ """Plays audio using the macOS afplay command."""
10
+ try:
11
+ subprocess.run(["afplay", file_path], check=True)
12
+ except FileNotFoundError:
13
+ print("Error: 'afplay' command not found. Are you on macOS?")
14
+ except Exception as e:
15
+ print(f"Error playing audio: {e}")
16
+
17
+ def main():
18
+ parser = argparse.ArgumentParser(description='Kokoro TTS Generator')
19
+ parser.add_argument('text', nargs='*', help='Text to speak')
20
+ parser.add_argument('-l', '--lang', default='a', help='Language code (a=Am. English, b=Br. English, e=Spanish, f=French, h=Hindi, i=Italian, j=Japanese, p=Portuguese, z=Chinese)')
21
+ parser.add_argument('-v', '--voice', default='af_heart', help='Voice ID (default: af_heart)')
22
+
23
+ args = parser.parse_args()
24
+
25
+ # Combine text arguments or read from stdin
26
+ if args.text:
27
+ text = " ".join(args.text)
28
+ else:
29
+ print("Enter text to speak (Ctrl+D to finish):")
30
+ try:
31
+ text = sys.stdin.read()
32
+ except KeyboardInterrupt:
33
+ print("\nExiting.")
34
+ sys.exit(0)
35
+
36
+ if not text.strip():
37
+ print("No text provided.")
38
+ return
39
+
40
+ # Initialize pipeline
41
+ try:
42
+ print(f"Initializing pipeline for language '{args.lang}'...")
43
+ pipeline = KPipeline(lang_code=args.lang)
44
+ except Exception as e:
45
+ print(f"Failed to initialize KPipeline: {e}")
46
+ print("Note: Japanese ('j') and Chinese ('z') require extra dependencies: pip install 'misaki[ja]' or 'misaki[zh]'")
47
+ sys.exit(1)
48
+
49
+ print(f"Generating audio with voice '{args.voice}'...")
50
+
51
+ # Generate audio
52
+ try:
53
+ generator = pipeline(text, voice=args.voice, speed=1, split_pattern=r'\n+')
54
+
55
+ for i, (gs, ps, audio) in enumerate(generator):
56
+ filename = f'output_{i}.wav'
57
+ sf.write(filename, audio, 24000)
58
+ print(f"Playing segment {i}...")
59
+ play_audio(filename)
60
+
61
+ except Exception as e:
62
+ print(f"Error during generation: {e}")
63
+
64
+ if __name__ == "__main__":
65
+ main()
uv.lock ADDED
The diff for this file is too large to render. See raw diff