HierSpeech++ HierSpeech++ Demo

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

 

Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, Seong-Whan Lee

Korea University

 

main

Overall framework of Hierspeech++

In this page, all audio samples are generated by hierspeech++ (v1). We will release the hierspeech++ (v2), a multi-lingual hierspeech++ soon.

We utilize a LibriTTS dataset to train the TTS model.

Online TTS Demo is available on [Hugging Face Spaces]

HierSpeech++

Zero-shot TTS with Expressive Dataset

Abstract

In this work, we once again significantly improve the naturalness and speaker similarity of the synthetic speech, even in the zero-shot speech synthesis scenarios.

Prompt 1 (Whisper)

  Prompt 2(Angry)

  Prompt 3 (Laughing)

Prompt 4 (Sleepy)

Prompt 5 (Steve jobs)

HierSpeech++

HierSpeech++

HierSpeech++

HierSpeech++

HierSpeech++

Zero-shot TTS (LibriTTS)

All speakers are unseen during training

Sentence 1 (121_121726_000029_000003)

In Germany, they generally "Hock the Kaiser."

GT

YourTTS

  HierSpeech

  Vall-E-X

XTTS(v1)

HierSpeech++
(LT460)

HierSpeech++
(LT960)

HierSpeech++
(LT960, others)

Sentence 2 (237_126133_000047_000000)

"Don't mind it, Polly," whispered Jasper; "twasn't her fault."

GT

YourTTS

  HierSpeech

  Vall-E-X

XTTS(v1)

HierSpeech++
(LT460)

HierSpeech++
(LT960)

HierSpeech++
(LT960, others)

Sentence 3 (260_123288_000020_000000)

"The sail! the sail!" I cry, motioning to lower it."

GT

YourTTS

  HierSpeech

  Vall-E-X

XTTS(v1)

HierSpeech++
(LT460)

HierSpeech++
(LT960)

HierSpeech++
(LT960, others)

Sentence 4 (908_31957_000017_000001)

A ring of amethyst I could not wear here, plainer to my sight, Than that first kiss.

GT

YourTTS

  HierSpeech

  Vall-E-X

XTTS(v1)

HierSpeech++
(LT460)

HierSpeech++
(LT960)

HierSpeech++
(LT960, others)

Sentence 5 (5683_32879_000025_000000)

'No, indeed, Dorcas--never, and never will; and I think, though I have learned to fear death, I would rather die than let Stanley even suspect it.'

GT

YourTTS

  HierSpeech

  Vall-E-X

XTTS(v1)

HierSpeech++
(LT460)

HierSpeech++
(LT960)

HierSpeech++
(LT960, others)

Sentence 6 (7021_85628_000026_000000)

The Princess certainly was beautiful, and he would have dearly liked to be kissed by her, but the cap which his mother had made he would not give up on any condition.

GT

YourTTS

  HierSpeech

  Vall-E-X

XTTS(v1)

HierSpeech++
(LT460)

HierSpeech++
(LT960)

HierSpeech++
(LT960, others)

Comparison with other models

All speakers are unseen during training

Sentence 1

And lay me down in my cold bed and leave my shining lot.

GT

Prompt

Vall-E

 NaturalSpeech 2

  StyleTTS 2

HierSpeech++

Sentence 2

Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.

GT

Prompt

Vall-E

 NaturalSpeech 2

  StyleTTS 2

HierSpeech++

Sentence 3

The army found the people in poverty and left them in comparative wealth.

GT

Prompt

Vall-E

 NaturalSpeech 2

  StyleTTS 2

HierSpeech++

Sentence 4

Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings.

GT

Prompt

Vall-E

 NaturalSpeech 2

  StyleTTS 2

HierSpeech++

Anime Characters

The audio samples are from Mega-TTS demo page.

Sentence 1

Let's go drink until we can't feel feelings anymore.

Prompt (Sponge Bob)

Mega-TTS

HierSpeech++

Sentence 2

Uh, it's not like the internet to go crazy about something small and stupid.

Prompt (Peter Griffin)

Mega-TTS

HierSpeech++

Sentence 3

Then I would never talk to that person about boa constrictors, or primeval forests, or stars. I would bring myself down to his level.

Prompt (Rick)

Mega-TTS

HierSpeech++

Sentence 4

In what a disgraceful light might it not strike so vain a man!

Prompt (Morty)

Mega-TTS

HierSpeech++

Zero-shot Voice Conversion (VCTK)

All speakers are unseen during training

Source Speaker Target Speaker Converted

GT ( p228 )

GT ( p233 )

  AutoVC

  VoiceMixer

DiffVC

Diff-HierVC

DDDM-VC

YourTTS

HierVST (Ours)

HierSpeech++
(LT460)

HierSpeech++
(LT960)

HierSpeech++
(LT960, others)

GT ( p233 )

GT ( p227 )

  AutoVC

  VoiceMixer

DiffVC

Diff-HierVC

DDDM-VC

YourTTS

HierVST (Ours)

HierSpeech++
(LT460)

HierSpeech++
(LT960)

HierSpeech++
(LT960, others)

GT ( p240 )

GT ( p236 )

  AutoVC

  VoiceMixer

DiffVC

Diff-HierVC

DDDM-VC

YourTTS

HierVST (Ours)

HierSpeech++
(LT460)

HierSpeech++
(LT960)

HierSpeech++
(LT960, others)

SpeechSR

Speech Super-resoltuion (16k --> 48k)

                                                                                                                                                                                                                                                                                                                              

For more accurate listening, it is recommended to conduct a simple audible frequency test.

WARNING: HIGH-FREQUENCY SAMPLES WITH LOUD VOLUME MAY HAVE PAINFUL SOUNDS.

2000 Hz       22000 Hz


                                                                                                                                                                                                                                                                                                                                          

Sentence 1

"No," said Trot, positively, there's been enough patching in this country and I won't have any more of it.

GT (16 kHz)

AudioSR

SpeechSR (Ours)

HierSpeech++ (16 kHz)

HierSpeech++ (+AudioSR)

HierSpeech++ (+SpeechSR)

Sentence 2

But tell me, please, what you intend to do with this new lot of the Powder of Life, which Dr. Pipt is making.

GT (16 kHz)

AudioSR

SpeechSR (Ours)

HierSpeech++ (16 kHz)

HierSpeech++ (+AudioSR)

HierSpeech++ (+SpeechSR)

Sentence 3

The end he had been born to serve yet did not see had led him to escape by an unseen path and now it beckoned to him once more and a new adventure was about to be opened to him.

GT (16 kHz)

AudioSR

SpeechSR (Ours)

HierSpeech++ (16 kHz)

HierSpeech++ (+AudioSR)

HierSpeech++ (+SpeechSR)