HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representation for Speech Synthesis

 

 

Sang-Hoon Lee, Seung-Bin Kim, Ji-Hyun Lee, Eunwoo Song, Min-Jae Hwang, Seong-Whan Lee

ABSTRACT

This paper presents HierSpeech, a high-quality end-to-end text-to-speech (TTS) system based on a hierarchical conditional variational autoencoder (VAE) utilizing self-supervised speech representations. Recently, single-stage TTS systems, which directly generate raw speech waveform from text, have been getting interest thanks to their ability in generating high-quality audio within a fully end-to-end training pipeline. However, there is still a room for improvement in the conventional TTS systems. Since it is challenging to infer both the linguistic and acoustic attributes from the text directly, missing the details of attributes, specifically linguistic information, is inevitable, which results in mispronunciation and over-smoothing problem in their synthetic speech. To address the aforementioned problem, we leverage self-supervised speech representations as additional linguistic representations to bridge an information gap between text and speech. Then, the hierarchical conditional VAE is adopted to connect these representations and to learn each attribute hierarchically by improving the linguistic capability in latent representations. Compared with the state-of-the-art TTS system, HierSpeech achieves +0.303 comparative mean opinion score, and reduces the phoneme error rate of synthesized speech from 9.16% to 5.78% on the VCTK dataset. Furthermore, we extend our model to HierSpeech-U, an untranscribed text-to-speech system. Specifically, HierSpeech-U can adapt to a novel speaker by utilizing self-supervised speech representations without text transcripts. The experimental results reveal that our method outperforms publicly available TTS models, and show the effectiveness of speaker adaptation with untranscribed speech.

 

 

We compare HierSpeech with several TTS models as:

1. Tacotron2: Autoregressive TTS model [Official Demo page][Unofficial Code]

2. Glow-TTS: Flow-based TTS model with Monotonic Alignment Search and normalizing flow [Official Demo page][Official Code]

3. PortaSpeech: Non-autoregressive TTS model with VAE and Normalizing flow [Official Demo page][Official Code]

4. VITS: Single-stage end-to-end TTS model with VAE augmented normalizing flow and adversarial training [Official Demo page][Official Code]

 

Multi-speaker (VCTK Dataset)

Sentence 1 (p238):

When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.

GT

ASR: When a man looks for something beyond his reach his friends say he is looking for the pot of gold at the end of the rinbo.

GT (HiFi-GAN)

ASR: When a man looks for something beyond his reach his friends say he is looking for the pot of gold at the end of the rinbo.

Tacotron 2

ASR: When a man looks for something beyond his reach his friends say he is looking for the pot of gold at the end of the rainbow.

Glow-TTS

ASR: When a man looks for something beyond his reach his friends say he is looking for the pot of gold at the end of the rainbow.

PortaSpeech

ASR: While a man looks for something beyond his reach his fenceay he is looking for the pot of gord at the end of the rainbow.

VITS

ASR: Ur man looks for something beyond his reach his friends see he is looking for the pot of gold at the end of the rainbow.

HierSpeech (Ours)

ASR: When a man looks for something beyond his reach his friends say he is looking for the pot of gold at the end of the rainbow.

Sentence 2 (p243):

Throughout the centuries people have explained the rainbow in various ways.

GT

ASR: Oughout the senturies people have explained the rainbow in various way.

GT (HiFi-GAN)

ASR: Oughout the senturies people have explained the rainbow in various wa.

Tacotron 2

ASR: Roughout the centuries people have explained the rainbow in various way.

Glow-TTS

ASR: Ura de senchis people have explained the rainbow in various way.

PortaSpeech

ASR: Ou at the senturies people have explained the rainbow (in) various way.

VITS

ASR: Hroughout the centuries people have explained a rainbow (in) various way.

HierSpeech (Ours)

ASR: Roughout te centus people have expplained the rainbow i various way.

Sentence 3 (p362):

We are being realistic about the challenges ahead.

GT

ASR: We are being realistic about the challenges ahead.

GT (HiFi-GAN)

ASR: We are being realistic about the challenges ahead.

Tacotron 2

ASR: We are being realistic about the challenges ahead.

Glow-TTS

ASR: W are being realistic about the challenges a hea.

PortaSpeech

ASR: Weare being realistic about the challenges ahea.

VITS

ASR: We are being realistic about the challenges a it.

HierSpeech (Ours)

ASR: We are being realistic about the challenges ahead.

Sentence 4 (p256):

This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.

GT

ASR: This is a very common type of bo one showing mainly red and yellow with little or no green or blue.

GT (HiFi-GAN)

ASR: This is a very common type of bow one showing mainly red and yellow with little or no green or blue.

Tacotron 2

ASR: This is a very common type of bow one showing mainly red and yellow with little or no green or blue.

Glow-TTS

ASR: His is a very common type of bow one showing mainly red and yellow with little or no green or blue.

PortaSpeech

ASR: His is a very common type of bow one showing mainly red and yelow with little or no green or ble.

VITS

ASR: This is a very common type of bow one showing mainly red and yellow with lessel ar no green ubli.

HierSpeech (Ours)

ASR: This is a very common type of bow one showing mainly red and yellow with little or no green or blue.

Sentence 5 (p282):

We also need a small plastic snake and a big toy frog for the kids.

GT

ASR: We also need a small plastick snake and a big toy frog for the k.

GT (HiFi-GAN)

ASR: We also need a small plastick snake and a big toy frog for the c.

Tacotron 2

ASR: E also need a small plasick snake and big toy frog for the ka.

Glow-TTS

ASR: We also need a small plastic smake and a big ty frock for the ti.

PortaSpeech

ASR: We also kneed a small plastic snake and a beg to frok for the ca.

VITS

ASR: We also need a small plasteck snake and a big toy frolk for the ki.

HierSpeech (Ours)

ASR: We also need a small plastic snake and a big toy frog for the kids.

Sentence 6 (p245):

The rainbow is a division of white light into many beautiful colors.

GT

ASR: The rainbow is a division of white lice into many beautiful colors.

GT (HiFi-GAN)

ASR: The rainbow is a division of white lices into many beautiful colors.

Tacotron 2

ASR: The rainbow is a division o white light into many beautiful colours.

Glow-TTS

ASR: The rainbor is a division of white light into many beautiful colors.

PortaSpeech

ASR: Th e rainbow is a division of white like in too many beautiful colors.

VITS

ASR: The rainbow is a division of white light into many beautiful colors.

HierSpeech (Ours)

ASR: The rainbow is a division of white light into many beautiful colors.

Sentence 7 (p276):

Ask her to bring these things with her from the store.

GT

ASR: Ask her to bring these things with her from the store.

GT (HiFi-GAN)

ASR: Askher to bring these things with her from the store.

Tacotron 2

ASR: Ask her to bring these things wither from the store.

Glow-TTS

ASR: Ask her to bring these things with her from the store.

PortaSpeech

ASR: Asked her to bring these things with her from the store.

VITS

ASR: Ask her to bring these things with her friend the store.

HierSpeech (Ours)

ASR: Ask her to bring these things with her from the store.

Multi-speaker (LibriTTS)

Sentence 1:

You were not sure i said and was placated by the sound of a faint sigh that passed between us like the flight of a bird in the nigh.

GT

ASR: You were not sure i said and was placated by the sound of a faint sigh that passed between us like the flight of a bird in the night.

VITS

You were not sure i said and was placated by the sound of a faete side that pass between us like the flight of a bird in the night.

HierSpeech

ASR: You were not sure i said and was placated by the sound of a faint sigh that passed between us like the flight of a bird in the night.

Sentence 2:

By the edge of the river they stopped and said farewell.

GT

ASR: By the edge of the river they stopped and said farewell.

VITS

ASR: By the edge of the river they stopped and said farewell.

HierSpeech (Ours)

ASR: By the edge of the river they stopped and said farewell.

Sentence 3:

And then i can imagine that i'm dressed gorgeously.

GT

ASR: And then i can imagine that i'm dressed gorgeous.

VITS

ASR: And then i can imagine that iam dress score to.

HierSpeech (Ours)

ASR: And then i can imagine that i'm dressed gorgeously.

Sentence 4:

And we may be sure that the eldest boy in that brood never forgot the day.

GT

ASR: And we may be sure that the eldest boy in that brood nver forgot the day.

VITS

ASR: We may be sure that the eldest boy an that brood never forgot the day.

HierSpeech (Ours)

ASR: And we may be sure that the eldest boy in that brood never forgot the day.

Sentence 5:

And, remember, that he had intended to tell me who he was when he arrived, only he was so ill.

GT

ASR: And remember that he had intended to tell me who he was when he arrived only he was so ill.

VITS

ASR: And remember that he had intended to tell me who he was when he arrived only he was so lill.

HierSpeech (Ours)

ASR: And remember that he had intended to tell me who he was when he arrived only he was so ill.

Sentence 6:

When anne dressed for it she tossed aside the pearl beads she usually wore and took from her trunk the small box that had come to green gables on christmas day. in it was a thread-like gold chain with a tiny pink enamel heart as a pendant.

GT

ASR: When anne dressed for it she tossed aside the pearl beads she usually wore and took from her trunk the small box that had come to green gables on christmas day in it was a thread like gold chain with a tiny pink enamel heart as a pendant.

VITS

ASR: An anderos for as she dust decide the perl beas she were 'n too ofron the small boxe come to green gables on christmas day and it was a threadlik gold chain with ha tiny pink and amehars an.

HierSpeech (Ours)

ASR: When un dressed for as she tossed us side the pearl beast she usually wore and from her trunk the small box that had come to green gables on christmas day and it was a thril like gold chain with a tiny pink and emmal hartas a pink.

Sentence 7:

But there was just nothing to be done with him.

GT

ASR: But there was just nothing to be done with him.

VITS

ASR: That there was just nothing to be done with him.

HierSpeech (Ours)

ASR: But there was just nothing to be done with.

Sentence 8:

It is, of course, now too late for me to give any advice in reference to the proposed scheme of captain fox.

GT

ASR: It is of course now too late for me to give any advice in reference to the proposed scheme of captain fox.

VITS

ASR: It is of course now too late for me to give any advice in reference to the proposed scheme of captain fo.

HierSpeech (Ours)

ASR: It is of course now too late for me to give any advice in reference to the proposed scheme of captain fox.

Sentence 9:

Carol was angry.

GT

ASR: Carol was angry.

VITS

ASR: Terol was angry.

HierSpeech (Ours)

ASR: Carol was angry.

Sentence 10:

What is his name.

GT

ASR: What is his name.

VITS

ASR: What is his name.

HierSpeech (Ours)

ASR: What is his name.

Sentence 11:

Lemon syrup.

GT

ASR: Emon cyrro.

VITS

ASR: Lemon rup.

HierSpeech (Ours)

ASR: Lemon syrup.

Sentence 12:

The two youngest miss thorpes were by themselves in the parlour; and, on anne's quitting it to call her sister, catherine took the opportunity of asking the other for some particulars of their yesterday's party.

GT

ASR: The two youngest miss thorpes were by themselves in the parlour and on anne's quitting it to call her sister catherine took the opportunity of asking the other for some particulars of their yesterday's part.

VITS

ASR: T n as miss thortnsells to parlo and an an's quittingo to cale her sister catherine to the up ertit of s any other for some particulars ith yesterday per.

HierSpeech (Ours)

ASR: The two youngest miss thorpes were by themselves in the parlour and on annes quitting it to call her sister catherine took the opportunity of asking the other for some particulars of their yesterday's barr.

Sentence 13:

'without self- knowledge,' says one of the greatest students of the human heart that ever lived, 'you have no real root in yourselves.

GT

ASR:Without self knowledge says one of the greatest students of the human heart that ever lived you have no real root in yourself.

VITS

ASR:Without self knowledge says one of the greatest students of the human heart that ever lived you have no real rood in yourselves.

HierSpeech (Ours)

ASR:Without self knowledge says one of the greatest students of the human heart that ever live you have no real root in yoursel.

Sentence 14:

We need not give up the conclusions to which our labors in dream interpretation lead us even though we must consider those conclusions strange.

GT

ASR: We need not give up the conclusions to which our labors in dream interpretation lead us even though we must consider those conclusions strange.

VITS

ASR: We need not give up the conclusions to which our labors in dream and interpretation lead us even though we must consider those conclusions strang.

HierSpeech (Ours)

ASR: We need not give up the conclusions to which our labors and dream interpretation lead us even though we must consider those conclusions strange.

Sentence 15:

Her mouth stiffened, the muscles of the cheek contracted on the right side of her pale, nervous face.

GT

ASR: Her mouth stiffened the muscles of the cheek contracted on the right side of her pale nervous face.

VITS

ASR: Her mouth stiffened the muscles of the cheek contracted on the right sige of her pale nervous face.

HierSpeech (Ours)

ASR: Her mouth stiffened the muscles of the cheek contracted on the right side of her pale nervous face.

Speaker Adaptation

We train the baseline HierSpeech using VCTK (98 speakers) and LibriTTS (1,151 speakers). Based on the pre-trained HierSpeech, we fine-tune each model with 20 samples from each speaker (10 novel speakers from VCTK)

Sentence 1:

Aristotle thought that the rainbow was caused by reflection of the sun's rays by the rain.

GT

ASR: Aristotle thought that the rainbow was caused by the reflection f the sun's rays on the rain.

HierSpeech (Zero-shot)

ASR: Aristotle thought t the rainbow was caused by reflection the sun's rays by the rain.

HierSpeech

ASR: Aristotle thought that the rainbow was caused by reflection of a sun's rays by the rain.

HierSpeech-U

ASR: Aristotle thought that the rainbow was caused by reflection of the sun's rays by the rain.

Sentence 2:

The norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky.

GT

ASR: The norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky.

HierSpeech (Zero-shot)

ASR: The norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky.

HierSpeech

ASR: The norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky.

HierSpeech-U

ASR: E norsemen considered the rainbow as the bridge over which the gods passed from earth to their home in the sky.

Voice Conversion

All source speech are not seen during training

Source Speaker Target Speaker Converted

p267 (Female)

p227 (Male)

AutoVC

VoiceMixer

VITS

HierSpeech

p234 (Female)

AutoVC

VoiceMixer

VITS

HierSpeech

Source Speaker Target Speaker Converted

p259 (Male)

p237 (Male)

AutoVC

VoiceMixer

VITS

HierSpeech

p231 (Female)

AutoVC

VoiceMixer

VITS

HierSpeech

Source Speaker Target Speaker Converted

p239 (Female)

p270 (Male)

AutoVC

VoiceMixer

VITS

HierSpeech

p231 (Female)

AutoVC

VoiceMixer

VITS

HierSpeech

Ablation Study

PP: Phoneme Predictor, AE: Acoustic Encoder

Sentence 1:

The rainbow is a division of white light into many beautiful colors.

GT

VITS (300k)

VITS w flow 8 (300k)

VITS w PP (300k)

HierSpeech (300k)

HierSpeech w.o AE (300k)

HierSpeech w.o AE and PP (300k)

Sentence 2:

Please call stella.

GT

VITS (300k)

VITS w flow 8 (300k)

VITS w PP (300k)

HierSpeech (300k)

HierSpeech w.o AE (300k)

HierSpeech w.o AE and PP (300k)