Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis

02.04.2021

Submitted to INTERSPEECH2021

Authors

Kosuke Futamata
Byeongseon Park
Ryuichi Yamamoto
Kentaro Tachibana

Abstract

We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a.k.a BERT, and explicit features extracted from BiLSTM with linguistic features. In conventional BiLSTM-based methods, word representations and/or sentence representations are used as independent components. The proposed method takes account of both representations to extract the latent semantics, which cannot be captured by previous methods. The objective evaluation results show that the proposed method obtains an absolute improvement of 3.2 points for the F1 score compared with BiLSTM-based conventional methods using linguistic features. Moreover, the perceptual listening test results verify that a TTS system that applied our proposed method achieved a mean opinion score of 4.39 in prosody naturalness, which is highly competitive with the score of 4.37 for synthesized speech with ground-truth phrase breaks.

Audio samples

There are 7 systems used for subjective evaluations.

Reference (Natural) : Recorded speech in the test set.
Reference (TTS): Synthesized speech from the test set.
Rule-based: A rule-based method that inserts phrase breaks only after punctuation.
BiLSTM (Tokens): conventional BiLSTM-based method using only a source sequence [1].
BiLSTM (Features): BiLSTM (Tokens)-based method that adds designed linguistic features [2].
BERT: method that uses BERT model.
BiLSTM (Features) + BERT: The proposed method that combines BiLSTM (Features) and BERT.

FastSpeech2 based acoustic model[3] and Parallel WaveGAN vocoder[4] are used for speech generation.

Sample 1

Rule-based	BiLSTM (Tokens)

BiLSTM (Tokens)	BiLSTM (Features)

BiLSTM (Features)	BERT

BERT	BiLSTM (Features) + BERT (Proposed)

BiLSTM (Features) + BERT (Proposed)	Reference (TTS)

BiLSTM (Features) + BERT (Proposed)	Reference (Natural)

Sample 2

Rule-based	BiLSTM (Tokens)

BiLSTM (Tokens)	BiLSTM (Features)

BiLSTM (Features)	BERT

BERT	BiLSTM (Features) + BERT (Proposed)

BiLSTM (Features) + BERT (Proposed)	Reference (TTS)

BiLSTM (Features) + BERT (Proposed)	Reference (Natural)

Sample 3

Rule-based	BiLSTM (Tokens)

BiLSTM (Tokens)	BiLSTM (Features)

BiLSTM (Features)	BERT

BERT	BiLSTM (Features) + BERT (Proposed)

BiLSTM (Features) + BERT (Proposed)	Reference (TTS)

BiLSTM (Features) + BERT (Proposed)	Reference (Natural)

Sample 4

Rule-based	BiLSTM (Tokens)

BiLSTM (Tokens)	BiLSTM (Features)

BiLSTM (Features)	BERT

BERT	BiLSTM (Features) + BERT (Proposed)

BiLSTM (Features) + BERT (Proposed)	Reference (TTS)

BiLSTM (Features) + BERT (Proposed)	Reference (Natural)

Sample 5

Rule-based	BiLSTM (Tokens)

BiLSTM (Tokens)	BiLSTM (Features)

BiLSTM (Features)	BERT

BERT	BiLSTM (Features) + BERT (Proposed)

BiLSTM (Features) + BERT (Proposed)	Reference (TTS)

BiLSTM (Features) + BERT (Proposed)	Reference (Natural)

References

[1]: A. Vadapalli and S. V. Gangashetty, “An investigation of recurrent neural network architectures using word embeddings for phrase break prediction", in Proceedings of INTERSPEECH 2016 (ISCA)
[2]: V. Klimkov, A. Nadolski, A. Moinet, B. Putrycz, R. BarraChicote, T. Merritt, and T. Drugman, “Phrase break prediction for long-form reading tts: Exploiting text structure information", in Proceedings of INTERSPEECH 2017 (ISCA)
[3]: R. Yi, H. Chenxu, Q. Tao, Z. Sheng, Z. Zhou, and L. Tie-Yan, “FastSpeech 2: Fast and high-quality end-to-end text-to-speech,” in Proceedings of ICLR (arXiv)
[4]: R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP (arXiv)

Acknowledgements

This work was supported by Clova Voice, NAVER Corp.,Seongnam, Korea.