Skip to content
List of papers
GithubTwitter

Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis

Submitted to INTERSPEECH2021

Authors
  • Kosuke Futamata
  • Byeongseon Park
  • Ryuichi Yamamoto
  • Kentaro Tachibana

Abstract

We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a.k.a BERT, and explicit features extracted from BiLSTM with linguistic features. In conventional BiLSTM-based methods, word representations and/or sentence representations are used as independent components. The proposed method takes account of both representations to extract the latent semantics, which cannot be captured by previous methods. The objective evaluation results show that the proposed method obtains an absolute improvement of 3.2 points for the F1 score compared with BiLSTM-based conventional methods using linguistic features. Moreover, the perceptual listening test results verify that a TTS system that applied our proposed method achieved a mean opinion score of 4.39 in prosody naturalness, which is highly competitive with the score of 4.37 for synthesized speech with ground-truth phrase breaks.

PBP_architecture


Audio samples

There are 7 systems used for subjective evaluations.

  • Reference (Natural) : Recorded speech in the test set.
  • Reference (TTS): Synthesized speech from the test set.
  • Rule-based: A rule-based method that inserts phrase breaks only after punctuation.
  • BiLSTM (Tokens): conventional BiLSTM-based method using only a source sequence [1].
  • BiLSTM (Features): BiLSTM (Tokens)-based method that adds designed linguistic features [2].
  • BERT: method that uses BERT model.
  • BiLSTM (Features) + BERT: The proposed method that combines BiLSTM (Features) and BERT.

FastSpeech2 based acoustic model[3] and Parallel WaveGAN vocoder[4] are used for speech generation.



Sample 1

Rule-basedBiLSTM (Tokens)
BiLSTM (Tokens)BiLSTM (Features)
BiLSTM (Features)BERT
BERTBiLSTM (Features) + BERT (Proposed)
BiLSTM (Features) + BERT (Proposed)Reference (TTS)
BiLSTM (Features) + BERT (Proposed)Reference (Natural)

Sample 2

Rule-basedBiLSTM (Tokens)
BiLSTM (Tokens)BiLSTM (Features)
BiLSTM (Features)BERT
BERTBiLSTM (Features) + BERT (Proposed)
BiLSTM (Features) + BERT (Proposed)Reference (TTS)
BiLSTM (Features) + BERT (Proposed)Reference (Natural)

Sample 3

Rule-basedBiLSTM (Tokens)
BiLSTM (Tokens)BiLSTM (Features)
BiLSTM (Features)BERT
BERTBiLSTM (Features) + BERT (Proposed)
BiLSTM (Features) + BERT (Proposed)Reference (TTS)
BiLSTM (Features) + BERT (Proposed)Reference (Natural)

Sample 4

Rule-basedBiLSTM (Tokens)
BiLSTM (Tokens)BiLSTM (Features)
BiLSTM (Features)BERT
BERTBiLSTM (Features) + BERT (Proposed)
BiLSTM (Features) + BERT (Proposed)Reference (TTS)
BiLSTM (Features) + BERT (Proposed)Reference (Natural)

Sample 5

Rule-basedBiLSTM (Tokens)
BiLSTM (Tokens)BiLSTM (Features)
BiLSTM (Features)BERT
BERTBiLSTM (Features) + BERT (Proposed)
BiLSTM (Features) + BERT (Proposed)Reference (TTS)
BiLSTM (Features) + BERT (Proposed)Reference (Natural)

References
  • [1]: A. Vadapalli and S. V. Gangashetty, “An investigation of recurrent neural network architectures using word embeddings for phrase break prediction", in Proceedings of INTERSPEECH 2016 (ISCA)
  • [2]: V. Klimkov, A. Nadolski, A. Moinet, B. Putrycz, R. BarraChicote, T. Merritt, and T. Drugman, “Phrase break prediction for long-form reading tts: Exploiting text structure information", in Proceedings of INTERSPEECH 2017 (ISCA)
  • [3]: R. Yi, H. Chenxu, Q. Tao, Z. Sheng, Z. Zhou, and L. Tie-Yan, “FastSpeech 2: Fast and high-quality end-to-end text-to-speech,” in Proceedings of ICLR (arXiv)
  • [4]: R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP (arXiv)
Acknowledgements

This work was supported by Clova Voice, NAVER Corp.,Seongnam, Korea.