Voicebuilding for Text-to-Speech Synthesis

Ingmar Steiner

11–15.04.2016

Introduction

Formalities

  • This course is a project seminar (for LST/CoLi students), or a regular seminar (for CS students).
  • Successful participation of the lecture “Text-to-Speech Synthesis” (Prof. Möbius) is a mandatory prerequisite.
  • To pass this course, you will need to build TTS voices and submit them, along with a written report. The report must explain the entire process, including any problems encountered, and their resolution (5–10 pp.). This report is due six weeks after the end of the seminar (27.5.2016).
  • Register through LSF/HISPOS by 15.4.2016.
  • Mailing list for questions, discussion:

Course Overview

  • Split into groups of 3–4 people each; each group should have at least
    • one native English speaker, and
    • one programmer/hacker, and
    • one phonetician
  • Prepare prompt list
  • Record speech corpus in studio
  • Process recordings (including automatic segmentation)
  • Build TTS voice
  • Use MaryTTS (invented here)

MaryTTS

MaryTTS

Install MaryTTS from source

  1. Prequisites: JDK 7 or later, git, maven
  2. Clone the source repository:

    git clone https://github.com/marytts/marytts.git -b v5.2beta2
  3. Enter your MaryTTS directory and install:

    cd marytts
    mvn install
  4. Run local MaryTTS server

    target/marytts-5.2-beta2/bin/marytts-server
  5. Surf to http://localhost:59125/

Debugging in Eclipse

See https://github.com/marytts/marytts/wiki/Eclipse

Install a new voice

  1. Run MaryTTS component installer

    target/marytts-5.2-beta2/bin/marytts-component-installer
  2. Select voice and click “Install selected”

TTS Synthesis Process

Text

Hello world

Allophones

MaryXML

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <t accent="L+H*" g2p_method="lexicon" ph="h @ - ' l @U" pos="UH">
          Hello
          <syllable ph="h @">
            <ph p="h"/>
            <ph p="@"/>
          </syllable>
          <syllable accent="L+H*" ph="l @U" stress="1">
            <ph p="l"/>
            <ph p="@U"/>
          </syllable>
        </t>
        <t accent="!H*" g2p_method="lexicon" ph="' w r= l d" pos="NN">
          world
          <syllable accent="!H*" ph="w r= l d" stress="1">
            <ph p="w"/>
            <ph p="r="/>
            <ph p="l"/>
            <ph p="d"/>
          </syllable>
        </t>
        <boundary breakindex="5" tone="L-L%"/>
      </phrase>
    </s>
  </p>
</maryxml>

Target Features

phone pos_in_syl accented ph_cplace
h 0 0 g
@ 1 0 0
l 0 1 a
@U 1 1 0
w 0 1 l
r= 1 1 0
l 2 1 a
d 3 1 a
_ 0 0 0

(small selection)

Audio

Target feature vectors used to generate/retrieve audio:

Voicebuilding

Synthesis “backwards”

Compute features vectors from text

then assign them to provided data.

Example

Prepare voicebuilding project

  1. Install software dependencies (as needed)
  2. Clone voicebuilding project

    git clone https://github.com/psibre/voice-cmu-slt --recursive
    cd voice-cmu-slt

Build with Gradle

./gradlew legacyInit
./gradlew assemble

Run the voice

  • Run an ad-hoc MaryTTS server with this voice

    ./gradlew run

or

  1. Or copy the distributions to the MaryTTS directory’s installer cache (under download)
  2. Install with the component installer
  3. Start the MaryTTS server

then

Surf to http://localhost:59125

Details

Legacy Init

  1. Retrieve CMU SLT Arctic data dependency
  2. Unpack and convert audio files
  3. Unpack and convert label files
  4. Extract text prompts and generate utterance list

Task Execution Graph

Pitchmarking

Using Praat

./gradlew legacyPraatPitchmarker
input
wav/*.wav
output
pm/*.pm

MCEP extraction

Using ch_track from EST

./gradlew legacyMCEPMaker
input
wav/*.wav
output
mcep/*.mcep

G2P and labeling

Predict phone sequence from text using MaryTTS

./gradlew generateAllophones
input
text/*.txt
output
prompt_allophones/*.xml

Check phonetic alignment

./gradlew legacyTranscriptionAligner
input
prompt_allophones/*.xml, lab/*.lab
output
allophones/*.xml

Unit features

Compute and assign feature vector to each unit using MaryTTS

./gradlew legacyPhoneUnitFeatureComputer legacyHalfPhoneUnitFeatureComputer
input
allophones/*.xml, mary/features.txt
output
phonefeatures/*.pfeats, halfphonefeatures/*.hpfeats

Data files

Compile “timeline” files for audio, utterances, and acoustic features

./gradlew legacyWaveTimelineMaker legacyBasenameTimelineMaker legacyMCepTimelineMaker
input
wav/*.wav, pm/*.pm, mcep/*.mcep
output
mary/timeline_waveforms.mry, mary/timeline_basenames.mry, mary/timeline_mcep.mry

These contain the actual data from the wav and mcep files, in pitch-synchronous “datagram” packets.

Acoustic models

Phone-level and halfphone-level unit and features files

./gradlew legacyPhoneUnitfileWriter legacyHalfPhoneUnitfileWriter legacyPhoneFeatureFileWriter legacyHalfPhoneFeatureFileWriter
input
pm/*.pm, phonelab/*.lab, phonefeatures/*.pfeats, halfphonelab/*.hplab
output
mary/phoneUnits.mry, mary/halfphoneUnits.mry, mary/phoneFeatures.mry, mary/phoneUnitFeatureDefinition.txt, mary/halfphoneFeatures.mry, mary/halfphoneUnitFeatureDefinition.txt

CARTs for Duration and F0

Using wagon from EST

./gradlew legacyDurationCARTTrainer legacyF0CARTTrainer
input
mary/phoneUnits.mry, mary/phoneFeatures.mry, mary/timeline_waveforms.mry
output
mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree

Distributable voice package

Ready for deployment in MaryTTS installation

./gradlew assemble
input
mary/cart.mry, featureSequence.txt, mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree, mary/halfphoneFeatures_ac.mry, mary/joinCostFeatures.mry, mary/joinCostWeights.txt, mary/halfphoneUnits.mry, mary/timeline_basenames.mry, mary/timeline_waveforms.mry
output
my_voice.zip, my_voice-component.xml

Your Turn

Prompt list creation

  1. Prerequisites: TeX Live, SoX (compiled with MP3 support)
  2. Surf to https://github.com/psibre/arctic-prompts
  3. Download or clone it
  4. Typeset with Gradle
  5. Postrequisites: Adobe Reader, Adobe Flash

Speech recording

  • Each group plans and carries out recordings for ~1 h of speech data
  • Use a phonetically balanced prompt set, e.g., TIMIT or ARCTIC
  • Use collaborative versioning tools to share this data in the team, e.g., Dropbox, git-annex, etc.

Phonetic segmentation

Use forced alignment for automatic segmentation

  • MAUS,
  • EHMM,
  • HTK,
  • CMU Sphinx,
  • Julius,
  • Kaldi,

and let’s not forget: manual labor!

Good luck!