Voicebuilding for Text-to-Speech Synthesis

Ingmar Steiner

20–31.03.2017

Preliminaries

Formalities

  • This course is a project seminar (for LST/CoLi students), or a regular seminar (for CS students).
  • Successful participation of the lecture “Text-to-Speech Synthesis” (Prof. Möbius) is a mandatory prerequisite.
  • To pass this course, you will need to build TTS voices and submit them, along with a written report. The report must explain the entire process, including any problems encountered, and their resolution (5–10 pp.). This report is due six weeks after the end of the seminar (i.e., 12.5.2017).
  • Register through LSF/HISPOS before 31.3.2017.
  • Mailing list for questions, discussion: voicebuildingsem@ml.coli.uni-saarland.de

Course Overview

  • Split into groups of 3–4 people each; each group should have at least
    • one native English speaker, and
    • one programmer/hacker, and
    • one phonetician
  • Prepare prompt list
  • Record speech corpus in studio
  • Process recordings (including automatic segmentation)
  • Build TTS voice
  • Use MaryTTS (invented here)

MaryTTS

MaryTTS Overview

Installer

TTS Synthesis Process

Text

Hello world

Allophones

MaryXML

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <t accent="L+H*" g2p_method="lexicon" ph="h @ - ' l @U" pos="UH">
          Hello
          <syllable ph="h @">
            <ph p="h"/>
            <ph p="@"/>
          </syllable>
          <syllable accent="L+H*" ph="l @U" stress="1">
            <ph p="l"/>
            <ph p="@U"/>
          </syllable>
        </t>
        <t accent="!H*" g2p_method="lexicon" ph="' w r= l d" pos="NN">
          world
          <syllable accent="!H*" ph="w r= l d" stress="1">
            <ph p="w"/>
            <ph p="r="/>
            <ph p="l"/>
            <ph p="d"/>
          </syllable>
        </t>
        <boundary breakindex="5" tone="L-L%"/>
      </phrase>
    </s>
  </p>
</maryxml>

Target Features

phone pos_in_syl accented ph_cplace
h 0 0 g
@ 1 0 0
l 0 1 a
@U 1 1 0
w 0 1 l
r= 1 1 0
l 2 1 a
d 3 1 a
_ 0 0 0

(small selection)

Audio

Target feature vectors used to generate/retrieve audio:

Voicebuilding

Synthesis “backwards”

Compute features vectors from text

then assign them to provided data.

Example

Prepare voicebuilding project

  1. Install software dependencies (as needed)
  2. Clone voicebuilding project

    git clone https://github.com/marytts/voice-cmu-slt -b v5.2
    cd voice-cmu-slt

Build with Gradle

./gradlew legacyInit
./gradlew build

Run the voice

  • Run an ad-hoc MaryTTS server with this voice

    ./gradlew run

Details

Legacy Init

  1. Retrieve CMU SLT Arctic data dependency
  2. Unpack and convert audio files
  3. Unpack and convert label files
  4. Extract text prompts and generate utterance list

Task Execution Graph

Pitchmarking

Using Praat

./gradlew legacyPraatPitchmarker
input
wav/*.wav
output
pm/*.pm

MCEP extraction

Using ch_track from EST

./gradlew legacyMCEPMaker
input
wav/*.wav
output
mcep/*.mcep

G2P and labeling

Predict phone sequence from text using MaryTTS

./gradlew generateAllophones
input
text/*.txt
output
prompt_allophones/*.xml

Check phonetic alignment

./gradlew legacyTranscriptionAligner
input
prompt_allophones/*.xml, lab/*.lab
output
allophones/*.xml

Unit features

Compute and assign feature vector to each unit using MaryTTS

./gradlew legacyPhoneUnitFeatureComputer legacyHalfPhoneUnitFeatureComputer
input
allophones/*.xml, mary/features.txt
output
phonefeatures/*.pfeats, halfphonefeatures/*.hpfeats

Data files

Compile “timeline” files for audio, utterances, and acoustic features

./gradlew legacyWaveTimelineMaker legacyBasenameTimelineMaker legacyMCepTimelineMaker
input
wav/*.wav, pm/*.pm, mcep/*.mcep
output
mary/timeline_waveforms.mry, mary/timeline_basenames.mry, mary/timeline_mcep.mry

These contain the actual data from the wav and mcep files, in pitch-synchronous “datagram” packets.

Acoustic models

Phone-level and halfphone-level unit and features files

./gradlew legacyPhoneUnitfileWriter legacyHalfPhoneUnitfileWriter legacyPhoneFeatureFileWriter legacyHalfPhoneFeatureFileWriter
input
pm/*.pm, phonelab/*.lab, phonefeatures/*.pfeats, halfphonelab/*.hplab
output
mary/phoneUnits.mry, mary/halfphoneUnits.mry, mary/phoneFeatures.mry, mary/phoneUnitFeatureDefinition.txt, mary/halfphoneFeatures.mry, mary/halfphoneUnitFeatureDefinition.txt

CARTs for Duration and F0

Using wagon from EST

./gradlew legacyDurationCARTTrainer legacyF0CARTTrainer
input
mary/phoneUnits.mry, mary/phoneFeatures.mry, mary/timeline_waveforms.mry
output
mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree

Distributable voice package

Ready for deployment in MaryTTS installation

./gradlew assemble
input
mary/cart.mry, featureSequence.txt, mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree, mary/halfphoneFeatures_ac.mry, mary/joinCostFeatures.mry, mary/joinCostWeights.txt, mary/halfphoneUnits.mry, mary/timeline_basenames.mry, mary/timeline_waveforms.mry
output
my_voice.zip, my_voice-component.xml

Your Turn

Groups

Each group will need:

  • One designated native speaker
  • Praat scripting skills
  • Software engineering (Java) skills
  • Git skills

Prompt list

Phonetically balanced, e.g.,

Slideshow preparation

  1. Convert prompt list to Markdown or LaTeX source
  2. Compile to using Pandoc or similar
  3. Generate beep and embed it via JavaScript or media9

Recording procedure

Presentation laptop with HDMI output

Video
Displayed in studio booth (prompt text)
Audio
Recording clock source (48 kHz)

DAW (Cubase/ProTools) records multiple channels:

  1. Close-talking mic
  2. Mounted mic
  3. Beeps from presentation (acoustic prompt/slide transition markers)
  4. MIDI track with manual markers (optional)

Data processing

Data formats

  • Store audio as FLAC
  • Serialize structured annotations to YAML or JSON
  • Easy to manipulate with many programming languages!

Audio post-processing

  • Normalization
  • Noise reduction (optional)
  • Detect event times in “beep” channel
  • Assign utterance start/end times
  • Split speech channel into utterances

Phonetic segmentation

Forced alignment with one of

Don’t forget to analyze and check for errors!

Project SCM and data distribution

Use Git.

But don’t store big binary files (such as audio) in Git!

Use solutions such as

Voicebuilding

Legacy VoiceImportTools

Pros

  • Well-documented (see Wiki)
  • Tried and true
  • HMM voicebuilding

Cons

  • Various pitfalls
  • No caching
  • No task dependency management

Gradle Voicebuilding Plugin

Pros

  • Agile
  • Task dependency management
  • Caching

Cons

  • Experimental
  • Barely documented

DevOps!

Requirements

  • Java 7 or 8
  • Praat
  • SoX
  • Edinburgh Speech Tools

Installation options

HTS Voicebuilding with Docker

  1. Install Docker
  2. Create a fresh directory and download this Dockerfile
  3. Register for an HTK account
  4. Run

    docker build \
    --build-arg HTKUSER=***** \
    --build-arg HTKPASSWORD=***** \
    -t marytts-builder-hsmm .

Prepare for HTS Voicebuilding

  1. Download (or build from source) the MaryTTS Builder and unpack it to some location
  2. Run the some location/bin/voiceimport.sh script from within your voicebuilding project’s build directory
    • Click “Settings” in the GUI and set the db.marybase property to /marytts; click “Save”
    • Run the “HMMVoiceFeatureSelection” component, confirm the dialog

Run the HTS Voicebuilding

Finally, run

docker run -v $PWD:$PWD -t marytts-builder-hsmm bash -c \
"cd $PWD; \
/marytts/target/marytts-builder-5.2/bin/voiceimport.sh \
HMMVoiceDataPreparation \
HMMVoiceConfigure \
HMMVoiceMakeData \
HMMVoiceMakeVoice"

This will take a long time. Follow the progress by tailing the hts/log-SOMETIMESTAMP log file.

Good luck!

Assemble the HTS voice

Finally, run

docker run -v $PWD:$PWD -t marytts-builder-hsmm bash -c \
"cd $PWD; \
/marytts/target/marytts-builder-5.2/bin/voiceimport.sh \
HMMVoiceCompiler"

This will collect all resources and generate code unter the build/mary/voice-YOUR_VOICE_NAME directory. It will also attempt to run Maven and fail (don’t worry about that).

Copy this buildscript into the generated Maven project directory, then run gradle build or gradle run.

Have fun!

Background image © Nevit Dilmen, CC-BY-SA