Voicebuilding for Text-to-Speech Synthesis

Ingmar Steiner

20–31.03.2017

Preliminaries

Formalities

This course is a project seminar (for LST/CoLi students), or a regular seminar (for CS students).
Successful participation of the lecture “Text-to-Speech Synthesis” (Prof. Möbius) is a mandatory prerequisite.
To pass this course, you will need to build TTS voices and submit them, along with a written report. The report must explain the entire process, including any problems encountered, and their resolution (5–10 pp.). This report is due six weeks after the end of the seminar (i.e., 12.5.2017).
Register through LSF/HISPOS before 31.3.2017.
Mailing list for questions, discussion: voicebuildingsem@ml.coli.uni-saarland.de

Course Overview

Split into groups of 3–4 people each; each group should have at least
- one native English speaker, and
- one programmer/hacker, and
- one phonetician
Prepare prompt list
Record speech corpus in studio
Process recordings (including automatic segmentation)
Build TTS voice
Use MaryTTS (invented here)

MaryTTS

MaryTTS Overview

Open-source, multilingual TTS platform in Java
http://mary.dfki.de/
Development hosted at https://github.com/marytts/marytts

Installer

TTS Synthesis Process

Text

Hello world

Allophones

MaryXML

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <t accent="L+H*" g2p_method="lexicon" ph="h @ - ' l @U" pos="UH">
          Hello
          <syllable ph="h @">
            <ph p="h"/>
            <ph p="@"/>
          </syllable>
          <syllable accent="L+H*" ph="l @U" stress="1">
            <ph p="l"/>
            <ph p="@U"/>
          </syllable>
        </t>
        <t accent="!H*" g2p_method="lexicon" ph="' w r= l d" pos="NN">
          world
          <syllable accent="!H*" ph="w r= l d" stress="1">
            <ph p="w"/>
            <ph p="r="/>
            <ph p="l"/>
            <ph p="d"/>
          </syllable>
        </t>
        <boundary breakindex="5" tone="L-L%"/>
      </phrase>
    </s>
  </p>
</maryxml>

Target Features

phone	pos_in_syl	accented	ph_cplace
h	0	0	g
@	1	0	0
l	0	1	a
@U	1	1	0
w	0	1	l
r=	1	1	0
l	2	1	a
d	3	1	a
_	0	0	0

(small selection)

Audio

Target feature vectors used to generate/retrieve audio:

Voicebuilding

Synthesis “backwards”

Compute features vectors from text

then assign them to provided data.

Example

CMU Arctic database
Public domain speech corpora designed for synthesis
Data freely available at http://festvox.org/cmu_arctic/

Prepare voicebuilding project

Install software dependencies (as needed)

Clone voicebuilding project

git clone https://github.com/marytts/voice-cmu-slt -b v5.2
cd voice-cmu-slt

Build with Gradle

./gradlew legacyInit
./gradlew build

Run the voice

Run an ad-hoc MaryTTS server with this voice
```
./gradlew run
```

Details

Legacy Init

Retrieve CMU SLT Arctic data dependency
Unpack and convert audio files
Unpack and convert label files
Extract text prompts and generate utterance list

Task Execution Graph

Pitchmarking

Using Praat

./gradlew legacyPraatPitchmarker

input: wav/*.wav
output: pm/*.pm

MCEP extraction

Using ch_track from EST

./gradlew legacyMCEPMaker

input: wav/*.wav
output: mcep/*.mcep

G2P and labeling

Predict phone sequence from text using MaryTTS

./gradlew generateAllophones

input: text/*.txt
output: prompt_allophones/*.xml

Check phonetic alignment

./gradlew legacyTranscriptionAligner

input: prompt_allophones/*.xml, lab/*.lab
output: allophones/*.xml

Unit features

Compute and assign feature vector to each unit using MaryTTS

./gradlew legacyPhoneUnitFeatureComputer legacyHalfPhoneUnitFeatureComputer

input: allophones/*.xml, mary/features.txt
output: phonefeatures/*.pfeats, halfphonefeatures/*.hpfeats

Data files

Compile “timeline” files for audio, utterances, and acoustic features

./gradlew legacyWaveTimelineMaker legacyBasenameTimelineMaker legacyMCepTimelineMaker

input: wav/*.wav, pm/*.pm, mcep/*.mcep
output: mary/timeline_waveforms.mry, mary/timeline_basenames.mry, mary/timeline_mcep.mry

These contain the actual data from the wav and mcep files, in pitch-synchronous “datagram” packets.

Acoustic models

Phone-level and halfphone-level unit and features files

./gradlew legacyPhoneUnitfileWriter legacyHalfPhoneUnitfileWriter legacyPhoneFeatureFileWriter legacyHalfPhoneFeatureFileWriter

input: pm/*.pm, phonelab/*.lab, phonefeatures/*.pfeats, halfphonelab/*.hplab
output: mary/phoneUnits.mry, mary/halfphoneUnits.mry, mary/phoneFeatures.mry, mary/phoneUnitFeatureDefinition.txt, mary/halfphoneFeatures.mry, mary/halfphoneUnitFeatureDefinition.txt

CARTs for Duration and F0

Using wagon from EST

./gradlew legacyDurationCARTTrainer legacyF0CARTTrainer

input: mary/phoneUnits.mry, mary/phoneFeatures.mry, mary/timeline_waveforms.mry
output: mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree

Distributable voice package

Ready for deployment in MaryTTS installation

./gradlew assemble

input: mary/cart.mry, featureSequence.txt, mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree, mary/halfphoneFeatures_ac.mry, mary/joinCostFeatures.mry, mary/joinCostWeights.txt, mary/halfphoneUnits.mry, mary/timeline_basenames.mry, mary/timeline_waveforms.mry
output: my_voice.zip, my_voice-component.xml

Your Turn

Groups

Each group will need:

One designated native speaker
Praat scripting skills
Software engineering (Java) skills
Git skills

Prompt list

Phonetically balanced, e.g.,

TIMIT
CMU Arctic
BITS (German)

Slideshow preparation

Convert prompt list to Markdown or LaTeX source
Compile to
- HTML5 (with reveal.js)
- PDF (with beamer)
using Pandoc or similar
Generate beep and embed it via JavaScript or media9

Recording procedure

Presentation laptop with HDMI output

Video: Displayed in studio booth (prompt text)
Audio: Recording clock source (48 kHz)

DAW (Cubase/ProTools) records multiple channels:

Close-talking mic
Mounted mic
Beeps from presentation (acoustic prompt/slide transition markers)
MIDI track with manual markers (optional)

Data processing

Data formats

Store audio as FLAC
Serialize structured annotations to YAML or JSON
Easy to manipulate with many programming languages!

Audio post-processing

Normalization
Noise reduction (optional)
Detect event times in “beep” channel
Assign utterance start/end times
Split speech channel into utterances

Phonetic segmentation

Forced alignment with one of

Don’t forget to analyze and check for errors!

Project SCM and data distribution

Use Git.

But don’t store big binary files (such as audio) in Git!

Use solutions such as

git-lfs
git-annex
Cloud storage via
- Dropbox
- GitHub release assets
- Bitbucket downloads
- etc.

Voicebuilding

Legacy VoiceImportTools

Pros

Well-documented (see Wiki)
Tried and true
HMM voicebuilding

Cons

Various pitfalls
No caching
No task dependency management

Gradle Voicebuilding Plugin

Pros

Agile
Task dependency management
Caching

Cons

Experimental
Barely documented

DevOps!

Requirements

Java 7 or 8
Praat
SoX
Edinburgh Speech Tools

Installation options

OSX (via Homebrew)
Linux (via APT)
VirtualBox (running Linux guest)
Docker (running Linux container)

HTS Voicebuilding with Docker

Install Docker
Create a fresh directory and download this Dockerfile
Register for an HTK account

Run

docker build \
--build-arg HTKUSER=***** \
--build-arg HTKPASSWORD=***** \
-t marytts-builder-hsmm .

Prepare for HTS Voicebuilding

Download (or build from source) the MaryTTS Builder and unpack it to some location
Run the some location/bin/voiceimport.sh script from within your voicebuilding project’s build directory
- Click “Settings” in the GUI and set the db.marybase property to /marytts; click “Save”
- Run the “HMMVoiceFeatureSelection” component, confirm the dialog

Run the HTS Voicebuilding

Finally, run

docker run -v $PWD:$PWD -t marytts-builder-hsmm bash -c \
"cd $PWD; \
/marytts/target/marytts-builder-5.2/bin/voiceimport.sh \
HMMVoiceDataPreparation \
HMMVoiceConfigure \
HMMVoiceMakeData \
HMMVoiceMakeVoice"

This will take a long time. Follow the progress by tailing the hts/log-SOMETIMESTAMP log file.

Good luck!

Assemble the HTS voice

Finally, run

docker run -v $PWD:$PWD -t marytts-builder-hsmm bash -c \
"cd $PWD; \
/marytts/target/marytts-builder-5.2/bin/voiceimport.sh \
HMMVoiceCompiler"

This will collect all resources and generate code unter the build/mary/voice-YOUR_VOICE_NAME directory. It will also attempt to run Maven and fail (don’t worry about that).

Copy this buildscript into the generated Maven project directory, then run gradle build or gradle run.

Voicebuilding for Text-to-Speech Synthesis

Preliminaries

Formalities

Course Overview

MaryTTS

MaryTTS Overview

Installer

TTS Synthesis Process

Text

Allophones

Target Features

Audio

Voicebuilding

Synthesis “backwards”

Example

Prepare voicebuilding project

Build with Gradle

Run the voice

Details

Legacy Init

Task Execution Graph

Pitchmarking

MCEP extraction

G2P and labeling

Check phonetic alignment

Unit features

Data files

Acoustic models

CARTs for Duration and F0

Distributable voice package

Your Turn

Groups

Prompt list

Slideshow preparation

Recording procedure

Data processing

Data formats

Audio post-processing

Phonetic segmentation

Project SCM and data distribution

Voicebuilding

Legacy VoiceImportTools

Pros

Cons

Gradle Voicebuilding Plugin

Pros

Cons

DevOps!

Requirements

Installation options

HTS Voicebuilding with Docker

Prepare for HTS Voicebuilding

Run the HTS Voicebuilding

Assemble the HTS voice

Have fun!