Voicebuilding for Text-to-Speech Synthesis

Ingmar Steiner

11–15.04.2016

Introduction

Formalities

This course is a project seminar (for LST/CoLi students), or a regular seminar (for CS students).
Successful participation of the lecture “Text-to-Speech Synthesis” (Prof. Möbius) is a mandatory prerequisite.
To pass this course, you will need to build TTS voices and submit them, along with a written report. The report must explain the entire process, including any problems encountered, and their resolution (5–10 pp.). This report is due six weeks after the end of the seminar (27.5.2016).
Register through LSF/HISPOS by 15.4.2016.
Mailing list for questions, discussion:

Course Overview

Split into groups of 3–4 people each; each group should have at least
- one native English speaker, and
- one programmer/hacker, and
- one phonetician
Prepare prompt list
Record speech corpus in studio
Process recordings (including automatic segmentation)
Build TTS voice
Use MaryTTS (invented here)

MaryTTS

Open-source, multilingual TTS platform in Java
http://mary.dfki.de/
Development hosted at https://github.com/marytts/marytts

Install MaryTTS from source

Prequisites: JDK 7 or later, git, maven

Clone the source repository:

git clone https://github.com/marytts/marytts.git -b v5.2beta2

Enter your MaryTTS directory and install:
```
cd marytts
mvn install
```

Run local MaryTTS server

target/marytts-5.2-beta2/bin/marytts-server

Surf to http://localhost:59125/

Debugging in Eclipse

See https://github.com/marytts/marytts/wiki/Eclipse

Install a new voice

Run MaryTTS component installer

target/marytts-5.2-beta2/bin/marytts-component-installer

Select voice and click “Install selected”

TTS Synthesis Process

Text

Hello world

Allophones

MaryXML

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <t accent="L+H*" g2p_method="lexicon" ph="h @ - ' l @U" pos="UH">
          Hello
          <syllable ph="h @">
            <ph p="h"/>
            <ph p="@"/>
          </syllable>
          <syllable accent="L+H*" ph="l @U" stress="1">
            <ph p="l"/>
            <ph p="@U"/>
          </syllable>
        </t>
        <t accent="!H*" g2p_method="lexicon" ph="' w r= l d" pos="NN">
          world
          <syllable accent="!H*" ph="w r= l d" stress="1">
            <ph p="w"/>
            <ph p="r="/>
            <ph p="l"/>
            <ph p="d"/>
          </syllable>
        </t>
        <boundary breakindex="5" tone="L-L%"/>
      </phrase>
    </s>
  </p>
</maryxml>

Target Features

phone	pos_in_syl	accented	ph_cplace
h	0	0	g
@	1	0	0
l	0	1	a
@U	1	1	0
w	0	1	l
r=	1	1	0
l	2	1	a
d	3	1	a
_	0	0	0

(small selection)

Audio

Target feature vectors used to generate/retrieve audio:

Voicebuilding

Synthesis “backwards”

Compute features vectors from text

then assign them to provided data.

Example

CMU Arctic database
public domain speech corpora designed for synthesis
data freely available at http://festvox.org/cmu_arctic/

Prepare voicebuilding project

Install software dependencies (as needed)

Clone voicebuilding project

git clone https://github.com/psibre/voice-cmu-slt --recursive
cd voice-cmu-slt

Build with Gradle

./gradlew legacyInit
./gradlew assemble

Run the voice

Run an ad-hoc MaryTTS server with this voice
```
./gradlew run
```

Or copy the distributions to the MaryTTS directory’s installer cache (under download)
Install with the component installer
Start the MaryTTS server

then

Surf to http://localhost:59125

Details

Legacy Init

Retrieve CMU SLT Arctic data dependency
Unpack and convert audio files
Unpack and convert label files
Extract text prompts and generate utterance list

Task Execution Graph

Pitchmarking

Using Praat

./gradlew legacyPraatPitchmarker

input: wav/*.wav
output: pm/*.pm

MCEP extraction

Using ch_track from EST

./gradlew legacyMCEPMaker

input: wav/*.wav
output: mcep/*.mcep

G2P and labeling

Predict phone sequence from text using MaryTTS

./gradlew generateAllophones

input: text/*.txt
output: prompt_allophones/*.xml

Check phonetic alignment

./gradlew legacyTranscriptionAligner

input: prompt_allophones/*.xml, lab/*.lab
output: allophones/*.xml

Unit features

Compute and assign feature vector to each unit using MaryTTS

./gradlew legacyPhoneUnitFeatureComputer legacyHalfPhoneUnitFeatureComputer

input: allophones/*.xml, mary/features.txt
output: phonefeatures/*.pfeats, halfphonefeatures/*.hpfeats

Data files

Compile “timeline” files for audio, utterances, and acoustic features

./gradlew legacyWaveTimelineMaker legacyBasenameTimelineMaker legacyMCepTimelineMaker

input: wav/*.wav, pm/*.pm, mcep/*.mcep
output: mary/timeline_waveforms.mry, mary/timeline_basenames.mry, mary/timeline_mcep.mry

These contain the actual data from the wav and mcep files, in pitch-synchronous “datagram” packets.

Acoustic models

Phone-level and halfphone-level unit and features files

./gradlew legacyPhoneUnitfileWriter legacyHalfPhoneUnitfileWriter legacyPhoneFeatureFileWriter legacyHalfPhoneFeatureFileWriter

input: pm/*.pm, phonelab/*.lab, phonefeatures/*.pfeats, halfphonelab/*.hplab
output: mary/phoneUnits.mry, mary/halfphoneUnits.mry, mary/phoneFeatures.mry, mary/phoneUnitFeatureDefinition.txt, mary/halfphoneFeatures.mry, mary/halfphoneUnitFeatureDefinition.txt

CARTs for Duration and F0

Using wagon from EST

./gradlew legacyDurationCARTTrainer legacyF0CARTTrainer

input: mary/phoneUnits.mry, mary/phoneFeatures.mry, mary/timeline_waveforms.mry
output: mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree

Distributable voice package

Ready for deployment in MaryTTS installation

./gradlew assemble

input: mary/cart.mry, featureSequence.txt, mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree, mary/halfphoneFeatures_ac.mry, mary/joinCostFeatures.mry, mary/joinCostWeights.txt, mary/halfphoneUnits.mry, mary/timeline_basenames.mry, mary/timeline_waveforms.mry
output: my_voice.zip, my_voice-component.xml

Your Turn

Prompt list creation

Prerequisites: TeX Live, SoX (compiled with MP3 support)
Surf to https://github.com/psibre/arctic-prompts
Download or clone it
Typeset with Gradle
Postrequisites: Adobe Reader, Adobe Flash

Speech recording

Each group plans and carries out recordings for ~1 h of speech data
Use a phonetically balanced prompt set, e.g., TIMIT or ARCTIC
Use collaborative versioning tools to share this data in the team, e.g., Dropbox, git-annex, etc.

Phonetic segmentation

Use forced alignment for automatic segmentation

MAUS,
EHMM,
HTK,
CMU Sphinx,
Julius,
Kaldi,
…

and let’s not forget: manual labor!

Voicebuilding for Text-to-Speech Synthesis

Ingmar Steiner

11–15.04.2016

Introduction

Formalities

Course Overview

MaryTTS

MaryTTS

Install MaryTTS from source

Debugging in Eclipse

Install a new voice

TTS Synthesis Process

Text

Allophones

Target Features

Audio

Voicebuilding

Synthesis “backwards”

Example

Prepare voicebuilding project

Build with Gradle

Run the voice

Details

Legacy Init

Task Execution Graph

Pitchmarking

MCEP extraction

G2P and labeling

Check phonetic alignment

Unit features

Data files

Acoustic models

CARTs for Duration and F0

Distributable voice package

Your Turn

Prompt list creation

Speech recording

Phonetic segmentation

Good luck!