Voicebuilding for Text-to-Speech Synthesis

Ingmar Steiner

2018-07-30 – 2018-08-10

Preliminaries

Course Overview

Split into groups (3–4 people each); idealy, each group should include
- one native English speaker, and
- one programmer/hacker, and
- one phonetician
Prepare prompt list
Record speech corpus in studio
Process recordings (including automatic segmentation)
Build TTS voice(s)
Use MaryTTS (invented here)

Formalities

This course is a seminar or project seminar, depending on your course of study.
Successful completion of the lecture “Text-to-Speech Synthesis” (Prof. Möbius) is a mandatory prerequisite.
To pass this course, you will need to build TTS voices and submit them, along with a written report. The report must explain the entire end-to-end process, including any problems encountered, and their resolution (5–10 pp.). This report is due six weeks after the end of the seminar (i.e., 2018-09-21).
Mailing list for questions, discussion: voicebuildingsem@ml.coli.uni-saarland.de
Register in HISPOS by 2018-08-03.

MaryTTS

MaryTTS Overview

Open-source, multilingual TTS platform in Java
http://mary.dfki.de/
Development hosted at https://github.com/marytts/marytts

Installer

TTS Synthesis Process

Text

Hello world

Allophones

MaryXML

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <t accent="L+H*" g2p_method="lexicon" ph="h @ - ' l @U" pos="UH">
          Hello
          <syllable ph="h @">
            <ph p="h"/>
            <ph p="@"/>
          </syllable>
          <syllable accent="L+H*" ph="l @U" stress="1">
            <ph p="l"/>
            <ph p="@U"/>
          </syllable>
        </t>
        <t accent="!H*" g2p_method="lexicon" ph="' w r= l d" pos="NN">
          world
          <syllable accent="!H*" ph="w r= l d" stress="1">
            <ph p="w"/>
            <ph p="r="/>
            <ph p="l"/>
            <ph p="d"/>
          </syllable>
        </t>
        <boundary breakindex="5" tone="L-L%"/>
      </phrase>
    </s>
  </p>
</maryxml>

Target Features

phone	pos_in_syl	accented	ph_cplace
h	0	0	g
@	1	0	0
l	0	1	a
@U	1	1	0
w	0	1	l
r=	1	1	0
l	2	1	a
d	3	1	a
_	0	0	0

(small selection)

Audio

Target feature vectors used to generate/retrieve audio:

Voicebuilding

Synthesis “backwards”

Compute features vectors from text

then assign them to provided data.

Example

CMU Arctic database
Public domain speech corpora designed for synthesis
Data freely available at http://festvox.org/cmu_arctic/

Prepare voicebuilding project

Install software dependencies (as needed)
- Git
- Praat
- SoX
- Edinburgh Speech Tools
Clone example voicebuilding project: https://github.com/marytts/voice-cmu-slt

Build with Gradle

./gradlew legacyInit
./gradlew build

Run the voice

Run an ad-hoc MaryTTS server with this voice: ./gradlew run
Navigate to http://localhost:59125

Details

Legacy Init

Retrieve CMU SLT Arctic data dependency
Unpack and convert audio files
Unpack and convert label files
Extract text prompts and generate utterance list

Task Execution Graph

Pitchmarking

Using Praat

./gradlew praatPitchmarker

input: wav/*.wav
output: pm/*.pm

MCEP extraction

Using ch_track from EST

./gradlew mcepExtractor

input: wav/*.wav
output: mcep/*.mcep

G2P and labeling

Predict phone sequence from text using MaryTTS

./gradlew generateAllophones

input: text/*.txt
output: prompt_allophones/*.xml

Check phonetic alignment

./gradlew legacyTranscriptionAligner

input: prompt_allophones/*.xml, lab/*.lab
output: allophones/*.xml

Unit features

Compute and assign feature vector to each unit using MaryTTS

./gradlew legacyPhoneUnitFeatureComputer legacyHalfPhoneUnitFeatureComputer

input: allophones/*.xml, mary/features.txt
output: phonefeatures/*.pfeats, halfphonefeatures/*.hpfeats

Data files

Compile “timeline” files for audio, utterances, and acoustic features

./gradlew legacyWaveTimelineMaker legacyBasenameTimelineMaker legacyMCepTimelineMaker

input: wav/*.wav, pm/*.pm, mcep/*.mcep
output: mary/timeline_waveforms.mry, mary/timeline_basenames.mry, mary/timeline_mcep.mry

These contain the actual data from the wav and mcep files, in pitch-synchronous “datagram” packets.

Acoustic models

Phone-level and halfphone-level unit and features files

./gradlew legacyPhoneUnitfileWriter legacyHalfPhoneUnitfileWriter legacyPhoneFeatureFileWriter legacyHalfPhoneFeatureFileWriter

input: pm/*.pm, phonelab/*.lab, phonefeatures/*.pfeats, halfphonelab/*.hplab
output: mary/phoneUnits.mry, mary/halfphoneUnits.mry, mary/phoneFeatures.mry, mary/phoneUnitFeatureDefinition.txt, mary/halfphoneFeatures.mry, mary/halfphoneUnitFeatureDefinition.txt

CARTs for Duration and F0

Using wagon from EST

./gradlew legacyDurationCARTTrainer legacyF0CARTTrainer

input: mary/phoneUnits.mry, mary/phoneFeatures.mry, mary/timeline_waveforms.mry
output: mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree

Distributable voice package

Ready for deployment in MaryTTS installation

./gradlew assemble

input: mary/cart.mry, featureSequence.txt, mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree, mary/halfphoneFeatures_ac.mry, mary/joinCostFeatures.mry, mary/joinCostWeights.txt, mary/halfphoneUnits.mry, mary/timeline_basenames.mry, mary/timeline_waveforms.mry
output: my_voice.zip, my_voice-component.xml

Legacy VoiceImportTools

Pros

Well-documented (see Wiki)
Compatible with stable MaryTTS version
HMM voicebuilding

Cons

Various pitfalls
No caching
No task dependency management

Gradle Voicebuilding Plugin

Pros

Agile
Task dependency management
Caching
Parallel processing

Cons

Experimental
Barely documented

Build automation

Build tool criteria

Fast & efficient
Easy to use
Cross-platform
Minimal requirements
Flexible & extensible
Automate build/test/release lifecycle

Gradle

First released in 2007
Runs on Java
Groovy-based build scripts with custom DSL
Builds Java, C/C++, Android, anything else
Extensible via plugins

Simple Gradle build script

build.gradle

task foo {
  doLast {
    println "Doing stuff."
  }
}

task bar {
  dependsOn foo
  doLast {
    println "Doing more stuff."
  }
}

Custom task class

buildSrc/src/main/groovy/DoStuff.groovy

class DoStuff extends DefaultTask {

    @InputFile
    final RegularFileProperty inputFile = newInputFile()

    @OutputFile
    final RegularFileProperty outputFile = newOutputFile()

    @TaskAction
    void doStuff() {
        // open output file for writing
        outputFile.get().asFile.withWriter { writer ->
            // iterate over lines in input file
            inputFile.get().asFile.eachLine { line ->
                // write line contents to output, twice
                writer.println line * 2
            }
        }
    }
}

Further reading

Lazy task input/output

src/foo.txt

foo
bar
baz

build.gradle

task foo(type: DoStuff) {
    inputFile = file('src/foo.txt')
    outputFile = layout.buildDirectory.file('bar.txt')
}

task bar(type: DoStuff) {
    inputFile = foo.outputFile
    outputFile = layout.buildDirectory.file('baz.txt')
}

Custom plugin

buildSrc/src/main/groovy/DoPlugin.groovy

class DoPlugin implements Plugin<Project> {

    @Override
    void apply(Project project) {
        project.task('foo', type: DoStuff) {
            outputFile = project.layout.buildDirectory.file('bar.txt')
        }

        project.task('bar', type: DoStuff) {
            inputFile = project.foo.outputFile
            outputFile = project.layout.buildDirectory.file('baz.txt')
        }
    }
}

build.gradle

apply plugin: DoPlugin

foo.inputFile = file('src/foo.txt')

Further reading

Writing Custom Plugins

Dependencies

build.gradle

repositories {
    ivy {
        url 'https://catalog.ldc.upenn.edu/docs'
        layout 'pattern', {
            artifact 'LDC93S1/[module].[ext]'
        }
    }
}

configurations {
    data
}

dependencies {
    data group: 'edu.upenn.ldc.timit', name: 'PROMPTS', version: '1988-10-31', ext: 'TXT'
}

tasks.register 'processPrompts', ProcessPrompts, {
    srcFiles = files(configurations.data)
    destDir = layout.buildDirectory.dir('text')
}

Further reading

Declaring Dependencies

Custom dependency processing class

buildSrc/src/main/groovy/ProcessPrompts.groovy

import org.gradle.api.DefaultTask
import org.gradle.api.file.*
import org.gradle.api.tasks.*

class ProcessPrompts extends DefaultTask {

    @InputFiles
    FileCollection srcFiles = project.files()

    @OutputDirectory
    final DirectoryProperty destDir = newOutputDirectory()

    @TaskAction
    void process() {
        project.copy {
            from srcFiles
            into destDir.get().asFile
            eachFile { fileDetails ->
                fileDetails.file.eachLine { line ->
                    if (line.startsWith(';'))
                        return
                    (line =~ /(.+) \((.+)\)/).each { all, prompt, code ->
                        destDir.file("${code}.txt").get().asFile.withWriter { writer ->
                            writer.println prompt
                        }
                    }
                }
                fileDetails.exclude()
            }
        }
    }
}

Further reading

Your Turn

Groups

Each group will need:

One designated native speaker
Praat scripting skills
Software engineering (Java) skills
Git skills

Prompt list

Phonetically balanced, e.g.,

TIMIT
CMU Arctic
BITS (German)

Slideshow preparation

Convert prompt list to Markdown or LaTeX source
Compile to either
- HTML5 (with reveal.js or similar) or
- PDF (with beamer)
using Pandoc or similar
Generate beep and embed it via JavaScript or media9

Example: TIMIT

See https://github.com/psibre/timit-prompts

Upgrade to use Gradle Pandoc reveal.js plugin
Resolve TIMIT text prompts as data dependency

Recording procedure

Presentation laptop with HDMI output

Video: Displayed in studio booth (prompt text)
Audio: Recording clock source (48 kHz)

DAW (Cubase/ProTools) records multiple channels:

Close-talking mic
Mounted mic
Beeps from presentation (acoustic prompt/slide transition markers)
MIDI track with manual markers (optional)

Data processing

Data formats

Store audio as FLAC
Serialize structured annotations to YAML or JSON
Easy to manipulate with many programming languages!

Integrate Gradle FLAML Plugin

Audio post-processing

Normalization
Noise reduction (optional)
Detect event times in “beep” channel
Assign utterance start/end times
Split speech channels into utterances

Phonetic segmentation

Forced alignment with one of

Integrate Gradle MaryTTS Kaldi MFA plugin

Don’t forget to analyze and check for errors!

Project SCM and data distribution

Use Git.

But don’t store big binary files (such as audio) in Git!

Use solutions such as

Git LFS
git-annex
Cloud storage via
- Dropbox
- GitHub release assets
- Bitbucket downloads
- etc.

Your mission

Declare, resolve, and process your recordings as a data dependency
Use SCM to version your build logic code and metadata (labels)
Apply and use Gradle voicebuilding plugins
Document all of the above in your group report
Provide access to your data and SCM repository

Build environments

OSX (via Homebrew)
Linux (via APT)
VirtualBox (running Linux guest)
Docker (running Linux container)

Voicebuilding for Text-to-Speech Synthesis

Preliminaries

Course Overview

Formalities

MaryTTS

MaryTTS Overview

Installer

TTS Synthesis Process

Text

Allophones

Target Features

Audio

Voicebuilding

Synthesis “backwards”

Example

Prepare voicebuilding project

Build with Gradle

Run the voice

Details

Legacy Init

Task Execution Graph

Pitchmarking

MCEP extraction

G2P and labeling

Check phonetic alignment

Unit features

Data files

Acoustic models

CARTs for Duration and F0

Distributable voice package

Legacy VoiceImportTools

Pros

Cons

Gradle Voicebuilding Plugin

Pros

Cons

Build automation

Build tool criteria

Gradle

Simple Gradle build script

Custom task class

Lazy task input/output

Custom plugin

Dependencies

Custom dependency processing class

Your Turn

Groups

Prompt list

Slideshow preparation

Example: TIMIT

Recording procedure

Data processing

Data formats

Audio post-processing

Phonetic segmentation

Project SCM and data distribution

Your mission

Build environments

Have fun!