Voicebuilding for Text-to-Speech Synthesis

Ingmar Steiner

26.02.–09.03.2018

Preliminaries

Course Overview

  • Split into groups (3–4 people each); idealy, each group should include
    • one native English speaker, and
    • one programmer/hacker, and
    • one phonetician
  • Prepare prompt list
  • Record speech corpus in studio
  • Process recordings (including automatic segmentation)
  • Build TTS voice(s)
  • Use MaryTTS (invented here)

Formalities

  • This course is a seminar or project seminar, depending on your course of study.
  • Successful completion of the lecture “Text-to-Speech Synthesis” (Prof. Möbius) is a mandatory prerequisite.
  • To pass this course, you will need to build TTS voices and submit them, along with a written report. The report must explain the entire end-to-end process, including any problems encountered, and their resolution (5–10 pp.). This report is due six weeks after the end of the seminar (i.e., 20.4.2018).
  • Mailing list for questions, discussion: voicebuildingsem@ml.coli.uni-saarland.de
  • Register in HISPOS by 26.2.2018.

MaryTTS

MaryTTS Overview

Installer

TTS Synthesis Process

Text

Hello world

Allophones

MaryXML

Target Features

phone pos_in_syl accented ph_cplace
h 0 0 g
@ 1 0 0
l 0 1 a
@U 1 1 0
w 0 1 l
r= 1 1 0
l 2 1 a
d 3 1 a
_ 0 0 0

(small selection)

Audio

Target feature vectors used to generate/retrieve audio:

Voicebuilding

Synthesis “backwards”

Compute features vectors from text

then assign them to provided data.

Example

Prepare voicebuilding project

  1. Install software dependencies (as needed)
  2. Clone example voicebuilding project: https://github.com/marytts/voice-cmu-slt

Build with Gradle

./gradlew legacyInit
./gradlew build

Run the voice

Details

Legacy Init

  1. Retrieve CMU SLT Arctic data dependency
  2. Unpack and convert audio files
  3. Unpack and convert label files
  4. Extract text prompts and generate utterance list

Task Execution Graph

Pitchmarking

Using Praat

./gradlew praatPitchmarker
input
wav/*.wav
output
pm/*.pm

MCEP extraction

Using ch_track from EST

./gradlew mcepExtractor
input
wav/*.wav
output
mcep/*.mcep

G2P and labeling

Predict phone sequence from text using MaryTTS

./gradlew generateAllophones
input
text/*.txt
output
prompt_allophones/*.xml

Check phonetic alignment

./gradlew legacyTranscriptionAligner
input
prompt_allophones/*.xml, lab/*.lab
output
allophones/*.xml

Unit features

Compute and assign feature vector to each unit using MaryTTS

./gradlew legacyPhoneUnitFeatureComputer legacyHalfPhoneUnitFeatureComputer
input
allophones/*.xml, mary/features.txt
output
phonefeatures/*.pfeats, halfphonefeatures/*.hpfeats

Data files

Compile “timeline” files for audio, utterances, and acoustic features

./gradlew legacyWaveTimelineMaker legacyBasenameTimelineMaker legacyMCepTimelineMaker
input
wav/*.wav, pm/*.pm, mcep/*.mcep
output
mary/timeline_waveforms.mry, mary/timeline_basenames.mry, mary/timeline_mcep.mry

These contain the actual data from the wav and mcep files, in pitch-synchronous “datagram” packets.

Acoustic models

Phone-level and halfphone-level unit and features files

./gradlew legacyPhoneUnitfileWriter legacyHalfPhoneUnitfileWriter legacyPhoneFeatureFileWriter legacyHalfPhoneFeatureFileWriter
input
pm/*.pm, phonelab/*.lab, phonefeatures/*.pfeats, halfphonelab/*.hplab
output
mary/phoneUnits.mry, mary/halfphoneUnits.mry, mary/phoneFeatures.mry, mary/phoneUnitFeatureDefinition.txt, mary/halfphoneFeatures.mry, mary/halfphoneUnitFeatureDefinition.txt

CARTs for Duration and F0

Using wagon from EST

./gradlew legacyDurationCARTTrainer legacyF0CARTTrainer
input
mary/phoneUnits.mry, mary/phoneFeatures.mry, mary/timeline_waveforms.mry
output
mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree

Distributable voice package

Ready for deployment in MaryTTS installation

./gradlew assemble
input
mary/cart.mry, featureSequence.txt, mary/dur.tree, mary/f0.left.tree, mary/f0.mid.tree, mary/f0.right.tree, mary/halfphoneFeatures_ac.mry, mary/joinCostFeatures.mry, mary/joinCostWeights.txt, mary/halfphoneUnits.mry, mary/timeline_basenames.mry, mary/timeline_waveforms.mry
output
my_voice.zip, my_voice-component.xml

Legacy VoiceImportTools

Pros

  • Well-documented (see Wiki)
  • Compatible with stable MaryTTS version
  • HMM voicebuilding

Cons

  • Various pitfalls
  • No caching
  • No task dependency management

Gradle Voicebuilding Plugin

Pros

  • Agile
  • Task dependency management
  • Caching
  • Parallel processing

Cons

  • Experimental
  • Barely documented

Build automation

Build tool criteria

  • Fast & efficient
  • Easy to use
  • Cross-platform
  • Minimal requirements
  • Flexible & extensible
  • Automate build/test/release lifecycle

Gradle

  • First released in 2007
  • Runs on Java
  • Groovy-based build scripts with custom DSL
  • Builds Java, C/C++, Android, anything else
  • Extensible via plugins

Simple Gradle build script

build.gradle

task foo {
  doLast {
    println "Doing stuff."
  }
}

task bar {
  dependsOn foo
  doLast {
    println "Doing more stuff."
  }
}

Custom task class

buildSrc/src/main/groovy/DoStuff.groovy

class DoStuff extends DefaultTask {

    @InputFile
    final RegularFileProperty inputFile = newInputFile()

    @OutputFile
    final RegularFileProperty outputFile = newOutputFile()

    @TaskAction
    void doStuff() {
        // open output file for writing
        outputFile.get().asFile.withWriter { writer ->
            // iterate over lines in input file
            inputFile.get().asFile.eachLine { line ->
                // write line contents to output, twice
                writer.println line * 2
            }
        }
    }
}

Lazy task input/output

src/foo.txt

foo
bar
baz

build.gradle

task foo(type: DoStuff) {
    inputFile = file('src/foo.txt')
    outputFile = layout.buildDirectory.file('bar.txt')
}

task bar(type: DoStuff) {
    inputFile = foo.outputFile
    outputFile = layout.buildDirectory.file('baz.txt')
}

Custom plugin

buildSrc/src/main/groovy/DoPlugin.groovy

class DoPlugin implements Plugin<Project> {

    @Override
    void apply(Project project) {
        project.task('foo', type: DoStuff) {
            outputFile = project.layout.buildDirectory.file('bar.txt')
        }

        project.task('bar', type: DoStuff) {
            inputFile = project.foo.outputFile
            outputFile = project.layout.buildDirectory.file('baz.txt')
        }
    }
}

build.gradle

apply plugin: DoPlugin

foo.inputFile = file('src/foo.txt')
Further reading

Dependencies

build.gradle

repositories {
    ivy {
        url 'https://catalog.ldc.upenn.edu/docs'
        layout 'pattern', {
            artifact 'LDC93S1/[module].[ext]'
        }
    }
}

configurations {
    data
}

dependencies {
    data group: 'edu.upenn.ldc.timit', name: 'PROMPTS', version: '1988-10-31', ext: 'TXT'
}

task processPrompts(type: ProcessPrompts) {
    config = configurations.data
    destDir = layout.buildDirectory.dir('text')
}
Further reading

Custom dependency processing class

buildSrc/src/main/groovy/ProcessPrompts.groovy

class ProcessPrompts extends DefaultTask {

    @Input
    Property<Configuration> config = project.objects.property(Configuration)

    @OutputDirectory
    final DirectoryProperty destDir = newOutputDirectory()

    @TaskAction
    void process() {
        project.copy {
            from config.get()
            into destDir.get().asFile
            eachFile { fileDetails ->
                fileDetails.file.eachLine { line ->
                    if (line.startsWith(';'))
                        return
                    (line =~ /(.+) \((.+)\)/).each { all, prompt, code ->
                        destDir.file("${code}.txt").get().asFile.withWriter { writer ->
                            writer.println prompt
                        }
                    }
                }
                fileDetails.exclude()
            }
        }
    }
}

Your Turn

Groups

Each group will need:

  • One designated native speaker
  • Praat scripting skills
  • Software engineering (Java) skills
  • Git skills

Prompt list

Phonetically balanced, e.g.,

Slideshow preparation

  1. Convert prompt list to Markdown or LaTeX source
  2. Compile to either using Pandoc or similar
  3. Generate beep and embed it via JavaScript or media9

Recording procedure

Presentation laptop with HDMI output

Video
Displayed in studio booth (prompt text)
Audio
Recording clock source (48 kHz)

DAW (Cubase/ProTools) records multiple channels:

  1. Close-talking mic
  2. Mounted mic
  3. Beeps from presentation (acoustic prompt/slide transition markers)
  4. MIDI track with manual markers (optional)

Data processing

Data formats

  • Store audio as FLAC
  • Serialize structured annotations to YAML or JSON
  • Easy to manipulate with many programming languages!

Audio post-processing

  • Normalization
  • Noise reduction (optional)
  • Detect event times in “beep” channel
  • Assign utterance start/end times
  • Split speech channels into utterances

Phonetic segmentation

Forced alignment with one of

Don’t forget to analyze and check for errors!

Project SCM and data distribution

Use Git.

But don’t store big binary files (such as audio) in Git!

Use solutions such as

Your mission

  1. Declare, resolve, and process your recordings as a data dependency
  2. Use SCM to version your build logic code and metadata (labels)
  3. Apply and use Gradle voicebuilding plugins
  4. Document all of the above in your group report
  5. Provide access to your data and SCM repository

Build environments

Have fun!