Text-to-speech Programs

I finally got around to shopping for a text-to-speech program for
dictation practice and audiobook making. Here's my experience, for
anyone interested:

First, those wo've tried Windows XP's built-in TTS program ("Narrator")
might be pleased to hear that third-party products sound incomparably
better. I looked at least a dozen commercial TTS developers, about
half of which offer desktop products. I narrowed those down by three
requirements:

1) High quality sound
2) Output can be directed to a sound (mp3) file
3) Reading speed can be adjusted in WPM (many products only provide
presets like "fast", "medium", "slow". Note, I didn't find any that
could be adjusted in "1.4-syllables-per-minute"!)

This left just two candidates: AT&T's "Natural Voices"
(www.naturalvoices.att.com), and Cepstral (www.cepstral.com).
The internet concensus seemed to be that AT&T's are clear and away
the best sounding. I wondered if that concensus is just due to the
product's high profile; their 22KHz voices are unarguably good, but in
some aspects, like phrasing and dynamic range, I thought IBM's newer
products, and Loquendo's, were better (alas, neither of those offer
end-user desktop packages).

Cepstral's various voices are definitely in the same class as AT&T's;
I think their phrasing is even better. Cepstral offers versions not
only for Windows, but also OSX, Linux and Solaris.

AT&T's 16KHz "Natural Voices" can be bought at www.nextuptech.com for
$35USD + $19.95USD for an mp3 file-creating extention. I'm not sure
where the 22KHz versions can be purchased. www.cepstral.com sells
their selection of voices (with file-output capability built-in) for
$29.99USD each. Both stores offer online demos, and Cepstral offers a
limited free-trial version.

I opted for Cepstral's 22KHz "Callie" voice for Linux, and I'm really
happy with it. The only reservations I'd note are about computer voices
in general, and for transcription practice, those aren't problems.
Pronunciation seems about 99% accurate; most of its errors are on
homonyms, present-/past-tense ambiguity, etc---occasionally distracting,
but almost negligible. Cepstral's WPM adjustment seems to affect the
words and not the breaks between them, which sounds unnatural as speeds
drop below 80WPM; maybe that's standard behaviour for such programs?
In any case, it doesn't make it hard to understand, and there are ways
to change that behaviour. It's great to be able to take material I'm
actually interested in hearing, and to be able to bump the speed up at
my own pace.

The program has been even more effective than I expected for audiobook
making. I've complemented Cepstral's phrasing ability, but lest you
expect Laurence Olivier, my wife says they all sound like they're reading
a shopping list. I don't notice that after about ten minutes, and the
voice has so far sustained my comprehension all the way through two large
classic novels, a couple of books of the Bible, and some dense philosophy
(not that I have that much free time!---I just drive a lot).

Creating an audiobook from some www.gutenberg.org plaintext is a snap.
Creating a nicely paced audiobook with proper pauses on paragraphs,
em-dashes, braces, emphases on exclamation points, and organised into
chapter-per-mp3-file is also easy---in unix. I'm not sure how you'd
do that much fancy chapter breaking, SSML markup and mp3 making under
Windows except by hand.

In sum, I'd recommend this program to anyone looking for motivation to
practice transcription.

(by routine-sibling for everyone)

Labels: general