Multilingual text to speech

I wanted to generate an audio file from text (so a text to speech problem). But different fragments of text are in different languages. And, because I'm a developer and I don't like to click on things, I wanted to do this with minimal user interaction. How would one do this?

Let's split it in two separate problems:

  1. text to speech, and
  2. manipulating the audio fragments to create a joined up version

1. Text to speech

Here, because I'm using a Mac, I will just use the say utility that comes with the operating system. I suppose other things exist in other operating systems (maybe even OTHER things exist in macOS too!), but I will not focus on that in this post.

a) Say something

Let's start with getting the computer to say something. A simple use of say would be:

say 'hello world'

And say will read that out through your speakers, using macOS' "Speech Synthesis manager", so the quality is pretty decent, because it is the same component that powers other spoken announcements and assistive technology bits in macOS.

Neat, and yet another example of how making systems accessible benefits everyone, not just abled users.

b) Say something in another language

Now, if you tried to read out this sentence in Spanish:

say 'hola a todo el mundo'

It would read it out, but unless your operating system is in Spanish, the pronunciation will be quite wrong, because this text is in Spanish, and not in whatever default language your computer is (mine is in UK English).

However, we could tell it to use a Spanish voice, and then it is much better:

say -v Monica 'hola a todo el mundo'

The names of available voices (and which language they know how to read) can be found out by entering this:

say -v ?

It returns a big list, with three columns for name of the voice, language code and a description:

Alex                en_US    # Most people recognize me by my voice.
Alice               it_IT    # Salve, mi chiamo Alice e sono una voce italiana.
Alva                sv_SE    # Hej, jag heter Alva. Jag är en svensk röst.
Amelie              fr_CA    # Bonjour, je m’appelle Amelie. Je suis une voix canadienne.
Anna                de_DE    # Hallo, ich heiße Anna und ich bin eine deutsche Stimme.
...

You could filter the list with grep if you roughly know what you're looking for.

For example, if you wanted to only show German speaking voices, you could use this (de_ is how the German language codes start with, for deustch):

say -v ? | grep de_

And it would output something like this:

Anna                de_DE    # Hallo, ich heiße Anna und ich bin eine deutsche Stimme.

You can also go to the System Preferences Accessibility panel, Spoken Content section, and open up the System voice drop down on the right side to select the Customise... option. There you can add more voices or download a bigger version with better quality for the pre-installed voices.

c) Say something from a text file

Entering the text we want to hear like this is fine, but it does not allow for huge lengths of text to be spoken as there are technical limits to how long shell commands can be--you might have seen the "too many arguments" error in the past, and it's just nice in general to decouple things: commands on one side, data on another.

The good news is that if we want the say to use a text file as input data, that's possible by using the -f option:

say -v Monica -f spanish.txt

d) Save it to an audio file

Another thing that we want to do is to output to an audio file rather than just hearing it instantly and letting it fade away into the mists of time.

The -o option lets you specify an AIFF file to output to:

say -v Monica -f spanish.txt -o spanish.aiff

Alright!

Suppose we generated three files (spanish.aiff, english.aiff and french.aiff) using their respective .txt source files, but we want only one file that we can, for example, load in our favourite audio player to listen to our creation anywhere we go.

How do we do this? This is when we move on to the next problem:

2. Audio file manipulation

If they were just plain text files, we could simply concatenate them together, one after the other, perhaps with some text separator like a \n, and we would be done.

But AIFF files contain a header at the beginning of the file. This describes properties about the rest of the contents of the file, such as the sampling rate, number of channels, etc. If we concatenate other files, we might either get nothing, or glitchy sounds as the computer tries to interpret the header as if it were audio data (I'm imagining something like the casette tape loading sounds, or an analogic modem trying to connect to the internet).

What we need is a tool to join the files together. Something that can parse the headers and load the audio data, and do something with those.

I am sure others exist, but I've used Sox in the past and so that's what I'm going to use for this little experiment. It describes itself as "the Swiss Army knife of sound processing programs" and it's exactly what we need.

Sox is hosted at sourceforge. I'm very surprised there are still projects there after everyone moved to more "social" networks like GitHub or GitLab, but there we go.

You can install it using Brew; no need to compile it yourself:

brew install sox

Then joining files together is as easy as this:

sox spanish.aiff english.aiff french.aiff all.wav

I've made it output to all.wav instead of AIFF for demonstrative purposes, but both AIFF and WAV use a huge amount of space unless you deliberately "compress" them (and even then they are not as space efficient as other formats).

We can also output a compressed version directly, no need for intermediate files:

sox spanish.aiff english.aiff french.aiff -r 44.1k -C 160.5 all.mp3

Ta-da! all.mp3 contains the synthesised text and it's about half the size of the WAV version. Although I specified parameters for using CD-quality sample rate and an average compression rate and quality, otherwise sox used some really low-quality defaults and it all sounded very lo-fi. Which is fine is that what you're after! You can play with the parameters and see for example what happens if you encode at 8k.

We have accomplished our goal! We read content from three different languages and synthetised the spoken text into a single MP3.

NOTE: Technically speaking, WAV and AIFF are lossless file formats, whereas MP3 or OGG are lossy, which means we save space at the expense of worse sound quality, but for this particular example a lossy format will be fine as we're not listening to a philarmonic orchestra.

You can now load this onto any suitable audio player and enjoy the synthetically generated file.

Here's a repository you can clone which contains some text files and a script to run what this article explains:

git clone xxx
cd xxx
./install.sh
./generate.sh

The resulting mp3 will be placed in the output folder.

Free ideas for you

I don't know about you, but each time I make something I end up having lots of alternative ideas for things you could make if you took matters into a different direction. I do not have the time to explore all of those, so here's my round of free ideas for you!

You could use this to transform text you have no time to read into a sort of podcast that you can then load in your podcast app. Of course you'd have to create the feed file and host the files somewhere.

Likewise you could use this to create really boring lists to fall asleep to. Find some list of something (e.g. place names) and feed it to say, then listen to that as you battle insomnia (and win). I'm fairly confident something like place names and post codes, which can easily become a sort of regular rhythm, can be quite good for falling asleep. And what about a list of prime numbers? (Tip: use commas between the numbers for a more natural pause when feeding them to say).

A more eccentric idea could be to feed it garbled lists or content that makes no actual sense; I'm thinking about something like going to Archive.org, finding some book that has been scanned and converted to text with Optical Character Recognition, but still has some sort of glitches here and there because it has not been reviewed by a human. I wonder how that would sound like, and whether that would make you eventually stop listening or just agitate you more.

Regarding sound ideas: there are some novelty voices in English (and they're literally marked as such). Just in case you want to "add some pizzaz" to your generated mp3. I think the Trinoids voice must have been used for lots of dance tracks!

sox also offers other effects such as bending, echos, flanging and etc. I do not need those but maybe you get ideas if you look at the documentation.

Also: for the "this already exists in macOS", yadda yadda yadda troupe

Yes, a form of this is already sort of baked into the operating system: if you are in certain macOS apps (mostly, native apps) and select a bunch of text and right click on it then go to Services, you can select Add to Music as a spoken track.

But! it requires a lot of user interaction (i.e. clicking) and it's not automatable, and ends up somewhere in iTunes (woops, I meant Music) and you have to open it or figure out where it is before you can use it.

I suppose it works very well if you assume that everyone only speaks English and nothing else exists in the world or if it exists, it's acceptable to butcher the words when you pronounce them.

In my tests, even Safari seems to ignore paragraphs marked as using a different language (using the HTML lang attribute), so this does not work for me.