Text to Speech Webinar Part 1

By Shauna Humphries:

Speaker 1 – Victoria Petrovka, Customer Sales/Solutions Manager from DeepZen:

Who are DeepZen:

Startup company founded in London dedicated to the production of high-quality voice solutions/technology for audio books, dubbing, synthetic media, gaming and meta-verse apps.
They specialize in emotion control and long-form audio production with advanced text analysis capabilities.

DeepZen’s Main Goal: Providing a tool for high-quality, quick and efficient audiobook/voice-over production.

Solutions offered by DeepZen:

AI integration: can be used by any type of app with real-time speech synthesis needs such as Digital Phone or Voicemail
Audiobook Production Platform: allows for scale-up of audiobook production and quicker, more cost-efficient product delivery.
New multilingual solutions for video translation, transcription and dubbing.

DeepZen’s Market Challenge:

The provision of a greater number of multilingual audio products. Only 8% of e-books are in audio format and 90% of these are in English.
In response to this, DeepZen developed a fully scalable audiobook production platform allowing for multilingual projects. Focus on human inclination and emotional range was critical for customer needs. DeepZen’s production rate is 5 times faster than the norm and more cost-efficient.

DeepZen’s Technology:

DeepZen uses Natural Language Processing and Text to Speech systems developed in-house specifically for audiobook/long-form content.
Technology replicates voice and enables high-quality. All voices are licensed from real narrators and DeepZen’s library holds a digital corpus of narrators whose voices were used in company projects.

The European Accessibility Act: a directive that ensures the accessibility of a number of products/services within the Member States. Focus on e-books and audio-visual media services.

How DeepZen successfully converts e-books to audio format:

Development of a fully scalable portal that allows for all projects to be successfully managed in one place.
Ability to process multiple projects at once. Allows you to prepare ePubs for conversion and track progress from upload to delivery.

DeepZen’s Main Clients – publishers, education, public libraries, media companies and Not for Profit orgs.

2nd Speaker – Ronan Maguirk, Irish Language Translator and IT Professional, Civil Service, Dublin.

Ronan has an MPhil in speech and language processing from Trinity College, Dublin
Ronan has previous experience in banking and is proficient in a wide range of access technologies.
By 2013 Ronan discovered that there was no Irish language speech synthesizer available. The need for TTS support for Irish and other minority languages became apparent.

NVDA:

Discovered by Ronan between 2010 and 2012.
Supports TTS in multiple languages using E-speak synthesizer. Provided 1^st opportunity for text to speech in Irish. E-speak languages are implemented by VI enthusiasts.

E-Speak Project:

Ronan became involved in E-speak project and succeeded in getting a 1^st implementation of Irish.
The system had many drawbacks including a robotic and unnatural sounding voice. Also, it was not entirely reliable and did not take dialects into consideration. Focus was on school Irish.
However, it proved useful for document editing and computer coding, therefore, it was a promising development.

Irish Localisation Project:

Ronan initiated this project with NVDA and others which led to the implementation into NVDA of the Irish Language Interface.
Designed to tackle the issue of Irish and screen readers which often use phrases that don’t appear on screen such as link, heading, button, combo-box etc. Software localisation was necessary to ensure availability of such frequently used phrases.
This project is ongoing, and the interface now has 2700 messages which require an update with each new release.

Lib Louis Free Open-Source Project:

Ronan joined this project to tackle the issue of braille and Irish.
Already a member of MBA (Irish equivalent of UKAAF), he developed the Irish Rail Code which is now in NVDA.

Abair project:

Ronan is currently working on this project with NVDA and Trinity which offers a solution to the challenge of Text to Speech and Irish dialects.
The project involves SAPI (Speech Application Programming Interface) implementation of the 3 main Irish language voices for each main Irish dialect.

Irish and Microsoft:

Microsoft have developed some Irish language voices; however, these are only available in Azure at present.
Not available yet as One Core. Best used to help generate audio books or read on-screen text.

Text to Speech Challenges with Irish:

In the case of multilingual documents, the synthesizer needs to be able to switch smoothly between languages for markup purposes and for the purpose of performing Spell Check in both languages. At present the user is required to switch languages manually.
Debate also around how to make e-books accessible for screen readers and how best to read Irish with Text to Speech/which software to use. At present Adobe Digital Editions is most used. Still a lot of work to be done with Irish in the open-source world.

3rd Speaker, Willem Kunis, Partnerships Assistant at Cere Proc

Who is Cere Proc: A speech synthesizer company based in Edenborough that produce voices for a wide range of TTS apps and DNN voices for mobile that require no internet connection.

CereVoice Unit Selection System:

First established in 2006, this system takes audio content and chops it up into small fragments, then the system glues it back together.
While this is extremely fast it does have some drawbacks including large file sizes to ensure availability of audio, and small noticeable errors (strange starts/stops).

Voice Cloning Project:

In 2010 the company performed voice cloning on American film critic Roger Ebert, which was built primarily from archive material. 1^st cloning project built from archive material.

First At-Home Voice Cloning System:

Established in 2012 this allows users to speak out sentences, have it recorded on their devices and the company can build a clone from that information.

Cerewave AI System:

Established in 2019 Cerewave AI is a Deep Neuro Network (DNN) which builds voices and is entirely generated on the computer. Allows for smoother replication of emotion/intonation.

Custom Voice Tender for NHS Wales:

Established in 2023 this is Cere Proc’s current project, targeted towards children and teenagers who have lost the ability to speak due to various ailments.
The project comprises 16 voices in Welsh, Welsh accented English, and North/South Welsh accents. First acknowledgment on behalf of TTS of the importance of regional accent variation.

Cerevoice System’s Key Components:

Emotional Genre: Range of emotions the voices have which can be tailored to customer needs.
Singing TTS: Created by music XML file formats and demonstrates voice control.
On Mobile DNN: Includes Android app CerePlay which allows you to download voices and have them on your mobile with no need for internet connection.
Vocal Puppetry: Allows you to speak audio and the data is applied to a synthetic voice. Emphasis on clarifying the real meaning of speech utterances, intonation, and pronunciation. Targeted towards people with speaking difficulties who can then deliver their speech how they wish).

Cere Proc’s Partners:

The company developed voices with the NLB (Norwegian Library for the Blind) and developed the C-Server system which allowed the library to generate audio content in bulk. Focus on provision of audio for e-content published monthly/weekly.
Voices were also developed for Aschehoug in multiple language using Cere Proc’s cloud system, allowing easy access for students.
Cere Proc’s work with Lego involved providing a voice for Lego’s new app which allows VI children to play with their toys. Since the voice was built to provide audio instructions it needed to be clear and concise, for example, when describing a red brick, the voice would say “red, four by four, square,” and that part is then conjoined.