How Voices for RHVoice Speech Synthesizer Are Created

Audio description: a coloured photo. On a dark computer screen, an assemble program is displayed. The monitor is divided into two parts. The lower part shows the sound data, which is represented as green blocks equal in height. The upper part is responsible for video and graphics, it shows pink blocks of various heights.

RHVoice is a speech synthesizer created by Olga Yakovleva. Many people with visual impairments cannot imagine their lives without it: together with screen reader programs it helps them surf the Internet, keep blogs, work with files and so on.

For a long time there was a limited range of voices in the RHVoice program. But in 2020, a visually challenged web developer Artyom Plaksin and his team created a laboratory of new voices for RHVoice. Also in 2020, a big new feature was introduced: developers presented a synthesized voice of a famous designer, traveler and blogger Artemy Lebedev.

Artyom Plaksin tells the Special View portal about the way the voices for RHVoice are created.

Audio description: a coloured photo. A brightly lit room with rows of black padded chairs. A stout young man in dark glasses is sitting in the front row. His name is Beka Gozashvili. He is wearing a white polo shirt. Black headphones are hanging around his neck. His hands lie upon a dark case in his lap.

How The Idea Was Born

Artyom Plaksin supervises IT projects and works as a web-master and tester of programs to gage their accessibility for sightless users. In the past three years he launched several social IT projects. Among them was the Data2data («Данныев данные») service, which helps sightless people to convert PDF files into text and text into speech, and also the Tiflo Cloud, a cloud storage service accessible to sightless users.

Furthermore, Plaksin heads the Tiflo Host («Тифло Хост») charitable project. Within the project framework he is developing and adapting the Russian segment of the Internet for the visually challenged, and provides sightless users with accessible hosting for their social projects.

"At first I didn't have this idea of creating a voice for the speech synthesizer. But at some moment while I was watching Artemy Lebedev's YouTube blog I noticed what an excellent voice he has. Clear, steady, with no hoarseness or defects. Then I thought that it could be nice to try and synthesize his voice so that sightless people could benefit from it", Artyom tells.

He wrote to Artemy Lebedev describing his idea, and Lebedev got interested. Lebedev was sent a speech database, a text comprising 1160 specially selected sentences in Russian. He was to read it at a studio using good sound recording equipment. There was no need to create the speech database from scratch, it was put together when the first RHVoice voices were developed, and it was in the public domain on the net.

According to Artyom Plaksin, at the beginning of the project he knew enough about speech synthesizing, but he didn't have the essential experience of creating voices on his own. Thus, a sightless programmer Beka Gozashvili joined the project. He had created an RHVoice language module for the Georgian language.

"Beka got excited with this idea, too, and we got down to work. At once we realized that we needed high class sound specialists. The record tracks were to be cut down into separate sentences, noises and background sounds were to be deleted. Not an easy job, and highly important", Artyom Plaksin says.

He invited Sergey Parshakov, sound editing engineer, and Denis Shishkin, sound record engineer, to join the project.

Voice Creation By Stages

Artemy Lebedev read 1160 sentences, which totaled 100 minutes of record. Mr. Plaksin notes that it usually takes announcers from 2 to 6 hours to record the phrases, and after that, editing yields 1.5 to 3 hours of audio material.

"Speech database recording is the first and the most important part of the work. The resulting quality of the synthesized voice largely depends on the quality and the clearness of the record. Artemy Lebedev sent us the material rather quickly, but unfortunately, the sound wasn't recorded at a professional studio, so it took us a lot of time to clear out the noises and background sounds. Nevertheless, we were very happy that our speaker got inspired by the idea and answered our request", Mr. Plaksin says.

At the second stage of the project Sergey Parshakov, sound editing engineer, needed to extract separate sentences from the records and clear up mispronunciations. The resulting number of audio files with sentences 3 to 5 seconds long had to be equal to the number of phrases in the text module file.

Audio description: a coloured photo. Against a dark background two silvery cylindrical studio microphones are hanging down fixed to two black holders.

"The third stage is working with the sound. This is Denis Shishkin's domain. He deletes unwanted noises, selects frequency and other parameters for the sound. It is essential that the soundman emphasizes the peculiarities of a speaker's voice and his or her manner, because this makes program users recognize the voice", Artyom Plaksin says.

He describes it as long and meticulous work that can be done properly only by a highly skilled soundman. The quality of the record is important, too, that is why developers always recommend potential announcers to record the sound only at a professional sound recording studio.

"The next stage is introducing the prepared files into a special program environment where the recorded sound is matched with the text. At that moment every letter from the text file of the language module acquires its corresponding articulation. Through burning-in (meaning multiple listening and editing) audio files are turned into a synthesized voice. This is what Beka does", Artyom tells.

Delicate adjustments at this stage help the voice not to seem too robot-like. Thanks to correctly selected frequency, every letter has its own sound, not monotonous, with minimum error. The adjustments may require plenty of time, but as Mr. Plaksin says, it takes 2 to 3 months of consistent work to create any synthesized voice.

Voice Catalog

Several thousand people downloaded Artemy Lebedev's synthesized voice in the first days after its publication. The result inspired the developers, and Mr. Plaksin proposed to create a laboratory of voices. Today their lab offers for downloading several voices of famous bloggers, reporters and show presenters.

Evgeny synthesized voice speaks with the voice of a well-known comedian Evgeny Chebatkov. Victoria voice was based on the voice of Natalya Arsenyeva, a radio host and the author of I Was There («Я там был») travel blog. In collaboration with Ukrainian colleagues Artyom's team created its first Ukrainian voice Volodimir based on the voice of Vladimir Beglov, reporter and lecturer.

"Many audio book lovers with visual impairments appreciate the voice of a famous Soviet and Russian actor and audio book narrator, Yury Zaborovsky. Alas, at the end of 2020 Yury Zaborovsky passed away. We decided to create his synthesized voice as a tribute to him. This was a real challenge for our team because we had to do this based on the existing audio books", Artyom Plaksin tells.

First of all, Artyom addressed Zaborovsky's widow with this idea. When she gave her consent, Artyom and his team selected about 1600 phrases out of the existing audio records.

"There are plenty of books narrated by Yury Zaborovsky, but this didn't make our work much easier. Firstly, his voice was changing over the years, and that was totally unacceptable for our task. Secondly, as many audio book readers remember, the actor narrates with much emotion, which is not quite suitable for speech synthesis. We had to hunt for less emotional phrases, and what is more, they had to fit all the parameters for speech database compilation", Mr. Plaksin says.

Because of these problems the process of voice creation was stalled. However, luck was on the developers' side: they found records dating 2004, and this material proved suitable for the synthesis. Several months later the voice appeared in the laboratory catalog. It was published on the birthday of the distinguished actor and narrator.

About The Plans For The Future

Many sightless people are already using the voices created by the laboratory. According to Artyom Plaksin, these voices are used to vocalize books, web pages, apps and so on. What is more, from the RHVoice Lab website one can download add-ons for screen reader programs.

"In the future we are planning to create voices in other languages. We feel very enthusiastic about creating a good-quality English voice, because users are rather satiated with the ones that are currently available. We also would like to make a highly demanded Tatar voice. We already have complete speech databases and are now looking for good announcers. What is more, we are planning to broaden the range of languages, but to do this we will have to compile speech databases from scratch, and that will require the assistance of experienced linguists. Our project is entirely non-profitable, that is why we will look for grants or sponsor support to accomplish our goals", Artyom Plaksin shares.

Users can download the voice they need from the catalog, and also donate to support the project if they wish. You may learn about the RHVoice Lab development and news in their blog.