Government wants Catalans to read out loud to improve voice assistant project

Reading short sentences will help enrich the database of Project AINA, which counts around 95 million short phrases

Catalan digital policy minister, Jordi Puigneró, on February 15, 2022 at Barcelona Supercomputing Center talking about Project AINA (by Aina Martí)
Catalan digital policy minister, Jordi Puigneró, on February 15, 2022 at Barcelona Supercomputing Center talking about Project AINA (by Aina Martí) / Gerard Escaich Folch

Gerard Escaich Folch | Barcelona

February 15, 2022 05:03 PM

The Catalan government will launch a campaign on Wednesday asking people to read short sentences out loud to improve Project AINA, which facilitates the development of voice assistants in the Catalan language. The campaign was announced on Tuesday afternoon during a press conference at Barcelona’s Supercomputing Center (BSC).

The goal is to get thousands of Catalan speakers of different areas, genders, and ages across the territory to record themselves reading short sentences already available online.

The project will use Common Voice, a Mozilla open-source platform focused on creating voice databases worldwide usable for voice assistant developers.

"AINA has come to conquer new territories, such as new digital platforms, new devices such as mobile phones, toasters…," Jordi Puigneró, Catalan digital policy minister said. The project is promoted by the digital policies department with the collaboration of the Barcelona Supercomputing Center.

The idea is to teach the Catalan language to machines so "citizens can relate using their own language," the minister explained, adding that they want to "avoid the digital disappearance of the Catalan language."

"Project AINA is the infrastructure, the artificial intelligence (AI) that will help" Catalan compete with other languages, Jordi Puigneró said to Catalan News.

"We want people with different accents speaking Catalan, those that have a Spanish background or even an English background that can speak the language," Marta Villegas, researcher at BSC announced.

To help out with the campaign, people will have to enter Project AINA’s website and start recording some of the sentences available.

Authorities recommend registering so researchers already have basic information as gender, age, and Catalan language accent determined. However, they recognize that some people would like to remain anonymous, therefore there is also an option to participate without registering.

"We need to increase the amount of voices that help AINA understand the Catalan language. We ask citizens to spend around five-ten minutes reading to AINA everyday. This way the AI can learn the language faster and quicker. With more data, it will be possible to accelerate the process that AINA needs," Puigneró said to this media outlet.

Voice assistants need big databases

To be able to create a voice assistant, companies need massive databases of textual and audiovisual material. This is the current step where Project AINA, launched in December 2020, is at.

The Barcelona Supercomputing Center has been preparing non-copyrighted material that will be shared with the Common Voice platform. They have been working in collaboration with the Catalan Public Television and Radio Broadcaster (CCMA), with the Parliament, and with the Catalan encyclopedia.

"The English corpus (the scientific name for the database) has 825GB of data. To train the Spanish voice assistant system, they used 570GB of data, while the biggest Catalan language corpus, until 2020, was of 10GB," Villegas said.

The current Common Voice platform has around 1,000 hours of material in the Catalan language. With the ‘La Nostra Llengua és la Teva Veu’ (Our Language is Your Voice) campaign, the Digital policy ministry wants to reach 2,000 hours by the end of the year.

To do so, the Catalan government will invest up to €3 million in Project AINA this year.

Second text corpus version

A new phase of Project AINA also kickstarts this year. Until now, the system has a first version of the text corpus of up to 1.77 billion words and 95 million sentences.

The next step is to increase the number of words and sentences but using different Catalan language accents or even different voice tones, such as colloquial, literal, or the one used in the public administration.