GSoC: Utilizing Speech-to-Speech Translation to Facilitate Multilingual File Storage and Networking

Hannes Leipold
4 min readJun 15, 2021

This a blog post that is updated periodically during the GSoC project focusing on using speech and text translation to facilitate storing text, audio, and video files in multiple languages and using this to build a multilingual network for users. The task is to build a multilingual messaging board where users are able to translate their text, audio, and video files to be in the native language of other users they wish to communicate with or simply wish to translate for themselves.

Given the need for people around the world of different backgrounds, with different linguistic heritages, to communicate with one another, a central task for researchers is to build systems that can help to facilitate a multilingual communication network.

Machine Translation for text (MTT) has been one of the most important research areas that modern day deep neural networks have been able to show great competitiveness in when compared to other frameworks. These neural networks even show increased competitiveness with human translation. However, MT for audio is comparatively less competitive with human translation and remains one of the most important and interesting areas of research for MT systems.

The system must be malleable and flexible for future advances in the field and as so must consider different possible pipelining strategies. As a baseline, we use build a system that uses Automatic Speech Recognition (ASR) to produce transcripts, then uses MTT to produce translated transcripts, and lastly uses Text-to-Speech (TTS) to produce translated speech. However, we wish to create a system that can also allow any two or all three of the tasks to be combined (hence 4 options). Moreover, different ASR, MTT, and TTS choices can give different important auxillary information such as the pauses between the words, prosodic acoustic information about the speaker, and matching the pacing of the translated text to the pace of the speaker. This means flexibility on the type of subsystems used is important.

SCHEDULE :

Week 01: Refamiliarization with SQL database systems and building the first relational database to store some basic SQL data and maintain that database locally. Building the first schema for what will be maintained in the database. The necessary components to accomplish our task will shift as we learn more from experimentation in the later weeks.

Week 02: With the database in the back of the mind, we shift focus to building a BASELINE translation system for speech. We utilize Mozilla’s DeepSpeech project, Google Translate, and Google TTS to translate a single audio file.

Week 03: Progressing from last week, we remain focus on audio files and adapt the SQL database for storing basic audio file information. The audio files themselves are stored in a directory and the SQL database maintains the locations of the files.

Week 04: Using Flask, we develop basic functionality for the website: a message board and back-end pipe lining to utilize the controllers and templates to retrieve data from the database and place them in the message board.

Week 05: Using Angular JS, integrated with Flask, we develop the JavaScript functionality for things like recording messages and storing the user information. Session handling is a very important component to allow the system to track if the user is logged in and using their personal info to inform what kind of messages to make.

First Evaluations

We now have a website with core functionalities. The first evaluations are coming and so it is also good to look at the future as well as what we have. We have a system that allows us to utilize state-of-the-art machine learning techniques to allow for multilingual data storage and messages, but not all functionality is still incomplete.

  1. Changes to the message storing system to be in full SQL and to be able to scale the system for many users.
  2. Changes to the website to make it easier to use and more friendly to actual users, including CSS, JS, and designing more HTML features.
  3. Segmentation for longer audio clips to utilize the transcription system on longer corpora.
  4. Generalizing system for a potential audio-to-translated-text or audio-to-translated-audio subsystems that could exist in the future.

Week 06: Played around with PyAudioAnalysis (following DeepSpeech’s examples) to see how this tool can be used to segment long audio sequences (30 seconds or more) to chunks that become more manageable. Single_file_translation.py shows some of these experiments on the GitHub.

Week 07: Began a SQL-alchemy implementation for messages that works for some of the core functionalities and works for no translation of text messages.

Week 08: SQL-alchemy for text messages is fully implemented. Focused on issues around the recording of audio or video files, and how those files are transferred to the database. Audio and video files are .webm files and are processed using ffmpeg and moviepy. Users can record audio or video files and those are now managed to be stored in the database.

Week 09: Full SQL-alchemy implementation for all messages, translated or not, and now fully stored in the database. Users are able to decide what type of message they would like to send and then send that message to other users. Those original messages and their translated pair are kept in a directory, the layout of which is stored in the SQL-alchemy database for messages.

Week 10: Users have a message board that is now able to retrieve important information about the messages as well as the messages themselves and display them sorted by their date of upload. Segmentation for larger files is still missing. Generalization to systems that do several of the tasks of transcribing, translating, and speaking together is not there as much as wanted. Video and Audio files can be processed in English only so far(!). Full core implementation is there, and, while the website is not secure, the sessions, user identity, and messages are appropriately kept and functionally stored when needed. It is really really cool to record messages and then see them and their translated versions appear on the message board!

--

--