A Project Proposal

How I'd engineer an original & disruptive Video Chat app offering Real-Time CC translations

August 15, 2021

Value Proposition:

A website visitor can establish a video chat connection with another user who may speak a different language, and the two will be able to understand each other and have a conversation by reading real-time closed-caption translations shown below the video feed.

Motivation:

My partner's family lives in Brazil and speaks only Portuguese, while I live in America and communicate in English. Being able to video chat and actually understand what we are saying to each other would enable us to finally get to know each other, which is something that is incredibly important to me, my partner, and their family.

Key Technology:

The key technology that will enable this translation feature, which is not currently available on popular platforms such as Zoom, is WebAssembly. By using WebAssembly to offload the expensive speech-in-one-language to text-in-another-language process from the server to the browser, the web app is able to facilitate a WebRTC video chat session accompanied by translated CC at near-native performance.

Steps:

Here I brake down the project into a set of deliverables that if executed in order will culminate in a feature-complete execution. By approaching the work in this way, I can make measurable progress continuously, and deliver updates developing the product described in the value proposition.

[ ] Take a pre-recorded English audio clip, access it within a WebAssembly module, calculate the amplitude of its volume at each point in time, and then display that data in the DOM in the form of a graph.
[ ] Deploy this MVP to a live hosted environment to mark the successful use of WebAssembly.
[ ] Use WebRTC to establish a peer-to-peer connection between my IP address and another on my local network in order to transmit and play out loud pre-recorded audio loops between the two clients in real time.
[ ] Deploy this MVP to mark the successful use of WebRTC.
[ ] As the data streams transmit between the client browsers, use the WebAssembly module from step 1 to process the audio as it is received in small intervals (spanning the time of a single loop). When the amplitude calculation of a given clip is complete, display the graph in the DOM under a synchronized clock/timer visual.
[ ] Deploy this MVP to mark the successful integration of WebAssembly with WebRTC.
[ ] Replace the pre-recorded audio loops with real-time input from the users' microphones.
[ ] Deploy this MVP to mark the successful accessing and transmission of the users' microphone feeds.
[ ] Stream video along with audio using real-time input from the users' cameras.
[ ] Deploy this MVP to mark the successful accessing and transmission of the users' video feeds in addition to audio feeds.
[ ] Replace the timer visuals with the video feeds and display the audio amplitude graph below the video instead, while simultaneously playing the audio along with the video.
[ ] Deploy this MVP to mark the successful implementation of WebRTC video chat accompanied with an offloaded WebAssembly process.
[ ] Replace the audio amplitude calculation with an open-source English speech-to-text algorithm designed to be executed in the WebAssembly execution environment. Then, display the same-language closed captions produced by this speech-to-text program under the video feeds.
[ ] Deploy this MVP to mark the successful delivery of Zoom with English subtitles.
[ ] Expand the speech-to-text algorithm to produce Portuguese subtitles given English audio.
[ ] Expand the speech-to-text algorithm to produce English subtitles given Portuguese audio.
[ ] Stretch goal: expand the speech-to-text algorithm to produce subtitles to/from a range of languages in addition to English and Portuguese.

Architectural design questions:

Which language/wasm library should I use for the WebAssembly component of this project, keeping in mind:
- My plan to incorporate an open-source speech-to-text translation algorithm
- My desire to use this algorithm in conjunction with WebRTC
- My dream of having a smooth deployment process
Would it be better to code the front-end in vanilla Javascript or React, considering camera and microphone permissions required to video-chat from within the browser? Perhaps learning ASP.NET & Blazor would be best?
Where would a back-end fit into the early stages of this project?

Possible Third-Party Libraries

DeepSpeech - Speech-To-Text