Major advances of Moshi in Data Science

📅 Published on July 30, 2024

I. What are the major advances of Moshi?

While the multimodal nature of Moshi isn’t groundbreaking in itself (as models like Gemini and SeamlessM4T already accept text or audio input), and the ability to detect and reproduce emotions isn’t new either (as ElevenLabs already does this), Moshi brings significant innovations to the table.

1. Asynchronous dual-stream channels

Moshi’s use of asynchronous Dual-Stream Channels is a notable innovation. This approach, although extremely complex, allows for improved performance but often results in interruptions during conversation.

2. Achieving low latency

For our AI models, Lisa and Adriana, we aimed for a 100 ms response time but couldn’t go below 500 ms. To achieve this, we employed several techniques:

Hardware optimization:

High-performance GPU with extremely high TFLOPS
Powerful CPU for batch preparation
Optimized network components (switches, cables, network card, and PCIe)

Streaming technology:

Development of proprietary streaming technology to compress audio data during transfer to the LLM

Sequence limitation:

Reduction of input prompt length (fewer tokens = less computational power required)

Point detection:

Starting inference as soon as a period is detected in the text

II. Asynchronous processing

Working asynchronously in a successive manner rather than with two different streams, which avoids prediction synchronization issues. Despite these techniques, our pipeline, consisting of three distinct stages.

TTS (Transcription) => LLM (Inference) => STT (vocal synthese)

3. Integrated model for end-to-end processing

Moshi integrates automatic speech recognition (ASR) and inference to produce output audio within a single model, enabling asynchronous processing. This eliminates the need for a start button for speech input, as transcription is no longer required.

4. Local execution on apple hardware

Another significant innovation of Moshi is its ability to run locally on a MacBook Pro. This is a remarkable achievement, given the challenges of working with non-NVIDIA GPUs in Data Science. Optimizing a model of this scale to perform efficiently on Apple hardware (M1 and M2) is a major technological breakthrough.

III. Potential impact on NVIDIA’s Market

Could this innovation affect NVIDIA’s stock? Only time will tell.

For a detailed explanation, watch the complete video.