Bytedance Researchers Present Cross Language Agent â€“ Simultaneous InterpretationÂ (CLASI): A High-Quality And Human-Like Simultaneous Speech Translation (SiST) System

One of the most difficult challenges in translation is simultaneous speech translation (SiST). The ability to translate spoken words into another language in real time is known as simultaneous speech translation, and it paves the way for instantaneous communication across language barriers. There has been a lot of buzz about machine-assisted autonomous interpretation in natural language processing (NLP). Streaming Automatic Speech Recognition (ASR), punctuation, and Machine Translation (MT) models are typically employed in a cascaded system in traditional simultaneous translation systems. Unfortunately, the ASR module is a common latency and error propagation source in such cascaded systems.Â

Academic SiST models and commercial SiST engines have come a long way, yet translation quality still needs to improve. With the help of humans, studies evaluated the available SiST systems as they are now. These systems significantly impact the efficacy of communication from a user-centered standpoint since they only provide listeners with less than 42% of the correct information. On the other hand, a human translator can convey at least 95% of the intended meaning and often more than 70%. As a result, researchers utilize 80% to denote highly qualified human interpreters in this work. LLMs are suggested to complete the SiST task because of their enormous success with machine and spoken translation.

Starting with the read-write policy, which requires LLM only to offer partial translation for input speech, integrating LLM into the SiST takes work. Second, LLMs canâ€™t learn rare terms or terminologies from training data; thus, getting human-equivalent performance is challenging. Finally, the performance on the SiST task is still hindered by the shortage of training data.Â In response to these challenges, researchers from ByteDance have introduced CLASI, a unique Cross-Lingual Agent that achieves Simultaneous Interpretation through the repeated execution of various operations.Â

CLASI overcomes the first obstacle by emulating human interpretersâ€™ approach of segmenting full sentences into smaller, more manageable pieces based on syntactic markers and contextual meaning. This is achieved through a data-driven policy learning method, enabling CLASI to learn and apply a rigorous read-write policy for SiST. To address the second obstacle, the CLASI agent was enhanced with two additional modules: a memory that records speech context and an external knowledge database with terminologies and matched translations. However, the external knowledge database can introduce noise and slow down the technique. To mitigate this, the researchers propose a new method called Multi-Modal Retrieval Augmented Generation (MM-RAG). This method uses a multi-modal retriever to search an external database for relevant information, thereby improving the efficiency of the CLASI agent.Â

They add the obtained information and memory context to the LLM agentâ€™s prompt to improve the translation using in-context learning. They use a three-stage training methodologyâ€”pretraining, ongoing training, and fine-tuningâ€”to tackle the data scarcity of the SiST job. LLM and audio encoder are pre trained separately using their massive internal datasets. The team trains their model continuously using billions of tokens of low-quality synthetic speech translation data to further their goal of achieving modal alignment between voice and text. For LLM to make greater use of the retrieverâ€™s and preceding translationâ€™s contextual information, they also incorporate several activities to improve its in-context learning capability. Finally, they use a tiny quantity of human-annotated data to fine-tune the model, making it more resilient and producing better translations by mimicking the actions of human professionals. Since SiST frequently incorporates compaction, abstraction, and paraphrasing, it is possible that the traditional automatic evaluation criteria of simultaneous interpretation do not accurately reflect its performance.

Valid Information Proportion (VIP)2 is a new evaluation metric they offer, which aligns with human interpreters. The primary goal of SiST is real-time communication, and VIP indicates the proportion of information that can be transmitted precisely. The researchers found that the proposed method significantly beats other available algorithms in human evaluations conducted on challenging real-world long speech datasets that are both diverse and varied in topic. As an example, in the direction of Chinese-to-English translation, CLASI gets an 81.3% VIP score, which is far better than human interpreters. This promising result indicates a bright future for SiST.

The results in Chinese-to-English and English-to-Chinese jobs were much better than those of commercial systems, but the team highlights that language considerations should be expanded in the future. Each translation round triggers a full action sequence in the presented implementation of CLASI. Since the model can accurately translate without any external knowledge, some activities are optional for simple translation scenarios. It is possible to train the model to skip extra steps in the future.

Therefore, the Valid Information Proportion (VIP) metric is suggested for enhanced human evaluation. This underscores the need for more reliable automated quality and latency measurements in the future. The evidence also points to the potential of reinforcement learning from human feedback (RLHF) to enhance LLM performance. While CLASI outperforms prior state-of-the-art systems, there is a clear need for additional research into improving multi-modal reward models, as well as RL approaches for SiST. Promising areas of study include multi-modal integration, such as end-to-end video-to-video or speech-to-speech production.Â Â

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post Bytedance Researchers Present Cross Language Agent â€“ Simultaneous InterpretationÂ (CLASI): A High-Quality And Human-Like Simultaneous Speech Translation (SiST) System appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Bytedance Researchers Present Cross Language Agent â€“ Simultaneous InterpretationÂ (CLASI): A High-Quality And Human-Like Simultaneous Speech Translation (SiST) System

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2024-47893 – VMware GPU Firmware Memory Disclosure

Pure CSS Logos

Canonical annuncia Ubuntu 24.04 per OrangePi RV2: la nuova frontiera delle SBC RISC-V economiche

CVE-2025-3819 – PHPGurukul Men Salon Management System SQL Injection Vulnerability

NVIDIA’s GeForce NOW memberships are “sold out” right now, and it says it’s our fault for using the servers too much

Microsoft resolves Windows 11 password rotation issue for Enterprise Devices

Revolutionizing AI with Mamba: A Survey of Its Capabilities and Future Directions

Power Surge: Unleashing the Secrets of Electricity for Digital Marketing

Rilasciato Shotcut 25.03: l’editor video open-source si aggiorna

Bytedance Researchers Present Cross Language Agent â€“ Simultaneous InterpretationÂ (CLASI): A High-Quality And Human-Like Simultaneous Speech Translation (SiST) System

Related Posts