Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Bytedance Researchers Present Cross Language Agent – Simultaneous Interpretation (CLASI): A High-Quality And Human-Like Simultaneous Speech Translation (SiST) System

    Bytedance Researchers Present Cross Language Agent – Simultaneous Interpretation (CLASI): A High-Quality And Human-Like Simultaneous Speech Translation (SiST) System

    August 6, 2024

    One of the most difficult challenges in translation is simultaneous speech translation (SiST). The ability to translate spoken words into another language in real time is known as simultaneous speech translation, and it paves the way for instantaneous communication across language barriers. There has been a lot of buzz about machine-assisted autonomous interpretation in natural language processing (NLP). Streaming Automatic Speech Recognition (ASR), punctuation, and Machine Translation (MT) models are typically employed in a cascaded system in traditional simultaneous translation systems. Unfortunately, the ASR module is a common latency and error propagation source in such cascaded systems. 

    Academic SiST models and commercial SiST engines have come a long way, yet translation quality still needs to improve. With the help of humans, studies evaluated the available SiST systems as they are now. These systems significantly impact the efficacy of communication from a user-centered standpoint since they only provide listeners with less than 42% of the correct information. On the other hand, a human translator can convey at least 95% of the intended meaning and often more than 70%. As a result, researchers utilize 80% to denote highly qualified human interpreters in this work. LLMs are suggested to complete the SiST task because of their enormous success with machine and spoken translation.

    Starting with the read-write policy, which requires LLM only to offer partial translation for input speech, integrating LLM into the SiST takes work. Second, LLMs can’t learn rare terms or terminologies from training data; thus, getting human-equivalent performance is challenging. Finally, the performance on the SiST task is still hindered by the shortage of training data. In response to these challenges, researchers from ByteDance have introduced CLASI, a unique Cross-Lingual Agent that achieves Simultaneous Interpretation through the repeated execution of various operations. 

    CLASI overcomes the first obstacle by emulating human interpreters’ approach of segmenting full sentences into smaller, more manageable pieces based on syntactic markers and contextual meaning. This is achieved through a data-driven policy learning method, enabling CLASI to learn and apply a rigorous read-write policy for SiST. To address the second obstacle, the CLASI agent was enhanced with two additional modules: a memory that records speech context and an external knowledge database with terminologies and matched translations. However, the external knowledge database can introduce noise and slow down the technique. To mitigate this, the researchers propose a new method called Multi-Modal Retrieval Augmented Generation (MM-RAG). This method uses a multi-modal retriever to search an external database for relevant information, thereby improving the efficiency of the CLASI agent. 

    They add the obtained information and memory context to the LLM agent’s prompt to improve the translation using in-context learning. They use a three-stage training methodology—pretraining, ongoing training, and fine-tuning—to tackle the data scarcity of the SiST job. LLM and audio encoder are pre trained separately using their massive internal datasets. The team trains their model continuously using billions of tokens of low-quality synthetic speech translation data to further their goal of achieving modal alignment between voice and text. For LLM to make greater use of the retriever’s and preceding translation’s contextual information, they also incorporate several activities to improve its in-context learning capability. Finally, they use a tiny quantity of human-annotated data to fine-tune the model, making it more resilient and producing better translations by mimicking the actions of human professionals. Since SiST frequently incorporates compaction, abstraction, and paraphrasing, it is possible that the traditional automatic evaluation criteria of simultaneous interpretation do not accurately reflect its performance.

    Valid Information Proportion (VIP)2 is a new evaluation metric they offer, which aligns with human interpreters. The primary goal of SiST is real-time communication, and VIP indicates the proportion of information that can be transmitted precisely. The researchers found that the proposed method significantly beats other available algorithms in human evaluations conducted on challenging real-world long speech datasets that are both diverse and varied in topic. As an example, in the direction of Chinese-to-English translation, CLASI gets an 81.3% VIP score, which is far better than human interpreters. This promising result indicates a bright future for SiST.

    The results in Chinese-to-English and English-to-Chinese jobs were much better than those of commercial systems, but the team highlights that language considerations should be expanded in the future. Each translation round triggers a full action sequence in the presented implementation of CLASI. Since the model can accurately translate without any external knowledge, some activities are optional for simple translation scenarios. It is possible to train the model to skip extra steps in the future.

    Therefore, the Valid Information Proportion (VIP) metric is suggested for enhanced human evaluation. This underscores the need for more reliable automated quality and latency measurements in the future. The evidence also points to the potential of reinforcement learning from human feedback (RLHF) to enhance LLM performance. While CLASI outperforms prior state-of-the-art systems, there is a clear need for additional research into improving multi-modal reward models, as well as RL approaches for SiST. Promising areas of study include multi-modal integration, such as end-to-end video-to-video or speech-to-speech production.  

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 47k+ ML SubReddit

    Find Upcoming AI Webinars here

    Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

    The post Bytedance Researchers Present Cross Language Agent – Simultaneous Interpretation (CLASI): A High-Quality And Human-Like Simultaneous Speech Translation (SiST) System appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleEnhancing Text Embeddings in Small Language Models: A Contrastive Fine-Tuning Approach with MiniCPM
    Next Article The Evolution of Artificial Intelligence (AI) Agents: Workflow, Planning, and Matrix Agents Leading Enterprise Automation

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2024-47893 – VMware GPU Firmware Memory Disclosure

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Pure CSS Logos

    Development

    Canonical annuncia Ubuntu 24.04 per OrangePi RV2: la nuova frontiera delle SBC RISC-V economiche

    Linux

    CVE-2025-3819 – PHPGurukul Men Salon Management System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    NVIDIA’s GeForce NOW memberships are “sold out” right now, and it says it’s our fault for using the servers too much

    News & Updates

    Highlights

    Microsoft resolves Windows 11 password rotation issue for Enterprise Devices

    April 10, 2025

    Microsoft has addressed an authentication issue impacting enterprise devices running Windows 11, version 24H2. The…

    Revolutionizing AI with Mamba: A Survey of Its Capabilities and Future Directions

    August 11, 2024

    Power Surge: Unleashing the Secrets of Electricity for Digital Marketing

    November 19, 2024

    Rilasciato Shotcut 25.03: l’editor video open-source si aggiorna

    March 30, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.