Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: Functionally, a Date

      September 16, 2025

      Creating Elastic And Bounce Effects With Expressive Animator

      September 16, 2025

      Microsoft shares Insiders preview of Visual Studio 2026

      September 16, 2025

      From Data To Decisions: UX Strategies For Real-Time Dashboards

      September 13, 2025

      DistroWatch Weekly, Issue 1139

      September 14, 2025

      Building personal apps with open source and AI

      September 12, 2025

      What Can We Actually Do With corner-shape?

      September 12, 2025

      Craft, Clarity, and Care: The Story and Work of Mengchu Yao

      September 12, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Can I use React Server Components (RSCs) today?

      September 16, 2025
      Recent

      Can I use React Server Components (RSCs) today?

      September 16, 2025

      Perficient Named among Notable Providers in Forrester’s Q3 2025 Commerce Services Landscape

      September 16, 2025

      Sarah McDowell Helps Clients Build a Strong AI Foundation Through Salesforce

      September 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      I Ran Local LLMs on My Android Phone

      September 16, 2025
      Recent

      I Ran Local LLMs on My Android Phone

      September 16, 2025

      DistroWatch Weekly, Issue 1139

      September 14, 2025

      sudo vs sudo-rs: What You Need to Know About the Rust Takeover of Classic Sudo Command

      September 14, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs

    TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs

    August 1, 2025

    Translation systems powered by LLMs have become so advanced that they can outperform human translators in some cases. As LLMs improve, especially in complex tasks such as document-level or literary translation, it becomes increasingly challenging to make further progress and to accurately evaluate that progress. Traditional automated metrics, such as BLEU, are still used but fail to explain why a score is given. With translation quality reaching near-human levels, users require evaluations that extend beyond numerical metrics, providing reasoning across key dimensions, such as accuracy, terminology, and audience suitability. This transparency enables users to assess evaluations, identify errors, and make more informed decisions. 

    While BLEU has long been the standard for evaluating machine translation (MT), its usefulness is fading as modern systems now rival or outperform human translators. Newer metrics, such as BLEURT, COMET, and MetricX, fine-tune powerful language models to assess translation quality more accurately. Large models, such as GPT and PaLM2, can now offer zero-shot or structured evaluations, even generating MQM-style feedback. Techniques such as pairwise comparison have also enhanced alignment with human judgments. Recent studies have shown that asking models to explain their choices improves decision quality; yet, such rationale-based methods are still underutilized in MT evaluation, despite their growing potential. 

    Researchers at Sakana.ai have developed TransEvalnia, a translation evaluation and ranking system that uses prompting-based reasoning to assess translation quality. It provides detailed feedback using selected MQM dimensions, ranks translations, and assigns scores on a 5-point Likert scale, including an overall rating. The system performs competitively with, or even better than, the leading MT-Ranker model across several language pairs and tasks, including English-Japanese, Chinese-English, and more. Tested with LLMs like Claude 3.5 and Qwen-2.5, its judgments aligned well with human ratings. The team also tackled position bias and has released all data, reasoning outputs, and code for public use. 

    The methodology centers on evaluating translations across key quality aspects, including accuracy, terminology, audience suitability, and clarity. For poetic texts like haikus, emotional tone replaces standard grammar checks. Translations are broken down and assessed span by span, scored on a 1–5 scale, and then ranked. To reduce bias, the study compares three evaluation strategies: single-step, two-step, and a more reliable interleaving method. A “no-reasoning” method is also tested but lacks transparency and is prone to bias. Finally, human experts reviewed selected translations to compare their judgments with those of the system, offering insights into its alignment with professional standards. 

    The researchers evaluated translation ranking systems using datasets with human scores, comparing their TransEvalnia models (Qwen and Sonnet) with MT-Ranker, COMET-22/23, XCOMET-XXL, and MetricX-XXL. On WMT-2024 en-es, MT-Ranker performed best, likely due to rich training data. However, in most other datasets, TransEvalnia matched or outperformed MT-Ranker; for example, Qwen’s no-reasoning approach led to a win on WMT-2023 en-de. Position bias was analyzed using inconsistency scores, where interleaved methods often had the lowest bias (e.g., 1.04 on Hard en-ja). Human raters gave Sonnet the highest overall Likert scores (4.37–4.61), with Sonnet’s evaluations correlating well with human judgment (Spearman’s R~0.51–0.54). 

    In conclusion, TransEvalnia is a prompting-based system for evaluating and ranking translations using LLMs like Claude 3.5 Sonnet and Qwen. The system provides detailed scores across key quality dimensions, inspired by the MQM framework, and selects the better translation among options. It often matches or outperforms MT-Ranker on several WMT language pairs, although MetricX-XXL leads on WMT due to fine-tuning. Human raters found Sonnet’s outputs to be reliable, and scores showed a strong correlation with human judgments. Fine-tuning Qwen improved performance notably. The team also explored solutions to position bias, a persistent challenge in ranking systems, and shared all evaluation data and code. 


    Check out the Paper here. Feel free to check our Tutorials page on AI Agent and Agentic AI for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleGoogle AI Introduces the Test-Time Diffusion Deep Researcher (TTD-DR): A Human-Inspired Diffusion Framework for Advanced Deep Research Agents
    Next Article The Battlefield 6 PC system requirements are here in time for Open Beta, and they’re kind, too — here are the specs you’ll need to run the explosive FPS

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    September 3, 2025
    Machine Learning

    Announcing the new cluster creation experience for Amazon SageMaker HyperPod

    September 3, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Your Google TV is getting a free feature upgrade – smart home users will love it

    News & Updates

    CVE-2025-6096 – Codesiddhant Jasmin Ransomware SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    OpenAI Launches GPT 4.1 Model with THESE Features

    Operating Systems

    Using Amazon SageMaker AI Random Cut Forest for NASA’s Blue Origin spacecraft sensor data

    Machine Learning

    Highlights

    Microsoft to Begin Phasing Out Legacy Drivers From Windows Update for Security and Stability

    June 21, 2025

    Microsoft is starting a routine cleanup of outdated drivers on Windows Update. The company says…

    Error’d: Hot Dog

    April 18, 2025

    CVE-2025-52830 – bSecure Universal Checkout SQL Injection

    July 4, 2025

    Transformers Gain Robust Multidimensional Positional Understanding: University of Manchester Researchers Introduce a Unified Lie Algebra Framework for N-Dimensional Rotary Position Embedding (RoPE)

    April 15, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.