Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      10 Top Generative AI Development Companies for Enterprise Node.js Projects

      August 30, 2025

      Prompting Is A Design Act: How To Brief, Guide And Iterate With AI

      August 29, 2025

      Best React.js Development Services in 2025: Features, Benefits & What to Look For

      August 29, 2025

      August 2025: AI updates from the past month

      August 29, 2025

      This 3-in-1 charger has a retractable superpower that’s a must for travel

      August 31, 2025

      How a legacy hardware company reinvented itself in the AI age

      August 31, 2025

      The 13+ best Walmart Labor Day deals 2025: Sales on Apple, Samsung, LG, and more

      August 31, 2025

      You can save up to $700 on my favorite Bluetti power stations for Labor Day

      August 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Call for Speakers – JS Conf Armenia 2025

      August 30, 2025
      Recent

      Call for Speakers – JS Conf Armenia 2025

      August 30, 2025

      Streamlining Application Automation with Laravel’s Task Scheduler

      August 30, 2025

      A Fluent Path Builder for PHP and Laravel

      August 30, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 KB5064081 24H2 adds taskbar clock, direct download links for .msu offline installer

      August 30, 2025
      Recent

      Windows 11 KB5064081 24H2 adds taskbar clock, direct download links for .msu offline installer

      August 30, 2025

      My Family Cinema not Working? 12 Quick Fixes

      August 30, 2025

      Super-linter – collection of linters and code analyzers

      August 30, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs

    TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs

    August 1, 2025

    Translation systems powered by LLMs have become so advanced that they can outperform human translators in some cases. As LLMs improve, especially in complex tasks such as document-level or literary translation, it becomes increasingly challenging to make further progress and to accurately evaluate that progress. Traditional automated metrics, such as BLEU, are still used but fail to explain why a score is given. With translation quality reaching near-human levels, users require evaluations that extend beyond numerical metrics, providing reasoning across key dimensions, such as accuracy, terminology, and audience suitability. This transparency enables users to assess evaluations, identify errors, and make more informed decisions. 

    While BLEU has long been the standard for evaluating machine translation (MT), its usefulness is fading as modern systems now rival or outperform human translators. Newer metrics, such as BLEURT, COMET, and MetricX, fine-tune powerful language models to assess translation quality more accurately. Large models, such as GPT and PaLM2, can now offer zero-shot or structured evaluations, even generating MQM-style feedback. Techniques such as pairwise comparison have also enhanced alignment with human judgments. Recent studies have shown that asking models to explain their choices improves decision quality; yet, such rationale-based methods are still underutilized in MT evaluation, despite their growing potential. 

    Researchers at Sakana.ai have developed TransEvalnia, a translation evaluation and ranking system that uses prompting-based reasoning to assess translation quality. It provides detailed feedback using selected MQM dimensions, ranks translations, and assigns scores on a 5-point Likert scale, including an overall rating. The system performs competitively with, or even better than, the leading MT-Ranker model across several language pairs and tasks, including English-Japanese, Chinese-English, and more. Tested with LLMs like Claude 3.5 and Qwen-2.5, its judgments aligned well with human ratings. The team also tackled position bias and has released all data, reasoning outputs, and code for public use. 

    The methodology centers on evaluating translations across key quality aspects, including accuracy, terminology, audience suitability, and clarity. For poetic texts like haikus, emotional tone replaces standard grammar checks. Translations are broken down and assessed span by span, scored on a 1–5 scale, and then ranked. To reduce bias, the study compares three evaluation strategies: single-step, two-step, and a more reliable interleaving method. A “no-reasoning” method is also tested but lacks transparency and is prone to bias. Finally, human experts reviewed selected translations to compare their judgments with those of the system, offering insights into its alignment with professional standards. 

    The researchers evaluated translation ranking systems using datasets with human scores, comparing their TransEvalnia models (Qwen and Sonnet) with MT-Ranker, COMET-22/23, XCOMET-XXL, and MetricX-XXL. On WMT-2024 en-es, MT-Ranker performed best, likely due to rich training data. However, in most other datasets, TransEvalnia matched or outperformed MT-Ranker; for example, Qwen’s no-reasoning approach led to a win on WMT-2023 en-de. Position bias was analyzed using inconsistency scores, where interleaved methods often had the lowest bias (e.g., 1.04 on Hard en-ja). Human raters gave Sonnet the highest overall Likert scores (4.37–4.61), with Sonnet’s evaluations correlating well with human judgment (Spearman’s R~0.51–0.54). 

    In conclusion, TransEvalnia is a prompting-based system for evaluating and ranking translations using LLMs like Claude 3.5 Sonnet and Qwen. The system provides detailed scores across key quality dimensions, inspired by the MQM framework, and selects the better translation among options. It often matches or outperforms MT-Ranker on several WMT language pairs, although MetricX-XXL leads on WMT due to fine-tuning. Human raters found Sonnet’s outputs to be reliable, and scores showed a strong correlation with human judgments. Fine-tuning Qwen improved performance notably. The team also explored solutions to position bias, a persistent challenge in ranking systems, and shared all evaluation data and code. 


    Check out the Paper here. Feel free to check our Tutorials page on AI Agent and Agentic AI for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleGoogle AI Introduces the Test-Time Diffusion Deep Researcher (TTD-DR): A Human-Inspired Diffusion Framework for Advanced Deep Research Agents
    Next Article The Battlefield 6 PC system requirements are here in time for Open Beta, and they’re kind, too — here are the specs you’ll need to run the explosive FPS

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 31, 2025
    Machine Learning

    Introducing auto scaling on Amazon SageMaker HyperPod

    August 30, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2024-56343 – IBM Verify Identity Access Digital Credentials Denial of Service

    Common Vulnerabilities and Exposures (CVEs)

    $17 Million Black Market Empire Crushed in Cybercrime Sting

    Development

    This premium Lenovo laptop is nearly checks all the boxes for me – including battery life

    News & Updates

    CVE-2025-49589 – PCSX2 Stack-Based Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-53908 – RomM Path Traversal Vulnerability

    July 16, 2025

    CVE ID : CVE-2025-53908

    Published : July 16, 2025, 8:15 p.m. | 6 hours, 47 minutes ago

    Description : RomM is a self-hosted rom manager and player. Versions prior to 3.10.3 and 4.0.0-beta.3 have an authenticated path traversal vulnerability in the `/api/raw` endpoint. Anyone running the latest version of RomM and has multiple users, even unprivileged users, such as the kiosk user in the official implementation, may be affected. This allows the leakage of passwords and users that may be stored on the system. Versions 3.10.3 and 4.0.0-beta.3 contain a patch.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-5299 – SourceCodester Client Database Management System Unrestricted File Upload Vulnerability

    May 28, 2025

    CVE-2024-51103 – PHPGURUKUL Student Management System SQL Injection Vulnerability

    May 23, 2025

    CVE-2024-45094 – IBM DS8900F and DS8A00 HMC Stored Cross-Site Scripting Vulnerability

    May 27, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.