TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs

Translation systems powered by LLMs have become so advanced that they can outperform human translators in some cases. As LLMs improve, especially in complex tasks such as document-level or literary translation, it becomes increasingly challenging to make further progress and to accurately evaluate that progress. Traditional automated metrics, such as BLEU, are still used but fail to explain why a score is given. With translation quality reaching near-human levels, users require evaluations that extend beyond numerical metrics, providing reasoning across key dimensions, such as accuracy, terminology, and audience suitability. This transparency enables users to assess evaluations, identify errors, and make more informed decisions.

While BLEU has long been the standard for evaluating machine translation (MT), its usefulness is fading as modern systems now rival or outperform human translators. Newer metrics, such as BLEURT, COMET, and MetricX, fine-tune powerful language models to assess translation quality more accurately. Large models, such as GPT and PaLM2, can now offer zero-shot or structured evaluations, even generating MQM-style feedback. Techniques such as pairwise comparison have also enhanced alignment with human judgments. Recent studies have shown that asking models to explain their choices improves decision quality; yet, such rationale-based methods are still underutilized in MT evaluation, despite their growing potential.

Researchers at Sakana.ai have developed TransEvalnia, a translation evaluation and ranking system that uses prompting-based reasoning to assess translation quality. It provides detailed feedback using selected MQM dimensions, ranks translations, and assigns scores on a 5-point Likert scale, including an overall rating. The system performs competitively with, or even better than, the leading MT-Ranker model across several language pairs and tasks, including English-Japanese, Chinese-English, and more. Tested with LLMs like Claude 3.5 and Qwen-2.5, its judgments aligned well with human ratings. The team also tackled position bias and has released all data, reasoning outputs, and code for public use.

The methodology centers on evaluating translations across key quality aspects, including accuracy, terminology, audience suitability, and clarity. For poetic texts like haikus, emotional tone replaces standard grammar checks. Translations are broken down and assessed span by span, scored on a 1–5 scale, and then ranked. To reduce bias, the study compares three evaluation strategies: single-step, two-step, and a more reliable interleaving method. A “no-reasoning” method is also tested but lacks transparency and is prone to bias. Finally, human experts reviewed selected translations to compare their judgments with those of the system, offering insights into its alignment with professional standards.

The researchers evaluated translation ranking systems using datasets with human scores, comparing their TransEvalnia models (Qwen and Sonnet) with MT-Ranker, COMET-22/23, XCOMET-XXL, and MetricX-XXL. On WMT-2024 en-es, MT-Ranker performed best, likely due to rich training data. However, in most other datasets, TransEvalnia matched or outperformed MT-Ranker; for example, Qwen’s no-reasoning approach led to a win on WMT-2023 en-de. Position bias was analyzed using inconsistency scores, where interleaved methods often had the lowest bias (e.g., 1.04 on Hard en-ja). Human raters gave Sonnet the highest overall Likert scores (4.37–4.61), with Sonnet’s evaluations correlating well with human judgment (Spearman’s R~0.51–0.54).

In conclusion, TransEvalnia is a prompting-based system for evaluating and ranking translations using LLMs like Claude 3.5 Sonnet and Qwen. The system provides detailed scores across key quality dimensions, inspired by the MQM framework, and selects the better translation among options. It often matches or outperforms MT-Ranker on several WMT language pairs, although MetricX-XXL leads on WMT due to fine-tuning. Human raters found Sonnet’s outputs to be reliable, and scores showed a strong correlation with human judgments. Fine-tuning Qwen improved performance notably. The team also explored solutions to position bias, a persistent challenge in ranking systems, and shared all evaluation data and code.

Check out the Paper here. Feel free to check our Tutorials page on AI Agent and Agentic AI for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs appeared first on MarkTechPost.

Source: Read MoreÂ

10 Top Generative AI Development Companies for Enterprise Node.js Projects

Prompting Is A Design Act: How To Brief, Guide And Iterate With AI

Best React.js Development Services in 2025: Features, Benefits & What to Look For

August 2025: AI updates from the past month

This 3-in-1 charger has a retractable superpower that’s a must for travel

How a legacy hardware company reinvented itself in the AI age

The 13+ best Walmart Labor Day deals 2025: Sales on Apple, Samsung, LG, and more

You can save up to $700 on my favorite Bluetti power stations for Labor Day

Call for Speakers – JS Conf Armenia 2025

Call for Speakers – JS Conf Armenia 2025

Streamlining Application Automation with Laravel’s Task Scheduler

A Fluent Path Builder for PHP and Laravel

Windows 11 KB5064081 24H2 adds taskbar clock, direct download links for .msu offline installer

Windows 11 KB5064081 24H2 adds taskbar clock, direct download links for .msu offline installer

My Family Cinema not Working? 12 Quick Fixes

Super-linter – collection of linters and code analyzers

TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Introducing auto scaling on Amazon SageMaker HyperPod

CVE-2024-56343 – IBM Verify Identity Access Digital Credentials Denial of Service

$17 Million Black Market Empire Crushed in Cybercrime Sting

This premium Lenovo laptop is nearly checks all the boxes for me – including battery life

CVE-2025-49589 – PCSX2 Stack-Based Buffer Overflow Vulnerability

CVE-2025-53908 – RomM Path Traversal Vulnerability

CVE-2025-5299 – SourceCodester Client Database Management System Unrestricted File Upload Vulnerability

CVE-2024-51103 – PHPGURUKUL Student Management System SQL Injection Vulnerability

CVE-2024-45094 – IBM DS8900F and DS8A00 HMC Stored Cross-Site Scripting Vulnerability

TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs

Related Posts