Despite promising results on synthetic benchmarks (e.g. Vending-Bench, SpreadsheetBench, DSBench), frontier models consistently underperform once they are deployed in complex,…