CS-Bench: A Bilingual (Chinese-English) Benchmark Dedicated to Evaluating the Performance of LLMs in Computer Science

The domain of artificial intelligence has been significantly shaped by the emergence of large language models (LLMs), showing vast potential across various fields. However, enabling LLMs to effectively utilize computer science knowledge and serve humanity more efficiently remains a key challenge. Despite existing studies covering multiple fields, including computer science, thereâ€™s a lack of comprehensive evaluation specifically focused on LLMsâ€™ performance in computer science. This gap overlooks the importance of thoroughly assessing the field and guiding LLM development to advance their capabilities in computer science.

Recent research has explored LLMsâ€™ potential in various industries and scientific fields. However, studies on LLMs in computer science fall into two main categories: broad evaluation benchmarks where computer science constitutes only a small fraction, and explorations of specific LLM applications within computer science. Neither approach provides a comprehensive evaluation of LLMsâ€™ foundational knowledge and reasoning abilities in the field. While individual capabilities like mathematics, coding, and logical reasoning have been well-studied, research on their integrated application and interrelationships remains sparse.

Researchers from Beijing University of Posts and Telecommunications propose CS-Bench, the first benchmark dedicated to evaluating LLMsâ€™ performance in computer science. CS-Bench features high-quality, diverse task forms, varying capacities, and bilingual evaluation. It comprises approximately 5,000 carefully curated test items spanning 26 sections across 4 key computer science domains. The benchmark includes multiple-choice, assertion, fill-in-the-blank, and open-ended questions to better simulate real-world scenarios and assess LLMsâ€™ robustness to different task formats. CS-Bench evaluates both knowledge-type and reasoning-type questions, supporting bilingual evaluation in Chinese and English.

CS-Bench covers four key domains: Data Structure and Algorithm (DSA), Computer Organization (CO), Computer Network(CN), and Operating System(OS). It includes 26 fine-grained subfields and diverse task forms to enrich assessment dimensions and simulate real-world scenarios. The data for CS-Bench comes from various sources, including publicly available online channels, adapted blog articles, and authorized teaching materials. The data processing involves a team of computer science graduates who parse questions and answers, label question types, and ensure quality through thorough manual checks. The benchmark supports bilingual assessment with a total of 4,838 samples across various task formats.

Evaluation results show that overall scores of models range from 39.86% to 72.29%. GPT-4 and GPT-4o represent the highest standard on CS-Bench, being the only models exceeding 70% proficiency. Open-source models like Qwen1.5-110B and Llama3-70B have surpassed previously strong closed-source models. Newer models demonstrate significant improvements compared to earlier versions. All models perform worse on reasoning compared to knowledge scores, indicating that reasoning poses a greater challenge. LLMs generally perform best in Data Structure and Algorithm and worst in Operating Systems. Stronger models demonstrate a better ability to use knowledge for reasoning and show more robustness across different task formats.

This study introduces CS-Bench to provide valuable insights into LLMsâ€™ performance in computer science. Even top-performing models like GPT-4o have significant room for improvement. The benchmark highlights the close interconnections between computer science, mathematics, and coding abilities in LLMs. These findings offer directions for enhancing LLMs in the field and provide valuable insights into their cross-abilities and applications, paving the way for future advancements in AI and computer science.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

The post CS-Bench: A Bilingual (Chinese-English) Benchmark Dedicated to Evaluating the Performance of LLMs in Computer Science appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

CS-Bench: A Bilingual (Chinese-English) Benchmark Dedicated to Evaluating the Performance of LLMs in Computer Science

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

CVE-2025-0130 – Palo Alto Networks PAN-OS Denial of Service (DoS)

Digital Marketing Legend “Srinidhi Ranganathan” Warns: What’s Ahead of AI May Be Worse Than a Recession

Having issues with jmeter sockets

Get started using Claude 3.5 Sonnet with audio data

Data is the New Oil

Yokogawa Recorders Vulnerable to Attack Due to Insecure Default Settings

The best photo editing software of 2025: Expert tested and reviewed

From Gas Station to Google with Self-Taught Cloud Engineer Rishab Kumar [Podcast #158]

CS-Bench: A Bilingual (Chinese-English) Benchmark Dedicated to Evaluating the Performance of LLMs in Computer Science

Related Posts