Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 3, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 3, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 3, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 3, 2025

      SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

      June 3, 2025

      The Witcher 4 looks absolutely amazing in UE5 technical presentation at State of Unreal 2025

      June 3, 2025

      Razer’s having another go at making it so you never have to charge your wireless gaming mouse, and this time it might have nailed it

      June 3, 2025

      Alienware’s rumored laptop could be the first to feature NVIDIA’s revolutionary Arm-based APU

      June 3, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      easy-live2d – About Make your Live2D as easy to control as a pixi sprite! Live2D Web SDK based on Pixi.js.

      June 3, 2025
      Recent

      easy-live2d – About Make your Live2D as easy to control as a pixi sprite! Live2D Web SDK based on Pixi.js.

      June 3, 2025

      From Kitchen To Conversion

      June 3, 2025

      Perficient Included in Forrester’s AI Technical Services Landscape, Q2 2025

      June 3, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

      June 3, 2025
      Recent

      SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

      June 3, 2025

      The Witcher 4 looks absolutely amazing in UE5 technical presentation at State of Unreal 2025

      June 3, 2025

      Razer’s having another go at making it so you never have to charge your wireless gaming mouse, and this time it might have nailed it

      June 3, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»CoSyn: An AI Framework that Leverages the Coding Capabilities of Text-only Large Language Models (LLMs) to Automatically Create Synthetic Text-Rich Multimodal Data

    CoSyn: An AI Framework that Leverages the Coding Capabilities of Text-only Large Language Models (LLMs) to Automatically Create Synthetic Text-Rich Multimodal Data

    February 26, 2025

    Vision-language models (VLMs) have demonstrated impressive capabilities in general image understanding, but face significant challenges when processing text-rich visual content such as charts, documents, diagrams, and screenshots. These specialised images require complex reasoning that combines textual comprehension with spatial understanding—a skill set critical for analysing scientific literature, improving accessibility features, and enabling AI agents to function effectively in real-world environments. Current VLMs struggle with these tasks primarily due to the scarcity of high-quality training data that realistically represents the diverse array of text-embedded visual formats encountered in practical applications. This data limitation has created a performance gap in scenarios requiring nuanced interpretation of structured visual information, hampering the deployment of these models in specialized domains where text-rich image processing is essential.

    Several approaches have been developed to enhance vision-language models for processing visual content. Early architectures explored different integration strategies including cross-attention mechanisms, Q-Former structures, and MLP projection layers to bridge visual and linguistic features. However, these models often suffer from significant imbalance i-e their language components substantially outweigh visual processing capabilities, leading to hallucinations when high-quality training data is scarce. Existing benchmarks for text-rich image understanding (charts, documents, infographics, diagrams, screenshots) remain limited in size, scope, and diversity, making them suitable for evaluation but inadequate for comprehensive training. Previous synthetic data generation efforts have typically focused on narrow domains using small sets of chart types with handcrafted question templates. Some approaches utilize text-only LLMs to generate annotations from tables or descriptions, while others explore code-based rendering of synthetic charts. Despite these advances, current synthetic datasets remain constrained in topic diversity, figure variety, and rendering methodology—critical limitations that hinder generalization to novel, out-of-distribution tasks.

    A team of researchers from University of Pennsylvania, and Allen Institute for Artificial Intelligence introduced the Code Guided Synthetic Data Generation System (CoSyn) which offers a flexible framework to address the challenges in text-rich image processing by creating diverse synthetic multimodal training data. This innovative system utilizes the code generation capabilities of text-only LLMS to produce both data and rendering code for various text-rich visual formats using 11 supported rendering tools including Python, HTML, and LaTeX. CoSyn generates not only the images but also corresponding textual instructions grounded in the underlying code representation, creating comprehensive vision-language instruction-tuning datasets. The researchers used this framework to develop CoSyn-400K, a large-scale diverse synthetic dataset specifically designed for text-rich image understanding.

    The CoSyn system operates through a sophisticated four-stage workflow beginning with a natural language query like “generate a dataset of book covers.” First, the system selects one of 20 generation pipelines built on 11 diverse rendering tools including Matplotlib, Plotly, LaTeX, HTML, Mermaid, and specialized tools like Lilypond for music sheets and RDKit for chemical structures. The process starts with topic generation guided by sampled personas that enhance content diversity, followed by detailed data generation that populates content specific to the chosen topic. Next, the system generates executable code that renders the synthetic images using the appropriate tool. Finally, using only the code as context, the system prompts language models to generate corresponding textual instructions, including questions, answers, and chain-of-thought reasoning explanations. To enhance diversity beyond what sampling parameters alone can achieve, CoSyn incorporates 200K unique personas during topic generation, effectively countering the repetitive output tendencies of language models. The implementation leverages the DataDreamer library for robust multi-stage generation, using Claude-3.5-Sonnet for code generation and GPT-4o-mini for instruction data generation.

    The model trained on CoSyn’s synthetic data demonstrates exceptional performance across text-rich image understanding benchmarks. When evaluated against seven specialized datasets, the 7B parameter model achieves the highest average performance, surpassing the second-best model (Llama 3.2 11B) by a significant margin of 3.9%. The model ranks first in four out of seven benchmarks and second in the remaining three, highlighting its consistent capabilities across diverse text-rich image tasks. Perhaps most remarkably, even the zero-shot version of the model without any exposure to training instances from evaluation datasets outperforms most competing open and closed models, including those that had been fine-tuned on benchmark training data. This unexpected result provides compelling evidence that the skills acquired from CoSyn’s synthetic data transfer effectively to downstream tasks without requiring domain-specific training examples. Additional ablation studies demonstrate that combining synthetic data with auxiliary and evaluation datasets yields the best performance (80.9%), substantially outperforming models trained on evaluation data alone (75.9%).

    Hostinger

    The CoSyn framework represents a significant advancement in vision-language model development, utilizing synthetic data generation to substantially improve performance on text-rich image understanding tasks. By harnessing the code generation capabilities of text-only LLMs, the system creates diverse, high-quality training data that enables models to generalize across domains with remarkable efficiency. Analysis confirms that CoSyn-generated data successfully mitigates biases present in existing datasets, resulting in models that perform robustly on realistic, human-written queries rather than just template-based questions. The demonstrated improvements in zero-shot learning, multi-hop reasoning, and novel domain adaptation highlight synthetic data’s crucial role in developing VLMs capable of handling complex text-rich visual content in practical applications.


    Check out the Paper and Dataset here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

    🚨 Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

    The post CoSyn: An AI Framework that Leverages the Coding Capabilities of Text-only Large Language Models (LLMs) to Automatically Create Synthetic Text-Rich Multimodal Data appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleOptimizing Imitation Learning: How X‑IL is Shaping the Future of Robotics
    Next Article 9 Website Menu Best Practices to Improve User Experience

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 3, 2025
    Machine Learning

    Distillation Scaling Laws

    June 3, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    7 Android widgets to make your phone or tablet more useful

    Development

    Learn how overflow: clip works in CSS.

    Development

    When IT meets OT: Cybersecurity for the physical world

    Development

    World’s First Hidden Jobs Finder Free Chrome Extension Will Blow Your Mind: Discover Jobs from Google Maps Like Never Before!

    Artificial Intelligence

    Highlights

    OpenAI’s o3 isn’t AGI yet but it just did something no other AI has done

    December 27, 2024

    The new AI model ‘is doing something completely different from the GPT series.’ Source: Latest…

    Gemini in Gmail Now Handles Google Calendar Tasks on Android and iOS

    May 21, 2025

    Did you get a fake McAfee or Norton invoice? How the scam works (and what not to do)

    August 17, 2024

    45 Visual Studio Code Shortcuts for Boosting Your Productivity

    August 13, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.