Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      How Red Hat just quietly, radically transformed enterprise server Linux

      June 2, 2025

      OpenAI wants ChatGPT to be your ‘super assistant’ – what that means

      June 2, 2025

      The best Linux VPNs of 2025: Expert tested and reviewed

      June 2, 2025

      One of my favorite gaming PCs is 60% off right now

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      `document.currentScript` is more useful than I thought.

      June 2, 2025
      Recent

      `document.currentScript` is more useful than I thought.

      June 2, 2025

      Adobe Sensei and GenAI in Practice for Enterprise CMS

      June 2, 2025

      Over The Air Updates for React Native Apps

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025
      Recent

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025

      Microsoft says Copilot can use location to change Outlook’s UI on Android

      June 2, 2025

      TempoMail — Command Line Temporary Email in Linux

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»TabTreeFormer: Enhancing Synthetic Tabular Data Generation Through Tree-Based Inductive Biases and Dual-Quantization Tokenization

    TabTreeFormer: Enhancing Synthetic Tabular Data Generation Through Tree-Based Inductive Biases and Dual-Quantization Tokenization

    January 9, 2025

    The generation of synthetic tabular data has become increasingly crucial in fields like healthcare and financial services, where privacy concerns often restrict the use of real-world data. While autoregressive transformers, masked transformers, and diffusion models with transformers, have shown significant success in generating high-quality synthetic data with strong fidelity, utility, and privacy guarantees, they face important limitations. Unlike their counterparts in computer vision and natural language processing, which uses domain-specific priors to enhance performance, current transformer models for tabular data generation largely ignore these valuable inductive biases. Moreover, Tabular data generation models have yet to incorporate such domain-specific architectural enhancements effectively.

    Early approaches to tabular data generation depend heavily on MLPs and CNNs as backbone architectures, utilizing GANs and VAEs as generation methods. The field then progressed to transformer-based solutions, introducing models like TabMT, which uses masked transformers with ordered embedding, and TabSyn, which employs diffusion models in latent space. While these approaches demonstrated improved utility and fidelity, they failed to account for crucial inductive biases specific to tabular data, such as non-smoothness and low-correlated features. In parallel, tree-based gradient boosting algorithms like XGBoost, LightGBM, and CatBoost have dominated traditional tabular data tasks due to their natural ability to understand feature relationships.

    Researchers from the Asian Institute of Digital Finance, National University of Singapore, Betterdata AI, Singapore, Tufts University, Massachusetts, United States, and the Department of Electrical and Computer Engineering, Singapore have proposed TabTreeFormer, an innovative hybrid transformer architecture. The model integrates a tree-based component that preserves tabular-specific inductive biases, focusing on non-smooth and low-correlated patterns through its discrete and non-rotationally invariant design. A key innovation is the dual-quantization tokenizer, which effectively captures multimodal continuous distributions and enhances the learning of numerical value distributions. This novel architecture improves data generation quality and shows a significant reduction, in model size by limiting vocabulary size and sequence length.

    The TabTreeFormer architecture integrates LightGBM as its tree-based component, with hyperparameters optimized using Optuna, and employs Distill-GPT2 as its transformer backbone. The model comes in three configurations: Small (1M parameters), Medium (5M parameters), and Large (40M parameters), offering flexibility in deployment based on computational resources and performance requirements. The implementation was tested on an NVIDIA RTX 4090 and evaluated across 10 diverse datasets from OpenML, including adult, bank, breast, credit, diabetes, iris, etc. The model’s performance was benchmarked against a comprehensive range of existing methods, including ARF, CTAB-GAN+, CTGAN, TVAE, TabDDPM, TabSyn, GReaT, and REaLTabFormer.

    TabTreeFormer shows exceptional performance across multiple evaluation metrics. In terms of fidelity, it achieves comparable results in marginal density distribution (Shape) while showing superior capability in capturing multimodal distributions compared to other autoregressive transformers. The model significantly outperforms all baselines in pair-wise correlation (Trend) metrics, excelling in handling correlations involving categorical features and showing marked improvement on datasets with larger feature sets. In utility evaluations measured by Machine Learning Efficiency (MLE), TabTreeFormer-S matches the performance of leading baselines like REaLTabFormer and TabSyn. At the same time, the Medium and Large versions surpass them by approximately 40%.

    In this paper, researchers introduced TabTreeFormer which represents a significant advancement in synthetic tabular data generation, successfully combining tree-based models’ inductive biases with transformer architecture while introducing an innovative dual-quantization tokenizer for optimized numerical value representation. The comprehensive evaluation across multiple datasets and metrics shows its superior performance in fidelity, utility, and privacy compared to existing approaches. The model’s ability to capture complex data distributions and inter-feature relationships while maintaining a smaller footprint makes it a promising solution for practical applications in privacy-preserving data generation.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

    The post TabTreeFormer: Enhancing Synthetic Tabular Data Generation Through Tree-Based Inductive Biases and Dual-Quantization Tokenization appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper Introduces Virgo: A Multimodal Large Language Model for Enhanced Slow-Thinking Reasoning
    Next Article Customized hardware furniture in tirupati

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    tCalc – simple calculator

    Linux

    My 5 must-have extensions for Firefox on Android (and what I use them for)

    Development

    A generalist AI agent for 3D virtual environments

    Artificial Intelligence

    An $800 Snapdragon X Elite laptop is the best reason not to bother buying a MacBook Air

    News & Updates

    Highlights

    Cloudflare’s new free tool stops bots from scraping your website content to train AI

    July 5, 2024

    AI bots accessed around 39% of the top one million ‘internet properties’ using Cloudflare in…

    Cybercrime Trends & Predictions for 2015

    April 9, 2025

    Le Migliori Applicazioni per Ascoltare la Radio su GNU/Linux

    February 14, 2025

    Chrome on iOS will soon open YouTube Incognito links directly in Incognito mode

    January 8, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.