Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      7 MagSafe accessories that I recommend every iPhone user should have

      June 1, 2025

      I replaced my Kindle with an iPad Mini as my ebook reader – 8 reasons why I don’t regret it

      June 1, 2025

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025

      Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

      May 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Photobooth is photobooth software for the Raspberry Pi and PC

      June 1, 2025
      Recent

      Photobooth is photobooth software for the Raspberry Pi and PC

      June 1, 2025

      Le notizie minori del mondo GNU/Linux e dintorni della settimana nr 22/2025

      June 1, 2025

      Rilasciata PorteuX 2.1: Novità e Approfondimenti sulla Distribuzione GNU/Linux Portatile Basata su Slackware

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»TabTreeFormer: Enhancing Synthetic Tabular Data Generation Through Tree-Based Inductive Biases and Dual-Quantization Tokenization

    TabTreeFormer: Enhancing Synthetic Tabular Data Generation Through Tree-Based Inductive Biases and Dual-Quantization Tokenization

    January 9, 2025

    The generation of synthetic tabular data has become increasingly crucial in fields like healthcare and financial services, where privacy concerns often restrict the use of real-world data. While autoregressive transformers, masked transformers, and diffusion models with transformers, have shown significant success in generating high-quality synthetic data with strong fidelity, utility, and privacy guarantees, they face important limitations. Unlike their counterparts in computer vision and natural language processing, which uses domain-specific priors to enhance performance, current transformer models for tabular data generation largely ignore these valuable inductive biases. Moreover, Tabular data generation models have yet to incorporate such domain-specific architectural enhancements effectively.

    Early approaches to tabular data generation depend heavily on MLPs and CNNs as backbone architectures, utilizing GANs and VAEs as generation methods. The field then progressed to transformer-based solutions, introducing models like TabMT, which uses masked transformers with ordered embedding, and TabSyn, which employs diffusion models in latent space. While these approaches demonstrated improved utility and fidelity, they failed to account for crucial inductive biases specific to tabular data, such as non-smoothness and low-correlated features. In parallel, tree-based gradient boosting algorithms like XGBoost, LightGBM, and CatBoost have dominated traditional tabular data tasks due to their natural ability to understand feature relationships.

    Researchers from the Asian Institute of Digital Finance, National University of Singapore, Betterdata AI, Singapore, Tufts University, Massachusetts, United States, and the Department of Electrical and Computer Engineering, Singapore have proposed TabTreeFormer, an innovative hybrid transformer architecture. The model integrates a tree-based component that preserves tabular-specific inductive biases, focusing on non-smooth and low-correlated patterns through its discrete and non-rotationally invariant design. A key innovation is the dual-quantization tokenizer, which effectively captures multimodal continuous distributions and enhances the learning of numerical value distributions. This novel architecture improves data generation quality and shows a significant reduction, in model size by limiting vocabulary size and sequence length.

    The TabTreeFormer architecture integrates LightGBM as its tree-based component, with hyperparameters optimized using Optuna, and employs Distill-GPT2 as its transformer backbone. The model comes in three configurations: Small (1M parameters), Medium (5M parameters), and Large (40M parameters), offering flexibility in deployment based on computational resources and performance requirements. The implementation was tested on an NVIDIA RTX 4090 and evaluated across 10 diverse datasets from OpenML, including adult, bank, breast, credit, diabetes, iris, etc. The model’s performance was benchmarked against a comprehensive range of existing methods, including ARF, CTAB-GAN+, CTGAN, TVAE, TabDDPM, TabSyn, GReaT, and REaLTabFormer.

    TabTreeFormer shows exceptional performance across multiple evaluation metrics. In terms of fidelity, it achieves comparable results in marginal density distribution (Shape) while showing superior capability in capturing multimodal distributions compared to other autoregressive transformers. The model significantly outperforms all baselines in pair-wise correlation (Trend) metrics, excelling in handling correlations involving categorical features and showing marked improvement on datasets with larger feature sets. In utility evaluations measured by Machine Learning Efficiency (MLE), TabTreeFormer-S matches the performance of leading baselines like REaLTabFormer and TabSyn. At the same time, the Medium and Large versions surpass them by approximately 40%.

    Hostinger

    In this paper, researchers introduced TabTreeFormer which represents a significant advancement in synthetic tabular data generation, successfully combining tree-based models’ inductive biases with transformer architecture while introducing an innovative dual-quantization tokenizer for optimized numerical value representation. The comprehensive evaluation across multiple datasets and metrics shows its superior performance in fidelity, utility, and privacy compared to existing approaches. The model’s ability to capture complex data distributions and inter-feature relationships while maintaining a smaller footprint makes it a promising solution for practical applications in privacy-preserving data generation.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

    The post TabTreeFormer: Enhancing Synthetic Tabular Data Generation Through Tree-Based Inductive Biases and Dual-Quantization Tokenization appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper Introduces Virgo: A Multimodal Large Language Model for Enhanced Slow-Thinking Reasoning
    Next Article Customized hardware furniture in tirupati

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 1, 2025
    Machine Learning

    BOND 2025 AI Trends Report Shows AI Ecosystem Growing Faster than Ever with Explosive User and Developer Adoption

    June 1, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    FreedomBox – Debian-based distribution

    Linux

    This new framework aims to finally set the standard for open-source AI models

    News & Updates

    BorgWarehouse is a fast and modern WebUI for BorgBackup

    Linux

    Different context name for webview in Hybrid Android App test through Appium

    Development
    Hostinger

    Highlights

    Development

    Online Cloth Rental System using PHP and MySQL

    April 12, 2024

    “Online Cloth Rental System” is a web based application which helps to investigate and relate…

    DeepMind’s latest research at NeurIPS 2022

    May 27, 2025

    Archman Linux – Arch-based distribution

    July 1, 2024

    Allcargo’s ECU Worldwide Appoints Rajneesh Garg as the Chief Information Officer

    June 24, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.