DRR-RATE: A Large Scale Synthetic Chest X-ray Dataset Complete with Labels and Radiological Reports

Chest X-rays are essential in diagnosing pulmonary and cardiac issues, including pneumonia and lung lesions, and are widely used in settings with limited resources. The rise of AI has greatly enhanced automated medical image analysis, benefiting from large, curated datasets. Recently, the focus has shifted to multimodal models, like Large Language Models and Vision-Based Language Models, which require extensive and diverse data for training. The study uses Digitally Reconstructed Radiography (DRR) to generate synthetic X-ray images from the CT-RATE dataset. This dataset is rich in binary labels and detailed radiological reports, making it valuable for training AI classifiers for disease diagnosis.

Researchers from the Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Clinical Center, and National Center for Biotechnology Information, National Library of Medicine have introduced DRR-RATE, synthetic X-ray images synthesized from computed tomography (CT) data using ray tracing techniques. Unlike conventional radiographs, DRRs offer controlled and reproducible imaging conditions by simulating the path of X-rays through CT volumes. Each DRR pixelâ€™s intensity is determined by the attenuation coefficients of tissues along the ray path, reflecting X-ray absorption. DRRs find crucial applications in radiation therapy planning, surgical preparation, educational purposes, and algorithm development. They facilitate precise dose calculations in therapy and accurate 2D-3D image registration for surgeries, enhancing medical education through realistic representations of various conditions. Ongoing research aims to improve DRR generation speed and image quality.

Several significant large-scale chest X-ray datasets have been pivotal in advancing medical imaging research. For instance, ChestX-ray8 and ChestX-ray14, released by the US National Institutes of Health (NIH), contain over 112,000 scans from more than 30,000 individuals. These datasets utilize NLP techniques to extract disease labels from radiological reports. CheXpert, another notable dataset, includes 224,316 radiographs from 65,240 patients at Stanford Health Care, also labeled using NLP methods. PadChest, comprising over 160,000 images, offers detailed annotations from radiologists at Hospital San Juan Hospital in Spain. MIMIC-CXR and VinDr-CXR further enhance research capabilities with extensive datasets annotated by radiologists from major medical centers. These datasets collectively support research in disease detection and AI applications in radiology and related fields.

The DRR-RATE dataset, an extension of the CT-RATE dataset, features 50,188 chest CT volumes from 21,304 patients, each paired with a radiology text report and binary labels for 18 pathology classes. Modifying the reconstruction matrix from original DICOM studies expanded the dataset to enhance its utility in medical imaging research. Patient demographics reveal a diverse age range and gender distribution across training and validation subsets. DRR images are generated using ray tracing algorithms, simulating X-ray projections from CT data, thereby enabling multimodal research applications bridging CT and X-ray imaging modalities. The dataset is publicly accessible under a CC BY-NC-SA license.

In the experiments with the DRR-RATE dataset, CheXnet was trained and evaluated for chest X-ray classification, comparing its performance against the CheXpert dataset. Using five-fold cross-validation, CheXnet achieved notable results. Cardiomegaly and Pleural Effusion showed robust performance with AUC scores of 0.92 and 0.95, respectively, indicating high predictive accuracy. However, Atelectasis and Consolidation exhibited moderate AUC values of 0.72 and 0.74, suggesting decent but less consistent performance. Lung Nodule and Lung Opacity had lower AUC scores, around 0.66 and 0.67, indicating room for improvement. When CheXnet trained on CheXpert and tested on DRR-RATE, performance decreased slightly for most conditions due to domain differences between real and DRR images.

The DRR-RATE is a synthetic chest X-ray dataset derived from CT scans, offering labeled images and radiological reports. By simulating CT-derived pathologies in X-ray form, DRR-RATE enriches training data for diagnostic models and enhances understanding across imaging modalities. Evaluating baseline CheXnet models trained on DRR-RATE and CheXpert datasets revealed robust performance, particularly in detecting Cardiomegaly, Consolidation, and Pleural Effusion. However, challenges remain for subtle conditions like Atelectasis, Lung Nodule, and Lung Opacity, potentially due to resolution limitations in DRR images. Nonetheless, DRR-RATEâ€™s integration marks a significant stride in synthesizing medical imaging data, bolstering AI-driven diagnostic capabilities, and advancing medical research.

Check out the Paper and Dataset. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

Create, edit, and augment tabular data with the first compound AI system, Gretel Navigator, now generallyÂ available! [Advertisement]

The post DRR-RATE: A Large Scale Synthetic Chest X-ray Dataset Complete with Labels and Radiological Reports appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

DRR-RATE: A Large Scale Synthetic Chest X-ray Dataset Complete with Labels and Radiological Reports

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Buckle up: Android Automotive has 70 new apps to keep you entertained (in the parking lot)

Desktop application development with Angular and Electron

Edge AI and Itâ€™s Advantages over Traditional AI

Perficient Colleague Attains Champion Status

Generative AI roadshow in North America with AWS and Hugging Face

Man sentenced to 7 years in prison for role in $50m internet scam

Windows 10 support ends in 285 days, but it’s not losing market share to Windows 11

Get Xdebug Working With Docker and PHP 8.4 in One Minute

DRR-RATE: A Large Scale Synthetic Chest X-ray Dataset Complete with Labels and Radiological Reports

Related Posts