Re-LAION 5B Dataset Released: Improving Safety and Transparency in Web-Scale Datasets for Foundation Model Research Through Rigorous Content Filtering

LAION, a prominent non-profit organization dedicated to advancing machine learning research by developing open and transparent datasets, has recently released Re-LAION 5B. This updated version of the LAION-5B dataset marks a milestone in the organizationâ€™s ongoing efforts to ensure the safety and legal compliance of web-scale datasets used in foundational model research. The new dataset addresses critical issues related to potential illegal content, notably Child Sexual Abuse Material (CSAM), that were identified in the original LAION-5B.

Background and Motivation

The original LAION-5B dataset, released in 2022, was designed as a web-scale, text-link-to-images pair dataset instrumental in training and evaluating foundation models. These models, which improve their performance as they scale in terms of data, model size, and computational resources, are crucial for advancing the field of machine learning. However, the vastness and openness of the internet, from which the data was sourced, presented significant challenges in ensuring that the dataset was entirely free of illegal content.

In December 2023, the Stanford Internet Observatory, led by researcher David Thiel, published a report identifying 1,008 links within the LAION-5B dataset that potentially pointed to CSAM. This discovery prompted LAION to take immediate action, temporarily withdrawing the dataset from public access. The findings underscored the limitations of the filtering mechanisms originally employed by LAION despite the organizationâ€™s best efforts to exclude such material.

The Re-LAION 5B Update

Re-LAION 5B represents the culmination of a comprehensive safety revision process in collaboration with several key partners, including the Internet Watch Foundation (IWF), the Canadian Center for Child Protection (C3P), and the Stanford Internet Observatory. These organizations provided LAION with lists of MD5 and SHA hashes corresponding to known CSAM and other illegal content. By leveraging these hashes, LAION was able to identify and remove 2,236 suspect links from the dataset systematically. This total includes the 1,008 links initially identified by the Stanford Internet Observatory.

Importantly, the filtering process employed in creating Re-LAION 5B allowed for removing potentially illegal content without requiring LAIONâ€™s researchers to directly access or inspect the content, thereby avoiding legal and ethical pitfalls. The updated dataset, now free of links to suspected CSAM, is available in two versions: Re-LAION-5B research and Re-LAION-5B research-safe. The former retains a higher threshold for potentially sensitive content, while the latter version further filters out the majority of Not Safe For Work (NSFW) material.

Ensuring Ongoing Safety and Compliance

LAIONâ€™s commitment to safety and transparency extends beyond the release of Re-LAION 5B. The organization has made the metadata from the updated dataset available to third parties, enabling them to clean their derivatives of LAION-5B by applying similar filtering techniques. This approach enhances the safety of derivative datasets and preserves the usability of LAION-5B as a reference dataset for ongoing research.

The release of Re-LAION 5B also sets a new standard for safety in creating web-scale datasets. By partnering with expert organizations like IWF and C3P, LAION has demonstrated the importance of collaboration in addressing the challenges posed by the huge and often unregulated content on the public web. This collaborative approach offers a model for other organizations engaged in similar work, highlighting the value of shared expertise and resources in ensuring the safety and integrity of research datasets.

A Call to Action for the Research Community

In light of the improvements made in Re-LAION 5B, LAION strongly encourages all researchers and organizations still using the original LAION-5B dataset to migrate to the updated version. By doing so, they can ensure that their work is based on a dataset that has been thoroughly vetted for safety and legal compliance. LAION also recommends that organizations involved in dataset creation from public web data partner with entities like IWF and C3P obtain hash lists and other resources necessary for effective filtering.

LAIONâ€™s experience underscores the need for the broader research community to adopt and adhere to best practices for handling potential safety issues. This includes timely and direct communication of findings & proactive measures to address risks associated with large-scale web-derived datasets.

Conclusion

Re-LAION 5B is a significant step forward in LAIONâ€™s mission to provide open, transparent, and safe datasets for the machine learning research community. By addressing the issues identified in the original LAION-5B dataset and setting a new standard for safety in web-scale datasets, LAION has reaffirmed its commitment to advancing the field of ML responsibly and ethically. As researchers and professionals continue to explore the potential of foundation models, datasets like Re-LAION 5B will play an important role in ensuring that this work is conducted on a solid and safe foundation.

Check out the Details. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 50k+ ML SubReddit

Here is a highly recommended webinar from our sponsor: â€˜Building Performant AI Applications with NVIDIA NIMs and Haystackâ€™

The post Re-LAION 5B Dataset Released: Improving Safety and Transparency in Web-Scale Datasets for Foundation Model Research Through Rigorous Content Filtering appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Enterprise Code Coverage

CodeSOD: Ready Xor Not

CodeSOD: A Set of Mistakes

CodeSOD: While This Works

I tried an ultra-thin iPhone case, and here’s how my daunting experience went

I tested the viral ‘tangle-free’ USB-C cable, and it’s my new travel essential

I found one of the fastest-charging portable batteries for home backups – and it’s on sale

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PEAR Releases (12.09.2024)

Community News: Latest PECL Releases (12.17.2024)

Windows 11’s Microsoft 365 app is taking a new AI-first approach with Copilot

Windows 11’s Microsoft 365 app is taking a new AI-first approach with Copilot

5 Compelling Reasons to Choose Linux Over Windows

Rilasciato DXVK 2.5.2: Ottimizzazioni e Correzioni per i Giochi Windows su GNU/Linux

Re-LAION 5B Dataset Released: Improving Safety and Transparency in Web-Scale Datasets for Foundation Model Research Through Rigorous Content Filtering

Why developers needn’t fear CSS – with the King of CSS himself Kevin Powell [Podcast #154]

I tested the viral ‘tangle-free’ USB-C cable, and it’s my new travel essential

Taming Long Audio Sequences: Audio Mamba Achieves Transformer-Level Performance Without Self-Attention

ANY.RUN Malware Sandbox Providerâ€™s Employee Email Compromised

Brandywine Realty Trust Confirms Data Breach After Ransomware Attack

CISA Launches 21st Cybersecurity Awareness Month: Secure Our World

Vision use cases with Llama 3.2 11B and 90B models from Meta

Quishing attacks are targeting electric car owners: Hereâ€™s how to slam on the brakes

How to Automate Documentation Conversion with Pandoc in CI/CD Pipelines

Install Nessus Vulnerability Scanner on Kali Linux

Re-LAION 5B Dataset Released: Improving Safety and Transparency in Web-Scale Datasets for Foundation Model Research Through Rigorous Content Filtering

Related Posts