Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Gretel AI Releases a New Multilingual Synthetic Financial Dataset on HuggingFace 🤗 for AI Developers Tackling Personally Identifiable Information PII Detection

    Gretel AI Releases a New Multilingual Synthetic Financial Dataset on HuggingFace 🤗 for AI Developers Tackling Personally Identifiable Information PII Detection

    June 14, 2024

    Detecting personally identifiable information PII in documents involves navigating various regulations, such as the EU’s General Data Protection Regulation (GDPR) and various U.S. financial data protection laws. These regulations mandate the secure handling of sensitive data, including customer identifiers, financial records, and other personal information. The diversity of data formats and the specific requirements of different domains necessitate a tailored approach to PII detection, which is where Gretel’s synthetic dataset comes into play.

    Empowering PII Detection with Domain-Specific Datasets

    Every organization has unique data formats and domain-specific requirements that may need to be fully captured by existing Named Entity Recognition (NER) models or sample datasets. Gretel’s Navigator tool allows developers to create customized synthetic datasets tailored to their needs. This approach significantly reduces the time & cost of traditional manual labeling techniques. By leveraging Gretel Navigator, developers can rapidly create large-scale, diverse, privacy-preserving datasets that accurately reflect the characteristics and challenges of their domain, ensuring that PII detection models are well-prepared for real-world scenarios and unique document types. One such dataset by Gretel is its multilingual Financial Document Dataset, released on the platform this week.

    Key Features of the Synthetic Financial Document Dataset

    Extensive Records: 55,940 records were partitioned into 50,776 training samples and 5,164 test samples.

    Coverage of Financial Document Formats: Includes 100 distinct financial document formats with 20 specific subtypes for each format.

    Synthetic PII: Contains 29 distinct PII types, aligned with Python Faker library generators for easy detection and replacement.

    Full-Length Documents: The average length of documents is 1,357 characters.

    Multilingual Support: Supports English, Spanish, Swedish, German, Italian, Dutch, and French.

    Quality Assurance: The LLM-as-a-Judge technique with the Mistral-7B language model is used to ensure data quality and evaluate conformance, quality, toxicity, bias, and groundedness.

    Image Source

    Use Cases of the Synthetic Financial Document Dataset

    Training NER Models: Detect and label PII in various domains.

    Testing PII Scanning Systems: Evaluate PII scanning systems on real, full-length documents unique to different domains.

    Evaluating De-identification Systems: Assess the performance of de-identification systems on realistic documents containing PII.

    Developing Data Privacy Solutions: Create and test data privacy solutions for the financial industry.

    Quality Assessment and Usage

    The quality of this dataset’s synthetic PII and documents is ensured through the LLM-as-a-Judge technique using the Mistral-7B language model. Each generated record is evaluated based on several criteria: conformance, quality, toxicity, bias, and groundedness. Records with high toxicity or bias scores or low groundedness, quality, or conformance scores are removed to maintain the dataset’s integrity. This rigorous quality assessment ensures the dataset is reliable and suitable for training robust PII detection models.

    Image Source

    Supporting the Open Data Community

    Gretel’s commitment to promoting open data and fostering collaboration within the AI community is evident in the release of this dataset. Gretel aims to accelerate the development of more accurate, unbiased, and trustworthy AI systems by sharing high-quality, diverse, and ethically sourced datasets. The synthetic financial document dataset is just one example of this commitment, providing a valuable resource for developers and researchers to build robust PII detection solutions.

    Conclusion

    Gretel’s synthetic financial document dataset represents an important innovation in PII detection. Gretel empowers AI developers to build more effective and domain-specific PII detection systems by providing a comprehensive and customizable dataset. This initiative addresses the technical challenges of PII detection and promotes data privacy and compliance across various industries. Resources like Gretel’s dataset will ensure sensitive data is handled securely and responsibly as AI evolves.

    Colab Notebook

    Sources

    https://gretel.ai/blog/gretel-unlocks-pii-detection-with-synthetic-financial-document-dataset

    https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual

    https://www.linkedin.com/feed/update/urn:li:activity:7206723643932868608/

    The post Gretel AI Releases a New Multilingual Synthetic Financial Dataset on HuggingFace 🤗 for AI Developers Tackling Personally Identifiable Information PII Detection appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper from China Propose ‘Magnus’: Revolutionizing Efficient LLM Serving for LMaaS with Semantic-Based Request Length Prediction
    Next Article Learn to Secure Petabyte-Scale Data in a Webinar with Industry Titans

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48187 – RAGFlow Authentication Bypass

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Even Great Companies Get Breached — Find Out Why and How to Stop It

    Development

    LWiAI Podcast #194 – Gemini Reasoning, Veo 2, Meta vs OpenAI, Fake Alignment

    Artificial Intelligence

    Method prevents an AI model from being overconfident about wrong answers

    Artificial Intelligence

    Best WordPress Plugins to Try Out in 2025

    Web Development
    GetResponse

    Highlights

    CVE-2025-47892 – Apache HTTP Server Cross-Site Request Forgery

    May 14, 2025

    CVE ID : CVE-2025-47892

    Published : May 14, 2025, 4:16 a.m. | 2 hours, 39 minutes ago

    Description : Rejected reason: Not used

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Military Spouse Assistants

    January 17, 2025

    iOS Ready

    July 26, 2024

    Looking from Page Object Model viewpoint and OOP (Selenium) how do we deal with waiting for web element?

    May 12, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.