Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      How Red Hat just quietly, radically transformed enterprise server Linux

      June 2, 2025

      OpenAI wants ChatGPT to be your ‘super assistant’ – what that means

      June 2, 2025

      The best Linux VPNs of 2025: Expert tested and reviewed

      June 2, 2025

      One of my favorite gaming PCs is 60% off right now

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      `document.currentScript` is more useful than I thought.

      June 2, 2025
      Recent

      `document.currentScript` is more useful than I thought.

      June 2, 2025

      Adobe Sensei and GenAI in Practice for Enterprise CMS

      June 2, 2025

      Over The Air Updates for React Native Apps

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025
      Recent

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025

      Microsoft says Copilot can use location to change Outlook’s UI on Android

      June 2, 2025

      TempoMail — Command Line Temporary Email in Linux

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»This AI Study Saves Researchers from Metadata Chaos with a Comparative Analysis of Extraction Techniques for Scholarly Documents

    This AI Study Saves Researchers from Metadata Chaos with a Comparative Analysis of Extraction Techniques for Scholarly Documents

    January 15, 2025

    Scientific metadata in research literature holds immense significance, as highlighted by flourishing research in scientometrics—a discipline dedicated to analyzing scholarly literature. Metadata improves the findability and accessibility of scientific documents by indexing and linking papers in a massive graph. Today, the research community has realized the importance of metadata. However, its awareness and consideration were negligible in the past, especially for non-technical disciplines such as social sciences, which made their publications less discoverable. Over time, many standards have been established to ensure uniformity and standardization. Moreover, metadata automation has progressed significantly, aided by advanced natural language processing (NLP) and computer vision techniques. NLP, in particular, has been leading metadata extraction. Still, there remains a significant issue that hinders its application in small and mid-sized publications, which often have a variety of templates and layouts. This article discusses the latest research comparing methods for metadata extraction from scholarly documents.

    Researchers from the Fraunhofer Institute for Applied Information Technology took on this challenge and explored various feature learning and classification approaches for scientific PDFs. The authors employed techniques across domains, from classical methods to the latest innovations. They utilized techniques such as Conditional Random Fields, BiLSTM with BERT representations, and innovative multimodal and TextMap methods. The approaches chosen by the authors overcome the limitations of generative LLMs, which require data in a specified structure, making them incompatible with diverse publication formats. The authors leveraged the strengths of BERT and other architectures to address the uniqueness and variability of different documents, including embedded multimodal content.

    The research team also curated two challenging labeled datasets to address the lack of ground truths for training DNN-based tools. For the first dataset, SSOAR-MVD, they synthesized 50,000 samples using predefined templates and available data. The other S-PMRD dataset was derived from the Semantic Scholar Open Research Corpus.

    In the paper, the research team assumed that metadata is typically present only on the first page of a PDF document and that its availability may vary across documents. They initially employed Conditional Random Fields to tackle this task by dividing it into two sub-goals: identification and extraction. Identification was facilitated by analyzing font changes, including color, size, style, and alignment variations. The identified lines then served as input to the extraction layer of CRF. The authors subsequently used BiLSTM with BERT embeddings and a BiLSTM-CRF approach with embeddings obtained from BERT. They also experimented with Grobid, a machine-learning library designed to parse sections of documents such as headers, titles, author information, or other metadata into XML/TEI format.

    Furthermore, they employed Fast RCNN and vision-language-based models. Lastly, the authors conducted experiments using the TextMap approach, which applies a two-phase processing method to handle spatial representation and semantic mapping. They innovatively integrated spatial and semantic components through a carefully designed interpolation process.

    The results from the above experiments were noteworthy. The first model, CRF, performed remarkably well for attributes with structured and predictable formats, such as dates, with an F1 score reaching 0.73. However, as data patterns diminished and complexity and variability increased, such as in the case of titles or authors’ names, its performance dwindled. BiLSTM demonstrated robustness in capturing the sequence and context of data, with an F1 score reaching as high as 0.9 for abstracts and dates. The BiLSTM-CRF performed moderately, as the capabilities of LSTM supported CRF, but it could not surpass the performance of BiLSTM alone. Grobid, despite its simple design, exceeded previous scores, achieving the highest F1-score of 0.96 in author extraction. Fast RCNN demonstrated high precision and recall across various metadata categories, achieving higher accuracies in recognizing titles, abstracts, and journals. In the TextMap method, the best output was obtained with Word2Vec embeddings, where performance reached 0.9 in F1-score.

    Conclusion: The authors compared various classical and advanced machine-learning tools for accurate metadata extraction. The paper highlighted the strengths and shortcomings of each method, enabling users to select the most suitable approach based on dataset content, desired accuracy, and physical constraints.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

    🚨 Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)

    The post This AI Study Saves Researchers from Metadata Chaos with a Comparative Analysis of Extraction Techniques for Scholarly Documents appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMinMo: A Multimodal Large Language Model with Approximately 8B Parameters for Seamless Voice Interaction
    Next Article CRW – The Group Connection App

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Archana Venugopal Takes Charge as Senior VP & CISO at National Commodity Exchange

    Development

    Exclusive report: What’s next for Surface in 2025, including a possible 11-inch mini Surface Laptop?

    Development

    CVE-2025-31198 – Apple macOS Symlink Path Handling Vulnerability

    Common Vulnerabilities and Exposures (CVEs)
    CISA Warns of CrushFTP Exploit Letting Attackers Bypass Authentication

    CISA Warns of CrushFTP Exploit Letting Attackers Bypass Authentication

    Development

    Highlights

    Development

    How to Start a Creative Venture With No Money

    July 29, 2024

    Starting a creative venture with no money can be challenging, but it’s definitely possible with…

    Art Exhibition Management System using Python Django and MySQL

    June 9, 2024

    Microsoft Makes Windows Recall Opt-in, Encrypted in Response to Privacy Concerns

    June 7, 2024

    The AI app gold gush: Move fast and build smart

    February 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.