Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      7 MagSafe accessories that I recommend every iPhone user should have

      June 1, 2025

      I replaced my Kindle with an iPad Mini as my ebook reader – 8 reasons why I don’t regret it

      June 1, 2025

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025

      Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

      May 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Photobooth is photobooth software for the Raspberry Pi and PC

      June 1, 2025
      Recent

      Photobooth is photobooth software for the Raspberry Pi and PC

      June 1, 2025

      Le notizie minori del mondo GNU/Linux e dintorni della settimana nr 22/2025

      June 1, 2025

      Rilasciata PorteuX 2.1: Novità e Approfondimenti sulla Distribuzione GNU/Linux Portatile Basata su Slackware

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»How to Remove Strikethrough Text from PDFs Using Python

    How to Remove Strikethrough Text from PDFs Using Python

    January 14, 2025

    In this blog post, I will share my journey of developing a Python-based solution to remove strikethrough text from PDFs. This solution is specifically designed for PDFs where strikethrough is applied as a style rather than an annotation.

    The Challenge

    Strikethrough text in PDFs can be tricky to handle, mainly when applied as a style. Standard PDF manipulation libraries often fall short in these cases. Determined to find a solution, I leveraged Python to create a practical approach.

    The Solution

    The solution involves three main steps: converting the PDF to a DOCX file, removing the strikethrough text from the DOCX file, and converting the modified DOCX file back to a PDF.

    Dependencies

    Before diving into the code, install the necessary Python dependencies. You will need:
    • pdf2docx for converting PDF to DOCX
    • python-docx for manipulating DOCX files
    • docx2pdf for converting DOCX back to PDF

    You can install these dependencies using pip:

    pip install pdf2docx python-docx docx2pdf

    Step-by-Step Guide to Remove Strikethrough Text from PDFs

    Step 1: Convert PDF to DOCX

    The first step is to convert the PDF file to a DOCX file. This allows us to manipulate the text more easily. We use the pdf2docx library for this conversion. Here is the code for the conversion function:

    from pdf2docx import Converter
    def convert_pdf_to_word(pdf_file, docx_file):
        """Convert PDF to DOCX format."""
        try:
            cv = Converter(pdf_file)
            cv.convert(docx_file, start=0, end=None)
            cv.close()
            print(f"Converted PDF to DOCX: {pdf_file} -> {docx_file}")
        except Exception as e:
            print(f"Error during PDF to DOCX conversion: {e}")
            sys.exit(1)

    In this function, we create an instance of the Converter class, passing the pdf_file as an argument. The convert method of the Converter class is called to perform the conversion, and the close method is called to release any resources the converter uses. If the conversion is successful, a message is printed indicating the conversion. If an error occurs, an exception is caught, and an error message is printed.

    Step 2: Remove Strikethrough Text

    Once we have the DOCX file, we can remove the strikethrough text. This step involves iterating through the paragraphs and runs in the DOCX file and checking for the strikethrough style. We use the python-docx library for this task. Here is the code for the strikethrough removal function:

    from docx import Document
    def remove_strikethrough_text(docx_file):
        """Remove all strikethrough text from a DOCX file."""
        try:
            document = Document(docx_file)
            modified = False
            for paragraph in document.paragraphs:
                for run in paragraph.runs:
                    if run.font.strike:
                        print(f"Removing strikethrough text: {run.text}")
                        run.text = ''
                        modified = True
            if modified:
                modified_docx_file = docx_file.replace('.docx', '_modified.docx')
                document.save(modified_docx_file)
                print(f"Strikethrough text removed. Saved to: {modified_docx_file}")
                return modified_docx_file
            else:
                print("No strikethrough text found.")
                return docx_file
        except Exception as e:
            print(f"Error during strikethrough text removal: {e}")
            sys.exit(1)

    In this function, we create an instance of the Document class, passing the docx_file as an argument. We iterate through each paragraph in the document and then through each run within the section. If the strike attribute of the run’s font is True, we print a message indicating removing the strikethrough text and set the run’s text to an empty string. If strikethrough text was removed, we save the modified document to a new file with _modified appended to the original filename. If no strikethrough text was found, we return the original DOCX file.

    Step 3: Convert DOCX Back to PDF

    The final step is to convert the modified DOCX file back to a PDF file. This ensures that the strikethrough text is removed in the final PDF. We use the docx2pdf library for this conversion. Here is the code for the conversion function:

    from docx2pdf import convert
    
    def convert_docx_to_pdf(docx_file, output_pdf):
        """Convert DOCX back to PDF format."""
        try:
            convert(docx_file, output_pdf)
            print(f"Converted DOCX to PDF: {docx_file} -> {output_pdf}")
        except Exception as e:
            print(f"Error during DOCX to PDF conversion: {e}")
            sys.exit(1)

    We call this function the convert function, passing the docx_file and output_pdf as arguments to perform the conversion. If the conversion is successful, a message is printed indicating the conversion. If an error occurs, an exception is caught, and an error message is printed.

    Main Execution Block

    The following block of code is the main execution section of the script. It starts by checking if the script is being run directly. It then verifies that the correct number of command-line arguments is provided and that the specified PDF file exists. If these conditions are met, the script defines intermediate file paths and performs the three main steps: converting the PDF to a DOCX file, removing strikethrough text from the DOCX file, and converting the modified DOCX back to a PDF. After completing these steps, it prints the location of the modified PDF file and cleans up any intermediate files. If errors occur during execution, they are caught and printed, and the script exits gracefully.

    if __name__ == "__main__":
        if len(sys.argv) != 2:
            sys.exit(1)
    
        pdf_file = sys.argv[1]
    
        if not os.path.exists(pdf_file):
            print(f"Error: File not found - {pdf_file}")
            sys.exit(1)
    
        try:
            # Define intermediate file paths
            base_name = os.path.splitext(pdf_file)[0]
            temp_docx_file = f"{base_name}.docx"
            modified_docx_file = f"{base_name}_modified.docx"
            output_pdf_file = f"{base_name}_modified.pdf"
    
            # Step 1: Convert PDF to DOCX
            convert_pdf_to_word(pdf_file, temp_docx_file)
    
            # Step 2: Remove strikethrough text
            final_docx_file = remove_strikethrough_text(temp_docx_file)
    
            # Step 3: Convert modified DOCX back to PDF
            convert_docx_to_pdf(final_docx_file, output_pdf_file)
    
            print(f"Modified PDF saved to: {output_pdf_file}")
    
            # Clean up intermediate DOCX files
            if os.path.exists(temp_docx_file):
                os.remove(temp_docx_file)
            if final_docx_file != temp_docx_file and os.path.exists(final_docx_file):
                os.remove(final_docx_file)
    
        except Exception as e:
            print(f"Error: {e}")
            sys.exit(1)
    

    Complete Script

    import sys
    import os
    from pdf2docx import Converter
    from docx import Document
    from docx2pdf import convert
    
    def convert_pdf_to_word(pdf_file, docx_file):
        """Convert PDF to DOCX format."""
        try:
            cv = Converter(pdf_file)
            cv.convert(docx_file, start=0, end=None)
            cv.close()
            print(f"Converted PDF to DOCX: {pdf_file} -> {docx_file}")
        except Exception as e:
            print(f"Error during PDF to DOCX conversion: {e}")
            sys.exit(1)
    
    def remove_strikethrough_text(docx_file):
        """Remove all strikethrough text from a DOCX file."""
        try:
            document = Document(docx_file)
            modified = False
    
            for paragraph in document.paragraphs:
                for run in paragraph.runs:
                    if run.font.strike:
                        print(f"Removing strikethrough text: {run.text}")
                        run.text = ''
                        modified = True
    
            if modified:
                modified_docx_file = docx_file.replace('.docx', '_modified.docx')
                document.save(modified_docx_file)
                print(f"Strikethrough text removed. Saved to: {modified_docx_file}")
                return modified_docx_file
            else:
                print("No strikethrough text found.")
                return docx_file
        except Exception as e:
            print(f"Error during strikethrough text removal: {e}")
            sys.exit(1)
    
    def convert_docx_to_pdf(docx_file, output_pdf):
        """Convert DOCX back to PDF format."""
        try:
            convert(docx_file, output_pdf)
            print(f"Converted DOCX to PDF: {docx_file} -> {output_pdf}")
        except Exception as e:
            print(f"Error during DOCX to PDF conversion: {e}")
            sys.exit(1)
    
    if __name__ == "__main__":
        if len(sys.argv) != 2:
            sys.exit(1)
    
        pdf_file = sys.argv[1]
    
        if not os.path.exists(pdf_file):
            print(f"Error: File not found - {pdf_file}")
            sys.exit(1)
    
        try:
            # Define intermediate file paths
            base_name = os.path.splitext(pdf_file)[0]
            temp_docx_file = f"{base_name}.docx"
            modified_docx_file = f"{base_name}_modified.docx"
            output_pdf_file = f"{base_name}_modified.pdf"
    
            # Step 1: Convert PDF to DOCX
            convert_pdf_to_word(pdf_file, temp_docx_file)
    
            # Step 2: Remove strikethrough text
            final_docx_file = remove_strikethrough_text(temp_docx_file)
    
            # Step 3: Convert modified DOCX back to PDF
            convert_docx_to_pdf(final_docx_file, output_pdf_file)
    
            print(f"Modified PDF saved to: {output_pdf_file}")
    
            # Clean up intermediate DOCX files
            if os.path.exists(temp_docx_file):
                os.remove(temp_docx_file)
            if final_docx_file != temp_docx_file and os.path.exists(final_docx_file):
                os.remove(final_docx_file)
    
        except Exception as e:
            print(f"Error: {e}")
            sys.exit(1)
    

    Run the Script

    Execute the script by running the following command, replacing it with the path to your PDF file:

    python <script_name>.py <pdf_file_path>

    This Python-based solution effectively removes strikethrough text from PDFs by leveraging the strengths of the pdf2docx, python-docx, and docx2pdf libraries. By converting the PDF to DOCX, modifying the DOCX, and converting it back to PDF, we can ensure that the strikethrough text is removed without affecting other content. This approach provides a robust and efficient method for handling strikethrough text in PDFs, making your documents clean and professional.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleFour B2B E-Commerce Trends for 2025
    Next Article How to Implement SSR and Client Hydration in Next.js

    Related Posts

    Artificial Intelligence

    Markus Buehler receives 2025 Washington Award

    June 1, 2025
    Artificial Intelligence

    LWiAI Podcast #201 – GPT 4.5, Sonnet 3.7, Grok 3, Phi 4

    June 1, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Who will launch the personalized banking UX age: on-device Apple AI or banks’ cloud-based AI?

    Development

    Build a “button-to-modal” animation with GSAP

    Web Development

    Use Laravel’s Built-in SetUp Hooks for Application Test Traits

    Development

    SEBI’s Cybersecurity Shield: A New Line of Defense for Indian Finance

    Development

    Highlights

    Why you should ignore 99% of AI tools – and which four I use every day

    March 16, 2025

    How I avoid AI overwhelm, manage AI FOMO, and stay smarter, faster, and less stressed.…

    How to Use JavaScript Streams for Efficient Asynchronous Requests

    January 4, 2025

    Microsoft is finally adding a highly requested feature to the Taskbar on Windows 11

    January 18, 2025

    Microsoft AI Proposes an Automated Pipeline that Utilizes GPT-4V(ision) to Generate Accurate Audio Description AD for Videos

    May 7, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.