How to Remove Strikethrough Text from PDFs Using Python

In this blog post, I will share my journey of developing a Python-based solution to remove strikethrough text from PDFs. This solution is specifically designed for PDFs where strikethrough is applied as a style rather than an annotation.

The Challenge

Strikethrough text in PDFs can be tricky to handle, mainly when applied as a style. Standard PDF manipulation libraries often fall short in these cases. Determined to find a solution, I leveraged Python to create a practical approach.

The Solution

The solution involves three main steps: converting the PDF to a DOCX file, removing the strikethrough text from the DOCX file, and converting the modified DOCX file back to a PDF.

Dependencies

Before diving into the code, install the necessary Python dependencies. You will need:
• pdf2docx for converting PDF to DOCX
• python-docx for manipulating DOCX files
• docx2pdf for converting DOCX back to PDF

You can install these dependencies using pip:

pip install pdf2docx python-docx docx2pdf

Step-by-Step Guide to Remove Strikethrough Text from PDFs

Step 1: Convert PDF to DOCX

The first step is to convert the PDF file to a DOCX file. This allows us to manipulate the text more easily. We use the pdf2docx library for this conversion. Here is the code for the conversion function:

from pdf2docx import Converter
def convert_pdf_to_word(pdf_file, docx_file):
    """Convert PDF to DOCX format."""
    try:
        cv = Converter(pdf_file)
        cv.convert(docx_file, start=0, end=None)
        cv.close()
        print(f"Converted PDF to DOCX: {pdf_file} -> {docx_file}")
    except Exception as e:
        print(f"Error during PDF to DOCX conversion: {e}")
        sys.exit(1)

In this function, we create an instance of the Converter class, passing the pdf_file as an argument. The convert method of the Converter class is called to perform the conversion, and the close method is called to release any resources the converter uses. If the conversion is successful, a message is printed indicating the conversion. If an error occurs, an exception is caught, and an error message is printed.

Step 2: Remove Strikethrough Text

Once we have the DOCX file, we can remove the strikethrough text. This step involves iterating through the paragraphs and runs in the DOCX file and checking for the strikethrough style. We use the python-docx library for this task. Here is the code for the strikethrough removal function:

from docx import Document
def remove_strikethrough_text(docx_file):
    """Remove all strikethrough text from a DOCX file."""
    try:
        document = Document(docx_file)
        modified = False
        for paragraph in document.paragraphs:
            for run in paragraph.runs:
                if run.font.strike:
                    print(f"Removing strikethrough text: {run.text}")
                    run.text = ''
                    modified = True
        if modified:
            modified_docx_file = docx_file.replace('.docx', '_modified.docx')
            document.save(modified_docx_file)
            print(f"Strikethrough text removed. Saved to: {modified_docx_file}")
            return modified_docx_file
        else:
            print("No strikethrough text found.")
            return docx_file
    except Exception as e:
        print(f"Error during strikethrough text removal: {e}")
        sys.exit(1)

In this function, we create an instance of the Document class, passing the docx_file as an argument. We iterate through each paragraph in the document and then through each run within the section. If the strike attribute of the run’s font is True, we print a message indicating removing the strikethrough text and set the run’s text to an empty string. If strikethrough text was removed, we save the modified document to a new file with _modified appended to the original filename. If no strikethrough text was found, we return the original DOCX file.

Step 3: Convert DOCX Back to PDF

The final step is to convert the modified DOCX file back to a PDF file. This ensures that the strikethrough text is removed in the final PDF. We use the docx2pdf library for this conversion. Here is the code for the conversion function:

from docx2pdf import convert

def convert_docx_to_pdf(docx_file, output_pdf):
    """Convert DOCX back to PDF format."""
    try:
        convert(docx_file, output_pdf)
        print(f"Converted DOCX to PDF: {docx_file} -> {output_pdf}")
    except Exception as e:
        print(f"Error during DOCX to PDF conversion: {e}")
        sys.exit(1)

We call this function the convert function, passing the docx_file and output_pdf as arguments to perform the conversion. If the conversion is successful, a message is printed indicating the conversion. If an error occurs, an exception is caught, and an error message is printed.

Main Execution Block

The following block of code is the main execution section of the script. It starts by checking if the script is being run directly. It then verifies that the correct number of command-line arguments is provided and that the specified PDF file exists. If these conditions are met, the script defines intermediate file paths and performs the three main steps: converting the PDF to a DOCX file, removing strikethrough text from the DOCX file, and converting the modified DOCX back to a PDF. After completing these steps, it prints the location of the modified PDF file and cleans up any intermediate files. If errors occur during execution, they are caught and printed, and the script exits gracefully.

if __name__ == "__main__":
    if len(sys.argv) != 2:
        sys.exit(1)

    pdf_file = sys.argv[1]

    if not os.path.exists(pdf_file):
        print(f"Error: File not found - {pdf_file}")
        sys.exit(1)

    try:
        # Define intermediate file paths
        base_name = os.path.splitext(pdf_file)[0]
        temp_docx_file = f"{base_name}.docx"
        modified_docx_file = f"{base_name}_modified.docx"
        output_pdf_file = f"{base_name}_modified.pdf"

        # Step 1: Convert PDF to DOCX
        convert_pdf_to_word(pdf_file, temp_docx_file)

        # Step 2: Remove strikethrough text
        final_docx_file = remove_strikethrough_text(temp_docx_file)

        # Step 3: Convert modified DOCX back to PDF
        convert_docx_to_pdf(final_docx_file, output_pdf_file)

        print(f"Modified PDF saved to: {output_pdf_file}")

        # Clean up intermediate DOCX files
        if os.path.exists(temp_docx_file):
            os.remove(temp_docx_file)
        if final_docx_file != temp_docx_file and os.path.exists(final_docx_file):
            os.remove(final_docx_file)

    except Exception as e:
        print(f"Error: {e}")
        sys.exit(1)

Complete Script

import sys
import os
from pdf2docx import Converter
from docx import Document
from docx2pdf import convert

def convert_pdf_to_word(pdf_file, docx_file):
    """Convert PDF to DOCX format."""
    try:
        cv = Converter(pdf_file)
        cv.convert(docx_file, start=0, end=None)
        cv.close()
        print(f"Converted PDF to DOCX: {pdf_file} -> {docx_file}")
    except Exception as e:
        print(f"Error during PDF to DOCX conversion: {e}")
        sys.exit(1)

def remove_strikethrough_text(docx_file):
    """Remove all strikethrough text from a DOCX file."""
    try:
        document = Document(docx_file)
        modified = False

        for paragraph in document.paragraphs:
            for run in paragraph.runs:
                if run.font.strike:
                    print(f"Removing strikethrough text: {run.text}")
                    run.text = ''
                    modified = True

        if modified:
            modified_docx_file = docx_file.replace('.docx', '_modified.docx')
            document.save(modified_docx_file)
            print(f"Strikethrough text removed. Saved to: {modified_docx_file}")
            return modified_docx_file
        else:
            print("No strikethrough text found.")
            return docx_file
    except Exception as e:
        print(f"Error during strikethrough text removal: {e}")
        sys.exit(1)

def convert_docx_to_pdf(docx_file, output_pdf):
    """Convert DOCX back to PDF format."""
    try:
        convert(docx_file, output_pdf)
        print(f"Converted DOCX to PDF: {docx_file} -> {output_pdf}")
    except Exception as e:
        print(f"Error during DOCX to PDF conversion: {e}")
        sys.exit(1)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        sys.exit(1)

    pdf_file = sys.argv[1]

    if not os.path.exists(pdf_file):
        print(f"Error: File not found - {pdf_file}")
        sys.exit(1)

    try:
        # Define intermediate file paths
        base_name = os.path.splitext(pdf_file)[0]
        temp_docx_file = f"{base_name}.docx"
        modified_docx_file = f"{base_name}_modified.docx"
        output_pdf_file = f"{base_name}_modified.pdf"

        # Step 1: Convert PDF to DOCX
        convert_pdf_to_word(pdf_file, temp_docx_file)

        # Step 2: Remove strikethrough text
        final_docx_file = remove_strikethrough_text(temp_docx_file)

        # Step 3: Convert modified DOCX back to PDF
        convert_docx_to_pdf(final_docx_file, output_pdf_file)

        print(f"Modified PDF saved to: {output_pdf_file}")

        # Clean up intermediate DOCX files
        if os.path.exists(temp_docx_file):
            os.remove(temp_docx_file)
        if final_docx_file != temp_docx_file and os.path.exists(final_docx_file):
            os.remove(final_docx_file)

    except Exception as e:
        print(f"Error: {e}")
        sys.exit(1)

Run the Script

Execute the script by running the following command, replacing it with the path to your PDF file:

python <script_name>.py <pdf_file_path>

This Python-based solution effectively removes strikethrough text from PDFs by leveraging the strengths of the pdf2docx, python-docx, and docx2pdf libraries. By converting the PDF to DOCX, modifying the DOCX, and converting it back to PDF, we can ensure that the strikethrough text is removed without affecting other content. This approach provides a robust and efficient method for handling strikethrough text in PDFs, making your documents clean and professional.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

7 MagSafe accessories that I recommend every iPhone user should have

I replaced my Kindle with an iPad Mini as my ebook reader – 8 reasons why I don’t regret it

Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

Student Record Android App using SQLite

Student Record Android App using SQLite

When Array uses less memory than Uint8Array (in V8)

Laravel 12 Starter Kits: Definite Guide Which to Choose

Photobooth is photobooth software for the Raspberry Pi and PC

Photobooth is photobooth software for the Raspberry Pi and PC

Le notizie minori del mondo GNU/Linux e dintorni della settimana nr 22/2025

Rilasciata PorteuX 2.1: Novità e Approfondimenti sulla Distribuzione GNU/Linux Portatile Basata su Slackware

How to Remove Strikethrough Text from PDFs Using Python

The Challenge

The Solution

Dependencies

Step-by-Step Guide to Remove Strikethrough Text from PDFs

Step 1: Convert PDF to DOCX

Step 2: Remove Strikethrough Text

Step 3: Convert DOCX Back to PDF

Main Execution Block

Complete Script

Run the Script

Markus Buehler receives 2025 Washington Award

LWiAI Podcast #201 – GPT 4.5, Sonnet 3.7, Grok 3, Phi 4

Who will launch the personalized banking UX age: on-device Apple AI or banksâ€™ cloud-based AI?

Build a “button-to-modal” animation with GSAP

Use Laravel’s Built-in SetUp Hooks for Application Test Traits

SEBIâ€™s Cybersecurity Shield: A New Line of Defense for Indian Finance

Why you should ignore 99% of AI tools – and which four I use every day

How to Use JavaScript Streams for Efficient Asynchronous Requests

Microsoft is finally adding a highly requested feature to the Taskbar on Windows 11

Microsoft AI Proposes an Automated Pipeline that Utilizes GPT-4V(ision) to Generate Accurate Audio Description AD for Videos

How to Remove Strikethrough Text from PDFs Using Python

The Challenge

The Solution

Dependencies

Step-by-Step Guide to Remove Strikethrough Text from PDFs

Step 1: Convert PDF to DOCX

Step 2: Remove Strikethrough Text

Step 3: Convert DOCX Back to PDF

Main Execution Block

Complete Script

Run the Script

Related Posts