In this blog post, I will share my journey of developing a Python-based solution to remove strikethrough text from PDFs. This solution is specifically designed for PDFs where strikethrough is applied as a style rather than an annotation.
The Challenge
Strikethrough text in PDFs can be tricky to handle, mainly when applied as a style. Standard PDF manipulation libraries often fall short in these cases. Determined to find a solution, I leveraged Python to create a practical approach.
The Solution
The solution involves three main steps: converting the PDF to a DOCX file, removing the strikethrough text from the DOCX file, and converting the modified DOCX file back to a PDF.
Dependencies
Before diving into the code, install the necessary Python dependencies. You will need:
• pdf2docx for converting PDF to DOCX
• python-docx for manipulating DOCX files
• docx2pdf for converting DOCX back to PDF
You can install these dependencies using pip:
pip install pdf2docx python-docx docx2pdf
Step-by-Step Guide to Remove Strikethrough Text from PDFs
Step 1: Convert PDF to DOCX
The first step is to convert the PDF file to a DOCX file. This allows us to manipulate the text more easily. We use the pdf2docx library for this conversion. Here is the code for the conversion function:
from pdf2docx import Converter def convert_pdf_to_word(pdf_file, docx_file): """Convert PDF to DOCX format.""" try: cv = Converter(pdf_file) cv.convert(docx_file, start=0, end=None) cv.close() print(f"Converted PDF to DOCX: {pdf_file} -> {docx_file}") except Exception as e: print(f"Error during PDF to DOCX conversion: {e}") sys.exit(1)
In this function, we create an instance of the Converter class, passing the pdf_file as an argument. The convert method of the Converter class is called to perform the conversion, and the close method is called to release any resources the converter uses. If the conversion is successful, a message is printed indicating the conversion. If an error occurs, an exception is caught, and an error message is printed.
Step 2: Remove Strikethrough Text
Once we have the DOCX file, we can remove the strikethrough text. This step involves iterating through the paragraphs and runs in the DOCX file and checking for the strikethrough style. We use the python-docx library for this task. Here is the code for the strikethrough removal function:
from docx import Document def remove_strikethrough_text(docx_file): """Remove all strikethrough text from a DOCX file.""" try: document = Document(docx_file) modified = False for paragraph in document.paragraphs: for run in paragraph.runs: if run.font.strike: print(f"Removing strikethrough text: {run.text}") run.text = '' modified = True if modified: modified_docx_file = docx_file.replace('.docx', '_modified.docx') document.save(modified_docx_file) print(f"Strikethrough text removed. Saved to: {modified_docx_file}") return modified_docx_file else: print("No strikethrough text found.") return docx_file except Exception as e: print(f"Error during strikethrough text removal: {e}") sys.exit(1)
In this function, we create an instance of the Document class, passing the docx_file as an argument. We iterate through each paragraph in the document and then through each run within the section. If the strike attribute of the run’s font is True, we print a message indicating removing the strikethrough text and set the run’s text to an empty string. If strikethrough text was removed, we save the modified document to a new file with _modified appended to the original filename. If no strikethrough text was found, we return the original DOCX file.
Step 3: Convert DOCX Back to PDF
The final step is to convert the modified DOCX file back to a PDF file. This ensures that the strikethrough text is removed in the final PDF. We use the docx2pdf library for this conversion. Here is the code for the conversion function:
from docx2pdf import convert def convert_docx_to_pdf(docx_file, output_pdf): """Convert DOCX back to PDF format.""" try: convert(docx_file, output_pdf) print(f"Converted DOCX to PDF: {docx_file} -> {output_pdf}") except Exception as e: print(f"Error during DOCX to PDF conversion: {e}") sys.exit(1)
We call this function the convert function, passing the docx_file and output_pdf as arguments to perform the conversion. If the conversion is successful, a message is printed indicating the conversion. If an error occurs, an exception is caught, and an error message is printed.
Main Execution Block
The following block of code is the main execution section of the script. It starts by checking if the script is being run directly. It then verifies that the correct number of command-line arguments is provided and that the specified PDF file exists. If these conditions are met, the script defines intermediate file paths and performs the three main steps: converting the PDF to a DOCX file, removing strikethrough text from the DOCX file, and converting the modified DOCX back to a PDF. After completing these steps, it prints the location of the modified PDF file and cleans up any intermediate files. If errors occur during execution, they are caught and printed, and the script exits gracefully.
if __name__ == "__main__": if len(sys.argv) != 2: sys.exit(1) pdf_file = sys.argv[1] if not os.path.exists(pdf_file): print(f"Error: File not found - {pdf_file}") sys.exit(1) try: # Define intermediate file paths base_name = os.path.splitext(pdf_file)[0] temp_docx_file = f"{base_name}.docx" modified_docx_file = f"{base_name}_modified.docx" output_pdf_file = f"{base_name}_modified.pdf" # Step 1: Convert PDF to DOCX convert_pdf_to_word(pdf_file, temp_docx_file) # Step 2: Remove strikethrough text final_docx_file = remove_strikethrough_text(temp_docx_file) # Step 3: Convert modified DOCX back to PDF convert_docx_to_pdf(final_docx_file, output_pdf_file) print(f"Modified PDF saved to: {output_pdf_file}") # Clean up intermediate DOCX files if os.path.exists(temp_docx_file): os.remove(temp_docx_file) if final_docx_file != temp_docx_file and os.path.exists(final_docx_file): os.remove(final_docx_file) except Exception as e: print(f"Error: {e}") sys.exit(1)
Complete Script
import sys import os from pdf2docx import Converter from docx import Document from docx2pdf import convert def convert_pdf_to_word(pdf_file, docx_file): """Convert PDF to DOCX format.""" try: cv = Converter(pdf_file) cv.convert(docx_file, start=0, end=None) cv.close() print(f"Converted PDF to DOCX: {pdf_file} -> {docx_file}") except Exception as e: print(f"Error during PDF to DOCX conversion: {e}") sys.exit(1) def remove_strikethrough_text(docx_file): """Remove all strikethrough text from a DOCX file.""" try: document = Document(docx_file) modified = False for paragraph in document.paragraphs: for run in paragraph.runs: if run.font.strike: print(f"Removing strikethrough text: {run.text}") run.text = '' modified = True if modified: modified_docx_file = docx_file.replace('.docx', '_modified.docx') document.save(modified_docx_file) print(f"Strikethrough text removed. Saved to: {modified_docx_file}") return modified_docx_file else: print("No strikethrough text found.") return docx_file except Exception as e: print(f"Error during strikethrough text removal: {e}") sys.exit(1) def convert_docx_to_pdf(docx_file, output_pdf): """Convert DOCX back to PDF format.""" try: convert(docx_file, output_pdf) print(f"Converted DOCX to PDF: {docx_file} -> {output_pdf}") except Exception as e: print(f"Error during DOCX to PDF conversion: {e}") sys.exit(1) if __name__ == "__main__": if len(sys.argv) != 2: sys.exit(1) pdf_file = sys.argv[1] if not os.path.exists(pdf_file): print(f"Error: File not found - {pdf_file}") sys.exit(1) try: # Define intermediate file paths base_name = os.path.splitext(pdf_file)[0] temp_docx_file = f"{base_name}.docx" modified_docx_file = f"{base_name}_modified.docx" output_pdf_file = f"{base_name}_modified.pdf" # Step 1: Convert PDF to DOCX convert_pdf_to_word(pdf_file, temp_docx_file) # Step 2: Remove strikethrough text final_docx_file = remove_strikethrough_text(temp_docx_file) # Step 3: Convert modified DOCX back to PDF convert_docx_to_pdf(final_docx_file, output_pdf_file) print(f"Modified PDF saved to: {output_pdf_file}") # Clean up intermediate DOCX files if os.path.exists(temp_docx_file): os.remove(temp_docx_file) if final_docx_file != temp_docx_file and os.path.exists(final_docx_file): os.remove(final_docx_file) except Exception as e: print(f"Error: {e}") sys.exit(1)
Run the Script
Execute the script by running the following command, replacing it with the path to your PDF file:
python <script_name>.py <pdf_file_path>
This Python-based solution effectively removes strikethrough text from PDFs by leveraging the strengths of the pdf2docx, python-docx, and docx2pdf libraries. By converting the PDF to DOCX, modifying the DOCX, and converting it back to PDF, we can ensure that the strikethrough text is removed without affecting other content. This approach provides a robust and efficient method for handling strikethrough text in PDFs, making your documents clean and professional.
Source: Read MoreÂ