Your Everyday Guide to the Digital World.

Count Characters And Words In PDF Files Using Python In Linux

Written by

in

If you deal with PDF files regularly on Linux and need to count the number of words or characters inside them, Python makes it easy. Whether you’re processing academic documents, contracts, or reports, automating this task saves time and improves accuracy.

In this post, I’ll walk you through a Python script that extracts text from a PDF and calculates the number of words and characters.

Prerequisites

Before you begin, make sure:

You’re using a Linux system (Ubuntu, Fedora, etc.)
Python 3 is installed
You have pip available to install Python packages

Step 1: Install Required Libraries

We’ll use PyPDF2 to read PDFs. Install it using pip:

pip install PyPDF2

Alternatively, you can use pdfplumber for more accurate extraction if your PDFs are complex (with columns, tables, etc.).

pip install pdfplumber

Step 2: Python Script to Count Words and Characters

Here’s a basic example using PyPDF2:

import PyPDF2

def count_pdf_text(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        full_text = ''
        for page in reader.pages:
            full_text += page.extract_text() or ''
        
        words = full_text.split()
        word_count = len(words)
        char_count = len(full_text)
        
        return word_count, char_count

# Example usage
pdf_file = 'example.pdf'
words, characters = count_pdf_text(pdf_file)
print(f"Word count: {words}")
print(f"Character count: {characters}")

Optional: Use `pdfplumber` for Better Accuracy

import pdfplumber

def count_pdf_text(file_path):
    full_text = ''
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            full_text += page.extract_text() or ''
    
    words = full_text.split()
    word_count = len(words)
    char_count = len(full_text)

    return word_count, char_count

Step 3: Run the Script

Save your script as count_pdf.py and run it from your terminal:

python count_pdf.py

Make sure to replace example.pdf with the path to your actual PDF file.

Final Tips

If you need to process many files, loop through a directory using os.listdir().
You can redirect output to a file if needed (> results.txt).
Watch out for scanned PDFs—they need OCR (like with Tesseract).

Conclusion

Counting characters and words in a PDF using Python on Linux is fast and efficient with the right tools. Whether you’re automating reports or analyzing documents, this method gives you control over your data.

Have questions or want to automate this for multiple files? Drop a comment below or reach out.

automation linux Open Source Tools pdf pdfplumber PyPDF2 python Python Scripts Text Extraction Word Count

Comments

Leave a Reply Cancel reply

More posts