If you deal with PDF files regularly on Linux and need to count the number of words or characters inside them, Python makes it easy. Whether you’re processing academic documents, contracts, or reports, automating this task saves time and improves accuracy.
In this post, I’ll walk you through a Python script that extracts text from a PDF and calculates the number of words and characters.
Prerequisites
Before you begin, make sure:
- You’re using a Linux system (Ubuntu, Fedora, etc.)
- Python 3 is installed
- You have
pip
available to install Python packages
Step 1: Install Required Libraries
We’ll use PyPDF2
to read PDFs. Install it using pip:
pip install PyPDF2
Alternatively, you can use pdfplumber
for more accurate extraction if your PDFs are complex (with columns, tables, etc.).
pip install pdfplumber
Step 2: Python Script to Count Words and Characters
Here’s a basic example using PyPDF2:
import PyPDF2
def count_pdf_text(file_path):
with open(file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
full_text = ''
for page in reader.pages:
full_text += page.extract_text() or ''
words = full_text.split()
word_count = len(words)
char_count = len(full_text)
return word_count, char_count
# Example usage
pdf_file = 'example.pdf'
words, characters = count_pdf_text(pdf_file)
print(f"Word count: {words}")
print(f"Character count: {characters}")
Optional: Use pdfplumber
for Better Accuracy
import pdfplumber
def count_pdf_text(file_path):
full_text = ''
with pdfplumber.open(file_path) as pdf:
for page in pdf.pages:
full_text += page.extract_text() or ''
words = full_text.split()
word_count = len(words)
char_count = len(full_text)
return word_count, char_count
Step 3: Run the Script
Save your script as count_pdf.py
and run it from your terminal:
python count_pdf.py
Make sure to replace example.pdf
with the path to your actual PDF file.
Final Tips
- If you need to process many files, loop through a directory using
os.listdir()
. - You can redirect output to a file if needed (
> results.txt
). - Watch out for scanned PDFs—they need OCR (like with Tesseract).
Conclusion
Counting characters and words in a PDF using Python on Linux is fast and efficient with the right tools. Whether you’re automating reports or analyzing documents, this method gives you control over your data.
Have questions or want to automate this for multiple files? Drop a comment below or reach out.
Leave a Reply