RackNerd Billboard Banner

How To Convert A PDF File To Markdown (With Images) In Linux

If you work with technical docs, wikis, or static site generators, you’ve probably run into this:
You have a PDF, but you really need it as Markdown — with images intact.

Good news: Linux has everything you need to make it happen.

Below is a step-by-step guide that works for both text-heavy PDFs and PDFs packed with diagrams, charts, and screenshots.


Step 1 – Install the Required Tools

We’ll use two main tools:

  1. pdftohtml (from poppler-utils) – Extracts text and images from PDFs.
  2. pandoc – Converts between document formats, including HTML to Markdown.

Install them with:

sudo apt update && sudo apt install poppler-utils pandoc

For Fedora/RHEL-based systems:

sudo dnf install poppler-utils pandoc

Step 2 – Convert PDF to HTML (Preserving Images)

First, turn your PDF into HTML while keeping the images:

pdftohtml -c -noframes -p myfile.pdf output.html

Flags explained:

  • -c → Keeps the layout as close as possible.
  • -noframes → Avoids splitting content into multiple HTML frames.
  • -p → Retains original images.

This will give you:

  • output.html → Your PDF in HTML format.
  • output_images/ (or similar) → A folder containing extracted images.

Step 3 – Convert HTML to Markdown

Now that you have HTML, use Pandoc to convert it to Markdown:

pandoc output.html -f html -t markdown -o myfile.md

This will create myfile.md with Markdown syntax.
If the HTML referenced images, the Markdown will contain image links to the extracted image files.


Step 4 – Organize Images for Your Markdown

Make sure the image folder is in the same directory as your Markdown file.
Pandoc will keep the relative paths, so if your Markdown says:

![Figure 1](output_images/image1.png)

…then output_images/image1.png should be next to your .md file.

If you plan to upload this to a site or Git repo, keep the images folder alongside the Markdown.


Step 5 – Clean Up the Markdown (Optional)

PDF → HTML → Markdown isn’t always perfect. You might see:

  • Extra line breaks
  • Odd spacing
  • Overly long lines

To tidy up, you can run:

pandoc myfile.md -f markdown -t markdown --wrap=preserve -o myfile_clean.md

Or open it in your favorite Markdown editor (like Typora, Obsidian, or VS Code) and do some manual cleanup.


Bonus – One-Liner Command

If you want to chain everything into one line:

pdftohtml -c -noframes -p myfile.pdf temp.html && pandoc temp.html -f html -t markdown -o myfile.md

Images will still be saved in the generated image folder from pdftohtml.


Final Thoughts

Converting PDFs to Markdown on Linux is straightforward once you know the toolchain.
By using pdftohtml to preserve images and pandoc to do the format conversion, you get a clean Markdown file and all your images neatly saved for reuse.

If you need perfect formatting for publishing, expect to do a little cleanup — but this method will save you hours compared to manually copying and pasting.

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
RackNerd Billboard Banner
© 2025 Computer Everywhere
Your Everyday Guide to the Digital World.
Terms of Service | Privacy Policy
Copy link