PDF to Markdown for RAG Pipelines | Better PDF Parsing for LLMs
Convert PDFs into structured Markdown for RAG pipelines and LLM workflows. Preserve headings and document structure before chunking and embedding.
PDF to Markdown for RAG Pipelines
If you've built a RAG pipeline before, you've probably discovered something frustrating: PDFs are one of the worst input formats for retrieval systems.
Most enterprise knowledge still lives inside PDFs: technical documentation, compliance reports, research papers, contracts, internal manuals. But extracting clean semantic structure from PDFs is much harder than most tutorials admit.
You can use our local-first PDF to Markdown Converter to turn text-based PDFs into structured Markdown before chunking and embedding.
Try it — drop a PDF below
Drag & drop your PDF here, or browse files
Max size 50MB · Text-based PDFs only
Why Plain Text Extraction Fails
The naive workflow: Extract text → Chunk text → Generate embeddings → Store vectors. Unfortunately, raw PDF extraction usually destroys document structure.
Heading Hierarchy Disappears
A PDF visually contains sections and subsections. Plain text extraction often collapses everything into one flat stream of text. Your embeddings lose semantic boundaries.
Headers and Footers Pollute Chunks
Repeated page headers like "Confidential — Internal Use Only" often appear inside embeddings repeatedly. This creates noisy retrieval results.
Multi-Column PDFs Break Reading Order
Academic papers are especially problematic. Naive extraction frequently interleaves both columns together into unreadable text. Embeddings generated from corrupted reading order are usually useless.
Why Markdown Works Better for RAG
Markdown preserves structure. Instead of flat text like "Performance Requirements The system shall support..." you get "## Performance Requirements" followed by the body text. Now chunk boundaries become meaningful.
- Split by headings
- Split by sections
- Split by semantic blocks
- Split by document hierarchy
Instead of arbitrary token windows. Retrieval quality improves noticeably.
Using pdftomd.app in a RAG Workflow
- One-off ingestion
- Prototyping
- Privacy-sensitive documents
- Local preprocessing
- Validating chunk quality quickly
Typical workflow: Convert PDF locally → Export Markdown → Run cleanup pass → Chunk Markdown → Generate embeddings → Store vectors. Because conversion happens entirely in-browser, sensitive documents never leave your machine.
Practical Markdown Chunking Strategies
1. Header-Based Chunking
Simple and effective for structured documents.
import re
def chunk_by_headers(markdown_text):
sections = re.split(r'\n(?=## )', markdown_text)
chunks = []
for section in sections:
if section.strip():
chunks.append(section.strip())
return chunksOne important caveat: if your source document itself contains Markdown examples or fenced code blocks with ##, this regex can split incorrectly inside code sections. For production systems, dedicated Markdown parsers are safer.
2. LangChain Markdown Splitting
For most modern RAG systems, this is usually the better approach.
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(
chunk_size=1000,
chunk_overlap=100
)
chunks = splitter.create_documents([markdown_content])This preserves heading boundaries whenever possible while still respecting token limits.
3. Context-Aware Chunking
Adding heading context often improves retrieval quality significantly.
{
"section": "Authentication > OAuth Tokens",
"content": "..."
}That metadata becomes extremely valuable during retrieval and reranking.
What Usually Needs Cleanup
- Repeated page headers
- Standalone page numbers
- Broken tables
- Malformed bullet lists
- Duplicated text blocks
- Corrupted Unicode characters
A quick cleanup pass often improves retrieval quality more than changing embedding models.
Where Browser-Based Conversion Fits
This tool is intentionally lightweight. It works well when: privacy matters, you want immediate results, you need quick preprocessing, documents are text-based, you don't want cloud APIs. It is not intended to replace large OCR pipelines.
When You Need Heavier Tools
Scanned PDFs
Image-only PDFs require OCR. Tools worth evaluating: Marker, Docling, AWS Textract, Azure Document Intelligence.
Math-Heavy Academic Papers
Scientific PDFs with dense equations are difficult for nearly every parser. Specialized tools like Mathpix or Nougat generally perform better for equation reconstruction.
High-Volume Batch Pipelines
If you're processing thousands of PDFs automatically, browser workflows stop scaling quickly. At that point, server-side parsers become more practical.