TL;DR: I had 12,000+ scanned government inspection reports in PDF format. Every OCR tool I tried either capped free usage at 5-10 pages, charged $300+/month, or produced garbage output. SayPDF's AI-powered OCR handled the entire batch with accurate text extraction, table preservation, and zero page limits on their web tool. This is the story of how I went from manually retyping data to fully automated pipeline.
The Problem: 12,000 Scanned PDFs and a Deadline
Last February, I was six weeks into a new role as a data analyst at an environmental consultancy in London. My first big project landed on my desk: analyze historical building inspection reports to identify patterns in structural deficiencies across the UK.
The catch? The data existed only as scanned PDFs. We're talking 12,247 documents, most of them scanned from paper reports dating back to 2015. Each report was between 3-15 pages. Tables, handwritten notes in margins, stamps, photographs mixed with text - the full chaos of real-world documents.
My manager gave me three months. "Should be straightforward," he said. "Just extract the data and put it in a spreadsheet."
If you've ever tried to extract structured data from scanned PDFs, you know that "straightforward" is the last word you'd use.
Week 1: The Free Tool Safari
I started where everyone starts - Google. "Free OCR PDF to Excel." The results look promising until you actually try them.
Here's what I found across 15 different tools:
The Typical "Free" OCR Experience
- Tool A: "Free OCR!" - converts 3 pages, then paywall. $29.99/month for "unlimited" (which was actually 100 pages/month)
- Tool B: Free tier exists, but OCR is premium-only. The free version just handles native PDFs (useless for scanned docs)
- Tool C: No page limit, but the OCR quality was so poor it was faster to retype manually
- Tool D: Good OCR, 5-page limit, $49/month for the plan I'd need
- Tool E: Desktop app, decent quality, but crashed on files over 10 pages. Consistently.
I burned through an entire week testing tools. By Friday, I'd successfully converted exactly 47 pages out of ~90,000. At that rate, I'd finish in approximately 38 years.
Week 2: The Enterprise Quote Reality Check
I escalated to my manager. "We might need to buy proper OCR software."
Enterprise OCR solutions quoted us between $500-$2,000/month. Adobe Acrobat Pro was the "affordable" option at $22.99/month per user, but even that struggled with our specific documents - handwritten annotations threw off the recognition, and tables came out as garbled text blocks.
One vendor wanted a $15,000 annual license for "unlimited batch processing with table recognition." For a 15-person consultancy, that was a non-starter.
Meanwhile, my deadline was ticking.
Week 3: Finding SayPDF (By Accident)
I found SayPDF while looking for something else entirely. I was searching for "PDF to Excel with table detection" on a developer forum, and someone had mentioned it in a thread about document processing pipelines.
My first reaction was skepticism. The site claimed AI-powered OCR with high accuracy. I'd heard that before. But two things caught my attention:
- No page limit on the web tools. Not "free for 5 pages" or "free trial for 7 days." Actually usable without immediately hitting a paywall.
- It specifically mentioned handling scanned documents and handwritten text. Most tools bury "scanned PDF support" deep in their premium tier.
I uploaded a test document - one of our nastier inspection reports with a mix of typed text, handwritten notes, a stamped header, and a data table. This was my litmus test. Every other tool had failed on this specific document.
The result came back in about 15 seconds. And it was... actually good.
The table structure was preserved. The handwritten margin notes were recognized (not perfectly, but ~85% accuracy, which was more than enough to work with). The typed text was essentially flawless. I sat there for a solid minute just staring at the output.
The Numbers: What Actually Happened
Here's how I broke it down:
Manual approach (what I was doing before): Each document took approximately 8-12 minutes to manually extract data from. For 12,247 documents, that's roughly 1,600 hours of work - about 10 months of full-time data entry. My manager's "three months" estimate wasn't even in the ballpark.
With SayPDF's web tools for smaller batches, and their API for the bulk: The initial testing and workflow setup took about 2 days. The actual conversion processing ran over a weekend. Post-processing and data validation took another week. Total hands-on time: approximately 60 hours.
Even accounting for the cost of the API credits for the bulk processing (which came out to roughly $180 total for the entire batch), the ROI was absurd. We would have spent more on coffee for the intern we'd need to hire for the manual work.
What Made the Difference
I've thought about why SayPDF worked where others failed, and it comes down to a few things:
1. Real OCR on the Free Tier
This is the big one. Most "free PDF converters" only handle native PDFs - documents where the text is already embedded digitally. When you upload a scanned document (which is essentially a picture of text), they either fail silently, produce garbage, or tell you OCR is a premium feature.
SayPDF's web tools actually run AI-powered OCR on scanned documents. That was the first hurdle most tools couldn't clear.
2. Table Recognition That Actually Works
Our inspection reports were full of tables - compliance checklists, measurement data, rating grids. Most OCR tools treat tables as plain text and lose all structure. You end up with columns merged together and rows jumbled.
SayPDF preserved the table structure. When I converted to Excel, the data was in the right cells. Not perfect every time, but workable - maybe 90% of tables came through clean, and the remaining 10% needed minor fixes rather than complete reconstruction.
3. The API for Scale
For the first couple hundred documents, the web interface was fine. But for 12,000+ files, I needed automation. SayPDF offers a REST API that let me write a simple script to batch-process everything.
// The core of my processing script (Node.js)
const fs = require('fs');
const FormData = require('form-data');
const axios = require('axios');
async function convertPDF(filePath) {
const form = new FormData();
form.append('file', fs.createReadStream(filePath));
const response = await axios.post(
'https://api.saypdf.com/api/v1/convert',
form,
{
headers: {
...form.getHeaders(),
'x-api-key': process.env.SAYPDF_API_KEY
},
params: { target: 'xlsx', ocr: true }
}
);
return response.data;
}
Nothing fancy. But it meant I could point my script at a folder of 12,000 PDFs, hit run, and come back Monday morning to find structured Excel files waiting for me.
4. Multi-Language Support
Some of our older reports had sections in Welsh (we cover all of the UK). Most OCR tools choke on mixed-language documents. SayPDF handled it without any special configuration - I didn't even have to specify the languages.
What I'd Do Differently
If I had to do this project again, I'd skip the two weeks of tool evaluation entirely and start with SayPDF from day one. But here's some practical advice for anyone in a similar situation:
- Test with your worst document first. Don't test OCR tools with clean, modern PDFs. Find the ugliest, most challenging document in your collection and use that as your benchmark.
- Start with the web tools, then move to the API. The web interface is great for validating quality before you commit to a batch processing approach.
- Budget for post-processing. Even the best OCR isn't 100% perfect. Build in time for spot-checking and correction. I spent about 15% of my total time on validation.
- Use the Excel output for tabular data. If your PDFs are mostly tables (invoices, reports, forms), convert directly to Excel rather than Word. It'll save you a reformatting step.
The Bigger Picture
This project changed how I think about document processing. There are millions of organizations sitting on mountains of scanned documents - medical records, legal contracts, financial statements, government archives - and the data in those documents is effectively locked away because extraction has been too expensive or too painful.
Tools like SayPDF are making that accessible. Not just for enterprises with five-figure software budgets, but for small consultancies like ours, for independent researchers, for nonprofits, for anyone who needs to turn a stack of scanned papers into usable data.
My three-month project? I finished it in five weeks. My manager still thinks I'm some kind of wizard. I haven't corrected him.
Ready to Stop Retyping PDFs?
Try SayPDF's AI-powered OCR. No signup required for the web tools. Upload a scanned document and see the difference.
Try Free OCR Now