PDFsharp & MigraDoc Foundation • View topic - Enumerate OCR'd text using PdfSharp?

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

Enumerate OCR'd text using PdfSharp?

Moderator: Stefan Lange

Page 1 of 1

[ 2 posts ]

Print view

Previous topic | Next topic

Author

Message

cshearercooper

Post subject: Enumerate OCR'd text using PdfSharp?

Posted: Mon Dec 30, 2024 8:37 pm

Joined: Tue Dec 17, 2024 11:56 pm
Posts: 8

I've got a PDF file that was OCR'd by Mobi PDF. I double-checked the OCR by closing Mobi PDF, re-opening the file, selecting a phrase, and then copy/pasting that phrase into NotePad and it's correct, so the OCR is good.

The challenge is, I'd now like to look at the output of the OCR using PdfSharp, but I can't find the text anywhere. All I can see is that in the contents of the page by calling

Code:

ContentReader.ReadContent(Page);

, there is a Dictionary operator "/Part <</MCID 0 >>".

I've been reading up on marked-content identifiers but it's all new to me and I can't figure out how to find the content that the MCID is referring to.

How can I find the actual content in the PDF file? Or am I barking up the wrong tree, is the OCR text actually stored somewhere completely different?

Thanks,
Chris

Top

TH-Soft

Post subject: Re: Enumerate OCR'd text using PdfSharp?

Posted: Mon Jan 06, 2025 8:37 am

PDFsharp Guru

Joined: Sat Mar 14, 2015 10:15 am
Posts: 1028
Location: CCAA

Hi!

cshearercooper wrote:

there is a Dictionary operator "/Part <</MCID 0 >>".

Without context, I cannot say what it is.

The commands that draw the text may use glyph indexes, as in "<002B0048004F004F0052000F0003003A00520055004F00470004> Tj".
See also: https://docs.pdfsharp.net/PDFsharp/Topi ... pping.html

There may be a table that maps glyph indexes to Unicode. Without that table, you cannot decode the text.
For an OCRed file, this table should be present.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)

Top

Page 1 of 1

[ 2 posts ]

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Who is online

Users browsing this forum: Google [Bot] and 4 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum