PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Thu May 02, 2024 6:32 am

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 4 posts ] 
Author Message
 Post subject: Reading PDF contents?
PostPosted: Tue Aug 19, 2008 10:54 am 
Offline

Joined: Tue Aug 19, 2008 9:55 am
Posts: 1
Hello, I've just started using PDFSharp and I was wondering how you can read the content of a PDF.

I tried looping through the Pages.Elements Property of the PdfDocument class but I get an error that I cannot convert from DictionaryEntry to Typ DictionaryElements.

Alternatively I tried using the PdfContent class from the CreateSingleContent method of a PdfPage but all I get are a handful cryptic values (something like "7 0 R", "120 B" or such) as whole content of a Pdf containing text and a table with at least 50 values.

Also, is there a difference between reading normal text and the contents of a table?

Thanks in advance.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Aug 20, 2008 11:26 am 
Offline

Joined: Wed Aug 20, 2008 11:21 am
Posts: 3
i was able to get the images of a page from below code, but still unable to find the text.

write below code in any click event

PdfDocument document = PdfReader.Open("C:\\HelloWorld.pdf", PdfDocumentOpenMode.ReadOnly);

int imageCount = 0;
// Iterate pages
foreach (PdfPage page in document.Pages)
{
// Get resources dictionary
PdfDictionary resources = page.Elements.GetDictionary("/Resources");
if (resources != null)
{
// Get external objects dictionary
PdfDictionary xObjects = resources.Elements.GetDictionary("/XObject");
if (xObjects != null)
{
PdfItem[] items = xObjects.Elements.Values;
// Iterate references to external objects
foreach (PdfItem item in items)
{
PdfReference reference = item as PdfReference;
if (reference != null)
{
PdfDictionary xObject = reference.Value as PdfDictionary;
// Is external object an image?
if (xObject != null && xObject.Elements.GetString("/Subtype") == "/Image")
{
imageCount++;
ExportImage(xObject, imageCount);

}
}
}
}
}
}


the following functions are used:

/// <summary>
/// Currently extracts only JPEG images.
/// </summary>
static void ExportImage(PdfDictionary image, int count)
{
string filter = image.Elements.GetName("/Filter");
switch (filter)
{
case "/DCTDecode":
ExportJpegImage(image, count);
break;

case "/FlateDecode":
ExportAsPngImage(image, count);
break;
}
}

/// <summary>
/// Exports a JPEG image.
/// </summary>
static void ExportJpegImage(PdfDictionary image, int count)
{
// Fortunately JPEG has native support in PDF and exporting an image is just writing the stream to a file.
byte[] stream = image.Stream.Value;
//FileStream fs = new FileStream(String.Format("Image{0}.jpeg", count++), FileMode.Create, FileAccess.Write);
//fs.Read(
//BinaryWriter bw = new BinaryWriter(fs);
//bw.Write(stream);

File.WriteAllBytes("C:\\poc_image_" + count.ToString() + ".jpeg", stream);
//bw.Close();
}


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Aug 21, 2008 7:41 am 
Offline

Joined: Thu Aug 21, 2008 7:23 am
Posts: 5
Hi. For text extraction you can use the PDFBox library. For .NET you also have to put a reference to IKVM in your code. An easy solution is using Text Mining Tool (which uses PDFBox). Just google it.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Aug 26, 2008 12:56 pm 
Offline

Joined: Wed Aug 20, 2008 11:21 am
Posts: 3
But i actually needed to find each text and image objects position as well


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 82 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group