PDFsharp & MigraDoc Foundation
http://forum.pdfsharp.de/

Image compression
http://forum.pdfsharp.de/viewtopic.php?f=4&t=82
Page 1 of 1

Author:  peteratoce [ Tue Feb 20, 2007 2:18 pm ]
Post subject:  Image compression

It would be most welcome if the library could compress images (not reduce resolution, as is sometimes appropriate).
Here are the results of some tests I did:
I started with a 100 page TIF file (A4, resolution 1200 dpi). BTW, such high resolution is absolutely necessary when a scanned document is to be printed on an offset press.

First, I opened the TIF in Acrobat (V 7) and saved as PDF. The file size barely grew, from 30.507.933 Bytes to 30.561.981 Bytes.

Then I used PDFsharp to do the equivalent (TIF aquired through System.Drawing.Image.FromFile, each page passed to PDFsharp through XImage.FromGdiPlusImage and then inserted in the output PDF with XGraphics.DrawImage). The conversion took about four times as long, and the resultant file size was 100.594.087 Bytes, i.e. more than three times as much.

Another consideration is the amount of memory needed during conversion. My understanding is that all newly created PDF pages have to be kept in memory by PDFsharp, until they are finally saved to file. My first test, done with a similar TIF file, but with 1012 pages in it, ran into an OutOfMemoryException. I expect that pages with compressed images on them would need far less memory during processing.

Author:  Thomas Hoevel [ Tue Feb 20, 2007 5:38 pm ]
Post subject: 

I cannot explain why the file is so much bigger.

Have you tried a release build? The debug build by default produces "verbose" PDF files that are bigger.

Images in the PDF file use lossless LZ compression (except for JPEG images - those are copied byte by byte into the PDF file).

Not sure if the verbose mode can account for a factor 3 - I don't expect that.

I'd like to know which image format and compression was used for the TIFF file. If it was JPEG or CCITT/FAX than this could be the reason - PDFsharp uses the standard LZ compression, but other methods may be better for your scanned image.
Or maybe the image got converted to 24 bit RGB - this could explain factor 3.

PDFsharp does not read the files - it relies on GDI+ to read them; the 8-bit-to-24-bit-conversion could occur here.

Long story short: we do compress image data. I'd like to know what happens there.

BTW: all pages are kept in memory. With 1000 scanned pages this really could be a problem, but for most applications this approach is appropriate.

Author:  peteratoce [ Mon Mar 05, 2007 8:48 am ]
Post subject: 

Differences in file size really seem to be caused by differing compression schemes:
A 100 page TIF (CCITT G4): 30.507.933 Bytes,
the same TIF (LZW): 100.349.200 Bytes.

PDFsharp created a file of size 100.521.556 Bytes from the G4, so the result is consistent.

I wish somebody (perhaps a knowledgeable user?) would turn his/her attention to image import and export in the library, including questions of different (= optimal) compression schemes for differing content types! From my experience I can say that GDI+ as an intermediate would have to go, though...

And, it would be nice to have more control over memory allocation, creation of temporary files or whatever is necessary to successfully process really large files.

Peter

Author:  grzeslag [ Fri Jul 10, 2009 1:08 pm ]
Post subject:  Re: Image compression

Hi

Is there any solution, to store images with CCITT G4 compression?
I have the same effect, almost all imeges converted by this library is more than 200% bigger :(

Grzegorz

Author:  Thomas Hoevel [ Mon Jul 13, 2009 9:19 am ]
Post subject:  Re: Image compression

PDF doesn't support TIFF.
It supports JPEG and LZW.

AFAIR it doesn't support G4 (but I'll check that eventually).

Current implementations of PDFsharp use LZW for any image that is not JPEG.

Author:  robert_baumann [ Sat Jul 18, 2009 4:54 pm ]
Post subject:  Re: Image compression

You might take a look at this article:
http://www.codeproject.com/KB/GDI-plus/ ... uick&fr=26

It describes how to convert images into bitonal format, which is required for CCITT4 compression, and how to handle multiple page TIFFs.

You would need to go into PdfSharp.Pdf.Advanced.PdfImage, and create a method like "InitializeCcitt4Tiff()", and have some flag in the XImage that specifies it should be bitonal. This flag would be used in the ctor of the PdfImage class.

Good luck, and post back if you have code to contribute!

Author:  peteratoce [ Sun Jul 19, 2009 12:59 pm ]
Post subject:  Re: Image compression

First of all, it is true that "PDF doesn't support TIFF", but it does support the same encoded (= compressed) image formats that are most widely found in TIFF files. Besides LZW/Flate and DCT (JPEG), which are appropriate for color and grayscale images, also CCITT Fax G3 and G4 is available for monochrome images. PDF, not surprisingly, inherited these capabilities from PostScript.

When you deal with images in PDF then you deal with so called "Filtered Streams". They consist of the encoded image data and, in addition, of information in the stream dictionary about the appropriate filter(s) needed to decode the data.
The above is knowledge I took from the specs, but, as I am an empirically minded person, I wanted to verify this for myself. So i created an image of a small black square (15 x 15) in the middle of an empty page (35 x 35 pix) and saved that to a TIFF G4 file. Then I saved the same image to a PDF file. When I looked at the results in a binary editior, I could see that the identical encoded image data can be found in both files, namely "ff c9 c3 1f ff ff ff ff fc 7f f0 01 00 10" (hex representation). In the PDF it looks like this:
<<
/Type /XObject
/Subtype /Image
/Name /Im0
/Filter [ /CCITTFaxDecode ]
/DecodeParms [ << /K -1 /Columns 35 /Rows 35 >> ]
/Width 35
/Height 35
/ColorSpace /DeviceGray
/BitsPerComponent 1
/Length 7 0 R
>>
stream
-- here the binary data --
endstream

(Remark: When I imported the TIFF file in Acrobat and saved to PDF, the image was re-encoded to Flate, with an increase of size)

This leads me to the question whether it shouldn't be possible to directly import G4 encoded pages from a TIFF file into a PDF document without re-encoding the images.
Instead of GDI+, one would probably have to use libtiff (or GraphicsMagick) to access the image(s) and metadata.

Attachments:
File comment: file examples
square.zip [2 KiB]
Downloaded 1092 times

Author:  Thomas Hoevel [ Mon Jul 20, 2009 9:09 am ]
Post subject:  Re: Image compression

I stand corrected.

I put "/CCITTFaxDecode" on our TODO list, but we won't address it before the release scheduled for this summer is out.

"/CCITTFaxDecode" is a lossless compression, so using GDI+ and re-compressing the image costs nothing but CPU time (but maybe there's a GDI+ flag that indicates FAX compression (I'll check that)).

Author:  kostadinnm [ Wed Sep 30, 2009 11:12 am ]
Post subject:  Re: Image compression

Thomas Hoevel wrote:
I stand corrected.

I put "/CCITTFaxDecode" on our TODO list, but we won't address it before the release scheduled for this summer is out.

"/CCITTFaxDecode" is a lossless compression, so using GDI+ and re-compressing the image costs nothing but CPU time (but maybe there's a GDI+ flag that indicates FAX compression (I'll check that)).


Any progress on this?

Author:  Thomas Hoevel [ Wed Sep 30, 2009 12:55 pm ]
Post subject:  Re: Image compression

kostadinnm wrote:
Any progress on this?

Not yet - I have to work for projects we get paid for ...

Author:  michael.hidalgo [ Tue Aug 23, 2011 3:35 pm ]
Post subject:  Re: Image compression

Any luck with this? I do have a PDF file that has 2 images, both images were compressed using CCITTFaxDecode, but I cannot extract it using PDFSharp.

Author:  Thomas Hoevel [ Tue Aug 23, 2011 3:42 pm ]
Post subject:  Re: Image compression

Extracting images was left as an exercise to the reader (see Export Images sample).

Back to topic: CCITT compression is implemented in the publicly available version of PDFsharp.

Author:  michael.hidalgo [ Mon Aug 29, 2011 4:19 pm ]
Post subject:  Re: Image compression

What do you mean with publicly available version of PDFsharp? Can you point me to that version?

Thanks

Author:  Thomas Hoevel [ Tue Aug 30, 2011 8:33 am ]
Post subject:  Re: Image compression

michael.hidalgo wrote:
Can you point me to that version?

http://pdfsharp.codeplex.com/releases/view/37054

Please note that PDFsharp only supports encoding of CCITT images (the topic of this thread), but not decoding (your question).

Author:  michael.hidalgo [ Tue Aug 30, 2011 2:22 pm ]
Post subject:  Re: Image compression

Thanks for the information

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/