Thursday, April 2, 2009

Scanning in DjVu

My new project has a requirement to scan the colour documents in DjVu format, so I thought of writing about this somewhat unfamiliar file format.

Have you heard of “Deja vu”.? As i understood in French this means something like “familiar” or “already experienced”. This is used to explain the weird feeling that most of us have experienced, where we come across a new situation or a person and we feel like it has happened before, although we cannot recall the exact situation. Thinking There could be several religious interpretations on this, but as I know there is no accepted scientific explanation on this yet. (at least I couldn't find any).

I don't know why they have used the same name, but DjVu is a file format similar to PDF, which is significantly small in size. This has been developed by AT&T and later the commercial rights have been transferred to lizard tech. Last year again it was transferred to Celartem Technology, the parent company of Lizard Tech. However DjVu is a free file format which means the specifications and the reference libraries are freely available. Similar to PDF, any user can view a DjVu document by installing a browser plug-in which is available freely. The commercial ownership is only on the encoding technology.

Below are some interesting comparisons from DjVu.org. (I am yet to test these in practice)
  • Scanned pages at 300 DPI in full color can be compressed down to 30 to 100KB files from 25MB.
  • Black-and-white pages at 300 DPI typically occupy 5 to 30KB when compressed
  • For color document images that contain both text and pictures, DjVu files are typically 5 to 10 times smaller than JPEG at similar quality.
  • For black-and-white pages, DjVu files are typically 10 to 20 times smaller than JPEG and five times smaller than GIF.
  • DjVu files are also about 3 to 8 times smaller than black and white PDF files produced from scanned documents

This is a graphical comparison done by Lizard Tech;



There are several important technologies being used in DjVu that makes it possible to have very clear images in such small file sizes. First is the compression technology that is being used. Unlike other compressions, in DjVu a file is compressed as 3 images namely the foreground image, background image and the mask image. The mask image which is in high resolution is used to store the text layer and uses a special compression technique. It compresses a particular character only once. And instead of recording all other occurrences of the same character it records only the location of subsequent occurrences. The other two image layers are stored in colour in low resolution. Due to this high compression technology a DjVu file with lot of text is significantly lower in size than a similar file in PDF. Also the decompression of a DjVu file is done in several steps. So the user will have an initial view very quickly and after few moments only the full quality image is displayed.

These features make DjVu an ideal format for scanning colour text documents for electronic distribution. Who knows, DjVu may even replace PDF files Surprised especially when it comes to scanned colour documents such as text books. The famous million book collection is an example of using DJVU format extensively. They offer more than 1. 5 million full text books freely in the open formats such as HTML, TIFF and DJVU.

Some other useful links;

5 comments:

  1. Thanks for the good overview about the DjVu file format. There is one really big draw back: it is proprietary. Over the last 6 years, Mixed Raster Compression (also known as layered compression) has been applied brilliantly to PDF and PDF/A documents which can achieve the same amazing compression rates while maintaining image quality. Moreover, the compression is built into the true document standard the world uses: PDF. If any of your readers want to learn more about a non-proprietary way of achieve this great compression, check out a leader in this space: LuraTech.

    ReplyDelete
  2. Seems a better option. I am going to try this.

    ReplyDelete
  3. As Mark says, DjVu is old news. It is possible to use the same MRC compression techniques in PDF files where you get the benefit of a widely used, open file format with multiple viewer implementations and all of the other PDF capabilities including myriad security options, commenting, bookmarks, etc.

    LuraTech has a nice scan to MRC PDF technology as does my employer - CVision technologies, and my previous employer, Adobe Systems, as well as my friends at IRIS, Spigraph, the list goes on and on. One vendor still supports DjVu for storage of highly compressed color document images, dozens support PDF.

    ReplyDelete
  4. Dear Amla, I am looking for solution to get direct djvu file from scanner. As most of scanning software not supporting djvu I know there is DjVu Solo can work with TWAIN which is non commercial use by Lizard Tech, again this solution is not producing bundled djvu instead 1 for each scanned page. Do you recommend any solution.

    ReplyDelete
    Replies
    1. Sorry, haven't heard of any other solution. Interesting to know any specific reason why you are continuing with DjVu

      Delete