Thursday, September 01, 2011

Broken PDFs in Simple Scan

Since version 2.32 Simple Scan has had a bug where it generates PDF files with invalid cross-reference tables.  The good news is this bug is now fixed, and will work correctly in simple-scan 3.2; thanks to Rafał Mużyło who diagnosed this.  You may not have noticed this bug as a number of PDF readers handle these types of failures and rebuild the table (e.g. Evince).  It was noticed that some versions of Adobe Reader do not handle these failures.

I've added a command line option that can fix existing PDF files that you have generated with Simple Scan.  To use run the following:

simple-scan --fix-pdf ~/Documents/*.pdf

It should be safe to run this on all PDF documents but PLEASE BACKUP FIRST. It will copy the existing document to DocumentName.pdf~ before replacing it with the fixed version so you have those in case anything goes wrong.

If you can't wait for the next simple-scan, you can also run this Python program (i.e. python fixpdf.py broken.pdf > fixed.pdf)

import sys
import re
lines = file (sys.argv[1]).readlines ()
xref_offset = int(lines[-2])
xref_offset = 0
for (n, line) in enumerate (lines):
        # Fix PDF header and binary comment
        if (n == 0 or n == 1) and line.startswith ('%%'):
                xref_offset -= 1
                line = line[1:]
        # Fix xref format
        match = re.match ('(\d\d\d\d\d\d\d\d\d\d) 0000 n\n', line)
        if match != None:
                offset = int (match.groups ()[0])
                line = '%010d 00000 n \n' % (offset + xref_offset)
        # Fix xref offset
        if n == len(lines) - 2:
                line = '%d\n' % (int (line) + xref_offset)
        # Fix EOF marker
        if n == len(lines) - 1 and line.startswith ('%%%%'):
            line = line[2:]
        print line,
Post a Comment