Thursday, September 01, 2011

Broken PDFs in Simple Scan

Since version 2.32 Simple Scan has had a bug where it generates PDF files with invalid cross-reference tables.  The good news is this bug is now fixed, and will work correctly in simple-scan 3.2; thanks to Rafał Mużyło who diagnosed this.  You may not have noticed this bug as a number of PDF readers handle these types of failures and rebuild the table (e.g. Evince).  It was noticed that some versions of Adobe Reader do not handle these failures.

I've added a command line option that can fix existing PDF files that you have generated with Simple Scan.  To use run the following:

simple-scan --fix-pdf ~/Documents/*.pdf

It should be safe to run this on all PDF documents but PLEASE BACKUP FIRST. It will copy the existing document to DocumentName.pdf~ before replacing it with the fixed version so you have those in case anything goes wrong.

If you can't wait for the next simple-scan, you can also run this Python program (i.e. python fixpdf.py broken.pdf > fixed.pdf)

import sys
import re
lines = file (sys.argv[1]).readlines ()
xref_offset = int(lines[-2])
xref_offset = 0
for (n, line) in enumerate (lines):
        # Fix PDF header and binary comment
        if (n == 0 or n == 1) and line.startswith ('%%'):
                xref_offset -= 1
                line = line[1:]
        # Fix xref format
        match = re.match ('(\d\d\d\d\d\d\d\d\d\d) 0000 n\n', line)
        if match != None:
                offset = int (match.groups ()[0])
                line = '%010d 00000 n \n' % (offset + xref_offset)
        # Fix xref offset
        if n == len(lines) - 2:
                line = '%d\n' % (int (line) + xref_offset)
        # Fix EOF marker
        if n == len(lines) - 1 and line.startswith ('%%%%'):
            line = line[2:]
        print line,

4 comments:

Matteo Nardi said...

Hey, I just wanted to let you know that even my dad (a 50-and-something years old non-tech dad!) loves Simple Scan! ..and he's grateful he doesn't need my help anymore when scanning documents :)
Thanks for the efforts!

Stef said...

I <3 Simple Scan. No seriously, it rocks. No nonsense, automatic file name generation. Perfect tool.

Unknown said...

Just notice two lines in your Python code.

xref_offset = int(lines[-2])
xref_offset = 0


I am a newbie about Python, but I guess it has something wrong with these two lines.

Is that the second one should be something like:
offset = 0

Robert Ancell said...

Enchanter - good catch! The first xref_offset line shouldn't be there, but it still works correctly.