Wednesday, January 23, 2008

Fixing Vendor Data: Malformed PDFs and Internet Explorer

I encountered an interesting problem today dealing with PDFs. The online Chilton applications I manage have over 10 gigabytes (75,000+) technical service bulletins in PDF form. Today we got a support call about some of them rendering to the browser as plain ASCII. I attempted to reproduce the problem on my system but found nothing wrong. After considering that it might be a browser issue I tried Internet Explorer and, sure enough, I could reproduce it. I tried several PDFs in Internet Explorer, FireFox and Opera. Only Internet Explorer had the problem. Instead of launching Adobe Reader and displaying the PDF in it, I got the raw contents of the PDF file in the browser in plain text.

I started comparing the contents of the PDFs to see if the offending ones looked corrupt. While my PDF reading skills are quite those of Adobe Reader, I discerned that the files didn't seem to be messed up. The PDF headers and footers looked normal. One thing I noticed was a point difference in the PDF version number in the file itself. I could imagine this could be the culprit. After all, all 3 browsers use the same Adobe Reader plug-in and the offending files worked fine in Firefox and Opera.

I wondered if there might be something amiss with the mime types that was throwing something off. This was another longshot but I checked away. After a few minutes with Fiddler, all looked kosher.

I went back to the PDF contents. After looking at some that worked and others that didn't I noticed that the ones that caused the problem had several blank lines at the beginning. In a test with a single file I removed what I thought to be harmless blank lines and viola, the problem was solved. Apparently Internet Explorer doesn't just push the PDF off to the reader based on the HTTP content type (application/pdf) but was actually reading the file. When it encounter something other than the typical open bytes of a PDF (%PDF-) it decided that it would just dump it out as plain text instead of handing off to the appropriate application. It seems that the other browsers don't do this and the extra bytes in the file are harmless to the PDF itself.

I ended up writing a fairly simple app to search the file byte by byte to find what should be the correct starting point. Simply stripping out a series of whitespace bytes (byte value 10), the files get cleaned up and work just fine.

Yet another story to reinforce that you can't always trust your vendor's data.

No comments: