I started comparing the contents of the PDFs to see if the offending ones looked corrupt. While my PDF reading skills are quite those of Adobe Reader, I discerned that the files didn't seem to be messed up. The PDF headers and footers looked normal. One thing I noticed was a point difference in the PDF version number in the file itself. I could imagine this could be the culprit. After all, all 3 browsers use the same Adobe Reader plug-in and the offending files worked fine in Firefox and Opera.
I wondered if there might be something amiss with the mime types that was throwing something off. This was another longshot but I checked away. After a few minutes with Fiddler, all looked kosher.
I went back to the PDF contents. After looking at some that worked and others that didn't I noticed that the ones that caused the problem had several blank lines at the beginning. In a test with a single file I removed what I thought to be harmless blank lines and viola, the problem was solved. Apparently Internet Explorer doesn't just push the PDF off to the reader based on the HTTP content type (application/pdf) but was actually reading the file. When it encounter something other than the typical open bytes of a PDF (%PDF-) it decided that it would just dump it out as plain text instead of handing off to the appropriate application. It seems that the other browsers don't do this and the extra bytes in the file are harmless to the PDF itself.
I ended up writing a fairly simple app to search the file byte by byte to find what should be the correct starting point. Simply stripping out a series of whitespace bytes (byte value 10), the files get cleaned up and work just fine.
Yet another story to reinforce that you can't always trust your vendor's data.
No comments:
Post a Comment