[GreenKeys] another m28 document... Warning - long - if you're not interested in document preservation - JUST SKIP THIS.

Mon Jan 14 10:28:13 EST 2008

Sheldon Daitch wrote:

> dumb question time.

The only "dumb" question is the one not asked. Everyone has to learn 
along the way - and if they aren't afforded the opportunity to "peek 
over the shoulder" - then Q&A is the next best thing.

> I don't know very much about PDF file creation, as we have two models
> of HP high end scanners used for PDF preparation.  But they are line 
> scanners,
> that is, if the displayed file is expanded (magnified), the scan lines 
> become very
> apparent.  We don't have the "creation" type PDF program that would take a
> text document and convert it to PDF.

The key to scanning is the software used; and with some (most) software 
- they require some skill (learning, trial and error and practice) to 
accomplish a good result.

First - the quality of the scanner is almost irrelevant - as long as 
it's capable of a minimum of 300dpi. (dots per inch of "resolution"). 
Even an old scanner - good software can produce excellent results - as 
the software can "drive" the scanner to produce the desired (scan) 
results. This particular document was scanned on a very old and tired 
Microtek. It's basic scan accuracy is 300dpi - yet software can "push" 
it to 2400dpi - but the trade-off is time - Which is the "major" 
difference in scanners - your newer high quality scanners can turn out 
an excellent scan in a single - comparatively fast pass. My old beast 
can turn out a similar quality scan -- but it'll take the software maybe 
ten minutes to get the job done.  The average scan time per page on this 
document was less than a minute - so the quality was only "acceptable" 
and considering the source - a printed copy (even the graphics) looks 
better than the originals - which are old, faded, yellowed, etc...

> In the first 6 pages or so, all text, there is no evidence of the 
> document being
> scanned, it is like it has been "typeset" into a PDF document.  
 > May I assume the first text pages were not simply scanned?

Yup...  Here is a case of using the best software for the desired 
result. The first six pages - being all text - were scanned and 
converted by Omnipage Pro 12.  This is Scan to text OCR (Optical 
character recognition) software package that is very good at doing it's 
task.  We (Sherry and I) found less than a half dozen outright errors 
when we proofed it (it subbed a 9 for a 4 in one place), and one of the 
errors was actually an error in spelling on the document (perceptable 
vs. perceptible). Since we try to keep the new document as "true as 
possible" to original - the "spelling error" was retained. The other 
errors were abbreviations - which it's spelling checker sometimes 
guesses wrong - and it's so close to some "real word" that it doesn't 
trip the "proofer" to present it for Ignore/Change confirmation.

The table on page two (parts list) presented a little formatting issue - 
as I allowed the program to "guess" at the page's entire content- when I 
should have manually "drawn" the regions. Omnipage can either "guess" 
the regions of a page (plain text areas; formatted text (tables); and 
graphics areas)- or you can manually draw these areas to ensure the 
results more closely match the original. Here again - more time - better 
results.

> On the 
> other hand,
> the last four pages, all the drawings pages, they are scanned, and the 
> scan lines
> become very apparent about 400% display.

OK - two different issues here - one is scan mode / and the other is 
post processing. As noted - different software does different things. If 
the graphic elements are simple and the original is good - Omnipage Pro 
can handle it pretty well (it's base graphic mode it TIFF). However if 
the graphics are complex (high resolution, detailed, or poor quality 
originals) then I use one of two different pieces of software to scan 
the graphics.  Whatever I use to scan - Photoshop is the "post 
processor" as it can do magic (amount of "magic" is directly 
proportionate to time invested). If I'm doing "onesies" - then 
Microtek's ScanWizard plug-in to Photoshop is used. If there is a bunch 
to be scanned - then Vuescan is the program of choice - as it is VERY 
powerful, and automates a great deal of the "repetitive" process of 
scanning things (I say things - because that's also the software I use 
to drive a Nikon 35mm Slide scanner - which can scan 50 slides at a time 
using it's autoloader).

The last four pages were scanned with Microtek's ScanWizard - 300Dpi - 
line art mode.  There is one key. High res graphics are scanned in 
either gray scale (B&W) or high res color.  The problem is that this 
mode "retains" all the detail - problem: because the faded, yellowed 
background remains. You COULD go in with photoshop and manually clean 
that up - but that takes time. You can change "mode" with photoshop 
(from gray scale or RGB to bitmap) and let it "drop out" the background 
"clutter/noise" that way - but if you do it "page wide" it either 
doesn't get it all - or it gets too much and looses some (desired) 
detail. If, on the other hand - you set the scanning software to bitmap 
mode (line art) - it *dynamically* adjusts as it scans - (usually) doing 
a quite good job of separating the desired "stuff" from the chaff. Again 
- point being - you *could* do a better job manually - *IF* you have the 
time.  And around here - time is in short supply.

Ok... now that we have "text" pages - they are edited in M/S Word  (they 
were saved out of Omniscan Pro directly into a word document).  Spelling 
checked, and proof read against the originals, any formatting issues 
fixed. Even though the font - Times New Roman - in this case - is very 
similar to the original - it doesn't "layout" exactly the same (letters 
per line, etc.) as the original document - so some "tweaking" is done to 
"look the same".

The graphics are already in photoshop - so a little touchup here and 
there - sharpen, crop/size, convert to 72dpi (so it displays the same on 
computer screens as it prints); and save. In the case of page 10 - the 
original print is messed up bad - obviously went through the press 
"crooked" and smeared a bit. So some clean up -  and replace most text 
so it's readable.

Now open Adobe Acrobat - import the word document - check - (yeah - 
looks ok) - import the four pages of graphics - oops one text legend got 
messed up - back to photoshop - fix, save- re-import to Acrobat.

Save a master - then save using "reduce file size option" which limits 
"compatibility" - but I figure most people have at least reader 5.x - so 
that's the option - and it cuts the final file size by 2/3. Up load to 
server via FTP-  let everyone know it's there.

> I am not sure what questions I ought to
> ask on how you did it.

Well - now that you have an "overview" of the process - and the tools 
that I use - and there are certainly others... you can jump in - and 
lend a hand preserving these old documents before they are lost to time 
and dust... As noted - I wish I had more time, but right now that's a 
luxury I don't have.  Pesky doctor wants to do "a procedure" on me 
tomorrow - so I have to get things in order before "reporting in". Then 
it's back to business at hand. Of course being busy helps pay the bills 
- so I guess that's better than to much time on my hands at this point 
;-)  (pesky doctors LIKE to do procedures - with their hand out, of 
course!).

I am considering getting a newer scanner - not that this one isn't 
"adequate" for the job from a quality viewpoint - but with several 
thousands of pages to scan - (many Navy training manuals, the Test 
Methods and Practices; and Reference data volumes of the EIMB - (the 
1964 edition that still has TUBES in them!)).... it'd sure be nice to 
turn a scanner loose like I can the slide scanner...

best regards...
-- 
randy guttery

A Tender Tale - a page dedicated to those Ships and Crews
so vital to the United States Silent Service:
http://tendertale.com