Best Practices for Portable Document Format (PDF) Creation

Print pagePrint pageEmail pageEmail page

Best Practices for Portable Document Format (PDF) Creation
October 2013

The Office of Scientific and Technical Information (OSTI) is responsible for permanently storing the Department of Energy's (DOE) scientific and technical information (STI) collection.  The STI must be collected in a digital format that can be preserved and accessible for years to come.  In the late 1990’s, OSTI selected the Portable Document Format (PDF) as the preferred format for receiving STI.  OSTI continues to use this format for the submission and storage of STI documents.  Preservation, content, and accessibility are enhanced by following these best practices for generating PDF files. 

Preservation of STI

A key component of OSTI’s mission is to preserve STI, both historical and current, and OSTI’s centralized collection dates back to the 1940’s.  In previous years, OSTI received and maintained STI on paper.  As resources permit, OSTI is converting the paper STI into PDF documents for long-term digital storage.  OSTI currently generates PDF/A-1a compliant PDFs with 300 DPI scanned images.   These PDFs are generated with commercial Optical Character Recognition (OCR) software to turn the paper words into to electronic words for enhanced indexing and searching. 

STI Content

STI is the output of federally funded research and all STI, whether unclassified/unlimited, Controlled Unclassified Information (CUI), or Classified, is to be provided to OSTI.  Publicly available (i.e., unlimited) STI must be easily accessible and searchable.  OSTI extracts text from all submitted PDFs for use in full-text searching.  If a PDF arrives at OSTI without extractable text, an Optical Character Recognition (OCR) process is used to produce a PDF/A-1a compliant PDF with 300 DPI scanned images.  Besides full-text indexing, the text is useful for the next important goal.

Making STI Accessible

All US Government agencies are required to make content found on their websites accessible to disabled persons.  PDFs generated correctly meet these accessibility requirements.  Digitally-born documents have the best potential for accessibility, but paper scanned documents can also be successfully converted to accessible documents.

PDF/A Compliance

OSTI prefers PDFs that meet one of the PDF/A compliance standards.  PDF/compliance helps ensure the PDF will be readable well into the future.  Many recent PDF generating software packages have options for making PDF/A compliant files.  PDF/A-1a achieves two of OSTI's goals of preservation and accessibility.  Generally, a PDF/A-1a PDF has these attributes:

  • The PDF is saved as a version 1.4.
  • All fonts used must be embedded.  OSTI has experienced problems with some PDFs using exotic fonts not available on average equipment viewing the PDF.
  • The PDF should use device independent color.
  • The PDF should contain a minimum set of XMP metadata (handled by the PDF generating software).
  • Document hierarchy should be included.
  • The PDF should be tagged.
  • The PDF should use Unicode character maps.
  • The language needs to be specified for the entire document, each page, or each text object.
  • The PDF cannot be encrypted.
  • The PDF should not use LZW compression.
  • The PDF should not have embedded files.
  • The PDF should not have external content references.
  • The PDF should not have multimedia content.
  • The PDF should not use JavaScript.
  • There should be no transparency.

PDF/A-1a compliance is difficult to obtain.  PDF/A-1b is less strict and PDF/A-2 and 3 allow more flexibility.  However, meeting any of the PDF/A requirements alone is not enough to concur with OSTI's PDF best practices. 

Three Classes of PDFs

There are three classes of PDFs that can be generated for OSTI:  1) Searchable image PDFs, 2) Formatted text and graphics PDFs, and 3) Hybrid PDFs (a mix of searchable image and formatted text and graphics pages).  PDFs from scans of paper should always be stored as an image with searchable text, where each page of the original document is one complete PDF page.  PDFs that are born digitally should be stored as formatted text and graphics, where each page is made up of formatted text and if present the pictures on the page remain as graphics.  A hybrid PDF would contain a mixture of the two types of pages for specific reasons.  OSTI's best practices for generating each of the three classes of PDFs are detailed below.

Generating Searchable Image PDFs from Paper Documents

If the document to be submitted to OSTI is known only to exist in paper or microfiche, these guidelines apply: 

  1. The paper document should be scanned on a good quality scanner at 300 dpi or more. 
  2. Monochrome pages should be scanned and saved using the CCITT4 compression algorithm.  Color pages should be scanned using a non lossy compression unless generating a PDF/A-2 PDF which allows JPEG 2000 compression.  
  3. The final step is to OCR the content to create a searchable image PDF from which OSTI can extract text.  Without this step, an image-only PDF is generated.  The reason for the OCR step is so that OSTI can extract text from the PDF to create an index of all words in the document.  Remember that scanning and OCRing software should be configured to generate PDF/A compliant PDFs or PDFs that meet as many of the PDF/A standards as possible.
  4. An optional step is to correct the OCR mistake(s).  OSTI realizes that correcting OCR mistakes is a time consuming operation, so this step is not required. 

Generating Formatted Text and Graphics PDFs from Digitally Born Documents

If the document to be submitted to OSTI can be accessed using word processing software (e.g., Microsoft Word), the best practice is to generate a PDF directly from the word processor.  Many recent word processing applications have natively built-in PDF converters.  Before using the built-in converter, change the preferences to generate a PDF/A-1a compliant PDF or a PDF that meets as many of the PDF/A standards as possible.  If there is not a built-in converter, there are both commercial and free PDF generators available that work similarly to a printer.  Some organizations may have a PDF printer already installed.  Look for a printer with "PDF" in the name.  Again, these will need to be configured to generate a PDF/A-1a compliant PDF or a PDF that meets as many of the PDF/A standards as possible.  The resulting PDF should be a formatted text and graphics which should look nearly identical to the document in the word processor.  Formatted text and graphics PDFs are a great way to preserve STI in a very accessible form.

Generating Hybrid PDFs

OSTI has received submissions where some pages are searchable images and other pages are formatted text and graphics.  Close examination of the content showed that the searchable image pages usually contain persons’ signatures while the rest remains formatted text and graphics.  For internal paper scanning, OSTI inserts two formatted text disclaimer pages immediately after the first page while the rest of the pages are searchable image pages; PDF assembly or architecting software is used.  Many commercial and a few free software packages exist for this function and some organizations may have site licenses for such software.   Assembling PDFs is an advanced topic beyond the scope of this guide, but is worth investigating if signature pages or standard disclaimer pages are fairly common needs.  Refer to the previous best practices for searchable image pages and formatted text and graphics pages.

Will OSTI Automatically Reject PDFs?

Yes.  Several PDF features are detected by OSTI's processing software and will cause the PDF to be rejected.  These are:

  • Encryption - OSTI will attempt to extract text from the PDF.  If encryption prevents text extraction, the PDF will be rejected.
  • Passwords - If a password is required to extract text from a PDF, the PDF will be rejected.
  • Corruption - If a PDF is so corrupted or damaged that text cannot be extracted from it, it will be rejected.  Corruption or damage, if  present, usually occurs during PDF uploading.

 

For additional information and/or questions, contact stip@osti.gov.

 

References


 

Last updated: October 7, 2013