BFO PDF Library 2.24: a PDF/UA deep dive

BFO PDF Library 2.24, with improved PDF/UA

We released the PDF Library 2.24 today with vastly improved support for editing and verifying the PDF Structure Tree. This is the structure that is used to make a PDF accessible (among other uses), and is required to achieve both PDF/UA and PDF/A-1a, 2a and 3a compliance .

We've had some support for this structure for quite a while now. So what's new?

  • The PDF Structure Tree can be returned as a DOM Document (which is not new), and that Document is live and can be edited (which is new).
  • When inserting tags into the tree via the beginTag method, you can control how the tags are ordered. For example, this means it's possible to draw the backgrounds for some elements first, then the content, without having the order dictated by the order of the drawing operations.
  • It's now virtually impossible to create an invalid tag structure with the API; we repair most types of damage before saving.
  • Merging multiple documents with Structure Trees will work; pages can be combined from multiple sources and the structure will move with them. If the final order is not correct, the tree can be modified to fix it.
  • It's possible to create Structure tags or Attributes with XML namespaces - this is new in PDF 2.0
  • PDF/UA validation and repair has been significantly improved.

The Structure Tree: an overview

PDF without a Structure Tree is just a sequence of operations on a page: move here, set this font, draw this text, draw the line, add this image. So imposing a structure onto this needs a little bit of lateral thinking.

   atts = new HashMap<String,Object>();
   atts.put("id", "p1");
   page.beginTag("Document", null);
   page.beginTag("P", atts);
   page.setStyle(style);
   page.drawText("Hello");
   page.endTag();
   atts.put("id", "s1");
   page.beginTag("Span", atts);
   page.drawRectangle(0, 0, 20, 40);
   page.endTag();
   atts.put("id", "p2");
   page.beginTag("P", atts);
   page.drawImage(img, 0, 0, 1, 1);
   page.endTag();
   page.endTag();
  
   1 0 0 rg           % set color to red
   /P<<MCID 0>>BMC    % begin section 0
   BT                 % begin text
   /R1 24 Tf          % set font R1, 24pt
   (Hello)Tj          % draw text hellow
   ET                 % end text
   EMC                % end section 0
   /Span<<MCID 1>>BMC % begin section 1
   0 0 20 40 re f     % draw rectangle
   EMC                % end section 0
   /P<<MCID 2>>BMC    % begin section 2
   /R2 Do             % draw image R2
   EMC                % end section 2
    
    
    
  
Java code PDF content

The way Adobe decided to do this was to add "markers" into the stream. Each page (or canvas) can be divided into marked sections, each with a unique number. A tree is then constructed seperately that points to those numbered sections. We represent each of these sequences in the Document returned from getStructureTree as elements in a special namespace. Here the tree you might get back from this method with the above code

<StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf">
  <Document>
    <P id="p1">
      <bfo:content mcid="0" page="0">Hello</bfo:content>
    </P>
    <Span id="s1">
      <bfo:content mcid="1" page="0" />
    </Span>
    <P id="p2">
      <bfo:content mcid="2" page="0" />
    </P>
  </Document>
</StructTreeRoot>
  

This tree is live, so if you want to swap the order of the two paragraphs, or put the Span inside a Paragraph, this is easily done. For example:

   Document doc = pdf.getStructureTree();
   Element p1 = doc.getElementById("p1");
   Element s1 = doc.getElementById("s1");
   p1.appendChild(s1);
  
<StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf">
  <Document>
    <P id="p1">
      <bfo:content mcid="0" page="0">Hello</bfo:content>
      <Span id="s1">
        <bfo:content mcid="1" page="0" />
      </Span>
    </P>
    <P id="p2">
      <bfo:content mcid="2" page="0" />
    </P>
  </Document>
</StructTreeRoot>
  

Why would you want to do this? One example is when combining pages from multiple documents. When moving pages from one PDF to another, the destination PDF will import just enough of the structure from the source PDF to include all the content on the imported pages. But it's likely this resulting structure won't accurately represent the desired result. Being able to edit the Document using the standard DOM package means any changes can be made to the DOM quickly and easily. Or at least as quickly and easily as you can do anything with the DOM package.

Quirks

The Document returned from the PDF is not a regular XML document, although we try to present it as one by using the DOM interface. There are some key differences you should be aware of if you're planning on working with this Document.

Call Document.normalizeDocument(), to incorporate changes to the PDF

As well as editing the Document tree directly via the DOM interface, it's possible to add content into the tree by calling the beginTag/endTag methods, or by migrating pages into or out of the PDF. These changes will not immediately be reflected in the Document, and neither will the automatic creation of namespace prefixes for PDF 2, extracted text and so on (see below).

To ensure the Document you are looking at is complete, call Document.normalizeDocument() after any of these changes and before you plan to analyse or edit the Document via the DOM interface.

page.beginTag("P", null);
page.drawText("Hello", 100, 800)
page.endTag();
Document doc = pdf.getStructureTree();
dump(doc);
     
    // Where's my content?
    <StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf"/>

doc.normalizeDocument();
dump(doc);

    // It's added in the call to normalizeDocument()
    <StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf">
      <P>
        <bfo:content mcid="0" page="0">Hello</bfo:content>
      </P>
    </StructTreeRoot>
  

You can create Elements but not Text, and some Elements are read-only

With a regular DOM Document you could create a new Text node by calling Document.createTextNode. This can't be done with the Document returned from PDF.getStructureTree(). Content in this tree must exist on the page, and the only way to do that is to mark a section of the page with beginTag/endTag as shown above.

Likewise, it's not possible to create or change attributes an Element in either the special "bfo" namespace or the root element, and it's not possible to create processing instructions, entities and so on.

Pseudo-namespaces are used for certain attributes

PDF defines several standard attributes which can be applied to elements, and groups them into sets by "Owner". For example, to set the number of rows a <TH> element spans, you would set the RowSpan attribute with the Table owner. We represent this in the tree as an attribute with the "Table" prefix in the "urn:Table" namespace.

   atts = new HashMap<String,Object>();
   atts.put("Table:RowSpan", "2");
   page.beginTag("TH", atts)
    
    
  
<StructTreeRoot xmlns:bfo="urn:bfopdf"
        xmlns="http://iso.org/pdf/ssn"
        xmlns:Table="urn:Table">
  <P id="p1" Table:RowSpan="2" />
</StructTreeRoot>
  

The prefix "Table" as well as "Layout", "List", "PrintField" and "Artifact" are bound to special namespaces in this way, and cannot be reset.

PDF 1.x Documents may have no namespace or a fixed namespace.

Other than the pseudo-namespaces above and the magical "bfo" namespace we use for content, a StructureTree in PDF 1.x has no namespace. The concept isn't part of ISO 32000-1. However, that specification makes a distinction between a Structure Tree which meets a set of specific requirements (the same requirements used by PDF/UA and PDF/A) and one which does not.

Documents which claim to meet these requirements set the "Marked" property under the Document Catalog to true. When we load a PDF that makes this claim, we set the namespace on the root element to http://iso.org/pdf/ssn (a value first defined in PDF 2.0, but specified to apply to documents that match the requirements in ISO 32000-1). Documents that don't make this claim have no namespace.

PDF 2.x allows namespaces, but no namespace prefixes

PDF 2.x introduced a few changes in this area. The set of approved tags was changed (some were added, some removed), and namespaces are introduced for both elements and attributes. So when we open a PDF 2.0 document that claims to meet the requirements outlined above, we set the namespace to the value http://iso.org/pdf2/ssn, as defined in ISO 32000-2.

While namespaces are allowed, the concept of a prefix is not part of the specification. We will assign prefixes automatically to nodes in the tree to make the XML look correct, but they are not stored in the PDF. This has several consequences:

  1. All prefixes are defined on the root element.
  2. There is no need to set a prefix with an "xmlns" attribute (if you do, we'll migrate it to the root)

When manipulating the Document with the DOM package, namespaced elements and attributes can be created in the normal way. When creating tags with the beginTag/endTag methods, the namespace URI is specified as a prefix to the element or attribute name, seperated with a newline.

   atts = new HashMap<String,Object>();
   atts.put("http://a.com\nFoo", "val");
   page.beginTag("http://b.com\nP", atts)
    
    
    
  
<StructTreeRoot xmlns:bfo="urn:bfopdf"
        xmlns="http://iso.org/pdf/ssn"
        xmlns:ns0="http://a.com"
        xmlns:ns1="http://b.com">
  <ns0:P ns1:Foo="val" />
</StructTreeRoot>
  

Characters are allowed that are invalid in XML

It's possible to create an element or attribute with a name that is invalid in XML - containing spaces, punctuation and so on. We've actually seen quite a few documents constructed this way while testing, it seems to be something that's done by Adobe InDesign.

This won't cause a problem unless you are trying to import an element or attribute from the Structure Tree into a regular DOM, in which case illegal characters will throw an Exception. The solution is to set a parameter on the DomConfig object, as shown below. The "fix-invalid-xml" parameter will not change the values internally, but will change the way they are presented in the DOM interface so that they appear as legal XML values.

   Element e = document.getElementById("id1");
   System.out.println(e.getTagName();   // Output is "Tag name" - space is invalid!
   Element copy = dom.importNode(e);    // Throws an exception.
   document.getDomConfig).setParameter("fix-invalid-xml", true);
   System.out.println(e.getTagName();   // Output is "Tag_name" - now it's valid.
   Element copy = dom.importNode(e);    // Node is imported, all is well
  

Text content is not always included

Extracting the text from a newly loaded PDF is quite a slow operation, and it requires an "extended edition plus viewer " license. For that reason we don'always populate the <bfo:content> elements with their text content (we do if you've created the content yourself, of course - this only applies to PDFs that have been loaded).

To complete the tree you need to either set the "extract-text" parameter on the DomConfig to true, or call PDFParser.getStructureTree instead of PDF.getStructureTree (this approach exists for legacy reasons; they do the same thing).

PDF pdf = new PDF(new PDFReader(new File("HelloWorld.pdf")));
Document doc = pdf.getStructureTree();
dump(doc);

    // No text within the bfo:content element
    <StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf">
      <Document>
        <P id="p1">
          <bfo:content mcid="0" page="0" />
        </P>
      </Document>
    </StructTreeRoot>
  
doc.getDomConfig().setParameter("extract-text", true);
doc.normalizeDocument();
dump(doc);
    
    // Text content is there
    <StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf">
      <Document>
        <P id="p1">
          <bfo:content mcid="0" page="0">Hello, World</bfo:content>
        </P>
      </Document>
    </StructTreeRoot>
  

When saving, simple repairs will be made unless you say otherwise

There are various requirements placed on the Structure Tree by profiles like PDF/UA. For example, the <THead> must always be before any <TBody> in a <Table>. If these restrictions are not met we will try to repair them, notifying you of this by emitting a warning code beginning with "SD".

If for some reason you don't want this to happen (if you're trying to ensure your input is correct, for example, automatic repairs may not be helpful), then again this can be turned off with a parameter to the DomConfig, as shown below. Any errors will throw an exception when saving instead.

Document doc = pdf.getStructureTree();
doc.getDomConfig().setParameter("fix-structure", false);
  

Sorting content added via the beginTag method

It's quite common to want content in the tree in a different order to the way the same content is placed on the page - for example, if you draw the backgrounds of various objects first, then the text content on top. While it's possible to just dump everything onto the page and then move the content around later in the tree, another approach makes use of two special attributes that can be passed to beginTag: "bfo:sort" and "bfo:uuid". Both are optional.

The "bfo:uuid" attribute can be any String, and is used to uniquely identify an element. Sibling elements with the same UUID in the tree are merged when normalizeDocument is called; the content is all moved to the first element of the set.

   atts = new HashMap<String,Object>();
   atts.put("bfo:uuid", 1);
   page.beginTag("P", atts);
   page.drawText("abc", 100, 100);
   page.endTag();
   atts.put("bfo:uuid", 2);
   page.beginTag("Span", atts);
   page.drawText("def", 100, 100);
   page.endTag();
   atts.put("bfo:uuid", 1);
   page.beginTag("P", atts);
   page.drawText("ghi", 100, 100);
   page.endTag();
  
<StructTreeRoot xmlns:bfo="urn:bfopdf"
        xmlns="http://iso.org/pdf/ssn"
 <P>
  <bfo:content mcid="0"">abc</bfo:content>
  <bfo:content mcid="2"">ghi</bfo:content>
 </P>
 <Span>
  <bfo:content mcid="1"">def</bfo:content>
 </Span>
</StructTreeRoot>
 
 
  

Further control is available with the "bfo:sort" attribute, which should be an instance of java.util.Comparable (a java.lang.Integer is a good choice). Sibling elements will be sorted on this key, before they are merged on their uuid.

A very common case is trying to convert an existing XML document to a PDF structure. The easiest way to do this is to ensure that the "bfo:uuid" and "bfo:sort" attributes are both set to an Integer which is the index in document order of the original node. This will allow you to add content to the page in any order you like; so long as the beginTag/endTag calls are nested properly and the "bfo:sort" and "bfo:uuid" attribute are set, the resulting tree will be in the same order as the input tree.

One last tip: the "bfo:location" attribute can be set to any String, and will be included in any warning or error messages printed about the Structure Tree. Set it to the location of the original element to aid debugging.

Accessing the Document "Role Map"

An aspect of the Structure Tree that is not part of XML is the ability to remap tags; For example, if you wished to represent both <pre> and <p> in PDF/UA you have a problem, as only <P> is a recognised Tag. You can do this by mapping the <pre> tag to <P> by way of the document role-map. This is retrieved from the DomConfig as before.

Document doc = pdf.getStructureTree();
rolemap = (Map<String,String>)doc.getDomConfig().getParameter("role-map");
rolemap.put("pre", "P");
page.beginTag("pre", null);     // Now valid in PDF/UA, as pre is mapped to P
  

Element names retrieved via the DOM interface are always the original values before remapping; in the example above, the Element.getTagName() method would return "pre"

Differences from previous releases

Finally, some small changes were made to the beginTag method which are incompatibile with previous releases.

  • "ID", "C, "T", "text" and "E" were aliases for "id", "class", "title", "ActualText" and "abbr" respectively. These aliases have been removed.
  • Standard attributes could be specified without their Owners; for example you could specify "RowSpan" as an alias for "Table:RowSpan". These aliases have been removed.
  • We've added a lot of new features to OutputProfile to better profile PDF/UA, and a few of the older ones have been removed. There's no real reason to reference those individual features, so this is unlikely to affect anyone. A quick recompile against 2.24 will identify if that's the case - if it is, drop us a line at support@bfo.com.

Conclusion

For the most part it will be easiest to create a Structure Tree with a larger project built on top of the PDF API, such as our Report Generator. Most of the changes in this release are designed to facilitate that, but there are others:

  • The improvements to PDF/UA validation.
  • The ability to merge PDFs with a Structure Tree and get a valid (and useful) result.
  • The fix to ensure content is not considered damaged by Acrobat.

We hope that these features will make 2.24 useful upgrade for many.