Odds and Ends: Dates, Timezones and apostrophes

Musings on Dates and Timezones

Things are quiet as we launch into 2023, but it's a new year so the topic of Dates seems a good one to discuss.

PDF dates

Lets first go back 22 years, to the publication of the third edition of the PDF Reference. Section 3.8.2 describes the Date format used in PDF, and was unchanged from the previous editions. Here's what it looks like.

Dates looked like that all the way through to the sixth edition (PDF 1.7), at which point PDF transitioned from an Adobe spec to one managed by ISO - ISO32000 was essentially PDF 1.7, recreated for ISO. But something was lost on the way.

The trailing apostrophe is no longer specified. Why? Who knows - it still appears in some of the examples. Does it matter? Yes!

Here's a PDF I created at 10:41 UTC in Acrobat after setting my computer Timezone to India Standard Time (UTC +0530) - so local time was 16:11. The /CreationDate metadata uses the old, pre-ISO format (with apostophe), but I then re-saved the PDF with our API, modified to use the new ISO format (without apostophe). So here's what the "Info Dictionary" (which contains the file metadata) looks like inside:

  /CreationDate (D:20230117161128+05'30')
  /ModDate (D:20230117161158+05'30)
  /Creator(Acrobat Pro 22.3.20310)
  

And here's what the metadata looks like in Acrobat, once I'm back in my normal UTC timezone.

Oh dear. Looks like that trailing apostrophe really matters to Acrobat - without it, any timezones are rounded down to the nearest hour.

XMP dates

Dates aren't used in PDF very much beyond this "Info Dictionary", and that was... well, not technically deprecated, but replaced in the PDF Reference version 3 (aka PDF 1.4) by the XMP metadata object, an XML based format. This wasn't published until after PDF Reference version 3, but the first revision of XMP (dated 2004) described dates like this:

ISO8601: finally, a date format which engineers can get behind. Except, of course, it's not that simple. ISO8601 is a complex beast and anyway, that reference isn't to ISO8601, it's to a W3C note. Let's take a look.

We have a choice of exactly 6 formats, and if you're going to include a time, you must include the timezone. This was the case in the next revision (September 2005 - XMP specification revisions weren't numbered, so we have to reference the publication date on the cover), but by the time of the July 2010 revision the language had changed:

The time zone designator need not be present in XMP. When not present, the time zone is unknown, and an XMP processor should not assume anything about the missing time zone.

Does this matter? Again, yes!

PDF/A dates

Now we get to PDF/A-1, which refers to PDF 1.4 and therefore XMP - which remember was unspecified at the time PDF 1.4 was published, so lets assume it refers to the first edition. Here's what it has to say on Metadata.

Now, for the first time, we have the requirement that "PDF Dates" (as specified in the Info Dictionary) must match the equivalent "XMP Dates" in the XMP metadata. The problem is that if the Info Dictionary contains a Date without a timezone, there's no way to represent that in XMP - because in 2004, timezones were required in XMP dates.

By the time PDF/A-2 came along the Info dictionary was to be officially ignored, but it still referenced ISO32000-1 which referenced the September 2005 edition of XMP. So timezones are still mandatory for all dates in PDF/A-2 and PDF/A-3. It's not until PDF/A-4 (which references ISO32002-2:2020, which references the now-ISO-standardized XMP, ISO16684-1:2019) do we once again have the ability to create dates with times that do not have a Timezone specified.

But really, does this matter?

A bit. As with many specifications in the PDF ecosystem, some reading between the lines is required. First, the BFO PDF Library has always (and for now will continue) to create "PDF Dates" with the trailing apostrophe, as specified in PDF 1.4 but not as specified in ISO32000. We do this for the simple reason that if we don't Acrobat gets the dates wrong, and as all other PDF tools seem to accept both formats we go with the incorrect, but more widely accepted format.

For PDF/A-1 validation, if a PDF Date has no timezone it will be invalid - this type of Date cannot be represented in XMP. However many other tools don't do this (Acrobat treats missing timezones as UTC), and it's an open question in the PDF/A working group as to whether this is a restriction we should revisit. Fortunately, the vast, vast majority of date-times also specify timezones (as they should), so it doesn't come up much.

Summary

This gives you some sort of insight into the kind of level we have to dig down to when implementing our PDF Library, and perhaps highlights the difficulty of getting this right when you're dealing with dozens of versions over many years. It certainly helped fill a publishing hole in January when we don't have much to write about.