Tracking timezone abbreviation removals in the IANA tz database

Started by Tom Laneover 9 years ago5 messageshackers

tgl@sss.pgh.pa.us

over 9 years ago

It turns out that some of the zone abbreviations shown in the IANA
timezone database don't correspond to any real-world usage, but were
more or less made up by the timezone database crew. And of late,
those folk have started to remove these invented abbreviations in favor
of just using numeric offsets. For example, part of the 2016f update
did this:

# Asia/Novosibirsk covers:
# 54 RU-NVS Novosibirsk Oblast

+# From Stepan Golosunov (2016-05-30):
+# http://asozd2.duma.gov.ru/main.nsf/(Spravka)?OpenAgent&RN=1085784-6
+# moves Novosibirsk oblast from UTC+6 to UTC+7.
+# From Stepan Golosunov (2016-07-04):
+# The law was signed yesterday and published today on
+# http://publication.pravo.gov.ru/Document/View/0001201607040064
+
 Zone Asia/Novosibirsk	 5:31:40 -	LMT	1919 Dec 14  6:00
-			 6:00	-	NOVT	1930 Jun 21 # Novosibirsk Time
-			 7:00	Russia	NOV%sT	1991 Mar 31  2:00s
-			 6:00	Russia	NOV%sT	1992 Jan 19  2:00s
-			 7:00	Russia	NOV%sT	1993 May 23 # say Shanks & P.
-			 6:00	Russia	NOV%sT	2011 Mar 27  2:00s
-			 7:00	-	NOVT	2014 Oct 26  2:00s
-			 6:00	-	NOVT
+			 6:00	-	+06	1930 Jun 21
+			 7:00	Russia	+07/+08	1991 Mar 31  2:00s
+			 6:00	Russia	+06/+07	1992 Jan 19  2:00s
+			 7:00	Russia	+07/+08	1993 May 23 # say Shanks & P.
+			 6:00	Russia	+06/+07	2011 Mar 27  2:00s
+			 7:00	-	+07	2014 Oct 26  2:00s
+			 6:00	-	+06	2016 Jul 24  2:00s
+			 7:00	-	+07

According to the Russia rules, %s expands to nothing for standard time or
"S" for summer time, so that the old entries here had the effect of
defining NOVT and NOVST as meaning variously UTC+6, UTC+7, or UTC+8
at different times in the past. Now those abbreviations are gone,
so far as IANA is concerned.

I had thought that this wouldn't really affect us because PG's
interpretations of zone abbreviations are driven by the info in
timezone/tznames/ rather than the IANA files in timezone/data/.
I forgot however that the "dynamic abbreviation" code relies on
being able to find matching abbreviations in the IANA data.
For example, timezone/tznames/Default lists

NOVST Asia/Novosibirsk # Novosibirsk Summer Time (obsolete)

but that zone abbreviation now fails entirely:

# select '2016-09-02 08:00:00 NOVST'::timestamptz;
ERROR: time zone abbreviation "novst" is not used in time zone "Asia/Novosibirsk"

(This is also the cause of bug #14307.)

So the question is what to do about it.

An easy answer is to just delete such entries from the timezone/tznames/
files altogether. If we believe the IANA crew's assumption that nobody
uses these abbreviations in the real world, then that seems like it
removes useless maintenance overhead for little cost. I'm a little
worried though that somebody might have followed IANA's lead and actually
started using these abbreviations, in which case we'd get complaints.

Another possibility is to keep these entries but get rid of the dynamic
zone aliases, reducing them to plain numeric UTC offsets. Probably the
value to use would be whatever was shown as the most recent active value
in the IANA list ... although that's open to interpretation. For
instance, according to the above we'd likely define NOVT as UTC+7,
but then logically NOVST ought to be UTC+8, which doesn't match up
with the fact that when it actually last appeared in the IANA data
it meant UTC+7. So that sounds like it'd be a bit of a mess involving
some judgment calls.

In the end, part of the reason we've got these files is so that users
can make their own decisions about abbreviation meanings. So the
ultimate answer to any complaints is going to be "if you think NOVT
means thus-and-such then put in an entry that says so".

So I'm leaning to the just-remove-it answer for any deleted abbreviation
that relies on a dynamic definition. Names that never had more than one
UTC offset can remain in the tznames list.

Comments?

BTW, we should be glad that the IANA crew decided to use ISO not POSIX
sign convention for these numeric zone abbreviations, or we'd really
have a mess on our hands :-(.

Also, the fact that this failure made it into releases shows that I'm
missing some now-necessary validation steps in the zone data update
procedure. Will add something about that to the timezone/README file.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Tom Lane (#1)

Re: Tracking timezone abbreviation removals in the IANA tz database

I wrote:

So I'm leaning to the just-remove-it answer for any deleted abbreviation
that relies on a dynamic definition. Names that never had more than one
UTC offset can remain in the tznames list.

After a bit more thought and consumption of caffeine, I am thinking that
that won't really be good enough. It's clear that the IANA crowd intend
to continue removing made-up abbreviations, in fact there's a pile of
that in their queue right now:
http://mm.icann.org/pipermail/tz/2016-August/023941.html

The trouble from our perspective is that a lot of those abbreviations have
shifted meaning over the years (if they had any real-world usage maybe
they'd have stayed more stable...), which means that we are using dynamic
abbreviations for them, which means breakage as soon as the tznames list
diverges from the IANA database. Which would be manageable if we only
shipped those filesets together, but in installations built with
--with-system-tzdata (which at least ought to include most vendor
distributions of PG), they don't come from the same place. This means
that even if we fix removals when we make new releases, other
abbreviations may break the next time the vendor updates their tzdata
package.

We could maybe tolerate individual abbreviations failing like this, but
as noted in bug #14307, the pg_timezone_abbrevs view fails altogether if
there are any broken abbreviations in the active timezone_abbreviations
list. So that makes this a problem for all users whether or not they care
about the specific abbreviations affected.

This leads me to think that we need to redefine the dynamic abbreviations
feature so it's a bit more robust in the face of this type of situation.

A really simple change (at least logically, haven't looked at the code)
would be to say that a dynamic abbreviation is just a macro for the
referenced zone name, that is if we have

NOVT Asia/Novosibirsk

in the timezone_abbreviations list then writing NOVT in a timezone value
is exactly equivalent to writing Asia/Novosibirsk. However, that breaks
backwards compatibility (at least for non-broken abbreviations). Our
convention up to now has been that if you write a standard-time zone
abbreviation then what it means is your local standard-time UTC offset,
even if DST is currently prevailing in your zone. For example, DST is
currently in force in the USA, so writing "America/New_York" means UTC-4,
but "EST" means UTC-5 regardless. Likewise "EDT" means UTC-4 and will
still mean that when winter comes.

So the idea I'm toying with (again, haven't tried to code this) is to say
that *if* we can match the abbreviation to something in the referenced
zone then we'll use that, but otherwise we fall back to treating the
abbreviation as a macro for the zone name. This would ensure that updates
to the IANA data could not break existing timezone_abbreviations entries
(at least, not unless IANA were to remove a zone name altogether, but they
have never done that AFAIR). An update could cause an abbreviation's
meaning to change, but that's true already, in fact it's kind of the
whole point.

If we were to do that, then perhaps we would not need to remove existing
timezone_abbreviations entries even if IANA deems them obsolete. I'd
still be inclined to remove them from our sample data files, but users
would easily be able to put them back if they wanted.

Thoughts?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Tom Lane (#2)

Re: Tracking timezone abbreviation removals in the IANA tz database

I wrote:

So the idea I'm toying with (again, haven't tried to code this) is to say
that *if* we can match the abbreviation to something in the referenced
zone then we'll use that, but otherwise we fall back to treating the
abbreviation as a macro for the zone name.

This turned out to require only fairly localized code changes. So
I now propose that we leave the tznames files as-is and apply the
attached patch (plus documentation changes, which I've not done yet).
Any objections?

regards, tom lane

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Tom Lane (#1)

Re: Tracking timezone abbreviation removals in the IANA tz database

On Fri, Sep 2, 2016 at 8:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I had thought that this wouldn't really affect us because PG's
interpretations of zone abbreviations are driven by the info in
timezone/tznames/ rather than the IANA files in timezone/data/.
I forgot however that the "dynamic abbreviation" code relies on
being able to find matching abbreviations in the IANA data.
For example, timezone/tznames/Default lists

NOVST Asia/Novosibirsk # Novosibirsk Summer Time (obsolete)

but that zone abbreviation now fails entirely:

# select '2016-09-02 08:00:00 NOVST'::timestamptz;
ERROR: time zone abbreviation "novst" is not used in time zone "Asia/Novosibirsk"

(This is also the cause of bug #14307.)

So the question is what to do about it.

An easy answer is to just delete such entries from the timezone/tznames/
files altogether. If we believe the IANA crew's assumption that nobody
uses these abbreviations in the real world, then that seems like it
removes useless maintenance overhead for little cost. I'm a little
worried though that somebody might have followed IANA's lead and actually
started using these abbreviations, in which case we'd get complaints.

Another possibility is to keep these entries but get rid of the dynamic
zone aliases, reducing them to plain numeric UTC offsets. Probably the
value to use would be whatever was shown as the most recent active value
in the IANA list ... although that's open to interpretation. For
instance, according to the above we'd likely define NOVT as UTC+7,
but then logically NOVST ought to be UTC+8, which doesn't match up
with the fact that when it actually last appeared in the IANA data
it meant UTC+7. So that sounds like it'd be a bit of a mess involving
some judgment calls.

In the end, part of the reason we've got these files is so that users
can make their own decisions about abbreviation meanings. So the
ultimate answer to any complaints is going to be "if you think NOVT
means thus-and-such then put in an entry that says so".

So I'm leaning to the just-remove-it answer for any deleted abbreviation
that relies on a dynamic definition. Names that never had more than one
UTC offset can remain in the tznames list.

Comments?

I tried to understand these emails today and mostly failed. I think
this stuff needs to be better documented.
src/timezones/tznames/README pretty much presumes that you more or
less know what's going on:

# This directory contains files with timezone sets for PostgreSQL. The problem
# is that time zone abbreviations are not unique throughout the world and you
# might find out that a time zone abbreviation in the `Default' set collides
# with the one you wanted to use. This can be fixed by selecting a timezone
# set that defines the abbreviation the way you want it. There might already
# be a file here that serves your needs. If not, you can create your own.

Of course, it doesn't bother to define the term 'timezone set', a
phrase that is not used in any other file in the source tree.
Eventually, after some poking around, I figured out that this is
described in the documentation under "B.3. Date/Time Configuration
Files", but I didn't notice that when I searched the documentation for
it. I only figured it out after I actually looked at the contents of
"src/timezone/tznames/Default" and saw that it recommended looking at
the "Date/Time Support", which I had done, but which I gave up on
doing pretty quickly because there was no section whose title included
the phrase "time zone".

Once I read that section, I ran across this:

# The offset is the equivalent offset in seconds from UTC, positive being east
# from Greenwich and negative being west. For example, -18000 would be five
# hours west of Greenwich, or North American east coast standard time. D
# indicates that the zone name represents local daylight-savings time rather
# than standard time. Alternatively, a time_zone_name can be given, in which
# case that time zone definition is consulted, and the abbreviation's meaning
# in that zone is used.

I dimly understand that "the abbreviation's meaning in that zone" is
referring to an EST vs. EDT type of distinction, but a novice reader
of this documentation will probably not understand that from this
language.

Similarly, your email mentions "dynamic zone aliases" but:

[rhaas pgsql]$ git grep -i 'dynamic zone alias'
[rhaas pgsql]$ git grep -i 'dynamic zone'
[rhaas pgsql]$ git grep -i 'zone alias'
[rhaas pgsql]$

I'm guessing that's referring to the same thing as the above
documentation paragraph, wherein a "time_zone_name" is given, but it
doesn't really refer to the time zone name, but rather the DST or
non-DST configuration of that time zone. But I'm not sure I'm right
about that.

Another thing that's not very well-documented is the use of our own
tzdata vs. the system's tzdata. There is a paragraph on it in the
installation instructions:

https://www.postgresql.org/docs/current/static/install-procedure.html

...but what happens if you use it? Are dynamic zone aliases and time
zone sets and such still in the picture, or have you forfeited that
stuff by going over to the system's tzdata?

Searching the documentation for "time zone" through the web site
yields a positively uninspiring set of results:

https://www.postgresql.org/search/?u=%2Fdocs%2F9.6%2F&q=time+zone

I suggest that if you (or someone else who understands it, if there is
any such person) were willing to either improve the existing
documentation or maybe even write a whole new chapter on how we do
time zone handling, it would be most welcome. I basically feel that
such documentation as we already have presumes expert-level knowledge
of what's in the IANA files, what's in our configuration files, and
how everything fits together; and I doubt many people have that
knowledge.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Robert Haas (#4)

Re: Tracking timezone abbreviation removals in the IANA tz database

Robert Haas <robertmhaas@gmail.com> writes:

I suggest that if you (or someone else who understands it, if there is
any such person) were willing to either improve the existing
documentation or maybe even write a whole new chapter on how we do
time zone handling, it would be most welcome.

Well, I'll put that on my to-do list, but it's a very long list.
And TBH, I thought the SGML docs around this were adequate already,
so I might not be in the best position to rewrite them to satisfy you.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tracking timezone abbreviation removals in the IANA tz database

Attachments: