IANA timezone abbreviations versus timezone_abbreviations

Started by Tom Laneover 1 year ago14 messageshackers

tgl@sss.pgh.pa.us

over 1 year ago

Although we've never documented this, it's been true for ages
that timestamptz_out shows timezone abbreviations that are taken
from the IANA tzdb data (in datestyles that use non-numeric
timezone fields, which is all but ISO). Meanwhile, timestamptz_in
recognizes timezone abbreviations if they match an entry in the
timezone_abbreviations file. Those two sources of truth are not
entirely consistent, leading to fun results like these:

regression=# set datestyle = postgres;
SET
regression=# set timezone = 'America/Montevideo';
SET
regression=# select '1900-01-01 00:00'::timestamptz;
timestamptz
------------------------------
Mon Jan 01 00:00:00 1900 LMT
(1 row)

regression=# select '1900-01-01 00:00'::timestamptz::text::timestamptz;
ERROR: invalid input syntax for type timestamp with time zone: "Mon Jan 01 00:00:00 1900 LMT"

(because LMT for "local mean time" is not in timezone_abbreviations)

regression=# select '1912-01-01 00:00'::timestamptz;
timestamptz
------------------------------
Mon Jan 01 00:00:00 1912 MMT
(1 row)

regression=# select '1912-01-01 00:00'::timestamptz::text::timestamptz;
timestamptz
------------------------------
Sun Dec 31 13:45:09 1911 MMT
(1 row)

(because this zone uses MMT for "Montevideo Mean Time" while
timezone_abbreviations thinks it means "Myanmar Time").

You can get unfortunate results even for current timestamps,
because we've not hesitated to put some North-American-centric
interpretations into the default abbreviations list:

regression=# set timezone = 'Asia/Shanghai';
SET
regression=# select '2024-12-13 00:00'::timestamptz;
timestamptz
------------------------------
Fri Dec 13 00:00:00 2024 CST
(1 row)

regression=# select '2024-12-13 00:00'::timestamptz::text::timestamptz;
timestamptz
------------------------------
Fri Dec 13 14:00:00 2024 CST

(because this IANA zone uses CST for "China Standard Time"
while timezone_abbreviations thinks it means (USA) "Central
Standard Time").

This mess was brought up in pgsql-bugs [1]/messages/by-id/CAJ7c6TOATjJqvhnYsui0=CO5XFMF4dvTGH+skzB--jNhqSQu5g@mail.gmail.com, but the solution
I propose here is invasive enough that I think it needs
discussion on -hackers.

What I think we should do about this is to teach timestamp
input to look into the current IANA time zone to see if it
knows the given abbreviation, and if so use that meaning
regardless of what timezone_abbreviations might say. This
isn't particularly hard, and it doesn't appear to cost
anything speed-wise, but is there anybody out there who
is relying on the current behavior?

I can imagine that somebody might be using an interpretation
that is contrary to IANA's ideas; but it seems fairly unlikely
with current IANA data, because they largely got rid of the
made-up abbreviations their data used to be full of. Anyplace
where we find a non-numeric abbreviation in the IANA data is
probably someplace where that abbreviation is widely current,
and people wouldn't expect it to mean something different.

On the positive side, this gives us a far better story for
abbreviations that conflict in different parts of the world.
timestamptz_in will now automatically do the right thing
given a correct timezone setting, without having to manually
adjust the abbreviation list. So between that and getting
rid of the round-trip hazards seen above, I think there is
sufficient reason to do this.

The only other way I can envision to remove the round-trip hazard
is to stop using alphabetic abbreviations at all in timestamp
output, and use numeric GMT offsets regardless of datestyle.
I doubt that would make many people happy. It would certainly
break a bunch of our own regression tests, and I expect it would
break other people's applications too.

(To be clear, I'm only proposing this for v18 not for back-patch.
While it's certainly fixing live bugs, there have not been that
many complaints, and the behavioral change is surely more than
we want for a back branch.)

Draft patches attached. Any thoughts?

regards, tom lane

[1]: /messages/by-id/CAJ7c6TOATjJqvhnYsui0=CO5XFMF4dvTGH+skzB--jNhqSQu5g@mail.gmail.com

Andreas Karlsson

andreas.karlsson@percona.com

over 1 year ago

In reply to: Tom Lane (#1)

Re: IANA timezone abbreviations versus timezone_abbreviations

On 12/13/24 12:33 AM, Tom Lane wrote:

What I think we should do about this is to teach timestamp
input to look into the current IANA time zone to see if it
knows the given abbreviation, and if so use that meaning
regardless of what timezone_abbreviations might say. This
isn't particularly hard, and it doesn't appear to cost
anything speed-wise, but is there anybody out there who
is relying on the current behavior?

I am not convinced this is an improvement. While this patch removes the
round-trip hazard it also makes it confusing to use the
timezone_abbreviations GUC since it can be overridden by IANA data based
on your current timezone. So you need to know all the, sometimes weird,
names for your current timezone. Seems unnecessarily hard to reason
about and wouldn't most people who use timezone_abbreviations rely on
the current behavior?

But that said I personally only use ISO timestamps with numerical
offsets. Partially to avoid all this mess.

Andreas

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Andreas Karlsson (#2)

Re: IANA timezone abbreviations versus timezone_abbreviations

Andreas Karlsson <andreas@proxel.se> writes:

On 12/13/24 12:33 AM, Tom Lane wrote:

What I think we should do about this is to teach timestamp
input to look into the current IANA time zone to see if it
knows the given abbreviation, and if so use that meaning
regardless of what timezone_abbreviations might say.

I am not convinced this is an improvement. While this patch removes the
round-trip hazard it also makes it confusing to use the
timezone_abbreviations GUC since it can be overridden by IANA data based
on your current timezone. So you need to know all the, sometimes weird,
names for your current timezone. Seems unnecessarily hard to reason
about and wouldn't most people who use timezone_abbreviations rely on
the current behavior?

Presumably they're not that weird to the locals?

I am not sure what you mean by "people who use
timezone_abbreviations". I think that's about everyone --- it's
not like the default setting doesn't contain any abbreviations.
(If it didn't then we'd not have such a problem...)

But that said I personally only use ISO timestamps with numerical
offsets. Partially to avoid all this mess.

If you only use ISO notation then this doesn't matter to you
either way.

regards, tom lane

Bruce Momjian

bruce@momjian.us

over 1 year ago

In reply to: Tom Lane (#3)

Re: IANA timezone abbreviations versus timezone_abbreviations

On Mon, Dec 16, 2024 at 02:57:59PM -0500, Tom Lane wrote:

Andreas Karlsson <andreas@proxel.se> writes:

On 12/13/24 12:33 AM, Tom Lane wrote:

What I think we should do about this is to teach timestamp
input to look into the current IANA time zone to see if it
knows the given abbreviation, and if so use that meaning
regardless of what timezone_abbreviations might say.

I am not convinced this is an improvement. While this patch removes the
round-trip hazard it also makes it confusing to use the
timezone_abbreviations GUC since it can be overridden by IANA data based
on your current timezone. So you need to know all the, sometimes weird,
names for your current timezone. Seems unnecessarily hard to reason
about and wouldn't most people who use timezone_abbreviations rely on
the current behavior?

Presumably they're not that weird to the locals?

I am not sure what you mean by "people who use
timezone_abbreviations". I think that's about everyone --- it's
not like the default setting doesn't contain any abbreviations.
(If it didn't then we'd not have such a problem...)

But that said I personally only use ISO timestamps with numerical
offsets. Partially to avoid all this mess.

If you only use ISO notation then this doesn't matter to you
either way.

Yes, your patch seems like a big improvement.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.

Jelte Fennema-Nio

postgres@jeltef.nl

over 1 year ago

In reply to: Bruce Momjian (#4)

Re: IANA timezone abbreviations versus timezone_abbreviations

On Sun Dec 29, 2024 at 12:47 AM CET, Bruce Momjian wrote:

On Mon, Dec 16, 2024 at 02:57:59PM -0500, Tom Lane wrote:
Yes, your patch seems like a big improvement.

+    Before consulting the <varname>timezone_abbreviations</varname> file,
+    <productname>PostgreSQL</productname> checks to see whether an
+    abbreviation used in datetime input is defined in the IANA time zone
+    database entry currently selected by the
+    <xref linkend="guc-timezone"/> run-time parameter.  If so the time
+    zone's meaning is used, for consistency with datetime output.  The
+    <varname>timezone_abbreviations</varname> file is mainly useful for
+    allowing datetime input to recognize abbreviations for time zones
+    other than the active zone.

I think it would be good to add some additional clarify here. It was
fairly confusing to me. Especially the last sentence, due to the use of
"active zone", even though it's really talking about the currently
active abbreviations list. Probably my confusion mostly came from the
fact that I wasn't aware that timezone abbreviations were localized, but
I doubt I'm alone in not knowing this. Maybe something like this (feel
free to improve further):

Before consulting the <varname>timezone_abbreviations</varname>
file, <productname>PostgreSQL</productname> checks to see whether an
abbreviation used in datetime input is defined in the currently active
IANA time zone database. The abbreviations for these entries are
localized based on the <xref linkend="guc-timezone"/>
run-time parameter, so depending on the configured <xref
linkend="guc-timezone"/> abbreviations will differ. If it is found the
IANA time zone database, then that meaning is used for consistency with
datetime output. The <varname>timezone_abbreviations</varname> file is
only useful for allowing datetime input to recognize
abbreviations for time zones that are not defined in the currently
active IANA time zone database.

Jelte Fennema-Nio

postgres@jeltef.nl

over 1 year ago

In reply to: Tom Lane (#3)

Re: IANA timezone abbreviations versus timezone_abbreviations

On Mon Dec 16, 2024 at 8:57 PM CET, Tom Lane wrote:

Andreas Karlsson <andreas@proxel.se> writes:

On 12/13/24 12:33 AM, Tom Lane wrote:
I am not convinced this is an improvement. While this patch removes the
round-trip hazard it also makes it confusing to use the
timezone_abbreviations GUC since it can be overridden by IANA data based
on your current timezone. So you need to know all the, sometimes weird,
names for your current timezone. Seems unnecessarily hard to reason
about and wouldn't most people who use timezone_abbreviations rely on
the current behavior?

Presumably they're not that weird to the locals?

I am not sure what you mean by "people who use
timezone_abbreviations". I think that's about everyone --- it's
not like the default setting doesn't contain any abbreviations.
(If it didn't then we'd not have such a problem...)

Maybe changing the default value of timezone_abbreviations is a better
solution to the problem, or in addition to the proposed patch.

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Jelte Fennema-Nio (#5)

Re: IANA timezone abbreviations versus timezone_abbreviations

"Jelte Fennema-Nio" <postgres@jeltef.nl> writes:

I think it would be good to add some additional clarify here. It was
fairly confusing to me. Especially the last sentence, due to the use of
"active zone", even though it's really talking about the currently
active abbreviations list. Probably my confusion mostly came from the
fact that I wasn't aware that timezone abbreviations were localized, but
I doubt I'm alone in not knowing this. Maybe something like this (feel
free to improve further):

Hmm, I don't like your phrasing using "IANA time zone database",
because that makes it sound like we'll take any abbreviation that's
found anywhere in that whole data set. What the proposal actually
does is to recognize any abbreviation that is used, or has been
used in the past, in the IANA time zone selected by our current
timezone setting. (And, perhaps more to the point, to give it the
meaning it has or had in that zone.) Not sure about concise wording
for that.

regards, tom lane

Jelte Fennema-Nio

postgres@jeltef.nl

over 1 year ago

In reply to: Jelte Fennema-Nio (#6)

Re: IANA timezone abbreviations versus timezone_abbreviations

On Sun Dec 29, 2024 at 11:49 PM CET, Jelte Fennema-Nio wrote:

Maybe changing the default value of timezone_abbreviations is a better
solution to the problem, or in addition to the proposed patch.

To clarify, I meant maybe changing the default of timezone_abbreviations
to be empty. It sounds like the timezone_abbreviations setting is a lot
less useful now than it was in the past.

Jelte Fennema-Nio

postgres@jeltef.nl

over 1 year ago

In reply to: Tom Lane (#7)

Re: IANA timezone abbreviations versus timezone_abbreviations

On Sun Dec 29, 2024 at 11:56 PM CET, Tom Lane wrote:

Hmm, I don't like your phrasing using "IANA time zone database",
because that makes it sound like we'll take any abbreviation that's
found anywhere in that whole data set. What the proposal actually
does is to recognize any abbreviation that is used, or has been
used in the past, in the IANA time zone selected by our current
timezone setting. (And, perhaps more to the point, to give it the
meaning it has or had in that zone.) Not sure about concise wording
for that.

Okay, yeah I definitely misunderstood what was happening here. So
scratch my previous attempt at clarifying.

The current situation seems utterly messed up though. One thing that
shocks me is that we're, by default and without warning, parsing IST as
Israel Standard Time instead of the timezone that 17% of the world's
population uses: Indian Standard Time. And even with this patch that
behaviour only changes if you set your timezone to Asia/India. Which
seems suboptimal, because even as a European myself, IST means Indian
Standard Time.

I definitely think this is a step in the right direction, but I'm not
sure that it goes far enough. How about in addition we change the
following:

1. Change the default of timezone_abbreviations to an empty list.
2. When parsing search for the abbreviation string in the IANA timezone
database.
a. If it's a unique match, use that timezone.
b. If it's not unique, but it matches an abbreviation of the current
timezone, use that timezone.
c. If it's part of the timezone_abbreviations list, use that timezone.
d. Else, throw an error.

I think that would result in a lot more sensible behaviour.

Another option would be to put "c" at the top of the list, which would
allow overriding the IANA timezone database with that file. As long as
we don't do that by default I think that would be fine.

And I guess a third option would be to remove conflicts from the Default
timezone_abbreviations list.

To be clear, for backwards compatibility we should probably keep the old
Default list in any of these cases so people can easily switch back in
case this change breaks their application.

#10

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Jelte Fennema-Nio (#9)

Re: IANA timezone abbreviations versus timezone_abbreviations

"Jelte Fennema-Nio" <postgres@jeltef.nl> writes:

The current situation seems utterly messed up though. One thing that
shocks me is that we're, by default and without warning, parsing IST as
Israel Standard Time instead of the timezone that 17% of the world's
population uses: Indian Standard Time. And even with this patch that
behaviour only changes if you set your timezone to Asia/India. Which
seems suboptimal, because even as a European myself, IST means Indian
Standard Time.

This argument would hold more water if we'd gotten more than
approximately zero field complaints about the current setup.
I think most folk who want that just use the documented solution
of setting timezone_abbreviations = 'India'.

Certainly there's an argument to be made that we should have gone
for a minimal rather than maximal default list of abbreviations.
But it's a couple of decades too late to be making that argument ---
at this point backwards compatibility is a huge consideration IMV.
(The current design dates from d8b5c95ca of 2006-07-25, just to
clarify how long this has stood.)

1. Change the default of timezone_abbreviations to an empty list.

The villagers would be on our doorstep with pitchforks if we did
that. It's remarkable how many of these abbreviations have live
constituencies --- try searching the commit log for "abbreviation"
to find a lot of commits that added zone abbrevs based on user
complaints. Now, most of those complaints were ten or more years
back, so maybe people no longer care as much ... but I doubt that's
the way to bet.

2. When parsing search for the abbreviation string in the IANA timezone
database.

Searching the whole IANA database isn't very practical I fear.
We do something approximately that costly to identify the default
timezone setting --- and that is code that got moved into initdb
precisely because it was too slow to be tolerable as part of
postmaster start. It might work to calculate the set of abbrevs
known in the IANA data once and cache it; but where, and how
would we know when to update the cache?

Another thing that complicates all this is that our
timezone_abbreviations data reflects a very old state of the IANA
database, from when they were of a mindset that every zone should have
alphabetic timezone abbrevs even if they had to make them up. A few
of those choices achieved real-world currency, but most didn't, and
starting in about 2015 the IANA crew removed (most of?) the ones
they'd made up in favor of numerical UTC offsets. We kept them all,
for fear of user complaints if we removed them. So we're now in a
situation where timezone_abbreviations knows a lot of abbrevs that
are no longer to be found in IANA, and it's very unclear how many
people might be depending on those entries. It's probably more than
none though. (Conversely, IANA does still have some historical
abbrevs such as "LMT" that aren't in timezone_abbreviations.)

Anyway, I'm of the opinion that we should be after a minimal change
not a grand rewrite that removes all ambiguity. Timezones are such
a political mess that we probably couldn't reach 100% clarity anyway.
We could definitely do major damage to the system's usability
though, if we don't tread carefully.

regards, tom lane

#11

Aleksander Alekseev

aleksander@timescale.com

over 1 year ago

In reply to: Tom Lane (#1)

Re: IANA timezone abbreviations versus timezone_abbreviations

Hi Tom,

This mess was brought up in pgsql-bugs [1], but the solution
I propose here is invasive enough that I think it needs
discussion on -hackers.

[...]

I tested and reviewed the patch. It fixes the originally reported bug
and looks good to me.

The only other way I can envision to remove the round-trip hazard
is to stop using alphabetic abbreviations at all in timestamp
output, and use numeric GMT offsets regardless of datestyle.
I doubt that would make many people happy. It would certainly
break a bunch of our own regression tests, and I expect it would
break other people's applications too.

I think we need to introduce alternative data types e.g. datetime /
datetimetz pair if we want to do this.

--
Best regards,
Aleksander Alekseev

#12

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Aleksander Alekseev (#11)

Re: IANA timezone abbreviations versus timezone_abbreviations

Aleksander Alekseev <aleksander@timescale.com> writes:

I tested and reviewed the patch. It fixes the originally reported bug
and looks good to me.

Thanks for looking at it! I'm not sure we have full consensus on
this yet, but at least there seem to be a majority in favor.
So here's a v2 with some loose ends cleaned up.

In 0001, I edited the docs more heavily in hopes of addressing
Jelte's concern about the docs being confusing. The code and
test changes are the same as before.

0002 is also the same as before.

0003 is new work that fixes the pg_timezone_abbrevs view so
that its output matches the new behavior.

regards, tom lane

#13

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Tom Lane (#12)

Re: IANA timezone abbreviations versus timezone_abbreviations

I wrote:

0003 is new work that fixes the pg_timezone_abbrevs view so
that its output matches the new behavior.

Hmph ... the cfbot doesn't like that one bit. It appears that the
"LMT" abbrev is known in my machine's system-provided tzdata tree,
but not when building our own timezone files. I wonder if this is
exposing an existing bug (perhaps our copy of zic is too far out of
date?). Will look into it tomorrow.

regards, tom lane

#14

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Tom Lane (#13)

Re: IANA timezone abbreviations versus timezone_abbreviations

I wrote:

Hmph ... the cfbot doesn't like that one bit. It appears that the
"LMT" abbrev is known in my machine's system-provided tzdata tree,
but not when building our own timezone files. I wonder if this is
exposing an existing bug (perhaps our copy of zic is too far out of
date?). Will look into it tomorrow.

Found it: there is indeed pre-existing breakage in
pg_interpret_timezone_abbrev(), but it doesn't manifest when using
--with-system-tzdata on RHEL8, so I'd missed it in my local testing.

0001-0003 are the same as last time (so I kept the patch version
as v2), and 0004 adds a fix for the pre-existing problem.

regards, tom lane

IANA timezone abbreviations versus timezone_abbreviations

Attachments:

Attachments:

Attachments: