Strategies for converting UTC data to local windows for arbitrary resolutions and timezones
Hi all,
I'm attempting to generate some reports using user-driven timezones for UTC
data, and I'm having trouble writing a query that meets the following
criteria:
1) Data should be averaged at one of daily, hourly, or 15 minute
granularities (user-driven).
2) Data over a given period should reflect the localized value of the UTC
data for a user-provided time zone. E.g.:
a) The 1 hour period starting at '2023-10-03 12:00:00.000000-0400'
(America/New_York) should reflect data with timestamps between '2023-10-03
16:00:00.000000Z' and '2023-10-03 17:00:00.000000Z'.
b) The 1 day period starting at '2023-10-03 00:00:00.000000-0400'
should reflect data with timestamps between '2023-10-03 04:00:00.000000Z'
and '2023-10-04 04:00:00.000000Z'.
3) When a period interacts with a DST change in the given timezone, the
data should not be clumped together. E.g.:
a) Data points occurring at 2023-11-05 05:30:00.000000Z and 2023-11-05
06:30:00.000000Z should be treated as falling into separate buckets when
the time zone is America/New_York: (2023-11-05 01:30:00.000000-0400 and
2023-11-05 01:30:00.000000-0500, respectively). This should be true for
either the 15 minute or 1 hour intervals.
b) Some clumping for day resolution seems ok! E.g. the 1 day period
starting at '2023-11-05 00:00:00.000000-0400' can and probably should
contain 25 hours' worth of data. Certainly it should not reflect the data
falling between '2023-11-05 00:04:00.000000Z' and '2023-11-05
00:04:00.000000Z' + '24 hours'::interval (= '2023-11-06 00:04:00.000000Z'),
because that would be the local times of '2023-11-05 00:00:00.000000-0400'
and '2023-11-04 23:00:00.000000-0500'.
4) It would be relatively simple to do 15 minute and 1 hour periods were
all timezone offsets multiples of 1 hour (in that case, all operations
could be done in UTC and then converted after the fact), but unfortunately
some time zones have 30 min-based offsets, which interferes with this
approach.
5) Ideally, the solution would not involve messing with the
server/connection's value of timezone. (Though I would be interested if
there was a solution that relaxed this constraint and was relatively
safe/compatible with transactions and psycopg2.)
6) Ideally, my query would return periods that are missing data (though
could plausibly fill these in in the layer above). This points toward
generate_series but...
7) Sigh: I can't upgrade to 16. I gather that 16 has a timezone parameter
for generate_series, which I believe might help. But tragically,
Digitalocean doesn't yet support 16, and it's not practical to migrate
elsewhere. Based on historical release -> support timing, I'd imagine they
will not support it until Q2 2024, which is too late for this feature. If
anyone had the inside scoop about when they'd likely support it, I'd
welcome it!
This looks pretty hairy written out as above, but I actually think it
reflects most people's intuitions about what data a local period "should"
correspond to (though I'd welcome feedback on this point).
Here are some thoughts about approaches that I've tried, and what their
drawbacks seem to be. For all these, I'll use the following CTE to
demonstrate some data that crosses a DST boundary:
```
with original_data as (
select
('2023-11-05 00:00:00.000000Z'::timestamptz) + (15 * x || '
minutes')::interval as "t"
from
generate_series(0, 1000) as x
)
```
1) date_trunc: it seems like as of v12, date_trunc accepts a third argument
of timezone, which essentially plays the role of the server timezone
setting for the scope of the function. This is very handy, and *I believe*
solves my issues for the hour/day periods:
```
[etc]
select
date_trunc('day', t, 'America/New_York'),
min(t),
max(t),
count(*)
from original_data
group by 1
order by 1;
date_trunc | min | max |
count
------------------------+------------------------+------------------------+-------
2023-11-04 04:00:00+00 | 2023-11-05 00:00:00+00 | 2023-11-05 03:45:00+00 |
16
2023-11-05 04:00:00+00 | 2023-11-05 04:00:00+00 | 2023-11-06 04:45:00+00 |
100
2023-11-06 05:00:00+00 | 2023-11-06 05:00:00+00 | 2023-11-07 04:45:00+00 |
96
2023-11-07 05:00:00+00 | 2023-11-07 05:00:00+00 | 2023-11-08 04:45:00+00 |
96
2023-11-08 05:00:00+00 | 2023-11-08 05:00:00+00 | 2023-11-09 04:45:00+00 |
96
2023-11-09 05:00:00+00 | 2023-11-09 05:00:00+00 | 2023-11-10 04:45:00+00 |
96
[etc]
```
This checks out, but unfortunately doesn't seem to work for 15 minutes. I
think, by the way, that this behavior is identical to what I would've
gotten if my server timezone was "America/New_York" and I ran it without
the 3rd argument, though I'd be curious to hear if there are discrepancies.
On this point, I've read up on some of the history around this feature, and
was a little puzzled by this assertion in this thread:
/messages/by-id/87in1k73nr.fsf@news-spur.riddles.org.uk
:
If you have a timestamp-in-UTC column and want to do a date_trunc in some
other specified zone (that's not the session timezone), you need FOUR uses
of AT TIME ZONE to do it correctly:
date_trunc('day', col AT TIME ZONE 'UTC' AT TIME ZONE 'Asia/Kathmandu')
AT TIME ZONE 'Asia/Kathmandu' AT TIME ZONE 'UTC'
I *think* that this doesn't comport with behavior I've seen, because at a
DST boundary the initial `col AT TIME ZONE 'UTC' AT TIME ZONE
'Asia/Kathmandu'` would drop the offset, which from then on would be
unrecoverable. For example:
```
select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp AT TIME
ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE 'America/New_York'
AT TIME ZONE 'UTC';
timezone
---------------------
2023-11-05 06:00:00
select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp AT TIME
ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE 'America/New_York'
AT TIME ZONE 'UTC';
timezone
---------------------
2023-11-05 06:00:00
```
This is what I meant above by "clumping" in point (3). I believe this issue
is ameliorated by setting the database timezone:
```
set timezone to 'America/New_York';
select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp AT TIME
ZONE 'UTC');
date_trunc
------------------------
2023-11-05 01:00:00-05
select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp AT TIME
ZONE 'UTC');
date_trunc
------------------------
2023-11-05 01:00:00-04
```
The same (correct, in my opinion) behavior seems to occur when using the
third argument to date_trunc, so all is well on that front, but I'm being
pedantic about this because if I was misunderstanding and usage of four `at
time zone`s was a legitimate strategy, it could plausibly be applied to the
below. Hoping to be incorrect here!
2) date_bin: this will take a "stride", which should accommodate all the
periods I'm interested in, and "origin" to help deal with initial day/hour
offsets. But I'm having trouble getting it to respect time zone/DST stuff.
For example:
```
[etc]
select
date_bin('24 hours', t, '2023-11-03 00:00:00.000000-0400'::timestamptz),
min(t),
max(t),
count(*)
from
original_data
group by 1
order by 1;
date_bin | min | max |
count
------------------------+------------------------+------------------------+-------
2023-11-04 00:00:00-04 | 2023-11-04 20:00:00-04 | 2023-11-04 23:45:00-04 |
16
2023-11-05 00:00:00-04 | 2023-11-05 00:00:00-04 | 2023-11-05 22:45:00-05 |
96
2023-11-05 23:00:00-05 | 2023-11-05 23:00:00-05 | 2023-11-06 22:45:00-05 |
96
2023-11-06 23:00:00-05 | 2023-11-06 23:00:00-05 | 2023-11-07 22:45:00-05 |
96
2023-11-07 23:00:00-05 | 2023-11-07 23:00:00-05 | 2023-11-08 22:45:00-05 |
96
2023-11-08 23:00:00-05 | 2023-11-08 23:00:00-05 | 2023-11-09 22:45:00-05 |
96
2023-11-09 23:00:00-05 | 2023-11-09 23:00:00-05 | 2023-11-10 22:45:00-05 |
96
[etc]
```
As you can see, the days get "bumped" by DST, and are off by one (until the
spring). I actually think this makes intuitive sense for this concept of
"stride"--it just means that it's not as useful for timezones with DST.
3) generate_series: If I could get this working, it would be ideal, because
it would also help fill in gaps in my data with null rows. But again, I
can't seem to, except in v16, have it respect timezones other than the
server's:
```
set timezone to 'UTC';
select
days as start_time,
lead(days) over (order BY days) as end_time
from generate_series(
'2023-11-03 00:00:00.000000-0400'::timestamptz,
'2023-11-07 00:00:00.000000-0500'::timestamptz,
'1 day'::interval
) days;
start_time | end_time
------------------------+------------------------
2023-11-03 04:00:00+00 | 2023-11-04 04:00:00+00
2023-11-04 04:00:00+00 | 2023-11-05 04:00:00+00
2023-11-05 04:00:00+00 | 2023-11-06 04:00:00+00
2023-11-06 04:00:00+00 | 2023-11-07 04:00:00+00
[etc.]
set timezone to 'America/New_York';
[same as above]
start_time | end_time
------------------------+------------------------
2023-11-03 00:00:00-04 | 2023-11-04 00:00:00-04
2023-11-04 00:00:00-04 | 2023-11-05 00:00:00-04
2023-11-05 00:00:00-04 | 2023-11-06 00:00:00-05
2023-11-06 00:00:00-05 | 2023-11-07 00:00:00-05
[etc.]
```
The latter being correct for these purposes, but not seeming super
reliable/practical (see: point 5).
Can anyone think of any other approaches to this problem? I'd be thrilled
if I could manage it without resorting to manually post-processing in
pandas or something, because I have a number of datapoints, and each
individual one is quite wide.
Apologies for the length of this message--just trying to be thorough. I
sincerely appreciate any help or pointers!
Best,
Lincoln
--
Lincoln Swaine-Moore
On Tue, Oct 3, 2023 at 7:33 PM Lincoln Swaine-Moore <lswainemoore@gmail.com>
wrote:
Hi all,
I'm attempting to generate some reports using user-driven timezones for
UTC data, and I'm having trouble writing a query that meets the following
criteria:1) Data should be averaged at one of daily, hourly, or 15 minute
granularities (user-driven).
2) Data over a given period should reflect the localized value of the UTC
data for a user-provided time zone. E.g.:
a) The 1 hour period starting at '2023-10-03 12:00:00.000000-0400'
(America/New_York) should reflect data with timestamps between '2023-10-03
16:00:00.000000Z' and '2023-10-03 17:00:00.000000Z'.
b) The 1 day period starting at '2023-10-03 00:00:00.000000-0400'
should reflect data with timestamps between '2023-10-03 04:00:00.000000Z'
and '2023-10-04 04:00:00.000000Z'.
3) When a period interacts with a DST change in the given timezone, the
data should not be clumped together. E.g.:
a) Data points occurring at 2023-11-05 05:30:00.000000Z and 2023-11-05
06:30:00.000000Z should be treated as falling into separate buckets when
the time zone is America/New_York: (2023-11-05 01:30:00.000000-0400 and
2023-11-05 01:30:00.000000-0500, respectively). This should be true for
either the 15 minute or 1 hour intervals.
b) Some clumping for day resolution seems ok! E.g. the 1 day period
starting at '2023-11-05 00:00:00.000000-0400' can and probably should
contain 25 hours' worth of data. Certainly it should not reflect the data
falling between '2023-11-05 00:04:00.000000Z' and '2023-11-05
00:04:00.000000Z' + '24 hours'::interval (= '2023-11-06 00:04:00.000000Z'),
because that would be the local times of '2023-11-05 00:00:00.000000-0400'
and '2023-11-04 23:00:00.000000-0500'.
4) It would be relatively simple to do 15 minute and 1 hour periods were
all timezone offsets multiples of 1 hour (in that case, all operations
could be done in UTC and then converted after the fact), but unfortunately
some time zones have 30 min-based offsets, which interferes with this
approach.
5) Ideally, the solution would not involve messing with the
server/connection's value of timezone. (Though I would be interested if
there was a solution that relaxed this constraint and was relatively
safe/compatible with transactions and psycopg2.)
6) Ideally, my query would return periods that are missing data (though
could plausibly fill these in in the layer above). This points toward
generate_series but...
7) Sigh: I can't upgrade to 16. I gather that 16 has a timezone parameter
for generate_series, which I believe might help. But tragically,
Digitalocean doesn't yet support 16, and it's not practical to migrate
elsewhere. Based on historical release -> support timing, I'd imagine they
will not support it until Q2 2024, which is too late for this feature. If
anyone had the inside scoop about when they'd likely support it, I'd
welcome it!This looks pretty hairy written out as above, but I actually think it
reflects most people's intuitions about what data a local period "should"
correspond to (though I'd welcome feedback on this point).Here are some thoughts about approaches that I've tried, and what their
drawbacks seem to be. For all these, I'll use the following CTE to
demonstrate some data that crosses a DST boundary:```
with original_data as (
select
('2023-11-05 00:00:00.000000Z'::timestamptz) + (15 * x || '
minutes')::interval as "t"
from
generate_series(0, 1000) as x
)
```1) date_trunc: it seems like as of v12, date_trunc accepts a third
argument of timezone, which essentially plays the role of the server
timezone setting for the scope of the function. This is very handy, and *I
believe* solves my issues for the hour/day periods:
```
[etc]
select
date_trunc('day', t, 'America/New_York'),
min(t),
max(t),
count(*)
from original_data
group by 1
order by 1;date_trunc | min | max
| count------------------------+------------------------+------------------------+-------
2023-11-04 04:00:00+00 | 2023-11-05 00:00:00+00 | 2023-11-05 03:45:00+00
| 16
2023-11-05 04:00:00+00 | 2023-11-05 04:00:00+00 | 2023-11-06 04:45:00+00
| 100
2023-11-06 05:00:00+00 | 2023-11-06 05:00:00+00 | 2023-11-07 04:45:00+00
| 96
2023-11-07 05:00:00+00 | 2023-11-07 05:00:00+00 | 2023-11-08 04:45:00+00
| 96
2023-11-08 05:00:00+00 | 2023-11-08 05:00:00+00 | 2023-11-09 04:45:00+00
| 96
2023-11-09 05:00:00+00 | 2023-11-09 05:00:00+00 | 2023-11-10 04:45:00+00
| 96[etc]
```This checks out, but unfortunately doesn't seem to work for 15 minutes. I
think, by the way, that this behavior is identical to what I would've
gotten if my server timezone was "America/New_York" and I ran it without
the 3rd argument, though I'd be curious to hear if there are discrepancies.On this point, I've read up on some of the history around this feature,
and was a little puzzled by this assertion in this thread:
/messages/by-id/87in1k73nr.fsf@news-spur.riddles.org.uk
:If you have a timestamp-in-UTC column and want to do a date_trunc in
some other specified zone (that's not the session timezone), you need FOUR
uses of AT TIME ZONE to do it correctly:date_trunc('day', col AT TIME ZONE 'UTC' AT TIME ZONE 'Asia/Kathmandu')
AT TIME ZONE 'Asia/Kathmandu' AT TIME ZONE 'UTC'
I *think* that this doesn't comport with behavior I've seen, because at a
DST boundary the initial `col AT TIME ZONE 'UTC' AT TIME ZONE
'Asia/Kathmandu'` would drop the offset, which from then on would be
unrecoverable. For example:```
select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp AT TIME
ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE 'America/New_York'
AT TIME ZONE 'UTC';
timezone
---------------------
2023-11-05 06:00:00select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp AT TIME
ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE 'America/New_York'
AT TIME ZONE 'UTC';
timezone
---------------------
2023-11-05 06:00:00
```This is what I meant above by "clumping" in point (3). I believe this
issue is ameliorated by setting the database timezone:```
set timezone to 'America/New_York';
select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp AT TIME
ZONE 'UTC');
date_trunc
------------------------
2023-11-05 01:00:00-05select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp AT TIME
ZONE 'UTC');
date_trunc
------------------------
2023-11-05 01:00:00-04
```The same (correct, in my opinion) behavior seems to occur when using the
third argument to date_trunc, so all is well on that front, but I'm being
pedantic about this because if I was misunderstanding and usage of four `at
time zone`s was a legitimate strategy, it could plausibly be applied to the
below. Hoping to be incorrect here!2) date_bin: this will take a "stride", which should accommodate all the
periods I'm interested in, and "origin" to help deal with initial day/hour
offsets. But I'm having trouble getting it to respect time zone/DST stuff.
For example:```
[etc]
select
date_bin('24 hours', t, '2023-11-03
00:00:00.000000-0400'::timestamptz),
min(t),
max(t),
count(*)
from
original_data
group by 1
order by 1;date_bin | min | max
| count------------------------+------------------------+------------------------+-------
2023-11-04 00:00:00-04 | 2023-11-04 20:00:00-04 | 2023-11-04 23:45:00-04
| 16
2023-11-05 00:00:00-04 | 2023-11-05 00:00:00-04 | 2023-11-05 22:45:00-05
| 96
2023-11-05 23:00:00-05 | 2023-11-05 23:00:00-05 | 2023-11-06 22:45:00-05
| 96
2023-11-06 23:00:00-05 | 2023-11-06 23:00:00-05 | 2023-11-07 22:45:00-05
| 96
2023-11-07 23:00:00-05 | 2023-11-07 23:00:00-05 | 2023-11-08 22:45:00-05
| 96
2023-11-08 23:00:00-05 | 2023-11-08 23:00:00-05 | 2023-11-09 22:45:00-05
| 96
2023-11-09 23:00:00-05 | 2023-11-09 23:00:00-05 | 2023-11-10 22:45:00-05
| 96
[etc]
```As you can see, the days get "bumped" by DST, and are off by one (until
the spring). I actually think this makes intuitive sense for this concept
of "stride"--it just means that it's not as useful for timezones with DST.3) generate_series: If I could get this working, it would be ideal,
because it would also help fill in gaps in my data with null rows. But
again, I can't seem to, except in v16, have it respect timezones other than
the server's:```
set timezone to 'UTC';
select
days as start_time,
lead(days) over (order BY days) as end_time
from generate_series(
'2023-11-03 00:00:00.000000-0400'::timestamptz,
'2023-11-07 00:00:00.000000-0500'::timestamptz,
'1 day'::interval
) days;start_time | end_time
------------------------+------------------------
2023-11-03 04:00:00+00 | 2023-11-04 04:00:00+00
2023-11-04 04:00:00+00 | 2023-11-05 04:00:00+00
2023-11-05 04:00:00+00 | 2023-11-06 04:00:00+00
2023-11-06 04:00:00+00 | 2023-11-07 04:00:00+00
[etc.]set timezone to 'America/New_York';
[same as above]start_time | end_time
------------------------+------------------------
2023-11-03 00:00:00-04 | 2023-11-04 00:00:00-04
2023-11-04 00:00:00-04 | 2023-11-05 00:00:00-04
2023-11-05 00:00:00-04 | 2023-11-06 00:00:00-05
2023-11-06 00:00:00-05 | 2023-11-07 00:00:00-05
[etc.]
```The latter being correct for these purposes, but not seeming super
reliable/practical (see: point 5).Can anyone think of any other approaches to this problem? I'd be thrilled
if I could manage it without resorting to manually post-processing in
pandas or something, because I have a number of datapoints, and each
individual one is quite wide.Apologies for the length of this message--just trying to be thorough. I
sincerely appreciate any help or pointers!Best,
Lincoln--
Lincoln Swaine-Moore
That's a long email to digest but a couple thoughts that may help.
I've always disliked the term "timestamp with time zone" and always
mentally translate that to "point in time". As long as data is stored as a
timestamp with time zone (point in time), it can be manipulated in any time
zone you desire including handling DST.
Time calculation always involves some assumption of meaning and PostgreSQL
makes some reasonable ones. I'm in US Pacific time and if I run:
select '2023-11-05'::timestamptz ;
I get:
timestamptz
------------------------
2023-11-05 00:00:00-07
If I add a day (crossing the DST boundary):
select '2023-11-05'::timestamptz + '1 day'::interval;
?column?
------------------------
2023-11-06 00:00:00-08
Observe that I end up at midnight the following day. But if I instead add
24 hours:
select '2023-11-05'::timestamptz + '24 hours'::interval;
?column?
------------------------
2023-11-05 23:00:00-08
24 hours is exactly what I get.
You are generating 15-minute intervals the hard way. You can do it directly
and have DST handled for you:
select generate_series('2023-11-05'::timestamptz,
'2023-11-06'::timestamptz, '15 minutes'::interval);
generate_series
------------------------
2023-11-05 00:00:00-07
2023-11-05 00:15:00-07
2023-11-05 00:30:00-07
2023-11-05 00:45:00-07
2023-11-05 01:00:00-07
2023-11-05 01:15:00-07
2023-11-05 01:30:00-07
2023-11-05 01:45:00-07
2023-11-05 01:00:00-08
2023-11-05 01:15:00-08
...
2023-11-06 00:00:00-08
(101 rows)
Note that 01:00:00-07 is a different point in time than 01:00:00-08.
Sticking with timestamp with time zone, aka a fully qualified point in
time, removes any ambiguity. Also observe that I get the correct number of
"bins". If I run the same thing but for March 12 2023 (spring forward)
I'll get 93 rows but on "normal days" there will be 97.
I suspect your best bet will be to store the data as type time stamp with
time zone and to set the time zone before running your queries.
Note that setting the time zone is a client/connection setting so if you
set it within a transaction, it will stay set when the transaction
concludes. But time manipulation is tricky and trying to DIY reinvent the
wheel is painful and often buggy. Let PostgreSQL do the work for you.
Cheers,
Steve
Am 04.10.2023 um 05:36 schrieb Steve Crawford:
On Tue, Oct 3, 2023 at 7:33 PM Lincoln Swaine-Moore
<lswainemoore@gmail.com> wrote:Hi all,
I'm attempting to generate some reports using user-driven
timezones for UTC data, and I'm having trouble writing a query
that meets the following criteria:1) Data should be averaged at one of daily, hourly, or 15 minute
granularities (user-driven).
2) Data over a given period should reflect the localized value of
the UTC data for a user-provided time zone. E.g.:
a) The 1 hour period starting at '2023-10-03
12:00:00.000000-0400' (America/New_York) should reflect data with
timestamps between '2023-10-03 16:00:00.000000Z' and '2023-10-03
17:00:00.000000Z'.
b) The 1 day period starting at '2023-10-03
00:00:00.000000-0400' should reflect data with timestamps between
'2023-10-03 04:00:00.000000Z' and '2023-10-04 04:00:00.000000Z'.
3) When a period interacts with a DST change in the given
timezone, the data should not be clumped together. E.g.:
a) Data points occurring at 2023-11-05 05:30:00.000000Z and
2023-11-05 06:30:00.000000Z should be treated as falling into
separate buckets when the time zone is America/New_York:
(2023-11-05 01:30:00.000000-0400 and 2023-11-05
01:30:00.000000-0500, respectively). This should be true for
either the 15 minute or 1 hour intervals.
b) Some clumping for day resolution seems ok! E.g. the 1 day
period starting at '2023-11-05 00:00:00.000000-0400' can and
probably should contain 25 hours' worth of data. Certainly it
should not reflect the data falling between '2023-11-05
00:04:00.000000Z' and '2023-11-05 00:04:00.000000Z' + '24
hours'::interval (= '2023-11-06 00:04:00.000000Z'), because that
would be the local times of '2023-11-05 00:00:00.000000-0400' and
'2023-11-04 23:00:00.000000-0500'.
4) It would be relatively simple to do 15 minute and 1 hour
periods were all timezone offsets multiples of 1 hour (in that
case, all operations could be done in UTC and then converted after
the fact), but unfortunately some time zones have 30 min-based
offsets, which interferes with this approach.
5) Ideally, the solution would not involve messing with the
server/connection's value of timezone. (Though I would be
interested if there was a solution that relaxed this constraint
and was relatively safe/compatible with transactions and psycopg2.)
6) Ideally, my query would return periods that are missing data
(though could plausibly fill these in in the layer above). This
points toward generate_series but...
7) Sigh: I can't upgrade to 16. I gather that 16 has a timezone
parameter for generate_series, which I believe might help. But
tragically, Digitalocean doesn't yet support 16, and it's not
practical to migrate elsewhere. Based on historical release ->
support timing, I'd imagine they will not support it until Q2
2024, which is too late for this feature. If anyone had the inside
scoop about when they'd likely support it, I'd welcome it!This looks pretty hairy written out as above, but I actually think
it reflects most people's intuitions about what data a local
period "should" correspond to (though I'd welcome feedback on this
point).Here are some thoughts about approaches that I've tried, and what
their drawbacks seem to be. For all these, I'll use the following
CTE to demonstrate some data that crosses a DST boundary:```
with original_data as (
select
('2023-11-05 00:00:00.000000Z'::timestamptz) + (15 * x ||
' minutes')::interval as "t"
from
generate_series(0, 1000) as x
)
```1) date_trunc: it seems like as of v12, date_trunc accepts a third
argument of timezone, which essentially plays the role of the
server timezone setting for the scope of the function. This is
very handy, and *I believe* solves my issues for the hour/day periods:
```
[etc]
select
date_trunc('day', t, 'America/New_York'),
min(t),
max(t),
count(*)
from original_data
group by 1
order by 1;date_trunc | min | max
| count
------------------------+------------------------+------------------------+-------
2023-11-04 04:00:00+00 | 2023-11-05 00:00:00+00 | 2023-11-05
03:45:00+00 | 16
2023-11-05 04:00:00+00 | 2023-11-05 04:00:00+00 | 2023-11-06
04:45:00+00 | 100
2023-11-06 05:00:00+00 | 2023-11-06 05:00:00+00 | 2023-11-07
04:45:00+00 | 96
2023-11-07 05:00:00+00 | 2023-11-07 05:00:00+00 | 2023-11-08
04:45:00+00 | 96
2023-11-08 05:00:00+00 | 2023-11-08 05:00:00+00 | 2023-11-09
04:45:00+00 | 96
2023-11-09 05:00:00+00 | 2023-11-09 05:00:00+00 | 2023-11-10
04:45:00+00 | 96[etc]
```This checks out, but unfortunately doesn't seem to work for 15
minutes. I think, by the way, that this behavior is identical to
what I would've gotten if my server timezone was
"America/New_York" and I ran it without the 3rd argument, though
I'd be curious to hear if there are discrepancies.On this point, I've read up on some of the history around this
feature, and was a little puzzled by this assertion in this
thread:
/messages/by-id/87in1k73nr.fsf@news-spur.riddles.org.uk:If you have a timestamp-in-UTC column and want to do a
date_trunc in some other specified zone (that's not the session
timezone), you need FOUR uses of AT TIME ZONE to do it correctly:date_trunc('day', col AT TIME ZONE 'UTC' AT TIME ZONE
'Asia/Kathmandu') AT TIME ZONE 'Asia/Kathmandu' AT TIME ZONE 'UTC'
I *think* that this doesn't comport with behavior I've seen,
because at a DST boundary the initial `col AT TIME ZONE 'UTC' AT
TIME ZONE 'Asia/Kathmandu'` would drop the offset, which from then
on would be unrecoverable. For example:```
select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp
AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE
'America/New_York' AT TIME ZONE 'UTC';
timezone
---------------------
2023-11-05 06:00:00select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp
AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE
'America/New_York' AT TIME ZONE 'UTC';
timezone
---------------------
2023-11-05 06:00:00
```This is what I meant above by "clumping" in point (3). I believe
this issue is ameliorated by setting the database timezone:```
set timezone to 'America/New_York';
select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp
AT TIME ZONE 'UTC');
date_trunc
------------------------
2023-11-05 01:00:00-05select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp
AT TIME ZONE 'UTC');
date_trunc
------------------------
2023-11-05 01:00:00-04
```The same (correct, in my opinion) behavior seems to occur when
using the third argument to date_trunc, so all is well on that
front, but I'm being pedantic about this because if I was
misunderstanding and usage of four `at time zone`s was a
legitimate strategy, it could plausibly be applied to the below.
Hoping to be incorrect here!2) date_bin: this will take a "stride", which should accommodate
all the periods I'm interested in, and "origin" to help deal with
initial day/hour offsets. But I'm having trouble getting it to
respect time zone/DST stuff. For example:```
[etc]
select
date_bin('24 hours', t, '2023-11-03
00:00:00.000000-0400'::timestamptz),
min(t),
max(t),
count(*)
from
original_data
group by 1
order by 1;date_bin | min | max
| count
------------------------+------------------------+------------------------+-------
2023-11-04 00:00:00-04 | 2023-11-04 20:00:00-04 | 2023-11-04
23:45:00-04 | 16
2023-11-05 00:00:00-04 | 2023-11-05 00:00:00-04 | 2023-11-05
22:45:00-05 | 96
2023-11-05 23:00:00-05 | 2023-11-05 23:00:00-05 | 2023-11-06
22:45:00-05 | 96
2023-11-06 23:00:00-05 | 2023-11-06 23:00:00-05 | 2023-11-07
22:45:00-05 | 96
2023-11-07 23:00:00-05 | 2023-11-07 23:00:00-05 | 2023-11-08
22:45:00-05 | 96
2023-11-08 23:00:00-05 | 2023-11-08 23:00:00-05 | 2023-11-09
22:45:00-05 | 96
2023-11-09 23:00:00-05 | 2023-11-09 23:00:00-05 | 2023-11-10
22:45:00-05 | 96
[etc]
```As you can see, the days get "bumped" by DST, and are off by one
(until the spring). I actually think this makes intuitive sense
for this concept of "stride"--it just means that it's not as
useful for timezones with DST.3) generate_series: If I could get this working, it would be
ideal, because it would also help fill in gaps in my data with
null rows. But again, I can't seem to, except in v16, have it
respect timezones other than the server's:```
set timezone to 'UTC';
select
days as start_time,
lead(days) over (order BY days) as end_time
from generate_series(
'2023-11-03 00:00:00.000000-0400'::timestamptz,
'2023-11-07 00:00:00.000000-0500'::timestamptz,
'1 day'::interval
) days;start_time | end_time
------------------------+------------------------
2023-11-03 04:00:00+00 | 2023-11-04 04:00:00+00
2023-11-04 04:00:00+00 | 2023-11-05 04:00:00+00
2023-11-05 04:00:00+00 | 2023-11-06 04:00:00+00
2023-11-06 04:00:00+00 | 2023-11-07 04:00:00+00
[etc.]set timezone to 'America/New_York';
[same as above]start_time | end_time
------------------------+------------------------
2023-11-03 00:00:00-04 | 2023-11-04 00:00:00-04
2023-11-04 00:00:00-04 | 2023-11-05 00:00:00-04
2023-11-05 00:00:00-04 | 2023-11-06 00:00:00-05
2023-11-06 00:00:00-05 | 2023-11-07 00:00:00-05
[etc.]
```The latter being correct for these purposes, but not seeming super
reliable/practical (see: point 5).Can anyone think of any other approaches to this problem? I'd be
thrilled if I could manage it without resorting to manually
post-processing in pandas or something, because I have a number of
datapoints, and each individual one is quite wide.Apologies for the length of this message--just trying to be
thorough. I sincerely appreciate any help or pointers!Best,
Lincoln--
Lincoln Swaine-MooreThat's a long email to digest but a couple thoughts that may help.
I've always disliked the term "timestamp with time zone" and always
mentally translate that to "point in time". As long as data is stored
as a timestamp with time zone (point in time), it can be manipulated
in any time zone you desire including handling DST.Time calculation always involves some assumption of meaning and
PostgreSQL makes some reasonable ones. I'm in US Pacific time and if I
run:
select '2023-11-05'::timestamptz ;
I get:
timestamptz
------------------------
2023-11-05 00:00:00-07If I add a day (crossing the DST boundary):
select '2023-11-05'::timestamptz + '1 day'::interval;
?column?
------------------------
2023-11-06 00:00:00-08Observe that I end up at midnight the following day. But if I instead
add 24 hours:
select '2023-11-05'::timestamptz + '24 hours'::interval;
?column?
------------------------
2023-11-05 23:00:00-0824 hours is exactly what I get.
You are generating 15-minute intervals the hard way. You can do it
directly and have DST handled for you:
select generate_series('2023-11-05'::timestamptz,
'2023-11-06'::timestamptz, '15 minutes'::interval);
generate_series
------------------------
2023-11-05 00:00:00-07
2023-11-05 00:15:00-07
2023-11-05 00:30:00-07
2023-11-05 00:45:00-07
2023-11-05 01:00:00-07
2023-11-05 01:15:00-07
2023-11-05 01:30:00-07
2023-11-05 01:45:00-07
2023-11-05 01:00:00-08
2023-11-05 01:15:00-08
...
2023-11-06 00:00:00-08
(101 rows)Note that 01:00:00-07 is a different point in time than 01:00:00-08.
Sticking with timestamp with time zone, aka a fully qualified point in
time, removes any ambiguity. Also observe that I get the correct
number of "bins". If I run the same thing but for March 12 2023
(spring forward) I'll get 93 rows but on "normal days" there will be 97.I suspect your best bet will be to store the data as type time
stamp with time zone and to set the time zone before running your queries.Note that setting the time zone is a client/connection setting so if
you set it within a transaction, it will stay set when the transaction
concludes. But time manipulation is tricky and trying to DIY reinvent
the wheel is painful and often buggy. Let PostgreSQL do the work for you.Cheers,
Steve
UTC is an excellent form of timestamps as a linear quantity in the db
(like Steve wrote: "point in time"); no gaps, no duplications.
For the sake of simplicity, I only included the possible date-bin
variants so that the results can be compared; ordering and grouping with
just one date_bin etc. can be easily customized...
SET TIMEZONE='Etc/UTC';
SELECT
sub.gs
,date_bin('15 minutes', sub.gs, '2023-01-01') AS norm15minutes
,date_bin('1 hours', sub.gs, '2023-01-01') AS norm1hour
,date_bin('1 days', sub.gs, '2023-01-01') AS norm1day
FROM (SELECT generate_series('2023-11-03 00:00:00Z'::timestamptz,
'2023-11-06 00:00:00Z'::timestamptz, '5 minutes'::interval) AS gs) AS sub;
For the WHERE clause also everything in UTC (the conversion of the
parameters of "user time zone" takes place before).
Conversion of the results into the "user time zone" takes place in the
client app.
--
regards, marian wendt
Steve Crawford <scrawford@pinpointresearch.com> writes:
On Tue, Oct 3, 2023 at 7:33 PM Lincoln Swaine-Moore <lswainemoore@gmail.com>
wrote:5) Ideally, the solution would not involve messing with the
server/connection's value of timezone. (Though I would be interested if
there was a solution that relaxed this constraint and was relatively
safe/compatible with transactions and psycopg2.)
Note that setting the time zone is a client/connection setting so if you
set it within a transaction, it will stay set when the transaction
concludes. But time manipulation is tricky and trying to DIY reinvent the
wheel is painful and often buggy. Let PostgreSQL do the work for you.
Expanding on that philosophy: you should be able to set the timezone
locally within a function, so that it wouldn't be that hard to make a
wrapper for generate_series that emulates the 4-argument version added
in v16.
Rather than messing with manually saving and restoring the prevailing
zone, I'd let the function SET infrastructure do it for me. Sadly,
that SET clause only takes literal constant arguments, so it'd go
roughly like this:
create function generate_series(timestamptz, timestamptz, interval, text)
returns setof timestamptz
strict immutable language plpgsql as
$$
begin
perform set_config('timezone', $4, true);
return query select generate_series($1, $2, $3);
end
$$ set timezone = 'UTC';
Setting the zone to UTC is a useless step, but that triggers
restoring the previous zone when the function exits; simpler
and probably faster than coding the save/restore explicitly.
Side note: whether this is really "immutable" is a matter for
debate, since time zone definitions tend to change over time.
But we chose to mark the new 4-argument version that way,
so you might as well do so too.
regards, tom lane
Am 04.10.2023 um 16:11 schrieb Tom Lane:
Steve Crawford<scrawford@pinpointresearch.com> writes:
On Tue, Oct 3, 2023 at 7:33 PM Lincoln Swaine-Moore<lswainemoore@gmail.com>
wrote:5) Ideally, the solution would not involve messing with the
server/connection's value of timezone. (Though I would be interested if
there was a solution that relaxed this constraint and was relatively
safe/compatible with transactions and psycopg2.)Note that setting the time zone is a client/connection setting so if you
set it within a transaction, it will stay set when the transaction
concludes. But time manipulation is tricky and trying to DIY reinvent the
wheel is painful and often buggy. Let PostgreSQL do the work for you.Expanding on that philosophy: you should be able to set the timezone
locally within a function, so that it wouldn't be that hard to make a
wrapper for generate_series that emulates the 4-argument version added
in v16.Rather than messing with manually saving and restoring the prevailing
zone, I'd let the function SET infrastructure do it for me. Sadly,
that SET clause only takes literal constant arguments, so it'd go
roughly like this:create function generate_series(timestamptz, timestamptz, interval, text)
returns setof timestamptz
strict immutable language plpgsql as
$$
begin
perform set_config('timezone', $4, true);
return query select generate_series($1, $2, $3);
end
$$ set timezone = 'UTC';Setting the zone to UTC is a useless step, but that triggers
restoring the previous zone when the function exits; simpler
and probably faster than coding the save/restore explicitly.Side note: whether this is really "immutable" is a matter for
debate, since time zone definitions tend to change over time.
But we chose to mark the new 4-argument version that way,
so you might as well do so too.regards, tom lane
As far as Lincoln describes it, the series is not the real problem here,
but is just intended to be a simplified example of his actual data.The
consideration that you can use the time zone using a function should
apply here...The following SELECT should show as an example that the
desired result can be achieved (TIMEZONE set to 'Etc/UTC'). Variations
of date_bin for 15 minutes or 1 hour should work similarly...
SELECT
sub.gs AS ts_in_utc
,sub.gs AT TIME ZONE 'America/New_York' AS ts_in_local_tz
,date_bin('1 days', sub.gs AT TIME ZONE 'America/New_York',
'2023-01-01')
FROM (SELECT generate_series('2023-11-03 00:00:00Z'::timestamptz,
'2023-11-07 00:00:00Z'::timestamptz, '5 minutes'::interval) AS gs) AS sub
WHERE
sub.gs >= '2023-11-05 00:00:00 America/New_York'::timestamptz AND
sub.gs < '2023-11-06 00:00:00 America/New_York'::timestamptz
--
regards, marian wendt
Thank you to all who have weighed in! Very much appreciated.
A few thoughts based on what I've read:
As far as Lincoln describes it, the series is not the real problem here,
but is just intended to be a simplified example of his actual data.
This both is and isn't the case. I was using gneerate_series to create some
data for testing purposes, but I also would love to be able to use
generate_series for the logic as well.
SELECT
sub.gs AS ts_in_utc
,sub.gs AT TIME ZONE 'America/New_York' AS ts_in_local_tz
,date_bin('1 days', sub.gs AT TIME ZONE 'America/New_York',
'2023-01-01')
FROM (SELECT generate_series('2023-11-03 00:00:00Z'::timestamptz,
'2023-11-07 00:00:00Z'::timestamptz, '5 minutes'::interval) AS gs) AS sub
WHERE
sub.gs >= '2023-11-05 00:00:00 America/New_York'::timestamptz AND
sub.gs < '2023-11-06 00:00:00 America/New_York'::timestamptz
I believe this query will be funky around DST borders, because `sub.gs AT
TIME ZONE 'America/New_York'` will be localized in a way that erases the
difference between hours with different offsets, which are genuinely
different. For instance, I ran this and there are two rows within it that
look like:
` 2023-11-05 05:00:00+00 | 2023-11-05 01:00:00 | 2023-11-05 00:00:00`
and
` 2023-11-05 06:00:00+00 | 2023-11-05 01:00:00 | 2023-11-05 00:00:00`
I think that the non-unique second column will pose an issue for the date
binning at a resolution finer than 1 day.
Note that setting the time zone is a client/connection setting so if you
set it within a transaction, it will stay set when the transaction
concludes. But time manipulation is tricky and trying to DIY reinvent the
wheel is painful and often buggy. Let PostgreSQL do the work for you.
Yeah, I'm definitely interested in keeping as much of the DST stuff outside
my code as possible. I guess my concern is that I think there are other
places where this database is being used in my codebase that may rely on
the database time setting being UTC (or really, GMT, though I don't think
there's a difference for these purposes). It would be best if all of my
application's code declared its intentions about the time zone of the
database connection before running its query, but I don't think that's a
feasible change to make right now. That's what's motivating my pursuit of
finding a way to write these queries without changing this setting, through
appropriate casting and such.
create function generate_series(timestamptz, timestamptz, interval, text)
returns setof timestamptz
strict immutable language plpgsql as
$$
begin
perform set_config('timezone', $4, true);
return query select generate_series($1, $2, $3);
end
$$ set timezone = 'UTC';
This is a nice suggestion, and in fact, it would be fine from my
perspective to reset to UTC every time. My concern is only around the
safety of the final `set timezone`. Under what circumstances/when can I
count on that being set? E.g. if a query using that function was cancelled
before finishing, would the connection timezone remain as $4? I guess the
3rd parameter to set_config is `is_local` (based on
https://pgpedia.info/s/set_config.html). Does that mean I could run this
outside this context of a function, and expect the setting to go back to
UTC on a rollback? Apologies if these are naive questions.
Thanks again for all the help.
Best,
Lincoln
On Wed, Oct 4, 2023 at 11:09 AM Marian Wendt <marian.wendt@yahoo.com> wrote:
Am 04.10.2023 um 16:11 schrieb Tom Lane:
Steve Crawford <scrawford@pinpointresearch.com> <scrawford@pinpointresearch.com> writes:
On Tue, Oct 3, 2023 at 7:33 PM Lincoln Swaine-Moore <lswainemoore@gmail.com> <lswainemoore@gmail.com>
wrote:5) Ideally, the solution would not involve messing with the
server/connection's value of timezone. (Though I would be interested if
there was a solution that relaxed this constraint and was relatively
safe/compatible with transactions and psycopg2.)Note that setting the time zone is a client/connection setting so if you
set it within a transaction, it will stay set when the transaction
concludes. But time manipulation is tricky and trying to DIY reinvent the
wheel is painful and often buggy. Let PostgreSQL do the work for you.Expanding on that philosophy: you should be able to set the timezone
locally within a function, so that it wouldn't be that hard to make a
wrapper for generate_series that emulates the 4-argument version added
in v16.Rather than messing with manually saving and restoring the prevailing
zone, I'd let the function SET infrastructure do it for me. Sadly,
that SET clause only takes literal constant arguments, so it'd go
roughly like this:create function generate_series(timestamptz, timestamptz, interval, text)
returns setof timestamptz
strict immutable language plpgsql as
$$
begin
perform set_config('timezone', $4, true);
return query select generate_series($1, $2, $3);
end
$$ set timezone = 'UTC';Setting the zone to UTC is a useless step, but that triggers
restoring the previous zone when the function exits; simpler
and probably faster than coding the save/restore explicitly.Side note: whether this is really "immutable" is a matter for
debate, since time zone definitions tend to change over time.
But we chose to mark the new 4-argument version that way,
so you might as well do so too.regards, tom lane
As far as Lincoln describes it, the series is not the real problem here,
but is just intended to be a simplified example of his actual data. The
consideration that you can use the time zone using a function should apply
here... The following SELECT should show as an example that the desired
result can be achieved (TIMEZONE set to 'Etc/UTC'). Variations of
date_bin for 15 minutes or 1 hour should work similarly...SELECT
sub.gs AS ts_in_utc
,sub.gs AT TIME ZONE 'America/New_York' AS ts_in_local_tz
,date_bin('1 days', sub.gs AT TIME ZONE 'America/New_York',
'2023-01-01')
FROM (SELECT generate_series('2023-11-03 00:00:00Z'::timestamptz,
'2023-11-07 00:00:00Z'::timestamptz, '5 minutes'::interval) AS gs) AS sub
WHERE
sub.gs >= '2023-11-05 00:00:00 America/New_York'::timestamptz AND
sub.gs < '2023-11-06 00:00:00 America/New_York'::timestamptz--
regards, marian wendt
--
Lincoln Swaine-Moore
Lincoln Swaine-Moore <lswainemoore@gmail.com> writes:
create function generate_series(timestamptz, timestamptz, interval, text)
returns setof timestamptz
strict immutable language plpgsql as
$$
begin
perform set_config('timezone', $4, true);
return query select generate_series($1, $2, $3);
end
$$ set timezone = 'UTC';
This is a nice suggestion, and in fact, it would be fine from my
perspective to reset to UTC every time. My concern is only around the
safety of the final `set timezone`. Under what circumstances/when can I
count on that being set? E.g. if a query using that function was cancelled
before finishing, would the connection timezone remain as $4?
No. The function call mechanism will ensure that timezone goes back
to its previous state at function exit. (In the case of an error
exit, that's actually handled by the transaction abort logic, but the
result is the same.) Because of that, I think it doesn't really
matter whether the set_config call says "true" or "false", but saying
that it's a local setting seems less confusing.
Does that mean I could run this
outside this context of a function, and expect the setting to go back to
UTC on a rollback?
An actual rollback would undo the effects of set_config, yes. You
only need this function wrapper to ensure that subsequent operations
in the same transaction don't see the setting change.
regards, tom lane
No. The function call mechanism will ensure that timezone goes back
to its previous state at function exit.
An actual rollback would undo the effects of set_config, yes. You
only need this function wrapper to ensure that subsequent operations
in the same transaction don't see the setting change.
Excellent, thank you. So just to be explicit here, I could either run this
function, or set/run my query/set back, with the same behavior/safety
guarantees as if I was using the generate_series function with timezone
from v16?
On Wed, Oct 4, 2023 at 12:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Lincoln Swaine-Moore <lswainemoore@gmail.com> writes:
create function generate_series(timestamptz, timestamptz, interval,
text)
returns setof timestamptz
strict immutable language plpgsql as
$$
begin
perform set_config('timezone', $4, true);
return query select generate_series($1, $2, $3);
end
$$ set timezone = 'UTC';This is a nice suggestion, and in fact, it would be fine from my
perspective to reset to UTC every time. My concern is only around the
safety of the final `set timezone`. Under what circumstances/when can I
count on that being set? E.g. if a query using that function wascancelled
before finishing, would the connection timezone remain as $4?
No. The function call mechanism will ensure that timezone goes back
to its previous state at function exit. (In the case of an error
exit, that's actually handled by the transaction abort logic, but the
result is the same.) Because of that, I think it doesn't really
matter whether the set_config call says "true" or "false", but saying
that it's a local setting seems less confusing.Does that mean I could run this
outside this context of a function, and expect the setting to go back to
UTC on a rollback?An actual rollback would undo the effects of set_config, yes. You
only need this function wrapper to ensure that subsequent operations
in the same transaction don't see the setting change.regards, tom lane
--
Lincoln Swaine-Moore
Lincoln Swaine-Moore <lswainemoore@gmail.com> writes:
Excellent, thank you. So just to be explicit here, I could either run this
function, or set/run my query/set back, with the same behavior/safety
guarantees as if I was using the generate_series function with timezone
from v16?
Yeah. One thing to keep in mind is that that might have different
behavior in terms of the evaluation of the arguments to the function,
ie which timezone setting is your input parsed according to.
regards, tom lane
Yeah. One thing to keep in mind is that that might have different
behavior in terms of the evaluation of the arguments to the function,
ie which timezone setting is your input parsed according to.
I see. You mean, in the event that it doesn't conform to an entry in
`pg_timezone_names`? I do have control over the possible options the user
can provide, so it should be straightforward to make sure those all have
entries.
Thanks,
Lincoln
Lincoln Swaine-Moore <lswainemoore@gmail.com> writes:
Yeah. One thing to keep in mind is that that might have different
behavior in terms of the evaluation of the arguments to the function,
ie which timezone setting is your input parsed according to.
I see. You mean, in the event that it doesn't conform to an entry in
`pg_timezone_names`? I do have control over the possible options the user
can provide, so it should be straightforward to make sure those all have
entries.
If you mean that your input will always include an explicit zone
specification, then this doesn't affect you. What I was thinking
about was that
select generate_series('2023-10-04 13:30', ...)
is going to mean different things depending on the zone setting
that prevails when that constant is parsed.
regards, tom lane
If you mean that your input will always include an explicit zone
specification, then this doesn't affect you. What I was thinking
about was thatselect generate_series('2023-10-04 13:30', ...)
is going to mean different things depending on the zone setting
that prevails when that constant is parsed.
Gotcha--I thought you meant the timezone argument to the function ($4). I
can make sure that all the datetime arguments to the function are "with
timezone", so there should be no ambiguity. But I guess if I didn't, the
timezone given by $4 would be the one parsing the naive timestamps, which I
think would also be ok from my perspective.
Thanks!
...
Yeah, I'm definitely interested in keeping as much of the DST stuff
outside my code as possible. I guess my concern is that I think there are
other places where this database is being used in my codebase that may rely
on the database time setting being UTC (or really, GMT, though I don't
think there's a difference for these purposes). It would be best if all of
my application's code declared its intentions about the time zone of the
database connection before running its query, but I don't think that's a
feasible change to make right now. That's what's motivating my pursuit of
finding a way to write these queries without changing this setting, through
appropriate casting and such.
...
Really UTC (not that it matters for calculation purposes):
https://en.wikipedia.org/wiki/Coordinated_Universal_Time
As to "other places" in your (or others) code, provided that you are
storing and manipulating your data as timestamp WITH time zone (point in
time), it can be inserted, reported, displayed, calculated, etc. in
whatever time zone any client or piece of code prefers. But when displaying
data around DST changes, fall-back in particular, the offset must be
included in your output to disambiguate things like 01:30:00-07 from
01:30:00-08. Both are 1:30am but are one hour apart.
Cheers,
Steve
On 4 Oct 2023, at 17:58, Lincoln Swaine-Moore <lswainemoore@gmail.com> wrote:
SELECT
sub.gs AS ts_in_utc
,sub.gs AT TIME ZONE 'America/New_York' AS ts_in_local_tz
,date_bin('1 days', sub.gs AT TIME ZONE 'America/New_York',
'2023-01-01')
FROM (SELECT generate_series('2023-11-03 00:00:00Z'::timestamptz,
'2023-11-07 00:00:00Z'::timestamptz, '5 minutes'::interval) AS gs) AS sub
WHERE
sub.gs >= '2023-11-05 00:00:00 America/New_York'::timestamptz AND
sub.gs < '2023-11-06 00:00:00 America/New_York'::timestamptzI believe this query will be funky around DST borders, because `sub.gs AT TIME ZONE 'America/New_York'` will be localized in a way that erases the difference between hours with different offsets, which are genuinely different. For instance, I ran this and there are two rows within it that look like:
` 2023-11-05 05:00:00+00 | 2023-11-05 01:00:00 | 2023-11-05 00:00:00`
and
` 2023-11-05 06:00:00+00 | 2023-11-05 01:00:00 | 2023-11-05 00:00:00`I think that the non-unique second column will pose an issue for the date binning at a resolution finer than 1 day.
What I do in such cases is to add an extra column with the UTC timestamp to serve as a linear scale to the local timestamps. That also helps with ordering buckets in reports and such during DST changes (especially the ones where an hour repeats).
Filtering in the queries occurs on the UTC scale, with the local timestamps calculated back to UTC, so that it doesn’t matter whether the local time has 23, 24, 25 or 24.5 or 23.5 or whatever number of hours on a date-range - it all maps back because UTC always has 24 hours.
Something that I also do is to create calendar tables and views for the buckets, with 2 timestamps per bucket: the start of the bucket and the start of the next bucket. That gives you a range to put actual timestamps between (not BETWEEN between, because that’s inclusive). You can store and index that, as opposed to generated results using generate_series - basically I materialise those.
For hours and quarter hours I found it to be fairly convenient to base a view on a join between a date calendar and an (quarter of an) hour per UTC day table, but materialising that with some indexes may perform better (at the cost of disk space). I do materialise that currently, but our database server doesn’t have a lot of memory so I’m often not hitting the cache and performance suffers a bit (infrastructure is about to change for the better though).
Regards,
Alban Hertroys
--
There is always an exception to always.
Really UTC (not that it matters for calculation purposes)
Sorry, yes--I just meant that literally when I run `show timezone` on the
server in question, I get `GMT`.
As to "other places" in your (or others) code, provided that you are
storing and manipulating your data as timestamp WITH time zone (point in
time)
Unfortunately this is not the case for my codebase/schema. The data is
stored WITHOUT. I glossed over this fact in my post, because they are in
fact UTC times that have had their (0) offset truncated, and prior to my
usage of any solution I can use `at time zone 'UTC'` to properly handle
them. But I don't think it's the case that in the application more
generally this is being handled gracefully, and so the behavior that is
currently (correctly) running is being propped up by the database's time
zone setting. In the long term, I think it might be appropriate to create a
new column that is simply `t at time zone 'UTC'`, and use that everywhere,
but that's not feasible for the task I'm handling right now.
Thanks!
What I do in such cases is to add an extra column with the UTC timestamp
to serve as a linear scale to the local timestamps. That also helps with
ordering buckets in reports and such during DST changes (especially the
ones where an hour repeats).
For hours and quarter hours I found it to be fairly convenient to base a
view on a join between a date calendar and an (quarter of an) hour per UTC
day table, but materialising that with some indexes may perform better (at
the cost of disk space). I do materialise that currently, but our database
server doesn’t have a lot of memory so I’m often not hitting the cache and
performance suffers a bit (infrastructure is about to change for the better
though).
That's an interesting idea, but I'm not sure I fully understand. Assuming
you're aggregating data: what do you group by? For instance, at an hourly
resolution, if you group by both the UTC timestamp and the local one, you
might end up, say, dividing an hour-long bucket in two for time zones with
half-hour-based offsets, no?
Thanks for the detailed writeup! Definitely helpful to learn more about
what people are using in production to handle this sort of thing.
--
Lincoln Swaine-Moore
Slightly off topic, but has anyone tried TimescaleDB for timeseries databases?
The issues discussed here are still there as they apply to the underlying Postgres ORDBMS.
We solve the problem (around 4 billion records of instrument sensor readings) by using UTC for the "native" timestamp, and working in that. Even though we are ½ way around the world. The local times can easily be determined & applied if desired, but by standardising on the reference time zone at the start, things have "just worked", for around 15 years now.
Brent Wood
Principal Technician, Fisheries
NIWA
DDI: +64 (4) 3860529
________________________________
From: Lincoln Swaine-Moore <lswainemoore@gmail.com>
Sent: Thursday, October 5, 2023 08:30
To: Alban Hertroys <haramrae@gmail.com>
Cc: Marian Wendt <marian.wendt@yahoo.com>; pgsql-general <pgsql-general@lists.postgresql.org>
Subject: Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones
What I do in such cases is to add an extra column with the UTC timestamp to serve as a linear scale to the local timestamps. That also helps with ordering buckets in reports and such during DST changes (especially the ones where an hour repeats).
For hours and quarter hours I found it to be fairly convenient to base a view on a join between a date calendar and an (quarter of an) hour per UTC day table, but materialising that with some indexes may perform better (at the cost of disk space). I do materialise that currently, but our database server doesn’t have a lot of memory so I’m often not hitting the cache and performance suffers a bit (infrastructure is about to change for the better though).
That's an interesting idea, but I'm not sure I fully understand. Assuming you're aggregating data: what do you group by? For instance, at an hourly resolution, if you group by both the UTC timestamp and the local one, you might end up, say, dividing an hour-long bucket in two for time zones with half-hour-based offsets, no?
Thanks for the detailed writeup! Definitely helpful to learn more about what people are using in production to handle this sort of thing.
--
Lincoln Swaine-Moore
[https://www.niwa.co.nz/static/niwa-2018-horizontal-180.png] <https://www.niwa.co.nz>
Brent Wood
Principal Technician - GIS and Spatial Data Management
Programme Leader - Environmental Information Delivery
+64-4-386-0529
National Institute of Water & Atmospheric Research Ltd (NIWA)
301 Evans Bay Parade Hataitai Wellington New Zealand
Connect with NIWA: niwa.co.nz<https://www.niwa.co.nz> Facebook<https://www.facebook.com/nzniwa> LinkedIn<https://www.linkedin.com/company/niwa> Twitter<https://twitter.com/niwa_nz> Instagram<https://www.instagram.com/niwa_science> YouTube<https://www.youtube.com/channel/UCJ-j3MLMg1H59Ak2UaNLL3A>
To ensure compliance with legal requirements and to maintain cyber security standards, NIWA's IT systems are subject to ongoing monitoring, activity logging and auditing. This monitoring and auditing service may be provided by third parties. Such third parties can access information transmitted to, processed by and stored on NIWA's IT systems.
Note: This email is intended solely for the use of the addressee and may contain information that is confidential or subject to legal professional privilege. If you receive this email in error please immediately notify the sender and delete the email.
On 4 Oct 2023, at 21:30, Lincoln Swaine-Moore <lswainemoore@gmail.com> wrote:
What I do in such cases is to add an extra column with the UTC timestamp to serve as a linear scale to the local timestamps. That also helps with ordering buckets in reports and such during DST changes (especially the ones where an hour repeats).
For hours and quarter hours I found it to be fairly convenient to base a view on a join between a date calendar and an (quarter of an) hour per UTC day table, but materialising that with some indexes may perform better (at the cost of disk space). I do materialise that currently, but our database server doesn’t have a lot of memory so I’m often not hitting the cache and performance suffers a bit (infrastructure is about to change for the better though).
That's an interesting idea, but I'm not sure I fully understand. Assuming you're aggregating data: what do you group by? For instance, at an hourly resolution, if you group by both the UTC timestamp and the local one, you might end up, say, dividing an hour-long bucket in two for time zones with half-hour-based offsets, no?
Thanks for the detailed writeup! Definitely helpful to learn more about what people are using in production to handle this sort of thing.
Frankly, I haven’t had to deal with half-hour-based offsets since I got this idea. I’m using it with whole-hour-offsets, where it doesn’t affect bin boundaries.
I suppose you could enrich your data in a similar fashion by adding a (virtual) column with the (client) time zone offset, so you could group by local timestamp + offset. That’s not going to match index expressions though, I fear…
For sorting, UTC timestamps would probably still be a useful addition, but they’re simple to add by either converting back from the local timestamps or by taking the min and max of the UTC-based column on the above grouping. Both solutions require that offset, obviously.
Now of course there are only 2 hours a year where this happens. Our data scientists chose to ignore the problem for simplicity’s sake and be slightly off with their numbers on those dates.
Regards,
Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.