truncating timestamps on arbitrary intervals
Hi,
When analyzing time-series data, it's useful to be able to bin
timestamps into equally spaced ranges. date_trunc() is only able to
bin on a specified whole unit. In the attached patch for the March
commitfest, I propose a new function date_trunc_interval(), which can
truncate to arbitrary intervals, e.g.:
select date_trunc_interval('15 minutes', timestamp '2020-02-16
20:48:40'); date_trunc_interval
---------------------
2020-02-16 20:45:00
(1 row)
With this addition, it might be possible to turn the existing
date_trunc() functions into wrappers. I haven't done that here because
it didn't seem practical at this point. For one, the existing
functions have special treatment for weeks, centuries, and millennia.
Note: I've only written the implementation for the type timestamp
without timezone. Adding timezone support would be pretty simple, but
I wanted to get feedback on the basic idea first before making it
complete. I've also written tests and very basic documentation.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v1-datetrunc_interval.patchapplication/octet-stream; name=v1-datetrunc_interval.patchDownload+194-1
On Wed, Feb 26, 2020 at 10:50:19AM +0800, John Naylor wrote:
Hi,
When analyzing time-series data, it's useful to be able to bin
timestamps into equally spaced ranges. date_trunc() is only able to
bin on a specified whole unit.
Thanks for adding this very handy feature!
In the attached patch for the March
commitfest, I propose a new function date_trunc_interval(), which can
truncate to arbitrary intervals, e.g.:select date_trunc_interval('15 minutes', timestamp '2020-02-16
20:48:40'); date_trunc_interval
---------------------
2020-02-16 20:45:00
(1 row)
I believe the following should error out, but doesn't.
# SELECT date_trunc_interval('1 year 1 ms', TIMESTAMP '2001-02-16 20:38:40');
date_trunc_interval
═════════════════════
2001-01-01 00:00:00
(1 row)
With this addition, it might be possible to turn the existing
date_trunc() functions into wrappers. I haven't done that here because
it didn't seem practical at this point. For one, the existing
functions have special treatment for weeks, centuries, and millennia.
I agree that turning it into a wrapper would be separate work.
Note: I've only written the implementation for the type timestamp
without timezone. Adding timezone support would be pretty simple,
but I wanted to get feedback on the basic idea first before making
it complete. I've also written tests and very basic documentation.
Please find attached an update that I believe fixes the bug I found in
a principled way.
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
Attachments:
v2-0001-Add-date_trunc_interval-interval-timestamp.patchtext/x-diff; charset=iso-8859-1Download+214-1
On Wed, Feb 26, 2020 at 3:51 PM David Fetter <david@fetter.org> wrote:
I believe the following should error out, but doesn't.
# SELECT date_trunc_interval('1 year 1 ms', TIMESTAMP '2001-02-16 20:38:40');
date_trunc_interval
═════════════════════
2001-01-01 00:00:00
(1 row)
You're quite right. I forgot to add error checking for
second-and-below units. I've added your example to the tests. (I
neglected to mention in my first email that because I chose to convert
the interval to the pg_tm struct (seemed easiest), it's not
straightforward how to allow multiple unit types, and I imagine the
use case is small, so I had it throw an error.)
Please find attached an update that I believe fixes the bug I found in
a principled way.
Thanks for that! I made a couple adjustments and incorporated your fix
into v3: While working on v1, I noticed the DTK_FOO macros already had
an idiom for bitmasking (see utils/datetime.h), so I used that instead
of a bespoke enum. Also, since the bitmask is checked once, I removed
the individual member checks, allowing me to remove all the gotos.
There's another small wrinkle: Since we store microseconds internally,
it's neither convenient nor useful to try to error out for things like
'2 ms 500 us', since that is just as well written as '2500 us', and
stored exactly the same. I'm inclined to just skip the millisecond
check and just use microseconds, but I haven't done that yet.
Also, I noticed this bug in v1:
SELECT date_trunc_interval('17 days', TIMESTAMP '2001-02-16 20:38:40.123456');
date_trunc_interval
---------------------
2001-01-31 00:00:00
(1 row)
This is another consequence of month and day being 1-based. Fixed,
with new tests.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v3-datetrunc_interval.patchapplication/octet-stream; name=v3-datetrunc_interval.patchDownload+211-1
John Naylor <john.naylor@2ndquadrant.com> writes:
[ v3-datetrunc_interval.patch ]
A few thoughts:
* In general, binning involves both an origin and a stride. When
working with plain numbers it's almost always OK to set the origin
to zero, but it's less clear to me whether that's all right for
timestamps. Do we need another optional argument? Even if we
don't, "zero" for tm_year is 1900, which is going to give results
that surprise somebody.
* I'm still not convinced that the code does the right thing for
1-based months or days. Shouldn't you need to subtract 1, then
do the modulus, then add back 1?
* Speaking of modulus, would it be clearer to express the
calculations like
timestamp -= timestamp % interval;
(That's just a question, I'm not sure.)
* Code doesn't look to have thought carefully about what to do with
negative intervals, or BC timestamps.
* The comment
* Justify all lower timestamp units and throw an error if any
* of the lower interval units are non-zero.
doesn't seem to have a lot to do with what the code after it actually
does. Also, you need explicit /* FALLTHRU */-type comments in that
switch, or pickier buildfarm members will complain.
* Seems like you could jam all the unit-related error checking into
that switch's default: case, where it will cost nothing if the
call is valid:
switch (unit)
{
...
default:
if (unit == 0)
// complain about zero interval
else
// complain about interval with multiple components
}
* I'd use ERRCODE_INVALID_PARAMETER_VALUE for any case of disallowed
contents of the interval.
regards, tom lane
On Wed, Feb 26, 2020 at 06:38:57PM +0800, John Naylor wrote:
On Wed, Feb 26, 2020 at 3:51 PM David Fetter <david@fetter.org> wrote:
I believe the following should error out, but doesn't.
# SELECT date_trunc_interval('1 year 1 ms', TIMESTAMP '2001-02-16 20:38:40');
date_trunc_interval
═════════════════════
2001-01-01 00:00:00
(1 row)You're quite right. I forgot to add error checking for
second-and-below units. I've added your example to the tests. (I
neglected to mention in my first email that because I chose to convert
the interval to the pg_tm struct (seemed easiest), it's not
straightforward how to allow multiple unit types, and I imagine the
use case is small, so I had it throw an error.)
I suspect that this could be sanely expanded to span some sets of
adjacent types in a future patch, e.g. year + month or hour + minute.
Please find attached an update that I believe fixes the bug I found in
a principled way.Thanks for that! I made a couple adjustments and incorporated your fix
into v3: While working on v1, I noticed the DTK_FOO macros already had
an idiom for bitmasking (see utils/datetime.h),
Oops! Sorry I missed that.
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, Feb 26, 2020 at 11:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
John Naylor <john.naylor@2ndquadrant.com> writes:
[ v3-datetrunc_interval.patch ]
A few thoughts:
* In general, binning involves both an origin and a stride. When
working with plain numbers it's almost always OK to set the origin
to zero, but it's less clear to me whether that's all right for
timestamps. Do we need another optional argument? Even if we
don't, "zero" for tm_year is 1900, which is going to give results
that surprise somebody.
Not sure.
A surprise I foresee in general might be: '1 week' is just '7 days',
and not aligned on "WOY". Since the function is passed an interval and
not text, we can't raise a warning. But date_trunc() already covers
that, so probably not a big deal.
* I'm still not convinced that the code does the right thing for
1-based months or days. Shouldn't you need to subtract 1, then
do the modulus, then add back 1?
Yes, brain fade on my part. Fixed in the attached v4.
* Speaking of modulus, would it be clearer to express the
calculations like
timestamp -= timestamp % interval;
(That's just a question, I'm not sure.)
Seems nicer, so done that way.
* Code doesn't look to have thought carefully about what to do with
negative intervals, or BC timestamps.
By accident, negative intervals currently behave the same as positive
ones. We could make negative intervals round up rather than truncate
(or vice versa). I don't know the best thing to do here.
As for BC, changed so it goes in the correct direction at least, and added test.
* The comment
* Justify all lower timestamp units and throw an error if any
* of the lower interval units are non-zero.
doesn't seem to have a lot to do with what the code after it actually
does. Also, you need explicit /* FALLTHRU */-type comments in that
switch, or pickier buildfarm members will complain.
Stale comment from an earlier version, fixed. Not sure if "justify" is
the right term, but "zero" is a bit misleading. Added fall thru's.
* Seems like you could jam all the unit-related error checking into
that switch's default: case, where it will cost nothing if the
call is valid:switch (unit)
{
...
default:
if (unit == 0)
// complain about zero interval
else
// complain about interval with multiple components
}
Done.
* I'd use ERRCODE_INVALID_PARAMETER_VALUE for any case of disallowed
contents of the interval.
Done.
Also removed the millisecond case, since it's impossible, or at least
not worth it, to distinguish from the microsecond case.
Note: I haven't done any additional docs changes in v4.
TODO: with timezone
possible TODO: origin parameter
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v4-datetrunc_interval.patchapplication/octet-stream; name=v4-datetrunc_interval.patchDownload+216-1
On Wed, Feb 26, 2020 at 11:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
* In general, binning involves both an origin and a stride. When
working with plain numbers it's almost always OK to set the origin
to zero, but it's less clear to me whether that's all right for
timestamps. Do we need another optional argument? Even if we
don't, "zero" for tm_year is 1900, which is going to give results
that surprise somebody.
I tried the simplest way in the attached v5. Examples (third param is origin):
-- same result as no origin:
select date_trunc_interval('5 min'::interval, TIMESTAMP '2020-02-01
01:01:01', TIMESTAMP '2020-02-01');
date_trunc_interval
---------------------
2020-02-01 01:00:00
(1 row)
-- shift bins by 2.5 min:
select date_trunc_interval('5 min'::interval, TIMESTAMP '2020-02-1
01:01:01', TIMESTAMP '2020-02-01 00:02:30');
date_trunc_interval
---------------------
2020-02-01 00:57:30
(1 row)
-- align weeks to start on Sunday
select date_trunc_interval('7 days'::interval, TIMESTAMP '2020-02-11
01:01:01.0', TIMESTAMP '1900-01-02');
date_trunc_interval
---------------------
2020-02-09 00:00:00
(1 row)
I've put off adding documentation on the origin piece pending comments
about the approach.
I haven't thought seriously about timezone yet, but hopefully it's
just work and nothing to think too hard about.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v5-datetrunc_interval.patchapplication/octet-stream; name=v5-datetrunc_interval.patchDownload+254-1
On Fri, 13 Mar 2020 at 03:13, John Naylor <john.naylor@2ndquadrant.com>
wrote:
On Wed, Feb 26, 2020 at 11:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
* In general, binning involves both an origin and a stride. When
working with plain numbers it's almost always OK to set the origin
to zero, but it's less clear to me whether that's all right for
timestamps. Do we need another optional argument? Even if we
don't, "zero" for tm_year is 1900, which is going to give results
that surprise somebody.
- align weeks to start on Sunday
select date_trunc_interval('7 days'::interval, TIMESTAMP '2020-02-11
01:01:01.0', TIMESTAMP '1900-01-02');
date_trunc_interval
---------------------
2020-02-09 00:00:00
(1 row)
I'm confused by this. If my calendars are correct, both 1900-01-02
and 2020-02-11 are Tuesdays. So if the date being adjusted and the origin
are both Tuesday, shouldn't the day part be left alone when truncating to 7
days? Also, I'd like to confirm that the default starting point for 7 day
periods (weeks) is Monday, per ISO. I know it's very fashionable in North
America to split the weekend in half but it's not the international
standard.
Perhaps the starting point for dates should be either 0001-01-01 (the
proleptic beginning of the CE calendar) or 2001-01-01 (the beginning of the
current 400-year repeating cycle of leap years and weeks, and a Monday,
giving the appropriate ISO result for truncating to 7 day periods).
On Fri, Mar 13, 2020 at 7:48 PM Isaac Morland <isaac.morland@gmail.com> wrote:
On Fri, 13 Mar 2020 at 03:13, John Naylor <john.naylor@2ndquadrant.com> wrote:
- align weeks to start on Sunday
select date_trunc_interval('7 days'::interval, TIMESTAMP '2020-02-11
01:01:01.0', TIMESTAMP '1900-01-02');
date_trunc_interval
---------------------
2020-02-09 00:00:00
(1 row)I'm confused by this. If my calendars are correct, both 1900-01-02 and 2020-02-11 are Tuesdays. So if the date being adjusted and the origin are both Tuesday, shouldn't the day part be left alone when truncating to 7 days?
Thanks for taking a look! The non-intuitive behavior you found is
because the patch shifts the timestamp before converting to the
internal pg_tm type. The pg_tm type stores day of the month, which is
used for the calculation. It's not counting the days since the origin.
Then the result is shifted back.
To get more logical behavior, perhaps the optional parameter is better
as an offset instead of an origin. Alternatively (or additionally),
the function could do the math on int64 timestamps directly.
Also, I'd like to confirm that the default starting point for 7 day periods (weeks) is Monday, per ISO.
That's currently the behavior in the existing date_trunc function,
when passed the string 'week'. Given that keyword, it calculates the
week of the year.
When using the proposed function with arbitrary intervals, it uses day
of the month, as found in the pg_tm struct. It doesn't treat 7 days
differently then 5 or 10 without user input (origin or offset), since
there is nothing special about 7 day intervals as such internally. To
show the difference between date_trunc, and date_trunc_interval as
implemented in v5 with no origin:
select date_trunc('week', d), count(*) from generate_series(
'2020-02-01'::timestamp, '2020-03-31', '1 day') d group by 1 order by
1;
date_trunc | count
---------------------+-------
2020-01-27 00:00:00 | 2
2020-02-03 00:00:00 | 7
2020-02-10 00:00:00 | 7
2020-02-17 00:00:00 | 7
2020-02-24 00:00:00 | 7
2020-03-02 00:00:00 | 7
2020-03-09 00:00:00 | 7
2020-03-16 00:00:00 | 7
2020-03-23 00:00:00 | 7
2020-03-30 00:00:00 | 2
(10 rows)
select date_trunc_interval('7 days'::interval, d), count(*) from
generate_series( '2020-02-01'::timestamp, '2020-03-31', '1 day') d
group by 1 order by 1;
date_trunc_interval | count
---------------------+-------
2020-02-01 00:00:00 | 7
2020-02-08 00:00:00 | 7
2020-02-15 00:00:00 | 7
2020-02-22 00:00:00 | 7
2020-02-29 00:00:00 | 1
2020-03-01 00:00:00 | 7
2020-03-08 00:00:00 | 7
2020-03-15 00:00:00 | 7
2020-03-22 00:00:00 | 7
2020-03-29 00:00:00 | 3
(10 rows)
Resetting the day every month is counterintuitive if not broken, and
as I mentioned it might make more sense to use the int64 timestamp
directly, at least for intervals less than one month. I'll go look
into doing that.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello,
On 3/13/2020 4:13 PM, John Naylor wrote:
I've put off adding documentation on the origin piece pending comments
about the approach.I haven't thought seriously about timezone yet, but hopefully it's
just work and nothing to think too hard about.
Thank you for the patch. I looked it and tested a bit.
There is one interesting case which might be mentioned in the
documentation or in the tests is the following. The function has
interesting behaviour with real numbers:
=# select date_trunc_interval('0.1 year'::interval, TIMESTAMP
'2020-02-01 01:21:01');
date_trunc_interval
---------------------
2020-02-01 00:00:00
=# select date_trunc_interval('1.1 year'::interval, TIMESTAMP
'2020-02-01 01:21:01');
ERROR: only one interval unit allowed for truncation
It is because the second interval has two interval units:
=# select '0.1 year'::interval;
interval
----------
1 mon
=# select '1.1 year'::interval;
interval
--------------
1 year 1 mon
--
Artur
On Sun, Mar 15, 2020 at 2:26 PM I wrote:
To get more logical behavior, perhaps the optional parameter is better
as an offset instead of an origin. Alternatively (or additionally),
the function could do the math on int64 timestamps directly.
For v6, I changed the algorithm to use pg_tm for months and years, and
int64 for all smaller units. Despite the split, I think it's easier to
read now, and certainly shorter. This has the advantage that now
mixing units is allowed, as long as you don't mix months/years with
days or smaller, which often doesn't make sense and is not very
practical. (not yet documented) One consequence of this is that when
operating on months/years, and the origin contains smaller units, the
smaller units are ignored. Example:
select date_trunc_interval('12 months'::interval, timestamp
'2012-03-01 01:21:01', timestamp '2011-03-22');
date_trunc_interval
---------------------
2012-03-01 00:00:00
(1 row)
Even though not quite a full year has passed, it ignores the days in
the origin time and detects a difference in 12 calendar months. That
might be fine, although we could also throw an error and say origins
must be in the form of 'YYYY-01-01 00:00:00' when truncating on months
and/or years.
I added a sketch of documentation for the origin parameter and more tests.
On Fri, Mar 13, 2020 at 7:48 PM Isaac Morland <isaac.morland@gmail.com> wrote:
I'm confused by this. If my calendars are correct, both 1900-01-02 and 2020-02-11 are Tuesdays. So if the date being adjusted and the origin are both Tuesday, shouldn't the day part be left alone when truncating to 7 days? Also, I'd like to confirm that the default starting point for 7 day periods (weeks) is Monday, per ISO.
This is fixed.
select date_trunc_interval('7 days'::interval, timestamp '2020-02-11
01:01:01.0', TIMESTAMP '1900-01-02');
date_trunc_interval
---------------------
2020-02-11 00:00:00
(1 row)
select date_trunc_interval('7 days'::interval, d), count(*) from
generate_series( '2020-02-01'::timestamp, '2020-03-31', '1 day') d
group by 1 order by 1;
date_trunc_interval | count
---------------------+-------
2020-01-27 00:00:00 | 2
2020-02-03 00:00:00 | 7
2020-02-10 00:00:00 | 7
2020-02-17 00:00:00 | 7
2020-02-24 00:00:00 | 7
2020-03-02 00:00:00 | 7
2020-03-09 00:00:00 | 7
2020-03-16 00:00:00 | 7
2020-03-23 00:00:00 | 7
2020-03-30 00:00:00 | 2
(10 rows)
Perhaps the starting point for dates should be either 0001-01-01 (the proleptic beginning of the CE calendar) or 2001-01-01 (the beginning of the current 400-year repeating cycle of leap years and weeks, and a Monday, giving the appropriate ISO result for truncating to 7 day periods).
I went ahead with 2001-01-01 for the time being.
On Thu, Mar 19, 2020 at 4:20 PM Artur Zakirov <zaartur@gmail.com> wrote:
=# select date_trunc_interval('1.1 year'::interval, TIMESTAMP
'2020-02-01 01:21:01');
ERROR: only one interval unit allowed for truncation
For any lingering cases like this (i don't see any), maybe an error
hint is in order. The following works now, as expected for 1 year 1
month:
select date_trunc_interval('1.1 year'::interval, timestamp '2002-05-01
01:21:01');
date_trunc_interval
---------------------
2002-02-01 00:00:0
I'm going to look into implementing timezone while awaiting comments on v6.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v6-datetrunc_interval.patchapplication/octet-stream; name=v6-datetrunc_interval.patchDownload+309-1
Hello John,
this looks like a nice feature. I'm wondering how it relates to the
following use-case:
When drawing charts, the user can select pre-defined widths on times
(like "15 min", "1 hour").
The data for these slots is fitted always to intervalls that start in 0
in the slot, e.g. if the user selects "15 min", the interval always
starts at xx:00, xx:15, xx:30 or xx:45. This is to aid caching of the
resulting data, and so that requesting the same chart at xx:00 and xx:01
actually draws the same chart, and not some bar with only one minute
data at at the end.
In PSQL, this works out to using this as GROUP BY and then summing up
all values:
SELECT FLOOR(EXTRACT(EPOCH from thetime) / 3600) * 3600, SUM(events)
FROM mytable ... GROUP BY 1;
whereas here 3600 means "hourly".
It is of course easy for things like "1 hour", but for yearly I had to
come up with things like:
EXTRAC(YEAR FROM thetime) * 12 + EXTRACT(MONTH FROM thetime)
and its gets even more involved for weeks, weekdays or odd things like
fortnights.
So would that mean one could replace this by:
date_trunc_interval('3600 seconds'::interval, thetime)
and it would be easier, and (hopefully) faster?
Best regards,
Tels
Import Notes
Resolved by subject fallback
On Tue, Mar 24, 2020 at 9:34 PM Tels <nospam-pg-abuse@bloodgate.com> wrote:
Hello John,
this looks like a nice feature. I'm wondering how it relates to the
following use-case:When drawing charts, the user can select pre-defined widths on times
(like "15 min", "1 hour").The data for these slots is fitted always to intervalls that start in 0
in the slot, e.g. if the user selects "15 min", the interval always
starts at xx:00, xx:15, xx:30 or xx:45. This is to aid caching of the
resulting data, and so that requesting the same chart at xx:00 and xx:01
actually draws the same chart, and not some bar with only one minute
data at at the end.
Hi Tels, thanks for your interest! Sounds like the exact use case I had in mind.
It is of course easy for things like "1 hour", but for yearly I had to
come up with things like:EXTRAC(YEAR FROM thetime) * 12 + EXTRACT(MONTH FROM thetime)
and its gets even more involved for weeks, weekdays or odd things like
fortnights.
To be clear, this particular case was already handled by the existing
date_trunc, but only single units and a few other special cases. I
understand if you have to write code to handle 15 minutes, you need to
use that structure for all cases.
Fortnight would be trivial:
date_trunc_interval('14 days'::interval, thetime [, optional start of
the fortnight])
...but weekdays would still need something extra.
So would that mean one could replace this by:
date_trunc_interval('3600 seconds'::interval, thetime)
and it would be easier, and (hopefully) faster?
Certainly easier and more flexible. It's hard to make guesses about
performance, but with your example above where you have two SQL
function calls plus some expression evaluation, I think a single
function would be faster in many cases.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I wrote:
I'm going to look into implementing timezone while awaiting comments on v6.
I attempted this in the attached v7. There are 4 new functions for
truncating timestamptz on an interval -- with and without origin, and
with and without time zone.
Parts of it are hackish, and need more work, but I think it's in
passable enough shape to get feedback on. The origin parameter logic
was designed with timestamps-without-time-zone in mind, and
retrofitting time zone on top of that was a bit messy. There might be
bugs.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v7-datetrunc_interval.patchapplication/octet-stream; name=v7-datetrunc_interval.patchDownload+657-1
On 3/30/2020 9:30 PM, John Naylor wrote:
I attempted this in the attached v7. There are 4 new functions for
truncating timestamptz on an interval -- with and without origin, and
with and without time zone.
Thank you for new version of the patch.
I'm not sure that I fully understand the 'origin' parameter. Is it valid
to have a value of 'origin' which is greater than a value of 'timestamp'
parameter?
I get some different results in such case:
=# select date_trunc_interval('2 year', timestamp '2020-01-16 20:38:40',
timestamp '2022-01-17 00:00:00');
date_trunc_interval
---------------------
2020-01-01 00:00:00
=# select date_trunc_interval('3 year', timestamp '2020-01-16 20:38:40',
timestamp '2022-01-17 00:00:00');
date_trunc_interval
---------------------
2022-01-01 00:00:00
So here I'm not sure which result is correct.
It seems that the patch is still in progress, but I have some nitpicking.
+ <entry><literal><function>date_trunc_interval(<type>interval</type>, <type>timestamptz</type>, <type>text</type>)</function></literal></entry> + <entry><type>timestamptz </type></entry>
It seems that 'timestamptz' in both argument and result descriptions
should be replaced by 'timestamp with time zone' (see other functions
descriptions). Though it is okay to use 'timestamptz' in SQL examples.
timestamp_trunc_interval_internal() and
timestamptz_trunc_interval_internal() have similar code. I think they
can be rewritten to avoid code duplication.
--
Artur
On Tue, Mar 31, 2020 at 4:34 PM Artur Zakirov <zaartur@gmail.com> wrote:
Thank you for new version of the patch.
Thanks for taking a look! Attached is v8, which addresses your points,
adds tests and fixes some bugs. There are still some WIP detritus in
the timezone code, so I'm not claiming it's ready, but it's much
closer. I'm fairly confident in the implementation of timestamp
without time zone, however.
I'm not sure that I fully understand the 'origin' parameter. Is it valid
to have a value of 'origin' which is greater than a value of 'timestamp'
parameter?
That is the intention. The returned values should be
origin +/- (n * interval)
where n is an integer.
I get some different results in such case:
=# select date_trunc_interval('2 year', timestamp '2020-01-16 20:38:40',
timestamp '2022-01-17 00:00:00');
date_trunc_interval
---------------------
2020-01-01 00:00:00
This was correct per how I coded it, but I have rethought where to
draw the bins for user-specified origins. I have decided that the
above is inconsistent with units smaller than a month. We shouldn't
"cross" the bin until the input has reached Jan. 17, in this case. In
v8, the answer to the above is
date_trunc_interval
---------------------
2018-01-17 00:00:00
(1 row)
(This could probably be better documented)
=# select date_trunc_interval('3 year', timestamp '2020-01-16 20:38:40',
timestamp '2022-01-17 00:00:00');
date_trunc_interval
---------------------
2022-01-01 00:00:00So here I'm not sure which result is correct.
This one is just plain broken. The result should always be equal or
earlier than the input. In v8 the result is now:
date_trunc_interval
---------------------
2019-01-17 00:00:00
(1 row)
It seems that the patch is still in progress, but I have some nitpicking.
+ <entry><literal><function>date_trunc_interval(<type>interval</type>, <type>timestamptz</type>, <type>text</type>)</function></literal></entry> + <entry><type>timestamptz </type></entry>It seems that 'timestamptz' in both argument and result descriptions
should be replaced by 'timestamp with time zone' (see other functions
descriptions). Though it is okay to use 'timestamptz' in SQL examples.
Any and all nitpicks welcome! I have made these match the existing
date_trunc documentation more closely.
timestamp_trunc_interval_internal() and
timestamptz_trunc_interval_internal() have similar code. I think they
can be rewritten to avoid code duplication.
I thought so too (and noticed the same about the existing date_trunc),
but it's more difficult than it looks.
Note: I copied some tests from timestamp to timestamptz with a few
tweaks. A few tz tests still don't pass. I'm not yet sure if the
problem is in the test, or my code. Some detailed review of the tests
and their results would be helpful.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v8-datetrunc_interval.patchapplication/octet-stream; name=v8-datetrunc_interval.patchDownload+1108-1
In v9, I've simplified the patch somewhat to make it easier for future
work to build on.
- When truncating on month-or-greater intervals, require the origin to
align on month. This removes the need to handle weird corner cases
that have no straightforward behavior.
- Remove hackish and possibly broken code to allow origin to be after
the input timestamp. The default origin is Jan 1, 1 AD, so only AD
dates will behave correctly by default. This is not enforced for now,
since it may be desirable to find a way to get this to work in a nicer
way.
- Rebase docs over PG13 formatting changes.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v9-datetrunc-interval.patchapplication/octet-stream; name=v9-datetrunc-interval.patchDownload+780-1
On 2020-06-30 06:34, John Naylor wrote:
In v9, I've simplified the patch somewhat to make it easier for future
work to build on.- When truncating on month-or-greater intervals, require the origin to
align on month. This removes the need to handle weird corner cases
that have no straightforward behavior.
- Remove hackish and possibly broken code to allow origin to be after
the input timestamp. The default origin is Jan 1, 1 AD, so only AD
dates will behave correctly by default. This is not enforced for now,
since it may be desirable to find a way to get this to work in a nicer
way.
- Rebase docs over PG13 formatting changes.
This looks pretty solid now. Are there any more corner cases and other
areas with unclear behavior that you are aware of?
A couple of thoughts:
- After reading the discussion a few times, I'm not so sure anymore
whether making this a cousin of date_trunc is the right way to go. As
you mentioned, there are some behaviors specific to date_trunc that
don't really make sense in date_trunc_interval, and maybe we'll have
more of those. Also, date_trunc_interval isn't exactly a handy name.
Maybe something to think about. It's obviously fairly straightforward
to change it.
- There were various issues with the stride interval having months and
years. I'm not sure we even need that. It could be omitted unless you
are confident that your implementation is now sufficient.
- Also, negative intervals could be prohibited, but I suppose that
matters less.
- I'm curious about the origin being set to 0001-01-01. This seems to
work correctly in that it sets the origin to a Monday, which is what we
wanted, but according to Google that day was a Saturday. Something to
do with Julian vs. Gregorian calendar? Maybe we should choose a date
that is a bit more recent and easier to reason with.
- Then again, I'm thinking that maybe we should make the origin
mandatory. Otherwise, the default answers when having strides larger
than a day are entirely arbitrary, e.g.,
=> select date_trunc_interval('10 year', '0196-05-20 BC'::timestamp);
0190-01-01 00:00:00 BC
=> select date_trunc_interval('10 year', '0196-05-20 AD'::timestamp);
0191-01-01 00:00:00
Perhaps the origin could be defaulted if the interval is less than a day
or something like that.
- I'm wondering whether we need the date_trunc_interval(interval,
timestamptz, timezone) variant. Isn't that the same as
date_trunc_interval(foo AT ZONE xyz, value)?
On Thu, Nov 12, 2020 at 9:56 AM Peter Eisentraut <
peter.eisentraut@enterprisedb.com> wrote:
On 2020-06-30 06:34, John Naylor wrote:
In v9, I've simplified the patch somewhat to make it easier for future
work to build on.- When truncating on month-or-greater intervals, require the origin to
align on month. This removes the need to handle weird corner cases
that have no straightforward behavior.
- Remove hackish and possibly broken code to allow origin to be after
the input timestamp. The default origin is Jan 1, 1 AD, so only AD
dates will behave correctly by default. This is not enforced for now,
since it may be desirable to find a way to get this to work in a nicer
way.
- Rebase docs over PG13 formatting changes.This looks pretty solid now. Are there any more corner cases and other
areas with unclear behavior that you are aware of?
Hi Peter,
Thanks for taking a look!
I believe there are no known corner cases aside from not throwing an error
if origin > input, but I'll revisit that when we are more firm on what
features we want support.
A couple of thoughts:
- After reading the discussion a few times, I'm not so sure anymore
whether making this a cousin of date_trunc is the right way to go. As
you mentioned, there are some behaviors specific to date_trunc that
don't really make sense in date_trunc_interval, and maybe we'll have
more of those.
As far as the behaviors, I'm not sure exactly what you what you were
thinking of, but here are two issues off the top of my head:
- If the new functions are considered variants of date_trunc(), there is
the expectation that the options work the same way, in particular the
timezone parameter. You asked specifically about that below, so I'll
address that separately.
- In the "week" case, the boundary position depends on the origin, since a
week-long interval is just 7 days.
Also, date_trunc_interval isn't exactly a handy name.
Maybe something to think about. It's obviously fairly straightforward
to change it.
Effectively, it puts timestamps into bins, so maybe date_bin() or something
like that?
- There were various issues with the stride interval having months and
years. I'm not sure we even need that. It could be omitted unless you
are confident that your implementation is now sufficient.
Months and years were a bit tricky, so I'd be happy to leave that out if
there is not much demand for it. date_trunc() already has quarters,
decades, centuries, and millenia.
- Also, negative intervals could be prohibited, but I suppose that
matters less.
Good for the sake of completeness. I think they happen to work in v9 by
accident, but it would be better not to expose that.
- I'm curious about the origin being set to 0001-01-01. This seems to
work correctly in that it sets the origin to a Monday, which is what we
wanted, but according to Google that day was a Saturday. Something to
do with Julian vs. Gregorian calendar?
Right, working backwards from our calendar today, it's Monday, but at the
time it would theoretically be Saturday, barring leap year miscalculations.
Maybe we should choose a date
that is a bit more recent and easier to reason with.
2001-01-01 would also be a Monday aligned with centuries and millenia, so
that would be my next suggestion. If we don't care to match with
date_trunc() on those larger units, we could also use 1900-01-01, so the
vast majority of dates in databases are after the origin.
- Then again, I'm thinking that maybe we should make the origin
mandatory. Otherwise, the default answers when having strides larger
than a day are entirely arbitrary, e.g.,=> select date_trunc_interval('10 year', '0196-05-20 BC'::timestamp);
0190-01-01 00:00:00 BC=> select date_trunc_interval('10 year', '0196-05-20 AD'::timestamp);
0191-01-01 00:00:00
Right. In the first case, the default origin is also after the input, and
crosses the AD/BC boundary. Tricky to get right.
Perhaps the origin could be defaulted if the interval is less than a day
or something like that.
If we didn't allow months and years to be units, it seems the default would
always make sense?
- I'm wondering whether we need the date_trunc_interval(interval,
timestamptz, timezone) variant. Isn't that the same as
date_trunc_interval(foo AT ZONE xyz, value)?
I based this on 600b04d6b5ef6 for date_trunc(), whose message states:
date_trunc(field, timestamptz, zone_name)
is the same as
date_trunc(field, timestamptz at time zone zone_name) at time zone zone_name
so without the shorthand, you need to specify the timezone twice, once for
the calculation, and once for the output.
--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Nov 23, 2020 at 1:44 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
On Thu, Nov 12, 2020 at 9:56 AM Peter Eisentraut <
peter.eisentraut@enterprisedb.com> wrote:
- After reading the discussion a few times, I'm not so sure anymore
whether making this a cousin of date_trunc is the right way to go. As
you mentioned, there are some behaviors specific to date_trunc that
don't really make sense in date_trunc_interval, and maybe we'll have
more of those.
For v10, I simplified the behavior by decoupling the behavior from
date_trunc() and putting in some restrictions as discussed earlier. It's
much simpler now. It could be argued that it goes too far in that
direction, but it's easy to reason about and we can put back some features
as we see fit.
Also, date_trunc_interval isn't exactly a handy name.
Maybe something to think about. It's obviously fairly straightforward
to change it.Effectively, it puts timestamps into bins, so maybe date_bin() or
something like that?
For v10 I went with date_bin() so we can see how that looks.
- There were various issues with the stride interval having months and
years. I'm not sure we even need that. It could be omitted unless you
are confident that your implementation is now sufficient.Months and years were a bit tricky, so I'd be happy to leave that out if
there is not much demand for it. date_trunc() already has quarters,
decades, centuries, and millenia.
I removed months and years for this version, but that can be reconsidered
of course. The logic is really simple now.
- Also, negative intervals could be prohibited, but I suppose that
matters less.
I didn't go this far, but probably should before long.
- Then again, I'm thinking that maybe we should make the origin
mandatory. Otherwise, the default answers when having strides larger
than a day are entirely arbitrary, e.g.,
I've tried this and like the resulting simplification.
- I'm wondering whether we need the date_trunc_interval(interval,
timestamptz, timezone) variant. Isn't that the same as
date_trunc_interval(foo AT ZONE xyz, value)?I based this on 600b04d6b5ef6 for date_trunc(), whose message states:
date_trunc(field, timestamptz, zone_name)
is the same as
date_trunc(field, timestamptz at time zone zone_name) at time zone
zone_name
so without the shorthand, you need to specify the timezone twice, once
for the calculation, and once for the output.
In light of making the origin mandatory, it no longer makes sense to have a
time zone parameter, since we can specify the time zone on the origin; and
if desired on the output as well.
--
John Naylor
EDB: http://www.enterprisedb.com