width_bucket function for timestamps

Started by Jeremy Drakeover 19 years ago12 messages
#1Jeremy Drake
pgsql@jdrake.com
1 attachment(s)

I just came across this code I wrote about a year ago which implements a
function equivilant to width_bucket for timestamps.

I wrote this when I was trying to plot some data over time, and I had more
points than I needed. This function allowed me to create a pre-determined
number of "bins" to average the data inside of so that I could get a sane
number of points. Part of the problem was that there were so many data
points, that a sql implementation of the function (or plpgsql, I forget,
it was a year ago) was painfully slow. This C function provided much
better performance than any other means at my disposal.

I wanted to share this code since it may be useful for someone else, but I
don't know exactly what to do with it. So I am putting it out there, and
asking what the proper home for such a function might be. I believe it
would be generally useful for people, but it is so small that it hardly
seems like a reasonable pgFoundry project. Maybe there is a home for such
a thing in the core distribution in a future release?

The code can be found at
http://www.jdrake.com/postgresql/bintimestamp.tar.gz for a buildable PGXS
module, or I attached just the C code. There is no documentation, the
parameters work the same as the width_bucket function. The code is not
necessarily the most readable in the world, I was trying to get as much
speed out of it as possible, since I was calling it over a million times
as a group by value.

Thanks for any pointers...

--
Fortune's Office Door Sign of the Week:

Incorrigible punster -- Do not incorrige.

Attachments:

bintimestamp.ctext/plain; charset=US-ASCII; name=bintimestamp.cDownload
/*****************************************************************************
 * file:        $RCSfile: bintimestamp.c,v $ $Revision: 1.1 $
 * module:      timestamp
 * authors:     jeremyd
 * last mod:    $Author: jeremyd $ at $Date: 2005/10/28 20:26:38 $
 * 
 * created:     Fri Oct 28 13:26:38 PDT 2005
 * 
 *****************************************************************************/

#include <string.h>
#include <math.h>
#include "postgres.h"

#include "fmgr.h"
#include "libpq/pqformat.h"
#include "utils/builtins.h"
#include "funcapi.h"
#include "utils/timestamp.h"

#ifndef JROUND
#	define JROUND(x) (x)
#endif

Datum timestamp_get_bin_size(PG_FUNCTION_ARGS);
Datum timestamp_bin(PG_FUNCTION_ARGS);

PG_FUNCTION_INFO_V1(timestamp_get_bin_size);
Datum
timestamp_get_bin_size(PG_FUNCTION_ARGS)
{
	Timestamp start = PG_GETARG_TIMESTAMP(0);
	Timestamp stop = PG_GETARG_TIMESTAMP(1);
	int32 nbuckets = PG_GETARG_INT32(2);
	Interval * retval = (Interval *)palloc (sizeof(Interval));

	if (!retval)
	{
		ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("insufficient memory for Interval allocation")));
		PG_RETURN_NULL();
	}

	memset (retval, 0, sizeof(Interval));

	retval->time = JROUND ((stop - start) / nbuckets);

	PG_RETURN_INTERVAL_P(retval);
}

PG_FUNCTION_INFO_V1(timestamp_bin);
Datum
timestamp_bin(PG_FUNCTION_ARGS)
{
	/*Timestamp op = PG_GETARG_TIMESTAMP(0);*/
	Timestamp start = PG_GETARG_TIMESTAMP(1);
	/*Timestamp stop = PG_GETARG_TIMESTAMP(2);*/
	Timestamp binsz;
	/*int32 nbuckets = PG_GETARG_INT32(3)*/;

	binsz = (PG_GETARG_TIMESTAMP(2) - start) / PG_GETARG_INT32(3);

	PG_RETURN_TIMESTAMP(JROUND((int)((PG_GETARG_TIMESTAMP(0) - start) / binsz) * binsz + start));
}
#2Jim C. Nasby
jim@nasby.net
In reply to: Jeremy Drake (#1)
Re: width_bucket function for timestamps

Sinte we already have width_bucket, I'd argue this should go in core. If
someone's feeling adventurous, there should probably be a double
precision version as well. Hrm... and maybe text...

Doesn't the backend already have something like this for calculating
histograms?

On Sun, Oct 08, 2006 at 10:30:47PM -0700, Jeremy Drake wrote:

I just came across this code I wrote about a year ago which implements a
function equivilant to width_bucket for timestamps.

I wrote this when I was trying to plot some data over time, and I had more
points than I needed. This function allowed me to create a pre-determined
number of "bins" to average the data inside of so that I could get a sane
number of points. Part of the problem was that there were so many data
points, that a sql implementation of the function (or plpgsql, I forget,
it was a year ago) was painfully slow. This C function provided much
better performance than any other means at my disposal.

I wanted to share this code since it may be useful for someone else, but I
don't know exactly what to do with it. So I am putting it out there, and
asking what the proper home for such a function might be. I believe it
would be generally useful for people, but it is so small that it hardly
seems like a reasonable pgFoundry project. Maybe there is a home for such
a thing in the core distribution in a future release?

The code can be found at
http://www.jdrake.com/postgresql/bintimestamp.tar.gz for a buildable PGXS
module, or I attached just the C code. There is no documentation, the
parameters work the same as the width_bucket function. The code is not
necessarily the most readable in the world, I was trying to get as much
speed out of it as possible, since I was calling it over a million times
as a group by value.

Thanks for any pointers...

--
Fortune's Office Door Sign of the Week:

Incorrigible punster -- Do not incorrige.

/*****************************************************************************
* file: $RCSfile: bintimestamp.c,v $ $Revision: 1.1 $
* module: timestamp
* authors: jeremyd
* last mod: $Author: jeremyd $ at $Date: 2005/10/28 20:26:38 $
*
* created: Fri Oct 28 13:26:38 PDT 2005
*
*****************************************************************************/

#include <string.h>
#include <math.h>
#include "postgres.h"

#include "fmgr.h"
#include "libpq/pqformat.h"
#include "utils/builtins.h"
#include "funcapi.h"
#include "utils/timestamp.h"

#ifndef JROUND
# define JROUND(x) (x)
#endif

Datum timestamp_get_bin_size(PG_FUNCTION_ARGS);
Datum timestamp_bin(PG_FUNCTION_ARGS);

PG_FUNCTION_INFO_V1(timestamp_get_bin_size);
Datum
timestamp_get_bin_size(PG_FUNCTION_ARGS)
{
Timestamp start = PG_GETARG_TIMESTAMP(0);
Timestamp stop = PG_GETARG_TIMESTAMP(1);
int32 nbuckets = PG_GETARG_INT32(2);
Interval * retval = (Interval *)palloc (sizeof(Interval));

if (!retval)
{
ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("insufficient memory for Interval allocation")));
PG_RETURN_NULL();
}

memset (retval, 0, sizeof(Interval));

retval->time = JROUND ((stop - start) / nbuckets);

PG_RETURN_INTERVAL_P(retval);
}

PG_FUNCTION_INFO_V1(timestamp_bin);
Datum
timestamp_bin(PG_FUNCTION_ARGS)
{
/*Timestamp op = PG_GETARG_TIMESTAMP(0);*/
Timestamp start = PG_GETARG_TIMESTAMP(1);
/*Timestamp stop = PG_GETARG_TIMESTAMP(2);*/
Timestamp binsz;
/*int32 nbuckets = PG_GETARG_INT32(3)*/;

binsz = (PG_GETARG_TIMESTAMP(2) - start) / PG_GETARG_INT32(3);

PG_RETURN_TIMESTAMP(JROUND((int)((PG_GETARG_TIMESTAMP(0) - start) / binsz) * binsz + start));
}

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#2)
Re: width_bucket function for timestamps

"Jim C. Nasby" <jim@nasby.net> writes:

Sinte we already have width_bucket, I'd argue this should go in core. If
someone's feeling adventurous, there should probably be a double
precision version as well. Hrm... and maybe text...

It's not clear to me why we have width_bucket operating on numeric and
not float8 --- that seems like an oversight, if not outright
misunderstanding of the type hierarchy. But if we had the float8
version, I think Jeremy's problem would be solved just by applying
the float8 version to "extract(epoch from timestamp)". I don't really
see the use-case for putting N versions of the function in there.

regards, tom lane

#4Jim C. Nasby
jim@nasby.net
In reply to: Tom Lane (#3)
Re: width_bucket function for timestamps

On Mon, Oct 09, 2006 at 12:02:12PM -0400, Tom Lane wrote:

"Jim C. Nasby" <jim@nasby.net> writes:

Sinte we already have width_bucket, I'd argue this should go in core. If
someone's feeling adventurous, there should probably be a double
precision version as well. Hrm... and maybe text...

It's not clear to me why we have width_bucket operating on numeric and
not float8 --- that seems like an oversight, if not outright
misunderstanding of the type hierarchy. But if we had the float8
version, I think Jeremy's problem would be solved just by applying
the float8 version to "extract(epoch from timestamp)". I don't really
see the use-case for putting N versions of the function in there.

Well, it would be nice to have a timestamp version so that users didn't
have to keep typing "extract(epoch from timestamp)"... but yeah, I
suspect that would work fine for timestamps. For intervals I suspect you
could just convert to seconds (if we're going to add timestamps, it
seems like we should add intervals as well).
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#4)
Re: width_bucket function for timestamps

"Jim C. Nasby" <jim@nasby.net> writes:

On Mon, Oct 09, 2006 at 12:02:12PM -0400, Tom Lane wrote:

... I think Jeremy's problem would be solved just by applying
the float8 version to "extract(epoch from timestamp)".

Thinko there ... I meant to type "extract(epoch from interval)".

Well, it would be nice to have a timestamp version so that users didn't
have to keep typing "extract(epoch from timestamp)"... but yeah, I
suspect that would work fine for timestamps. For intervals I suspect you
could just convert to seconds (if we're going to add timestamps, it
seems like we should add intervals as well).

This is exactly the slippery slope I don't care to start down.

regards, tom lane

#6Jim C. Nasby
jim@nasby.net
In reply to: Tom Lane (#5)
Re: width_bucket function for timestamps

On Mon, Oct 09, 2006 at 01:49:37PM -0400, Tom Lane wrote:

"Jim C. Nasby" <jim@nasby.net> writes:

On Mon, Oct 09, 2006 at 12:02:12PM -0400, Tom Lane wrote:

... I think Jeremy's problem would be solved just by applying
the float8 version to "extract(epoch from timestamp)".

Thinko there ... I meant to type "extract(epoch from interval)".

Except that the patch is for timestamp support, not intervals.

Well, it would be nice to have a timestamp version so that users didn't
have to keep typing "extract(epoch from timestamp)"... but yeah, I
suspect that would work fine for timestamps. For intervals I suspect you
could just convert to seconds (if we're going to add timestamps, it
seems like we should add intervals as well).

This is exactly the slippery slope I don't care to start down.

I guess I'm confused as to how this is any different from other
functions where we've provided multiple input arguments, such as the
aggregate functions. It certainly doesn't seem like it'd take a lot of
extra code to support this...
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#6)
Re: width_bucket function for timestamps

"Jim C. Nasby" <jim@nasby.net> writes:

On Mon, Oct 09, 2006 at 01:49:37PM -0400, Tom Lane wrote:

This is exactly the slippery slope I don't care to start down.

I guess I'm confused as to how this is any different from other
functions where we've provided multiple input arguments, such as the
aggregate functions.

The salient reason is that the spec only defines width_bucket for numeric
input arguments, whereas stuff like max/min is defined *by the spec* for
other data types.

Since there's no spec-based argument for allowing width_bucket for other
datatypes, and only an (IMHO) very weak use-case for it, I don't think
we should add the clutter.

regards, tom lane

#8Jeremy Drake
pgsql@jdrake.com
In reply to: Tom Lane (#3)
Re: width_bucket function for timestamps

On Mon, 9 Oct 2006, Tom Lane wrote:

It's not clear to me why we have width_bucket operating on numeric and
not float8 --- that seems like an oversight, if not outright
misunderstanding of the type hierarchy.

Would that make the below a lot faster?

But if we had the float8
version, I think Jeremy's problem would be solved just by applying
the float8 version to "extract(epoch from timestamp)". I don't really
see the use-case for putting N versions of the function in there.

I found the function I used before I implemented the C version. It was
significantly slower, which is why I wrote the C version.

-- given a date range and a number of buckets, round the given date to one
-- of the buckets such that any number of dates within the date range passed
-- in to this function will only return up to the number of buckets unique
-- values
CREATE OR REPLACE FUNCTION date_width_bucket
(tm TIMESTAMP WITHOUT TIME ZONE,
low TIMESTAMP WITHOUT TIME ZONE,
high TIMESTAMP WITHOUT TIME ZONE,
nbuckets INTEGER
) RETURNS TIMESTAMP WITHOUT TIME ZONE AS $$
SELECT ((EXTRACT(epoch FROM $3) - EXTRACT(epoch FROM $2)) / $4) *
(width_bucket(EXTRACT(epoch FROM $1)::NUMERIC,
EXTRACT(epoch FROM $2)::NUMERIC,
EXTRACT(epoch FROM $3)::NUMERIC,
$4)
- 1) * '1 second'::INTERVAL + $2;
$$ LANGUAGE sql IMMUTABLE STRICT;

--
I don't think they could put him in a mental hospital. On the other
hand, if he were already in, I don't think they'd let him out.

#9Jim C. Nasby
jim@nasby.net
In reply to: Tom Lane (#7)
Re: width_bucket function for timestamps

On Mon, Oct 09, 2006 at 03:49:50PM -0400, Tom Lane wrote:

"Jim C. Nasby" <jim@nasby.net> writes:

On Mon, Oct 09, 2006 at 01:49:37PM -0400, Tom Lane wrote:

This is exactly the slippery slope I don't care to start down.

I guess I'm confused as to how this is any different from other
functions where we've provided multiple input arguments, such as the
aggregate functions.

The salient reason is that the spec only defines width_bucket for numeric
input arguments, whereas stuff like max/min is defined *by the spec* for
other data types.

Since there's no spec-based argument for allowing width_bucket for other
datatypes, and only an (IMHO) very weak use-case for it, I don't think
we should add the clutter.

Catalog or code clutter? ISTM that it wouldn't take much extra work at
all to provide this for timestamps or intervals...

In any case, having a faster version that used double certainly seems
like it'd be useful. It'd probably allow the OP to go back to his
original, simple version.
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeremy Drake (#8)
Re: width_bucket function for timestamps

Jeremy Drake <pgsql@jdrake.com> writes:

I found the function I used before I implemented the C version. It was
significantly slower, which is why I wrote the C version.

I would imagine that most of the problem is the NUMERIC arithmetic
that's doing.

regards, tom lane

#11Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#3)
Re: width_bucket function for timestamps

On Mon, 2006-10-09 at 12:02 -0400, Tom Lane wrote:

It's not clear to me why we have width_bucket operating on numeric and
not float8

I asked about this when I originally implemented width_bucket(), I
recall[1]http://archives.postgresql.org/pgsql-patches/2004-04/msg00259.php. At the time, there was scepticism about whether it was even
worth implementing width_bucket(), let alone providing multiple
implementations of it. I'd be happy to provide a float8 implementation
(I may even have one lying around somewhere...)

-Neil

[1]: http://archives.postgresql.org/pgsql-patches/2004-04/msg00259.php

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#11)
Re: width_bucket function for timestamps

Neil Conway <neilc@samurai.com> writes:

On Mon, 2006-10-09 at 12:02 -0400, Tom Lane wrote:

It's not clear to me why we have width_bucket operating on numeric and
not float8

I asked about this when I originally implemented width_bucket(), I
recall[1]. At the time, there was scepticism about whether it was even
worth implementing width_bucket(), let alone providing multiple
implementations of it. I'd be happy to provide a float8 implementation
(I may even have one lying around somewhere...)

It's probably too late for 8.2, unless some other compelling reason to
force an initdb surfaces. But if you can find the code, please stick it
in when 8.3 devel starts. As it stands, one must do an explicit
coercion to numeric to use width_bucket() with floats; which is tedious,
slow, and nowhere required by the spec.

regards, tom lane