Approximate join on timestamps

Started by Phil Endecottabout 19 years ago7 messagesgeneral
Jump to latest
#1Phil Endecott
spam_from_postgresql_general@chezphil.org

Dear Experts,

I have two tables containing chronological data, and I want to join
them using the timestamps. The challenge is that the timestamps only
match approximately.

My first attempt was something like

t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval)

Of course there is no "abs" for intervals, and I couldn't think of
anything better than this

t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval)

What indexes could I add to make this moderately efficient?

But that query isn't really good enough. There is no single "epsillon"
value that works for this data set. I really want to find the closest match.

I feel that it ought to be possible to step through the two tables in
timestamp order matching up elements. Is there any way to express this
is SQL?

(One detail is that the left table has fewer rows than the right table,
and I want one output row for each row in the left table.)

Many thanks for any suggestions.

Phil.

(You are welcome to CC: me in any replies.)

#2Rhys A.D. Stewart
rhys.stewart@gmail.com
In reply to: Phil Endecott (#1)
Fwd: Approximate join on timestamps

---------- Forwarded message ----------
From: Rhys Stewart <rhys.stewart@gmail.com>
Date: Mar 20, 2007 6:50 PM
Subject: Re: [GENERAL] Approximate join on timestamps
To: Phil Endecott <spam_from_postgresql_general@chezphil.org>

had a similar problem a while back. so i made and abs_time function:

CREATE OR REPLACE FUNCTION abs_time(interval)
RETURNS interval AS
$BODY$
BEGIN
if
$1 < '00:00:00'::interval
then
return ($1 * -1)::interval;
else
return $1;
END IF;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;
ALTER FUNCTION abs_time(interval) OWNER TO postgres;

hopes this gets you somewhere

Show quoted text

On 3/20/07, Phil Endecott <spam_from_postgresql_general@chezphil.org> wrote:

Dear Experts,

I have two tables containing chronological data, and I want to join
them using the timestamps. The challenge is that the timestamps only
match approximately.

My first attempt was something like

t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval)

Of course there is no "abs" for intervals, and I couldn't think of
anything better than this

t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval)

What indexes could I add to make this moderately efficient?

But that query isn't really good enough. There is no single "epsillon"
value that works for this data set. I really want to find the closest match.

I feel that it ought to be possible to step through the two tables in
timestamp order matching up elements. Is there any way to express this
is SQL?

(One detail is that the left table has fewer rows than the right table,
and I want one output row for each row in the left table.)

Many thanks for any suggestions.

Phil.

(You are welcome to CC: me in any replies.)

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#3Jorge Godoy
jgodoy@gmail.com
In reply to: Phil Endecott (#1)
Re: Approximate join on timestamps

"Phil Endecott" <spam_from_postgresql_general@chezphil.org> writes:

I have two tables containing chronological data, and I want to join them using
the timestamps. The challenge is that the timestamps only match approximately.

My first attempt was something like

t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval)

Of course there is no "abs" for intervals, and I couldn't think of anything
better than this

t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval)

What indexes could I add to make this moderately efficient?

But that query isn't really good enough. There is no single "epsillon" value
that works for this data set. I really want to find the closest match.

I feel that it ought to be possible to step through the two tables in
timestamp order matching up elements. Is there any way to express this is SQL?

(One detail is that the left table has fewer rows than the right table, and I
want one output row for each row in the left table.)

Many thanks for any suggestions.

Untested, but what about something like a function that does (pseudocode below):

select (min(t1.t) > ref_time) as above_t1
select (max(t1.t) < ref_time) as below_t1
if ((above_t1 - below_t1) =< '0 seconds'::interval then
return above_t1
else
return below_t1

to find out the nearest time with regards to t1 when compared to a reference
time that should be the time you're looking for.

Do the same for t2...

I haven't checked the docs if there's something that already makes your life
easier :-)

--
Jorge Godoy <jgodoy@gmail.com>

#4Jorge Godoy
jgodoy@gmail.com
In reply to: Rhys A.D. Stewart (#2)
Re: Fwd: Approximate join on timestamps

"Rhys Stewart" <rhys.stewart@gmail.com> writes:

had a similar problem a while back. so i made and abs_time function:

CREATE OR REPLACE FUNCTION abs_time(interval)
RETURNS interval AS
$BODY$
BEGIN
if
$1 < '00:00:00'::interval
then
return ($1 * -1)::interval;
else
return $1;
END IF;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;

I believe that you can declare this IMMUTABLE. For a given interval it will
always return the same value, so you can benefit from some optimization.

http://www.postgresql.org/docs/8.2/interactive/xfunc-volatility.html

An IMMUTABLE function cannot modify the database and is guaranteed to
return the same results given the same arguments forever. This
category allows the optimizer to pre-evaluate the function when a
query calls it with constant arguments. For example, a query like
SELECT ... WHERE x = 2 + 2 can be simplified on sight to SELECT
... WHERE x = 4, because the function underlying the integer addition
operator is marked IMMUTABLE.

--
Jorge Godoy <jgodoy@gmail.com>

#5Brent Wood
b.wood@niwa.co.nz
In reply to: Phil Endecott (#1)
Re: Approximate join on timestamps

Phil Endecott wrote:

Dear Experts,

I have two tables containing chronological data, and I want to join
them using the timestamps. The challenge is that the timestamps only
match approximately.

Hi Phil,

This is how we dealt with a similar situation. It may be suitable for
you....

We have about 200,000,000 records timestamped to facilitate this using a
more or less data warehousing approach.

We generate timestamps at one minute intervals, then assign the
appropriate values (readings) to each timestamp, by using the last
recorded reading before that time (separate record for each instrument)
within an appropriate interval (so missing values are not populated with
historic values). Note that this will discard all records for each
reading except for the last one per interval.

We also add a column ("timer") which has values of 1, 2, 5, 10, 20, 30,
60, 720, 1440 depending on the hour & minutes of the timestamp.

so a "where timer >=60" returns hourly readings, "where timer =720"
gives midday readings, "where timer >=10" gives every 10 minute reading,
etc

This then gets a clustered index on timestamp (and is partitioned on
year) and a 24 way self-relation (ie: 24 instrument readings joined by
timestamp returns 3 months of 10 minute values in 20 odd seconds on a
fastish desktop box.

HTH,

Brent Wood

#6Klint Gore
kg@kgb.une.edu.au
In reply to: Phil Endecott (#1)
Re: Approximate join on timestamps

On Tue, 20 Mar 2007 23:30:46 +0000, "Phil Endecott" <spam_from_postgresql_general@chezphil.org> wrote:

I have two tables containing chronological data, and I want to join
them using the timestamps. The challenge is that the timestamps only
match approximately.

My first attempt was something like

t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval)

Of course there is no "abs" for intervals, and I couldn't think of
anything better than this

t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval)

How about using extract(epoch from t) to turn it into a numeric value?

select distinct on (t1.primary_key) *
from t1
join t2 on extract(epoch from t2.t) < extract(epoch from t1.t) + 30
and extract(epoch from t2.t) > extract(epoch from t1.t) - 30
order by t1.something,
abs(extract(epoch from t2.t) - extract(epoch from t1.t));

What indexes could I add to make this moderately efficient?

If t is timestamp without time zone then you might be able to use an
index on it

create index t1_epoch_idx on t1 ((extract(epoch from t)))
create index t2_epoch_idx on t2 ((extract(epoch from t)))

But that query isn't really good enough. There is no single "epsillon"
value that works for this data set. I really want to find the closest match.

see order by. the +/- 30 in the above query can be used for tolerance
on the join.

(One detail is that the left table has fewer rows than the right table,
and I want one output row for each row in the left table.)

see distinct on.

klint.

+---------------------------------------+-----------------+
: Klint Gore                            : "Non rhyming    :
: EMail   : kg@kgb.une.edu.au           :  slang - the    :
: Snail   : A.B.R.I.                    :  possibilities  :
: Mail      University of New England   :  are useless"   :
:           Armidale NSW 2351 Australia :     L.J.J.      :
: Fax     : +61 2 6772 5376             :                 :
+---------------------------------------+-----------------+
#7Alban Hertroys
alban@magproductions.nl
In reply to: Phil Endecott (#1)
Re: Approximate join on timestamps

Phil Endecott wrote:

Dear Experts,

I have two tables containing chronological data, and I want to join them
using the timestamps. The challenge is that the timestamps only match
approximately.

My first attempt was something like

t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval)

Of course there is no "abs" for intervals, and I couldn't think of
anything better than this

t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1
min'::interval)

What about:
t1 join t2
on ((t1.t - interval '30s', t1.t + interval '30s') overlaps (t2.t -
interval '30s', t2.t + interval '30s'))

No need for abs(interval) or repeating conditions that way. My first
attempt was using 'between' instead of 'overlaps', but I don't think
that'll work correctly.

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
7500 AK Enschede

// Integrate Your World //