writing a MIN(RECORD) aggregate

Started by Sam Masonabout 18 years ago12 messageshackers
Jump to latest
#1Sam Mason
sam@samason.me.uk

Hi,

I'm trying to write a version of the MIN aggregate for values of RECORD
type. I'm somewhat stuck on getting type information about the argument
out, I can determine how many attributes it's got but I can't seem to do
any better than that. Does anyone have any good pointers into the code
for places to help me understand what's happening?

The reason for doing this is mainly because I think it'd be nicer to be
doing things like:

SELECT i, (MIN((j,k))).k
FROM tbl
GROUP BY i;

instead of:

SELECT DISTINCT ON (i) i, k
FROM tbl
ORDER BY i,j,k;

Which as far as I can tell should produce identical results, except the
first has cleaner semantics. It also allows you to combine MIN and MAX
in the same query, giving the value of k for the smallest and largest j
in this example--requiring two queries if it was done using the DISTINCT
ON method.

Sam

#2Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Sam Mason (#1)
Re: writing a MIN(RECORD) aggregate

On Mar 20, 2008, at 2:23 PM, Sam Mason wrote:

I'm trying to write a version of the MIN aggregate for values of
RECORD
type. I'm somewhat stuck on getting type information about the
argument
out, I can determine how many attributes it's got but I can't seem
to do
any better than that. Does anyone have any good pointers into the
code
for places to help me understand what's happening?

The reason for doing this is mainly because I think it'd be nicer
to be
doing things like:

SELECT i, (MIN((j,k))).k
FROM tbl
GROUP BY i;

How is that any better than SELECT i, min(k) FROM tbl GROUP BY i ?

instead of:

SELECT DISTINCT ON (i) i, k
FROM tbl
ORDER BY i,j,k;

Which as far as I can tell should produce identical results, except
the
first has cleaner semantics. It also allows you to combine MIN and
MAX
in the same query, giving the value of k for the smallest and
largest j
in this example--requiring two queries if it was done using the
DISTINCT
ON method.

I don't see how min(record) even allows for that.

I'm not saying that min/avg/max/etc(RECORD) wouldn't be useful; I'm
just failing to see the use in these examples.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#3Sam Mason
sam@samason.me.uk
In reply to: Jim Nasby (#2)
Re: writing a MIN(RECORD) aggregate

On Mon, Mar 24, 2008 at 05:27:04PM -0500, Decibel! wrote:

On Mar 20, 2008, at 2:23 PM, Sam Mason wrote:

SELECT i, (MIN((j,k))).k
FROM tbl
GROUP BY i;

How is that any better than SELECT i, min(k) FROM tbl GROUP BY i ?

Because I want the value of k associated with the minimum value of j.
For example, if I have data looking like:

i j k
1 3 7
1 4 8
2 5 10
2 6 9

I want to get this out:

i k
1 7
2 10

I would get this if I used the DISTINCT ON or if MIN was valid over
records. With your code I'd get this:

i k
1 7
2 9

I'm not saying that min/avg/max/etc(RECORD) wouldn't be useful;

AVG wouldn't work, because it relies on treating it's parameter as a
numeric field over which summation and division are valid operations.
MIN/MAX just relies on there being a (total) ordering operator available
and with PG there pretty much always is.

I'm just failing to see the use in these examples.

Did the example above make things any clearer?

I've also just realised that PG's current handling of NULLs inside
records is also going to cause problems. The main problem seems to be
that the IS NULL operator isn't consistent with comparison operators.
For example:

(1,NULL) IS NULL --> FALSE
(1,NULL) = (1,NULL) --> NULL

I'm not sure if it's just my intuition is off, or whether there is an
invariant (e.g. a comparison returns NULL if-and-only-if either side
evaluate TRUE to IS NULL) that's being broken.

Thanks,
Sam

#4Bruce Momjian
bruce@momjian.us
In reply to: Sam Mason (#3)
Re: writing a MIN(RECORD) aggregate

"Sam Mason" <sam@samason.me.uk> writes:

On Mon, Mar 24, 2008 at 05:27:04PM -0500, Decibel! wrote:

On Mar 20, 2008, at 2:23 PM, Sam Mason wrote:

SELECT i, (MIN((j,k))).k
FROM tbl
GROUP BY i;

How is that any better than SELECT i, min(k) FROM tbl GROUP BY i ?

Because I want the value of k associated with the minimum value of j.
For example, if I have data looking like:

I have nothing against having min(record) and it does seem like it would let
you do this at least for reasonably simple cases.

But I'm more eager to see full OLAP window functions which would let you do
this and a whole lot else as well.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#5Sam Mason
sam@samason.me.uk
In reply to: Bruce Momjian (#4)
Re: writing a MIN(RECORD) aggregate

On Mar 25, 2008, at 4:43PM, Gregory Stark wrote:

On Mar 20, 2008, at 2:23 PM, Sam Mason wrote:

SELECT i, (MIN((j,k))).k
FROM tbl
GROUP BY i;

I have nothing against having min(record) and it does seem like it would let
you do this at least for reasonably simple cases.

The main reason for this was that I've needed min(record) a few times
before and thought it would be reasonably easy to code.

But I'm more eager to see full OLAP window functions which would let you do
this and a whole lot else as well.

I've never used window functions before so don't think about them when
solving my problems. If they were available I'd probably start using
them. From the small bit of reading that I've done around them, they
seem very imperative in nature. I'm not sure if this is a good or a bad
thing.

In a database that did support them, how would I write my query with
them? My first naive attempt was this:

SELECT i, MIN(k) OVER (PARTITION BY j)
FROM tbl
GROUP BY i;

This is obviously wrong, but I don't see how to get to where I need to
be.

Sam

#6Bruce Momjian
bruce@momjian.us
In reply to: Sam Mason (#5)
Re: writing a MIN(RECORD) aggregate

"Sam Mason" <sam@samason.me.uk> writes:

SELECT i, MIN(k) OVER (PARTITION BY j)
FROM tbl
GROUP BY i;

This is obviously wrong, but I don't see how to get to where I need to
be.

I'm not entirely sure myself. I think it might involve RANK OVER j though.

I suspect it will look more like the DISTINCT ON solution than the min(record)
solution.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#7Sam Mason
sam@samason.me.uk
In reply to: Bruce Momjian (#6)
Re: writing a MIN(RECORD) aggregate

On Tue, Mar 25, 2008 at 06:58:06PM +0000, Gregory Stark wrote:

"Sam Mason" <sam@samason.me.uk> writes:

SELECT i, MIN(k) OVER (PARTITION BY j)
FROM tbl
GROUP BY i;

This is obviously wrong, but I don't see how to get to where I need to
be.

I'm not entirely sure myself. I think it might involve RANK OVER j though.

The main thing I wanted to avoid was an explosion of sub-queries that
you get with DISTINCT ON style queries. For example, with record style
syntax, I can do:

SELECT i, (MIN((j,k))).k AS ka, (MIN((mycode(j),k))).k AS kb
FROM tbl
GROUP BY i;

whereas using DISTINCT ON I'd have to do:

SELECT a.i, a.k AS ka, b.k as kb
FROM (
SELECT DISTINCT ON (i) i, k
FROM tbl
ORDER BY i, j) a, (
SELECT DISTINCT ON (i) i, k
FROM tbl
ORDER BY i, mycode(j)) b
WHERE a.i = b.i;

Which gets unmanageable quickly. Any idea how window functions would
cope with this sort of complexity? Or is this what you meant by:

I suspect it will look more like the DISTINCT ON solution than the min(record)
solution.

Thanks,
Sam

#8Bruce Momjian
bruce@momjian.us
In reply to: Sam Mason (#7)
Re: writing a MIN(RECORD) aggregate

"Sam Mason" <sam@samason.me.uk> writes:

On Tue, Mar 25, 2008 at 06:58:06PM +0000, Gregory Stark wrote:
The main thing I wanted to avoid was an explosion of sub-queries that
you get with DISTINCT ON style queries. For example, with record style
syntax, I can do:

SELECT i, (MIN((j,k))).k AS ka, (MIN((mycode(j),k))).k AS kb
FROM tbl
GROUP BY i;

whereas using DISTINCT ON I'd have to do:

...

Which gets unmanageable quickly. Any idea how window functions would
cope with this sort of complexity? Or is this what you meant by:

I suspect it will look more like the DISTINCT ON solution than the min(record)
solution.

The flip side is that if you want to get several fields based on min(j) the
min(record) approach requires you to write that expression several times (and
the database to calculate it several times).

I think the window functions might (assuming an ideal implementation) get the
best of both worlds. You would be able to do something with multiple
partitions so you could ask of a few columns where rank over j = 1 and a few
more columns where rank over k = 1.

But, uh, I'm not sure. I'll have to sit down with the spec and see if that's
true. Furthermore it may be wishful thinking to hope that the implementation
will do anything special with the special case where you're only selecting
records where rank = 1.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

#9Sam Mason
sam@samason.me.uk
In reply to: Bruce Momjian (#8)
Re: writing a MIN(RECORD) aggregate

On Tue, Mar 25, 2008 at 07:54:17PM +0000, Gregory Stark wrote:

"Sam Mason" <sam@samason.me.uk> writes:

SELECT i, (MIN((j,k))).k AS ka, (MIN((mycode(j),k))).k AS kb
FROM tbl
GROUP BY i;

The flip side is that if you want to get several fields based on min(j) the
min(record) approach requires you to write that expression several times (and
the database to calculate it several times).

No. My demos have only used one column because that's the smallest
useful demo.

SELECT i, r.k, r.l
FROM (
SELECT i, MIN((j,k,l)) AS r
FROM tbl
GROUP BY i) x;

The reason for the sub-select is only because SQL doesn't provide any
other way to name expressions. Hum, or at least this should work...
There doesn't seem to be any nice way of getting fields out of a record!

If I really want to do this, it's going to turn into quite an overhaul
of record handling in PG. It would also remove the nice syntactic trick
that a.b identifies the field "b" from table "a", and s.a.b means that
the above is in schema "s".

I think the window functions might (assuming an ideal implementation) get the
best of both worlds. You would be able to do something with multiple
partitions so you could ask of a few columns where rank over j = 1 and a few
more columns where rank over k = 1.

But, uh, I'm not sure. I'll have to sit down with the spec and see if that's
true. Furthermore it may be wishful thinking to hope that the implementation
will do anything special with the special case where you're only selecting
records where rank = 1.

I don't really understand what you're saying above. Optimisation is
another can of worms that shouldn't be opened until we know how this
sort of thing is going to be used.

Sam

#10Bruce Momjian
bruce@momjian.us
In reply to: Sam Mason (#9)
Re: writing a MIN(RECORD) aggregate

"Sam Mason" <sam@samason.me.uk> writes:

The reason for the sub-select is only because SQL doesn't provide any
other way to name expressions. Hum, or at least this should work...
There doesn't seem to be any nice way of getting fields out of a record!

If I really want to do this, it's going to turn into quite an overhaul
of record handling in PG. It would also remove the nice syntactic trick
that a.b identifies the field "b" from table "a", and s.a.b means that
the above is in schema "s".

Yeah, to disambiguate it you have to use (r).i

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

#11Sam Mason
sam@samason.me.uk
In reply to: Bruce Momjian (#10)
Re: writing a MIN(RECORD) aggregate

On Wed, Mar 26, 2008 at 01:03:18AM +0000, Gregory Stark wrote:

"Sam Mason" <sam@samason.me.uk> writes:

The reason for the sub-select is only because SQL doesn't provide any
other way to name expressions. Hum, or at least this should work...
There doesn't seem to be any nice way of getting fields out of a record!

Yeah, to disambiguate it you have to use (r).i

OK, that sort of makes sense. The next problem is that PG doesn't
remember the column names:

SELECT (ROW(i)).i FROM (SELECT 1) x(i);

Results in PG saying it doesn't know where "i" is inside the row, which
seems a little strange. I think it's this detail that accounts for
my problems in trying to get this all working before. This seems to
suggest that there are two record-like data structures in PG, one for
the records returned as part of the SELECT list and another that I'm
using here.

As a side case, would it be nice if:

SELECT (SELECT 1 AS a, 2 AS b);

resulted in a record with two members?

Sam

#12Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Sam Mason (#3)
Re: writing a MIN(RECORD) aggregate

On Mar 25, 2008, at 11:33 AM, Sam Mason wrote:

On Mon, Mar 24, 2008 at 05:27:04PM -0500, Decibel! wrote:

On Mar 20, 2008, at 2:23 PM, Sam Mason wrote:

SELECT i, (MIN((j,k))).k
FROM tbl
GROUP BY i;

How is that any better than SELECT i, min(k) FROM tbl GROUP BY i ?

Because I want the value of k associated with the minimum value of j.

Ahh, makes sense. FWIW...

SELECT i, (SELECT k FROM ... WHERE i = i.i ORDER BY j LIMIT 1)
FROM (SELECT DISTINCT i FROM ...) i
;

If you needed more than just k, I think there's a way to do that in
the FROM clause, too.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload