Finding uniques across a big join

Started by John D. Burgerover 20 years ago16 messagesgeneral
Jump to latest
#1John D. Burger
john@mitre.org

I could use some help with the following:

I have a database of geographic entities with attributes spread across
several tables. I need to determine which entities are unique with
respect to some of those attributes. I'm using the following query:

select p2.gazPlaceID from
(select p1.featureType, n1.placeNameID, c1.containerID
from gazPlaces as p1
join gazNamings as n1 using (gazPlaceID)
join gazContainers as c1 using (gazPlaceID)
group by p1.featureType, n1.placeNameID, c1.containerID
having count(*) = 1) as uniqs,
gazPlaces as p2
join gazNamings as n2 using (gazPlaceID)
join gazContainers as c2 using (gazPlaceID)
where uniqs.featureType = p2.featureType
and uniqs.placeNameID = n2.placeNameID
and uniqs.containerID = c2.containerID;

The basic idea is to compute featureType-placeNameID-containerID
combinations with a three-way join, determine which of those have a
count of 1, and then join that back to the same three-way join to get
the gazPlaceIDs corresponding to the unique combos (whew!).

gazPlaces has about 6M entries, gazNamings and gazContainers each about
10M. All of the fields above are integers, and I have indexes on
everything relevant, but the query still takes about eight hours. My
question is not (necessarily) how to improve the efficiency of this
query, but whether anyone can think of a faster way to compute the
uniques. Again, the goal is to find entries in gazPlaces that are the
only ones with their particular combination of feature type, name and
container.

Any help is appreciated!

- John Burger
MITRE

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: John D. Burger (#1)
Re: Finding uniques across a big join

On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote:

I could use some help with the following:

I have a database of geographic entities with attributes spread across
several tables. I need to determine which entities are unique with
respect to some of those attributes. I'm using the following query:

<snip>

If you put the gazPlaceID as a result of the uniqs subquery, that would
avoid the second lookup, right? Whether it's much faster is the
question. So something like:

select p1.gazPlaceID
from gazPlaces as p1
join gazNamings as n1 using (gazPlaceID)
join gazContainers as c1 using (gazPlaceID)
group by p1.gazPlaceID, p1.featureType, n1.placeNameID, c1.containerID
having count(*) = 1

Secondly, what does the plan look like? Is it materialising or sorting
at any stage?

Finally, what version of postgres?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#3Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: John D. Burger (#1)
Re: Finding uniques across a big join

It will probably be a win to come up with a list of potential records
from each table, instead of after doing the 3-way join. so something
like:

(SELECT gazPlaceID FROM gazPlaces GROUP BY featureType HAVING count(*)=1)
JOIN
(SELECT ...)

If you post the output of explain (or explain analyze is even better)
then people could probably make better suggestions.

On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote:

I could use some help with the following:

I have a database of geographic entities with attributes spread across
several tables. I need to determine which entities are unique with
respect to some of those attributes. I'm using the following query:

select p2.gazPlaceID from
(select p1.featureType, n1.placeNameID, c1.containerID
from gazPlaces as p1
join gazNamings as n1 using (gazPlaceID)
join gazContainers as c1 using (gazPlaceID)
group by p1.featureType, n1.placeNameID, c1.containerID
having count(*) = 1) as uniqs,
gazPlaces as p2
join gazNamings as n2 using (gazPlaceID)
join gazContainers as c2 using (gazPlaceID)
where uniqs.featureType = p2.featureType
and uniqs.placeNameID = n2.placeNameID
and uniqs.containerID = c2.containerID;

The basic idea is to compute featureType-placeNameID-containerID
combinations with a three-way join, determine which of those have a
count of 1, and then join that back to the same three-way join to get
the gazPlaceIDs corresponding to the unique combos (whew!).

gazPlaces has about 6M entries, gazNamings and gazContainers each about
10M. All of the fields above are integers, and I have indexes on
everything relevant, but the query still takes about eight hours. My
question is not (necessarily) how to improve the efficiency of this
query, but whether anyone can think of a faster way to compute the
uniques. Again, the goal is to find entries in gazPlaces that are the
only ones with their particular combination of feature type, name and
container.

Any help is appreciated!

- John Burger
MITRE

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#4John D. Burger
john@mitre.org
In reply to: Martijn van Oosterhout (#2)
Re: Finding uniques across a big join

On Nov 30, 2005, at 01:55, Martijn van Oosterhout wrote:

On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote:

I could use some help with the following:

I have a database of geographic entities with attributes spread across
several tables. I need to determine which entities are unique with
respect to some of those attributes. I'm using the following query:

<snip>

If you put the gazPlaceID as a result of the uniqs subquery, that would
avoid the second lookup, right? Whether it's much faster is the
question. So something like:

select p1.gazPlaceID
from gazPlaces as p1
join gazNamings as n1 using (gazPlaceID)
join gazContainers as c1 using (gazPlaceID)
group by p1.gazPlaceID, p1.featureType, n1.placeNameID,
c1.containerID
having count(*) = 1

The problem is that then every row is unique, because gazPlaceID is a
primary key. As far as I can see, I need to group on just the other
three columns - they're the ones for which I'm interested in
uniqueness.

Secondly, what does the plan look like? Is it materialising or sorting
at any stage?
Finally, what version of postgres?

Version is 7.4.8 for Solaris. Below is (a version of) the query again,
as well as the plan. No materialization, I think, but it appears to be
sorting the first three-way join to do the counts, then sorting the
second one to merge. Cost estimates are way off, as the final result
has almost 10M rows, but everything is analyzed, with indexes on every
column, although none of them get used.

Again, any suggestions on tweaking this query, or a completely
different approach to finding the entities with unique combinations,
would be much appreciated.

- John Burger
MITRE

select p2.gazPlaceID, u.*
into table tempCanonical_nameMatchEquiver_3435_1
from
(select p1.featureType, n1.placeNameID, c1.containerID
from gazPlaces as p1
join gazNamings as n1 using (gazPlaceID)
join gazContainers as c1 using (gazPlaceID)
group by p1.featureType, n1.placeNameID, c1.containerID
having count(*) = 1) as u,
gazPlaces as p2
join gazNamings as n2 using (gazPlaceID)
join gazContainers as c2 using (gazPlaceID)
where u.featureType = p2.featureType
and u.placeNameID = n2.placeNameID
and u.containerID = c2.containerID;

Hash Join (cost=4009535.96..4316595.92 rows=306 width=16)
Hash Cond: (("outer".gazplaceid = "inner".gazplaceid) AND
("outer".containerid = "inner".containerid))
-> Seq Scan on gazcontainers c2 (cost=0.00..141636.45 rows=9193945
width=8)
-> Hash (cost=4006472.81..4006472.81 rows=282029 width=20)
-> Merge Join (cost=3777226.54..4006472.81 rows=282029
width=20)
Merge Cond: (("outer".featuretype = "inner".featuretype)
AND ("outer".placenameid = "inner".placenameid))
-> Subquery Scan u (cost=2107001.20..2259698.67
rows=5552635 width=12)
-> GroupAggregate (cost=2107001.20..2204172.32
rows=5552635 width=12)
Filter: (count(*) = 1)
-> Sort (cost=2107001.20..2120882.79
rows=5552635 width=12)
Sort Key: p1.featuretype,
n1.placenameid, c1.containerid
-> Hash Join
(cost=688064.17..1217844.46 rows=5552635 width=12)
Hash Cond: ("outer".gazplaceid =
"inner".gazplaceid)
-> Seq Scan on gaznamings n1
(cost=0.00..156331.05 rows=10147805 width=8)
-> Hash
(cost=642816.39..642816.39 rows=6128713 width=16)
-> Hash Join
(cost=160244.91..642816.39 rows=6128713 width=16)
Hash Cond:
("outer".gazplaceid = "inner".gazplaceid)
-> Seq Scan on
gazcontainers c1 (cost=0.00..141636.45 rows=9193945 width=8)
-> Hash
(cost=120982.13..120982.13 rows=6128713 width=8)
-> Seq Scan
on gazplaces p1 (cost=0.00..120982.13 rows=6128713 width=8)
-> Sort (cost=1670225.33..1685547.11 rows=6128713
width=16)
Sort Key: p2.featuretype, n2.placenameid
-> Hash Join (cost=160244.91..684040.19
rows=6128713 width=16)
Hash Cond: ("outer".gazplaceid =
"inner".gazplaceid)
-> Seq Scan on gaznamings n2
(cost=0.00..156331.05 rows=10147805 width=8)
-> Hash (cost=120982.13..120982.13
rows=6128713 width=8)
-> Seq Scan on gazplaces p2
(cost=0.00..120982.13 rows=6128713 width=8)

#5John D. Burger
john@mitre.org
In reply to: Jim Nasby (#3)
Re: Finding uniques across a big join

Jim C. Nasby wrote:

It will probably be a win to come up with a list of potential records
from each table, instead of after doing the 3-way join. so something
like:

(SELECT gazPlaceID FROM gazPlaces GROUP BY featureType HAVING
count(*)=1)
JOIN
(SELECT ...)

Hmm, not sure I understand. Joining the uniques wrt each of the three
columns does not result in the unique triples, or even a superset of
them, at least in this case.

If you post the output of explain (or explain analyze is even better)
then people could probably make better suggestions.

Okay, I just posted the query plan. I will try to run an EXPLAIN
ANALYZE tonight.

Again, I'm also interested in completely different approaches to
discovering the entities with unique attribute combinations.
Intuitively, the query I posted is doing "too much work", because it's
computing the total count for each combo, when all I really need is to
know if the count is 1 or greater than 1.

Thanks!

- John Burger
MITRE

#6Martijn van Oosterhout
kleptog@svana.org
In reply to: John D. Burger (#4)
Re: Finding uniques across a big join

On Wed, Nov 30, 2005 at 01:20:19PM -0500, John D. Burger wrote:

select p1.gazPlaceID
from gazPlaces as p1
join gazNamings as n1 using (gazPlaceID)
join gazContainers as c1 using (gazPlaceID)
group by p1.gazPlaceID, p1.featureType, n1.placeNameID,
c1.containerID
having count(*) = 1

The problem is that then every row is unique, because gazPlaceID is a
primary key. As far as I can see, I need to group on just the other
three columns - they're the ones for which I'm interested in
uniqueness.

AIUI, according to the JOIN conditions in the query you have
n1.gazPlaceID = c1.gazPlaceID = p1.gazPlaceID so grouping by one of
those shouldn't affect the query result.

Are the tables wide? Maybe you're losing a lot of time transferring
data you don't need. Other than that I can't think of any neat
tricks...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#7Scott Marlowe
smarlowe@g2switchworks.com
In reply to: John D. Burger (#1)
Re: Finding uniques across a big join

On Tue, 2005-11-29 at 20:58, John D. Burger wrote:

I could use some help with the following:

I have a database of geographic entities with attributes spread across
several tables. I need to determine which entities are unique with
respect to some of those attributes. I'm using the following query:

select p2.gazPlaceID from
(select p1.featureType, n1.placeNameID, c1.containerID
from gazPlaces as p1
join gazNamings as n1 using (gazPlaceID)
join gazContainers as c1 using (gazPlaceID)
group by p1.featureType, n1.placeNameID, c1.containerID
having count(*) = 1) as uniqs,
gazPlaces as p2
join gazNamings as n2 using (gazPlaceID)
join gazContainers as c2 using (gazPlaceID)
where uniqs.featureType = p2.featureType
and uniqs.placeNameID = n2.placeNameID
and uniqs.containerID = c2.containerID;

The basic idea is to compute featureType-placeNameID-containerID
combinations with a three-way join, determine which of those have a
count of 1, and then join that back to the same three-way join to get
the gazPlaceIDs corresponding to the unique combos (whew!).

gazPlaces has about 6M entries, gazNamings and gazContainers each about
10M. All of the fields above are integers, and I have indexes on
everything relevant, but the query still takes about eight hours. My
question is not (necessarily) how to improve the efficiency of this
query, but whether anyone can think of a faster way to compute the
uniques. Again, the goal is to find entries in gazPlaces that are the
only ones with their particular combination of feature type, name and
container.

OK, let's assume that the basic part of it, before the group by, has
been put into a view, so we can then do:

select pkey1, field2, field3, field4 from view;

And we know that pkey1 is unique, but we want the records where pkey1 is
the only thing different between them, right?

select
v1.pkey1,
v1.field2,
v1.field3,
v1.field4,
v2.pkey1,
v2.field2,
v2.field3,
v2.field4,
from
view v1
join
view v2
on (
v1.field2=v2.field2 and
v1.field3=v2.field3 and
v1.field3=v2.field3 and
v1.pkey1<>v2.pkey
)

How does that work?

#8John D. Burger
john@mitre.org
In reply to: Scott Marlowe (#7)
Re: Finding uniques across a big join

Scott Marlowe wrote:

OK, let's assume that the basic part of it, before the group by, has
been put into a view, so we can then do:

select pkey1, field2, field3, field4 from view;

And we know that pkey1 is unique, but we want the records where pkey1
is
the only thing different between them, right?

Hmm, I'm explaining this really badly :). I should have defined a view
like you suggest to help simplify it. What I want is the pkeys (and
the field values) where no other pkey has that triple of field values.
That's why my earlier query does a group by the fields and then having
count(*) = 1. Also, FWIW, pkey1 is unique in its original table, but
not in the view, since some of the other tables are one-to-many.

select
v1.pkey1,
v1.field2,
v1.field3,
v1.field4,
v2.pkey1,
v2.field2,
v2.field3,
v2.field4,
from
view v1
join
view v2
on (
v1.field2=v2.field2 and
v1.field3=v2.field3 and
v1.field3=v2.field3 and
v1.pkey1<>v2.pkey
)

How does that work?

Won't this be a massive cross product of all pkey pairs that have the
same field values?

Here's what I'm currently using, in terms of your very helpful view:

select v1.pkey1, v1.field2, v1.field3, v1.field4
from view as v1
join
(select v2.field1, v2.field2, v2.field3
from view as v2
group by v2.field2, v2.field3, v2.field4
having count(*) = 1)
using (field2, field3, field4);

This is the one that takes eight hours. :( Another way to express what
I want is this:

select v1.pkey1, v1.field2, v1.field3, v1.field4
from view as v1
where not exists
(select true from view as v2
where v1.field2 = v2.field2
and v1.field3 = v2.field3
and v1.field4 = v2.field4
and v1.pkey1 <> v2.pkey1);

That looks like a horrible nested loop, but I suppose I should try it
to make sure it is indeed slower then the previous query.

- John Burger
MITRE

#9Scott Marlowe
smarlowe@g2switchworks.com
In reply to: John D. Burger (#8)
Re: Finding uniques across a big join

On Wed, 2005-11-30 at 16:27, John D. Burger wrote:

Scott Marlowe wrote:

select
v1.pkey1,
v1.field2,
v1.field3,
v1.field4,
v2.pkey1,
v2.field2,
v2.field3,
v2.field4,
from
view v1
join
view v2
on (
v1.field2=v2.field2 and
v1.field3=v2.field3 and
v1.field3=v2.field3 and
v1.pkey1<>v2.pkey
)

How does that work?

Won't this be a massive cross product of all pkey pairs that have the
same field values?

Yes, assuming there are a lot of them. OTOH, if there are only a few
duplicates you're looking for...

How many are you expecting, percentage wise, to get back?

Here's what I'm currently using, in terms of your very helpful view:

select v1.pkey1, v1.field2, v1.field3, v1.field4
from view as v1
join
(select v2.field1, v2.field2, v2.field3
from view as v2
group by v2.field2, v2.field3, v2.field4
having count(*) = 1)
using (field2, field3, field4);

This is the one that takes eight hours. :( Another way to express what
I want is this:

select v1.pkey1, v1.field2, v1.field3, v1.field4
from view as v1
where not exists
(select true from view as v2
where v1.field2 = v2.field2
and v1.field3 = v2.field3
and v1.field4 = v2.field4
and v1.pkey1 <> v2.pkey1);

That looks like a horrible nested loop, but I suppose I should try it
to make sure it is indeed slower then the previous query.

If you can allocated enough shared memory for the set to fit in memory,
you might be able to get a hash agg method, which is much faster than
most other methods for this kind of thing, since it requires no sort.

In a side point, I'm currently mushing 888,000,000 6 character codes up
against each other to check for duplicates. I have 6 machines doing
this, at 1 million codes compared to 1 million codes every 0.5 seconds
aggregate. That gets me down to about 1 week. So, 8 hours is seeming
quite fast. :)

#10Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: John D. Burger (#5)
Re: Finding uniques across a big join

On Wed, Nov 30, 2005 at 01:29:17PM -0500, John D. Burger wrote:

Jim C. Nasby wrote:

It will probably be a win to come up with a list of potential records
from each table, instead of after doing the 3-way join. so something
like:

(SELECT gazPlaceID FROM gazPlaces GROUP BY featureType HAVING
count(*)=1)
JOIN
(SELECT ...)

Hmm, not sure I understand. Joining the uniques wrt each of the three
columns does not result in the unique triples, or even a superset of
them, at least in this case.

So featureType is *unique* within gazPlaces?

If you look at the query plan, it's estimating 120k to scan one of the
tables, but if you look at the hash join that joins the 3rd table to the
first two it's estimating 1.2M. That's because it's dealing with a
monsterous dataset; it's joining every single row with every other
possible row.

What I was trying to do was eliminate rows from that join that would
mean you couldn't have a count of only 1. But my example actually
changes the output, so that won't work.

If you post the output of explain (or explain analyze is even better)
then people could probably make better suggestions.

Okay, I just posted the query plan. I will try to run an EXPLAIN
ANALYZE tonight.

Again, I'm also interested in completely different approaches to
discovering the entities with unique attribute combinations.
Intuitively, the query I posted is doing "too much work", because it's
computing the total count for each combo, when all I really need is to
know if the count is 1 or greater than 1.

The problem I see is that as you add more criteria (which is what each
JOIN does), you reduce the count that you might get for every
combination. Of course adding more rows increases the potential count.
Combine the two and I don't think there's any other way to come up with
the list of what's truely unique any other way.

What I'm not sure of is if knowing certain things about gazPlaceID would
allow you to make some assumptions that could simplify the query. I
can't think of anything that would help there.

Basically, I think you're stuck with the ugly 3-way join that's the core
of the slowness. But, there is a gain to be had: doing the second set of
joins bumps you from 1.2M to over 4M. Someone else suggested adding
gazPlaceID to the GROUP BY; I definately think you should do that.
Because of the equality conditions I can't conceive of any way that
it would change anything. What *might* make a difference is which table
you pull gazPlaceID from; again, because of the equality condition it
doesn't matter which table you get them from, but I don't know if the
optimizer understands that. So changing which table you pull that from
could drastically effect how the 3-way join is done, which could also
drastically effect the cost involved. Ideally you'd like the first 2-way
join to be the one that results in the smallest amount of overlap.

Also, if you're on 8.1, you might try experimenting with different
indexes on the 3 tables. I think this could be an ideal candidate for a
bitmap scan; I believe it's possible to build the entire set you want
with a scan of 6 bitmaps, but I don't know if there's that much brains
in the bitmap code yet. Actually, I think what would be ideal would be
to build 3 bitmaps for the 3 fields you really want to group by and
compare those; the trick would be building those bitmaps using indexes
of gazPlaceID and the field you care about from each table.

BTW, it might be easier for someone to come up with a solution if you
post a plain-english description of what you're trying to find out,
along with any facts about the data (such as what's unique, etc).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#11John D. Burger
john@mitre.org
In reply to: Scott Marlowe (#9)
Re: Finding uniques across a big join

Scott Marlowe wrote:

Won't this be a massive cross product of all pkey pairs that have the
same field values?

Yes, assuming there are a lot of them. OTOH, if there are only a few
duplicates you're looking for...

I'm not looking for duplicates, I'm looking for uniques - note the
Subject line :). Here's an analogous problem: everybody in the room
has eye color, hair color, and skin color. I want to find the people
whose particular three-way combination is unique - no one else has that
eye-hair-skin combo. The analogue to my current query is then like
this:

select p1.personID, p1.eyeColor, p1.hairColor, p1.skinColor
from persons as p1
join
(select p2.eyeColor, p2.hairColor, p2.skinColor
from persons as p2
group by p2.eyeColor, p2.hairColor, p2.skinColor
having count(*) = 1)
using (eyeColor, hairColor, skinColor);

The inner select finds the unique combinations, the outer one goes back
and finds the peopleID corresponding to each unique combo. And the
persons table is actually a view on a big three-way join.

Jim Nasby wrote:

Someone else suggested adding gazPlaceID to the GROUP BY; I definately
think you should do that.

That changes the semantics of what I want. If I group by personID
above, then every FOUR-way combo is of course unique. What I'd like to
do is group by the three attributes, and select for personID as well.
But of course you can't select for columns you haven't grouped by.

Sorry, I can't think of any other ways to explain what I'm doing. But
thank you for your replies.

- John Burger
MITRE

#12Bruno Wolff III
bruno@wolff.to
In reply to: John D. Burger (#11)
Re: Finding uniques across a big join

On Wed, Nov 30, 2005 at 20:44:30 -0500,
"John D. Burger" <john@mitre.org> wrote:

That changes the semantics of what I want. If I group by personID
above, then every FOUR-way combo is of course unique. What I'd like to
do is group by the three attributes, and select for personID as well.
But of course you can't select for columns you haven't grouped by.

Assuming that personID is an ordered type, you can select max(personID)
in the GROUP BY and save the join at the end.

#13John D. Burger
john@mitre.org
In reply to: Bruno Wolff III (#12)
Re: Finding uniques across a big join

Bruno Wolff III wrote:

That changes the semantics of what I want. If I group by personID
above, then every FOUR-way combo is of course unique. What I'd like
to
do is group by the three attributes, and select for personID as well.
But of course you can't select for columns you haven't grouped by.

Assuming that personID is an ordered type, you can select max(personID)
in the GROUP BY and save the join at the end.

I'm not sure what this means - do you mean:

select p2.eyeColor, p2.hairColor, p2.skinColor
from persons as p2
group by max(p2.personID), p2.eyeColor, p2.hairColor, p2.skinColor
having count(*) = 1;

I don't know what that does. If you mean:

select max(p2.personID), p2.eyeColor, p2.hairColor, p2.skinColor
from persons as p2
group by p2.personID, p2.eyeColor, p2.hairColor, p2.skinColor
having count(*) = 1;

then I don't think that works either - if I include personID in the
GROUP BY, then the COUNT doesn't do what I want, right? I just want
uniques wrt the three attribute fields. If I group by personID, then
personID counts towards uniqueness.

Thanks for all the suggestions, folks.

- John Burger
MITRE

#14Marcin Inkielman
marn@wit.edu.pl
In reply to: John D. Burger (#8)
Re: Finding uniques across a big join

John D. Burger napisaďż˝(a):

select v1.pkey1, v1.field2, v1.field3, v1.field4
from view as v1
join
(select v2.field1, v2.field2, v2.field3
from view as v2
group by v2.field2, v2.field3, v2.field4
having count(*) = 1)
using (field2, field3, field4);

This is the one that takes eight hours. :( Another way to express
what I want is this:

select v1.pkey1, v1.field2, v1.field3, v1.field4
from view as v1
where not exists
(select true from view as v2
where v1.field2 = v2.field2
and v1.field3 = v2.field3
and v1.field4 = v2.field4
and v1.pkey1 <> v2.pkey1);

That looks like a horrible nested loop, but I suppose I should try it
to make sure it is indeed slower then the previous query.

Hi!

Did you try the second query? I guess I should take consirerably less
time than the first one. Usualy I do "these things" like this...
This is the only possibility for the planner to use indexes. The query
plan you send us shows that are mostly seq scans are used.

Regards,

Marcin Inkielman

#15Bruce Momjian
bruce@momjian.us
In reply to: John D. Burger (#13)
Re: Finding uniques across a big join

"John D. Burger" <john@mitre.org> writes:

I don't know what that does. If you mean:

select max(p2.personID), p2.eyeColor, p2.hairColor, p2.skinColor
from persons as p2
group by p2.personID, p2.eyeColor, p2.hairColor, p2.skinColor
having count(*) = 1;

then I don't think that works either - if I include personID in the GROUP BY,

select max(personid) as personid, eyecolor, haircolor, skincolor
from persons
group by eyecolor, haircolor, skincolor
having count(*) = 1

--
greg

#16John D. Burger
john@mitre.org
In reply to: Bruce Momjian (#15)
Re: Finding uniques across a big join

Greg Stark wrote:

select max(personid) as personid, eyecolor, haircolor, skincolor
from persons
group by eyecolor, haircolor, skincolor
having count(*) = 1

Aha, I understand Bruno's suggestion now! I was actually trying to
think of some way of using an aggregate on personid, but couldn't
figure it out. Of course max does just what I want in this case, since
max on one value gives that value - min and some other aggregate
functions would work too. Very clever!

On my actual problem, where "persons" is actually three joined tables,
my original query took eight hours. The new query, modeled after the
above, runs in almost exactly a tenth of the time.

Thanks for all the suggestions!

- John D. Burger
MITRE