Question on a select

Started by Madison Kellyover 21 years ago11 messagesgeneral

linux@alteeve.com

over 21 years ago

Hi all,

This is my first post here so please let me know if I miss any list
guidelines. :)

I was hoping to get some help, advice or pointers to an answer for a
somewhat odd (to me at least) SELECT. What I am trying to do is select
that values from one table where matching values do not exist in another
table.

For example:

Let's say 'table_a' has the columns 'a_name, a_type, a_dir,
a_<others>' and 'table_b' has the columns 'b_name, b_type, b_dir,
b_<others>' where 'others' are columns unique to each table. What I need
to do is select all the values in 'a_name, a_type, a_dir' from 'table_a'
where there is no matching entries in "table_b's" 'b_name, b_type, b_dir'.

I know I could do something like:

SELECT a_name, a_type, a_dir FROM table_a;

and then loop through all the returned values and for each do a
matching select from 'table_b' and use my program to catch the ones not
in 'table_b'. This is not very efficient though and I will be searching
through tables that could have several hundred thousand entries so the
inefficiency would be amplified. Is there some way to use a join or
something similar to do this?

Thank you all!

Madison

Vincent Hikida

vhikida@inreach.com

over 21 years ago

In reply to: Madison Kelly (#1)

Re: Question on a select

There are several ways. I am making the simplifying assumption that name,
type and dir cannot be NULL in either table. If they are the query is a
little more complicated.

The following are a couple of many techniques.

SELECT a.a_name
, a.a_type
, a.a_dir
FROM a_table a
WHERE NOT EXISTS
( SELECT NULL
FROM b_table b
WHERE b.b_name = a.a_name
AND b.b_type = a.a_type
AND b.b_dir = a.a_dir
)

SELECT a.a_name
, a.a_type
, a.a_dir
FROM a_table a
LEFT JOIN b_table b
ON a.a_table = b.b_table
AND a.a_type = b.b_type
AND a.a_dir = b.b_type
WHERE b.b_table IS NULL // assumes that b.b_table
is a not null column.

Let's say that dir could be null and dir is a string, then (assuming that
dir can never be 'xyz') you could say something like

COALESCE(a.a_dir,'xyz') = COALESCE(b.b_dir,'xyz')

Since NULL never equal NULL, if you want NULL in one table to match a NULL
in another table, you need to change it to something not NULL. However this
depends on what you want in your application.

Queries like this are used often to check the integrity of your data.
Examples of this are 1) What orders don't have order items? 2) What books
have no authors? etc.

----- Original Message -----
From: "Madison Kelly" <linux@alteeve.com>
To: "PgSQL General List" <pgsql-general@postgresql.org>
Sent: Saturday, January 01, 2005 7:32 PM
Subject: [GENERAL] Question on a select

Show quoted text

Hi all,

This is my first post here so please let me know if I miss any list
guidelines. :)

I was hoping to get some help, advice or pointers to an answer for a
somewhat odd (to me at least) SELECT. What I am trying to do is select
that values from one table where matching values do not exist in another
table.

For example:

Let's say 'table_a' has the columns 'a_name, a_type, a_dir, a_<others>'
and 'table_b' has the columns 'b_name, b_type, b_dir, b_<others>' where
'others' are columns unique to each table. What I need to do is select all
the values in 'a_name, a_type, a_dir' from 'table_a' where there is no
matching entries in "table_b's" 'b_name, b_type, b_dir'.

I know I could do something like:

SELECT a_name, a_type, a_dir FROM table_a;

and then loop through all the returned values and for each do a matching
select from 'table_b' and use my program to catch the ones not in
'table_b'. This is not very efficient though and I will be searching
through tables that could have several hundred thousand entries so the
inefficiency would be amplified. Is there some way to use a join or
something similar to do this?

Thank you all!

Madison

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Bruno Wolff III

bruno@wolff.to

over 21 years ago

In reply to: Madison Kelly (#1)

Re: Question on a select

On Sat, Jan 01, 2005 at 22:32:17 -0500,
Madison Kelly <linux@alteeve.com> wrote:

Hi all,

This is my first post here so please let me know if I miss any list
guidelines. :)

I was hoping to get some help, advice or pointers to an answer for a
somewhat odd (to me at least) SELECT. What I am trying to do is select
that values from one table where matching values do not exist in another
table.

For example:

Let's say 'table_a' has the columns 'a_name, a_type, a_dir,
a_<others>' and 'table_b' has the columns 'b_name, b_type, b_dir,
b_<others>' where 'others' are columns unique to each table. What I need
to do is select all the values in 'a_name, a_type, a_dir' from 'table_a'
where there is no matching entries in "table_b's" 'b_name, b_type, b_dir'.

SELECT a_name, a_type, a_dir, a_<others> FROM table_a
WHERE a_name, a_type, a_dir NOT IN (
SELECT b_name, b_type, b_dir FROM table_b)
;

In pre 7.4 versions or if there are NULLs in the key columns for table_b
then you probably want to use NOT EXISTS (with a moodified WHERE clause)
instead on NOT IN.

Madison Kelly

linux@alteeve.com

over 21 years ago

In reply to: Vincent Hikida (#2)

Re: Question on a select

Vincent Hikida wrote:

There are several ways. I am making the simplifying assumption that
name, type and dir cannot be NULL in either table. If they are the query
is a little more complicated.

The following are a couple of many techniques.

SELECT a.a_name
, a.a_type
, a.a_dir
FROM a_table a
WHERE NOT EXISTS
( SELECT NULL
FROM b_table b
WHERE b.b_name = a.a_name
AND b.b_type = a.a_type
AND b.b_dir = a.a_dir
)

SELECT a.a_name
, a.a_type
, a.a_dir
FROM a_table a
LEFT JOIN b_table b
ON a.a_table = b.b_table
AND a.a_type = b.b_type
AND a.a_dir = b.b_type
WHERE b.b_table IS NULL // assumes that
b.b_table is a not null column.

Let's say that dir could be null and dir is a string, then (assuming
that dir can never be 'xyz') you could say something like

COALESCE(a.a_dir,'xyz') = COALESCE(b.b_dir,'xyz')

Since NULL never equal NULL, if you want NULL in one table to match a
NULL in another table, you need to change it to something not NULL.
However this depends on what you want in your application.

Queries like this are used often to check the integrity of your data.
Examples of this are 1) What orders don't have order items? 2) What
books have no authors? etc.

This is wonderful! Thank you for responding so quickly! :) I should
mention that I am still very much learning so I apologise in advance if
I miss the obvious. ^.^;

They are all 'not null' and I am trying to do exactly the kind of
task you described. I tried the first example on my DB and got a syntax
error:

tle-bu=> SELECT a.file_name, a.file_parent_dir, a.file_type FROM
file_info_1 a WHERE NOT EXIST (SELECT NULL FROM file_set_1 b WHERE
b.fs_name=a.file_name, b.fs_parent_dir=a.file_parent_dir,
b.fs_type=a.file_type);
ERROR: syntax error at or near "SELECT" at character 88

The second example you gave seems to work perfectly (as I will show
near the bottom of this email). What are the benefits and down sides of
each method? Is there a simple reason why the first method failed
(probably a typo I imagine...)?

A little more specifics about my DB:

'file_info_1' and 'file_set_1' are two tables I use to store
information of files and directories (this is a backup program).

'file_info_#' stores dynamic info like file size, owner and such.
This table is dropped and recreated before a new scan of the partition
creates a mass 'COPY' load (the '_1' indicates the first partition).

'file_set_#' stores static information such as "has the file been
selected for backup" which is why I keep it in a separate table. I want
to run this select first to write entries for newly added files and
directories (the values will match the file's parent) and then again in
reverse to remove from 'file_set_#' entries that no longer exist on the
partition.

If it helps, here is the structure of the tables:

CREATE TABLE file_info_ID (
file_acc_time bigint not null,
file_group_name varchar(255) not null,
file_group_uid int not null,
file_mod_time bigint not null,
file_name varchar(255) not null,
file_parent_dir varchar(255) not null,
file_perm varchar(10) not null,
file_size bigint not null,
file_type varchar(2) not null default 'f',
file_user_name varchar(255) not null,
file_user_uid int not null
);

CREATE TABLE file_set_# (
fs_backup boolean not null default 't',
fs_display boolean not null default 'f',
fs_name varchar(255) not null,
fs_parent_dir varchar(255) not null,
fs_restore boolean not null default 'f',
fs_type varchar(2) not null default 'f'
);

And here is some sample data that I have to work with (yes, it's a
win2k partition... I use it to test other aspects of my program and, if
I blow it away, I won't be upset. ^.^; All of this is being done on a
Fedora Core 3 install in case it makes a difference):

tle-bu=> SELECT file_type, file_parent_dir, file_name FROM file_info_1
WHERE file_parent_dir='/' LIMIT 30;
file_type | file_parent_dir | file_name
-----------+-----------------+------------------------
d | / | .
d | / | downloads
d | / | Documents and Settings
d | / | Program Files
f | / | io.sys
f | / | msdos.sys
f | / | _NavCClt.Log
d | / | WUTemp
d | / | Recycled
f | / | pagefile.sys
d | / | winnt
f | / | ntldr
f | / | ntdetect.com
f | / | boot.ini
f | / | config.sys
f | / | autoexec.bat
f | / | t5r4e3w2q1.exe
f | / | 1q2w3e4r5t.exe
f | / | logon.exe
f | / | arcldr.exe
f | / | arcsetup.exe
(21 rows)

tle-bu=> SELECT fs_type, fs_parent_dir, fs_name FROM file_set_1 WHERE
fs_parent_dir='/' LIMIT 30;
fs_type | fs_parent_dir | fs_name
---------+---------------+------------------------
d | / | .
d | / | downloads
d | / | Documents and Settings
d | / | Program Files
d | / | WUTemp
d | / | Recycled
d | / | winnt
(7 rows)

In this example I deleted manually all the 'f' entries so that when I
do the select I should get:

file_type | file_parent_dir | file_name
-----------+-----------------+------------------------
f | / | io.sys
f | / | msdos.sys
f | / | _NavCClt.Log
f | / | pagefile.sys
f | / | ntldr
f | / | ntdetect.com
f | / | boot.ini
f | / | config.sys
f | / | autoexec.bat
f | / | t5r4e3w2q1.exe
f | / | 1q2w3e4r5t.exe
f | / | logon.exe
f | / | arcldr.exe
f | / | arcsetup.exe

Which is exactly what your second example provides:

tle-bu=> SELECT a.file_name, a.file_parent_dir, a.file_type FROM
file_info_1 a LEFT JOIN file_set_1 b ON a.file_name=b.fs_name AND
a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type WHERE
b.fs_name IS NULL;
file_name | file_parent_dir | file_type
----------------+-----------------+-----------
1q2w3e4r5t.exe | / | f
arcldr.exe | / | f
arcsetup.exe | / | f
autoexec.bat | / | f
boot.ini | / | f
config.sys | / | f
io.sys | / | f
logon.exe | / | f
msdos.sys | / | f
_NavCClt.Log | / | f
ntdetect.com | / | f
ntldr | / | f
pagefile.sys | / | f
t5r4e3w2q1.exe | / | f
(14 rows)

Thank you very much for your help!

Madison

Madison Kelly

linux@alteeve.com

over 21 years ago

In reply to: Bruno Wolff III (#3)

Re: Question on a select

Bruno Wolff III wrote:

SELECT a_name, a_type, a_dir, a_<others> FROM table_a
WHERE a_name, a_type, a_dir NOT IN (
SELECT b_name, b_type, b_dir FROM table_b)
;

In pre 7.4 versions or if there are NULLs in the key columns for table_b
then you probably want to use NOT EXISTS (with a moodified WHERE clause)
instead on NOT IN.

Hi Bruno,

Thank you for replying! I tried your example but I am getting a
syntax error:

tle-bu=> SELECT file_name, file_parent_dir, file_type FROM file_info_1
WHERE file_name, file_parent_dir, file_type NOT IN (SELECT fs_name,
fs_parent_dir, fs_type FROM file_set_1);
ERROR: syntax error at or near "," at character 78

I just replied to Vincent's post with a lot of detail on what I am
trying to do and how my DB is constructed. His second example worked but
I also had a syntax error on his first example. This program will be
working with very large data sets so I would love to get your method
working so that I could try benchmarking them to see which, in my
application, would be most effective.

Thank you very kindly for helping!

Madison

Bruno Wolff III

bruno@wolff.to

over 21 years ago

In reply to: Madison Kelly (#5)

Re: Question on a select

On Sun, Jan 02, 2005 at 01:58:20 -0500,
Madison Kelly <linux@alteeve.com> wrote:

Bruno Wolff III wrote:

SELECT a_name, a_type, a_dir, a_<others> FROM table_a
WHERE a_name, a_type, a_dir NOT IN (
SELECT b_name, b_type, b_dir FROM table_b)
;

In pre 7.4 versions or if there are NULLs in the key columns for table_b
then you probably want to use NOT EXISTS (with a moodified WHERE clause)
instead on NOT IN.

Hi Bruno,

Thank you for replying! I tried your example but I am getting a
syntax error:

tle-bu=> SELECT file_name, file_parent_dir, file_type FROM file_info_1
WHERE file_name, file_parent_dir, file_type NOT IN (SELECT fs_name,
fs_parent_dir, fs_type FROM file_set_1);
ERROR: syntax error at or near "," at character 78

There should be parenthesis around the list to test.
WHERE a_name, a_type, a_dir NOT IN (
should be
WHERE (a_name, a_type, a_dir) NOT IN (

I just replied to Vincent's post with a lot of detail on what I am
trying to do and how my DB is constructed. His second example worked but
I also had a syntax error on his first example. This program will be
working with very large data sets so I would love to get your method
working so that I could try benchmarking them to see which, in my
application, would be most effective.

I believe that the NOT IN query should run comparably to the LEFT JOIN
example supplied by the other person (at least in recent versions of
Postgres). I would expect this to run faster than using NOT EXISTS.
You probably want to try all 3. The semantics of the three ways of doing
this are not all equivalent if there are NULLs in the data being used
to eliminate rows. As you indicated you don't have NULLs this shouldn't
be a problem.

Another way to write this is using set different (EXCEPT or EXCEPT ALL)
using the key fields and then joining back to table a to pick up the
other fields. However this will almost certianly be slower than the
other methods.

Madison Kelly

linux@alteeve.com

over 21 years ago

In reply to: Bruno Wolff III (#6)

Re: Question on a select

Bruno Wolff III wrote:

There should be parenthesis around the list to test.
WHERE a_name, a_type, a_dir NOT IN (
should be
WHERE (a_name, a_type, a_dir) NOT IN (

That did it (I think)!

I believe that the NOT IN query should run comparably to the LEFT JOIN
example supplied by the other person (at least in recent versions of
Postgres). I would expect this to run faster than using NOT EXISTS.
You probably want to try all 3. The semantics of the three ways of doing
this are not all equivalent if there are NULLs in the data being used
to eliminate rows. As you indicated you don't have NULLs this shouldn't
be a problem.

Another way to write this is using set different (EXCEPT or EXCEPT ALL)
using the key fields and then joining back to table a to pick up the
other fields. However this will almost certianly be slower than the
other methods.

Something odd, now that I have the other method working (I think)...

returns the results in roughly 1 or 2 seconds on a test data set of
15,000 entries. I have an index on both 'file_info_1' covering
'fs_name', 'fs_parent_dir' and 'fs_type' and on 'file_set_1' covering
'file_name', 'file_parent_dir' and 'file_type'. When I try the seconds
method though:

tle-bu=> SELECT file_name, file_parent_dir, file_type FROM file_info_1
WHERE (file_name, file_parent_dir, file_type) NOT IN (SELECT fs_name,
fs_parent_dir, fs_type FROM file_set_1);

It took so long to process that after roughly three minutes I stopped
the query for fear of overheating my laptop (which happend a while back
forcing a thermal shut down).

The indexes are:

CREATE INDEX file_info_#_display_idx ON file_info_# (file_type,
file_parent_dir, file_name);
CREATE INDEX file_set_#_sync_idx ON file_set_# (fs_name, fs_parent_dir,
fs_type)

Are these not effective for the second query? If not, what should I
change or add? If so, would you have any insight into why there is such
an incredible difference in performance?

Thanks very much again!!

Madison

Vincent Hikida

vhikida@inreach.com

over 21 years ago

In reply to: Madison Kelly (#1)

Re: Question on a select

They are all 'not null' and I am trying to do exactly the kind of task
you described. I tried the first example on my DB and got a syntax error:

tle-bu=> SELECT a.file_name, a.file_parent_dir, a.file_type FROM
file_info_1 a WHERE NOT EXIST (SELECT NULL FROM file_set_1 b WHERE
b.fs_name=a.file_name, b.fs_parent_dir=a.file_parent_dir,
b.fs_type=a.file_type);
ERROR: syntax error at or near "SELECT" at character 88

I've quickly read the thread and I don't think you got an answer as to why
you are getting a syntax error here. Your query shows something line

WHERE b.fs_name=a.file_name, b.fs_parent_dir=a.file_parent_dir

you need to put an AND instead of a comma:

WHERE b.fs_name=a.file_name AND b.fs_parent_dir=a.file_parent_dir

As for which of the queries is best I don't know. My background is as an
Oracle developer. I think that Bruno already suggested testing the three
queries. There is a trace utility which shows some of what happens under the
covers of a query. I've used it extensively in Oracle but have never used it
in Postgresql.

If I understand what you said, the NOT IN was significantly slower. That has
been my experience in Oracle long time ago so I've tended to shy away from
that syntax. I'm sure optimizers are much better now then when I
experimented with NOT IN but my coworker who tried it in Oracle was getting
a slower response than with a subselect about a year ago. Theoretically if 3
queries are logically equivalent as the three queries you've been given, an
optimizer should find the same best query plan to execute it. I don't think
that optimizers are that smart yet.

The outer join is probably doing either a sort merge or a hash join. In your
application this should be the best option. (A sort merge sorts both tables
first or at least the key columns and then merges the tables together.)

Bruno said that the subselect would be slower. It may be that he thinks it
will do a nested loop. That is that it will read each row in table A and try
to find that concatenated key in table B's index. I don't think that a
nested loop would be very good in your particular application.

As for the indexes you set up, I think they are correct indexes.

Vincent

Vincent Hikida

vhikida@inreach.com

over 21 years ago

In reply to: Madison Kelly (#1)

Re: Question on a select

The indexes are:

CREATE INDEX file_info_#_display_idx ON file_info_# (file_type,
file_parent_dir, file_name);
CREATE INDEX file_set_#_sync_idx ON file_set_# (fs_name, fs_parent_dir,
fs_type)

Are these not effective for the second query? If not, what should I
change or add? If so, would you have any insight into why there is such an
incredible difference in performance?

I didn't look at your indexes closely enough. When you have concatenated
index, you want to have the most selective colum first. I guess that
file_type is not very selective. file_name is probably the most selective.
In the above, the index on file_set_# is optimal. The index on file_info_#
is suboptimal.

However, if the query is doing a hash join or sort merge, an index is not
used so the index doesn't matter. However, you probably do other queries
that do use the index so it should be fixed.

Vincent

#10

Madison Kelly

linux@alteeve.com

over 21 years ago

In reply to: Vincent Hikida (#9)

Re: Question on a select

Vincent Hikida wrote:

The indexes are:

CREATE INDEX file_info_#_display_idx ON file_info_# (file_type,
file_parent_dir, file_name);
CREATE INDEX file_set_#_sync_idx ON file_set_# (fs_name,
fs_parent_dir, fs_type)

Are these not effective for the second query? If not, what should I
change or add? If so, would you have any insight into why there is
such an incredible difference in performance?

I didn't look at your indexes closely enough. When you have concatenated
index, you want to have the most selective colum first. I guess that
file_type is not very selective. file_name is probably the most
selective. In the above, the index on file_set_# is optimal. The index
on file_info_# is suboptimal.

However, if the query is doing a hash join or sort merge, an index is
not used so the index doesn't matter. However, you probably do other
queries that do use the index so it should be fixed.

Vincent

Thank you, Vincent!

I didn't realize that the order made a difference. A sign of how much
learning I need to do. :p For reference, I think 'file_parent_dir' and
'fs_parent_dir' are the most important because I do an 'ORDER BY
[fs|file]_parent_dir ASC' on most queries. I've made the changes, thank
you again!

Madison

#11

Pierre-Frédéric Caillaud

lists@boutiquenumerique.com

over 21 years ago

In reply to: Madison Kelly (#10)

Re: Question on a select

I didn't realize that the order made a difference. A sign of how much
learning I need to do. :p For reference, I think 'file_parent_dir' and
'fs_parent_dir' are the most important because I do an 'ORDER BY
[fs|file]_parent_dir ASC' on most queries. I've made the changes, thank
you again!

If you SELECT ... WHERE condition on A order by B :

an index on A will be used, but an index on B won't

If you SELECT ... WHERE condition on A order by A, B :

an index on A,B will be used because it will give the rows in already
sorted order