question about efficiency

Started by Johnson, Shaunnover 23 years ago5 messagesgeneral

Jump to latest

Johnson, Shaunn

SJohnson6@bcbsm.com

over 23 years ago

Howdy:

Running PostgreSQL 7.2.1 on RedHat Linux 7.2.

Have a question about a query and it's efficiency:

One of the users created a query that looks like this:

[snip]
"create table dev_lbpclaimsum as
SELECT
claimsum2001.c_contract_num,
claimsum2001.c_mbr_num,
claimsum2001.c_proc_cd,
claimsum2001.c_proc_mod,
claimsum2001.d_from_dt,
claimsum2001.d_thru_dt,
claimsum2001.i_pd,
claimsum2001.c_serv_prov
FROM
claimsum2001 a join dev_pb_proc b on a.c_proc_cd = substr
(b.c_proc_cd,1,6)

[/snip]

After looking monitoring the query, it seems that the job took over a day
and still never completed. I looked at it and replaced the
'claimsum2001' with 'a' and did an explain on both:

[snip]
[with alias 'a' ]
Nested Loop (cost=0.00..64051744.91 rows=15892944 width=84)
-> Seq Scan on dev_pb_proc b (cost=0.00..20.00 rows=1000 width=10)
-> Index Scan using stateclaim01_proc_cd_i on claimsum2001 a
(cost=0.00..63813.33 rows=15893 width=74)

[/with alias]

[original way]

psql:./jans_stuff.sql:14: NOTICE: Adding missing FROM-clause entry for
table "claimsum2001"
psql:./jans_stuff.sql:14: NOTICE: QUERY PLAN:

Nested Loop (cost=0.00..19629582277130.89 rows=214697805924711 width=95)
-> Nested Loop (cost=0.00..64051744.91 rows=15892944 width=21)
-> Seq Scan on dev_pb_proc b (cost=0.00..20.00 rows=1000 width=10)
-> Index Scan using stateclaim01_proc_cd_i on claimsum2001 a
(cost=0.00..63813.33 rows=15893 width=11)
-> Seq Scan on claimsum2001 (cost=0.00..1100019.02 rows=13509002
width=74)

EXPLAIN
[/original way]

[/snip]

I created the table in the modified way, it took maybe an hour or so (which
is great!)

But, my question is:

Can someone explain why there's a difference between using
claimsum2001.(whatever) and a.(whatever)?

I wouldn't have thought that it would have mattered in efficiency, just been
more convenient.

Thanks!

-X

Stephan Szabo

sszabo@megazone23.bigpanda.com

over 23 years ago

In reply to: Johnson, Shaunn (#1)

Re: question about efficiency

On Mon, 11 Nov 2002, Johnson, Shaunn wrote:

Howdy:

Running PostgreSQL 7.2.1 on RedHat Linux 7.2.

Have a question about a query and it's efficiency:

One of the users created a query that looks like this:

[snip]
"create table dev_lbpclaimsum as
SELECT
claimsum2001.c_contract_num,
claimsum2001.c_mbr_num,
claimsum2001.c_proc_cd,
claimsum2001.c_proc_mod,
claimsum2001.d_from_dt,
claimsum2001.d_thru_dt,
claimsum2001.i_pd,
claimsum2001.c_serv_prov
FROM
claimsum2001 a join dev_pb_proc b on a.c_proc_cd = substr
(b.c_proc_cd,1,6)

This query is (afaik) illegal in plain SQL and for postgres assumes you
want a join with claimsum2001 as claimsum2001. See notice,

psql:./jans_stuff.sql:14: NOTICE: Adding missing FROM-clause entry for
table "claimsum2001"

It's basically converted it into a three way join.

Can someone explain why there's a difference between using
claimsum2001.(whatever) and a.(whatever)?

Once you alias claimsum2001 as a there isn't a table claimsum2001
in the query, but you're asking for columns from the non-existant
table in the select clause. PostgreSQL tries to be helpful by
adding a from entry for the table whereas it'd technically be illegal.

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: Stephan Szabo (#2)

Re: question about efficiency

Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:

It's basically converted it into a three way join.

More to the point, an *unconstrained* join between "claimsum2001
as claimsum2001" and the intended joined table. The reason it
ran for so long was it was generating a huge number of rows ---
as many as would be in the cross product.

regards, tom lane

Richard Huxton

dev@archonet.com

over 23 years ago

In reply to: Tom Lane (#3)

Re: question about efficiency

Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:

It's basically converted it into a three way join.

More to the point, an *unconstrained* join between "claimsum2001
as claimsum2001" and the intended joined table. The reason it
ran for so long was it was generating a huge number of rows ---
as many as would be in the cross product.

Would it be possible to have some sort of configurable limit on this:

max_result_set_size = 10,000,000
max_result_set_action = ignore | debug | warn | error

The planner knows how many rows it's expecting to return, and a warning
message would show up unconstrained queries like this.

- Richard Huxton

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: Richard Huxton (#4)

Re: question about efficiency

<dev@archonet.com> writes:

Would it be possible to have some sort of configurable limit on this:

max_result_set_size = 10,000,000
max_result_set_action = ignore | debug | warn | error

The planner knows how many rows it's expecting to return,

Personally, I would never trust the planner's estimate far enough
to use it as the basis for triggering an error ;-)

regards, tom lane