GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Started by Tom Laneover 26 years ago12 messages

tgl@sss.pgh.pa.us

over 26 years ago

I wrote:

Oleg Bartunov <oleg@sai.msu.su> writes:

WHile testing 6.5 cvs to see what's the progress with capability
of Postgres to work with big joins I get following error messages:

I think there are still some nasty bugs in the GEQO planner.

I have just committed some changes that fix bugs in the GEQO planner
and limit its memory usage. It should now be possible to use GEQO even
for queries that join a very large number of tables --- at least from
the standpoint of not running out of memory during planning. (It can
still take a while :-(. I think that the default GEQO parameter
settings may be configured to use too many generations, but haven't
poked at this yet.)

I have observed that the regular optimizer requires about 50MB to plan
some ten-way joins, and can exceed my system's 128MB process data limit
on some eleven-way joins. We currently have the GEQO threshold set at
11, which prevents the latter case by default --- but 50MB is a lot.
I wonder whether we shouldn't back the GEQO threshold off to 10.
(When I suggested setting it to 11, I was only looking at speed relative
to GEQO, not memory usage. There is now a *big* difference in memory
usage...) Comments?

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofThu13May1999101608-04008567.926604968@sss.pgh.pa.us

Bruce Momjian

maillist@candle.pha.pa.us

over 26 years ago

In reply to: Tom Lane (#1)

Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

I have observed that the regular optimizer requires about 50MB to plan
some ten-way joins, and can exceed my system's 128MB process data limit
on some eleven-way joins. We currently have the GEQO threshold set at
11, which prevents the latter case by default --- but 50MB is a lot.
I wonder whether we shouldn't back the GEQO threshold off to 10.
(When I suggested setting it to 11, I was only looking at speed relative
to GEQO, not memory usage. There is now a *big* difference in memory
usage...) Comments?

You chose 11 by comparing GEQO with non-GEQO. I think you will find
that with your improved GEQO, GEQO is faster for smaller number of
joins, preventing the memory problem. Can you check the speeds again?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Oleg Bartunov

oleg@sai.msu.su

over 26 years ago

In reply to: Bruce Momjian (#2)

Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

On Sun, 16 May 1999, Bruce Momjian wrote:

Date: Sun, 16 May 1999 21:17:30 -0400 (EDT)
From: Bruce Momjian <maillist@candle.pha.pa.us>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Oleg Bartunov <oleg@sai.msu.su>, pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

I have observed that the regular optimizer requires about 50MB to plan
some ten-way joins, and can exceed my system's 128MB process data limit
on some eleven-way joins. We currently have the GEQO threshold set at
11, which prevents the latter case by default --- but 50MB is a lot.
I wonder whether we shouldn't back the GEQO threshold off to 10.
(When I suggested setting it to 11, I was only looking at speed relative
to GEQO, not memory usage. There is now a *big* difference in memory
usage...) Comments?

You chose 11 by comparing GEQO with non-GEQO. I think you will find
that with your improved GEQO, GEQO is faster for smaller number of
joins, preventing the memory problem. Can you check the speeds again?

I confirm big join with 11 tables doesn't eats all memory+swap on
my Linux box as before and it runs *forever* :-). It took already
18 minutes of CPU (P200, 64Mb) ! Will wait.

8438 postgres 12 0 11104 3736 2620 R 0 98.6 5.9 18:16 postmaster

This query doesn't use (expicitly) GEQO

select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as t9,t10.a as t10
from t0 ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10
where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and t5.a_id=t0.a_t5_id and t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and t10.a_id=t0.a_t10_id ;

Regards,

Oleg

-- 
Bruce Momjian                        |  http://www.op.net/~candle
maillist@candle.pha.pa.us            |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: Oleg Bartunov (#3)

Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Oleg Bartunov <oleg@sai.msu.su> writes:

I confirm big join with 11 tables doesn't eats all memory+swap on
my Linux box as before and it runs *forever* :-). It took already
18 minutes of CPU (P200, 64Mb) ! Will wait.

18 minutes??? It takes barely over a minute on my aging 75MHz HP-PA
box. (Practically all of which is planning time, since there are only
10 tuples to join... or are you doing this on a realistically sized
set of tables now?)

This query doesn't use (expicitly) GEQO

select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as t9,t10.a as t10
from t0 ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10
where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and t5.a_id=t0.a_t5_id and t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and t10.a_id=t0.a_t10_id ;

No, but since there are 11 tables mentioned, it will be sent to the GEQO
optimizer anyway with the default GEQO threshold of 11...

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofMon17May1999101128+0400Pine.GSO.3.96.SK.990517100740.3758E-100000@ra | Resolved by subject fallback

Oleg Bartunov

oleg@sai.msu.su

over 26 years ago

In reply to: Tom Lane (#4)

Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

On Mon, 17 May 1999, Tom Lane wrote:

Date: Mon, 17 May 1999 09:44:18 -0400
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Oleg Bartunov <oleg@sai.msu.su>
Cc: pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Oleg Bartunov <oleg@sai.msu.su> writes:

I confirm big join with 11 tables doesn't eats all memory+swap on
my Linux box as before and it runs *forever* :-). It took already
18 minutes of CPU (P200, 64Mb) ! Will wait.

18 minutes??? It takes barely over a minute on my aging 75MHz HP-PA
box. (Practically all of which is planning time, since there are only
10 tuples to join... or are you doing this on a realistically sized
set of tables now?)

Oops,

I found the problem. I modified my test script to add 'vacuum analyze'
after creating test data and it works really fast ! Great !
Now I'm wondering why do I need vacuum analyze after creating test data
and indices ? What's the state of discussion in hackers ?

Regards,
Oleg

This query doesn't use (expicitly) GEQO

select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as t9,t10.a as t10
from t0 ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10
where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and t5.a_id=t0.a_t5_id and t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and t10.a_id=t0.a_t10_id ;

No, but since there are 11 tables mentioned, it will be sent to the GEQO
optimizer anyway with the default GEQO threshold of 11...

regards, tom lane

Oleg Bartunov

oleg@sai.msu.su

over 26 years ago

In reply to: Tom Lane (#4)

1 attachment(s)

Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Tom,

I was so happy with the problem solved so I decided to play with more joins :-)
Really, queries were processed very quickly but at 14 tables backend died :

COPY t13 FROM STDIN USING DELIMITERS '|';
vacuum analyze;
VACUUM

select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as t9,t10.a as t10,t11.a as t11,t12.a as t12,t13.a as t13
from t0 ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13
where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and t5.a_id=t0.a_t5_id and t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and t10.a_id=t0.a_t10_id and t11.a_id=t0.a_
t11_id and t12.a_id=t0.a_t12_id and t13.a_id=t0.a_t13_id ;
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.
We have lost the connection to the backend, so further processing is impossible. Terminating.

Tom, could you try my script at your machine ?
I attached the script. You need perl to run it.

mkjoindata.pl | psql test

Regards,

Oleg

On Mon, 17 May 1999, Tom Lane wrote:

Date: Mon, 17 May 1999 09:44:18 -0400
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Oleg Bartunov <oleg@sai.msu.su>
Cc: pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Oleg Bartunov <oleg@sai.msu.su> writes:

I confirm big join with 11 tables doesn't eats all memory+swap on
my Linux box as before and it runs *forever* :-). It took already
18 minutes of CPU (P200, 64Mb) ! Will wait.

18 minutes??? It takes barely over a minute on my aging 75MHz HP-PA
box. (Practically all of which is planning time, since there are only
10 tuples to join... or are you doing this on a realistically sized
set of tables now?)

This query doesn't use (expicitly) GEQO

select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as t9,t10.a as t10
from t0 ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10
where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and t5.a_id=t0.a_t5_id and t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and t10.a_id=t0.a_t10_id ;

No, but since there are 11 tables mentioned, it will be sent to the GEQO
optimizer anyway with the default GEQO threshold of 11...

regards, tom lane

Oleg Bartunov

oleg@sai.msu.su

over 26 years ago

In reply to: Oleg Bartunov (#6)

Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Oops,

it seems that was my fault, I didn't specified @nitems (sizes of tables)
for all tables. Now it works fine.
Tom, in case of my fault, why did postgres die ?

Regards,

Oleg

On Mon, 17 May 1999, Oleg Bartunov wrote:

Date: Mon, 17 May 1999 18:08:45 +0400 (MSD)
From: Oleg Bartunov <oleg@sai.msu.su>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Oleg Bartunov <oleg@sai.msu.su>, pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Tom,

I was so happy with the problem solved so I decided to play with more joins :-)
Really, queries were processed very quickly but at 14 tables backend died :

COPY t13 FROM STDIN USING DELIMITERS '|';
vacuum analyze;
VACUUM

select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as t9,t10.a as t10,t11.a as t11,t12.a as t12,t13.a as t13
from t0 ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13
where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and t5.a_id=t0.a_t5_id and t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and t10.a_id=t0.a_t10_id and t11.a_id=t0.

t11_id and t12.a_id=t0.a_t12_id and t13.a_id=t0.a_t13_id ;
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.
We have lost the connection to the backend, so further processing is impossible. Terminating.

Tom, could you try my script at your machine ?
I attached the script. You need perl to run it.

mkjoindata.pl | psql test

Regards,

Oleg

On Mon, 17 May 1999, Tom Lane wrote:

Date: Mon, 17 May 1999 09:44:18 -0400
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Oleg Bartunov <oleg@sai.msu.su>
Cc: pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Oleg Bartunov <oleg@sai.msu.su> writes:

I confirm big join with 11 tables doesn't eats all memory+swap on
my Linux box as before and it runs *forever* :-). It took already
18 minutes of CPU (P200, 64Mb) ! Will wait.

18 minutes??? It takes barely over a minute on my aging 75MHz HP-PA
box. (Practically all of which is planning time, since there are only
10 tuples to join... or are you doing this on a realistically sized
set of tables now?)

This query doesn't use (expicitly) GEQO

select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as t9,t10.a as t10
from t0 ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10
where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and t5.a_id=t0.a_t5_id and t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and t10.a_id=t0.a_t10_id ;

No, but since there are 11 tables mentioned, it will be sent to the GEQO
optimizer anyway with the default GEQO threshold of 11...

regards, tom lane

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: Oleg Bartunov (#7)

Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Oleg Bartunov <oleg@sai.msu.su> writes:

I found the problem. I modified my test script to add 'vacuum analyze'
after creating test data and it works really fast ! Great !
Now I'm wondering why do I need vacuum analyze after creating test data
and indices ?

VACUUM ANALYZE would create pg_statistics entries for the tables,
which'd allow the optimizer to make better estimates of restriction
and join selectivities. I expect that it changes the plan being used;
what does EXPLAIN say with and without the analyze?

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofMon17May1999180027+0400Pine.GSO.3.96.SK.990517175549.15218G-100000@ra | Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: Tom Lane (#8)

Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Oleg Bartunov <oleg@sai.msu.su> writes:

it seems that was my fault, I didn't specified @nitems (sizes of tables)
for all tables. Now it works fine.
Tom, in case of my fault, why did postgres die ?

I don't know --- I don't see it here. I just ran your script as given,
and it worked. (It produced zero rows of output, since the missing
nitems values meant that no data was loaded into the last few tables ...
but there was no backend crash.)

Is it crashing because of running out of memory, or something else?
Can you provide a backtrace?

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofMon17May1999182507+0400Pine.GSO.3.96.SK.990517181710.15218J-100000@ra | Resolved by subject fallback

#10

Oleg Bartunov

oleg@sai.msu.su

over 26 years ago

In reply to: Tom Lane (#9)

Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

On Mon, 17 May 1999, Tom Lane wrote:

Date: Mon, 17 May 1999 11:24:31 -0400
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Oleg Bartunov <oleg@sai.msu.su>
Cc: pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Oleg Bartunov <oleg@sai.msu.su> writes:

it seems that was my fault, I didn't specified @nitems (sizes of tables)
for all tables. Now it works fine.
Tom, in case of my fault, why did postgres die ?

I don't know --- I don't see it here. I just ran your script as given,
and it worked. (It produced zero rows of output, since the missing
nitems values meant that no data was loaded into the last few tables ...
but there was no backend crash.)

Is it crashing because of running out of memory, or something else?

No, memory is fine. It just dies.

Can you provide a backtrace?

Will try to reproduce crash,. How do I can debug psql ?

regards, tom lane

#11

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: Oleg Bartunov (#10)

Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Bruce Momjian <maillist@candle.pha.pa.us> writes:

I have observed that the regular optimizer requires about 50MB to plan
some ten-way joins, and can exceed my system's 128MB process data limit
on some eleven-way joins. We currently have the GEQO threshold set at
11, which prevents the latter case by default --- but 50MB is a lot.
I wonder whether we shouldn't back the GEQO threshold off to 10.
(When I suggested setting it to 11, I was only looking at speed relative
to GEQO, not memory usage. There is now a *big* difference in memory
usage...) Comments?

You chose 11 by comparing GEQO with non-GEQO. I think you will find
that with your improved GEQO, GEQO is faster for smaller number of
joins, preventing the memory problem. Can you check the speeds again?

Bruce, I have rerun a couple of tests and am getting numbers like these:

# tables joined

... 10 11 ...

STD OPTIMIZER 24 115
GEQO 45 55

This is after tweaking the GEQO parameters to improve speed slightly
in the default case. (Setting EFFORT=LOW reduces the 11-way plan time
to about 40 sec, setting EFFORT=HIGH makes it about 70.)

The breakpoint for speed is still clearly at GEQO threshold 11.
*However*, the regular optimizer uses close to 120MB of memory to
plan these 11-way joins, and that's excessive (especially since that's
not even counting the space that will be used for execution...).
Until we can do something about reclaiming space more effectively,
I recommend reducing the default GEQO threshold to 10.

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofSun16May1999211730-0400199905170117.VAA20976@candle.pha.pa.us | Resolved by subject fallback

#12

Bruce Momjian

maillist@candle.pha.pa.us

over 26 years ago

In reply to: Tom Lane (#11)

Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

You chose 11 by comparing GEQO with non-GEQO. I think you will find
that with your improved GEQO, GEQO is faster for smaller number of
joins, preventing the memory problem. Can you check the speeds again?

Bruce, I have rerun a couple of tests and am getting numbers like these:

# tables joined

... 10 11 ...

STD OPTIMIZER 24 115
GEQO 45 55

This is after tweaking the GEQO parameters to improve speed slightly
in the default case. (Setting EFFORT=LOW reduces the 11-way plan time
to about 40 sec, setting EFFORT=HIGH makes it about 70.)

The breakpoint for speed is still clearly at GEQO threshold 11.
*However*, the regular optimizer uses close to 120MB of memory to
plan these 11-way joins, and that's excessive (especially since that's
not even counting the space that will be used for execution...).
Until we can do something about reclaiming space more effectively,
I recommend reducing the default GEQO threshold to 10.

Agreed.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)

Attachments: