Strange Create Index behaviour

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Gary Doades (#1)

Re: Strange Create Index behaviour

Gary Doades <gpd@gpdnet.co.uk> writes:

Platform: FreeBSD 6.0, Postgresql 8.1.2 compiled from the ports collection.

If the column that is having the index created has a certain
distribution of values then create index takes a very long time. If the
data values (integer in this case) a fairly evenly distributed then
create index is very quick, if the data values are all the same it is
very quick. I discovered that in the slow cases the column had
approximately half the values as zero and the rest fairly spread out.

Interesting. I tried your test script and got fairly close times
for all the cases on two different machines:
old HPUX machine: shortest 5800 msec, longest 7960 msec
new Fedora 4 machine: shortest 461 msec, longest 608 msec
(the HPUX machine was doing other stuff at the same time, so some
of its variation is probably only noise).

So what this looks like to me is a corner case that FreeBSD's qsort
fails to handle well.

You might try forcing Postgres to use our private copy of qsort, as we
do on Solaris for similar reasons. (The easy way to do this by hand
is to configure as normal, then alter the LIBOBJS setting in
src/Makefile.global to add "qsort.o", then proceed with normal build.)
However, I think that our private copy is descended from *BSD sources,
so it might have the same failure mode. It'd be worth finding out.

The final interesting thing is that as I increase shared buffers to 2000
or 3000 the problem gets *worse*

shared_buffers is unlikely to impact index build time noticeably in
recent PG releases. maintenance_work_mem would affect it a lot, though.
What setting were you using for that?

Can anyone else try these test cases on other platforms?

regards, tom lane

gpd@gpdnet.co.uk

about 20 years ago

In reply to: Tom Lane (#2)

Re: Strange Create Index behaviour

Tom Lane wrote:

Interesting. I tried your test script and got fairly close times
for all the cases on two different machines:
old HPUX machine: shortest 5800 msec, longest 7960 msec
new Fedora 4 machine: shortest 461 msec, longest 608 msec
(the HPUX machine was doing other stuff at the same time, so some
of its variation is probably only noise).

So what this looks like to me is a corner case that FreeBSD's qsort
fails to handle well.

You might try forcing Postgres to use our private copy of qsort, as we
do on Solaris for similar reasons. (The easy way to do this by hand
is to configure as normal, then alter the LIBOBJS setting in
src/Makefile.global to add "qsort.o", then proceed with normal build.)
However, I think that our private copy is descended from *BSD sources,
so it might have the same failure mode. It'd be worth finding out.

The final interesting thing is that as I increase shared buffers to 2000
or 3000 the problem gets *worse*

shared_buffers is unlikely to impact index build time noticeably in
recent PG releases. maintenance_work_mem would affect it a lot, though.
What setting were you using for that?

Can anyone else try these test cases on other platforms?

Thanks for that.

I've since tried it on Windows (pg 8.1.2) and the times were all
similar, around 1200ms so it might just be BSD.

I'll have to wait until tomorrow to get back to my BSD box. FreeBSD
ports makes it easy to install, so I'll have to figure out how to get in
and change things manually. I guess the appropriate files are still left
around after the ports make command finishes, so I just edit the file
and make again?

If it can't be fixed though I guess we may have a problem using BSD. I'm
surprised this hasn't been brought up before, the case doesn't seem
*that* rare. Maybe not that many using FreeBSD?

I'd certainly be interested if anyone else can repro it on FreeBSD though.

Regards,
Gary.

gpd@gpdnet.co.uk

about 20 years ago

In reply to: Tom Lane (#2)

Re: Strange Create Index behaviour

Tom Lane wrote:

shared_buffers is unlikely to impact index build time noticeably in
recent PG releases. maintenance_work_mem would affect it a lot, though.
What setting were you using for that?

Also, i tried upping maintenance_work_mem to 65536 and it didn't make
much difference (maybe 10% faster for the "normal" cases). Upping the
shared_buffers *definitely* makes the bad cases worse though, but I
agree I don't see why...

Regards,
Gary.

simon@2ndQuadrant.com

about 20 years ago

In reply to: Gary Doades (#1)

Re: Strange Create Index behaviour

On Wed, 2006-02-15 at 20:00 +0000, Gary Doades wrote:

I have put together a test case

Please enable trace_sort=on and then repeat tests and post the
accompanying log file.

I think this is simply the sort taking longer depending upon the data
distribution, but I'd like to know for sure.

Thanks,

Best Regards, Simon Riggs

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Tom Lane (#2)

Re: Strange Create Index behaviour

I wrote:

Interesting. I tried your test script and got fairly close times
for all the cases on two different machines:
old HPUX machine: shortest 5800 msec, longest 7960 msec
new Fedora 4 machine: shortest 461 msec, longest 608 msec

So what this looks like to me is a corner case that FreeBSD's qsort
fails to handle well.

I tried forcing PG to use src/port/qsort.c on the Fedora machine,
and lo and behold:
new Fedora 4 machine: shortest 434 msec, longest 8530 msec

So it sure looks like this script does expose a problem on BSD-derived
qsorts. Curiously, the case that's much the worst for me is the third
in the script, while the shortest time is the first case, which was slow
for Gary. So I'd venture that the *BSD code has been tweaked somewhere
along the way, in a manner that moves the problem around without really
fixing it. (Anyone want to compare the actual FreeBSD source to what
we have?)

This is pretty relevant stuff, because there was a thread recently
advocating that we stop using the platform qsort on all platforms:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php

It's really interesting to see a case where port/qsort is radically
worse than other qsorts ... unless we figure that out and fix it,
I think the idea of using port/qsort everywhere has just taken a
major hit.

regards, tom lane

gpd@gpdnet.co.uk

about 20 years ago

In reply to: Tom Lane (#6)

Re: Strange Create Index behaviour

Tom Lane wrote:

I tried forcing PG to use src/port/qsort.c on the Fedora machine,
and lo and behold:
new Fedora 4 machine: shortest 434 msec, longest 8530 msec

So it sure looks like this script does expose a problem on BSD-derived
qsorts. Curiously, the case that's much the worst for me is the third
in the script, while the shortest time is the first case, which was slow
for Gary. So I'd venture that the *BSD code has been tweaked somewhere
along the way, in a manner that moves the problem around without really
fixing it. (Anyone want to compare the actual FreeBSD source to what
we have?)

If I run the script again, it is not always the first case that is slow,
it varies from run to run, which is why I repeated it quite a few times
for the test.

Interestingly, if I don't delete the table after a run, but just drop
and re-create the index repeatedly it stays a pretty consistent time,
either repeatedly good or repeatedly bad!

Regards,
Gary.

gpd@gpdnet.co.uk

about 20 years ago

In reply to: Tom Lane (#6)

Re: Strange Create Index behaviour

Tom Lane wrote:

So it sure looks like this script does expose a problem on BSD-derived
qsorts. Curiously, the case that's much the worst for me is the third
in the script, while the shortest time is the first case, which was slow
for Gary. So I'd venture that the *BSD code has been tweaked somewhere
along the way, in a manner that moves the problem around without really
fixing it. (Anyone want to compare the actual FreeBSD source to what
we have?)

It's really interesting to see a case where port/qsort is radically
worse than other qsorts ... unless we figure that out and fix it,
I think the idea of using port/qsort everywhere has just taken a
major hit.

More specifically to BSD, is there any way I can use a non-BSD qsort for
building Postresql server?

Regards,
Gary.

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Simon Riggs (#5)

Re: Strange Create Index behaviour

Simon Riggs <simon@2ndquadrant.com> writes:

Please enable trace_sort=on and then repeat tests and post the
accompanying log file.

I did this on my Fedora machine with port/qsort.c, and got the results
attached. Curiously, this run has the spikes in completely different
places than the prior one did. So the random component of the test data
is affecting the results quite a lot. There seems absolutely no doubt
that we are looking at data-dependent qsort misbehavior, though. The
CPU time eaten by performsort accounts for all but about 100 msec of the
elapsed time reported on the psql side.

regards, tom lane

LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.15u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/12.43u sec elapsed 12.44 sec
LOG: internal sort ended, 9861 KB used: CPU 0.01s/12.51u sec elapsed 12.52 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/0.78u sec elapsed 0.78 sec
LOG: internal sort ended, 9861 KB used: CPU 0.02s/0.85u sec elapsed 0.87 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.01s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.01s/0.96u sec elapsed 0.97 sec
LOG: internal sort ended, 9861 KB used: CPU 0.02s/1.03u sec elapsed 1.06 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/0.31u sec elapsed 0.32 sec
LOG: internal sort ended, 9861 KB used: CPU 0.02s/0.38u sec elapsed 0.40 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/7.91u sec elapsed 7.92 sec
LOG: internal sort ended, 9861 KB used: CPU 0.02s/7.99u sec elapsed 8.01 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.01s/0.13u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.01s/0.61u sec elapsed 0.63 sec
LOG: internal sort ended, 9861 KB used: CPU 0.04s/0.67u sec elapsed 0.71 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.01s/0.13u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.01s/11.52u sec elapsed 11.54 sec
LOG: internal sort ended, 9861 KB used: CPU 0.03s/11.59u sec elapsed 11.62 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/0.45u sec elapsed 0.46 sec
LOG: internal sort ended, 9861 KB used: CPU 0.02s/0.55u sec elapsed 0.57 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/0.45u sec elapsed 0.46 sec
LOG: internal sort ended, 9861 KB used: CPU 0.04s/0.54u sec elapsed 0.57 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.02s/0.12u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.02s/0.44u sec elapsed 0.46 sec
LOG: internal sort ended, 9861 KB used: CPU 0.03s/0.55u sec elapsed 0.58 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.02s/0.13u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.02s/0.44u sec elapsed 0.46 sec
LOG: internal sort ended, 9861 KB used: CPU 0.03s/0.54u sec elapsed 0.58 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.02s/0.13u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.02s/0.44u sec elapsed 0.46 sec
LOG: internal sort ended, 9861 KB used: CPU 0.04s/0.54u sec elapsed 0.59 sec

#10

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Gary Doades (#7)

Re: Strange Create Index behaviour

Gary Doades <gpd@gpdnet.co.uk> writes:

Interestingly, if I don't delete the table after a run, but just drop
and re-create the index repeatedly it stays a pretty consistent time,
either repeatedly good or repeatedly bad!

This is consistent with the theory of a data-dependent performance
problem in qsort. If you don't generate a fresh set of random test
data, then you get repeatable runtimes. With a new set of test data,
you might or might not hit the not-so-sweet-spot that we seem to have
detected.

regards, tom lane

#11

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Gary Doades (#7)

qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Gary Doades <gpd@gpdnet.co.uk> writes:

If I run the script again, it is not always the first case that is slow,
it varies from run to run, which is why I repeated it quite a few times
for the test.

For some reason I hadn't immediately twigged to the fact that your test
script is just N repetitions of the exact same structure with random data.
So it's not so surprising that you get random variations in behavior
with different test data sets.

I did some experimentation comparing the qsort from Fedora Core 4
(glibc-2.3.5-10.3) with our src/port/qsort.c. For those who weren't
following the pgsql-performance thread, the test case is just this
repeated a lot of times:

create table atest(i int4, r int4);
insert into atest (i,r) select generate_series(1,100000), 0;
insert into atest (i,r) select generate_series(1,100000), random()*100000;
\timing
create index idx on atest(r);
\timing
drop table atest;

I did this 100 times and sorted the reported runtimes. (Investigation
with trace_sort = on confirms that the runtime is almost entirely spent
in qsort() called from our performsort --- the Postgres overhead is
about 100msec on this machine.) Results are below.

It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.

I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
We need to fix our qsort.c before pushing ahead with that idea.

regards, tom lane

100 runtimes for glibc qsort, sorted ascending:

Time: 459.860 ms
Time: 460.209 ms
Time: 460.704 ms
Time: 461.317 ms
Time: 461.538 ms
Time: 461.652 ms
Time: 461.988 ms
Time: 462.573 ms
Time: 462.638 ms
Time: 462.716 ms
Time: 462.917 ms
Time: 463.219 ms
Time: 463.455 ms
Time: 463.650 ms
Time: 463.723 ms
Time: 463.737 ms
Time: 463.750 ms
Time: 463.852 ms
Time: 463.964 ms
Time: 463.988 ms
Time: 464.003 ms
Time: 464.135 ms
Time: 464.372 ms
Time: 464.458 ms
Time: 464.496 ms
Time: 464.551 ms
Time: 464.599 ms
Time: 464.655 ms
Time: 464.656 ms
Time: 464.722 ms
Time: 464.814 ms
Time: 464.827 ms
Time: 464.878 ms
Time: 464.899 ms
Time: 464.905 ms
Time: 464.987 ms
Time: 465.055 ms
Time: 465.138 ms
Time: 465.159 ms
Time: 465.194 ms
Time: 465.310 ms
Time: 465.316 ms
Time: 465.375 ms
Time: 465.450 ms
Time: 465.535 ms
Time: 465.595 ms
Time: 465.680 ms
Time: 465.769 ms
Time: 465.865 ms
Time: 465.892 ms
Time: 465.903 ms
Time: 466.003 ms
Time: 466.154 ms
Time: 466.164 ms
Time: 466.203 ms
Time: 466.305 ms
Time: 466.344 ms
Time: 466.364 ms
Time: 466.388 ms
Time: 466.502 ms
Time: 466.593 ms
Time: 466.725 ms
Time: 466.794 ms
Time: 466.798 ms
Time: 466.904 ms
Time: 466.971 ms
Time: 466.997 ms
Time: 467.122 ms
Time: 467.146 ms
Time: 467.221 ms
Time: 467.224 ms
Time: 467.244 ms
Time: 467.277 ms
Time: 467.587 ms
Time: 468.142 ms
Time: 468.207 ms
Time: 468.237 ms
Time: 468.471 ms
Time: 468.663 ms
Time: 468.700 ms
Time: 469.235 ms
Time: 469.840 ms
Time: 470.472 ms
Time: 471.140 ms
Time: 472.811 ms
Time: 472.959 ms
Time: 474.858 ms
Time: 477.210 ms
Time: 479.571 ms
Time: 479.671 ms
Time: 482.797 ms
Time: 488.852 ms
Time: 514.639 ms
Time: 529.287 ms
Time: 612.185 ms
Time: 660.748 ms
Time: 742.227 ms
Time: 866.814 ms
Time: 1234.848 ms
Time: 1267.398 ms

100 runtimes for port/qsort.c, sorted ascending:

Time: 418.905 ms
Time: 420.611 ms
Time: 420.764 ms
Time: 420.904 ms
Time: 421.706 ms
Time: 422.466 ms
Time: 422.627 ms
Time: 423.189 ms
Time: 423.302 ms
Time: 425.096 ms
Time: 425.731 ms
Time: 425.851 ms
Time: 427.253 ms
Time: 430.113 ms
Time: 432.756 ms
Time: 432.963 ms
Time: 440.502 ms
Time: 440.640 ms
Time: 450.452 ms
Time: 458.143 ms
Time: 459.212 ms
Time: 467.706 ms
Time: 468.006 ms
Time: 468.574 ms
Time: 470.003 ms
Time: 472.313 ms
Time: 483.622 ms
Time: 492.395 ms
Time: 509.564 ms
Time: 531.037 ms
Time: 533.366 ms
Time: 535.610 ms
Time: 575.523 ms
Time: 582.688 ms
Time: 593.545 ms
Time: 647.364 ms
Time: 660.612 ms
Time: 677.312 ms
Time: 680.288 ms
Time: 697.626 ms
Time: 833.066 ms
Time: 834.511 ms
Time: 851.819 ms
Time: 920.443 ms
Time: 926.731 ms
Time: 954.289 ms
Time: 1045.214 ms
Time: 1059.200 ms
Time: 1062.328 ms
Time: 1136.018 ms
Time: 1260.091 ms
Time: 1276.883 ms
Time: 1319.351 ms
Time: 1438.854 ms
Time: 1475.457 ms
Time: 1538.211 ms
Time: 1549.004 ms
Time: 1744.642 ms
Time: 1771.258 ms
Time: 1959.530 ms
Time: 2300.140 ms
Time: 2589.641 ms
Time: 2612.780 ms
Time: 3100.024 ms
Time: 3284.125 ms
Time: 3379.792 ms
Time: 3750.278 ms
Time: 4302.278 ms
Time: 4780.624 ms
Time: 5000.056 ms
Time: 5092.604 ms
Time: 5168.722 ms
Time: 5292.941 ms
Time: 5895.964 ms
Time: 7003.164 ms
Time: 7099.449 ms
Time: 7115.083 ms
Time: 7384.940 ms
Time: 8214.010 ms
Time: 8700.771 ms
Time: 9331.225 ms
Time: 10503.360 ms
Time: 12496.026 ms
Time: 12982.474 ms
Time: 15192.390 ms
Time: 15392.161 ms
Time: 15958.295 ms
Time: 18375.693 ms
Time: 18617.706 ms
Time: 18927.515 ms
Time: 19898.018 ms
Time: 20865.979 ms
Time: 21000.907 ms
Time: 21297.585 ms
Time: 21714.518 ms
Time: 25423.235 ms
Time: 27543.052 ms
Time: 28314.182 ms
Time: 29400.278 ms
Time: 34142.534 ms

#12

simon@2ndQuadrant.com

about 20 years ago

In reply to: Tom Lane (#10)

Re: Strange Create Index behaviour

On Wed, 2006-02-15 at 16:51 -0500, Tom Lane wrote:

Gary Doades <gpd@gpdnet.co.uk> writes:

Interestingly, if I don't delete the table after a run, but just drop
and re-create the index repeatedly it stays a pretty consistent time,
either repeatedly good or repeatedly bad!

This is consistent with the theory of a data-dependent performance
problem in qsort. If you don't generate a fresh set of random test
data, then you get repeatable runtimes. With a new set of test data,
you might or might not hit the not-so-sweet-spot that we seem to have
detected.

Agreed. Good analysis...

Best Regards, Simon Riggs

#13

gpd@gpdnet.co.uk

about 20 years ago

In reply to: Tom Lane (#11)

Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Tom Lane wrote:

For some reason I hadn't immediately twigged to the fact that your test
script is just N repetitions of the exact same structure with random data.
So it's not so surprising that you get random variations in behavior
with different test data sets.

It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.

I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
We need to fix our qsort.c before pushing ahead with that idea.

[snip]

Time: 28314.182 ms
Time: 29400.278 ms
Time: 34142.534 ms

Ouch! That confirms my problem. I generated the random test case because
it was easier than including the dump of my tables, but you can
appreciate that tables 20 times the size are basically crippled when it
comes to creating an index on them.

Examining the dump and the associated times during restore it looks like
I have 7 tables with this approximate distribution, thus the
ridiculously long restore time. Better not re-index soon!

Is this likely to hit me in a random fashion during normal operation,
joins, sorts, order by for example?

So the options are:
1) Fix the included qsort.c code and use that
2) Get FreeBSD to fix their qsort code
3) Both

I guess that 1 is the real solution in case anyone else's qsort is
broken in the same way. Then at least you *could* use it all the time :)

Regards,
Gary.

#14

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Gary Doades (#13)

Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Gary Doades <gpd@gpdnet.co.uk> writes:

Is this likely to hit me in a random fashion during normal operation,
joins, sorts, order by for example?

Yup, anytime you're passing data with that kind of distribution
through a sort.

So the options are:
1) Fix the included qsort.c code and use that
2) Get FreeBSD to fix their qsort code
3) Both

I guess that 1 is the real solution in case anyone else's qsort is
broken in the same way. Then at least you *could* use it all the time :)

It's reasonable to assume that most of the *BSDen have basically the
same qsort code. Ours claims to have come from NetBSD sources, but
I don't doubt that they all trace back to a common ancestor.

regards, tom lane

#15

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Gary Doades (#13)

Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Gary Doades <gpd@gpdnet.co.uk> writes:

Ouch! That confirms my problem. I generated the random test case because
it was easier than including the dump of my tables, but you can
appreciate that tables 20 times the size are basically crippled when it
comes to creating an index on them.

Actually... we only use qsort when we have a sorting problem that fits
within the allowed sort memory. The external-sort logic doesn't go
through that code at all. So all the analysis we just did on your test
case doesn't necessarily apply to sort problems that are too large for
the sort_mem setting.

The test case would be sorting 200000 index entries, which'd probably
occupy at least 24 bytes apiece of sort memory, so probably about 5 meg.
A problem 20 times that size would definitely not fit in the default
16MB maintenance_work_mem. Were you using a large value of
maintenance_work_mem for your restore?

regards, tom lane

#16

rjpeace@earthlink.net

about 20 years ago

In reply to: Tom Lane (#11)

Re: qsort again (was Re: [PERFORM] Strange Create Index

This behavior is consistent with the pivot choosing algorithm
assuming certain distribution(s) for the data. For instance,
median-of-three partitioning is known to be pessimal when the data is
geometrically or hyper-geometrically distributed. Also, care must be
taken that sometimes is not when there are many equal values in the
data. Even pseudo random number generator based pivot choosing
algorithms are not immune if the PRNG is flawed in some way.

How are we choosing our pivots?

At 06:28 PM 2/15/2006, Tom Lane wrote:

Show quoted text

I did some experimentation comparing the qsort from Fedora Core 4
(glibc-2.3.5-10.3) with our src/port/qsort.c. For those who weren't
following the pgsql-performance thread, the test case is just this
repeated a lot of times:

create table atest(i int4, r int4);
insert into atest (i,r) select generate_series(1,100000), 0;
insert into atest (i,r) select generate_series(1,100000), random()*100000;
\timing
create index idx on atest(r);
\timing
drop table atest;

I did this 100 times and sorted the reported runtimes. (Investigation
with trace_sort = on confirms that the runtime is almost entirely spent
in qsort() called from our performsort --- the Postgres overhead is
about 100msec on this machine.) Results are below.

It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.

I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
We need to fix our qsort.c before pushing ahead with that idea.

regards, tom lane

100 runtimes for glibc qsort, sorted ascending:

Time: 459.860 ms
<snip>
Time: 488.852 ms
Time: 514.639 ms
Time: 529.287 ms
Time: 612.185 ms
Time: 660.748 ms
Time: 742.227 ms
Time: 866.814 ms
Time: 1234.848 ms
Time: 1267.398 ms

100 runtimes for port/qsort.c, sorted ascending:

Time: 418.905 ms
<snip>
Time: 20865.979 ms
Time: 21000.907 ms
Time: 21297.585 ms
Time: 21714.518 ms
Time: 25423.235 ms
Time: 27543.052 ms
Time: 28314.182 ms
Time: 29400.278 ms
Time: 34142.534 ms

#17

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Tom Lane (#15)

Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

I wrote:

Gary Doades <gpd@gpdnet.co.uk> writes:

Ouch! That confirms my problem. I generated the random test case because
it was easier than including the dump of my tables, but you can
appreciate that tables 20 times the size are basically crippled when it
comes to creating an index on them.

Actually... we only use qsort when we have a sorting problem that fits
within the allowed sort memory. The external-sort logic doesn't go
through that code at all. So all the analysis we just did on your test
case doesn't necessarily apply to sort problems that are too large for
the sort_mem setting.

I increased the size of the test case by 10x (basically s/100000/1000000/)
which is enough to push it into the external-sort regime. I get
amazingly stable runtimes now --- I didn't have the patience to run 100
trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec.
So this code path is definitely not very sensitive to this data
distribution.

While these numbers aren't glittering in comparison to the best-case
qsort times (~450 msec to sort 10% as much data), they are sure a lot
better than the worst-case times. So maybe a workaround for you is
to decrease maintenance_work_mem, counterintuitive though that be.
(Now, if you *weren't* using maintenance_work_mem of 100MB or more
for your problem restore, then I'm not sure I know what's going on...)

We still ought to try to fix qsort of course.

regards, tom lane

#18

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Ron Peacetree (#16)

Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Ron <rjpeace@earthlink.net> writes:

How are we choosing our pivots?

See qsort.c: it looks like median of nine equally spaced inputs (ie,
the 1/8th points of the initial input array, plus the end points),
implemented as two rounds of median-of-three choices.  With half of the
data inputs zero, it's not too improbable for two out of the three
samples to be zeroes in which case I think the med3 result will be zero
--- so choosing a pivot of zero is much more probable than one would
like, and doing so in many levels of recursion causes the problem.

I think. I'm not too sure if the code isn't just being sloppy about the
case where many data values are equal to the pivot --- there's a special
case there to switch to insertion sort, and maybe that's getting invoked
too soon. It'd be useful to get a line-level profile of the behavior of
this code in the slow cases...

regards, tom lane

#19

Dann Corbit

DCorbit@connx.com

about 20 years ago

In reply to: Tom Lane (#18)

Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Tom Lane
Sent: Wednesday, February 15, 2006 5:22 PM
To: Ron
Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

Index

behaviour)

Ron <rjpeace@earthlink.net> writes:

How are we choosing our pivots?

See qsort.c: it looks like median of nine equally spaced inputs (ie,
the 1/8th points of the initial input array, plus the end points),
implemented as two rounds of median-of-three choices. With half of

the

data inputs zero, it's not too improbable for two out of the three
samples to be zeroes in which case I think the med3 result will be

zero

--- so choosing a pivot of zero is much more probable than one would
like, and doing so in many levels of recursion causes the problem.

Adding some randomness to the selection of the pivot is a known
technique to fix the oddball partitions problem. However, Bentley and
Sedgewick proved that every quick sort algorithm has some input set that
makes it go quadratic (hence the recent popularity of introspective
sort, which switches to heapsort if quadratic behavior is detected. The
C++ template I submitted was an example of introspective sort, but
PostgreSQL does not use C++ so it was not helpful).

I think. I'm not too sure if the code isn't just being sloppy about

the

case where many data values are equal to the pivot --- there's a

special

case there to switch to insertion sort, and maybe that's getting

invoked

too soon.

Here are some cases known to make qsort go quadratic:
1. Data already sorted
2. Data reverse sorted
3. Data organ-pipe sorted or ramp
4. Almost all data of the same value

There are probably other cases. Randomizing the pivot helps some, as
does check for in-order or reverse order partitions.

Imagine if 1/3 of the partitions fall into a category that causes
quadratic behavior (have one of the above formats and have more than
CUTOFF elements in them).

It is doubtful that the switch to insertion sort is causing any sort of
problems. It is only going to be invoked on tiny sets, for which it has
a fixed cost that is probably less that qsort() function calls on sets
of the same size.

It'd be useful to get a line-level profile of the behavior of
this code in the slow cases...

I guess that my in-order or presorted tests [which often arise when
there are very few distinct values] may solve the bad partition
problems. Don't forget that the algorithm is called recursively.

regards, tom lane

---------------------------(end of

broadcast)---------------------------

Show quoted text

TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Import Notes

Resolved by subject fallback

#20

simon@2ndQuadrant.com

about 20 years ago

In reply to: Gary Doades (#1)

Re: Strange Create Index behaviour

On Wed, 2006-02-15 at 20:00 +0000, Gary Doades wrote:

I have put together a test case that demonstrates the problem (see
below). I create a simple table, as close in structure to one of my
problem tables and populate an integer column with 100,000 zeros follow
by 100,000 random integers between 0 and 100,000. Then create an index
on this column. I then drop the table and repeat. The create index
should take around 1-2 seconds. A fair proportion of the time it takes
50 seconds!!!

If I fill the same row with all random data the create index always
takes a second or two. If I fill the column with all zeros everything is
still OK.

Aside from the importance of investigating sort behaviour, have you
tried to build a partial index WHERE col > 0 ? That way you wouldn't
even be indexing the zeros.

Best Regards, Simon Riggs

#21

Christopher Kings-Lynne

chriskl@familyhealth.com.au

about 20 years ago

In reply to: Gary Doades (#13)

#22

simon@2ndQuadrant.com

about 20 years ago

In reply to: Tom Lane (#17)

#23

Neil Conway

neilc@samurai.com

about 20 years ago

In reply to: Tom Lane (#11)

#24

Qingqing Zhou

zhouqq@cs.toronto.edu

about 20 years ago

In reply to: Gary Doades (#1)

#25

rjpeace@earthlink.net

about 20 years ago

In reply to: Tom Lane (#18)

#26

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Qingqing Zhou (#24)

#27

Qingqing Zhou

zhouqq@cs.toronto.edu

about 20 years ago

In reply to: Gary Doades (#1)

#28

Qingqing Zhou

zhouqq@cs.toronto.edu

about 20 years ago

In reply to: Gary Doades (#1)

#29

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Qingqing Zhou (#27)

#30

gpd@gpdnet.co.uk

about 20 years ago

In reply to: Tom Lane (#17)

#31

Steinar H. Gunderson

sgunderson@bigfoot.com

about 20 years ago

In reply to: Ron Peacetree (#25)

#32

Florian Weimer

fw@deneb.enyo.de

about 20 years ago

In reply to: Neil Conway (#23)

#33

kleptog@svana.org

about 20 years ago

In reply to: Florian Weimer (#32)

#34

Sven Geisler

sgeisler@aeccom.com

about 20 years ago

In reply to: Martijn van Oosterhout (#33)

#35

rjpeace@earthlink.net

about 20 years ago

In reply to: Steinar H. Gunderson (#31)

#36

rjpeace@earthlink.net

about 20 years ago

In reply to: Florian Weimer (#32)

#37

schabi@logix-tt.com

about 20 years ago

In reply to: Ron Peacetree (#35)

#38

Jonah H. Harris

jonah.harris@gmail.com

about 20 years ago

In reply to: Markus Schaber (#37)

#39

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Gary Doades (#30)

#40

kleptog@svana.org

about 20 years ago

In reply to: Ron Peacetree (#35)

#41

gpd@gpdnet.co.uk

about 20 years ago

In reply to: Tom Lane (#39)

#42

rjpeace@earthlink.net

about 20 years ago

In reply to: Martijn van Oosterhout (#40)

#43

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Ron Peacetree (#42)

#44

Craig A. James

cjames@modgraph-usa.com

about 20 years ago

In reply to: Markus Schaber (#37)

#45

rjpeace@earthlink.net

about 20 years ago

In reply to: Ron Peacetree (#42)

#46

kleptog@svana.org

about 20 years ago

In reply to: Ron Peacetree (#45)

#47

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Craig A. James (#44)

#48

Scott Lamb

slamb@slamb.org

about 20 years ago

In reply to: Ron Peacetree (#45)

#49

Dann Corbit

DCorbit@connx.com

about 20 years ago

In reply to: Tom Lane (#47)

#50

rjpeace@earthlink.net

about 20 years ago

In reply to: Scott Lamb (#48)

#51

Neil Conway

neilc@samurai.com

about 20 years ago

In reply to: Steinar H. Gunderson (#31)

#52

Mark Lewis

mark.lewis@mir3.com

about 20 years ago

In reply to: Tom Lane (#47)

#53

schabi@logix-tt.com

about 20 years ago

In reply to: Mark Lewis (#52)

#54

kleptog@svana.org

about 20 years ago

In reply to: Mark Lewis (#52)

#55

Bruce Momjian

bruce@momjian.us

about 20 years ago

In reply to: Markus Schaber (#53)

#56

PFC

lists@peufeu.com

about 20 years ago

In reply to: Mark Lewis (#52)

#57

Mark Lewis

mark.lewis@mir3.com

about 20 years ago

In reply to: Bruce Momjian (#55)

#58

schabi@logix-tt.com

about 20 years ago

In reply to: PFC (#56)

#59

Steinar H. Gunderson

sgunderson@bigfoot.com

about 20 years ago

In reply to: PFC (#56)

#60

David Lang

dlang@invendra.net

about 20 years ago

In reply to: Mark Lewis (#57)

#61

rjpeace@earthlink.net

about 20 years ago

In reply to: Ron Peacetree (#50)

#62

Jens-Wolfhard Schicke

ml+pgsql-hackers@asco.de

about 20 years ago

In reply to: Dann Corbit (#49)

#63

kleptog@svana.org

about 20 years ago

In reply to: Jens-Wolfhard Schicke (#62)

#64

Ragnar

gnari@hive.is

about 20 years ago

In reply to: Ron Peacetree (#61)

#65

schabi@logix-tt.com

about 20 years ago

In reply to: David Lang (#60)

#66

schabi@logix-tt.com

about 20 years ago

In reply to: Ron Peacetree (#61)

#67

PFC

lists@peufeu.com

about 20 years ago

In reply to: Markus Schaber (#66)

#68

rjpeace@earthlink.net

about 20 years ago

In reply to: Ragnar (#64)

#69

rjpeace@earthlink.net

about 20 years ago

In reply to: Markus Schaber (#66)

#70

kleptog@svana.org

about 20 years ago

In reply to: Ron Peacetree (#69)

#71

Scott Lamb

slamb@slamb.org

about 20 years ago

In reply to: Mark Lewis (#52)

#72

kleptog@svana.org

about 20 years ago

In reply to: Scott Lamb (#71)

#73

rjpeace@earthlink.net

about 20 years ago

In reply to: Martijn van Oosterhout (#70)

#74

Mark Lewis

mark.lewis@mir3.com

about 20 years ago

In reply to: David Lang (#60)

#75

Ragnar

gnari@hive.is

about 20 years ago

In reply to: Ron Peacetree (#68)

#76

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Mark Lewis (#74)

#77

Gregory Maxwell

gmaxwell@gmail.com

about 20 years ago

In reply to: Ragnar (#75)

#78

rjpeace@earthlink.net

about 20 years ago

In reply to: Dann Corbit (#19)

#79

Bruce Momjian

bruce@momjian.us

about 20 years ago

In reply to: Dann Corbit (#19)

#80

Jonah H. Harris

jonah.harris@gmail.com

about 20 years ago

In reply to: Bruce Momjian (#79)

#81

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Tom Lane (#11)

#82

Craig A. James

cjames@modgraph-usa.com

about 20 years ago

In reply to: Tom Lane (#81)

#83

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Craig A. James (#82)

#84

Craig A. James

cjames@modgraph-usa.com

about 20 years ago

In reply to: Tom Lane (#83)

#85