Strange Create Index behaviour

Started by Gary Doadesabout 20 years ago86 messageshackers
Jump to latest
#1Gary Doades
gpd@gpdnet.co.uk

Platform: FreeBSD 6.0, Postgresql 8.1.2 compiled from the ports collection.

Not sure if this belongs in performance or bugs..

A pg_restore of my 2.5GB database was taking up to 2 hours to complete
instead of the expected 10-15 minutes. Checking the server it was mostly
CPU bound. Testing further I discovered that is was spending huge
amounts of CPU time creating some indexes.

It took a while to find out, but basically it boils down to this:

If the column that is having the index created has a certain
distribution of values then create index takes a very long time. If the
data values (integer in this case) a fairly evenly distributed then
create index is very quick, if the data values are all the same it is
very quick. I discovered that in the slow cases the column had
approximately half the values as zero and the rest fairly spread out.
One column started off with around 400,000 zeros and the rest of the
following rows spread between values of 1 to 500,000.

I have put together a test case that demonstrates the problem (see
below). I create a simple table, as close in structure to one of my
problem tables and populate an integer column with 100,000 zeros follow
by 100,000 random integers between 0 and 100,000. Then create an index
on this column. I then drop the table and repeat. The create index
should take around 1-2 seconds. A fair proportion of the time it takes
50 seconds!!!

If I fill the same row with all random data the create index always
takes a second or two. If I fill the column with all zeros everything is
still OK.

When my tables that I am trying to restore are over 2 million rows the
creating one index can take an hour!! (almost all CPU time).

All other areas of performance, once the dump is restored and analysed
seem to be OK, even large hash/merge joins and sorts

This is entirely repeatable in FreeBSD in that around half the time
create index will be incredibly slow.

All postgresql.conf settings are at the defaults for the test initially
(fresh install)

The final interesting thing is that as I increase shared buffers to 2000
or 3000 the problem gets *worse*

The following text is output from the test script..

select version();
version

------------------------------------------------------------------------------------------------
PostgreSQL 8.1.2 on i386-portbld-freebsd6.0, compiled by GCC cc (GCC)
3.4.4 [FreeBSD] 20050518
(1 row)

\timing
Timing is on.

----- Many slow cases, note the 50+ seconds cases

create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 81.859 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 1482.141 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1543.508 ms
create index idx on atest(r);
CREATE INDEX
Time: 56685.230 ms

drop table atest;
DROP TABLE
Time: 4.616 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 6.889 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 2009.787 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1828.663 ms
create index idx on atest(r);
CREATE INDEX
Time: 3991.257 ms

drop table atest;
DROP TABLE
Time: 3.796 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 19.965 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 1625.059 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 2622.827 ms
create index idx on atest(r);
CREATE INDEX
Time: 1082.799 ms

drop table atest;
DROP TABLE
Time: 4.627 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 2.953 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 2068.744 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 2671.420 ms
create index idx on atest(r);
CREATE INDEX
Time: 8047.660 ms

drop table atest;
DROP TABLE
Time: 3.675 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 2.582 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 1723.987 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 2263.131 ms
create index idx on atest(r);
CREATE INDEX
Time: 50050.308 ms

drop table atest;
DROP TABLE
Time: 52.744 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 25.370 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 2052.733 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 2631.317 ms
create index idx on atest(r);
CREATE INDEX
Time: 61440.897 ms

drop table atest;
DROP TABLE
Time: 26.137 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 24.794 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 2851.977 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1553.046 ms
create index idx on atest(r);
CREATE INDEX
Time: 1774.920 ms

---- Fast (Normal?) cases

drop table atest;
DROP TABLE
Time: 4.422 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 2.543 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1516.246 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1407.400 ms
create index idx on atest(r);
CREATE INDEX
Time: 903.503 ms

drop table atest;
DROP TABLE
Time: 3.820 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 22.861 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1455.556 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 2037.996 ms
create index idx on atest(r);
CREATE INDEX
Time: 718.286 ms

drop table atest;
DROP TABLE
Time: 4.503 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 3.448 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1523.540 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1261.473 ms
create index idx on atest(r);
CREATE INDEX
Time: 727.707 ms

drop table atest;
DROP TABLE
Time: 3.564 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 2.897 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1447.504 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1403.525 ms
create index idx on atest(r);
CREATE INDEX
Time: 754.577 ms

drop table atest;
DROP TABLE
Time: 4.633 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 3.196 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1618.544 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1530.450 ms
create index idx on atest(r);
CREATE INDEX
Time: 802.980 ms
drop table atest;
DROP TABLE
Time: 4.707 ms
mserver#

Regards,
Gary.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gary Doades (#1)
Re: Strange Create Index behaviour

Gary Doades <gpd@gpdnet.co.uk> writes:

Platform: FreeBSD 6.0, Postgresql 8.1.2 compiled from the ports collection.

If the column that is having the index created has a certain
distribution of values then create index takes a very long time. If the
data values (integer in this case) a fairly evenly distributed then
create index is very quick, if the data values are all the same it is
very quick. I discovered that in the slow cases the column had
approximately half the values as zero and the rest fairly spread out.

Interesting. I tried your test script and got fairly close times
for all the cases on two different machines:
old HPUX machine: shortest 5800 msec, longest 7960 msec
new Fedora 4 machine: shortest 461 msec, longest 608 msec
(the HPUX machine was doing other stuff at the same time, so some
of its variation is probably only noise).

So what this looks like to me is a corner case that FreeBSD's qsort
fails to handle well.

You might try forcing Postgres to use our private copy of qsort, as we
do on Solaris for similar reasons. (The easy way to do this by hand
is to configure as normal, then alter the LIBOBJS setting in
src/Makefile.global to add "qsort.o", then proceed with normal build.)
However, I think that our private copy is descended from *BSD sources,
so it might have the same failure mode. It'd be worth finding out.

The final interesting thing is that as I increase shared buffers to 2000
or 3000 the problem gets *worse*

shared_buffers is unlikely to impact index build time noticeably in
recent PG releases. maintenance_work_mem would affect it a lot, though.
What setting were you using for that?

Can anyone else try these test cases on other platforms?

regards, tom lane

#3Gary Doades
gpd@gpdnet.co.uk
In reply to: Tom Lane (#2)
Re: Strange Create Index behaviour

Tom Lane wrote:

Interesting. I tried your test script and got fairly close times
for all the cases on two different machines:
old HPUX machine: shortest 5800 msec, longest 7960 msec
new Fedora 4 machine: shortest 461 msec, longest 608 msec
(the HPUX machine was doing other stuff at the same time, so some
of its variation is probably only noise).

So what this looks like to me is a corner case that FreeBSD's qsort
fails to handle well.

You might try forcing Postgres to use our private copy of qsort, as we
do on Solaris for similar reasons. (The easy way to do this by hand
is to configure as normal, then alter the LIBOBJS setting in
src/Makefile.global to add "qsort.o", then proceed with normal build.)
However, I think that our private copy is descended from *BSD sources,
so it might have the same failure mode. It'd be worth finding out.

The final interesting thing is that as I increase shared buffers to 2000
or 3000 the problem gets *worse*

shared_buffers is unlikely to impact index build time noticeably in
recent PG releases. maintenance_work_mem would affect it a lot, though.
What setting were you using for that?

Can anyone else try these test cases on other platforms?

Thanks for that.

I've since tried it on Windows (pg 8.1.2) and the times were all
similar, around 1200ms so it might just be BSD.

I'll have to wait until tomorrow to get back to my BSD box. FreeBSD
ports makes it easy to install, so I'll have to figure out how to get in
and change things manually. I guess the appropriate files are still left
around after the ports make command finishes, so I just edit the file
and make again?

If it can't be fixed though I guess we may have a problem using BSD. I'm
surprised this hasn't been brought up before, the case doesn't seem
*that* rare. Maybe not that many using FreeBSD?

I'd certainly be interested if anyone else can repro it on FreeBSD though.

Regards,
Gary.

#4Gary Doades
gpd@gpdnet.co.uk
In reply to: Tom Lane (#2)
Re: Strange Create Index behaviour

Tom Lane wrote:

shared_buffers is unlikely to impact index build time noticeably in
recent PG releases. maintenance_work_mem would affect it a lot, though.
What setting were you using for that?

Also, i tried upping maintenance_work_mem to 65536 and it didn't make
much difference (maybe 10% faster for the "normal" cases). Upping the
shared_buffers *definitely* makes the bad cases worse though, but I
agree I don't see why...

Regards,
Gary.

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Gary Doades (#1)
Re: Strange Create Index behaviour

On Wed, 2006-02-15 at 20:00 +0000, Gary Doades wrote:

I have put together a test case

Please enable trace_sort=on and then repeat tests and post the
accompanying log file.

I think this is simply the sort taking longer depending upon the data
distribution, but I'd like to know for sure.

Thanks,

Best Regards, Simon Riggs

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: Strange Create Index behaviour

I wrote:

Interesting. I tried your test script and got fairly close times
for all the cases on two different machines:
old HPUX machine: shortest 5800 msec, longest 7960 msec
new Fedora 4 machine: shortest 461 msec, longest 608 msec

So what this looks like to me is a corner case that FreeBSD's qsort
fails to handle well.

I tried forcing PG to use src/port/qsort.c on the Fedora machine,
and lo and behold:
new Fedora 4 machine: shortest 434 msec, longest 8530 msec

So it sure looks like this script does expose a problem on BSD-derived
qsorts. Curiously, the case that's much the worst for me is the third
in the script, while the shortest time is the first case, which was slow
for Gary. So I'd venture that the *BSD code has been tweaked somewhere
along the way, in a manner that moves the problem around without really
fixing it. (Anyone want to compare the actual FreeBSD source to what
we have?)

This is pretty relevant stuff, because there was a thread recently
advocating that we stop using the platform qsort on all platforms:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php

It's really interesting to see a case where port/qsort is radically
worse than other qsorts ... unless we figure that out and fix it,
I think the idea of using port/qsort everywhere has just taken a
major hit.

regards, tom lane

#7Gary Doades
gpd@gpdnet.co.uk
In reply to: Tom Lane (#6)
Re: Strange Create Index behaviour

Tom Lane wrote:

I tried forcing PG to use src/port/qsort.c on the Fedora machine,
and lo and behold:
new Fedora 4 machine: shortest 434 msec, longest 8530 msec

So it sure looks like this script does expose a problem on BSD-derived
qsorts. Curiously, the case that's much the worst for me is the third
in the script, while the shortest time is the first case, which was slow
for Gary. So I'd venture that the *BSD code has been tweaked somewhere
along the way, in a manner that moves the problem around without really
fixing it. (Anyone want to compare the actual FreeBSD source to what
we have?)

If I run the script again, it is not always the first case that is slow,
it varies from run to run, which is why I repeated it quite a few times
for the test.

Interestingly, if I don't delete the table after a run, but just drop
and re-create the index repeatedly it stays a pretty consistent time,
either repeatedly good or repeatedly bad!

Regards,
Gary.

#8Gary Doades
gpd@gpdnet.co.uk
In reply to: Tom Lane (#6)
Re: Strange Create Index behaviour

Tom Lane wrote:

So it sure looks like this script does expose a problem on BSD-derived
qsorts. Curiously, the case that's much the worst for me is the third
in the script, while the shortest time is the first case, which was slow
for Gary. So I'd venture that the *BSD code has been tweaked somewhere
along the way, in a manner that moves the problem around without really
fixing it. (Anyone want to compare the actual FreeBSD source to what
we have?)

It's really interesting to see a case where port/qsort is radically
worse than other qsorts ... unless we figure that out and fix it,
I think the idea of using port/qsort everywhere has just taken a
major hit.

More specifically to BSD, is there any way I can use a non-BSD qsort for
building Postresql server?

Regards,
Gary.

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#5)
Re: Strange Create Index behaviour

Simon Riggs <simon@2ndquadrant.com> writes:

Please enable trace_sort=on and then repeat tests and post the
accompanying log file.

I did this on my Fedora machine with port/qsort.c, and got the results
attached. Curiously, this run has the spikes in completely different
places than the prior one did. So the random component of the test data
is affecting the results quite a lot. There seems absolutely no doubt
that we are looking at data-dependent qsort misbehavior, though. The
CPU time eaten by performsort accounts for all but about 100 msec of the
elapsed time reported on the psql side.

regards, tom lane

LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.15u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/12.43u sec elapsed 12.44 sec
LOG: internal sort ended, 9861 KB used: CPU 0.01s/12.51u sec elapsed 12.52 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/0.78u sec elapsed 0.78 sec
LOG: internal sort ended, 9861 KB used: CPU 0.02s/0.85u sec elapsed 0.87 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.01s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.01s/0.96u sec elapsed 0.97 sec
LOG: internal sort ended, 9861 KB used: CPU 0.02s/1.03u sec elapsed 1.06 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/0.31u sec elapsed 0.32 sec
LOG: internal sort ended, 9861 KB used: CPU 0.02s/0.38u sec elapsed 0.40 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/7.91u sec elapsed 7.92 sec
LOG: internal sort ended, 9861 KB used: CPU 0.02s/7.99u sec elapsed 8.01 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.01s/0.13u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.01s/0.61u sec elapsed 0.63 sec
LOG: internal sort ended, 9861 KB used: CPU 0.04s/0.67u sec elapsed 0.71 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.01s/0.13u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.01s/11.52u sec elapsed 11.54 sec
LOG: internal sort ended, 9861 KB used: CPU 0.03s/11.59u sec elapsed 11.62 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/0.45u sec elapsed 0.46 sec
LOG: internal sort ended, 9861 KB used: CPU 0.02s/0.55u sec elapsed 0.57 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.00s/0.45u sec elapsed 0.46 sec
LOG: internal sort ended, 9861 KB used: CPU 0.04s/0.54u sec elapsed 0.57 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.02s/0.12u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.02s/0.44u sec elapsed 0.46 sec
LOG: internal sort ended, 9861 KB used: CPU 0.03s/0.55u sec elapsed 0.58 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.02s/0.13u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.02s/0.44u sec elapsed 0.46 sec
LOG: internal sort ended, 9861 KB used: CPU 0.03s/0.54u sec elapsed 0.58 sec
LOG: begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG: performsort starting: CPU 0.02s/0.13u sec elapsed 0.15 sec
LOG: performsort done: CPU 0.02s/0.44u sec elapsed 0.46 sec
LOG: internal sort ended, 9861 KB used: CPU 0.04s/0.54u sec elapsed 0.59 sec

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gary Doades (#7)
Re: Strange Create Index behaviour

Gary Doades <gpd@gpdnet.co.uk> writes:

Interestingly, if I don't delete the table after a run, but just drop
and re-create the index repeatedly it stays a pretty consistent time,
either repeatedly good or repeatedly bad!

This is consistent with the theory of a data-dependent performance
problem in qsort. If you don't generate a fresh set of random test
data, then you get repeatable runtimes. With a new set of test data,
you might or might not hit the not-so-sweet-spot that we seem to have
detected.

regards, tom lane

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gary Doades (#7)
qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Gary Doades <gpd@gpdnet.co.uk> writes:

If I run the script again, it is not always the first case that is slow,
it varies from run to run, which is why I repeated it quite a few times
for the test.

For some reason I hadn't immediately twigged to the fact that your test
script is just N repetitions of the exact same structure with random data.
So it's not so surprising that you get random variations in behavior
with different test data sets.

I did some experimentation comparing the qsort from Fedora Core 4
(glibc-2.3.5-10.3) with our src/port/qsort.c. For those who weren't
following the pgsql-performance thread, the test case is just this
repeated a lot of times:

create table atest(i int4, r int4);
insert into atest (i,r) select generate_series(1,100000), 0;
insert into atest (i,r) select generate_series(1,100000), random()*100000;
\timing
create index idx on atest(r);
\timing
drop table atest;

I did this 100 times and sorted the reported runtimes. (Investigation
with trace_sort = on confirms that the runtime is almost entirely spent
in qsort() called from our performsort --- the Postgres overhead is
about 100msec on this machine.) Results are below.

It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.

I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
We need to fix our qsort.c before pushing ahead with that idea.

regards, tom lane

100 runtimes for glibc qsort, sorted ascending:

Time: 459.860 ms
Time: 460.209 ms
Time: 460.704 ms
Time: 461.317 ms
Time: 461.538 ms
Time: 461.652 ms
Time: 461.988 ms
Time: 462.573 ms
Time: 462.638 ms
Time: 462.716 ms
Time: 462.917 ms
Time: 463.219 ms
Time: 463.455 ms
Time: 463.650 ms
Time: 463.723 ms
Time: 463.737 ms
Time: 463.750 ms
Time: 463.852 ms
Time: 463.964 ms
Time: 463.988 ms
Time: 464.003 ms
Time: 464.135 ms
Time: 464.372 ms
Time: 464.458 ms
Time: 464.496 ms
Time: 464.551 ms
Time: 464.599 ms
Time: 464.655 ms
Time: 464.656 ms
Time: 464.722 ms
Time: 464.814 ms
Time: 464.827 ms
Time: 464.878 ms
Time: 464.899 ms
Time: 464.905 ms
Time: 464.987 ms
Time: 465.055 ms
Time: 465.138 ms
Time: 465.159 ms
Time: 465.194 ms
Time: 465.310 ms
Time: 465.316 ms
Time: 465.375 ms
Time: 465.450 ms
Time: 465.535 ms
Time: 465.595 ms
Time: 465.680 ms
Time: 465.769 ms
Time: 465.865 ms
Time: 465.892 ms
Time: 465.903 ms
Time: 466.003 ms
Time: 466.154 ms
Time: 466.164 ms
Time: 466.203 ms
Time: 466.305 ms
Time: 466.344 ms
Time: 466.364 ms
Time: 466.388 ms
Time: 466.502 ms
Time: 466.593 ms
Time: 466.725 ms
Time: 466.794 ms
Time: 466.798 ms
Time: 466.904 ms
Time: 466.971 ms
Time: 466.997 ms
Time: 467.122 ms
Time: 467.146 ms
Time: 467.221 ms
Time: 467.224 ms
Time: 467.244 ms
Time: 467.277 ms
Time: 467.587 ms
Time: 468.142 ms
Time: 468.207 ms
Time: 468.237 ms
Time: 468.471 ms
Time: 468.663 ms
Time: 468.700 ms
Time: 469.235 ms
Time: 469.840 ms
Time: 470.472 ms
Time: 471.140 ms
Time: 472.811 ms
Time: 472.959 ms
Time: 474.858 ms
Time: 477.210 ms
Time: 479.571 ms
Time: 479.671 ms
Time: 482.797 ms
Time: 488.852 ms
Time: 514.639 ms
Time: 529.287 ms
Time: 612.185 ms
Time: 660.748 ms
Time: 742.227 ms
Time: 866.814 ms
Time: 1234.848 ms
Time: 1267.398 ms

100 runtimes for port/qsort.c, sorted ascending:

Time: 418.905 ms
Time: 420.611 ms
Time: 420.764 ms
Time: 420.904 ms
Time: 421.706 ms
Time: 422.466 ms
Time: 422.627 ms
Time: 423.189 ms
Time: 423.302 ms
Time: 425.096 ms
Time: 425.731 ms
Time: 425.851 ms
Time: 427.253 ms
Time: 430.113 ms
Time: 432.756 ms
Time: 432.963 ms
Time: 440.502 ms
Time: 440.640 ms
Time: 450.452 ms
Time: 458.143 ms
Time: 459.212 ms
Time: 467.706 ms
Time: 468.006 ms
Time: 468.574 ms
Time: 470.003 ms
Time: 472.313 ms
Time: 483.622 ms
Time: 492.395 ms
Time: 509.564 ms
Time: 531.037 ms
Time: 533.366 ms
Time: 535.610 ms
Time: 575.523 ms
Time: 582.688 ms
Time: 593.545 ms
Time: 647.364 ms
Time: 660.612 ms
Time: 677.312 ms
Time: 680.288 ms
Time: 697.626 ms
Time: 833.066 ms
Time: 834.511 ms
Time: 851.819 ms
Time: 920.443 ms
Time: 926.731 ms
Time: 954.289 ms
Time: 1045.214 ms
Time: 1059.200 ms
Time: 1062.328 ms
Time: 1136.018 ms
Time: 1260.091 ms
Time: 1276.883 ms
Time: 1319.351 ms
Time: 1438.854 ms
Time: 1475.457 ms
Time: 1538.211 ms
Time: 1549.004 ms
Time: 1744.642 ms
Time: 1771.258 ms
Time: 1959.530 ms
Time: 2300.140 ms
Time: 2589.641 ms
Time: 2612.780 ms
Time: 3100.024 ms
Time: 3284.125 ms
Time: 3379.792 ms
Time: 3750.278 ms
Time: 4302.278 ms
Time: 4780.624 ms
Time: 5000.056 ms
Time: 5092.604 ms
Time: 5168.722 ms
Time: 5292.941 ms
Time: 5895.964 ms
Time: 7003.164 ms
Time: 7099.449 ms
Time: 7115.083 ms
Time: 7384.940 ms
Time: 8214.010 ms
Time: 8700.771 ms
Time: 9331.225 ms
Time: 10503.360 ms
Time: 12496.026 ms
Time: 12982.474 ms
Time: 15192.390 ms
Time: 15392.161 ms
Time: 15958.295 ms
Time: 18375.693 ms
Time: 18617.706 ms
Time: 18927.515 ms
Time: 19898.018 ms
Time: 20865.979 ms
Time: 21000.907 ms
Time: 21297.585 ms
Time: 21714.518 ms
Time: 25423.235 ms
Time: 27543.052 ms
Time: 28314.182 ms
Time: 29400.278 ms
Time: 34142.534 ms

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#10)
Re: Strange Create Index behaviour

On Wed, 2006-02-15 at 16:51 -0500, Tom Lane wrote:

Gary Doades <gpd@gpdnet.co.uk> writes:

Interestingly, if I don't delete the table after a run, but just drop
and re-create the index repeatedly it stays a pretty consistent time,
either repeatedly good or repeatedly bad!

This is consistent with the theory of a data-dependent performance
problem in qsort. If you don't generate a fresh set of random test
data, then you get repeatable runtimes. With a new set of test data,
you might or might not hit the not-so-sweet-spot that we seem to have
detected.

Agreed. Good analysis...

Best Regards, Simon Riggs

#13Gary Doades
gpd@gpdnet.co.uk
In reply to: Tom Lane (#11)
Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Tom Lane wrote:

For some reason I hadn't immediately twigged to the fact that your test
script is just N repetitions of the exact same structure with random data.
So it's not so surprising that you get random variations in behavior
with different test data sets.

It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.

I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
We need to fix our qsort.c before pushing ahead with that idea.

[snip]

Time: 28314.182 ms
Time: 29400.278 ms
Time: 34142.534 ms

Ouch! That confirms my problem. I generated the random test case because
it was easier than including the dump of my tables, but you can
appreciate that tables 20 times the size are basically crippled when it
comes to creating an index on them.

Examining the dump and the associated times during restore it looks like
I have 7 tables with this approximate distribution, thus the
ridiculously long restore time. Better not re-index soon!

Is this likely to hit me in a random fashion during normal operation,
joins, sorts, order by for example?

So the options are:
1) Fix the included qsort.c code and use that
2) Get FreeBSD to fix their qsort code
3) Both

I guess that 1 is the real solution in case anyone else's qsort is
broken in the same way. Then at least you *could* use it all the time :)

Regards,
Gary.

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gary Doades (#13)
Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Gary Doades <gpd@gpdnet.co.uk> writes:

Is this likely to hit me in a random fashion during normal operation,
joins, sorts, order by for example?

Yup, anytime you're passing data with that kind of distribution
through a sort.

So the options are:
1) Fix the included qsort.c code and use that
2) Get FreeBSD to fix their qsort code
3) Both

I guess that 1 is the real solution in case anyone else's qsort is
broken in the same way. Then at least you *could* use it all the time :)

It's reasonable to assume that most of the *BSDen have basically the
same qsort code. Ours claims to have come from NetBSD sources, but
I don't doubt that they all trace back to a common ancestor.

regards, tom lane

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gary Doades (#13)
Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Gary Doades <gpd@gpdnet.co.uk> writes:

Ouch! That confirms my problem. I generated the random test case because
it was easier than including the dump of my tables, but you can
appreciate that tables 20 times the size are basically crippled when it
comes to creating an index on them.

Actually... we only use qsort when we have a sorting problem that fits
within the allowed sort memory. The external-sort logic doesn't go
through that code at all. So all the analysis we just did on your test
case doesn't necessarily apply to sort problems that are too large for
the sort_mem setting.

The test case would be sorting 200000 index entries, which'd probably
occupy at least 24 bytes apiece of sort memory, so probably about 5 meg.
A problem 20 times that size would definitely not fit in the default
16MB maintenance_work_mem. Were you using a large value of
maintenance_work_mem for your restore?

regards, tom lane

#16Ron Peacetree
rjpeace@earthlink.net
In reply to: Tom Lane (#11)
Re: qsort again (was Re: [PERFORM] Strange Create Index

This behavior is consistent with the pivot choosing algorithm
assuming certain distribution(s) for the data. For instance,
median-of-three partitioning is known to be pessimal when the data is
geometrically or hyper-geometrically distributed. Also, care must be
taken that sometimes is not when there are many equal values in the
data. Even pseudo random number generator based pivot choosing
algorithms are not immune if the PRNG is flawed in some way.

How are we choosing our pivots?

At 06:28 PM 2/15/2006, Tom Lane wrote:

Show quoted text

I did some experimentation comparing the qsort from Fedora Core 4
(glibc-2.3.5-10.3) with our src/port/qsort.c. For those who weren't
following the pgsql-performance thread, the test case is just this
repeated a lot of times:

create table atest(i int4, r int4);
insert into atest (i,r) select generate_series(1,100000), 0;
insert into atest (i,r) select generate_series(1,100000), random()*100000;
\timing
create index idx on atest(r);
\timing
drop table atest;

I did this 100 times and sorted the reported runtimes. (Investigation
with trace_sort = on confirms that the runtime is almost entirely spent
in qsort() called from our performsort --- the Postgres overhead is
about 100msec on this machine.) Results are below.

It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.

I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
We need to fix our qsort.c before pushing ahead with that idea.

regards, tom lane

100 runtimes for glibc qsort, sorted ascending:

Time: 459.860 ms
<snip>
Time: 488.852 ms
Time: 514.639 ms
Time: 529.287 ms
Time: 612.185 ms
Time: 660.748 ms
Time: 742.227 ms
Time: 866.814 ms
Time: 1234.848 ms
Time: 1267.398 ms

100 runtimes for port/qsort.c, sorted ascending:

Time: 418.905 ms
<snip>
Time: 20865.979 ms
Time: 21000.907 ms
Time: 21297.585 ms
Time: 21714.518 ms
Time: 25423.235 ms
Time: 27543.052 ms
Time: 28314.182 ms
Time: 29400.278 ms
Time: 34142.534 ms

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#15)
Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

I wrote:

Gary Doades <gpd@gpdnet.co.uk> writes:

Ouch! That confirms my problem. I generated the random test case because
it was easier than including the dump of my tables, but you can
appreciate that tables 20 times the size are basically crippled when it
comes to creating an index on them.

Actually... we only use qsort when we have a sorting problem that fits
within the allowed sort memory. The external-sort logic doesn't go
through that code at all. So all the analysis we just did on your test
case doesn't necessarily apply to sort problems that are too large for
the sort_mem setting.

I increased the size of the test case by 10x (basically s/100000/1000000/)
which is enough to push it into the external-sort regime. I get
amazingly stable runtimes now --- I didn't have the patience to run 100
trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec.
So this code path is definitely not very sensitive to this data
distribution.

While these numbers aren't glittering in comparison to the best-case
qsort times (~450 msec to sort 10% as much data), they are sure a lot
better than the worst-case times. So maybe a workaround for you is
to decrease maintenance_work_mem, counterintuitive though that be.
(Now, if you *weren't* using maintenance_work_mem of 100MB or more
for your problem restore, then I'm not sure I know what's going on...)

We still ought to try to fix qsort of course.

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ron Peacetree (#16)
Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Ron <rjpeace@earthlink.net> writes:

How are we choosing our pivots?

See qsort.c: it looks like median of nine equally spaced inputs (ie,
the 1/8th points of the initial input array, plus the end points),
implemented as two rounds of median-of-three choices.  With half of the
data inputs zero, it's not too improbable for two out of the three
samples to be zeroes in which case I think the med3 result will be zero
--- so choosing a pivot of zero is much more probable than one would
like, and doing so in many levels of recursion causes the problem.

I think. I'm not too sure if the code isn't just being sloppy about the
case where many data values are equal to the pivot --- there's a special
case there to switch to insertion sort, and maybe that's getting invoked
too soon. It'd be useful to get a line-level profile of the behavior of
this code in the slow cases...

regards, tom lane

#19Dann Corbit
DCorbit@connx.com
In reply to: Tom Lane (#18)
Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Tom Lane
Sent: Wednesday, February 15, 2006 5:22 PM
To: Ron
Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

Index

behaviour)

Ron <rjpeace@earthlink.net> writes:

How are we choosing our pivots?

See qsort.c: it looks like median of nine equally spaced inputs (ie,
the 1/8th points of the initial input array, plus the end points),
implemented as two rounds of median-of-three choices. With half of

the

data inputs zero, it's not too improbable for two out of the three
samples to be zeroes in which case I think the med3 result will be

zero

--- so choosing a pivot of zero is much more probable than one would
like, and doing so in many levels of recursion causes the problem.

Adding some randomness to the selection of the pivot is a known
technique to fix the oddball partitions problem. However, Bentley and
Sedgewick proved that every quick sort algorithm has some input set that
makes it go quadratic (hence the recent popularity of introspective
sort, which switches to heapsort if quadratic behavior is detected. The
C++ template I submitted was an example of introspective sort, but
PostgreSQL does not use C++ so it was not helpful).

I think. I'm not too sure if the code isn't just being sloppy about

the

case where many data values are equal to the pivot --- there's a

special

case there to switch to insertion sort, and maybe that's getting

invoked

too soon.

Here are some cases known to make qsort go quadratic:
1. Data already sorted
2. Data reverse sorted
3. Data organ-pipe sorted or ramp
4. Almost all data of the same value

There are probably other cases. Randomizing the pivot helps some, as
does check for in-order or reverse order partitions.

Imagine if 1/3 of the partitions fall into a category that causes
quadratic behavior (have one of the above formats and have more than
CUTOFF elements in them).

It is doubtful that the switch to insertion sort is causing any sort of
problems. It is only going to be invoked on tiny sets, for which it has
a fixed cost that is probably less that qsort() function calls on sets
of the same size.

It'd be useful to get a line-level profile of the behavior of
this code in the slow cases...

I guess that my in-order or presorted tests [which often arise when
there are very few distinct values] may solve the bad partition
problems. Don't forget that the algorithm is called recursively.

regards, tom lane

---------------------------(end of

broadcast)---------------------------

Show quoted text

TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Gary Doades (#1)
Re: Strange Create Index behaviour

On Wed, 2006-02-15 at 20:00 +0000, Gary Doades wrote:

I have put together a test case that demonstrates the problem (see
below). I create a simple table, as close in structure to one of my
problem tables and populate an integer column with 100,000 zeros follow
by 100,000 random integers between 0 and 100,000. Then create an index
on this column. I then drop the table and repeat. The create index
should take around 1-2 seconds. A fair proportion of the time it takes
50 seconds!!!

If I fill the same row with all random data the create index always
takes a second or two. If I fill the column with all zeros everything is
still OK.

Aside from the importance of investigating sort behaviour, have you
tried to build a partial index WHERE col > 0 ? That way you wouldn't
even be indexing the zeros.

Best Regards, Simon Riggs

#21Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Gary Doades (#13)
#22Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#17)
#23Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#11)
#24Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Gary Doades (#1)
#25Ron Peacetree
rjpeace@earthlink.net
In reply to: Tom Lane (#18)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Qingqing Zhou (#24)
#27Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Gary Doades (#1)
#28Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Gary Doades (#1)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Qingqing Zhou (#27)
#30Gary Doades
gpd@gpdnet.co.uk
In reply to: Tom Lane (#17)
#31Steinar H. Gunderson
sgunderson@bigfoot.com
In reply to: Ron Peacetree (#25)
#32Florian Weimer
fw@deneb.enyo.de
In reply to: Neil Conway (#23)
#33Martijn van Oosterhout
kleptog@svana.org
In reply to: Florian Weimer (#32)
#34Sven Geisler
sgeisler@aeccom.com
In reply to: Martijn van Oosterhout (#33)
#35Ron Peacetree
rjpeace@earthlink.net
In reply to: Steinar H. Gunderson (#31)
#36Ron Peacetree
rjpeace@earthlink.net
In reply to: Florian Weimer (#32)
#37Markus Schaber
schabi@logix-tt.com
In reply to: Ron Peacetree (#35)
#38Jonah H. Harris
jonah.harris@gmail.com
In reply to: Markus Schaber (#37)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gary Doades (#30)
#40Martijn van Oosterhout
kleptog@svana.org
In reply to: Ron Peacetree (#35)
#41Gary Doades
gpd@gpdnet.co.uk
In reply to: Tom Lane (#39)
#42Ron Peacetree
rjpeace@earthlink.net
In reply to: Martijn van Oosterhout (#40)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ron Peacetree (#42)
#44Craig A. James
cjames@modgraph-usa.com
In reply to: Markus Schaber (#37)
#45Ron Peacetree
rjpeace@earthlink.net
In reply to: Ron Peacetree (#42)
#46Martijn van Oosterhout
kleptog@svana.org
In reply to: Ron Peacetree (#45)
#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Craig A. James (#44)
#48Scott Lamb
slamb@slamb.org
In reply to: Ron Peacetree (#45)
#49Dann Corbit
DCorbit@connx.com
In reply to: Tom Lane (#47)
#50Ron Peacetree
rjpeace@earthlink.net
In reply to: Scott Lamb (#48)
#51Neil Conway
neilc@samurai.com
In reply to: Steinar H. Gunderson (#31)
#52Mark Lewis
mark.lewis@mir3.com
In reply to: Tom Lane (#47)
#53Markus Schaber
schabi@logix-tt.com
In reply to: Mark Lewis (#52)
#54Martijn van Oosterhout
kleptog@svana.org
In reply to: Mark Lewis (#52)
#55Bruce Momjian
bruce@momjian.us
In reply to: Markus Schaber (#53)
#56PFC
lists@peufeu.com
In reply to: Mark Lewis (#52)
#57Mark Lewis
mark.lewis@mir3.com
In reply to: Bruce Momjian (#55)
#58Markus Schaber
schabi@logix-tt.com
In reply to: PFC (#56)
#59Steinar H. Gunderson
sgunderson@bigfoot.com
In reply to: PFC (#56)
#60David Lang
dlang@invendra.net
In reply to: Mark Lewis (#57)
#61Ron Peacetree
rjpeace@earthlink.net
In reply to: Ron Peacetree (#50)
#62Jens-Wolfhard Schicke
ml+pgsql-hackers@asco.de
In reply to: Dann Corbit (#49)
#63Martijn van Oosterhout
kleptog@svana.org
In reply to: Jens-Wolfhard Schicke (#62)
#64Ragnar
gnari@hive.is
In reply to: Ron Peacetree (#61)
#65Markus Schaber
schabi@logix-tt.com
In reply to: David Lang (#60)
#66Markus Schaber
schabi@logix-tt.com
In reply to: Ron Peacetree (#61)
#67PFC
lists@peufeu.com
In reply to: Markus Schaber (#66)
#68Ron Peacetree
rjpeace@earthlink.net
In reply to: Ragnar (#64)
#69Ron Peacetree
rjpeace@earthlink.net
In reply to: Markus Schaber (#66)
#70Martijn van Oosterhout
kleptog@svana.org
In reply to: Ron Peacetree (#69)
#71Scott Lamb
slamb@slamb.org
In reply to: Mark Lewis (#52)
#72Martijn van Oosterhout
kleptog@svana.org
In reply to: Scott Lamb (#71)
#73Ron Peacetree
rjpeace@earthlink.net
In reply to: Martijn van Oosterhout (#70)
#74Mark Lewis
mark.lewis@mir3.com
In reply to: David Lang (#60)
#75Ragnar
gnari@hive.is
In reply to: Ron Peacetree (#68)
#76Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Lewis (#74)
#77Gregory Maxwell
gmaxwell@gmail.com
In reply to: Ragnar (#75)
#78Ron Peacetree
rjpeace@earthlink.net
In reply to: Dann Corbit (#19)
#79Bruce Momjian
bruce@momjian.us
In reply to: Dann Corbit (#19)
#80Jonah H. Harris
jonah.harris@gmail.com
In reply to: Bruce Momjian (#79)
#81Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#11)
#82Craig A. James
cjames@modgraph-usa.com
In reply to: Tom Lane (#81)
#83Tom Lane
tgl@sss.pgh.pa.us
In reply to: Craig A. James (#82)
#84Craig A. James
cjames@modgraph-usa.com
In reply to: Tom Lane (#83)
#85Tom Lane
tgl@sss.pgh.pa.us
In reply to: Craig A. James (#84)
#86Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Craig A. James (#84)