GiST concurrency commited

Started by Teodor Sigaevalmost 21 years ago26 messageshackers
Jump to latest
#1Teodor Sigaev
teodor@sigaev.ru

Have we list named something like 'test focusing for 8.1'? If it exists then
GiST concurrency and recovery testing should be added to it. Especially,
recovery after crash. Of course, now Oleg and me going to begin a large test
program.

While I'm running test with concurrent select/insert/update/delete/vacuum/vacuum
full I found, that sometimes postgres crashes in index_beginscan_internal on
FunctionCall3, because structure 'procedure' becomes zeroed. As I understand,
LockRelation can invalidate part of Relation structure. So, I moved
GET_REL_PROCEDURE after LockRelation. It seems to me, this patch should be
backpatched or it's needed another fixing. This problem was 2-4 times per
million statements executing by 4 flows.

And there is one more problem: it caused approximatly one time per 2-4 million
statements, I got traps:
TRAP: FailedAssertion("!((*curpage)->offsets_used == num_tuples)", File:
"vacuum.c", Line: 2766)
LOG: server process (PID 15847) was terminated by signal 6
Sorry, but I couldn't debug this trap and my knowledge about this piece of code
is very limited. Postgres didn't create a core file. I don't believe this
problem is in touch with my GiST framework, becouse it is about heap pages. I
suspect trap occurs while concurrent vacuum, but I am not sure.

PS
My concurrency testing scripts:
http://www.sigaev.ru/gist/
concur.pl - generator of SQL statements
concur.sh - simple wrapper about concur.pl which reinit db, makes db and table.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Teodor Sigaev (#1)
Re: GiST concurrency commited

Teodor Sigaev <teodor@sigaev.ru> writes:

While I'm running test with concurrent
select/insert/update/delete/vacuum/vacuum full I found, that sometimes
postgres crashes in index_beginscan_internal on FunctionCall3, because
structure 'procedure' becomes zeroed. As I understand, LockRelation
can invalidate part of Relation structure. So, I moved
GET_REL_PROCEDURE after LockRelation.

Oooh, good catch.

It seems to me, this patch
should be backpatched or it's needed another fixing.

No, it's not an issue in the back branches, because until recently
GET_REL_PROCEDURE only fetched the function OID.

And there is one more problem: it caused approximatly one time per 2-4 million
statements, I got traps:
TRAP: FailedAssertion("!((*curpage)->offsets_used == num_tuples)", File:
"vacuum.c", Line: 2766)
LOG: server process (PID 15847) was terminated by signal 6

Odd. Will look at it later (after feature freeze), if you don't find
the cause beforehand.

regards, tom lane

#3Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#2)
Re: GiST concurrency commited

I think the whole GiST limitations page can be removed now...

http://developer.postgresql.org/docs/postgres/limitations.html

Chris

#4Teodor Sigaev
teodor@sigaev.ru
In reply to: Tom Lane (#2)
Re: GiST concurrency commited

And there is one more problem: it caused approximatly one time per 2-4 million
statements, I got traps:
TRAP: FailedAssertion("!((*curpage)->offsets_used == num_tuples)", File:
"vacuum.c", Line: 2766)
LOG: server process (PID 15847) was terminated by signal 6

Odd. Will look at it later (after feature freeze), if you don't find
the cause beforehand.

It's definitly bug in a vaccum code, I got the same trap without any GiST
indexes (to reproduce, just comment out 'create index' command in my script).

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#5Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Teodor Sigaev (#1)
Re: GiST concurrency commited

"Teodor Sigaev" <teodor@sigaev.ru> writes

concur.pl - generator of SQL statements

retrieving it is forbidden ...

Regards,
Qingqing

#6Teodor Sigaev
teodor@sigaev.ru
In reply to: Qingqing Zhou (#5)
Re: GiST concurrency commited

Sorry, fixed.

Qingqing Zhou wrote:

"Teodor Sigaev" <teodor@sigaev.ru> writes

concur.pl - generator of SQL statements

retrieving it is forbidden ...

Regards,
Qingqing

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#7Bruce Momjian
bruce@momjian.us
In reply to: Christopher Kings-Lynne (#3)
Re: GiST concurrency commited

Christopher Kings-Lynne wrote:

I think the whole GiST limitations page can be removed now...

http://developer.postgresql.org/docs/postgres/limitations.html

Done.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Teodor Sigaev (#1)
VACUUM/t_ctid bug (was Re: GiST concurrency commited)

Awhile back, Teodor Sigaev <teodor@sigaev.ru> wrote:

And there is one more problem: it caused approximatly one time per 2-4 million
statements, I got traps:
TRAP: FailedAssertion("!((*curpage)->offsets_used == num_tuples)", File:
"vacuum.c", Line: 2766)
LOG: server process (PID 15847) was terminated by signal 6
Sorry, but I couldn't debug this trap and my knowledge about this piece of code
is very limited. Postgres didn't create a core file. I don't believe this
problem is in touch with my GiST framework, becouse it is about heap pages. I
suspect trap occurs while concurrent vacuum, but I am not sure.

PS
My concurrency testing scripts:
http://www.sigaev.ru/gist/
concur.pl - generator of SQL statements
concur.sh - simple wrapper about concur.pl which reinit db, makes db and table.

I have committed changes that I believe fix this problem:
http://archives.postgresql.org/pgsql-committers/2005-08/msg00213.php
But it needs more testing. Would you update to CVS tip and see if you
still see the failure?

Also, if anyone else has some vacuum + concurrent update test cases,
any testing you can do in CVS tip would be useful. This patch is big
and ugly enough that back-patching it into all the supported back
branches is a pretty scary prospect. I don't think we have a lot of
choice --- it is a data-loss risk --- but we need to beat the heck
out of the CVS-tip version before we start pushing it into the release
branches.

My current intention is to leave it just in CVS tip for the next few
days, and not to start developing back-branch versions until after
we've made the first 8.1 beta release. The back-ports are going to
be painful (the code involved has changed often enough that I fear
each branch will need a custom tailored patch) ... so I really don't
want to start without some confidence that the CVS-tip patch is right.

In other words ... if you can test this ... HELP!!!

regards, tom lane

#9Gavin Sherry
swm@linuxworld.com.au
In reply to: Tom Lane (#8)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

On Sat, 20 Aug 2005, Tom Lane wrote:

I have committed changes that I believe fix this problem:
http://archives.postgresql.org/pgsql-committers/2005-08/msg00213.php
But it needs more testing. Would you update to CVS tip and see if you
still see the failure?

I've written some quick scripts. One just vacuums constantly (999 vacuums
to 1 vacuum full) while three other scripts three randomly insert
into, update and delete from 3 tables. There's a mix of small and large
transactions. The tables have a single int column. It is set up to run 3
million transactions across the 3 scripts.

I will try and jump onto one of the larger OSDL machines to test as well.

Gavin

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Sherry (#9)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

Gavin Sherry <swm@linuxworld.com.au> writes:

I've written some quick scripts. One just vacuums constantly (999 vacuums
to 1 vacuum full) while three other scripts three randomly insert
into, update and delete from 3 tables. There's a mix of small and large
transactions. The tables have a single int column. It is set up to run 3
million transactions across the 3 scripts.

Note that since the issues have mainly to do with update chains, it'd be
good to stress cases where a row is updated multiple times before being
deleted. And use at least one long-running transaction, so that VACUUM
can't just throw away the update chain.

regards, tom lane

#11Teodor Sigaev
teodor@sigaev.ru
In reply to: Tom Lane (#8)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

In other words ... if you can test this ... HELP!!!

I'll run tests.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#12Gavin Sherry
swm@linuxworld.com.au
In reply to: Tom Lane (#10)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

On Sat, 20 Aug 2005, Tom Lane wrote:

Gavin Sherry <swm@linuxworld.com.au> writes:

I've written some quick scripts. One just vacuums constantly (999 vacuums
to 1 vacuum full) while three other scripts three randomly insert
into, update and delete from 3 tables. There's a mix of small and large
transactions. The tables have a single int column. It is set up to run 3
million transactions across the 3 scripts.

Note that since the issues have mainly to do with update chains, it'd be
good to stress cases where a row is updated multiple times before being
deleted. And use at least one long-running transaction, so that VACUUM
can't just throw away the update chain.

Right.

I modified the test so have multiple updates of a given row mixed with
concurrent long running read transactions. Vacuum was running repeatedly
in a concurrent session. I did not encounter any problems.

However, the results are inconclusive since I ran the same test against
HEAD from 10 days ago and didn't manage to trigger the problem Teodor's
script did. I'll take a better look tomorrow.

Gavin

#13Teodor Sigaev
teodor@sigaev.ru
In reply to: Tom Lane (#8)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

I have committed changes that I believe fix this problem:
http://archives.postgresql.org/pgsql-committers/2005-08/msg00213.php
But it needs more testing. Would you update to CVS tip and see if you
still see the failure?

It seems, patch works correctly. My tests is passed with approximatly 1e8 SQL
statements without any failure. Tests runs on two boxes: PIII/1133MHz adn Quad
Xeon/500MHz and works with four threads.

Further I'm going to increase concurrency up to 12 parallel threads.

PS GiST passed this tests too.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#14Teodor Sigaev
teodor@sigaev.ru
In reply to: Teodor Sigaev (#13)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

Further I'm going to increase concurrency up to 12 parallel threads.

All is ok, test is passed with approximatly 40 millions statements

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#15Hannu Krosing
hannu@tm.ee
In reply to: Teodor Sigaev (#14)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

On K, 2005-08-24 at 15:52 +0400, Teodor Sigaev wrote:

Further I'm going to increase concurrency up to 12 parallel threads.

All is ok, test is passed with approximatly 40 millions statements

I have sent a patch to patches list enabling concurrent vacuums to
actually reclaim space while another long vacuum is running, i.e.
vacuums won't no longer block each other from removing deleted tuples.

see:

http://archives.postgresql.org/pgsql-patches/2005-08/msg00304.php

Could you perhaps test this patch as well, while you already have a
setup for testing parallel vacuums under big loads ?

Or perhaps you can share the setup/scripts/data so that I could run your
test myself as well ?

--
Hannu Krosing <hannu@skype.net>

#16Teodor Sigaev
teodor@sigaev.ru
In reply to: Hannu Krosing (#15)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

http://archives.postgresql.org/pgsql-patches/2005-08/msg00304.php

Could you perhaps test this patch as well, while you already have a
setup for testing parallel vacuums under big loads ?

Ok, I'll do it.

Or perhaps you can share the setup/scripts/data so that I could run your
test myself as well ?

Statements generator:
http://www.sigaev.ru/gist/concur.pl
Usage:
./concur.pl -d DATABASE [-n N] [-c N] [-l]
-d DATABASE - DATABASE name
-n N - number of rows
-c N - number of flow
-l - linear mode

Script guarantees the same result (ie the same data in table) with the same -n
and -c options, result doesn't depend on -l option. Also script guarantees
absence of dead locks.

% perl concur.pl -d qq -n 100000 -c 4
Start: parallel mode with 4 flows
3 flow finish. Stats: ni:25000 nu:555 nd:81 nv:4(nf:1) nt:232
0 flow finish. Stats: ni:25000 nu:554 nd:77 nv:2(nf:0) nt:247
1 flow finish. Stats: ni:25000 nu:548 nd:65 nv:4(nf:1) nt:249
2 flow finish. Stats: ni:25000 nu:552 nd:79 nv:7(nf:2) nt:263
All flow finish; status: 0; elapsed time: 159.25 sec

Script prints some statistics per flow:
ni - number of insert's statements (sum of this should be equal to -n N)
nu - -/- update's statements
nd - -/- delete's statements
nv - -/- vacuum (including full vacuum)
nf - number of full vacuum
nt - number of transactions

Simple wrapper for manipulating test table and loop testing (you should edit it
to right paths):
http://www.sigaev.ru/gist/concur.sh

Those scripts was wrote to test GiST concurrency.

I suspect it's needed to make some changes in generator to increase number of
updates and vacuums for your goal.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Teodor Sigaev (#16)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

Teodor Sigaev <teodor@sigaev.ru> writes:

http://www.sigaev.ru/gist/concur.pl
http://www.sigaev.ru/gist/concur.sh

BTW, these scripts seem to indicate that there's a GIST or
contrib/intarray problem in the 8.0 branch. I was trying to use 'em
to test REL8_0_STABLE branch tip to verify my t_ctid chain backpatch,
and I pretty consistently see "Problem with update":

Start: parallel mode with 4 flows
Problem with update {77,77}:0 count:1 at concur.pl line 91.
Issuing rollback() for database handle being DESTROY'd without explicit disconnect().
Problem with update {43,24}:3 count:1 at concur.pl line 91.
Issuing rollback() for database handle being DESTROY'd without explicit disconnect().
Problem with update {43,43}:2 count:1 at concur.pl line 91.
Issuing rollback() for database handle being DESTROY'd without explicit disconnect().
1 flow finish. Stats: ni:75000 nu:1661 nd:216 nv:13(nf:3) nt:780
All flow finish; status: 255; elapsed time: 265.48 sec

Is this something that can be fixed for 8.0.4?

regards, tom lane

#18Teodor Sigaev
teodor@sigaev.ru
In reply to: Hannu Krosing (#15)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

http://archives.postgresql.org/pgsql-patches/2005-08/msg00304.php

Could you perhaps test this patch as well, while you already have a
setup for testing parallel vacuums under big loads ?

I didn't find any problem with your patch during testing with 1e8 statements...

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#19Hannu Krosing
hannu@tm.ee
In reply to: Teodor Sigaev (#18)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

On R, 2005-08-26 at 16:47 +0400, Teodor Sigaev wrote:

http://archives.postgresql.org/pgsql-patches/2005-08/msg00304.php

Could you perhaps test this patch as well, while you already have a
setup for testing parallel vacuums under big loads ?

I didn't find any problem with your patch during testing with 1e8 statements...

Thank You Very Much !

--
Hannu Krosing <hannu@skype.net>

#20Teodor Sigaev
teodor@sigaev.ru
In reply to: Tom Lane (#17)
Re: VACUUM/t_ctid bug (was Re: GiST concurrency commited)

Finded problem in GiST isn't too simple to resolve. I'm working on it. The
problem is about update query...

Tom Lane wrote:

Teodor Sigaev <teodor@sigaev.ru> writes:

http://www.sigaev.ru/gist/concur.pl
http://www.sigaev.ru/gist/concur.sh

BTW, these scripts seem to indicate that there's a GIST or
contrib/intarray problem in the 8.0 branch. I was trying to use 'em
to test REL8_0_STABLE branch tip to verify my t_ctid chain backpatch,
and I pretty consistently see "Problem with update":

Start: parallel mode with 4 flows
Problem with update {77,77}:0 count:1 at concur.pl line 91.
Issuing rollback() for database handle being DESTROY'd without explicit disconnect().
Problem with update {43,24}:3 count:1 at concur.pl line 91.
Issuing rollback() for database handle being DESTROY'd without explicit disconnect().
Problem with update {43,43}:2 count:1 at concur.pl line 91.
Issuing rollback() for database handle being DESTROY'd without explicit disconnect().
1 flow finish. Stats: ni:75000 nu:1661 nd:216 nv:13(nf:3) nt:780
All flow finish; status: 255; elapsed time: 265.48 sec

Is this something that can be fixed for 8.0.4?

regards, tom lane

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#21Teodor Sigaev
teodor@sigaev.ru
In reply to: Tom Lane (#17)
#22Mario Weilguni
mweilguni@sime.com
In reply to: Teodor Sigaev (#21)
#23Teodor Sigaev
teodor@sigaev.ru
In reply to: Mario Weilguni (#22)
#24Mario Weilguni
mweilguni@sime.com
In reply to: Teodor Sigaev (#23)
#25Teodor Sigaev
teodor@sigaev.ru
In reply to: Mario Weilguni (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Teodor Sigaev (#21)