GIN data corruption bug(s) in 9.6devel
Hi,
while repeating some full-text benchmarks on master, I've discovered
that there's a data corruption bug somewhere. What happens is that while
loading data into a table with GIN indexes (using multiple parallel
connections), I sometimes get this:
TRAP: FailedAssertion("!(((PageHeader) (page))->pd_special >=
(__builtin_offsetof (PageHeaderData, pd_linp)))", File: "ginfast.c",
Line: 537)
LOG: server process (PID 22982) was terminated by signal 6: Aborted
DETAIL: Failed process was running: autovacuum: ANALYZE messages
The details of the assert are always exactly the same - it's always
autovacuum and it trips on exactly the same check. And the backtrace
always looks like this (full backtrace attached):
#0 0x00007f133b635045 in raise () from /lib64/libc.so.6
#1 0x00007f133b6364ea in abort () from /lib64/libc.so.6
#2 0x00000000007dc007 in ExceptionalCondition
(conditionName=conditionName@entry=0x81a088 "!(((PageHeader)
(page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))",
errorType=errorType@entry=0x81998b "FailedAssertion",
fileName=fileName@entry=0x83480a "ginfast.c",
lineNumber=lineNumber@entry=537) at assert.c:54
#3 0x00000000004894aa in shiftList (stats=0x0, fill_fsm=1 '\001',
newHead=26357, metabuffer=130744, index=0x7f133c0f7518) at ginfast.c:537
#4 ginInsertCleanup (ginstate=ginstate@entry=0x7ffd98ac9160,
vac_delay=vac_delay@entry=1 '\001', fill_fsm=fill_fsm@entry=1 '\001',
stats=stats@entry=0x0) at ginfast.c:908
#5 0x00000000004874f7 in ginvacuumcleanup (fcinfo=<optimized out>) at
ginvacuum.c:662
...
It's not perfectly deterministic - sometimes I had repeat the whole load
multiple times (up to 5x, and each load takes ~30minutes).
I'm pretty sure this is not external issue, because I've reproduced it
on a different machine (different CPU / kernel / libraries / compiler).
It's however interesting that on the other machine I've also observed a
different kind of lockups, where the sessions get stuck on semop() in
gininsert (again, full backtrace attached):
#0 0x0000003f3d4eaf97 in semop () from /lib64/libc.so.6
#1 0x000000000067a41f in PGSemaphoreLock (sema=0x7f93290405d8) at
pg_sema.c:387
#2 0x00000000006df613 in LWLockAcquire (lock=0x7f92a4dce900,
mode=LW_EXCLUSIVE) at lwlock.c:1049
#3 0x00000000004878c6 in ginHeapTupleFastInsert
(ginstate=0x7ffd969c88f0, collector=0x7ffd969caeb0) at ginfast.c:250
#4 0x000000000047423a in gininsert (fcinfo=<value optimized out>) at
gininsert.c:531
...
I'm not sure whether this is a different manifestation of the same issue
or another bug. The systems are not exactly the same - one has a single
socket (i5) while the other one has 2 (Xeon), the compilers and kernels
are different and so on.
I've also seen cases when the load seemingly completed OK, but trying to
dump the table to disk using COPY resulted in
ERROR: compressed data is corrupted
which I find rather strange as there was no previous error, and also
COPY should only dump table data (while the asserts were in GIN code
handling index pages, unless I'm mistaken). Seems like a case of
insufficient locking where two backends scribble on the same page
somehow, and then autovacuum hits the assert. Ot maybe not, not sure.
I've been unable to reproduce the issue on REL9_5_STABLE (despite
running the load ~20x on each machine), so that seems safe, and the
issue was introduced by some of the newer commits.
I've already spent too much CPU time on this, so perhaps someone with
better knowledge of the GIN code can take care of this. To reproduce it
you may use the same code I did - it's available here:
https://bitbucket.org/tvondra/archie
it's a PoC of database with pgsql mailing lists with fulltext. It's a
bit messy, but it's rather simple
1) clone the repo
$ git clone https://bitbucket.org/tvondra/archie.git
2) create a directory for downloading the mbox files
$ mkdir archie-data
3) download the mbox files (~4.5GB of data) using the download
script (make sure archie/bin is on PATH)
$ cd archie-data
$ export PATH=../archie/bin:$PATH
$ ../archie/download
4) use "run" scipt (attached) to run the load n-times on a given
commit
$ run.sh master 10
NOTICE: The run script is the messiest one, you'll have to
edit it to fix paths etc.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Import Notes
Reply to msg id not found: 563BD513.7020300@2ndquadrant.comReference msg id not found: 563BD513.7020300@2ndquadrant.com
On Thu, Nov 5, 2015 at 2:18 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
Hi,
while repeating some full-text benchmarks on master, I've discovered
that there's a data corruption bug somewhere. What happens is that while
loading data into a table with GIN indexes (using multiple parallel
connections), I sometimes get this:TRAP: FailedAssertion("!(((PageHeader) (page))->pd_special >=
(__builtin_offsetof (PageHeaderData, pd_linp)))", File: "ginfast.c",
Line: 537)
LOG: server process (PID 22982) was terminated by signal 6: Aborted
DETAIL: Failed process was running: autovacuum: ANALYZE messagesThe details of the assert are always exactly the same - it's always
autovacuum and it trips on exactly the same check. And the backtrace
always looks like this (full backtrace attached):#0 0x00007f133b635045 in raise () from /lib64/libc.so.6
#1 0x00007f133b6364ea in abort () from /lib64/libc.so.6
#2 0x00000000007dc007 in ExceptionalCondition
(conditionName=conditionName@entry=0x81a088 "!(((PageHeader)
(page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))",
errorType=errorType@entry=0x81998b "FailedAssertion",
fileName=fileName@entry=0x83480a "ginfast.c",
lineNumber=lineNumber@entry=537) at assert.c:54
#3 0x00000000004894aa in shiftList (stats=0x0, fill_fsm=1 '\001',
newHead=26357, metabuffer=130744, index=0x7f133c0f7518) at ginfast.c:537
#4 ginInsertCleanup (ginstate=ginstate@entry=0x7ffd98ac9160,
vac_delay=vac_delay@entry=1 '\001', fill_fsm=fill_fsm@entry=1 '\001',
stats=stats@entry=0x0) at ginfast.c:908
#5 0x00000000004874f7 in ginvacuumcleanup (fcinfo=<optimized out>) at
ginvacuum.c:662
...
This looks like it is probably the same bug discussed here:
/messages/by-id/CAMkU=1xALfLhUUohFP5v33RzedLVb5aknNUjcEuM9KNBKrB6-Q@mail.gmail.com
And here:
/messages/by-id/56041B26.2040902@sigaev.ru
The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit
e95680832854cf300e64c) that free pages were recycled aggressively
enough that it actually becomes likely to be hit.
There are some proposed patches in those threads, but discussion on
them seems to have stalled out. Can you try one and see if it fixes
the problems you are seeing?
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/05/2015 11:44 PM, Jeff Janes wrote:
This looks like it is probably the same bug discussed here:
/messages/by-id/CAMkU=1xALfLhUUohFP5v33RzedLVb5aknNUjcEuM9KNBKrB6-Q@mail.gmail.com
And here:
/messages/by-id/56041B26.2040902@sigaev.ru
The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit
e95680832854cf300e64c) that free pages were recycled aggressively
enough that it actually becomes likely to be hit.
I have only quickly skimmed the discussions, but my impression was that
it's mostly about removing stuff that shouldn't be removed and such. But
maybe there are race conditions that cause data corruption. I don't
really want to dive too deeply into this, I've already spent too much
time trying to reproduce it.
There are some proposed patches in those threads, but discussion on
them seems to have stalled out. Can you try one and see if it fixes
the problems you are seeing?
I can do that - I see there are three patches in the two threads:
1) gin_pending_lwlock.patch (Jeff Janes)
2) gin_pending_pagelock.patch (Jeff Janes)
3) gin_alone_cleanup-2.patch (Teodor Sigaev)
Should I test all of them? Or is (1) obsoleted by (2) for example?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On 11/05/2015 11:44 PM, Jeff Janes wrote:
This looks like it is probably the same bug discussed here:
/messages/by-id/CAMkU=1xALfLhUUohFP5v33RzedLVb5aknNUjcEuM9KNBKrB6-Q@mail.gmail.com
And here:
/messages/by-id/56041B26.2040902@sigaev.ru
The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit
e95680832854cf300e64c) that free pages were recycled aggressively
enough that it actually becomes likely to be hit.I have only quickly skimmed the discussions, but my impression was that it's
mostly about removing stuff that shouldn't be removed and such. But maybe
there are race conditions that cause data corruption. I don't really want to
dive too deeply into this, I've already spent too much time trying to
reproduce it.There are some proposed patches in those threads, but discussion on
them seems to have stalled out. Can you try one and see if it fixes
the problems you are seeing?I can do that - I see there are three patches in the two threads:
1) gin_pending_lwlock.patch (Jeff Janes)
2) gin_pending_pagelock.patch (Jeff Janes)
3) gin_alone_cleanup-2.patch (Teodor Sigaev)Should I test all of them? Or is (1) obsoleted by (2) for example?
1 is obsolete. Either 2 or 3 should fix the bug, provided this is the
bug you are seeing. They have different performance side effects, but
as far as fixing the bug they should be equivalent.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 11/06/2015 01:05 AM, Jeff Janes wrote:
On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...
I can do that - I see there are three patches in the two threads:
1) gin_pending_lwlock.patch (Jeff Janes)
2) gin_pending_pagelock.patch (Jeff Janes)
3) gin_alone_cleanup-2.patch (Teodor Sigaev)Should I test all of them? Or is (1) obsoleted by (2) for example?
1 is obsolete. Either 2 or 3 should fix the bug, provided this is the
bug you are seeing. They have different performance side effects, but
as far as fixing the bug they should be equivalent.
OK, I'll do testing with those two patches then, and I'll also note the
performance difference (the data load was very stable). Of course, it's
just one particular workload.
I'll post an update after the weekend.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 11/06/2015 02:09 AM, Tomas Vondra wrote:
Hi,
On 11/06/2015 01:05 AM, Jeff Janes wrote:
On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:...
I can do that - I see there are three patches in the two threads:
1) gin_pending_lwlock.patch (Jeff Janes)
2) gin_pending_pagelock.patch (Jeff Janes)
3) gin_alone_cleanup-2.patch (Teodor Sigaev)Should I test all of them? Or is (1) obsoleted by (2) for example?
1 is obsolete. Either 2 or 3 should fix the bug, provided this is the
bug you are seeing. They have different performance side effects, but
as far as fixing the bug they should be equivalent.OK, I'll do testing with those two patches then, and I'll also note the
performance difference (the data load was very stable). Of course, it's
just one particular workload.I'll post an update after the weekend.
I've finally managed to test the two patches. Sorry for the delay.
I've repeated the workload on 9.5, 9.6 and 9.6 with (1) or (2), looking
for lockups or data corruption. I've also measured duration of the
script, to see what's the impact on performance. The configuration
(shared_buffers, work_mem ...) was exactly the same in all cases.
9.5 : runtime ~1380 seconds
9.6 : runtime ~1380 seconds (but lockups and data corruption)
9.6+(1) : runtime ~1380 seconds
9.6+(2) : runtime ~1290 seconds
So both patches seem to do the trick, but (2) is faster. Not sure if
this is expected. (BTW all the results are without asserts enabled).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
Hi,
On 11/06/2015 02:09 AM, Tomas Vondra wrote:
Hi,
On 11/06/2015 01:05 AM, Jeff Janes wrote:
On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:...
I can do that - I see there are three patches in the two threads:
1) gin_pending_lwlock.patch (Jeff Janes)
2) gin_pending_pagelock.patch (Jeff Janes)
3) gin_alone_cleanup-2.patch (Teodor Sigaev)Should I test all of them? Or is (1) obsoleted by (2) for example?
1 is obsolete. Either 2 or 3 should fix the bug, provided this is the
bug you are seeing. They have different performance side effects, but
as far as fixing the bug they should be equivalent.OK, I'll do testing with those two patches then, and I'll also note the
performance difference (the data load was very stable). Of course, it's
just one particular workload.I'll post an update after the weekend.
I've finally managed to test the two patches. Sorry for the delay.
I've repeated the workload on 9.5, 9.6 and 9.6 with (1) or (2), looking for
lockups or data corruption. I've also measured duration of the script, to
see what's the impact on performance. The configuration (shared_buffers,
work_mem ...) was exactly the same in all cases.9.5 : runtime ~1380 seconds
9.6 : runtime ~1380 seconds (but lockups and data corruption)
9.6+(1) : runtime ~1380 seconds
9.6+(2) : runtime ~1290 secondsSo both patches seem to do the trick, but (2) is faster. Not sure if this is
expected. (BTW all the results are without asserts enabled).
Do you know what the size of the pending list was at the end of each test?
I think last one may be faster because it left a large mess behind
that someone needs to clean up later.
Also, do you have the final size of the indexes in each case?
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 12/21/2015 07:41 PM, Jeff Janes wrote:
On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...
So both patches seem to do the trick, but (2) is faster. Not sure
if this is expected. (BTW all the results are without asserts
enabled).Do you know what the size of the pending list was at the end of each
test?I think last one may be faster because it left a large mess behind
that someone needs to clean up later.
No. How do I measure it?
Also, do you have the final size of the indexes in each case?
No, I haven't realized the patches do affect that, so I haven't measured it.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Dec 21, 2015 at 11:51 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On 12/21/2015 07:41 PM, Jeff Janes wrote:
On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:...
So both patches seem to do the trick, but (2) is faster. Not sure
if this is expected. (BTW all the results are without asserts
enabled).Do you know what the size of the pending list was at the end of each
test?I think last one may be faster because it left a large mess behind
that someone needs to clean up later.No. How do I measure it?
pageinspect's gin_metapage_info, or pgstattuple's pgstatginindex
Also, do you have the final size of the indexes in each case?
No, I haven't realized the patches do affect that, so I haven't measured it.
There shouldn't be a difference between the two approaches (although I
guess there could be if one left a larger pending list than the other,
as pending lists is very space inefficient), but since you included
9.5 in your test I thought it would be interesting to see how either
patched version under 9.6 compared to 9.5.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 12/23/2015 09:33 PM, Jeff Janes wrote:
On Mon, Dec 21, 2015 at 11:51 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On 12/21/2015 07:41 PM, Jeff Janes wrote:
On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:...
So both patches seem to do the trick, but (2) is faster. Not sure
if this is expected. (BTW all the results are without asserts
enabled).Do you know what the size of the pending list was at the end of each
test?I think last one may be faster because it left a large mess behind
that someone needs to clean up later.No. How do I measure it?
pageinspect's gin_metapage_info, or pgstattuple's pgstatginindex
Hmmm, so this turns out not very useful, because at the end the data I
get from gin_metapage_info is almost exactly the same for both patches
(more details below).
Also, do you have the final size of the indexes in each case?
No, I haven't realized the patches do affect that, so I haven't measured it.
There shouldn't be a difference between the two approaches (although I
guess there could be if one left a larger pending list than the other,
as pending lists is very space inefficient), but since you included
9.5 in your test I thought it would be interesting to see how either
patched version under 9.6 compared to 9.5.
Well, turns out there's a quite significant difference, actually. The
index sizes I get (quite stable after multiple runs):
9.5 : 2428 MB
9.6 + alone cleanup : 730 MB
9.6 + pending lock : 488 MB
So that's quite a significant difference, I guess. The load duration for
each version look like this:
9.5 : 1415 seconds
9.6 + alone cleanup : 1310 seconds
9.6 + pending lock : 1380 seconds
I'd say I'm happy with sacrificing ~5% of time in exchange for ~35%
reduction of index size.
The size of the index on 9.5 after VACUUM FULL (so pretty much the
smallest index possible) is 440MB, which suggests the "pending lock"
patch does a quite good job.
The gin_metapage_info at the end of one of the runs (pretty much all the
runs look exactly the same) looks like this:
pending lock alone cleanup 9.5
--------------------------------------------------------
pending_head 2 2 310460
pending_tail 338 345 310806
tail_free_size 812 812 812
n_pending_pages 330 339 347
n_pending_tuples 1003 1037 1059
n_total_pages 2 2 2
n_entry_pages 1 1 1
n_data_pages 0 0 0
n_entries 0 0 0
version 2 2 2
So almost no difference, except for the pending_* attributes, and even
in that case the values are only different for 9.5 branch. Not sure what
conclusion to draw from this - maybe it's necessary to collect the
function input while the load is running (but that'd be tricky to
process, I guess).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 01/05/2016 10:38 AM, Tomas Vondra wrote:
Hi,
...
There shouldn't be a difference between the two approaches (although I
guess there could be if one left a larger pending list than the other,
as pending lists is very space inefficient), but since you included
9.5 in your test I thought it would be interesting to see how either
patched version under 9.6 compared to 9.5.Well, turns out there's a quite significant difference, actually. The
index sizes I get (quite stable after multiple runs):9.5 : 2428 MB
9.6 + alone cleanup : 730 MB
9.6 + pending lock : 488 MBSo that's quite a significant difference, I guess. The load duration for
each version look like this:9.5 : 1415 seconds
9.6 + alone cleanup : 1310 seconds
9.6 + pending lock : 1380 secondsI'd say I'm happy with sacrificing ~5% of time in exchange for ~35%
reduction of index size.The size of the index on 9.5 after VACUUM FULL (so pretty much the
smallest index possible) is 440MB, which suggests the "pending lock"
patch does a quite good job.The gin_metapage_info at the end of one of the runs (pretty much all the
runs look exactly the same) looks like this:pending lock alone cleanup 9.5
--------------------------------------------------------
pending_head 2 2 310460
pending_tail 338 345 310806
tail_free_size 812 812 812
n_pending_pages 330 339 347
n_pending_tuples 1003 1037 1059
n_total_pages 2 2 2
n_entry_pages 1 1 1
n_data_pages 0 0 0
n_entries 0 0 0
version 2 2 2So almost no difference, except for the pending_* attributes, and even
in that case the values are only different for 9.5 branch. Not sure what
conclusion to draw from this - maybe it's necessary to collect the
function input while the load is running (but that'd be tricky to
process, I guess).
Are we going to anything about this? While the bug is present in 9.5
(and possibly other versions), fixing it before 9.6 gets out seems
important because reproducing it there is rather trivial (while I've
been unable to reproduce it on 9.5).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 24, 2016 at 9:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
Well, turns out there's a quite significant difference, actually. The
index sizes I get (quite stable after multiple runs):9.5 : 2428 MB
9.6 + alone cleanup : 730 MB
9.6 + pending lock : 488 MBSo that's quite a significant difference, I guess. The load duration for
each version look like this:9.5 : 1415 seconds
9.6 + alone cleanup : 1310 seconds
9.6 + pending lock : 1380 secondsI'd say I'm happy with sacrificing ~5% of time in exchange for ~35%
reduction of index size.The size of the index on 9.5 after VACUUM FULL (so pretty much the
smallest index possible) is 440MB, which suggests the "pending lock"
patch does a quite good job.The gin_metapage_info at the end of one of the runs (pretty much all the
runs look exactly the same) looks like this:pending lock alone cleanup 9.5
--------------------------------------------------------
pending_head 2 2 310460
pending_tail 338 345 310806
tail_free_size 812 812 812
n_pending_pages 330 339 347
n_pending_tuples 1003 1037 1059
n_total_pages 2 2 2
n_entry_pages 1 1 1
n_data_pages 0 0 0
n_entries 0 0 0
version 2 2 2So almost no difference, except for the pending_* attributes, and even
in that case the values are only different for 9.5 branch. Not sure what
conclusion to draw from this - maybe it's necessary to collect the
function input while the load is running (but that'd be tricky to
process, I guess).Are we going to anything about this? While the bug is present in 9.5 (and
possibly other versions), fixing it before 9.6 gets out seems important
because reproducing it there is rather trivial (while I've been unable to
reproduce it on 9.5).
I'm not going to promise to commit anything here, because GIN is not
usually my area, but could you provide a link to the email that
contains the patch you think should be committed?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/24/2016 06:56 AM, Robert Haas wrote:
On Wed, Feb 24, 2016 at 9:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...
Are we going to anything about this? While the bug is present in 9.5 (and
possibly other versions), fixing it before 9.6 gets out seems important
because reproducing it there is rather trivial (while I've been unable to
reproduce it on 9.5).I'm not going to promise to commit anything here, because GIN is not
usually my area, but could you provide a link to the email that
contains the patch you think should be committed?
Sure. There are actually two candidate patches in two separate threads,
I'm nots sure which one is better. Based on the testing both seem to fix
the issue and the "pending lock" patch produces much smaller indexes (at
least in my benchmark):
[1]: /messages/by-id/56041B26.2040902@sigaev.ru
[2]: /messages/by-id/CAMkU=1w7Uu1GZ8N0bxMykRLgTh-uPH+GPHfhMNeryZPCv7fqdA@mail.gmail.com
/messages/by-id/CAMkU=1w7Uu1GZ8N0bxMykRLgTh-uPH+GPHfhMNeryZPCv7fqdA@mail.gmail.com
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for remembering this problem, at least for me.
Well, turns out there's a quite significant difference, actually. The
index sizes I get (quite stable after multiple runs):9.5 : 2428 MB
9.6 + alone cleanup : 730 MB
9.6 + pending lock : 488 MB
Interesting, I don't see why alone_cleanup and pending_lock are so differ. I'd
like to understand that, but does somebody have an good theory? The single point
in pending_lock patch is an suspicious exception in ProcSleep, this exception
may cause problem in future.
So that's quite a significant difference, I guess. The load duration for
each version look like this:9.5 : 1415 seconds
9.6 + alone cleanup : 1310 seconds
9.6 + pending lock : 1380 secondsI'd say I'm happy with sacrificing ~5% of time in exchange for ~35%
reduction of index size.
I think, alone_cleanup patch is faster because regular insert could break its
cleanup process if autovacuum waits to begin work on cleanup. So, insert process
could returns earlier from pending cleanup process.
In attachment just rebased v2 alone_cleanup patch.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
gin_alone_cleanup-3.patchbinary/octet-stream; name=gin_alone_cleanup-3.patchDownload
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2ddf568..01db34c 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -27,7 +27,9 @@
#include "utils/memutils.h"
#include "utils/rel.h"
#include "utils/acl.h"
+#include "postmaster/autovacuum.h"
#include "storage/indexfsm.h"
+#include "storage/lmgr.h"
/* GUC parameter */
int gin_pending_list_limit = 0;
@@ -502,11 +504,8 @@ ginHeapTupleFastCollect(GinState *ginstate,
* If newHead == InvalidBlockNumber then function drops the whole list.
*
* metapage is pinned and exclusive-locked throughout this function.
- *
- * Returns true if another cleanup process is running concurrently
- * (if so, we can just abandon our own efforts)
*/
-static bool
+static void
shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
bool fill_fsm, IndexBulkDeleteResult *stats)
{
@@ -537,14 +536,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
data.ndeleted++;
- if (GinPageIsDeleted(page))
- {
- /* concurrent cleanup process is detected */
- for (i = 0; i < data.ndeleted; i++)
- UnlockReleaseBuffer(buffers[i]);
-
- return true;
- }
+ Assert(!GinPageIsDeleted(page));
nDeletedHeapTuples += GinPageGetOpaque(page)->maxoff;
blknoToDelete = GinPageGetOpaque(page)->rightlink;
@@ -620,8 +612,6 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
RecordFreeIndexPage(index, freespace[i]);
} while (blknoToDelete != newHead);
-
- return false;
}
/* Initialize empty KeyArray */
@@ -722,18 +712,10 @@ processPendingPage(BuildAccumulator *accum, KeyArray *ka,
/*
* Move tuples from pending pages into regular GIN structure.
*
- * This can be called concurrently by multiple backends, so it must cope.
- * On first glance it looks completely not concurrent-safe and not crash-safe
- * either. The reason it's okay is that multiple insertion of the same entry
- * is detected and treated as a no-op by gininsert.c. If we crash after
- * posting entries to the main index and before removing them from the
+ * On first glance it looks completely not crash-safe. But if we crash
+ * after posting entries to the main index and before removing them from the
* pending list, it's okay because when we redo the posting later on, nothing
- * bad will happen. Likewise, if two backends simultaneously try to post
- * a pending entry into the main index, one will succeed and one will do
- * nothing. We try to notice when someone else is a little bit ahead of
- * us in the process, but that's just to avoid wasting cycles. Only the
- * action of removing a page from the pending list really needs exclusive
- * lock.
+ * bad will happen.
*
* fill_fsm indicates that ginInsertCleanup should add deleted pages
* to FSM otherwise caller is responsible to put deleted pages into
@@ -757,6 +739,39 @@ ginInsertCleanup(GinState *ginstate,
KeyArray datums;
BlockNumber blkno;
bool fsm_vac = false;
+ Size workMemory;
+ bool inVacuum = (stats == NULL);
+
+ /*
+ * We would like to prevent concurrent cleanup process. For
+ * that we will lock metapage in exclusive mode using LockPage()
+ * call. Nobody other will use that lock for metapage, so
+ * we keep possibility of concurrent insertion into pending list
+ */
+
+ if (inVacuum)
+ {
+ /*
+ * We are called from [auto]vacuum/analyze or
+ * gin_clean_pending_list() and we would like to wait
+ * concurrent cleanup to finish.
+ */
+ LockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
+ workMemory =
+ (IsAutoVacuumWorkerProcess() && autovacuum_work_mem != -1) ?
+ autovacuum_work_mem : maintenance_work_mem;
+ }
+ else
+ {
+ /*
+ * We are called from regular insert and if we see
+ * concurrent cleanup just exit in hope that concurrent
+ * process will clean up pending list.
+ */
+ if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))
+ return;
+ workMemory = work_mem;
+ }
metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
LockBuffer(metabuffer, GIN_SHARE);
@@ -767,6 +782,7 @@ ginInsertCleanup(GinState *ginstate,
{
/* Nothing to do */
UnlockReleaseBuffer(metabuffer);
+ UnlockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
return;
}
@@ -802,13 +818,7 @@ ginInsertCleanup(GinState *ginstate,
*/
for (;;)
{
- if (GinPageIsDeleted(page))
- {
- /* another cleanup process is running concurrently */
- UnlockReleaseBuffer(buffer);
- fsm_vac = false;
- break;
- }
+ Assert(!GinPageIsDeleted(page));
/*
* read page's datums into accum
@@ -827,7 +837,7 @@ ginInsertCleanup(GinState *ginstate,
*/
if (GinPageGetOpaque(page)->rightlink == InvalidBlockNumber ||
(GinPageHasFullRow(page) &&
- (accum.allocatedMemory >= (Size)maintenance_work_mem * 1024L)))
+ (accum.allocatedMemory >= workMemory * 1024L)))
{
ItemPointerData *list;
uint32 nlist;
@@ -864,14 +874,7 @@ ginInsertCleanup(GinState *ginstate,
LockBuffer(metabuffer, GIN_EXCLUSIVE);
LockBuffer(buffer, GIN_SHARE);
- if (GinPageIsDeleted(page))
- {
- /* another cleanup process is running concurrently */
- UnlockReleaseBuffer(buffer);
- LockBuffer(metabuffer, GIN_UNLOCK);
- fsm_vac = false;
- break;
- }
+ Assert(!GinPageIsDeleted(page));
/*
* While we left the page unlocked, more stuff might have gotten
@@ -904,13 +907,7 @@ ginInsertCleanup(GinState *ginstate,
* remove read pages from pending list, at this point all
* content of read pages is in regular structure
*/
- if (shiftList(index, metabuffer, blkno, fill_fsm, stats))
- {
- /* another cleanup process is running concurrently */
- LockBuffer(metabuffer, GIN_UNLOCK);
- fsm_vac = false;
- break;
- }
+ shiftList(index, metabuffer, blkno, fill_fsm, stats);
/* At this point, some pending pages have been freed up */
fsm_vac = true;
@@ -922,7 +919,17 @@ ginInsertCleanup(GinState *ginstate,
* if we removed the whole pending list just exit
*/
if (blkno == InvalidBlockNumber)
+ {
+ UnlockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
break;
+ }
+
+ /*
+ * If we are called from regular insert, give a chance to
+ * [auto]vacuum/analyze continues our work
+ */
+ if (!inVacuum)
+ UnlockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
/*
* release memory used so far and reinit state
@@ -930,6 +937,21 @@ ginInsertCleanup(GinState *ginstate,
MemoryContextReset(opCtx);
initKeyArray(&datums, datums.maxvalues);
ginInitBA(&accum);
+
+ if (!inVacuum &&
+ !ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))
+ {
+ /*
+ * We are called from regular insert, but concurrent
+ * clean catches a lock while we keep metapage unlocked.
+ * We believe, that it's a [auto]vacuum/analyze, not a
+ * cleanup called from regular insert, because
+ * cleanup called from vacuum waits a a lock while
+ * regular one just try once with conditional lock
+ */
+ break;
+ }
+
}
else
{
Well, turns out there's a quite significant difference, actually. The
index sizes I get (quite stable after multiple runs):9.5 : 2428 MB
9.6 + alone cleanup : 730 MB
9.6 + pending lock : 488 MB
In attach modified alone_cleanup patch which doesn't break cleanup process as it
does pending_lock patch but attached patch doesn't touch a lock management.
Tomas. if you can, pls, repeat test with this patch. If not, I will try to do
it, but later.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
gin_alone_cleanup-3.1.patchbinary/octet-stream; name=gin_alone_cleanup-3.1.patchDownload
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2ddf568..b2ff767 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -27,7 +27,9 @@
#include "utils/memutils.h"
#include "utils/rel.h"
#include "utils/acl.h"
+#include "postmaster/autovacuum.h"
#include "storage/indexfsm.h"
+#include "storage/lmgr.h"
/* GUC parameter */
int gin_pending_list_limit = 0;
@@ -502,11 +504,8 @@ ginHeapTupleFastCollect(GinState *ginstate,
* If newHead == InvalidBlockNumber then function drops the whole list.
*
* metapage is pinned and exclusive-locked throughout this function.
- *
- * Returns true if another cleanup process is running concurrently
- * (if so, we can just abandon our own efforts)
*/
-static bool
+static void
shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
bool fill_fsm, IndexBulkDeleteResult *stats)
{
@@ -537,14 +536,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
data.ndeleted++;
- if (GinPageIsDeleted(page))
- {
- /* concurrent cleanup process is detected */
- for (i = 0; i < data.ndeleted; i++)
- UnlockReleaseBuffer(buffers[i]);
-
- return true;
- }
+ Assert(!GinPageIsDeleted(page));
nDeletedHeapTuples += GinPageGetOpaque(page)->maxoff;
blknoToDelete = GinPageGetOpaque(page)->rightlink;
@@ -620,8 +612,6 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
RecordFreeIndexPage(index, freespace[i]);
} while (blknoToDelete != newHead);
-
- return false;
}
/* Initialize empty KeyArray */
@@ -722,18 +712,10 @@ processPendingPage(BuildAccumulator *accum, KeyArray *ka,
/*
* Move tuples from pending pages into regular GIN structure.
*
- * This can be called concurrently by multiple backends, so it must cope.
- * On first glance it looks completely not concurrent-safe and not crash-safe
- * either. The reason it's okay is that multiple insertion of the same entry
- * is detected and treated as a no-op by gininsert.c. If we crash after
- * posting entries to the main index and before removing them from the
+ * On first glance it looks completely not crash-safe. But if we crash
+ * after posting entries to the main index and before removing them from the
* pending list, it's okay because when we redo the posting later on, nothing
- * bad will happen. Likewise, if two backends simultaneously try to post
- * a pending entry into the main index, one will succeed and one will do
- * nothing. We try to notice when someone else is a little bit ahead of
- * us in the process, but that's just to avoid wasting cycles. Only the
- * action of removing a page from the pending list really needs exclusive
- * lock.
+ * bad will happen.
*
* fill_fsm indicates that ginInsertCleanup should add deleted pages
* to FSM otherwise caller is responsible to put deleted pages into
@@ -757,6 +739,39 @@ ginInsertCleanup(GinState *ginstate,
KeyArray datums;
BlockNumber blkno;
bool fsm_vac = false;
+ Size workMemory;
+ bool inVacuum = (stats == NULL);
+
+ /*
+ * We would like to prevent concurrent cleanup process. For
+ * that we will lock metapage in exclusive mode using LockPage()
+ * call. Nobody other will use that lock for metapage, so
+ * we keep possibility of concurrent insertion into pending list
+ */
+
+ if (inVacuum)
+ {
+ /*
+ * We are called from [auto]vacuum/analyze or
+ * gin_clean_pending_list() and we would like to wait
+ * concurrent cleanup to finish.
+ */
+ LockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
+ workMemory =
+ (IsAutoVacuumWorkerProcess() && autovacuum_work_mem != -1) ?
+ autovacuum_work_mem : maintenance_work_mem;
+ }
+ else
+ {
+ /*
+ * We are called from regular insert and if we see
+ * concurrent cleanup just exit in hope that concurrent
+ * process will clean up pending list.
+ */
+ if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))
+ return;
+ workMemory = work_mem;
+ }
metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
LockBuffer(metabuffer, GIN_SHARE);
@@ -767,6 +782,7 @@ ginInsertCleanup(GinState *ginstate,
{
/* Nothing to do */
UnlockReleaseBuffer(metabuffer);
+ UnlockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
return;
}
@@ -802,13 +818,7 @@ ginInsertCleanup(GinState *ginstate,
*/
for (;;)
{
- if (GinPageIsDeleted(page))
- {
- /* another cleanup process is running concurrently */
- UnlockReleaseBuffer(buffer);
- fsm_vac = false;
- break;
- }
+ Assert(!GinPageIsDeleted(page));
/*
* read page's datums into accum
@@ -827,7 +837,7 @@ ginInsertCleanup(GinState *ginstate,
*/
if (GinPageGetOpaque(page)->rightlink == InvalidBlockNumber ||
(GinPageHasFullRow(page) &&
- (accum.allocatedMemory >= (Size)maintenance_work_mem * 1024L)))
+ (accum.allocatedMemory >= workMemory * 1024L)))
{
ItemPointerData *list;
uint32 nlist;
@@ -864,14 +874,7 @@ ginInsertCleanup(GinState *ginstate,
LockBuffer(metabuffer, GIN_EXCLUSIVE);
LockBuffer(buffer, GIN_SHARE);
- if (GinPageIsDeleted(page))
- {
- /* another cleanup process is running concurrently */
- UnlockReleaseBuffer(buffer);
- LockBuffer(metabuffer, GIN_UNLOCK);
- fsm_vac = false;
- break;
- }
+ Assert(!GinPageIsDeleted(page));
/*
* While we left the page unlocked, more stuff might have gotten
@@ -904,13 +907,7 @@ ginInsertCleanup(GinState *ginstate,
* remove read pages from pending list, at this point all
* content of read pages is in regular structure
*/
- if (shiftList(index, metabuffer, blkno, fill_fsm, stats))
- {
- /* another cleanup process is running concurrently */
- LockBuffer(metabuffer, GIN_UNLOCK);
- fsm_vac = false;
- break;
- }
+ shiftList(index, metabuffer, blkno, fill_fsm, stats);
/* At this point, some pending pages have been freed up */
fsm_vac = true;
@@ -946,6 +943,7 @@ ginInsertCleanup(GinState *ginstate,
page = BufferGetPage(buffer);
}
+ UnlockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
ReleaseBuffer(metabuffer);
/*
Hi,
On 02/25/2016 05:32 PM, Teodor Sigaev wrote:
Well, turns out there's a quite significant difference, actually. The
index sizes I get (quite stable after multiple runs):9.5 : 2428 MB
9.6 + alone cleanup : 730 MB
9.6 + pending lock : 488 MBIn attach modified alone_cleanup patch which doesn't break cleanup
process as it does pending_lock patch but attached patch doesn't touch a
lock management.Tomas. if you can, pls, repeat test with this patch. If not, I will try
to do it, but later.
I'll do that once the system I used for that gets available - right now
it's running other benchmarks.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 24, 2016 at 8:51 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
Thank you for remembering this problem, at least for me.
Well, turns out there's a quite significant difference, actually. The
index sizes I get (quite stable after multiple runs):9.5 : 2428 MB
9.6 + alone cleanup : 730 MB
9.6 + pending lock : 488 MBInteresting, I don't see why alone_cleanup and pending_lock are so differ.
I'd like to understand that, but does somebody have an good theory?
Under my patch, anyone who wanted to do a clean up and detected
someone else was doing one would wait for the concurrent one to end.
(This is more consistent with the existing behavior, I just made it so
they don't do any damage while they wait.)
Under your patch, if a backend wants to do a clean up and detects
someone else is already doing one, it would just skip the clean up and
proceed on with whatever it was doing. This allows one process
(hopefully a vacuum, but maybe a user backend) to get pinned down
indefinitely, as other processes keep putting stuff onto the end of
the pending_list with no throttle.
Since the freespace recycling only takes place once the list is
completely cleaned, allowing some processes to add to the end while
one poor process is trying to clean can lead to less effective
recycling.
That is my theory, anyway.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 25, 2016 at 11:19:20AM -0800, Jeff Janes wrote:
On Wed, Feb 24, 2016 at 8:51 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
Thank you for remembering this problem, at least for me.
Well, turns out there's a quite significant difference, actually. The
index sizes I get (quite stable after multiple runs):9.5 : 2428 MB
9.6 + alone cleanup : 730 MB
9.6 + pending lock : 488 MBInteresting, I don't see why alone_cleanup and pending_lock are so differ.
I'd like to understand that, but does somebody have an good theory?Under my patch, anyone who wanted to do a clean up and detected
someone else was doing one would wait for the concurrent one to end.
(This is more consistent with the existing behavior, I just made it so
they don't do any damage while they wait.)Under your patch, if a backend wants to do a clean up and detects
someone else is already doing one, it would just skip the clean up and
proceed on with whatever it was doing. This allows one process
(hopefully a vacuum, but maybe a user backend) to get pinned down
indefinitely, as other processes keep putting stuff onto the end of
the pending_list with no throttle.Since the freespace recycling only takes place once the list is
completely cleaned, allowing some processes to add to the end while
one poor process is trying to clean can lead to less effective
recycling.That is my theory, anyway.
[This is a generic notification.]
The above-described topic is currently a PostgreSQL 9.6 open item. Teodor,
since you committed the patch believed to have created it, you own this open
item. If that responsibility lies elsewhere, please let us know whose
responsibility it is to fix this. Since new open items may be discovered at
any time and I want to plan to have them all fixed well in advance of the ship
date, I will appreciate your efforts toward speedy resolution. Please
present, within 72 hours, a plan to fix the defect within seven days of this
message. Thanks.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
The above-described topic is currently a PostgreSQL 9.6 open item. Teodor,
since you committed the patch believed to have created it, you own this open
item. If that responsibility lies elsewhere, please let us know whose
responsibility it is to fix this. Since new open items may be discovered at
any time and I want to plan to have them all fixed well in advance of the ship
date, I will appreciate your efforts toward speedy resolution. Please
present, within 72 hours, a plan to fix the defect within seven days of this
message. Thanks.
I'm waiting of Tomas testing. Suppose, it isn't possible now? so, will do
myself, after that I will publish results.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/04/2016 02:06 PM, Teodor Sigaev wrote:
The above-described topic is currently a PostgreSQL 9.6 open item.
Teodor,
since you committed the patch believed to have created it, you own
this open
item. If that responsibility lies elsewhere, please let us know whose
responsibility it is to fix this. Since new open items may be
discovered at
any time and I want to plan to have them all fixed well in advance of
the ship
date, I will appreciate your efforts toward speedy resolution. Please
present, within 72 hours, a plan to fix the defect within seven days
of this
message. Thanks.I'm waiting of Tomas testing. Suppose, it isn't possible now? so, will
do myself, after that I will publish results.
Ah, damn. This completely slipped from my TODO list. I'll rerun the
tests either today or tomorrow, and report the results here.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 04/04/2016 02:25 PM, Tomas Vondra wrote:
On 04/04/2016 02:06 PM, Teodor Sigaev wrote:
The above-described topic is currently a PostgreSQL 9.6 open item.
Teodor,
since you committed the patch believed to have created it, you own
this open
item. If that responsibility lies elsewhere, please let us know whose
responsibility it is to fix this. Since new open items may be
discovered at
any time and I want to plan to have them all fixed well in advance of
the ship
date, I will appreciate your efforts toward speedy resolution. Please
present, within 72 hours, a plan to fix the defect within seven days
of this
message. Thanks.I'm waiting of Tomas testing. Suppose, it isn't possible now? so, will
do myself, after that I will publish results.Ah, damn. This completely slipped from my TODO list. I'll rerun the
tests either today or tomorrow, and report the results here.
So, I've done some testing on the patch, and in short it seems fine to
me. It fixes the bug (no data corruption observed), and the performance
seems fine too.
I've tested the v2, v3 and v3.1 of the patch, to see if there are any
differences. The v2 no longer applies, so I tested it on ee943004. The
following table shows the total duration of the data load, and also
sizes of the two GIN indexes.
duration (sec) subject body
---------------------------------------------------------------
v2 1290 23 MB 684 MB
v3 1360 24 MB 487 MB
v3.1 1360 24 MB 488 MB
So, looks good to me.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I've tested the v2, v3 and v3.1 of the patch, to see if there are any
differences. The v2 no longer applies, so I tested it on ee943004. The following
table shows the total duration of the data load, and also sizes of the two GIN
indexes.duration (sec) subject body
---------------------------------------------------------------
v2 1290 23 MB 684 MB
v3 1360 24 MB 487 MB
v3.1 1360 24 MB 488 MB
Thank you very much.
Hmm. v3 is a just rebased version of v2, v3.1 hasn't unlock/lock cycle during
cleanup, just to become similar to Jeff's pending lock patch. In theory, v2 and
v3 should be very close, v3.1 should be close to pending_lock.
I'm inclining to push v3.1 as one of two winners by size/performance and, unlike
to pending lock patch, it doesn't change an internal logic of lock machinery.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 6, 2016 at 9:52 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
I'm inclining to push v3.1 as one of two winners by size/performance and,
unlike to pending lock patch, it doesn't change an internal logic of lock
machinery.
This restricts the memory used by ordinary backends when doing the
cleanup to be work_mem. Shouldn't we let them use
maintenance_work_mem? Only one backend can be doing this clean up of a
given index at any given time, so we don't need to worry about many
concurrent allocations of maintenance_work_mem. This seems very
similar in spirit to index creation, where a backend is allowed to use
maintenance_work_mem.
Also, do we plan on backpatching this? While there are no known
instances of anyone hitting this bug in any released code, still it is
a bug. There are plenty of reports of corrupted indexes which never
get properly diagnosed, and for all we know some of them might be due
to this. On the other hand, it is a substantial behavior change,
which is scary to backpatch. (But all other proposed solutions also
have behavior changes of one kind or another.)
To summarize the bugs in the already-released code:
If a vacuum's dead list includes a tuple which is still in the pending
list, and the vacuum skips the pending list cleanup because it detects
someone else is already working on it, then the vacuum could remove
that tuple from the table, even though there is still a reference to
it in the index.
A page could be deleted from the pending list, put on the free space
map, then reused. When the linked list of pending pages is followed
onto this reused page, it will fail to recognize is as being deleted
and recycled, and bad things will happen. This is quite likely to
happen in 9.6 because of the new aggressive recycling of pages. But
I've seen no argument as to why it is impossible to hit this bug in
released code, it seems like it would be possible to hit it but just
on a very rare basis.
To summarize the behavior change:
In the released code, an inserting backend that violates the pending
list limit will try to clean the list, even if it is already being
cleaned. It won't accomplish anything useful, but will go through the
motions until eventually it runs into a page the lead cleaner has
deleted, at which point it realizes there is another cleaner and it
stops. This acts as a natural throttle to how fast insertions can
take place into an over-sized pending list.
The proposed change removes that throttle, so that inserters will
immediately see there is already a cleaner and just go back about
their business. Due to that, unthrottled backends could add to the
pending list faster than the cleaner can clean it, leading to
unbounded growth in the pending list and could cause a user backend to
becoming apparently unresponsive to the user, indefinitely. That is
scary to backpatch.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jeff Janes wrote:
The proposed change removes that throttle, so that inserters will
immediately see there is already a cleaner and just go back about
their business. Due to that, unthrottled backends could add to the
pending list faster than the cleaner can clean it, leading to
unbounded growth in the pending list and could cause a user backend to
becoming apparently unresponsive to the user, indefinitely. That is
scary to backpatch.
Can we install a protection against that? For instance, the cleaner
only cleans up the entries that were present when it started, and if
other items are inserted later into the pending list, they are left for
a future cleaner.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jeff Janes <jeff.janes@gmail.com> writes:
To summarize the behavior change:
In the released code, an inserting backend that violates the pending
list limit will try to clean the list, even if it is already being
cleaned. It won't accomplish anything useful, but will go through the
motions until eventually it runs into a page the lead cleaner has
deleted, at which point it realizes there is another cleaner and it
stops. This acts as a natural throttle to how fast insertions can
take place into an over-sized pending list.
Right.
The proposed change removes that throttle, so that inserters will
immediately see there is already a cleaner and just go back about
their business. Due to that, unthrottled backends could add to the
pending list faster than the cleaner can clean it, leading to
unbounded growth in the pending list and could cause a user backend to
becoming apparently unresponsive to the user, indefinitely. That is
scary to backpatch.
It's scary to put into HEAD, either. What if we simply don't take
that specific behavioral change? It doesn't seem like this is an
essential part of fixing the bug as you described it. (Though I've
not read the patch, so maybe I'm just missing the connection.)
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 7, 2016 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
To summarize the behavior change:
In the released code, an inserting backend that violates the pending
list limit will try to clean the list, even if it is already being
cleaned. It won't accomplish anything useful, but will go through the
motions until eventually it runs into a page the lead cleaner has
deleted, at which point it realizes there is another cleaner and it
stops. This acts as a natural throttle to how fast insertions can
take place into an over-sized pending list.Right.
The proposed change removes that throttle, so that inserters will
immediately see there is already a cleaner and just go back about
their business. Due to that, unthrottled backends could add to the
pending list faster than the cleaner can clean it, leading to
unbounded growth in the pending list and could cause a user backend to
becoming apparently unresponsive to the user, indefinitely. That is
scary to backpatch.It's scary to put into HEAD, either. What if we simply don't take
that specific behavioral change? It doesn't seem like this is an
essential part of fixing the bug as you described it. (Though I've
not read the patch, so maybe I'm just missing the connection.)
There are only 3 fundamental options I see, the cleaner can wait,
"help", or move on.
"Helping" is what it does now and is dangerous.
Moving on gives the above-discussed unthrottling problem.
Waiting has two problems. The act of waiting will cause autovacuums
to be canceled, unless ugly hacks are deployed to prevent that. If
we deploy those ugly hacks, then we have the problem that a user
backend will end up waiting on an autovacuum to finish the cleaning,
and the autovacuum is taking its sweet time due to
autovacuum_vacuum_cost_delay. (The "helping" model avoids this
problem because the user backend can just catch up with and pass the
io-throttled autovac process)
For completeness sake, a fourth option would to move on, but only
after inserting the tuple directly into the main index structure
(rather then the pending list) like would be done with fastupdate off,
once the pending list is already oversized. This is my favorite, but
there is no chance of it going into 9.6, much less being backpatched.
Alvaro's recommendation, to let the cleaner off the hook once it
passes the page which was the tail page at the time it started, would
prevent any process from getting pinned down indefinitely, but would
not prevent the size of the list from increasing without bound. I
think that would probably be good enough, because the current
throttling behavior is purely accidentally and doesn't *guarantee* a
limit on the size of the pending list.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 07, 2016 at 05:53:54PM -0700, Jeff Janes wrote:
On Thu, Apr 7, 2016 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
To summarize the behavior change:
In the released code, an inserting backend that violates the pending
list limit will try to clean the list, even if it is already being
cleaned. It won't accomplish anything useful, but will go through the
motions until eventually it runs into a page the lead cleaner has
deleted, at which point it realizes there is another cleaner and it
stops. This acts as a natural throttle to how fast insertions can
take place into an over-sized pending list.Right.
The proposed change removes that throttle, so that inserters will
immediately see there is already a cleaner and just go back about
their business. Due to that, unthrottled backends could add to the
pending list faster than the cleaner can clean it, leading to
unbounded growth in the pending list and could cause a user backend to
becoming apparently unresponsive to the user, indefinitely. That is
scary to backpatch.It's scary to put into HEAD, either. What if we simply don't take
that specific behavioral change? It doesn't seem like this is an
essential part of fixing the bug as you described it. (Though I've
not read the patch, so maybe I'm just missing the connection.)There are only 3 fundamental options I see, the cleaner can wait,
"help", or move on."Helping" is what it does now and is dangerous.
Moving on gives the above-discussed unthrottling problem.
Waiting has two problems. The act of waiting will cause autovacuums
to be canceled, unless ugly hacks are deployed to prevent that. If
we deploy those ugly hacks, then we have the problem that a user
backend will end up waiting on an autovacuum to finish the cleaning,
and the autovacuum is taking its sweet time due to
autovacuum_vacuum_cost_delay.
Teodor, this thread has been quiet for four days, and the deadline to fix this
open item expired 23 hours ago. Do you have a new plan for fixing it?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
This restricts the memory used by ordinary backends when doing the
cleanup to be work_mem. Shouldn't we let them use
maintenance_work_mem? Only one backend can be doing this clean up of a
given index at any given time, so we don't need to worry about many
concurrent allocations of maintenance_work_mem. This seems very
similar in spirit to index creation, where a backend is allowed to use
maintenance_work_mem.
Because it could be a several indexes in one pg instance. And each cleaner could
eat maintenance_work_mem.
Also, do we plan on backpatching this? While there are no known
Yes
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
There are only 3 fundamental options I see, the cleaner can wait,
"help", or move on."Helping" is what it does now and is dangerous.
Moving on gives the above-discussed unthrottling problem.
Waiting has two problems. The act of waiting will cause autovacuums
to be canceled, unless ugly hacks are deployed to prevent that. If
we deploy those ugly hacks, then we have the problem that a user
backend will end up waiting on an autovacuum to finish the cleaning,
and the autovacuum is taking its sweet time due to
autovacuum_vacuum_cost_delay. (The "helping" model avoids this
problem because the user backend can just catch up with and pass the
io-throttled autovac process)
With pending cleanup patch backend will try to get lock on metapage with
ConditionalLockPage. Will it interrupt autovacum worker?
For completeness sake, a fourth option would to move on, but only
after inserting the tuple directly into the main index structure
(rather then the pending list) like would be done with fastupdate off,
once the pending list is already oversized. This is my favorite, but
there is no chance of it going into 9.6, much less being backpatched.
Agree, will reimplement for 9.7
Alvaro's recommendation, to let the cleaner off the hook once it
passes the page which was the tail page at the time it started, would
prevent any process from getting pinned down indefinitely, but would
not prevent the size of the list from increasing without bound. I
think that would probably be good enough, because the current
throttling behavior is purely accidentally and doesn't *guarantee* a
limit on the size of the pending list.
Added, see attached patch (based on v3.1)
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
gin_alone_cleanup-4.patchbinary/octet-stream; name=gin_alone_cleanup-4.patchDownload
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 08ec16f..ad2c4ca 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -27,7 +27,9 @@
#include "utils/memutils.h"
#include "utils/rel.h"
#include "utils/acl.h"
+#include "postmaster/autovacuum.h"
#include "storage/indexfsm.h"
+#include "storage/lmgr.h"
/* GUC parameter */
int gin_pending_list_limit = 0;
@@ -502,11 +504,8 @@ ginHeapTupleFastCollect(GinState *ginstate,
* If newHead == InvalidBlockNumber then function drops the whole list.
*
* metapage is pinned and exclusive-locked throughout this function.
- *
- * Returns true if another cleanup process is running concurrently
- * (if so, we can just abandon our own efforts)
*/
-static bool
+static void
shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
bool fill_fsm, IndexBulkDeleteResult *stats)
{
@@ -537,14 +536,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
data.ndeleted++;
- if (GinPageIsDeleted(page))
- {
- /* concurrent cleanup process is detected */
- for (i = 0; i < data.ndeleted; i++)
- UnlockReleaseBuffer(buffers[i]);
-
- return true;
- }
+ Assert(!GinPageIsDeleted(page));
nDeletedHeapTuples += GinPageGetOpaque(page)->maxoff;
blknoToDelete = GinPageGetOpaque(page)->rightlink;
@@ -620,8 +612,6 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
RecordFreeIndexPage(index, freespace[i]);
} while (blknoToDelete != newHead);
-
- return false;
}
/* Initialize empty KeyArray */
@@ -722,18 +712,10 @@ processPendingPage(BuildAccumulator *accum, KeyArray *ka,
/*
* Move tuples from pending pages into regular GIN structure.
*
- * This can be called concurrently by multiple backends, so it must cope.
- * On first glance it looks completely not concurrent-safe and not crash-safe
- * either. The reason it's okay is that multiple insertion of the same entry
- * is detected and treated as a no-op by gininsert.c. If we crash after
- * posting entries to the main index and before removing them from the
+ * On first glance it looks completely not crash-safe. But if we crash
+ * after posting entries to the main index and before removing them from the
* pending list, it's okay because when we redo the posting later on, nothing
- * bad will happen. Likewise, if two backends simultaneously try to post
- * a pending entry into the main index, one will succeed and one will do
- * nothing. We try to notice when someone else is a little bit ahead of
- * us in the process, but that's just to avoid wasting cycles. Only the
- * action of removing a page from the pending list really needs exclusive
- * lock.
+ * bad will happen.
*
* fill_fsm indicates that ginInsertCleanup should add deleted pages
* to FSM otherwise caller is responsible to put deleted pages into
@@ -755,8 +737,43 @@ ginInsertCleanup(GinState *ginstate,
oldCtx;
BuildAccumulator accum;
KeyArray datums;
- BlockNumber blkno;
+ BlockNumber blkno,
+ blknoFinish;
+ bool cleanupFinish = false;
bool fsm_vac = false;
+ Size workMemory;
+ bool inVacuum = (stats == NULL);
+
+ /*
+ * We would like to prevent concurrent cleanup process. For
+ * that we will lock metapage in exclusive mode using LockPage()
+ * call. Nobody other will use that lock for metapage, so
+ * we keep possibility of concurrent insertion into pending list
+ */
+
+ if (inVacuum)
+ {
+ /*
+ * We are called from [auto]vacuum/analyze or
+ * gin_clean_pending_list() and we would like to wait
+ * concurrent cleanup to finish.
+ */
+ LockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
+ workMemory =
+ (IsAutoVacuumWorkerProcess() && autovacuum_work_mem != -1) ?
+ autovacuum_work_mem : maintenance_work_mem;
+ }
+ else
+ {
+ /*
+ * We are called from regular insert and if we see
+ * concurrent cleanup just exit in hope that concurrent
+ * process will clean up pending list.
+ */
+ if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))
+ return;
+ workMemory = work_mem;
+ }
metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
LockBuffer(metabuffer, GIN_SHARE);
@@ -767,10 +784,17 @@ ginInsertCleanup(GinState *ginstate,
{
/* Nothing to do */
UnlockReleaseBuffer(metabuffer);
+ UnlockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
return;
}
/*
+ * Remember a tail page to prevent infinit cleanup if other backends add
+ * new tuples faster than we can cleanup.
+ */
+ blknoFinish = metadata->tail;
+
+ /*
* Read and lock head of pending list
*/
blkno = metadata->head;
@@ -802,13 +826,14 @@ ginInsertCleanup(GinState *ginstate,
*/
for (;;)
{
- if (GinPageIsDeleted(page))
- {
- /* another cleanup process is running concurrently */
- UnlockReleaseBuffer(buffer);
- fsm_vac = false;
- break;
- }
+ Assert(!GinPageIsDeleted(page));
+
+ /*
+ * Are we walk through the page which as we remember was a tail when we
+ * start our cleanup?
+ */
+ if (blkno == blknoFinish)
+ cleanupFinish = true;
/*
* read page's datums into accum
@@ -821,13 +846,10 @@ ginInsertCleanup(GinState *ginstate,
* Is it time to flush memory to disk? Flush if we are at the end of
* the pending list, or if we have a full row and memory is getting
* full.
- *
- * XXX using up maintenance_work_mem here is probably unreasonably
- * much, since vacuum might already be using that much.
*/
if (GinPageGetOpaque(page)->rightlink == InvalidBlockNumber ||
(GinPageHasFullRow(page) &&
- (accum.allocatedMemory >= (Size)maintenance_work_mem * 1024L)))
+ (accum.allocatedMemory >= workMemory * 1024L)))
{
ItemPointerData *list;
uint32 nlist;
@@ -864,14 +886,7 @@ ginInsertCleanup(GinState *ginstate,
LockBuffer(metabuffer, GIN_EXCLUSIVE);
LockBuffer(buffer, GIN_SHARE);
- if (GinPageIsDeleted(page))
- {
- /* another cleanup process is running concurrently */
- UnlockReleaseBuffer(buffer);
- LockBuffer(metabuffer, GIN_UNLOCK);
- fsm_vac = false;
- break;
- }
+ Assert(!GinPageIsDeleted(page));
/*
* While we left the page unlocked, more stuff might have gotten
@@ -904,13 +919,7 @@ ginInsertCleanup(GinState *ginstate,
* remove read pages from pending list, at this point all
* content of read pages is in regular structure
*/
- if (shiftList(index, metabuffer, blkno, fill_fsm, stats))
- {
- /* another cleanup process is running concurrently */
- LockBuffer(metabuffer, GIN_UNLOCK);
- fsm_vac = false;
- break;
- }
+ shiftList(index, metabuffer, blkno, fill_fsm, stats);
/* At this point, some pending pages have been freed up */
fsm_vac = true;
@@ -919,9 +928,10 @@ ginInsertCleanup(GinState *ginstate,
LockBuffer(metabuffer, GIN_UNLOCK);
/*
- * if we removed the whole pending list just exit
+ * if we removed the whole pending list or we cleanup tail (which
+ * we remembered on start our cleanup process) then just exit
*/
- if (blkno == InvalidBlockNumber)
+ if (blkno == InvalidBlockNumber || cleanupFinish)
break;
/*
@@ -946,6 +956,7 @@ ginInsertCleanup(GinState *ginstate,
page = BufferGetPage(buffer, NULL, NULL, BGP_NO_SNAPSHOT_TEST);
}
+ UnlockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
ReleaseBuffer(metabuffer);
/*
Alvaro's recommendation, to let the cleaner off the hook once it
passes the page which was the tail page at the time it started, would
prevent any process from getting pinned down indefinitely, but wouldAdded, see attached patch (based on v3.1)
If there is no objections I will aplly it at mondeay and backpatch all supported
versions.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 12, 2016 at 9:53 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
With pending cleanup patch backend will try to get lock on metapage with
ConditionalLockPage. Will it interrupt autovacum worker?
Correct, ConditionalLockPage should not interrupt the autovacuum worker.
Alvaro's recommendation, to let the cleaner off the hook once it
passes the page which was the tail page at the time it started, would
prevent any process from getting pinned down indefinitely, but would
not prevent the size of the list from increasing without bound. I
think that would probably be good enough, because the current
throttling behavior is purely accidentally and doesn't *guarantee* a
limit on the size of the pending list.Added, see attached patch (based on v3.1)
With this applied, I am getting a couple errors I have not seen before
after extensive crash recovery testing:
ERROR: attempted to delete invisible tuple
ERROR: unexpected chunk number 1 (expected 2) for toast value
100338365 in pg_toast_16425
I've restarted the test harness with intentional crashes turned off,
to see if the problems are related to crash recovery or are more
generic than that.
I've never seen these particular problems before, so don't have much
insight into what might be going on or how to debug it.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Added, see attached patch (based on v3.1)
With this applied, I am getting a couple errors I have not seen before
after extensive crash recovery testing:
ERROR: attempted to delete invisible tuple
ERROR: unexpected chunk number 1 (expected 2) for toast value
100338365 in pg_toast_16425
Huh, seems, it's not related to GIN at all... Indexes don't play with toast
machinery. The single place where this error can occur is a heap_delete() -
deleting already deleted tuple.
I've restarted the test harness with intentional crashes turned off,
to see if the problems are related to crash recovery or are more
generic than that.I've never seen these particular problems before, so don't have much
insight into what might be going on or how to debug it.
Check my reasoning: In version 4 I added a remebering of tail of pending list
into blknoFinish variable. And when we read page which was a tail on cleanup
start then we sets cleanupFinish variable and after cleaning that page we will
stop further cleanup. Any insert caused during cleanup will be placed after
blknoFinish (corner case: in that page), so, vacuum should not miss tuples
marked as deleted.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 18, 2016 at 05:48:17PM +0300, Teodor Sigaev wrote:
Added, see attached patch (based on v3.1)
With this applied, I am getting a couple errors I have not seen before
after extensive crash recovery testing:
ERROR: attempted to delete invisible tuple
ERROR: unexpected chunk number 1 (expected 2) for toast value
100338365 in pg_toast_16425Huh, seems, it's not related to GIN at all... Indexes don't play with toast
machinery. The single place where this error can occur is a heap_delete() -
deleting already deleted tuple.
Like you, I would not expect gin_alone_cleanup-4.patch to cause such an error.
I get the impression Jeff has a test case that he had run in many iterations
against the unpatched baseline. I also get the impression that a similar or
smaller number of its iterations against gin_alone_cleanup-4.patch triggered
these two errors (once apiece, or multiple times?). Jeff, is that right? If
so, until we determine the cause, we should assume the cause arrived in
gin_alone_cleanup-4.patch. An error in pointer arithmetic or locking might
corrupt an unrelated buffer, leading to this symptom.
I've restarted the test harness with intentional crashes turned off,
to see if the problems are related to crash recovery or are more
generic than that.I've never seen these particular problems before, so don't have much
insight into what might be going on or how to debug it.
Could you describe the test case in sufficient detail for Teodor to reproduce
your results?
Check my reasoning: In version 4 I added a remebering of tail of pending
list into blknoFinish variable. And when we read page which was a tail on
cleanup start then we sets cleanupFinish variable and after cleaning that
page we will stop further cleanup. Any insert caused during cleanup will be
placed after blknoFinish (corner case: in that page), so, vacuum should not
miss tuples marked as deleted.
Would any hacker volunteer to review Teodor's reasoning here?
Thanks,
nm
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 5, 2015 at 2:44 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit
e95680832854cf300e64c) that free pages were recycled aggressively
enough that it actually becomes likely to be hit.
In other words: The bug could be in 9.5, but that hasn't been
conclusively demonstrated. Fair?
I'm not an expert on GIN at all; I know far more about B-Tree. But,
commit e95680832854cf300e64c seems a bit odd to me. I don't see any
argument for why it's okay that the recycling of pages can happen
immediately for the pending list, rather than requiring it to happen
at some time later with a safe interlock (some like B-Tree's use of
RecentGlobalXmin). The GIN README discusses a few such issues, but it
wasn't updated by the commit I mentioned, which I would have expected.
OTOH, after all of 10 minutes I can't see what's special about
ginvacuumcleanup() that makes its long established
RecordFreeIndexPage() call fundamentally safer, which if true would be
a surprisingly obvious defect to go unfixed for all these years. This
is what you yourself said about it, I think. I need to look at it
again with fresh eyes, but offhand having no safe interlock for the
well established RecordFreeIndexPage() call for GIN seems like an
implausibly obvious bug.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 18, 2016 at 7:48 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
Added, see attached patch (based on v3.1)
With this applied, I am getting a couple errors I have not seen before
after extensive crash recovery testing:
ERROR: attempted to delete invisible tuple
ERROR: unexpected chunk number 1 (expected 2) for toast value
100338365 in pg_toast_16425Huh, seems, it's not related to GIN at all... Indexes don't play with toast
machinery. The single place where this error can occur is a heap_delete() -
deleting already deleted tuple.
Those are two independent errors. The delete invisible tuple error
doesn't occur on toast tables.
The actual statement triggering the error is an update statement.
Since it is showing up in the delete path, I assume it must be an
update where the new tuple goes to a different page. But, if the
soon-to-be-old tuple is not visible, why is the update trying to
update it in the first place? It seems like the different parts of
the code disagree on what is visible.
update foo set count=count+1,text_array=$1 where text_array @> $2
I agree it might not have anything to do with gin indexes, but I
didn't see it in testing anything else. It might be a wrap-around
problem which for some reason only the gin test is efficient at
evoking. What I've done now is apply your v4 patch directly to
e95680832854cf300e64c1 and I am trying to see if it also has the
problem. If that is clean, then it is probably an independently
introduced bug which is just getting exercised by the gin index stress
test. If that is the case I'll try to git bisect forward, but that
could take weeks given the runtimes involved. If that is dirty, then
maybe the FSM vacuuming patch introduced/uncovered more than one bug,
and should be reverted.
I've restarted the test harness with intentional crashes turned off,
to see if the problems are related to crash recovery or are more
generic than that.
I do not see the problem when there is no crash-recovery cycling involved.
I also do not see the problem when compiled under --enable-cassert,
but that could just be because compiling that way makes it too slow to
get in sufficient testing to hit the bug; before I gave up.
I've never seen these particular problems before, so don't have much
insight into what might be going on or how to debug it.Check my reasoning: In version 4 I added a remebering of tail of pending
list into blknoFinish variable. And when we read page which was a tail on
cleanup start then we sets cleanupFinish variable and after cleaning that
page we will stop further cleanup. Any insert caused during cleanup will be
placed after blknoFinish (corner case: in that page), so, vacuum should not
miss tuples marked as deleted.
Yes, I agree with the correctness of v4. But I do wonder if we should
use that early stopping for vacuum and gin_clean_pending_list, rather
than just using it for user backends. While I think correctness
allows it to stop early, since these routines are explicitly about
cleaning things up it seems like they should volunteer to clean the
whole thing.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Apr 22, 2016 at 2:20 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
Check my reasoning: In version 4 I added a remebering of tail of pending
list into blknoFinish variable. And when we read page which was a tail on
cleanup start then we sets cleanupFinish variable and after cleaning that
page we will stop further cleanup. Any insert caused during cleanup will be
placed after blknoFinish (corner case: in that page), so, vacuum should not
miss tuples marked as deleted.Yes, I agree with the correctness of v4. But I do wonder if we should
use that early stopping for vacuum and gin_clean_pending_list, rather
than just using it for user backends. While I think correctness
allows it to stop early, since these routines are explicitly about
cleaning things up it seems like they should volunteer to clean the
whole thing.
Not to hold anyone's feet to the fire, but we are running out of time
before beta1, and this seems to be one of the more serious outstanding
issues. I'd rather not release beta1 with known data corruption bugs,
but that means either fixing whatever is broken here or reverting the
commit that caused or revealed the problem pretty soon.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 21, 2016 at 11:00 PM, Noah Misch <noah@leadboat.com> wrote:
On Mon, Apr 18, 2016 at 05:48:17PM +0300, Teodor Sigaev wrote:
Added, see attached patch (based on v3.1)
With this applied, I am getting a couple errors I have not seen before
after extensive crash recovery testing:
ERROR: attempted to delete invisible tuple
ERROR: unexpected chunk number 1 (expected 2) for toast value
100338365 in pg_toast_16425Huh, seems, it's not related to GIN at all... Indexes don't play with toast
machinery. The single place where this error can occur is a heap_delete() -
deleting already deleted tuple.Like you, I would not expect gin_alone_cleanup-4.patch to cause such an error.
I get the impression Jeff has a test case that he had run in many iterations
against the unpatched baseline. I also get the impression that a similar or
smaller number of its iterations against gin_alone_cleanup-4.patch triggered
these two errors (once apiece, or multiple times?). Jeff, is that right?
Because the unpatched baseline suffers from the bug which was the
original topic of this thread, I haven't been able to test against the
original baseline. It would fail from that other bug before it ran
long enough to hit these ones. Both errors start within a few minutes
of each other, but do not appear to be related other than that. Once
they start happening, they occur repeatedly.
Could you describe the test case in sufficient detail for Teodor to reproduce
your results?
I spawn a swarm of processes to update a counter in a randomly chosen
row, selected via the gin index. They do this as fast as possible
until the server intentionally crashes. When it recovers, they
compare notes and see if the results are consistent.
But in this case, the problem is not inconsistent results, but rather
errors during the updating stage.
The patch introduces a mechanism to crash the server, a mechanism to
fast-forward the XID, and some additional logging that I sometimes
find useful (which I haven't been using in this case, but don't want
to rip it out)
The perl script implements the core of the test harness.
The shell script sets up the server (using hard coded paths for the
data and the binaries, so will need to be changed), and then calls the
perl script in a loop.
Running on an 8 core system, I've never seen it have a problem in less
than 9 hours of run time.
This produces copious amounts of logging to stdout. This is how I
look through the log files to find this particular problem:
sh do.sh >& do.err &
tail -f do.err | fgrep ERROR
The more I think about it, the more I think gin is just an innocent
bystander, for which I just happen to have a particularly demanding
test. I think something about snapshots and wrap-around may be
broken.
Cheers,
Jeff
Attachments:
crash_REL9_6.patchapplication/octet-stream; name=crash_REL9_6.patchDownload
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
new file mode 100644
index 2f7e645..3d7e255
*** a/src/backend/access/transam/varsup.c
--- b/src/backend/access/transam/varsup.c
***************
*** 33,38 ****
--- 33,41 ----
/* pointer to "variable cache" in shared memory (set up by shmem.c) */
VariableCache ShmemVariableCache = NULL;
+ int JJ_xid=0;
+ extern int JJ_vac;
+
/*
* Allocate the next XID for a new transaction or subtransaction.
*************** GetNewTransactionId(bool isSubXact)
*** 168,173 ****
--- 171,181 ----
*
* Extend pg_subtrans and pg_commit_ts too.
*/
+ {
+ int incr;
+ for (incr=0; incr <=JJ_xid; incr++)
+ {
+ xid = ShmemVariableCache->nextXid;
ExtendCLOG(xid);
ExtendCommitTs(xid);
ExtendSUBTRANS(xid);
*************** GetNewTransactionId(bool isSubXact)
*** 179,184 ****
--- 187,194 ----
* more XIDs until there is CLOG space for them.
*/
TransactionIdAdvance(ShmemVariableCache->nextXid);
+ }
+ }
/*
* We must store the new XID into the shared ProcArray before releasing
*************** SetTransactionIdLimit(TransactionId olde
*** 342,349 ****
LWLockRelease(XidGenLock);
/* Log the info */
! ereport(DEBUG1,
! (errmsg("transaction ID wrap limit is %u, limited by database with OID %u",
xidWrapLimit, oldest_datoid)));
/*
--- 352,359 ----
LWLockRelease(XidGenLock);
/* Log the info */
! ereport(LOG,
! (errmsg("JJ transaction ID wrap limit is %u, limited by database with OID %u",
xidWrapLimit, oldest_datoid)));
/*
*************** ForceTransactionIdLimitUpdate(void)
*** 420,425 ****
--- 430,446 ----
oldestXidDB = ShmemVariableCache->oldestXidDB;
LWLockRelease(XidGenLock);
+ if (JJ_vac) {
+ elog(LOG,"JJ ForceTransactionIdLimitUpdate in %d: !normal %d, !valid %d, follows %d (%u, %u), !exists %d", MyDatabaseId,
+ !TransactionIdIsNormal(oldestXid),
+ !TransactionIdIsValid(xidVacLimit),
+ TransactionIdFollowsOrEquals(nextXid, xidVacLimit),
+ nextXid, xidVacLimit,
+ !SearchSysCacheExists1(DATABASEOID, ObjectIdGetDatum(oldestXidDB))
+ );
+ };
+
+
if (!TransactionIdIsNormal(oldestXid))
return true; /* shouldn't happen, but just in case */
if (!TransactionIdIsValid(xidVacLimit))
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index f9644db..c14e2af
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** BootStrapXLOG(void)
*** 4823,4828 ****
--- 4823,4829 ----
ShmemVariableCache->nextOid = checkPoint.nextOid;
ShmemVariableCache->oidCount = 0;
MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
+ //elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
*************** StartupXLOG(void)
*** 6341,6346 ****
--- 6342,6348 ----
ShmemVariableCache->nextOid = checkPoint.nextOid;
ShmemVariableCache->oidCount = 0;
MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
+ //elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
SetCommitTsLimit(checkPoint.oldestCommitTsXid,
*************** xlog_redo(XLogReaderState *record)
*** 9249,9254 ****
--- 9251,9257 ----
MultiXactAdvanceOldest(checkPoint.oldestMulti,
checkPoint.oldestMultiDB);
+ //elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
/*
*************** xlog_redo(XLogReaderState *record)
*** 9346,9351 ****
--- 9349,9355 ----
*/
MultiXactAdvanceOldest(checkPoint.oldestMulti,
checkPoint.oldestMultiDB);
+ //elog(LOG,"JJ maybe SetTransactionIDLimit %d", checkPoint.oldestXid);
if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
checkPoint.oldestXid))
SetTransactionIdLimit(checkPoint.oldestXid,
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
new file mode 100644
index 93361a0..eabedae
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
*************** int vacuum_freeze_min_age;
*** 58,63 ****
--- 58,65 ----
int vacuum_freeze_table_age;
int vacuum_multixact_freeze_min_age;
int vacuum_multixact_freeze_table_age;
+ int JJ_vac=0;
+
/* A few variables that don't seem worth passing around as parameters */
*************** vacuum_set_xid_limits(Relation rel,
*** 531,536 ****
--- 533,539 ----
}
*freezeLimit = limit;
+ if (JJ_vac) elog(LOG,"JJ freezeLimit %d", *freezeLimit);
/*
* Compute the multixact age for which freezing is urgent. This is
*************** vacuum_set_xid_limits(Relation rel,
*** 585,590 ****
--- 588,595 ----
* VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
* before anti-wraparound autovacuum is launched.
*/
+ if (JJ_vac) elog(LOG,"JJ freeze_min_age %d vacuum_freeze_table_age %d freeze_table_age %d ReadNew %d", freeze_min_age,
+ vacuum_freeze_table_age, freeze_table_age,ReadNewTransactionId());
freezetable = freeze_table_age;
if (freezetable < 0)
freezetable = vacuum_freeze_table_age;
*************** vac_update_datfrozenxid(void)
*** 1022,1027 ****
--- 1027,1033 ----
* truncate pg_clog and/or pg_multixact. Also do it if the shared
* XID-wrap-limit info is stale, since this action will update that too.
*/
+ if (JJ_vac && dirty) elog(LOG,"JJ updating in %d without call to ForceTransactionIdLimitUpdate", MyDatabaseId);
if (dirty || ForceTransactionIdLimitUpdate())
vac_truncate_clog(newFrozenXid, newMinMulti,
lastSaneFrozenXid, lastSaneMinMulti);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
new file mode 100644
index 426e756..a20af02
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 64,69 ****
--- 64,70 ----
#include "utils/tqual.h"
+ extern int JJ_vac;
/*
* Space/time tradeoff parameters: do these need to be user-tunable?
*
*************** lazy_vacuum_rel(Relation onerel, int opt
*** 233,238 ****
--- 234,241 ----
aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
mxactFullScanLimit);
+ if (JJ_vac) elog(LOG,"JJ aggresive %d, relfrozenid %d", aggressive, onerel->rd_rel->relfrozenxid);
+
vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
new file mode 100644
index e2859df..db79486
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
*************** int autovacuum_vac_cost_delay;
*** 122,127 ****
--- 122,128 ----
int autovacuum_vac_cost_limit;
int Log_autovacuum_min_duration = -1;
+ extern int JJ_vac;
/* how long to keep pgstat data in the launcher, in milliseconds */
#define STATS_READ_DELAY 1000
*************** AutoVacWorkerMain(int argc, char *argv[]
*** 1635,1642 ****
InitPostgres(NULL, dbid, NULL, InvalidOid, dbname);
SetProcessingMode(NormalProcessing);
set_ps_display(dbname, false);
- ereport(DEBUG1,
- (errmsg("autovacuum: processing database \"%s\"", dbname)));
if (PostAuthDelay)
pg_usleep(PostAuthDelay * 1000000L);
--- 1636,1641 ----
*************** AutoVacWorkerMain(int argc, char *argv[]
*** 1644,1650 ****
--- 1643,1653 ----
/* And do an appropriate amount of work */
recentXid = ReadNewTransactionId();
recentMulti = ReadNextMultiXactId();
+ if (JJ_vac) ereport(LOG,
+ (errmsg("autovacuum: processing database \"%s\" at recent Xid of %u recent mxid of %u", dbname,recentXid,recentMulti)));
do_autovacuum();
+ if (JJ_vac) ereport(LOG,
+ (errmsg("autovacuum: done processing database \"%s\" at recent Xid of %u recent mxid of %u", dbname,ReadNewTransactionId(),ReadNextMultiXactId())));
}
/*
*************** relation_needs_vacanalyze(Oid relid,
*** 2761,2773 ****
* reset, because if that happens, the last vacuum and analyze counts
* will be reset too.
*/
- elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f)",
- NameStr(classForm->relname),
- vactuples, vacthresh, anltuples, anlthresh);
/* Determine if this table needs vacuum or analyze. */
*dovacuum = force_vacuum || (vactuples > vacthresh);
*doanalyze = (anltuples > anlthresh);
}
else
{
--- 2764,2777 ----
* reset, because if that happens, the last vacuum and analyze counts
* will be reset too.
*/
/* Determine if this table needs vacuum or analyze. */
*dovacuum = force_vacuum || (vactuples > vacthresh);
*doanalyze = (anltuples > anlthresh);
+
+ if (JJ_vac) elog(LOG, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f) wraparound %d dovaccum %d doanalyze %d",
+ NameStr(classForm->relname),
+ vactuples, vacthresh, anltuples, anlthresh, *wraparound, *dovacuum, *doanalyze);
}
else
{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
new file mode 100644
index 578276d..612606c
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 67,72 ****
--- 67,74 ----
#define FILE_POSSIBLY_DELETED(err) ((err) == ENOENT || (err) == EACCES)
#endif
+ int JJ_torn_page=0;
+
/*
* The magnetic disk storage manager keeps track of open file
* descriptors in its own descriptor pool. This is done to make it
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 794,799 ****
--- 796,802 ----
off_t seekpos;
int nbytes;
MdfdVec *v;
+ static int counter=0;
/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 818,824 ****
errmsg("could not seek to block %u in file \"%s\": %m",
blocknum, FilePathName(v->mdfd_vfd))));
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
--- 821,838 ----
errmsg("could not seek to block %u in file \"%s\": %m",
blocknum, FilePathName(v->mdfd_vfd))));
! if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress()) {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
! errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
! errhint("JJ is screwing with the database.")));
! } else {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index fb091bc..887245d
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 106,111 ****
--- 106,114 ----
/* XXX these should appear in other modules' header files */
extern bool Log_disconnections;
extern int CommitDelay;
+ int JJ_torn_page;
+ extern int JJ_xid;
+ extern int JJ_vac;
extern int CommitSiblings;
extern char *default_tablespace;
extern char *temp_tablespaces;
*************** static struct config_int ConfigureNamesI
*** 2352,2357 ****
--- 2355,2387 ----
},
{
+ {"JJ_torn_page", PGC_USERSET, WAL_SETTINGS,
+ gettext_noop("Simulate a torn-page crash after this number of page writes (0 to turn off)"),
+ NULL
+ },
+ &JJ_torn_page,
+ 0, 0, 100000, NULL, NULL
+ },
+
+ {
+ {"JJ_xid", PGC_USERSET, WAL_SETTINGS,
+ gettext_noop("Skip this many xid every time we acquire one"),
+ NULL
+ },
+ &JJ_xid,
+ 0, 0, 1000000, NULL, NULL
+ },
+
+ {
+ {"JJ_vac", PGC_USERSET, WAL_SETTINGS,
+ gettext_noop("turn on verbose logging"),
+ NULL
+ },
+ &JJ_vac,
+ 0, 0, 1000000, NULL, NULL
+ },
+
+ {
{"commit_siblings", PGC_USERSET, WAL_SETTINGS,
gettext_noop("Sets the minimum concurrent open transactions before performing "
"commit_delay."),
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
new file mode 100644
index 8d5a378..cb4610b
*** a/src/include/pg_config_manual.h
--- b/src/include/pg_config_manual.h
***************
*** 276,282 ****
/*
* Enable debugging print statements for lock-related operations.
*/
! /* #define LOCK_DEBUG */
/*
* Enable debugging print statements for WAL-related operations; see
--- 276,282 ----
/*
* Enable debugging print statements for lock-related operations.
*/
! #define LOCK_DEBUG 1
/*
* Enable debugging print statements for WAL-related operations; see
On Fri, Apr 22, 2016 at 02:03:01PM -0700, Jeff Janes wrote:
On Thu, Apr 21, 2016 at 11:00 PM, Noah Misch <noah@leadboat.com> wrote:
Could you describe the test case in sufficient detail for Teodor to reproduce
your results?[detailed description and attachments]
Thanks.
The more I think about it, the more I think gin is just an innocent
bystander, for which I just happen to have a particularly demanding
test. I think something about snapshots and wrap-around may be
broken.
Based on your review, I think the next step on $subject is for Teodor to
either commit gin_alone_cleanup-4.patch or to tell us what he'd like to see or
do before committing that patch.
While the failures during your crash recovery testing may not even be 9.6
regressions, they suggest the presence of serious bugs. At a minimum, the
community should confirm they are not 9.6 regressions before releasing 9.6. I
have added an open item to that effect.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Check my reasoning: In version 4 I added a remebering of tail of pending
list into blknoFinish variable. And when we read page which was a tail on
cleanup start then we sets cleanupFinish variable and after cleaning that
page we will stop further cleanup. Any insert caused during cleanup will be
placed after blknoFinish (corner case: in that page), so, vacuum should not
miss tuples marked as deleted.Yes, I agree with the correctness of v4. But I do wonder if we should
use that early stopping for vacuum and gin_clean_pending_list, rather
Interesting, I've missed this possible option
than just using it for user backends. While I think correctness
allows it to stop early, since these routines are explicitly about
cleaning things up it seems like they should volunteer to clean the
whole thing.
I believe that autovacuum should not require guaranteed full clean up, only
vacuum and gin_clean_pending_list() should do that. In all other cases it should
stop early to prevent possible infinite cleanup. Patch attached.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
gin_alone_cleanup-5.patchbinary/octet-stream; name=gin_alone_cleanup-5.patchDownload
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2ddf568..727bb43 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -27,7 +27,9 @@
#include "utils/memutils.h"
#include "utils/rel.h"
#include "utils/acl.h"
+#include "postmaster/autovacuum.h"
#include "storage/indexfsm.h"
+#include "storage/lmgr.h"
/* GUC parameter */
int gin_pending_list_limit = 0;
@@ -437,7 +439,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
END_CRIT_SECTION();
if (needCleanup)
- ginInsertCleanup(ginstate, true, NULL);
+ ginInsertCleanup(ginstate, false, true, NULL);
}
/*
@@ -502,11 +504,8 @@ ginHeapTupleFastCollect(GinState *ginstate,
* If newHead == InvalidBlockNumber then function drops the whole list.
*
* metapage is pinned and exclusive-locked throughout this function.
- *
- * Returns true if another cleanup process is running concurrently
- * (if so, we can just abandon our own efforts)
*/
-static bool
+static void
shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
bool fill_fsm, IndexBulkDeleteResult *stats)
{
@@ -537,14 +536,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
data.ndeleted++;
- if (GinPageIsDeleted(page))
- {
- /* concurrent cleanup process is detected */
- for (i = 0; i < data.ndeleted; i++)
- UnlockReleaseBuffer(buffers[i]);
-
- return true;
- }
+ Assert(!GinPageIsDeleted(page));
nDeletedHeapTuples += GinPageGetOpaque(page)->maxoff;
blknoToDelete = GinPageGetOpaque(page)->rightlink;
@@ -620,8 +612,6 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
RecordFreeIndexPage(index, freespace[i]);
} while (blknoToDelete != newHead);
-
- return false;
}
/* Initialize empty KeyArray */
@@ -722,18 +712,10 @@ processPendingPage(BuildAccumulator *accum, KeyArray *ka,
/*
* Move tuples from pending pages into regular GIN structure.
*
- * This can be called concurrently by multiple backends, so it must cope.
- * On first glance it looks completely not concurrent-safe and not crash-safe
- * either. The reason it's okay is that multiple insertion of the same entry
- * is detected and treated as a no-op by gininsert.c. If we crash after
- * posting entries to the main index and before removing them from the
+ * On first glance it looks completely not crash-safe. But if we crash
+ * after posting entries to the main index and before removing them from the
* pending list, it's okay because when we redo the posting later on, nothing
- * bad will happen. Likewise, if two backends simultaneously try to post
- * a pending entry into the main index, one will succeed and one will do
- * nothing. We try to notice when someone else is a little bit ahead of
- * us in the process, but that's just to avoid wasting cycles. Only the
- * action of removing a page from the pending list really needs exclusive
- * lock.
+ * bad will happen.
*
* fill_fsm indicates that ginInsertCleanup should add deleted pages
* to FSM otherwise caller is responsible to put deleted pages into
@@ -742,7 +724,7 @@ processPendingPage(BuildAccumulator *accum, KeyArray *ka,
* If stats isn't null, we count deleted pending pages into the counts.
*/
void
-ginInsertCleanup(GinState *ginstate,
+ginInsertCleanup(GinState *ginstate, bool full_clean,
bool fill_fsm, IndexBulkDeleteResult *stats)
{
Relation index = ginstate->index;
@@ -755,8 +737,43 @@ ginInsertCleanup(GinState *ginstate,
oldCtx;
BuildAccumulator accum;
KeyArray datums;
- BlockNumber blkno;
+ BlockNumber blkno,
+ blknoFinish;
+ bool cleanupFinish = false;
bool fsm_vac = false;
+ Size workMemory;
+ bool inVacuum = (stats == NULL);
+
+ /*
+ * We would like to prevent concurrent cleanup process. For
+ * that we will lock metapage in exclusive mode using LockPage()
+ * call. Nobody other will use that lock for metapage, so
+ * we keep possibility of concurrent insertion into pending list
+ */
+
+ if (inVacuum)
+ {
+ /*
+ * We are called from [auto]vacuum/analyze or
+ * gin_clean_pending_list() and we would like to wait
+ * concurrent cleanup to finish.
+ */
+ LockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
+ workMemory =
+ (IsAutoVacuumWorkerProcess() && autovacuum_work_mem != -1) ?
+ autovacuum_work_mem : maintenance_work_mem;
+ }
+ else
+ {
+ /*
+ * We are called from regular insert and if we see
+ * concurrent cleanup just exit in hope that concurrent
+ * process will clean up pending list.
+ */
+ if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))
+ return;
+ workMemory = work_mem;
+ }
metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
LockBuffer(metabuffer, GIN_SHARE);
@@ -767,10 +784,17 @@ ginInsertCleanup(GinState *ginstate,
{
/* Nothing to do */
UnlockReleaseBuffer(metabuffer);
+ UnlockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
return;
}
/*
+ * Remember a tail page to prevent infinit cleanup if other backends add
+ * new tuples faster than we can cleanup.
+ */
+ blknoFinish = metadata->tail;
+
+ /*
* Read and lock head of pending list
*/
blkno = metadata->head;
@@ -802,13 +826,15 @@ ginInsertCleanup(GinState *ginstate,
*/
for (;;)
{
- if (GinPageIsDeleted(page))
- {
- /* another cleanup process is running concurrently */
- UnlockReleaseBuffer(buffer);
- fsm_vac = false;
- break;
- }
+ Assert(!GinPageIsDeleted(page));
+
+ /*
+ * Are we walk through the page which as we remember was a tail when we
+ * start our cleanup? But if caller asks us to clean up whole pending
+ * list then ignore old tail, we will work until list becomes empty.
+ */
+ if (blkno == blknoFinish && full_clean == false)
+ cleanupFinish = true;
/*
* read page's datums into accum
@@ -821,13 +847,10 @@ ginInsertCleanup(GinState *ginstate,
* Is it time to flush memory to disk? Flush if we are at the end of
* the pending list, or if we have a full row and memory is getting
* full.
- *
- * XXX using up maintenance_work_mem here is probably unreasonably
- * much, since vacuum might already be using that much.
*/
if (GinPageGetOpaque(page)->rightlink == InvalidBlockNumber ||
(GinPageHasFullRow(page) &&
- (accum.allocatedMemory >= (Size)maintenance_work_mem * 1024L)))
+ (accum.allocatedMemory >= workMemory * 1024L)))
{
ItemPointerData *list;
uint32 nlist;
@@ -864,14 +887,7 @@ ginInsertCleanup(GinState *ginstate,
LockBuffer(metabuffer, GIN_EXCLUSIVE);
LockBuffer(buffer, GIN_SHARE);
- if (GinPageIsDeleted(page))
- {
- /* another cleanup process is running concurrently */
- UnlockReleaseBuffer(buffer);
- LockBuffer(metabuffer, GIN_UNLOCK);
- fsm_vac = false;
- break;
- }
+ Assert(!GinPageIsDeleted(page));
/*
* While we left the page unlocked, more stuff might have gotten
@@ -904,13 +920,7 @@ ginInsertCleanup(GinState *ginstate,
* remove read pages from pending list, at this point all
* content of read pages is in regular structure
*/
- if (shiftList(index, metabuffer, blkno, fill_fsm, stats))
- {
- /* another cleanup process is running concurrently */
- LockBuffer(metabuffer, GIN_UNLOCK);
- fsm_vac = false;
- break;
- }
+ shiftList(index, metabuffer, blkno, fill_fsm, stats);
/* At this point, some pending pages have been freed up */
fsm_vac = true;
@@ -919,9 +929,10 @@ ginInsertCleanup(GinState *ginstate,
LockBuffer(metabuffer, GIN_UNLOCK);
/*
- * if we removed the whole pending list just exit
+ * if we removed the whole pending list or we cleanup tail (which
+ * we remembered on start our cleanup process) then just exit
*/
- if (blkno == InvalidBlockNumber)
+ if (blkno == InvalidBlockNumber || cleanupFinish)
break;
/*
@@ -946,6 +957,7 @@ ginInsertCleanup(GinState *ginstate,
page = BufferGetPage(buffer);
}
+ UnlockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
ReleaseBuffer(metabuffer);
/*
@@ -1004,7 +1016,7 @@ gin_clean_pending_list(PG_FUNCTION_ARGS)
memset(&stats, 0, sizeof(stats));
initGinState(&ginstate, indexRel);
- ginInsertCleanup(&ginstate, true, &stats);
+ ginInsertCleanup(&ginstate, true, true, &stats);
index_close(indexRel, AccessShareLock);
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 6a4b98a..1934c37 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -540,8 +540,10 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
{
/* Yes, so initialize stats to zeroes */
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
- /* and cleanup any pending inserts */
- ginInsertCleanup(&gvs.ginstate, false, stats);
+ /*
+ * and cleanup any pending inserts */
+ ginInsertCleanup(&gvs.ginstate, !IsAutoVacuumWorkerProcess(),
+ false, stats);
}
/* we'll re-count the tuples each time */
@@ -654,7 +656,7 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (IsAutoVacuumWorkerProcess())
{
initGinState(&ginstate, index);
- ginInsertCleanup(&ginstate, true, stats);
+ ginInsertCleanup(&ginstate, false, true, stats);
}
return stats;
}
@@ -667,7 +669,8 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
{
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
initGinState(&ginstate, index);
- ginInsertCleanup(&ginstate, false, stats);
+ ginInsertCleanup(&ginstate, !IsAutoVacuumWorkerProcess(),
+ false, stats);
}
memset(&idxStat, 0, sizeof(idxStat));
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index f1ad04b..68cfe0c 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -947,7 +947,7 @@ extern void ginHeapTupleFastCollect(GinState *ginstate,
GinTupleCollector *collector,
OffsetNumber attnum, Datum value, bool isNull,
ItemPointer ht_ctid);
-extern void ginInsertCleanup(GinState *ginstate,
+extern void ginInsertCleanup(GinState *ginstate, bool full_clean,
bool fill_fsm, IndexBulkDeleteResult *stats);
/* ginpostinglist.c */
On Tue, Apr 26, 2016 at 08:22:03PM +0300, Teodor Sigaev wrote:
Check my reasoning: In version 4 I added a remebering of tail of pending
list into blknoFinish variable. And when we read page which was a tail on
cleanup start then we sets cleanupFinish variable and after cleaning that
page we will stop further cleanup. Any insert caused during cleanup will be
placed after blknoFinish (corner case: in that page), so, vacuum should not
miss tuples marked as deleted.Yes, I agree with the correctness of v4. But I do wonder if we should
use that early stopping for vacuum and gin_clean_pending_list, ratherInteresting, I've missed this possible option
than just using it for user backends. While I think correctness
allows it to stop early, since these routines are explicitly about
cleaning things up it seems like they should volunteer to clean the
whole thing.I believe that autovacuum should not require guaranteed full clean up, only
vacuum and gin_clean_pending_list() should do that. In all other cases it
should stop early to prevent possible infinite cleanup. Patch attached.
Teodor, this item remains open after twenty-three days of you owning it. At
this pace, 9.6beta1 will have an unreliable GIN implementation. What things
must happen so you can close this item, and when will you do those things?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
The more I think about it, the more I think gin is just an innocent
bystander, for which I just happen to have a particularly demanding
test. I think something about snapshots and wrap-around may be
broken.
After 10 hours of running I've got
1587 XX000 2016-04-28 05:57:09.964 MSK:ERROR: unexpected chunk number 0
(expected 1) for toast value 10192986 in pg_toast_16424
1587 XX000 2016-04-28 05:57:09.964 MSK:CONTEXT: automatic analyze of table
"teodor.public.foo"
1587 00000 2016-04-28 05:57:09.974 MSK:LOG: JJ transaction ID wrap limit is
2147484514, limited by database with OID 16384
I've pushed v5 of gin-alone-cleanup patch, many thanks to all participants.
Will work on backpatch
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers