Write Ahead Logging for Hash Indexes

Started by Amit Kapilaalmost 10 years ago124 messageshackers

amit.kapila16@gmail.com

almost 10 years ago

$SUBJECT will make hash indexes reliable and usable on standby.
AFAIU, currently hash indexes are not recommended to be used in
production mainly because they are not crash-safe and with this patch,
I hope we can address that limitation and recommend them for use in
production.

This patch is built on my earlier patch [1]https://commitfest.postgresql.org/10/647/ of making hash indexes
concurrent. The main reason for doing so is that the earlier patch
allows to complete the split operation and used light-weight locking
due to which operations can be logged at granular level.

WAL for different operations:

This has been explained in README as well, but I am again writing it
here for the ease of people.
=================================

Multiple WAL records are being written for create index operation,
first for initializing the metapage, followed by one for each new
bucket created during operation followed by one for initializing the
bitmap page. If the system crashes after any operation, the whole
operation is rolledback. I have considered to write a single WAL
record for the whole operation, but for that we need to limit the
number of initial buckets that can be created during the operation. As
we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with
current XLog machinery, it is better to write multiple WAL records for
this operation. The downside of restricting the number of buckets is
that we need to perform split operation if the number of tuples are
more than what can be accommodated in initial set of buckets and it is
not unusual to have large number of tuples during create index
operation.

Ordinary item insertions (that don't force a page split or need a new
overflow page) are single WAL entries. They touch a single bucket
page and meta page, metapage is updated during replay as it is updated
during original operation.

An insertion that causes an addition of an overflow page is logged as
a single WAL entry preceded by a WAL entry for a new overflow page
required to insert a tuple. There is a corner case where by the time
we try to use newly allocated overflow page, it already gets used by
concurrent insertions, for such a case, a new overflow page will be
allocated and a separate WAL entry will be made for the same.

An insertion that causes a bucket split is logged as a single WAL
entry, followed by a WAL entry for allocating a new bucket, followed
by a WAL entry for each overflow bucket page in the new bucket to
which the tuples are moved from old bucket, followed by a WAL entry to
indicate that split is complete for both old and new buckets.

A split operation which requires overflow pages to complete the
operation will need to write a WAL record for each new allocation of
an overflow page. As splitting involves multiple atomic actions, it's
possible that the system crashes between moving tuples from bucket
pages of old bucket to new bucket. After recovery, both the old and
new buckets will be marked with in_complete split flag. The reader
algorithm works correctly, as it will scan both the old and new
buckets.

We finish the split at next insert or split operation on old bucket.
It could be done during searches, too, but it seems best not to put
any extra updates in what would otherwise be a read-only operation
(updating is not possible in hot standby mode anyway). It would seem
natural to complete the split in VACUUM, but since splitting a bucket
might require to allocate a new page, it might fail if you run out of
disk space. That would be bad during VACUUM - the reason for running
VACUUM in the first place might be that you run out of disk space, and
now VACUUM won't finish because you're out of disk space. In
contrast, an insertion can require enlarging the physical file anyway.

Deletion of tuples from a bucket is performed for two reasons, one for
removing the dead tuples and other for removing the tuples that are
moved by split. WAL entry is made for each bucket page from which
tuples are removed, followed by a WAL entry to clear the garbage flag
if the tuples moved by split are removed. Another separate WAL entry
is made for updating the metapage if the deletion is performed for
removing the dead tuples by vaccum.

As deletion involves multiple atomic operations, it is quite possible
that system crashes after (a) removing tuples from some of the bucket
pages (b) before clearing the garbage flag (c) before updating the
metapage. If the system crashes before completing (b), it will again
try to clean the bucket during next vacuum or insert after recovery
which can have some performance impact, but it will work fine. If the
system crashes before completing (c), after recovery there could be
some additional splits till the next vacuum updates the metapage, but
the other operations like insert, delete and scan will work correctly.
We can fix this problem by actually updating the metapage based on
delete operation during replay, but not sure if it is worth the
complication.

Squeeze operation moves tuples from one of the buckets later in the
chain to one of the bucket earlier in chain and writes WAL record when
either the bucket to which it is writing tuples is filled or bucket
from which it is removing the tuples becomes empty.

As Squeeze operation involves writing multiple atomic operations, it
is quite possible, that system crashes before completing the operation
on entire bucket. After recovery, the operations will work correctly,
but the index will remain bloated and can impact performance of read
and insert operations until the next vacuum squeezes the bucket
completely.

=====================================

One of the challenge in writing this patch was that the current code
was not written with a mindset that we need to write WAL for different
operations. Typical example is _hash_addovflpage() where pages are
modified across different function calls and all modifications needs
to be done atomically, so I have to refactor some code so that the
operations can be logged sensibly.

Thanks to Ashutosh Sharma who has helped me in completing the patch by
writing WAL for create index and delete operation and done the
detailed testing of patch by using pg_filedump tool. I think it is
better if he himself explains the testing he has done to ensure
correctness of patch.

Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2]/messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com.

[1]: https://commitfest.postgresql.org/10/647/
[2]: /messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Amit Kapila (#1)

Re: Write Ahead Logging for Hash Indexes

On Tue, Aug 23, 2016 at 8:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

$SUBJECT will make hash indexes reliable and usable on standby.
AFAIU, currently hash indexes are not recommended to be used in
production mainly because they are not crash-safe and with this patch,
I hope we can address that limitation and recommend them for use in
production.

This patch is built on my earlier patch [1] of making hash indexes
concurrent. The main reason for doing so is that the earlier patch
allows to complete the split operation and used light-weight locking
due to which operations can be logged at granular level.

WAL for different operations:

This has been explained in README as well, but I am again writing it
here for the ease of people.

One of the challenge in writing this patch was that the current code
was not written with a mindset that we need to write WAL for different
operations. Typical example is _hash_addovflpage() where pages are
modified across different function calls and all modifications needs
to be done atomically, so I have to refactor some code so that the
operations can be logged sensibly.

This patch has not done handling for OldSnapshot. Previously, we
haven't done TestForOldSnapshot() checks in hash index as they were
not logged, but now with this patch, it makes sense to perform such
checks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ashutosh Sharma

ashu.coek88@gmail.com

almost 10 years ago

In reply to: Amit Kapila (#2)

Re: Write Ahead Logging for Hash Indexes

Hi All,

Following are the steps that i have followed to verify the WAL Logging
of hash index,

1. I used Mithun's patch to improve coverage of hash index code [1]/messages/by-id/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com to
verify the WAL Logging of hash index. Firstly i have confirmed if all
the XLOG records associated with hash index are being covered or not
using this patch. In case if any of the XLOG record for hash index
operation is not being covered i have added a testcase for it. I have
found that one of the XLOG record 'XLOG_HASH_MOVE_PAGE_CONTENTS' was
not being covered and added a small testcase for the same. The patch
for this is available @ [2]/messages/by-id/CAE9k0PkNjryhSiG53mjnKFhi+MipJMjSa=YkH-UeW3bfr1HPJQ@mail.gmail.com.

2. I executed the regression test suite and found all the hash indexes
that are getting created as a part of regression test suite using the
below query.

SELECT t.relname index_name, t.oid FROM pg_class t JOIN pg_am idx
ON idx.oid = t.relam WHERE idx.amname = 'hash';

3. Thirdly, I have calculated the number of pages associated with each
hash index and compared every page of hash index on master and standby
server using pg_filedump tool. As for example if the number of pages
associated with 'con_hash_index' is 10 then here is what i did,

On master:
-----------------
select pg_relation_filepath('con_hash_index');
pg_relation_filepath
----------------------
base/16408/16433
(1 row)

./pg_filedump -if -R 0 9
/home/edb/git-clone-postgresql/postgresql/TMP/postgres/master/base/16408/16433

/tmp/file1

On Slave:
---------------
select pg_relation_filepath('con_hash_index');
pg_relation_filepath
----------------------
base/16408/16433
(1 row)

./pg_filedump -if -R 0 9
/home/edb/git-clone-postgresql/postgresql/TMP/postgres/standby/base/16408/16433

/tmp/file2

compared file1 and file2 using some diff tool.

Following are the list of hash indexes that got created inside
regression database when regression test suite was executed on a
master server.

hash_i4_index
hash_name_index
hash_txt_index
hash_f8_index
con_hash_index
hash_idx

In short, this is all i did and found no issues during testing. Please
let me know if you need any further details.

I would like to Thank Amit for his support and guidance during the
testing phase.

[1]: /messages/by-id/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com
[2]: /messages/by-id/CAE9k0PkNjryhSiG53mjnKFhi+MipJMjSa=YkH-UeW3bfr1HPJQ@mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Janes

jeff.janes@gmail.com

almost 10 years ago

In reply to: Amit Kapila (#1)

Re: Write Ahead Logging for Hash Indexes

Hi Amit,

Thanks for working on this.

When building with --enable-cassert, I get compiler warning:

hash.c: In function 'hashbucketcleanup':
hash.c:722: warning: 'new_bucket' may be used uninitialized in this function

After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.

With this backtrace:

(gdb) bt
#0 0x0000003838c325e5 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003838c33dc5 in abort () at abort.c:92
#2 0x000000000081a8fd in ExceptionalCondition (conditionName=<value
optimized out>, errorType=<value optimized out>, fileName=<value optimized
out>,
lineNumber=<value optimized out>) at assert.c:54
#3 0x00000000004a4199 in _hash_freeovflpage (rel=0x7f3f745d86b8,
bucketbuf=198, ovflbuf=199, wbuf=198, itups=0x7ffc258fa090,
itup_offsets=0x126e8a8,
tups_size=0x7ffc258f93d0, nitups=70, bstrategy=0x12ba320) at
hashovfl.c:553
#4 0x00000000004a4c32 in _hash_squeezebucket (rel=<value optimized out>,
bucket=38, bucket_blkno=56, bucket_buf=198, bstrategy=0x12ba320)
at hashovfl.c:1010
#5 0x00000000004a042a in hashbucketcleanup (rel=0x7f3f745d86b8,
bucket_buf=198, bucket_blkno=56, bstrategy=0x12ba320, maxbucket=96,
highmask=127,
lowmask=63, tuples_removed=0x7ffc258fc1c8,
num_index_tuples=0x7ffc258fc1c0, bucket_has_garbage=0 '\000', delay=1
'\001',
callback=0x5e9bd0 <lazy_tid_reaped>, callback_state=0x126e248) at
hash.c:937
#6 0x00000000004a07e7 in hashbulkdelete (info=0x7ffc258fc2b0, stats=0x0,
callback=0x5e9bd0 <lazy_tid_reaped>, callback_state=0x126e248) at hash.c:580
#7 0x00000000005e98c5 in lazy_vacuum_index (indrel=0x7f3f745d86b8,
stats=0x126ecc0, vacrelstats=0x126e248) at vacuumlazy.c:1599
#8 0x00000000005ea7f9 in lazy_scan_heap (onerel=<value optimized out>,
options=<value optimized out>, params=0x12ba290, bstrategy=<value optimized
out>)
at vacuumlazy.c:1291
#9 lazy_vacuum_rel (onerel=<value optimized out>, options=<value optimized
out>, params=0x12ba290, bstrategy=<value optimized out>) at vacuumlazy.c:255
#10 0x00000000005e8939 in vacuum_rel (relid=17329, relation=0x7ffc258fcbd0,
options=99, params=0x12ba290) at vacuum.c:1399
#11 0x00000000005e8d01 in vacuum (options=99, relation=0x7ffc258fcbd0,
relid=<value optimized out>, params=0x12ba290, va_cols=0x0,
bstrategy=<value optimized out>, isTopLevel=1 '\001') at vacuum.c:307
#12 0x00000000006a07f1 in autovacuum_do_vac_analyze () at autovacuum.c:2823
#13 do_autovacuum () at autovacuum.c:2341
#14 0x00000000006a0f9c in AutoVacWorkerMain (argc=<value optimized out>,
argv=<value optimized out>) at autovacuum.c:1656
#15 0x00000000006a1116 in StartAutoVacWorker () at autovacuum.c:1461
#16 0x00000000006afb00 in StartAutovacuumWorker (postgres_signal_arg=<value
optimized out>) at postmaster.c:5323
#17 sigusr1_handler (postgres_signal_arg=<value optimized out>) at
postmaster.c:5009
#18 <signal handler called>
#19 0x0000003838ce1503 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:82
#20 0x00000000006b0ec0 in ServerLoop (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:1657
#21 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>)
at postmaster.c:1301
#22 0x0000000000632e88 in main (argc=4, argv=0x11f4d50) at main.c:228

Cheers,

Jeff

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

almost 10 years ago

In reply to: Amit Kapila (#1)

Re: Write Ahead Logging for Hash Indexes

On 23/08/16 15:24, Amit Kapila wrote:

Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] - /messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

Firstly - really nice! Patches applied easily etc to latest version 10
checkout.

I thought I'd test by initializing pgbench schema, adding a standby,
then adding some hash indexes and running pgbench:

Looking on the standby:

they have been replicated there ok.

However I'm seeing a hang on the master after a while:

but no errors in the logs, any thoughts?

Cheers

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

almost 10 years ago

In reply to: Mark Kirkwood (#5)

Re: Write Ahead Logging for Hash Indexes

On 24/08/16 12:09, Mark Kirkwood wrote:

On 23/08/16 15:24, Amit Kapila wrote:

Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] -
/messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

Firstly - really nice! Patches applied easily etc to latest version 10
checkout.

I thought I'd test by initializing pgbench schema, adding a standby,
then adding some hash indexes and running pgbench:

Looking on the standby:

bench=# \d pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers
----------+---------------+-----------
aid | integer | not null
bid | integer |
abalance | integer |
filler | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid) <====

bench=# \d pgbench_history
Table "public.pgbench_history"
Column | Type | Modifiers
--------+-----------------------------+-----------
tid | integer |
bid | integer |
aid | integer |
delta | integer |
mtime | timestamp without time zone |
filler | character(22) |
Indexes:
"pgbench_history_bid" hash (bid) <=====

they have been replicated there ok.

However I'm seeing a hang on the master after a while:

bench=# SELECT datname,application_name,state,now()-xact_start AS
wait,query FROM pg_stat_activity ORDER BY wait DESC;
datname | application_name | state | wait | query
---------+------------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------

| walreceiver | idle | |
bench | pgbench | active | 00:31:38.367467 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973,
-3868, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:38.215378 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95,
3954814, 2091, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.991056 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33,
8355237, 3438, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.619798 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16,
4839994, -2443, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.544196 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119,
4053, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.334504 | UPDATE
pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
bench | pgbench | active | 00:31:35.234112 | UPDATE
pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
bench | pgbench | active | 00:31:34.434676 | UPDATE
pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
bench | psql | active | 00:00:00 | SELECT
datname,application_name,state,now()-xact_start AS wait,query FROM
pg_stat_activity ORDER BY wait DESC;
(10 rows)

but no errors in the logs, any thoughts?

FWIW, retesting with the wal logging patch removed (i.e leaving the
concurrent hast one) works fine.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Mark Kirkwood (#6)

Re: Write Ahead Logging for Hash Indexes

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 12:09, Mark Kirkwood wrote:

On 23/08/16 15:24, Amit Kapila wrote:

Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] -
/messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

Firstly - really nice! Patches applied easily etc to latest version 10
checkout.

I thought I'd test by initializing pgbench schema, adding a standby, then
adding some hash indexes and running pgbench:

Looking on the standby:

bench=# \d pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers
----------+---------------+-----------
aid | integer | not null
bid | integer |
abalance | integer |
filler | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid) <====

bench=# \d pgbench_history
Table "public.pgbench_history"
Column | Type | Modifiers
--------+-----------------------------+-----------
tid | integer |
bid | integer |
aid | integer |
delta | integer |
mtime | timestamp without time zone |
filler | character(22) |
Indexes:
"pgbench_history_bid" hash (bid) <=====

they have been replicated there ok.

However I'm seeing a hang on the master after a while:

bench=# SELECT datname,application_name,state,now()-xact_start AS
wait,query FROM pg_stat_activity ORDER BY wait DESC;
datname | application_name | state | wait | query

---------+------------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------
| walreceiver | idle | |
bench | pgbench | active | 00:31:38.367467 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973,
-3868, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:38.215378 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814,
2091, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.991056 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237,
3438, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.619798 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994,
-2443, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.544196 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 4053,
CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.334504 | UPDATE
pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
bench | pgbench | active | 00:31:35.234112 | UPDATE
pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
bench | pgbench | active | 00:31:34.434676 | UPDATE
pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
bench | psql | active | 00:00:00 | SELECT
datname,application_name,state,now()-xact_start AS wait,query FROM
pg_stat_activity ORDER BY wait DESC;
(10 rows)

but no errors in the logs, any thoughts?

Can you get the call stacks?

FWIW, retesting with the wal logging patch removed (i.e leaving the
concurrent hast one) works fine.

Okay, information noted.

Thanks for testing and showing interest in the patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

almost 10 years ago

In reply to: Amit Kapila (#7)

Re: Write Ahead Logging for Hash Indexes

On 24/08/16 15:36, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 12:09, Mark Kirkwood wrote:

On 23/08/16 15:24, Amit Kapila wrote:

Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] -
/messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

Firstly - really nice! Patches applied easily etc to latest version 10
checkout.

I thought I'd test by initializing pgbench schema, adding a standby, then
adding some hash indexes and running pgbench:

Looking on the standby:

bench=# \d pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers
----------+---------------+-----------
aid | integer | not null
bid | integer |
abalance | integer |
filler | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid) <====

bench=# \d pgbench_history
Table "public.pgbench_history"
Column | Type | Modifiers
--------+-----------------------------+-----------
tid | integer |
bid | integer |
aid | integer |
delta | integer |
mtime | timestamp without time zone |
filler | character(22) |
Indexes:
"pgbench_history_bid" hash (bid) <=====

they have been replicated there ok.

However I'm seeing a hang on the master after a while:

bench=# SELECT datname,application_name,state,now()-xact_start AS
wait,query FROM pg_stat_activity ORDER BY wait DESC;
datname | application_name | state | wait | query

---------+------------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------
| walreceiver | idle | |
bench | pgbench | active | 00:31:38.367467 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973,
-3868, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:38.215378 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814,
2091, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.991056 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237,
3438, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.619798 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994,
-2443, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.544196 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 4053,
CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.334504 | UPDATE
pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
bench | pgbench | active | 00:31:35.234112 | UPDATE
pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
bench | pgbench | active | 00:31:34.434676 | UPDATE
pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
bench | psql | active | 00:00:00 | SELECT
datname,application_name,state,now()-xact_start AS wait,query FROM
pg_stat_activity ORDER BY wait DESC;
(10 rows)

but no errors in the logs, any thoughts?

Can you get the call stacks?

For every stuck backend? (just double checking)...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Mark Kirkwood (#8)

Re: Write Ahead Logging for Hash Indexes

On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 15:36, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

Can you get the call stacks?

For every stuck backend? (just double checking)...

Yeah.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

almost 10 years ago

In reply to: Amit Kapila (#9)

Re: Write Ahead Logging for Hash Indexes

On 24/08/16 16:33, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 15:36, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
Can you get the call stacks?

For every stuck backend? (just double checking)...

Yeah.

Cool,

I managed to reproduce with a reduced workload of 4 backends, then
noticed that the traces for 3 of 'em were all the same. So I've attached
the 2 unique ones, plus noted what pg_stat_activity thought the wait
event was, in case that is useful.

Cheers

Mark

#11

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

almost 10 years ago

In reply to: Mark Kirkwood (#10)

Re: Write Ahead Logging for Hash Indexes

On 24/08/16 16:52, Mark Kirkwood wrote:

On 24/08/16 16:33, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 15:36, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
Can you get the call stacks?

For every stuck backend? (just double checking)...

Yeah.

Cool,

I managed to reproduce with a reduced workload of 4 backends, then
noticed that the traces for 3 of 'em were all the same. So I've
attached the 2 unique ones, plus noted what pg_stat_activity thought
the wait event was, in case that is useful.

...actually I was wrong there, only 2 of them were the same. So I've
attached a new log of bt's of them all.

#12

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Jeff Janes (#4)

Re: Write Ahead Logging for Hash Indexes

On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

Hi Amit,

Thanks for working on this.

When building with --enable-cassert, I get compiler warning:

hash.c: In function 'hashbucketcleanup':
hash.c:722: warning: 'new_bucket' may be used uninitialized in this function

This warning is from concurrent index patch. I will fix it and post
the patch on that thread.

After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.

Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set. Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit. Then while freeing
the overflow page it can hit the Assert as mentioned by you. I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue. As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.

If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

almost 10 years ago

In reply to: Mark Kirkwood (#11)

Re: Write Ahead Logging for Hash Indexes

On 24/08/16 17:01, Mark Kirkwood wrote:

...actually I was wrong there, only 2 of them were the same. So I've
attached a new log of bt's of them all.

And I can reproduce with only 1 session, figured that might be a helpful
piece of the puzzle (trace attached).

#14

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Mark Kirkwood (#13)

Re: Write Ahead Logging for Hash Indexes

On Wed, Aug 24, 2016 at 11:44 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 17:01, Mark Kirkwood wrote:

...actually I was wrong there, only 2 of them were the same. So I've
attached a new log of bt's of them all.

And I can reproduce with only 1 session, figured that might be a helpful
piece of the puzzle (trace attached).

Thanks.

I think I know the problem here. Basically _hash_freeovflpage() is
trying to take a lock on a buffer previous to overflow page to update
the links and it is quite possible that the same buffer is already
locked for moving the tuples while squeezing the bucket. I am working
on a fix for the same.

Coincidently, Ashutosh Sharma a colleague of mine who was also testing
this patch found the same issue by an attached sql script. So we
might be able to inculcate a test case in the regression suite as
well after fix.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#15

Jeff Janes

jeff.janes@gmail.com

almost 10 years ago

In reply to: Amit Kapila (#12)

Re: Write Ahead Logging for Hash Indexes

On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.

Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set. Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit. Then while freeing
the overflow page it can hit the Assert as mentioned by you. I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue. As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.

If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.

The test is rather awkward, it might be easier to just have me test it.
But, I've attached it.

There is a patch that needs to applied and compiled (alongside your
patches, of course), to inject the crashes. A perl script which creates
the schema and does the updates. And a shell script which sets up the
cluster with the appropriate parameters, and then calls the perl script in
a loop.

The top of the shell script has some hard coded paths to the binaries, and
to the test data directory (which is automatically deleted)

I run it like "sh do.sh >& do.err &"

It gives two different types of assertion failures:

$ fgrep TRAP: do.err |sort|uniq -c

21 TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)
32 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c",
Line: 2506)

The second one is related to the intentional crashes, and so is not
relevant to you.

Cheers,

Jeff

#16

Alvaro Herrera

alvherre@2ndquadrant.com

almost 10 years ago

In reply to: Amit Kapila (#1)

Re: Write Ahead Logging for Hash Indexes

Amit Kapila wrote:

$SUBJECT will make hash indexes reliable and usable on standby.

Nice work.

Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
instead of cramming it in hash.h? Removing the existing #include
"xlogreader.h" from hash.h would be nice. I volunteer for pushing any
preliminary header cleanup commits.

I think it's silly that access/hash.h is the go-to header for hash
usage, including hash_any(). Not this patch's fault, of course, just
venting.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Alvaro Herrera (#16)

Re: Write Ahead Logging for Hash Indexes

On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Amit Kapila wrote:

$SUBJECT will make hash indexes reliable and usable on standby.

Nice work.

Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
instead of cramming it in hash.h? Removing the existing #include
"xlogreader.h" from hash.h would be nice. I volunteer for pushing any
preliminary header cleanup commits.

So, what you are expecting here is that we create hash_xlog.h and move
necessary functions like hash_redo(), hash_desc() and hash_identify()
to it. Then do whatever else is required to build it successfully.
Once that is done, I can build my patches on top of it. Is that
right? If yes, I can work on it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Alvaro Herrera

alvherre@2ndquadrant.com

almost 10 years ago

In reply to: Amit Kapila (#17)

Re: Write Ahead Logging for Hash Indexes

Amit Kapila wrote:

On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
instead of cramming it in hash.h? Removing the existing #include
"xlogreader.h" from hash.h would be nice. I volunteer for pushing any
preliminary header cleanup commits.

So, what you are expecting here is that we create hash_xlog.h and move
necessary functions like hash_redo(), hash_desc() and hash_identify()
to it. Then do whatever else is required to build it successfully.
Once that is done, I can build my patches on top of it. Is that
right? If yes, I can work on it.

Yes, thanks.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Alvaro Herrera (#18)

Re: Write Ahead Logging for Hash Indexes

On Thu, Aug 25, 2016 at 6:54 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Amit Kapila wrote:

On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
instead of cramming it in hash.h? Removing the existing #include
"xlogreader.h" from hash.h would be nice. I volunteer for pushing any
preliminary header cleanup commits.

So, what you are expecting here is that we create hash_xlog.h and move
necessary functions like hash_redo(), hash_desc() and hash_identify()
to it. Then do whatever else is required to build it successfully.
Once that is done, I can build my patches on top of it. Is that
right? If yes, I can work on it.

Yes, thanks.

How about attached? If you want, I think we can one step further and
move hash_redo to a new file hash_xlog.c which is required for main
patch, but we can leave it for later as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#20

Alvaro Herrera

alvherre@2ndquadrant.com

almost 10 years ago

In reply to: Amit Kapila (#19)

Re: Write Ahead Logging for Hash Indexes

Amit Kapila wrote:

How about attached?

That works; pushed. (I removed a few #includes from the new file.)

If you want, I think we can one step further and move hash_redo to a
new file hash_xlog.c which is required for main patch, but we can
leave it for later as well.

I think that can be a part of the main patch.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#20)

#22

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#15)

#23

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#22)

#24

Ashutosh Sharma

ashu.coek88@gmail.com

over 9 years ago

In reply to: Amit Kapila (#22)

#25

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Amit Kapila (#22)

#26

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Mark Kirkwood (#25)

#27

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Ashutosh Sharma (#24)

#28

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Jeff Janes (#27)

#29

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#27)

#30

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Mark Kirkwood (#28)

#31

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Mark Kirkwood (#30)

#32

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Amit Kapila (#31)

#33

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Mark Kirkwood (#32)

#34

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Jeff Janes (#27)

#35

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#34)

#36

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Mark Kirkwood (#33)

#37

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Amit Kapila (#35)

#38

Jesper Pedersen

jesper.pedersen@redhat.com

over 9 years ago

In reply to: Amit Kapila (#22)

#39

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#37)

#40

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jesper Pedersen (#38)

#41

Jesper Pedersen

jesper.pedersen@redhat.com

over 9 years ago

In reply to: Amit Kapila (#40)

#42

Ashutosh Sharma

ashu.coek88@gmail.com

over 9 years ago

In reply to: Jesper Pedersen (#41)

#43

Ashutosh Sharma

ashu.coek88@gmail.com

over 9 years ago

In reply to: Ashutosh Sharma (#42)

#44

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Ashutosh Sharma (#43)

#45

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jesper Pedersen (#41)

#46

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Amit Kapila (#45)

#47

Jesper Pedersen

jesper.pedersen@redhat.com

over 9 years ago

In reply to: Amit Kapila (#45)

#48

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#46)

#49

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Amit Kapila (#48)

#50

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#49)

#51

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#50)

#52

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#51)

#53

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#52)

#54

Ashutosh Sharma

ashu.coek88@gmail.com

over 9 years ago

In reply to: Amit Kapila (#53)

#55

Jesper Pedersen

jesper.pedersen@redhat.com

over 9 years ago

In reply to: Amit Kapila (#52)

#56

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Amit Kapila (#52)

#57

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#56)

#58

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#57)

Write Ahead Logging for Hash Indexes

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: