Parallel Bitmap scans a bit broken

Started by David Rowleyabout 9 years ago25 messageshackers
Jump to latest
#1David Rowley
dgrowleyml@gmail.com

I was just doing some testing on [1]/messages/by-id/attachment/50164/brin-correlation-v3.patch when I noticed that there's a problem
with parallel bitmap index scans scans.

Test case:

patch with [1]/messages/by-id/attachment/50164/brin-correlation-v3.patch

=# create table r1(value int);
CREATE TABLE
=# insert into r1 select (random()*1000)::int from
generate_Series(1,1000000);
INSERT 0 1000000
=# create index on r1 using brin(value);
CREATE INDEX
=# set enable_seqscan=0;
SET
=# explain select * from r1 where value=555;
QUERY PLAN

-----------------------------------------------------------------------------------------
Gather (cost=3623.52..11267.45 rows=5000 width=4)
Workers Planned: 2
-> Parallel Bitmap Heap Scan on r1 (cost=2623.52..9767.45 rows=2083
width=4)
Recheck Cond: (value = 555)
-> Bitmap Index Scan on r1_value_idx (cost=0.00..2622.27
rows=522036 width=0)
Index Cond: (value = 555)
(6 rows)

=# explain analyze select * from r1 where value=555;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

The crash occurs in tbm_shared_iterate() at:

PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];

I see in tbm_prepare_shared_iterate() tbm->npages is zero. I'm unsure if
bringetbitmap() does something different with npages than btgetbitmap()
around setting npages?

But anyway, due to the npages being 0 the tbm->ptpages is not allocated
in tbm_prepare_shared_iterate()

if (tbm->npages)
{
tbm->ptpages = dsa_allocate(tbm->dsa, sizeof(PTIterationArray) +
tbm->npages * sizeof(int));

so when tbm_shared_iterate runs this code;

/*
* If both chunk and per-page data remain, must output the numerically
* earlier page.
*/
if (istate->schunkptr < istate->nchunks)
{
PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]];
PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];
BlockNumber chunk_blockno;

chunk_blockno = chunk->blockno + istate->schunkbit;

if (istate->spageptr >= istate->npages ||
chunk_blockno < page->blockno)
{
/* Return a lossy page indicator from the chunk */
output->blockno = chunk_blockno;
output->ntuples = -1;
output->recheck = true;
istate->schunkbit++;

LWLockRelease(&istate->lock);
return output;
}
}

it fails, due to idxpages pointing to random memory

Probably this is a simple fix for the authors, so passing it along. I'm a
bit unable to see how the part above is meant to work.

[1]: /messages/by-id/attachment/50164/brin-correlation-v3.patch
/messages/by-id/attachment/50164/brin-correlation-v3.patch

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#2Dilip Kumar
dilipbalaut@gmail.com
In reply to: David Rowley (#1)
Re: Parallel Bitmap scans a bit broken

On Thu, Mar 9, 2017 at 9:17 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

patch with [1]

=# create table r1(value int);
CREATE TABLE
=# insert into r1 select (random()*1000)::int from
generate_Series(1,1000000);
INSERT 0 1000000
=# create index on r1 using brin(value);
CREATE INDEX
=# set enable_seqscan=0;
SET
=# explain select * from r1 where value=555;

I am looking into the issue, I have already reproduced it. I will
update on this soon.

Thanks for reporting.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#2)
Re: Parallel Bitmap scans a bit broken

On Thu, Mar 9, 2017 at 9:37 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

=# create table r1(value int);
CREATE TABLE
=# insert into r1 select (random()*1000)::int from
generate_Series(1,1000000);
INSERT 0 1000000
=# create index on r1 using brin(value);
CREATE INDEX
=# set enable_seqscan=0;
SET
=# explain select * from r1 where value=555;

I am looking into the issue, I have already reproduced it. I will
update on this soon.

Thanks for reporting.

I slightly modified your query to reproduce this issue.

explain analyze select * from r1 where value<555;

Patch is attached to fix the problem.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_bitmap_fix.patchapplication/octet-stream; name=parallel_bitmap_fix.patchDownload+1-2
#4Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#3)
Re: Parallel Bitmap scans a bit broken

On Thu, Mar 9, 2017 at 10:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I slightly modified your query to reproduce this issue.

explain analyze select * from r1 where value<555;

Patch is attached to fix the problem.

I forgot to mention the cause of the problem.

if (istate->schunkptr < istate->nchunks)
{
PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]];
PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];
BlockNumber chunk_blockno;

In above if condition we have only checked istate->schunkptr <
istate->nchunks that means we have some chunk left so we are safe to
access idxchunks, But just after that we are accessing
ptbase[idxpages[istate->spageptr]] without checking that accessing
idxpages is safe or not.

tbm_iterator already handling this case, I broke it in tbm_shared_iterator.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#4)
Re: Parallel Bitmap scans a bit broken

On Thu, Mar 9, 2017 at 11:50 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 9, 2017 at 10:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I slightly modified your query to reproduce this issue.

explain analyze select * from r1 where value<555;

Patch is attached to fix the problem.

I forgot to mention the cause of the problem.

if (istate->schunkptr < istate->nchunks)
{
PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]];
PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];
BlockNumber chunk_blockno;

In above if condition we have only checked istate->schunkptr <
istate->nchunks that means we have some chunk left so we are safe to
access idxchunks, But just after that we are accessing
ptbase[idxpages[istate->spageptr]] without checking that accessing
idxpages is safe or not.

tbm_iterator already handling this case, I broke it in tbm_shared_iterator.

I don't know if this is the only problem -- it would be good if David
could retest -- but it's certainly *a* problem, so committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6David Rowley
dgrowleyml@gmail.com
In reply to: Robert Haas (#5)
Re: Parallel Bitmap scans a bit broken

On 10 March 2017 at 06:17, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 9, 2017 at 11:50 AM, Dilip Kumar <dilipbalaut@gmail.com>
wrote:

On Thu, Mar 9, 2017 at 10:02 PM, Dilip Kumar <dilipbalaut@gmail.com>

wrote:

I slightly modified your query to reproduce this issue.

explain analyze select * from r1 where value<555;

Patch is attached to fix the problem.

I forgot to mention the cause of the problem.

if (istate->schunkptr < istate->nchunks)
{
PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]];
PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];
BlockNumber chunk_blockno;

In above if condition we have only checked istate->schunkptr <
istate->nchunks that means we have some chunk left so we are safe to
access idxchunks, But just after that we are accessing
ptbase[idxpages[istate->spageptr]] without checking that accessing
idxpages is safe or not.

tbm_iterator already handling this case, I broke it in

tbm_shared_iterator.

I don't know if this is the only problem -- it would be good if David
could retest -- but it's certainly *a* problem, so committed.

Thanks for committing, and generally parallelising more stuff.

I confirm that my test case is now working again.

I'll be in this general area today, so will mention if I stumble over
anything that looks broken.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Emre Hasegeli
emre@hasegeli.com
In reply to: Robert Haas (#5)
Re: Parallel Bitmap scans a bit broken

I don't know if this is the only problem

I'll be in this general area today, so will mention if I stumble over
anything that looks broken.

I was testing the same patch with a large dataset and got a different segfault:

hasegeli=# explain select * from only mp_notification_20170225 where server_id = 7;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Gather (cost=26682.94..476995.88 rows=1 width=215)
Workers Planned: 2
-> Parallel Bitmap Heap Scan on mp_notification_20170225 (cost=25682.94..475995.78 rows=1 width=215)
Recheck Cond: (server_id = 7)
-> Bitmap Index Scan on mp_notification_block_idx (cost=0.00..25682.94 rows=4557665 width=0)
Index Cond: (server_id = 7)
(6 rows)

hasegeli=# select * from only mp_notification_20170225 where server_id = 7;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

* thread #1: tid = 0x5045a8f, 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8, tuple=0x00007fea3c891040) + 40 at brin_tuple.c:414, queue = 'com.apple.main-thread', stop reason = signal SIGUSR1
* frame #0: 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8, tuple=0x00007fea3c891040) + 40 at brin_tuple.c:414 [opt]
frame #1: 0x000000010ae4000c postgres`bringetbitmap(scan=0x00007fea3c875c20, tbm=<unavailable>) + 428 at brin.c:398 [opt]
frame #2: 0x000000010ae9b451 postgres`index_getbitmap(scan=0x00007fea3c875c20, bitmap=<unavailable>) + 65 at indexam.c:726 [opt]
frame #3: 0x000000010b0035a9 postgres`MultiExecBitmapIndexScan(node=<unavailable>) + 233 at nodeBitmapIndexscan.c:91 [opt]
frame #4: 0x000000010b002840 postgres`BitmapHeapNext(node=<unavailable>) + 400 at nodeBitmapHeapscan.c:143 [opt]
frame #5: 0x000000010afef5d0 postgres`ExecProcNode(node=0x00007fea3c873948) + 224 at execProcnode.c:459 [opt]
frame #6: 0x000000010b004cc9 postgres`ExecGather [inlined] gather_getnext(gatherstate=<unavailable>) + 520 at nodeGather.c:276 [opt]
frame #7: 0x000000010b004ac1 postgres`ExecGather(node=<unavailable>) + 497 at nodeGather.c:212 [opt]
frame #8: 0x000000010afef6b2 postgres`ExecProcNode(node=0x00007fea3c872f58) + 450 at execProcnode.c:541 [opt]
frame #9: 0x000000010afeaf90 postgres`standard_ExecutorRun [inlined] ExecutePlan(estate=<unavailable>, planstate=<unavailable>, use_parallel_mode=<unavailable>, operation=<unavailable>, numberTuples=0, direction=<unavailable>, dest=<unavailable>) + 29 at execMain.c:1616 [opt]
frame #10: 0x000000010afeaf73 postgres`standard_ExecutorRun(queryDesc=<unavailable>, direction=<unavailable>, count=0) + 291 at execMain.c:348 [opt]
frame #11: 0x000000010af8b108 postgres`ExplainOnePlan(plannedstmt=0x00007fea3c871040, into=0x0000000000000000, es=0x00007fea3c805360, queryString=<unavailable>, params=<unavailable>, planduration=<unavailable>) + 328 at explain.c:533 [opt]
frame #12: 0x000000010af8ab98 postgres`ExplainOneQuery(query=0x00007fea3c805890, cursorOptions=<unavailable>, into=0x0000000000000000, es=0x00007fea3c805360, queryString=<unavailable>,params=0x0000000000000000) + 280 at explain.c:369 [opt]
frame #13: 0x000000010af8a773 postgres`ExplainQuery(pstate=<unavailable>, stmt=0x00007fea3d005450, queryString="explain analyze select * from only mp_notification_20170225 where server_id > 6;",params=0x0000000000000000, dest=0x00007fea3c8052c8) + 819 at explain.c:254 [opt]
frame #14: 0x000000010b13b660 postgres`standard_ProcessUtility(pstmt=0x00007fea3d005fa8, queryString="explain analyze select * from only mp_notification_20170225 where server_id > 6;",context=PROCESS_UTILITY_TOPLEVEL, params=0x0000000000000000, dest=0x00007fea3c8052c8, completionTag=<unavailable>) + 1104 at utility.c:675 [opt]
frame #15: 0x000000010b13ad2a postgres`PortalRunUtility(portal=0x00007fea3c837640, pstmt=0x00007fea3d005fa8, isTopLevel='\x01', setHoldSnapshot=<unavailable>, dest=0x00007fea3c8052c8, completionTag=<unavailable>) + 90 at pquery.c:1165 [opt]
frame #16: 0x000000010b139f56 postgres`FillPortalStore(portal=0x00007fea3c837640, isTopLevel='\x01') + 182 at pquery.c:1025 [opt]
frame #17: 0x000000010b139c22 postgres`PortalRun(portal=0x00007fea3c837640, count=<unavailable>, isTopLevel='\x01', dest=<unavailable>, altdest=<unavailable>, completionTag=<unavailable>) + 402 at pquery.c:757 [opt]
frame #18: 0x000000010b13789b postgres`PostgresMain + 44 at postgres.c:1101 [opt]
frame #19: 0x000000010b13786f postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, dbname=<unavailable>, username=<unavailable>) + 8927 at postgres.c:4066 [opt]
frame #20: 0x000000010b0ba113 postgres`PostmasterMain [inlined] BackendRun + 7587 at postmaster.c:4317 [opt]
frame #21: 0x000000010b0ba0e8 postgres`PostmasterMain [inlined] BackendStartup at postmaster.c:3989 [opt]
frame #22: 0x000000010b0ba0e8 postgres`PostmasterMain at postmaster.c:1729 [opt]
frame #23: 0x000000010b0ba0e8 postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) + 7544 at postmaster.c:1337 [opt]
frame #24: 0x000000010b0332af postgres`main(argc=<unavailable>, argv=<unavailable>) + 1567 at main.c:228 [opt]
frame #25: 0x00007fffb4e28255 libdyld.dylib`start + 1

I can try to provide a test case, if that wouldn't be enough to spot
the problem.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Dilip Kumar
dilipbalaut@gmail.com
In reply to: Emre Hasegeli (#7)
Re: Parallel Bitmap scans a bit broken

On Wed, Mar 15, 2017 at 8:11 PM, Emre Hasegeli <emre@hasegeli.com> wrote:

I can try to provide a test case, if that wouldn't be enough to spot
the problem.

Thanks for reporting, I am looking into this. Meanwhile, if you can
provide the reproducible test case then locating the issue will be
faster.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#8)
Re: Parallel Bitmap scans a bit broken

On Wed, Mar 15, 2017 at 8:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I can try to provide a test case, if that wouldn't be enough to spot
the problem.

Thanks for reporting, I am looking into this. Meanwhile, if you can
provide the reproducible test case then locating the issue will be
faster.

After trying multiple attempts with different datasets I am unable to
reproduce the issue.

I tried with below test case:
create table t(a int, b varchar);
insert into t values(generate_series(1,10000000), repeat('x', 100));
insert into t values(generate_series(1,100000000), repeat('x', 100));
create index idx on t using brin(a);
postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where a>6;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=580794.52..3059826.52 rows=110414922 width=105) (actual
time=92.324..91853.716 rows=110425971 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on t (cost=579794.52..3058826.52
rows=46006218 width=105) (actual time=65.651..62023.020 rows=36808657
loops=3)
Recheck Cond: (a > 6)
Rows Removed by Index Recheck: 4
Heap Blocks: lossy=204401
-> Bitmap Index Scan on idx (cost=0.00..552190.79
rows=110425920 width=0) (actual time=88.215..88.215 rows=19040000
loops=1)
Index Cond: (a > 6)
Planning time: 1.116 ms
Execution time: 96176.881 ms
(11 rows)

Is it possible for you to provide a reproducible test case? I also
applied the patch given up thread[1]/messages/by-id/attachment/50164/brin-correlation-v3.patch but still could not reproduce.

[1]: /messages/by-id/attachment/50164/brin-correlation-v3.patch

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Dilip Kumar
dilipbalaut@gmail.com
In reply to: Emre Hasegeli (#7)
Re: Parallel Bitmap scans a bit broken

On Wed, Mar 15, 2017 at 8:11 PM, Emre Hasegeli <emre@hasegeli.com> wrote:

* thread #1: tid = 0x5045a8f, 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8, tuple=0x00007fea3c891040) + 40 at brin_tuple.c:414, queue = 'com.apple.main-thread', stop reason = signal SIGUSR1
* frame #0: 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8, tuple=0x00007fea3c891040) + 40 at brin_tuple.c:414 [opt]
frame #1: 0x000000010ae4000c postgres`bringetbitmap(scan=0x00007fea3c875c20, tbm=<unavailable>) + 428 at brin.c:398 [opt]
frame #2: 0x000000010ae9b451 postgres`index_getbitmap(scan=0x00007fea3c875c20, bitmap=<unavailable>) + 65 at indexam.c:726 [opt]
frame #3: 0x000000010b0035a9 postgres`MultiExecBitmapIndexScan(node=<unavailable>) + 233 at nodeBitmapIndexscan.c:91 [opt]
frame #4: 0x000000010b002840 postgres`BitmapHeapNext(node=<unavailable>) + 400 at nodeBitmapHeapscan.c:143 [opt]

Further analyzing the call stack, seems like this is not exact call
stack where it crashed. Because, if you notice the code in the
brin_deform_tuple (line 414)

brin_deform_tuple(BrinDesc *brdesc, BrinTuple *tuple)

{
dtup = brin_new_memtuple(brdesc);

if (BrinTupleIsPlaceholder(tuple))
dtup->bt_placeholder = true;
dtup->bt_blkno = tuple->bt_blkno; --> line 414

This can crash at line:414, if either tuple is invalid memory(but I
think it's not because we have already accessed this memory in above
if check) or dtup is invalid (this is also not possible because
brin_new_memtuple has already accessed this).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Emre Hasegeli
emre@hasegeli.com
In reply to: Dilip Kumar (#10)
Re: Parallel Bitmap scans a bit broken

This can crash at line:414, if either tuple is invalid memory(but I
think it's not because we have already accessed this memory in above
if check) or dtup is invalid (this is also not possible because
brin_new_memtuple has already accessed this).

I was testing with the brin correlation patch [1]/messages/by-id/attachment/50164/brin-correlation-v3.patch applied. I cannot
crash it without the patch either. I am sorry for not testing it
before. The patch make BRIN selectivity estimation function access
more information.

[1]: /messages/by-id/attachment/50164/brin-correlation-v3.patch

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Dilip Kumar
dilipbalaut@gmail.com
In reply to: Emre Hasegeli (#11)
Re: Parallel Bitmap scans a bit broken

On Wed, Mar 15, 2017 at 10:02 PM, Emre Hasegeli <emre@hasegeli.com> wrote:

I was testing with the brin correlation patch [1] applied. I cannot
crash it without the patch either. I am sorry for not testing it
before. The patch make BRIN selectivity estimation function access
more information.

[1] /messages/by-id/attachment/50164/brin-correlation-v3.patch

With my test case, I could not crash even with this patch applied.
Can you provide your test case?
(table, index, data, query)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Emre Hasegeli
emre@hasegeli.com
In reply to: Dilip Kumar (#12)
Re: Parallel Bitmap scans a bit broken

With my test case, I could not crash even with this patch applied.
Can you provide your test case?

Yes:

hasegeli=# create table r2 as select (random() * 3)::int as i from generate_series(1, 1000000);
SELECT 1000000
hasegeli=# create index on r2 using brin (i);
CREATE INDEX
hasegeli=# analyze r2;
ANALYZE
hasegeli=# explain select * from only r2 where i = 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Gather (cost=2867.50..9225.32 rows=1 width=4)
Workers Planned: 2
-> Parallel Bitmap Heap Scan on r2 (cost=1867.50..8225.22 rows=1 width=4)
Recheck Cond: (i = 10)
-> Bitmap Index Scan on r2_i_idx (cost=0.00..1867.50 rows=371082 width=0)
Index Cond: (i = 10)
(6 rows)

hasegeli=# select * from only r2 where i = 10;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Dilip Kumar
dilipbalaut@gmail.com
In reply to: Emre Hasegeli (#13)
Re: Parallel Bitmap scans a bit broken

On Wed, Mar 15, 2017 at 10:21 PM, Emre Hasegeli <emre@hasegeli.com> wrote:

hasegeli=# create table r2 as select (random() * 3)::int as i from generate_series(1, 1000000);
SELECT 1000000
hasegeli=# create index on r2 using brin (i);
CREATE INDEX
hasegeli=# analyze r2;
ANALYZE
hasegeli=# explain select * from only r2 where i = 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Gather (cost=2867.50..9225.32 rows=1 width=4)
Workers Planned: 2
-> Parallel Bitmap Heap Scan on r2 (cost=1867.50..8225.22 rows=1 width=4)
Recheck Cond: (i = 10)
-> Bitmap Index Scan on r2_i_idx (cost=0.00..1867.50 rows=371082 width=0)
Index Cond: (i = 10)
(6 rows)

hasegeli=# select * from only r2 where i = 10;

I am able to reproduce the bug, and attached patch fixes the same.
Problem is that I am not handling TBM_EMPTY state properly. I
remember that while reviewing the patch Robert mentioned that we might
need to handle the TBM_EMPTY and I told that since we are not handling
in non-parallel mode so we don't need to handle here as well. But, I
was wrong. So the problem is that if state is not TBM_HASH then it's
directly assuming TBM_ONE_PAGE which is completely wrong. I have
fixed that and also fixed in other similar locations.

Please verify the fix.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_tbm_empty.patchapplication/octet-stream; name=fix_tbm_empty.patchDownload+12-8
#15Emre Hasegeli
emre@hasegeli.com
In reply to: Dilip Kumar (#14)
Re: Parallel Bitmap scans a bit broken

Please verify the fix.

The same test with both of the patches applied still crashes for me.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Dilip Kumar
dilipbalaut@gmail.com
In reply to: Emre Hasegeli (#15)
Re: Parallel Bitmap scans a bit broken

On Thu, Mar 16, 2017 at 12:56 AM, Emre Hasegeli <emre@hasegeli.com> wrote:

Please verify the fix.

The same test with both of the patches applied still crashes for me.

After above fix, I am not able to reproduce. Can you give me the
backtrace of the crash location or the dump?

I am trying on the below commit

commit c5832346625af4193b1242e57e7d13e66a220b38
Author: Stephen Frost <sfrost@snowman.net>
Date: Wed Mar 15 11:19:39 2017 -0400

+ https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch
+ fix_tbm_empty.patch

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#16)
Re: Parallel Bitmap scans a bit broken

On Thu, Mar 16, 2017 at 5:02 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

After above fix, I am not able to reproduce. Can you give me the
backtrace of the crash location or the dump?

I am trying on the below commit

commit c5832346625af4193b1242e57e7d13e66a220b38
Author: Stephen Frost <sfrost@snowman.net>
Date: Wed Mar 15 11:19:39 2017 -0400

+ https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch
+ fix_tbm_empty.patch

Forgot to mention after fix I am seeing this output.

postgres=# explain analyze select * from only r2 where i = 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Gather (cost=2880.56..9251.98 rows=1 width=4) (actual
time=3.857..3.857 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on r2 (cost=1880.56..8251.88 rows=1
width=4) (actual time=0.043..0.043 rows=0 loops=3)
Recheck Cond: (i = 10)
-> Bitmap Index Scan on r2_i_idx (cost=0.00..1880.56
rows=373694 width=0) (actual time=0.052..0.052 rows=0 loops=1)
Index Cond: (i = 10)
Planning time: 0.111 ms
Execution time: 4.449 ms
(9 rows)

postgres=# select * from only r2 where i = 10;
i
---
(0 rows)

Are you getting the crash with the same test case?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Emre Hasegeli
emre@hasegeli.com
In reply to: Dilip Kumar (#17)
Re: Parallel Bitmap scans a bit broken

Are you getting the crash with the same test case?

Yes. Here is the new backtrace:

* thread #1: tid = 0x51828fd, 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0) at generic.h:57, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
* frame #0: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0) at generic.h:57 [opt]
frame #1: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32_impl(val_=0) at generic.h:163 [opt]
frame #2: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32(val=0) + 17 at atomics.h:237 [opt]
frame #3: 0x0000000100caf303 postgres`tbm_prepare_shared_iterate(tbm=<unavailable>) + 723 at tidbitmap.c:875 [opt]
frame #4: 0x0000000100c74844 postgres`BitmapHeapNext(node=<unavailable>) + 436 at nodeBitmapHeapscan.c:154 [opt]
frame #5: 0x0000000100c615b0 postgres`ExecProcNode(node=0x00007fdabf8189f0) + 224 at execProcnode.c:459 [opt]
frame #6: 0x0000000100c76ca9 postgres`ExecGather [inlined] gather_getnext(gatherstate=<unavailable>) + 520 at nodeGather.c:276 [opt]
frame #7: 0x0000000100c76aa1 postgres`ExecGather(node=<unavailable>) + 497 at nodeGather.c:212 [opt]
frame #8: 0x0000000100c61692 postgres`ExecProcNode(node=0x00007fdabf818558) + 450 at execProcnode.c:541 [opt]
frame #9: 0x0000000100c5cf70 postgres`standard_ExecutorRun [inlined] ExecutePlan(estate=<unavailable>, planstate=<unavailable>, use_parallel_mode=<unavailable>, operation=<unavailable>, numberTuples=0, direction=<unavailable>, dest=<unavailable>) + 29 at execMain.c:1616 [opt]
frame #10: 0x0000000100c5cf53 postgres`standard_ExecutorRun(queryDesc=<unavailable>, direction=<unavailable>, count=0) + 291 at execMain.c:348 [opt]
frame #11: 0x0000000100dac0df postgres`PortalRunSelect(portal=0x00007fdac000b240, forward=<unavailable>, count=0, dest=<unavailable>) + 255 at pquery.c:921 [opt]
frame #12: 0x0000000100dabc84 postgres`PortalRun(portal=0x00007fdac000b240, count=<unavailable>, isTopLevel='\x01', dest=<unavailable>, altdest=<unavailable>, completionTag=<unavailable>) + 500 at pquery.c:762 [opt]
frame #13: 0x0000000100da989b postgres`PostgresMain + 44 at postgres.c:1101 [opt]
frame #14: 0x0000000100da986f postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, dbname=<unavailable>, username=<unavailable>) + 8927 at postgres.c:4066 [opt]
frame #15: 0x0000000100d2c113 postgres`PostmasterMain [inlined] BackendRun + 7587 at postmaster.c:4317 [opt]
frame #16: 0x0000000100d2c0e8 postgres`PostmasterMain [inlined] BackendStartup at postmaster.c:3989 [opt]
frame #17: 0x0000000100d2c0e8 postgres`PostmasterMain at postmaster.c:1729 [opt]
frame #18: 0x0000000100d2c0e8 postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) + 7544 at postmaster.c:1337 [opt]
frame #19: 0x0000000100ca528f postgres`main(argc=<unavailable>, argv=<unavailable>) + 1567 at main.c:228 [opt]
frame #20: 0x00007fffb4e28255 libdyld.dylib`start + 1
frame #21: 0x00007fffb4e28255 libdyld.dylib`start + 1

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Dilip Kumar
dilipbalaut@gmail.com
In reply to: Emre Hasegeli (#18)
Re: Parallel Bitmap scans a bit broken

On Thu, Mar 16, 2017 at 3:52 PM, Emre Hasegeli <emre@hasegeli.com> wrote:

* thread #1: tid = 0x51828fd, 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0) at generic.h:57, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
* frame #0: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0) at generic.h:57 [opt]
frame #1: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32_impl(val_=0) at generic.h:163 [opt]
frame #2: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32(val=0) + 17 at atomics.h:237 [opt]

By looking at the call stack I got the problem location. I am
reviewing other parts of the code if there are the similar mistake at
other places. Soon I will post the patch. Thanks for the help.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#19)
Re: Parallel Bitmap scans a bit broken

On Thu, Mar 16, 2017 at 5:14 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

pg_atomic_write_u32_impl(val=0) at generic.h:57, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)

* frame #0: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0) at generic.h:57 [opt]
frame #1: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32_impl(val_=0) at generic.h:163 [opt]
frame #2: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32(val=0) + 17 at atomics.h:237 [opt]

By looking at the call stack I got the problem location. I am
reviewing other parts of the code if there are the similar mistake at
other places. Soon I will post the patch. Thanks for the help.

Based on the call stack I have tried to fix the issue. The problem is
there was some uninitialized pointer access (in some special cases
i.e. TBM_EMPTY when pagetable is not created at all).

fix_tbm_empty.patch have fixed some of them but induced one which you
are seeing in your call stack.

Hopefully, this time I got it correct. Since I am unable to reproduce
the issue so I will again need your help in verifying the fix.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_tbm_empty_v2.patchapplication/octet-stream; name=fix_tbm_empty_v2.patchDownload+29-14
#21Emre Hasegeli
emre@hasegeli.com
In reply to: Dilip Kumar (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Emre Hasegeli (#21)
#23Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#22)
#24Dilip Kumar
dilipbalaut@gmail.com
In reply to: Emre Hasegeli (#21)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#23)