Fix for parallel BTree initialization bug

Started by Jameson, Hunter 'James'over 5 years ago16 messages
#1Jameson, Hunter 'James'
hunjmes@amazon.com
1 attachment(s)

Hi, I ran across a small (but annoying) bug in initializing parallel BTree scans, which causes the parallel-scan state machine to get confused. The fix is one line; the description is a bit longer—

Before, function _bt_first() would exit immediately if the specified scan keys could never be satisfied--without notifying other parallel workers, if any, that the scan key was done. This moved that particular worker to a scan key beyond what was in the shared parallel-query state, so that it would later try to read in "InvalidBlockNumber", without recognizing it as a special sentinel value.

The basic bug is that the BTree parallel query state machine assumes that a worker process is working on a key <= the global key--a worker process can be behind (i.e., hasn't finished its work on a previous key), but never ahead. By allowing the first worker to move on to the next scan key, in this one case, without notifying other workers, the global key ends up < the first worker's local key.

Symptoms of the bug are: on R/O, we get an error saying we can't extend the index relation, while on an R/W we just extend the index relation by 1 block.

To reproduce, you need a query that:

1. Executes parallel BTree index scan;
2. Has an IN-list of size > 1;
3. Has an additional index filter that makes it impossible to satisfy the
first IN-list condition.

(We encountered such a query, and therefore the bug, on a production instance.)

Thanks,
James

--
James Hunter, Amazon Web Services (AWS)

Attachments:

0001-Fix-initialization-of-parallel-BTree-scan.patchapplication/octet-stream; name=0001-Fix-initialization-of-parallel-BTree-scan.patchDownload
From b2cc0e805c4b41ece8042432276e456d1e958956 Mon Sep 17 00:00:00 2001
From: James Hunter <hunjmes@amazon.com>
Date: Tue, 8 Sep 2020 15:59:19 +0000
Subject: [PATCH] Fix initialization of parallel BTree scan

Before, function _bt_first() would exit immediately if the specified scan
keys could never be satisfied--without notifying other parallel workers, if
any, that the scan key was done. This moved that particular worker to a
scan key beyond what was in the shared parallel-query state, so that it
would later try to read in "InvalidBlockNumber", without recognizing it as
a special sentinel value.

The basic bug is that the BTree parallel query state machine assumes that a
worker process is working on a key <= the global key--a worker process can
be behind (i.e., hasn't finished its work on a previous key), but never
ahead. By allowing the first worker to move on to the next scan key, in
this one case, without notifying other workers, the global key ends up <
the first worker's local key.

---
 src/backend/access/nbtree/nbtsearch.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 1e628a33d7..8f6575fdf1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -880,7 +880,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * never be satisfied (eg, x == 1 AND x > 2).
 	 */
 	if (!so->qual_ok)
+	{
+		/* Notify any other workers that we're done with this scan key. */
+		_bt_parallel_done(scan);
 		return false;
+	}
 
 	/*
 	 * For parallel scans, get the starting page from shared state. If the
-- 
2.15.3.AMZN

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Jameson, Hunter 'James' (#1)
Re: Fix for parallel BTree initialization bug

On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
<hunjmes@amazon.com> wrote:

Hi, I ran across a small (but annoying) bug in initializing parallel BTree scans, which causes the parallel-scan state machine to get confused.

To reproduce, you need a query that:

1. Executes parallel BTree index scan;

2. Has an IN-list of size > 1;

3. Has an additional index filter that makes it impossible to satisfy the

first IN-list condition.

(We encountered such a query, and therefore the bug, on a production instance.)

I think I can understand what you are pointing out here but it would
be great if you can have a reproducible test case because that will
make it apparent and we might want to include that in the regression
tests if possible.

--
With Regards,
Amit Kapila.

#3Jameson, Hunter 'James'
hunjmes@amazon.com
In reply to: Amit Kapila (#2)
Re: Fix for parallel BTree initialization bug

Hi, I spent some time trying to create a repro (other than testing it on the production instance where we encountered the bug), but was unable to create one within a reasonable time.

The tricky part is that the bug symptoms are run-time symptoms -- so not only do you need, first, to satisfy conditions (1), (2), and (3), without the query optimizer optimizing them away! -- but you also need, second, a query that runs long enough for one or more of the parallel workers' state machines to get confused. (This wasn't a problem on the production instance where we encountered the bug and I tested the fix.)

Also, third-- passing InvalidBlockNumber to ReadBuffer() generally just appends a new block to the relation, so the bug doesn't even result in an error condition on an RW instance. (The production instance was RO...) So the bug, although very small!, is annoying!

James

On 9/9/20, 6:14 AM, "Amit Kapila" <amit.kapila16@gmail.com> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
<hunjmes@amazon.com> wrote:

Hi, I ran across a small (but annoying) bug in initializing parallel BTree scans, which causes the parallel-scan state machine to get confused.

To reproduce, you need a query that:

1. Executes parallel BTree index scan;

2. Has an IN-list of size > 1;

3. Has an additional index filter that makes it impossible to satisfy the

first IN-list condition.

(We encountered such a query, and therefore the bug, on a production instance.)

I think I can understand what you are pointing out here but it would
be great if you can have a reproducible test case because that will
make it apparent and we might want to include that in the regression
tests if possible.

--
With Regards,
Amit Kapila.

#4Jameson, Hunter 'James'
hunjmes@amazon.com
In reply to: Jameson, Hunter 'James' (#3)
Re: [UNVERIFIED SENDER] Re: Fix for parallel BTree initialization bug

Also, the behavior (=line of code) added by the bug fix is the same as existing code in the same function, _bt_first(), at lines 898, 1096, 1132, 1367. And the calls to _bt_parallel_readpage(), line 903, and _bt_steppage(), line 1416, will also ultimately call _bt_parallel_done(). So the bug seems to be a pretty simple oversight: in 6 out of 7 cases in _bt_first(), we call _bt_parallel_done() before returning "false"; but in the 7th case (fixed in this bug fix), we do not. The fix is to make case #7 the same as the other 6.

James

On 9/9/20, 7:11 AM, "Jameson, Hunter 'James'" <hunjmes@amazon.com> wrote:

Hi, I spent some time trying to create a repro (other than testing it on the production instance where we encountered the bug), but was unable to create one within a reasonable time.

The tricky part is that the bug symptoms are run-time symptoms -- so not only do you need, first, to satisfy conditions (1), (2), and (3), without the query optimizer optimizing them away! -- but you also need, second, a query that runs long enough for one or more of the parallel workers' state machines to get confused. (This wasn't a problem on the production instance where we encountered the bug and I tested the fix.)

Also, third-- passing InvalidBlockNumber to ReadBuffer() generally just appends a new block to the relation, so the bug doesn't even result in an error condition on an RW instance. (The production instance was RO...) So the bug, although very small!, is annoying!

James

On 9/9/20, 6:14 AM, "Amit Kapila" <amit.kapila16@gmail.com> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
<hunjmes@amazon.com> wrote:

Hi, I ran across a small (but annoying) bug in initializing parallel BTree scans, which causes the parallel-scan state machine to get confused.

To reproduce, you need a query that:

1. Executes parallel BTree index scan;

2. Has an IN-list of size > 1;

3. Has an additional index filter that makes it impossible to satisfy the

first IN-list condition.

(We encountered such a query, and therefore the bug, on a production instance.)

I think I can understand what you are pointing out here but it would
be great if you can have a reproducible test case because that will
make it apparent and we might want to include that in the regression
tests if possible.

--
With Regards,
Amit Kapila.

#5Justin Pryzby
pryzby@telsasoft.com
In reply to: Jameson, Hunter 'James' (#1)
Re: Fix for parallel BTree initialization bug

On Tue, Sep 08, 2020 at 06:25:03PM +0000, Jameson, Hunter 'James' wrote:

Hi, I ran across a small (but annoying) bug in initializing parallel BTree scans, which causes the parallel-scan state machine to get confused. The fix is one line; the description is a bit longer—

What postgres version was this ?

Before, function _bt_first() would exit immediately if the specified scan keys could never be satisfied--without notifying other parallel workers, if any, that the scan key was done. This moved that particular worker to a scan key beyond what was in the shared parallel-query state, so that it would later try to read in "InvalidBlockNumber", without recognizing it as a special sentinel value.

The basic bug is that the BTree parallel query state machine assumes that a worker process is working on a key <= the global key--a worker process can be behind (i.e., hasn't finished its work on a previous key), but never ahead. By allowing the first worker to move on to the next scan key, in this one case, without notifying other workers, the global key ends up < the first worker's local key.

Symptoms of the bug are: on R/O, we get an error saying we can't extend the index relation, while on an R/W we just extend the index relation by 1 block.

What's the exact error ? Are you able to provide a backtrace ?

To reproduce, you need a query that:

1. Executes parallel BTree index scan;
2. Has an IN-list of size > 1;

Do you mean you have an index on col1 and a query condition like: col1 IN (a,b,c...) ?

3. Has an additional index filter that makes it impossible to satisfy the
first IN-list condition.

.. AND col1::text||'foo' = '';
I think you mean that the "impossible" condition makes it so that a btree
worker exits early.

(We encountered such a query, and therefore the bug, on a production instance.)

Could you send the "shape" of the query or its plan, obfuscated and redacted as
need be ?

--
Justin

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Jameson, Hunter 'James' (#1)
Re: Fix for parallel BTree initialization bug

On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
<hunjmes@amazon.com> wrote:

Hi, I ran across a small (but annoying) bug in initializing parallel BTree scans, which causes the parallel-scan state machine to get confused. The fix is one line; the description is a bit longer—

Before, function _bt_first() would exit immediately if the specified scan keys could never be satisfied--without notifying other parallel workers, if any, that the scan key was done.

The first question that comes to mind is how is it possible that for
one of the workers specified scan keys is not satisfied while for
others it is satisfied? I think it is possible when other workers are
still working on the previous scan key and this worker has moved to
the next scan key. If not, then what is the other case?

This moved that particular worker to a scan key beyond what was in the shared parallel-query state, so that it would later try to read in "InvalidBlockNumber", without recognizing it as a special sentinel value.

Now, if it happens as I mentioned then the other workers should not
try to advance their scan because their local scan key will be lesser
than shared key. Basically, they should return from the below
condition:
_bt_parallel_seize()
{
..
if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
{
/* Parallel scan has already advanced to a new set of scankeys. */
status = false;
}
..
}

After this, those workers will also update their scan key and move
forward from there. So, I am not seeing how this could create a
problem.

--
With Regards,
Amit Kapila.

#7Jameson, Hunter 'James'
hunjmes@amazon.com
In reply to: Amit Kapila (#6)
Re: Fix for parallel BTree initialization bug

Answers inline below, sorry for the formatting-- am still trying to get corporate email to work nicely with this mailing list, thanks.

On 9/9/20, 9:22 PM, "Justin Pryzby" <pryzby@telsasoft.com> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Tue, Sep 08, 2020 at 06:25:03PM +0000, Jameson, Hunter 'James' wrote:

Hi, I ran across a small (but annoying) bug in initializing parallel BTree scans, which causes the parallel-scan state machine to get confused. The fix is one line; the description is a bit longer—

What postgres version was this ?

We have observed this bug on PostgreSQL versions 11.x and 10.x. I don't believe it occurs in PostgreSQL versions 9.x, because 9.x does not have parallel BTree scan.

Before, function _bt_first() would exit immediately if the specified scan keys could never be satisfied--without notifying other parallel workers, if any, that the scan key was done. This moved that particular worker to a scan key beyond what was in the shared parallel-query state, so that it would later try to read in "InvalidBlockNumber", without recognizing it as a special sentinel value.

The basic bug is that the BTree parallel query state machine assumes that a worker process is working on a key <= the global key--a worker process can be behind (i.e., hasn't finished its work on a previous key), but never ahead. By allowing the first worker to move on to the next scan key, in this one case, without notifying other workers, the global key ends up < the first worker's local key.

Symptoms of the bug are: on R/O, we get an error saying we can't extend the index relation, while on an R/W we just extend the index relation by 1 block.

What's the exact error ? Are you able to provide a backtrace ?

I am not able to provide a full backtrace, unfortunately, but the relevant part appears to be:

ReadBuffer (... blockNum=blockNum@entry=4294967295)
_bt_getbuf (... blkno=4294967295 ...)
_bt_readnextpage (... blkno=4294967295 ... )
_bt_steppage (...)
_bt_next (...)
btgettuple (...)
index_getnext_tid (...)
index_getnext (...)
IndexNext (...)

Notice that _bt_steppage() is passing InvalidBlockNumber to ReadBuffer(). That is the bug.

To reproduce, you need a query that:

1. Executes parallel BTree index scan;
2. Has an IN-list of size > 1;

Do you mean you have an index on col1 and a query condition like: col1 IN (a,b,c...) ?

Something like that, yes,

3. Has an additional index filter that makes it impossible to satisfy the
first IN-list condition.

.. AND col1::text||'foo' = '';
I think you mean that the "impossible" condition makes it so that a btree
worker exits early.

Specifically, on that worker, _bt_first() sees !so->qual_ok and just returns "false". That is the bug. The fix is that the worker must also call _bt_parallel_done(scan), as is done everywhere else in _bt_first() where it returns "false".

(We encountered such a query, and therefore the bug, on a production instance.)

Could you send the "shape" of the query or its plan, obfuscated and redacted as
need be ?

Plan is something like:

Finalize GroupAggregate ... (... loops=1)
Group Key: (...)
-> Gather Merge ... (... loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate ... (... loops=3)
Group Key: (...)
-> Sort ... (... loops=3)
Sort Key: (...)
Sort Method: quicksort ...
-> Nested Loop ... (... loops=3)
-> Parallel Index Scan using ... (... loops=3)
Index Cond: (((f ->> 't') >= ... ) AND ((f ->> 't') < ...) AND (((f -> 'c') ->> 't') = ANY (...)) AND (((f-> 'c') ->> 't') = ...))
Filter: (CASE WHEN ... END IS NOT NULL)
Rows Removed by Filter: ...
-> Index Only Scan using ... (... rows=1 loops=...)
Index Cond: (a = b)
Heap Fetches: ...

--
Justin

James
--
James Hunter, Amazon Web Services (AWS)

#8Jameson, Hunter 'James'
hunjmes@amazon.com
In reply to: Jameson, Hunter 'James' (#7)
Re: Fix for parallel BTree initialization bug

Answers inline below:

On 9/10/20, 4:58 AM, "Amit Kapila" <amit.kapila16@gmail.com> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
<hunjmes@amazon.com> wrote:

Hi, I ran across a small (but annoying) bug in initializing parallel BTree scans, which causes the parallel-scan state machine to get confused. The fix is one line; the description is a bit longer—

Before, function _bt_first() would exit immediately if the specified scan keys could never be satisfied--without notifying other parallel workers, if any, that the scan key was done.

The first question that comes to mind is how is it possible that for
one of the workers specified scan keys is not satisfied while for
others it is satisfied? I think it is possible when other workers are
still working on the previous scan key and this worker has moved to
the next scan key. If not, then what is the other case?

I think that's right. If I remember correctly, the first to move to the next IN-list condition exits early and *locally* moves on to the next-next IN-list condition, but doesn't properly advance the global scan key. At that point, "By allowing the first worker to move on to the next scan key, in this one case, without notifying other workers, the global key ends up < the first worker's local key." So the first worker now has a local scan key > the global scan key, because it didn't call _bt_parallel_done().

This moved that particular worker to a scan key beyond what was in the shared parallel-query state, so that it would later try to read in "InvalidBlockNumber", without recognizing it as a special sentinel value.

Now, if it happens as I mentioned then the other workers should not
try to advance their scan because their local scan key will be lesser
than shared key. Basically, they should return from the below
condition:
_bt_parallel_seize()
{
..
if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
{
/* Parallel scan has already advanced to a new set of scankeys. */
status = false;
}
..
}

After this, those workers will also update their scan key and move
forward from there. So, I am not seeing how this could create a
problem.

I think, if I understand my notes on the bug, that the problem is with the first worker, not the other workers. So it doesn't matter if the other workers aren't confused, because the first worker confuses itself. The first worker has moved on, without telling anyone else, basically.

--
With Regards,
Amit Kapila.

Thanks,
James
--
James Hunter, Amazon Web Services (AWS)

#9Justin Pryzby
pryzby@telsasoft.com
In reply to: Jameson, Hunter 'James' (#1)
Re: Fix for parallel BTree initialization bug

Against all odds, I was able to reproduce this.

begin;
CREATE TABLE t AS SELECT generate_series(1,999999)i;
ALTER TABLE t SET (parallel_workers=2, autovacuum_enabled=off);
CREATE INDEX ON t(i);
commit;

SET parallel_leader_participation=off; SET min_parallel_table_scan_size=0; SET enable_bitmapscan=off; SET enable_indexonlyscan=off; SET enable_seqscan=off;
explain(analyze , verbose on) SELECT COUNT(1) FROM t a WHERE a.i>555 AND i IN ( 333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666 ) ORDER BY 1;

Which gives a plan like:
Sort (cost=5543.71..5543.72 rows=1 width=8)
Sort Key: (count(1))
-> Finalize Aggregate (cost=5543.69..5543.70 rows=1 width=8)
-> Gather (cost=5543.48..5543.69 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=4543.48..4543.49 rows=1 width=8)
-> Parallel Index Scan using t_i_idx on t a (cost=0.42..4204.92 rows=135423 width=0)

I don't get an error, on read-only hot standby. I do get inconsistent results,
including on primary server.

count | 222
count | 214

This appears to be a bug in commit 569174f1b btree: Support parallel index scans.

I've added your patch here:
https://commitfest.postgresql.org/30/2729/

In the course of reproducing this, I also added:
@@ -1972,2 +1975,3 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
        rel = scan->indexRelation;
+       Assert(BlockNumberIsValid(blkno));

--
Justin

#10Hunter, James
hunjmes@amazon.com
In reply to: Justin Pryzby (#9)
Re: Fix for parallel BTree initialization bug

Nice repro, thanks!
--
James Hunter, Amazon Web Services (AWS)

Show quoted text

On 9/10/20 7:37 PM, Justin Pryzby wrote:

Against all odds, I was able to reproduce this.

begin;
CREATE TABLE t AS SELECT generate_series(1,999999)i;
ALTER TABLE t SET (parallel_workers=2, autovacuum_enabled=off);
CREATE INDEX ON t(i);
commit;

SET parallel_leader_participation=off; SET min_parallel_table_scan_size=0; SET enable_bitmapscan=off; SET enable_indexonlyscan=off; SET enable_seqscan=off;
explain(analyze , verbose on) SELECT COUNT(1) FROM t a WHERE a.i>555 AND i IN ( 333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666 ) ORDER BY 1;

Which gives a plan like:
Sort (cost=5543.71..5543.72 rows=1 width=8)
Sort Key: (count(1))
-> Finalize Aggregate (cost=5543.69..5543.70 rows=1 width=8)
-> Gather (cost=5543.48..5543.69 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=4543.48..4543.49 rows=1 width=8)
-> Parallel Index Scan using t_i_idx on t a (cost=0.42..4204.92 rows=135423 width=0)

I don't get an error, on read-only hot standby. I do get inconsistent results,
including on primary server.

count | 222
count | 214

This appears to be a bug in commit 569174f1b btree: Support parallel index scans.

I've added your patch here:
https://commitfest.postgresql.org/30/2729/

In the course of reproducing this, I also added:
@@ -1972,2 +1975,3 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
rel = scan->indexRelation;
+       Assert(BlockNumberIsValid(blkno));

--
Justin

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Justin Pryzby (#9)
1 attachment(s)
Re: Fix for parallel BTree initialization bug

On Fri, Sep 11, 2020 at 8:07 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

Against all odds, I was able to reproduce this.

Thanks, this helps me to understand the problem. So whats going on
here is that once one of the workers has moved to the next set of scan
keys without incrementing parallel shared key count the other workers
can try to join the on-going scan with a different set of keys which
can lead to unpredictable behavior which is seen by both you and
James. In your case, it scanned the blocks twice for the same set of
scan keys due to which you are getting more rows than actual rows to
be returned by scan and in the case of James, one of the workers
changed it scan block to InvalidBlockNumber (basically start of scan)
during the scan which lead to the problem.

So the fix provided by James is correct. I have slightly adjusted the
commit message in the attached. It needs to be backpatched till 10
where this feature was introduced.

I have tested this on HEAD. It would be great if you can verify in
back branches as well. I'll also do it before commit.

--
With Regards,
Amit Kapila.

Attachments:

v2-0001-Update-parallel-BTree-scan-state-when-the-scan-ke.patchapplication/octet-stream; name=v2-0001-Update-parallel-BTree-scan-state-when-the-scan-ke.patchDownload
From 56c0663e231c46fb16ab31b604e9d33265aeb6b9 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 11 Sep 2020 16:15:03 +0530
Subject: [PATCH v2] Update parallel BTree scan state when the scan keys can't
 be satisfied.

For parallel btree scan to work for array of scan keys, it should reach
BTPARALLEL_DONE state once for every distinct combination of array keys.
This is required to ensure that the parallel workers don't try to seize
blocks at the same time for different scan keys. We missed to update this
state when we discovered that the scan keys can't be satisfied.

Author: James Hunter
Reviewed-by: Amit Kapila
Tested-by: Justin Pryzby
Backpatch-through: 10, where it was introduced
Discussion: https://postgr.es/m/4248CABC-25E3-4809-B4D0-128E1BAABC3C@amazon.com
---
 src/backend/access/nbtree/nbtsearch.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 1e628a33d7..8f6575fdf1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -880,7 +880,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * never be satisfied (eg, x == 1 AND x > 2).
 	 */
 	if (!so->qual_ok)
+	{
+		/* Notify any other workers that we're done with this scan key. */
+		_bt_parallel_done(scan);
 		return false;
+	}
 
 	/*
 	 * For parallel scans, get the starting page from shared state. If the
-- 
2.28.0.windows.1

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#11)
Re: Fix for parallel BTree initialization bug

On Fri, Sep 11, 2020 at 4:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 11, 2020 at 8:07 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I have tested this on HEAD. It would be great if you can verify in
back branches as well. I'll also do it before commit.

I am planning to push this tomorrow after doing testing on
back-branches. Let me know if you have any comments.

--
With Regards,
Amit Kapila.

In reply to: Amit Kapila (#12)
Re: Fix for parallel BTree initialization bug

On Mon, Sep 14, 2020 at 5:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am planning to push this tomorrow after doing testing on
back-branches. Let me know if you have any comments.

The fix seems sensible to me.

--
Peter Geoghegan

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Geoghegan (#13)
Re: Fix for parallel BTree initialization bug

On Mon, Sep 14, 2020 at 11:26 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Sep 14, 2020 at 5:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am planning to push this tomorrow after doing testing on
back-branches. Let me know if you have any comments.

The fix seems sensible to me.

Thanks, I think it is better to wait for a day or two as yesterday
only we stamped 13 and we need to backpatch this.

--
With Regards,
Amit Kapila.

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#14)
Re: Fix for parallel BTree initialization bug

Amit Kapila <amit.kapila16@gmail.com> writes:

On Mon, Sep 14, 2020 at 11:26 PM Peter Geoghegan <pg@bowt.ie> wrote:

The fix seems sensible to me.

Thanks, I think it is better to wait for a day or two as yesterday
only we stamped 13 and we need to backpatch this.

Right, please avoid pushing anything non-critical to REL_13_STABLE
until you see the git tag appear. I doubt we will need to re-wrap
the tarballs, but you never know.

regards, tom lane

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#15)
Re: Fix for parallel BTree initialization bug

On Tue, Sep 15, 2020 at 8:54 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Mon, Sep 14, 2020 at 11:26 PM Peter Geoghegan <pg@bowt.ie> wrote:

The fix seems sensible to me.

Thanks, I think it is better to wait for a day or two as yesterday
only we stamped 13 and we need to backpatch this.

Right, please avoid pushing anything non-critical to REL_13_STABLE
until you see the git tag appear. I doubt we will need to re-wrap
the tarballs, but you never know.

Pushed now. Sorry Peter, I forgot to give you reviewer credit in the
commit message but I really appreciate your nod for this patch.

--
With Regards,
Amit Kapila.