pg_stat_progress_create_index vs. parallel index builds
Hi,
While experimenting with parallel index builds, I've noticed a somewhat
strange behavior of pg_stat_progress_create_index when a btree index is
built with parallel workers - some of the phases seem to be missing.
In serial (no parallelism) mode, the progress is roughly this (it's
always the first/last timestamp of each phase):
| command | phase
-------------+--------------+----------------------------------------
12:56:01 AM | CREATE INDEX | building index: scanning table
...
01:06:22 AM | CREATE INDEX | building index: scanning table
01:06:23 AM | CREATE INDEX | building index: sorting live tuples
...
01:13:10 AM | CREATE INDEX | building index: sorting live tuples
01:13:11 AM | CREATE INDEX | building index: loading tuples in tree
...
01:24:02 AM | CREATE INDEX | building index: loading tuples in tree
So it goes through three phases:
1) scanning tuples
2) sorting live tuples
3) loading tuples in tree
But with parallel build index build, it changes to:
| command | phase
-------------+--------------+----------------------------------------
11:40:48 AM | CREATE INDEX | building index: scanning table
...
11:47:24 AM | CREATE INDEX | building index: scanning table (scan
complete)
11:56:22 AM | CREATE INDEX | building index: scanning table
11:56:23 AM | CREATE INDEX | building index: loading tuples in tree
...
12:05:33 PM | CREATE INDEX | building index: loading tuples in tree
That is, the "sorting live tuples" phase disappeared, and instead it
seems to be counted in the "scanning table" one, as if there was an
update of the phase missing.
I've only tried this on master, but I assume it behaves like this in the
older releases too. I wonder if this is intentional - it sure is a bit
misleading.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, 2 Jun 2021 at 13:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,
While experimenting with parallel index builds, I've noticed a somewhat
strange behavior of pg_stat_progress_create_index when a btree index is
built with parallel workers - some of the phases seem to be missing.In serial (no parallelism) mode, the progress is roughly this (it's
always the first/last timestamp of each phase):| command | phase
-------------+--------------+----------------------------------------
12:56:01 AM | CREATE INDEX | building index: scanning table
...
01:06:22 AM | CREATE INDEX | building index: scanning table
01:06:23 AM | CREATE INDEX | building index: sorting live tuples
...
01:13:10 AM | CREATE INDEX | building index: sorting live tuples
01:13:11 AM | CREATE INDEX | building index: loading tuples in tree
...
01:24:02 AM | CREATE INDEX | building index: loading tuples in treeSo it goes through three phases:
1) scanning tuples
2) sorting live tuples
3) loading tuples in treeBut with parallel build index build, it changes to:
| command | phase
-------------+--------------+----------------------------------------
11:40:48 AM | CREATE INDEX | building index: scanning table
...
11:47:24 AM | CREATE INDEX | building index: scanning table (scan
complete)
11:56:22 AM | CREATE INDEX | building index: scanning table
11:56:23 AM | CREATE INDEX | building index: loading tuples in tree
...
12:05:33 PM | CREATE INDEX | building index: loading tuples in treeThat is, the "sorting live tuples" phase disappeared, and instead it
seems to be counted in the "scanning table" one, as if there was an
update of the phase missing.
I've only tried this on master, but I assume it behaves like this in the
older releases too. I wonder if this is intentional - it sure is a bit
misleading.
This was a suprise to me as well. According to documentation in
sortsupport.h (line 125-129) the parallel workers produce pre-sorted
segments during the scanning phase, which are subsequently merged by
the leader. This might mean that the 'sorting' phase is already
finished during the 'scanning' phase by waiting for the parallel
workers; I haven't looked further if this is the case and whether it
could be changed to also produce the sorting metrics, but seeing as it
is part of the parallel workers API of tuplesort, I think fixing it in
current releases is going to be difficult.
With regards,
Matthias van de Meent
On 6/2/21 3:03 PM, Matthias van de Meent wrote:
On Wed, 2 Jun 2021 at 13:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,
While experimenting with parallel index builds, I've noticed a somewhat
strange behavior of pg_stat_progress_create_index when a btree index is
built with parallel workers - some of the phases seem to be missing.In serial (no parallelism) mode, the progress is roughly this (it's
always the first/last timestamp of each phase):| command | phase
-------------+--------------+----------------------------------------
12:56:01 AM | CREATE INDEX | building index: scanning table
...
01:06:22 AM | CREATE INDEX | building index: scanning table
01:06:23 AM | CREATE INDEX | building index: sorting live tuples
...
01:13:10 AM | CREATE INDEX | building index: sorting live tuples
01:13:11 AM | CREATE INDEX | building index: loading tuples in tree
...
01:24:02 AM | CREATE INDEX | building index: loading tuples in treeSo it goes through three phases:
1) scanning tuples
2) sorting live tuples
3) loading tuples in treeBut with parallel build index build, it changes to:
| command | phase
-------------+--------------+----------------------------------------
11:40:48 AM | CREATE INDEX | building index: scanning table
...
11:47:24 AM | CREATE INDEX | building index: scanning table (scan
complete)
11:56:22 AM | CREATE INDEX | building index: scanning table
11:56:23 AM | CREATE INDEX | building index: loading tuples in tree
...
12:05:33 PM | CREATE INDEX | building index: loading tuples in treeThat is, the "sorting live tuples" phase disappeared, and instead it
seems to be counted in the "scanning table" one, as if there was an
update of the phase missing.I've only tried this on master, but I assume it behaves like this in the
older releases too. I wonder if this is intentional - it sure is a bit
misleading.This was a suprise to me as well. According to documentation in
sortsupport.h (line 125-129) the parallel workers produce pre-sorted
segments during the scanning phase, which are subsequently merged by
the leader. This might mean that the 'sorting' phase is already
finished during the 'scanning' phase by waiting for the parallel
workers; I haven't looked further if this is the case and whether it
could be changed to also produce the sorting metrics, but seeing as it
is part of the parallel workers API of tuplesort, I think fixing it in
current releases is going to be difficult.
Maybe. Perhaps it's more complicated to decide when to switch between
phases with parallel workers. Still, the table scan is done after ~8
minutes (based on blocks_total vs. blocks_done), yet we keep that phase
for another ~9 minutes. It seems this is where the workers do the sort,
so "sorting live tuples" seems like a more natural phase for this.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, 2 Jun 2021 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 6/2/21 3:03 PM, Matthias van de Meent wrote:
On Wed, 2 Jun 2021 at 13:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,
While experimenting with parallel index builds, I've noticed a somewhat
strange behavior of pg_stat_progress_create_index when a btree index is
built with parallel workers - some of the phases seem to be missing.In serial (no parallelism) mode, the progress is roughly this (it's
always the first/last timestamp of each phase):| command | phase
-------------+--------------+----------------------------------------
12:56:01 AM | CREATE INDEX | building index: scanning table
...
01:06:22 AM | CREATE INDEX | building index: scanning table
01:06:23 AM | CREATE INDEX | building index: sorting live tuples
...
01:13:10 AM | CREATE INDEX | building index: sorting live tuples
01:13:11 AM | CREATE INDEX | building index: loading tuples in tree
...
01:24:02 AM | CREATE INDEX | building index: loading tuples in treeSo it goes through three phases:
1) scanning tuples
2) sorting live tuples
3) loading tuples in treeBut with parallel build index build, it changes to:
| command | phase
-------------+--------------+----------------------------------------
11:40:48 AM | CREATE INDEX | building index: scanning table
...
11:47:24 AM | CREATE INDEX | building index: scanning table (scan
complete)
11:56:22 AM | CREATE INDEX | building index: scanning table
11:56:23 AM | CREATE INDEX | building index: loading tuples in tree
...
12:05:33 PM | CREATE INDEX | building index: loading tuples in treeThat is, the "sorting live tuples" phase disappeared, and instead it
seems to be counted in the "scanning table" one, as if there was an
update of the phase missing.I've only tried this on master, but I assume it behaves like this in the
older releases too. I wonder if this is intentional - it sure is a bit
misleading.This was a suprise to me as well. According to documentation in
sortsupport.h (line 125-129) the parallel workers produce pre-sorted
segments during the scanning phase, which are subsequently merged by
the leader. This might mean that the 'sorting' phase is already
finished during the 'scanning' phase by waiting for the parallel
workers; I haven't looked further if this is the case and whether it
could be changed to also produce the sorting metrics, but seeing as it
is part of the parallel workers API of tuplesort, I think fixing it in
current releases is going to be difficult.Maybe. Perhaps it's more complicated to decide when to switch between
phases with parallel workers. Still, the table scan is done after ~8
minutes (based on blocks_total vs. blocks_done), yet we keep that phase
for another ~9 minutes. It seems this is where the workers do the sort,
so "sorting live tuples" seems like a more natural phase for this.
After looking at it a bit more, it seems like a solution was actually
easier than I'd expected. PFA a prototype (unvalidated, but
check-world -ed) patch that would add these subphases of progress
reporting, which can be backpatched down to 12.
Do note that this is a partial fix, as it only fixes it when the
leader participates; but I don't think that limitation is too much of
a problem because only on builds which explicitly define the
non-standard DISABLE_LEADER_PARTICIPATION this will happen, and in
such cases the progress reporting for the loading phase will fail as
well.
With regards,
Matthias van de Meent
Attachments:
v1-0001-Report-phase-progress-for-the-sort-phase-in-paral.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Report-phase-progress-for-the-sort-phase-in-paral.patchDownload
From 4a7960cf19a6480a18119570a13f19835904f06f Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm@gmail.com>
Date: Wed, 2 Jun 2021 16:07:24 +0200
Subject: [PATCH v1] Report phase progress for the sort phase in parallel btree
construction.
---
src/backend/access/nbtree/nbtsort.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2c4d7f6e25..92f7ca66d8 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1978,9 +1978,18 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
* tuplesort_performsort() for spool2 if it ends up containing no dead
* tuples (this is disallowed for workers by tuplesort).
*/
+
+ if (progress)
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
+ PROGRESS_BTREE_PHASE_PERFORMSORT_1);
tuplesort_performsort(btspool->sortstate);
if (btspool2)
+ {
+ if (progress)
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
+ PROGRESS_BTREE_PHASE_PERFORMSORT_2);
tuplesort_performsort(btspool2->sortstate);
+ }
/*
* Done. Record ambuild statistics, and whether we encountered a broken
--
2.20.1
On 6/2/21 4:54 PM, Matthias van de Meent wrote:
On Wed, 2 Jun 2021 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
...
After looking at it a bit more, it seems like a solution was actually
easier than I'd expected. PFA a prototype (unvalidated, but
check-world -ed) patch that would add these subphases of progress
reporting, which can be backpatched down to 12.
Nice. I gave it a try on the database I'm experimenting with, and it
seems to be working fine. Please add it to the next CF.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, 2 Jun 2021 at 17:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 6/2/21 4:54 PM, Matthias van de Meent wrote:
On Wed, 2 Jun 2021 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
...
After looking at it a bit more, it seems like a solution was actually
easier than I'd expected. PFA a prototype (unvalidated, but
check-world -ed) patch that would add these subphases of progress
reporting, which can be backpatched down to 12.Nice. I gave it a try on the database I'm experimenting with, and it
seems to be working fine. Please add it to the next CF.
Thanks, cf available here: https://commitfest.postgresql.org/33/3149/
With regards,
Matthias van de Meent
On 2021-Jun-02, Tomas Vondra wrote:
Hi,
While experimenting with parallel index builds, I've noticed a somewhat
strange behavior of pg_stat_progress_create_index when a btree index is
built with parallel workers - some of the phases seem to be missing.
Hmm, that's odd. I distinctly recall testing the behavior with parallel
workers, and it is mentioned by Rahila in the original thread, and I
think we tried to ensure that it was sane. I am surprised to learn that
there's such a large gap.
I'll go have a deeper look at the provided patch and try to get it
backpatched.
I think it would be valuable to have some kind of test mode where the
progress reporting APIs would make some noise (perhaps with a bespoke
GUC option) so that we can test things in some automated manner ...
--
�lvaro Herrera 39�49'30"S 73�17'W
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle." (Larry Wall, Apocalypse 6)
On 6/2/21 6:38 PM, Alvaro Herrera wrote:
On 2021-Jun-02, Tomas Vondra wrote:
Hi,
While experimenting with parallel index builds, I've noticed a somewhat
strange behavior of pg_stat_progress_create_index when a btree index is
built with parallel workers - some of the phases seem to be missing.Hmm, that's odd. I distinctly recall testing the behavior with parallel
workers, and it is mentioned by Rahila in the original thread, and I
think we tried to ensure that it was sane. I am surprised to learn that
there's such a large gap.
Yeah, I quickly skimmed [1]/messages/by-id/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql which I think is the thread you're referring
to, and there is some discussion about parallel workers. I haven't read
it in detail, though.
[1]: /messages/by-id/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
/messages/by-id/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
I'll go have a deeper look at the provided patch and try to get it
backpatched.I think it would be valuable to have some kind of test mode where the
progress reporting APIs would make some noise (perhaps with a bespoke
GUC option) so that we can test things in some automated manner ...
True, but how would that GUC work? Would it add something into the
system view, or just log something?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2021-Jun-02, Tomas Vondra wrote:
On 6/2/21 6:38 PM, Alvaro Herrera wrote:
Hmm, that's odd. I distinctly recall testing the behavior with parallel
workers, and it is mentioned by Rahila in the original thread, and I
think we tried to ensure that it was sane. I am surprised to learn that
there's such a large gap.Yeah, I quickly skimmed [1] which I think is the thread you're referring
to, and there is some discussion about parallel workers. I haven't read
it in detail, though.[1]
/messages/by-id/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
Well, it is quite possible that we found *some* problems with parallel
workers but not all of them :-)
I think it would be valuable to have some kind of test mode where the
progress reporting APIs would make some noise (perhaps with a bespoke
GUC option) so that we can test things in some automated manner ...True, but how would that GUC work? Would it add something into the
system view, or just log something?
With the GUC turned on, emit some sort of message (maybe at INFO level)
whenever some subset of the progress parameters changes. This makes it
easy to compare the progress of any command with the expected set of
messages.
However, it's not very clear which parameters are observed
for changes (you can't do it for all params, because you'd get one for
each block in some cases, and that's unworkable). The way have #defined
the parameters makes it difficult to annotate parameters with flag bits;
we could have something
#ifdef USE_ASSERT_CHECKING
#define PROGRESS_LOG_CHANGES 0x70000000
#else
#define PROGRESS_LOG_CHANGES 0x0
#endif
#define PROGRESS_CLUSTER_PHASE (1 | PROGRESS_LOG_CHANGES)
and the progress-reporting knows to mask-out the LOG_CHANGES bit before
storing the value in memory, but also knows to emit the log output if
that's enabled and the LOG_CHANGES bit is present. (The assertion flag
would be tested at compile time to avoid a performance hit in production
builds.)
--
�lvaro Herrera Valdivia, Chile
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu. Five minutes later I realize that it's also talking
about food" (Donald Knuth)
On Thu, Jun 3, 2021 at 1:49 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 2 Jun 2021 at 17:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Nice. I gave it a try on the database I'm experimenting with, and it
seems to be working fine. Please add it to the next CF.Thanks, cf available here: https://commitfest.postgresql.org/33/3149/
The patch looks OK to me. It seems apparent that the lines added by
the patch are missing from the current source in the parallel case.
I tested with and without the patch, using the latest PG14 source as
of today, and can confirm that without the patch applied, the "sorting
live tuples" phase is not reported in the parallel-case, but with the
patch applied it then does get reported in that case. I also confirmed
that, as you said, the patch only addresses the usual case where the
parallel leader participates in the parallel operation.
What is slightly puzzling to me (and perhaps digging deeper will
reveal it) is why this "sorting live tuples" phase seems so short in
the serial case compared to the parallel case?
For example, in my test I created an index on a column of a table
having 10 million records, and it took about 40 seconds, during which
the "sorting live tuples" phase seemed to take about 8 seconds. Yet
for the serial case, index creation took about 75 seconds, during
which the "sorting live tuples" phase seemed to take about 1 second.
Regards,
Greg Nancarrow
Fujitsu Australia
On Fri, Jun 4, 2021 at 5:25 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
What is slightly puzzling to me (and perhaps digging deeper will
reveal it) is why this "sorting live tuples" phase seems so short in
the serial case compared to the parallel case?
For example, in my test I created an index on a column of a table
having 10 million records, and it took about 40 seconds, during which
the "sorting live tuples" phase seemed to take about 8 seconds. Yet
for the serial case, index creation took about 75 seconds, during
which the "sorting live tuples" phase seemed to take about 1 second.
Seems to be because in the serial case, the sort occurs after the scan
is complete (obviously) but in the parallel case, the scan and sort
are combined, so (after patch application) a portion of the then
reported "sorting live tuples" phase is actually "scanning table".
Regards,
Greg Nancarrow
Fujitsu Australia
On 2021-Jun-04, Greg Nancarrow wrote:
I tested with and without the patch, using the latest PG14 source as
of today, and can confirm that without the patch applied, the "sorting
live tuples" phase is not reported in the parallel-case, but with the
patch applied it then does get reported in that case. I also confirmed
that, as you said, the patch only addresses the usual case where the
parallel leader participates in the parallel operation.
What is slightly puzzling to me (and perhaps digging deeper will
reveal it) is why this "sorting live tuples" phase seems so short in
the serial case compared to the parallel case?
For example, in my test I created an index on a column of a table
having 10 million records, and it took about 40 seconds, during which
the "sorting live tuples" phase seemed to take about 8 seconds. Yet
for the serial case, index creation took about 75 seconds, during
which the "sorting live tuples" phase seemed to take about 1 second.
I think the reason is that scanning the table is not just scanning the
table -- it is also feeding tuples to tuplesort, which internally is
already sorting them as it receives them. So by the time you're done
scanning the relation, some (large) fraction of the sorting work is
already done, which is why the "sorting" phase is so short.
Tracing sort is not easy. we discussed this earlier; see
/messages/by-id/20181218210159.xtkltzm7flrwsm55@alvherre.pgsql
for example.
--
�lvaro Herrera Valdivia, Chile
Maybe there's lots of data loss but the records of data loss are also lost.
(Lincoln Yeoh)
On 2021-Jun-04, Greg Nancarrow wrote:
On Thu, Jun 3, 2021 at 1:49 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 2 Jun 2021 at 17:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Nice. I gave it a try on the database I'm experimenting with, and it
seems to be working fine. Please add it to the next CF.Thanks, cf available here: https://commitfest.postgresql.org/33/3149/
The patch looks OK to me. It seems apparent that the lines added by
the patch are missing from the current source in the parallel case.I tested with and without the patch, using the latest PG14 source as
of today, and can confirm that without the patch applied, the "sorting
live tuples" phase is not reported in the parallel-case, but with the
patch applied it then does get reported in that case. I also confirmed
that, as you said, the patch only addresses the usual case where the
parallel leader participates in the parallel operation.
So, with Matthias' patch applied and some instrumentation to log (some)
parameter updates, this is what I get on building an index in parallel.
The "subphase" is parameter 10:
2021-06-09 17:04:30.692 -04 19194 WARNING: updating param 0 to 1
2021-06-09 17:04:30.692 -04 19194 WARNING: updating param 6 to 0
2021-06-09 17:04:30.692 -04 19194 WARNING: updating param 8 to 403
2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 9 to 2
2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 10 to 1
2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 11 to 0
2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 15 to 0
2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 10 to 2
2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 15 to 486726
2021-06-09 17:04:37.418 -04 19194 WARNING: updating param 10 to 3 <-- this one is new
2021-06-09 17:04:42.215 -04 19194 WARNING: updating param 11 to 110000000
2021-06-09 17:04:42.215 -04 19194 WARNING: updating param 15 to 0
2021-06-09 17:04:42.215 -04 19194 WARNING: updating param 10 to 3
2021-06-09 17:04:42.237 -04 19194 WARNING: updating param 10 to 5
The thing to note is that we set subphase to 3 twice. The first of
those is added by the patch to _bt_parallel_scan_and_sort. The second
is in _bt_leafbuild, just before setting the subphase to LEAF_LOAD. So
the change is that we set the subphase to "sorting live tuples" five
seconds ahead of what we were doing previously. Seems ok. (We could
alternatively skip the progress update call in _bt_leafbuild; but those
calls are so cheap that adding a conditional jump is almost as
expensive.)
(The other potential problem might be to pointlessly invoke the progress
update calls when in a worker. But that's already covered because only
the leader passes progress=true to _bt_parallel_scan_and_sort.)
I'll push now.
--
�lvaro Herrera Valdivia, Chile