Unify parallel worker handling for index builds and instrumentation
Hi,
Whilst working on the stack-based instrumentation patch, I noticed
that there is a lot of duplication of parallel worker code, and in
particular parallel index build code, when it comes to how it sets up
shared memory, and how metadata information is fed back to/from the
leader.
My initial focus was around instrumentation, but I think there is a
bigger duplication issue that needs a refactoring in regards to
parallel index builds in particular. See attached a draft patch
series, that does the following:
0001: Use the new Instrumentation struct to handle WAL and Buffer usage together
0002: New helpers to estimate/store Instrumentation struct in shared memory
0003: New helpers to handle query text for parallel workers
0004: Introduces shared parallel index build helper functions
Of these, 0004 is where I have the most uncertainty personally (what's
a good abstraction here?), but wanted to share this to spark a
conversation on how we could improve this.
I could imagine an alternate design where we refactor more generally
how parallel workers get handled for any maintenance commands (i.e.
also including parallel vacuum), but that would reduce how much code
duplication we can avoid for index builds.
Thoughts?
Thanks,
Lukas
--
Lukas Fittl
Attachments:
v1-0001-instrumentation-Use-Instrumentation-struct-for-pa.patchapplication/octet-stream; name=v1-0001-instrumentation-Use-Instrumentation-struct-for-pa.patchDownload+114-180
v1-0003-Unify-parallel-worker-handling-for-query-text.patchapplication/octet-stream; name=v1-0003-Unify-parallel-worker-handling-for-query-text.patchDownload+60-96
v1-0002-Unify-parallel-worker-handling-for-instrumentatio.patchapplication/octet-stream; name=v1-0002-Unify-parallel-worker-handling-for-instrumentatio.patchDownload+58-94
v1-0004-Unify-parallel-index-build-infrastructure.patchapplication/octet-stream; name=v1-0004-Unify-parallel-index-build-infrastructure.patchDownload+465-475
Hi,
Whilst working on the stack-based instrumentation patch, I noticed
that there is a lot of duplication of parallel worker code, and in
particular parallel index build code, when it comes to how it sets up
shared memory, and how metadata information is fed back to/from the
leader.
+1, Thanks for the patches!
My initial focus was around instrumentation, but I think there is a
bigger duplication issue that needs a refactoring in regards to
parallel index builds in particular.
Yes, looks like it.
I reviewed you proposed patch set and here are my findings.
0001: Use the new Instrumentation struct to handle WAL and Buffer usage together
This one is a straightforward refactor. Only comment is to remove the bare block
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
-
&wal_usage[ParallelWorkerNumber]);
+ {
+ Instrumentation *worker_instr;
+
+ worker_instr = shm_toc_lookup(toc,
PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(&worker_instr[ParallelWorkerNumber]);
+ }
0002: New helpers to estimate/store Instrumentation struct in shared memory
Another straightforward refactor.
This comment
- * If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
- */
was removed from the several duplicate code paths that estimated space
for parallel instrumentation, but I don't think it added value anyhow,
so that is fine.
0003: New helpers to handle query text for parallel workers
+/*
+ * Restore the query text in a worker: set debug_query_string and report it as
+ * the current activity. The key is looked up with missing_ok, so this is a
+ * no-op (leaving debug_query_string NULL) when the leader stored no text.
+ */
+void
+RestoreParallelQueryText(shm_toc *toc, uint64 key)
+{
+ debug_query_string = shm_toc_lookup(toc, key, true);
+ pgstat_report_activity(STATE_RUNNING, debug_query_string);
+}
I went back-forth on this since it combines 2 different ideas, setting
a debug_query_string and reporting activity. I think this is OK after
thinking about it a bit, and the whole point is to deduplicate code.
"no-op (leaving debug_query_string NULL) " The comment here is
not accurate. I would drop the "no-op" part since it is setting a NULL
query string. right?
0004: Introduces shared parallel index build helper functions
Of these, 0004 is where I have the most uncertainty personally (what's
a good abstraction here?), but wanted to share this to spark a
conversation on how we could improve this.
I think this is a nice deduplication as well. So I am also +1 to this.
A few things:
1/
+ * Mutable state that is maintained for exact index AMs (e.g.
nbtree), and
+ * unused for lossy AMs.
Not sure if the "loss AMs" comment is accurate. From what I can tell, GIN
does not use the havedead and brokenhotchain fields and it's not lossy.
right?
2/
+ * workers are launched; the mutable fields are maintained by the workers
nit: extra space before "launched"
3/
Per the commit message:
This also fixes an oversight in parallel GIN index builds that forgot
to pass the queryid to the parallel workers, like other AMs do.
I think this should be an independent patch, and first in the series.
I could imagine an alternate design where we refactor more generally
how parallel workers get handled for any maintenance commands (i.e.
also including parallel vacuum), but that would reduce how much code
duplication we can avoid for index builds.
Yeah, I am not sure we need to go that far in the deduplication. Parallel
Vacuum and Parallel index builds are way different in what they do.
--
Sami Imseih
Amazon Web Services (AWS)
On Mon, Jun 01, 2026 at 01:31:00PM -0500, Sami Imseih wrote:
0001: Use the new Instrumentation struct to handle WAL and Buffer usage together
This one is a straightforward refactor. Only comment is to remove the bare block
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], - &wal_usage[ParallelWorkerNumber]); + { + Instrumentation *worker_instr; + + worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false); + InstrEndParallelQuery(&worker_instr[ParallelWorkerNumber]); + }
Instrumentation is a structure larger than WalUsage and BufferUsage
combined, and we don't care about all these other fields for the
parallel workers. Sure, this cuts code, but this also increases the
DSM and memory footprint.
0002 relies on 0001, so same mixed feeling regarding it.
0003: New helpers to handle query text for parallel workers
+/* + * Restore the query text in a worker: set debug_query_string and report it as + * the current activity. The key is looked up with missing_ok, so this is a + * no-op (leaving debug_query_string NULL) when the leader stored no text. + */ +void +RestoreParallelQueryText(shm_toc *toc, uint64 key) +{ + debug_query_string = shm_toc_lookup(toc, key, true); + pgstat_report_activity(STATE_RUNNING, debug_query_string); +}I went back-forth on this since it combines 2 different ideas, setting
a debug_query_string and reporting activity. I think this is OK after
thinking about it a bit, and the whole point is to deduplicate code."no-op (leaving debug_query_string NULL) " The comment here is
not accurate. I would drop the "no-op" part since it is setting a NULL
query string. right?
Hmm. This one looks like it has nice advantages. For one, less
specific logic related to debug_query_string in all the main callbacks
of these parallel workers, copied in three different places.
0004: Introduces shared parallel index build helper functions
Of these, 0004 is where I have the most uncertainty personally (what's
a good abstraction here?), but wanted to share this to spark a
conversation on how we could improve this.I think this is a nice deduplication as well. So I am also +1 to this.
Less duplication with the levels of locks assigned and less
duplication regarding the levels of the snapshots retrieved makes this
approach quite beneficial in the long-term. That reads nice.
--
Michael
0001: Use the new Instrumentation struct to handle WAL and Buffer usage together
This one is a straightforward refactor. Only comment is to remove the bare block
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], - &wal_usage[ParallelWorkerNumber]); + { + Instrumentation *worker_instr; + + worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false); + InstrEndParallelQuery(&worker_instr[ParallelWorkerNumber]); + }Instrumentation is a structure larger than WalUsage and BufferUsage
combined, and we don't care about all these other fields for the
parallel workers. Sure, this cuts code, but this also increases the
DSM and memory footprint.
The difference is 192 bytes per worker; 360 bytes for Instrumentation
- 168 bytes
for removing BufferUsage and WalUsage. Even with 32 parallel workers ( probably
the extreme case ) that is ~6 KB, which is negligible compared to the code
cleanup.
gdb -batch -nx $PGHOME/bin/postgres \
-ex "print (int)sizeof(Instrumentation)" \
-ex "print (int)sizeof(BufferUsage)" \
-ex "print (int)sizeof(WalUsage)" \
-ex "ptype/o Instrumentation" \
-ex "ptype/o BufferUsage" \
-ex "ptype/o WalUsage"
$1 = 360
$2 = 128
$3 = 40
--
Sami Imseih
Amazon Web Services (AWS)
On Tue, Jun 2, 2026 at 4:29 PM Sami Imseih <samimseih@gmail.com> wrote:
0001: Use the new Instrumentation struct to handle WAL and Buffer usage together
This one is a straightforward refactor. Only comment is to remove the bare block
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], - &wal_usage[ParallelWorkerNumber]); + { + Instrumentation *worker_instr; + + worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false); + InstrEndParallelQuery(&worker_instr[ParallelWorkerNumber]); + }Instrumentation is a structure larger than WalUsage and BufferUsage
combined, and we don't care about all these other fields for the
parallel workers. Sure, this cuts code, but this also increases the
DSM and memory footprint.The difference is 192 bytes per worker; 360 bytes for Instrumentation
- 168 bytes
for removing BufferUsage and WalUsage. Even with 32 parallel workers ( probably
the extreme case ) that is ~6 KB, which is negligible compared to the code
cleanup.
Hmm, thanks for running the numbers!
I agree its not substantial, but its also not nothing - so I see
Michael's point. I think the largest difference here is the fact that
WalUsage and BufferUsage are repeated - that is something that would
no longer be necessary once/if we go with the stack-based
instrumentation approach. When I had this patch in the other patch
series, it was after a refactoring that made the overhead much
smaller.
Maybe for now I can re-order this patch series to lead with the other
patches, and we could defer this part a bit until we have clarity what
we're doing re: stack-based instrumentation.
I'll respond on the other comments separately later this week. Thanks
to you both for the quick feedback!
Thanks,
Lukas
--
Lukas Fittl
On Tue, Jun 02, 2026 at 05:38:09PM -0700, Lukas Fittl wrote:
I agree its not substantial, but its also not nothing - so I see
Michael's point. I think the largest difference here is the fact that
WalUsage and BufferUsage are repeated - that is something that would
no longer be necessary once/if we go with the stack-based
instrumentation approach. When I had this patch in the other patch
series, it was after a refactoring that made the overhead much
smaller.Maybe for now I can re-order this patch series to lead with the other
patches, and we could defer this part a bit until we have clarity what
we're doing re: stack-based instrumentation.
Right. That hints also to a patch ordering problem to me, where the
final picture (if I got correctly) would be that we don't have the
_start fields anymore at the end in the Instrumentation data sent.
It feels confusing that we push from the leader to its workers a set
of fields that the workers have no use for.
--
Michael