Parallel CREATE INDEX for BRIN indexes

Started by Tomas Vondraalmost 3 years ago45 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
infrastructure (starting workers etc.) is "inspired" by the BTREE code
(i.e. copied from that and massaged a bit to call brin stuff).

_bt_begin_parallel -> _brin_begin_parallel
_bt_end_parallel -> _brin_end_parallel
_bt_parallel_estimate_shared -> _brin_parallel_estimate_shared
_bt_leader_participate_as_worker -> _brin_leader_participate_as_worker
_bt_parallel_scan_and_sort -> _brin_parallel_scan_and_build

This is mostly mechanical stuff - setting up the parallel workers,
starting the scan etc.

The tricky part is how to divide the work between workers and how we
combine the partial results. For BTREE we simply let each worker to read
a subset of the table (using a parallel scan), sort it and then do a
merge sort on the partial results.

For BRIN it's a bit different, because the indexes essentially splits
the table into smaller ranges and treat them independently. So the
easiest way is to organize the table scan so that each range gets
processed by exactly one worker. Each worker writes the index tuples
into a temporary file, and then when all workers are done we read and
write them into the index.

The problem is a parallel scan assigns mostly random subset of the table
to each worker - it's not guaranteed a BRIN page range to be processed
by a single worker.

0001 does that in a bit silly way - instead of doing single large scan,
each worker does a sequence of TID range scans for each worker (see
_brin_parallel_scan_and_build), and BrinShared has fields used to track
which ranges were already assigned to workers. A bit cumbersome, but it
works pretty well.

0002 replaces the TID range scan sequence with a single parallel scan,
modified to assign "chunks" in multiple of pagesPerRange.

In both cases _brin_end_parallel then reads the summaries from worker
files, and adds them into the index. In 0001 this is fairly simple,
although we could do one more improvement and sort the ranges by range
start to make the index nicer (and possibly a bit more efficient). This
should be simple, because the per-worker results are already sorted like
that (so a merge sort in _brin_end_parallel would be enough).

For 0002 it's a bit more complicated, because with a single parallel
scan brinbuildCallbackParallel can't decide if a range is assigned to a
different worker or empty. And we want to generate summaries for empty
ranges in the index. We could either skip such range during index build,
and then add empty summaries in _brin_end_parallel (if needed), or add
them and then merge them using "union".

I just realized there's a third option to do this - we could just do
regular parallel scan (with no particular regard to pagesPerRange), and
then do "union" when merging results from workers. It doesn't require
the sequence of TID scans, and the union would also handle the empty
ranges. The per-worker results might be much larger, though, because
each worker might produce up to the "full" BRIN index.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-parallel-CREATE-INDEX-for-BRIN-20230608.patchtext/x-patch; charset=UTF-8; name=0001-parallel-CREATE-INDEX-for-BRIN-20230608.patchDownload+719-6
0002-switch-CREATE-INDEX-for-BRIN-to-parallel-sc-20230608.patchtext/x-patch; charset=UTF-8; name=0002-switch-CREATE-INDEX-for-BRIN-to-parallel-sc-20230608.patchDownload+73-79
#2Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#1)
Re: Parallel CREATE INDEX for BRIN indexes

On Thu, 8 Jun 2023 at 14:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
infrastructure (starting workers etc.) is "inspired" by the BTREE code
(i.e. copied from that and massaged a bit to call brin stuff).

Nice work.

In both cases _brin_end_parallel then reads the summaries from worker
files, and adds them into the index. In 0001 this is fairly simple,
although we could do one more improvement and sort the ranges by range
start to make the index nicer (and possibly a bit more efficient). This
should be simple, because the per-worker results are already sorted like
that (so a merge sort in _brin_end_parallel would be enough).

I see that you manually built the passing and sorting of tuples
between workers, but can't we use the parallel tuplesort
infrastructure for that? It already has similar features in place and
improves code commonality.

For 0002 it's a bit more complicated, because with a single parallel
scan brinbuildCallbackParallel can't decide if a range is assigned to a
different worker or empty. And we want to generate summaries for empty
ranges in the index. We could either skip such range during index build,
and then add empty summaries in _brin_end_parallel (if needed), or add
them and then merge them using "union".

I just realized there's a third option to do this - we could just do
regular parallel scan (with no particular regard to pagesPerRange), and
then do "union" when merging results from workers. It doesn't require
the sequence of TID scans, and the union would also handle the empty
ranges. The per-worker results might be much larger, though, because
each worker might produce up to the "full" BRIN index.

Would it be too much effort to add a 'min_chunk_size' argument to
table_beginscan_parallel (or ParallelTableScanDesc) that defines the
minimum granularity of block ranges to be assigned to each process? I
think that would be the most elegant solution that would require
relatively little effort: table_block_parallelscan_nextpage already
does parallel management of multiple chunk sizes, and I think this
modification would fit quite well in that code.

Kind regards,

Matthias van de Meent

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#2)
Re: Parallel CREATE INDEX for BRIN indexes

On 7/4/23 23:53, Matthias van de Meent wrote:

On Thu, 8 Jun 2023 at 14:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
infrastructure (starting workers etc.) is "inspired" by the BTREE code
(i.e. copied from that and massaged a bit to call brin stuff).

Nice work.

In both cases _brin_end_parallel then reads the summaries from worker
files, and adds them into the index. In 0001 this is fairly simple,
although we could do one more improvement and sort the ranges by range
start to make the index nicer (and possibly a bit more efficient). This
should be simple, because the per-worker results are already sorted like
that (so a merge sort in _brin_end_parallel would be enough).

I see that you manually built the passing and sorting of tuples
between workers, but can't we use the parallel tuplesort
infrastructure for that? It already has similar features in place and
improves code commonality.

Maybe. I wasn't that familiar with what parallel tuplesort can and can't
do, and the little I knew I managed to forget since I wrote this patch.
Which similar features do you have in mind?

The workers are producing the results in "start_block" order, so if they
pass that to the leader, it probably can do the usual merge sort.

For 0002 it's a bit more complicated, because with a single parallel
scan brinbuildCallbackParallel can't decide if a range is assigned to a
different worker or empty. And we want to generate summaries for empty
ranges in the index. We could either skip such range during index build,
and then add empty summaries in _brin_end_parallel (if needed), or add
them and then merge them using "union".

I just realized there's a third option to do this - we could just do
regular parallel scan (with no particular regard to pagesPerRange), and
then do "union" when merging results from workers. It doesn't require
the sequence of TID scans, and the union would also handle the empty
ranges. The per-worker results might be much larger, though, because
each worker might produce up to the "full" BRIN index.

Would it be too much effort to add a 'min_chunk_size' argument to
table_beginscan_parallel (or ParallelTableScanDesc) that defines the
minimum granularity of block ranges to be assigned to each process? I
think that would be the most elegant solution that would require
relatively little effort: table_block_parallelscan_nextpage already
does parallel management of multiple chunk sizes, and I think this
modification would fit quite well in that code.

I'm confused. Isn't that pretty much exactly what 0002 does? I mean,
that passes pagesPerRange to table_parallelscan_initialize(), so that
each pagesPerRange is assigned to a single worker.

The trouble I described above is that the scan returns tuples, and the
consumer has no idea what was the chunk size or how many other workers
are there. Imagine you get a tuple from block 1, and then a tuple from
block 1000. Does that mean that the blocks in between are empty or that
they were processed by some other worker?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#3)
Re: Parallel CREATE INDEX for BRIN indexes

On Wed, 5 Jul 2023 at 00:08, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 7/4/23 23:53, Matthias van de Meent wrote:

On Thu, 8 Jun 2023 at 14:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
infrastructure (starting workers etc.) is "inspired" by the BTREE code
(i.e. copied from that and massaged a bit to call brin stuff).

Nice work.

In both cases _brin_end_parallel then reads the summaries from worker
files, and adds them into the index. In 0001 this is fairly simple,
although we could do one more improvement and sort the ranges by range
start to make the index nicer (and possibly a bit more efficient). This
should be simple, because the per-worker results are already sorted like
that (so a merge sort in _brin_end_parallel would be enough).

I see that you manually built the passing and sorting of tuples
between workers, but can't we use the parallel tuplesort
infrastructure for that? It already has similar features in place and
improves code commonality.

Maybe. I wasn't that familiar with what parallel tuplesort can and can't
do, and the little I knew I managed to forget since I wrote this patch.
Which similar features do you have in mind?

The workers are producing the results in "start_block" order, so if they
pass that to the leader, it probably can do the usual merge sort.

For 0002 it's a bit more complicated, because with a single parallel
scan brinbuildCallbackParallel can't decide if a range is assigned to a
different worker or empty. And we want to generate summaries for empty
ranges in the index. We could either skip such range during index build,
and then add empty summaries in _brin_end_parallel (if needed), or add
them and then merge them using "union".

I just realized there's a third option to do this - we could just do
regular parallel scan (with no particular regard to pagesPerRange), and
then do "union" when merging results from workers. It doesn't require
the sequence of TID scans, and the union would also handle the empty
ranges. The per-worker results might be much larger, though, because
each worker might produce up to the "full" BRIN index.

Would it be too much effort to add a 'min_chunk_size' argument to
table_beginscan_parallel (or ParallelTableScanDesc) that defines the
minimum granularity of block ranges to be assigned to each process? I
think that would be the most elegant solution that would require
relatively little effort: table_block_parallelscan_nextpage already
does parallel management of multiple chunk sizes, and I think this
modification would fit quite well in that code.

I'm confused. Isn't that pretty much exactly what 0002 does? I mean,
that passes pagesPerRange to table_parallelscan_initialize(), so that
each pagesPerRange is assigned to a single worker.

Huh, I overlooked that one... Sorry for that.

The trouble I described above is that the scan returns tuples, and the
consumer has no idea what was the chunk size or how many other workers
are there. Imagine you get a tuple from block 1, and then a tuple from
block 1000. Does that mean that the blocks in between are empty or that
they were processed by some other worker?

If the unit of work for parallel table scans is the index's
pages_per_range, then I think we can just fill in expected-but-missing
ranges as 'empty' in the parallel leader during index loading, like
the first of the two solutions you proposed.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#4)
Re: Parallel CREATE INDEX for BRIN indexes

On 7/5/23 10:44, Matthias van de Meent wrote:

On Wed, 5 Jul 2023 at 00:08, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 7/4/23 23:53, Matthias van de Meent wrote:

On Thu, 8 Jun 2023 at 14:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
infrastructure (starting workers etc.) is "inspired" by the BTREE code
(i.e. copied from that and massaged a bit to call brin stuff).

Nice work.

In both cases _brin_end_parallel then reads the summaries from worker
files, and adds them into the index. In 0001 this is fairly simple,
although we could do one more improvement and sort the ranges by range
start to make the index nicer (and possibly a bit more efficient). This
should be simple, because the per-worker results are already sorted like
that (so a merge sort in _brin_end_parallel would be enough).

I see that you manually built the passing and sorting of tuples
between workers, but can't we use the parallel tuplesort
infrastructure for that? It already has similar features in place and
improves code commonality.

Maybe. I wasn't that familiar with what parallel tuplesort can and can't
do, and the little I knew I managed to forget since I wrote this patch.
Which similar features do you have in mind?

The workers are producing the results in "start_block" order, so if they
pass that to the leader, it probably can do the usual merge sort.

For 0002 it's a bit more complicated, because with a single parallel
scan brinbuildCallbackParallel can't decide if a range is assigned to a
different worker or empty. And we want to generate summaries for empty
ranges in the index. We could either skip such range during index build,
and then add empty summaries in _brin_end_parallel (if needed), or add
them and then merge them using "union".

I just realized there's a third option to do this - we could just do
regular parallel scan (with no particular regard to pagesPerRange), and
then do "union" when merging results from workers. It doesn't require
the sequence of TID scans, and the union would also handle the empty
ranges. The per-worker results might be much larger, though, because
each worker might produce up to the "full" BRIN index.

Would it be too much effort to add a 'min_chunk_size' argument to
table_beginscan_parallel (or ParallelTableScanDesc) that defines the
minimum granularity of block ranges to be assigned to each process? I
think that would be the most elegant solution that would require
relatively little effort: table_block_parallelscan_nextpage already
does parallel management of multiple chunk sizes, and I think this
modification would fit quite well in that code.

I'm confused. Isn't that pretty much exactly what 0002 does? I mean,
that passes pagesPerRange to table_parallelscan_initialize(), so that
each pagesPerRange is assigned to a single worker.

Huh, I overlooked that one... Sorry for that.

The trouble I described above is that the scan returns tuples, and the
consumer has no idea what was the chunk size or how many other workers
are there. Imagine you get a tuple from block 1, and then a tuple from
block 1000. Does that mean that the blocks in between are empty or that
they were processed by some other worker?

If the unit of work for parallel table scans is the index's
pages_per_range, then I think we can just fill in expected-but-missing
ranges as 'empty' in the parallel leader during index loading, like
the first of the two solutions you proposed.

Right, I think that's the right solution.

Or rather the only solution, because the other idea (generating the
empty ranges in workers) relies on the workers knowing when to generate
that. But I don't think the workers have the necessary information.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#3)
Re: Parallel CREATE INDEX for BRIN indexes

On Wed, 5 Jul 2023 at 00:08, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 7/4/23 23:53, Matthias van de Meent wrote:

On Thu, 8 Jun 2023 at 14:55, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
infrastructure (starting workers etc.) is "inspired" by the BTREE code
(i.e. copied from that and massaged a bit to call brin stuff).

Nice work.

In both cases _brin_end_parallel then reads the summaries from worker
files, and adds them into the index. In 0001 this is fairly simple,
although we could do one more improvement and sort the ranges by range
start to make the index nicer (and possibly a bit more efficient). This
should be simple, because the per-worker results are already sorted like
that (so a merge sort in _brin_end_parallel would be enough).

I see that you manually built the passing and sorting of tuples
between workers, but can't we use the parallel tuplesort
infrastructure for that? It already has similar features in place and
improves code commonality.

Maybe. I wasn't that familiar with what parallel tuplesort can and can't
do, and the little I knew I managed to forget since I wrote this patch.
Which similar features do you have in mind?

I was referring to the feature that is "emitting a single sorted run
of tuples at the leader backend based on data gathered in parallel
worker backends". It manages the sort state, on-disk runs etc. so that
you don't have to manage that yourself.

Adding a new storage format for what is effectively a logical tape
(logtape.{c,h}) and manually merging it seems like a lot of changes if
that functionality is readily available, standardized and optimized in
sortsupport; and adds an additional place to manually go through for
disk-related changes like TDE.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#6)
Re: Parallel CREATE INDEX for BRIN indexes

On 7/5/23 16:33, Matthias van de Meent wrote:

...

Maybe. I wasn't that familiar with what parallel tuplesort can and can't
do, and the little I knew I managed to forget since I wrote this patch.
Which similar features do you have in mind?

I was referring to the feature that is "emitting a single sorted run
of tuples at the leader backend based on data gathered in parallel
worker backends". It manages the sort state, on-disk runs etc. so that
you don't have to manage that yourself.

Adding a new storage format for what is effectively a logical tape
(logtape.{c,h}) and manually merging it seems like a lot of changes if
that functionality is readily available, standardized and optimized in
sortsupport; and adds an additional place to manually go through for
disk-related changes like TDE.

Here's a new version of the patch, with three main changes:

1) Adoption of the parallel scan approach, instead of the homegrown
solution with a sequence of TID scans. This is mostly what the 0002
patch did, except for fixing a bug - parallel scan has a "rampdown"
close to the end, and this needs to consider the chunk size too.

2) Switches to the parallel tuplesort, as proposed. This turned out to
be easier than I expected - most of the work was in adding methods to
tuplesortvariants.c to allow reading/writing BrinTuple items. The main
limitation is that we need to pass around the length of the tuple
(AFAICS it's not in the BrinTuple itself). I'm not entirely sure about
the memory management aspect of this, and maybe there's a more elegant
solution.

Overall it seems to work - the brin.c code is heavily based on how
nbtsearch.c does parallel builds for btree, so hopefully it's fine. At
some point I got a bit confused about which spool to create/use, but it
seems to work.

3) Handling of empty ranges - I ended up ignoring empty ranges in
workers (i.e. those are not written to the tuplesort), and instead the
leader fills them in when reading data from the shared tuplesort.

One thing I was wondering about is whether it might be better to allow
the workers to process overlapping ranges, and then let the leader to
merge the summaries. That would mean we might not need the tableam.c
changes at all, but the leader would need to do more work (although the
BRIN indexes tend to be fairly small). The main reason that got me
thinking about this is that we have pretty much no tests for the union
procedures, because triggering that is really difficult. But for
parallel index builds that'd be much more common.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-parallel-CREATE-INDEX-for-BRIN-20230706.patchtext/x-patch; charset=UTF-8; name=0001-parallel-CREATE-INDEX-for-BRIN-20230706.patchDownload+1065-16
#8Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#7)
Re: Parallel CREATE INDEX for BRIN indexes

On Thu, 6 Jul 2023 at 16:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 7/5/23 16:33, Matthias van de Meent wrote:

...

Maybe. I wasn't that familiar with what parallel tuplesort can and can't
do, and the little I knew I managed to forget since I wrote this patch.
Which similar features do you have in mind?

I was referring to the feature that is "emitting a single sorted run
of tuples at the leader backend based on data gathered in parallel
worker backends". It manages the sort state, on-disk runs etc. so that
you don't have to manage that yourself.

Adding a new storage format for what is effectively a logical tape
(logtape.{c,h}) and manually merging it seems like a lot of changes if
that functionality is readily available, standardized and optimized in
sortsupport; and adds an additional place to manually go through for
disk-related changes like TDE.

Here's a new version of the patch, with three main changes:

Thanks! I've done a review on the patch, and most looks good. Some
places need cleanup and polish, some others more documentations, and
there are some other issues, but so far it's looking OK.

One thing I was wondering about is whether it might be better to allow
the workers to process overlapping ranges, and then let the leader to
merge the summaries. That would mean we might not need the tableam.c
changes at all, but the leader would need to do more work (although the
BRIN indexes tend to be fairly small). The main reason that got me
thinking about this is that we have pretty much no tests for the union
procedures, because triggering that is really difficult. But for
parallel index builds that'd be much more common.

Hmm, that's a good point. I don't mind either way, but it would add
overhead in the leader to do all of that merging - especially when you
configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd
need to merge up to parallel_workers tuples. That could be a
significant overhead.

... thinks a bit.

Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite
unlikely to be a serious problem - the per-backend IO saved of such
large ranges on a single backend has presumably much more impact than
the merging of n_parallel_tasks max-sized brin tuples. So, seems fine
with me.

Review follows below.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

-----------

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c

+ BrinShared *brinshared;

Needs some indentation fixes.

+    int            bs_reltuples;
[...]
+    state->bs_reltuples += reltuples;

My IDE warns me that reltuples is a double. Looking deeper into the
value, it contains the number of live tuples in the table, so this
conversion may not result in a meaningful value for tables with >=2^31
live tuples. Tables > 56GB could begin to get affected by this.

+ int bs_worker_id;

This variable seems to be unused.

+ BrinSpool *bs_spool;
+ BrinSpool *bs_spool_out;

Are both used? If so, could you add comments why we have two spools
here, instead of only one?

+/*
+ * A version of the callback, used by parallel index builds. The main difference
+ * is that instead of writing the BRIN tuples into the index, we write them into
+ * a shared tuplestore file, and leave the insertion up to the leader (which may

+ ... shared tuplesort, and ...

brinbuildCallbackParallel(...)
+ while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)

shouldn't this be an 'if' now?

+        while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)
+            state->bs_currRangeStart += state->bs_pagesPerRange;

Is there a reason why you went with iterative addition instead of a
single divide-and-multiply like the following?:

+ state->bs_currRangeStart += state->bs_pagesPerRange *
((state->bs_currRangeStart - thisblock) / state->bs_pagesPerRange);

diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
[...]
-table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
+table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan, BlockNumber chunk_factor)
[...]
-    /* compare phs_syncscan initialization to similar logic in initscan */
+    bpscan->phs_chunk_factor = chunk_factor;
+    /* compare phs_syncscan initialization to similar logic in initscan
+     *
+     * Disable sync scans if the chunk factor is set (valid block number).
+     */

I think this needs some pgindent or other style work, both on comment
style and line lengths

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
[...]
+            Assert(false); (x3)

I think these can be cleaned up, right?

diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
[...]
+ * Computing BrinTuple size with only the tuple is difficult, so we want to track
+ * the length for r referenced by SortTuple. That's what BrinSortTuple is meant
+ * to do - it's essentially a BrinTuple prefixed by length. We only write the
+ * BrinTuple to the logtapes, though.

Why don't we write the full BrinSortTuple to disk? Doesn't that make more sense?

+    tuplesort_puttuple_common(state, &stup,
+                              base->sortKeys &&
+                              base->sortKeys->abbrev_converter &&
+                              !stup.isnull1);

Can't this last argument just be inlined, based on knowledge that we
don't use sortKeys in brin?

+comparetup_index_brin(const SortTuple *a, const SortTuple *b,
+                      Tuplesortstate *state)
+{
+    BrinTuple  *tuple1;
[...]
+    tuple1 = &((BrinSortTuple *) a)->tuple;
[...]

I'm fairly sure that this cast (and it's neighbour) is incorrect and
should be the following instead:

+ tuple1 = &((BrinSortTuple *) (a->tuple))->tuple;

Additionally, I think the following would be a better approach here,
as we wouldn't need to do pointer-chasing:

+ static int
+ comparetup_index_brin(const SortTuple *a, const SortTuple *b,
+                      Tuplesortstate *state)
+ {
+    Assert(TuplesortstateGetPublic(state)->haveDatum1);
+
+    if (DatumGetUInt32(a->datum1) > DatumGetUInt32(b->datum1))
+        return 1;
+    if (DatumGetUInt32(a->datum1) < DatumGetUInt32(b->datum1))
+        return -1;
+     /* silence compilers */
+    return 0;
+ }

---

Thanks for working on this!

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#8)
Re: Parallel CREATE INDEX for BRIN indexes

On 7/11/23 23:11, Matthias van de Meent wrote:

On Thu, 6 Jul 2023 at 16:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 7/5/23 16:33, Matthias van de Meent wrote:

...

Maybe. I wasn't that familiar with what parallel tuplesort can and can't
do, and the little I knew I managed to forget since I wrote this patch.
Which similar features do you have in mind?

I was referring to the feature that is "emitting a single sorted run
of tuples at the leader backend based on data gathered in parallel
worker backends". It manages the sort state, on-disk runs etc. so that
you don't have to manage that yourself.

Adding a new storage format for what is effectively a logical tape
(logtape.{c,h}) and manually merging it seems like a lot of changes if
that functionality is readily available, standardized and optimized in
sortsupport; and adds an additional place to manually go through for
disk-related changes like TDE.

Here's a new version of the patch, with three main changes:

Thanks! I've done a review on the patch, and most looks good. Some
places need cleanup and polish, some others more documentations, and
there are some other issues, but so far it's looking OK.

One thing I was wondering about is whether it might be better to allow
the workers to process overlapping ranges, and then let the leader to
merge the summaries. That would mean we might not need the tableam.c
changes at all, but the leader would need to do more work (although the
BRIN indexes tend to be fairly small). The main reason that got me
thinking about this is that we have pretty much no tests for the union
procedures, because triggering that is really difficult. But for
parallel index builds that'd be much more common.

Hmm, that's a good point. I don't mind either way, but it would add
overhead in the leader to do all of that merging - especially when you
configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd
need to merge up to parallel_workers tuples. That could be a
significant overhead.

... thinks a bit.

Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite
unlikely to be a serious problem - the per-backend IO saved of such
large ranges on a single backend has presumably much more impact than
the merging of n_parallel_tasks max-sized brin tuples. So, seems fine
with me.

As for PARALLEL_SEQSCAN_MAX_CHUNK_SIZE, the last patch actually
considers the chunk_factor (i.e. pages_per_range) *after* doing

pbscanwork->phsw_chunk_size = Min(pbscanwork->phsw_chunk_size,
PARALLEL_SEQSCAN_MAX_CHUNK_SIZE);

so even with (pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE) it
would not need to merge anything.

Now, that might have been a bad idea and PARALLEL_SEQSCAN_MAX_CHUNK_SIZE
should be considered. In which case this *has* to do the union, even if
only for the rare corner case.

But I don't think that's a major issue - it's pretty sure summarizing
the tuples is way more expensive than merging the summaries. Which is
what matters for Amdahl's law ...

Review follows below.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

-----------

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c

+ BrinShared *brinshared;

Needs some indentation fixes.

+    int            bs_reltuples;
[...]
+    state->bs_reltuples += reltuples;

My IDE warns me that reltuples is a double. Looking deeper into the
value, it contains the number of live tuples in the table, so this
conversion may not result in a meaningful value for tables with >=2^31
live tuples. Tables > 56GB could begin to get affected by this.

+ int bs_worker_id;

This variable seems to be unused.

+ BrinSpool *bs_spool;
+ BrinSpool *bs_spool_out;

Are both used? If so, could you add comments why we have two spools
here, instead of only one?

OK, I admit I'm not sure both are actually necessary. I was struggling
getting it working with just one spool, because when the leader
participates as a worker, it needs to both summarize some of the chunks
(and put the tuples somewhere). And then it also needs to consume the
final output.

Maybe it's just a case of cargo cult programming - I was mostly copying
stuff from the btree index build, trying to make it work, and then with
two spools it started working.

+/*
+ * A version of the callback, used by parallel index builds. The main difference
+ * is that instead of writing the BRIN tuples into the index, we write them into
+ * a shared tuplestore file, and leave the insertion up to the leader (which may

+ ... shared tuplesort, and ...

brinbuildCallbackParallel(...)
+ while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)

shouldn't this be an 'if' now?

Hmmm, probably ... that way we'd skip the empty ranges.

+        while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)
+            state->bs_currRangeStart += state->bs_pagesPerRange;

Is there a reason why you went with iterative addition instead of a
single divide-and-multiply like the following?:

+ state->bs_currRangeStart += state->bs_pagesPerRange *
((state->bs_currRangeStart - thisblock) / state->bs_pagesPerRange);

Probably laziness ... You're right the divide-multiply seems simpler.

diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
[...]
-table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
+table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan, BlockNumber chunk_factor)
[...]
-    /* compare phs_syncscan initialization to similar logic in initscan */
+    bpscan->phs_chunk_factor = chunk_factor;
+    /* compare phs_syncscan initialization to similar logic in initscan
+     *
+     * Disable sync scans if the chunk factor is set (valid block number).
+     */

I think this needs some pgindent or other style work, both on comment
style and line lengths

Right.

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
[...]
+            Assert(false); (x3)

I think these can be cleaned up, right?

Duh! Absolutely, this shouldn't have been in the patch at all. I only
added those to quickly identify places that got the tuplesort into
unexpected state (much easier with a coredump and a backtrace).

diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
[...]
+ * Computing BrinTuple size with only the tuple is difficult, so we want to track
+ * the length for r referenced by SortTuple. That's what BrinSortTuple is meant
+ * to do - it's essentially a BrinTuple prefixed by length. We only write the
+ * BrinTuple to the logtapes, though.

Why don't we write the full BrinSortTuple to disk? Doesn't that make more sense?

Not sure I understand. We ultimately do, because we write

(length + BrinTuple)

and BrinSortTuple is exactly that. But if we write BrinSortTuple, we
would end up writing length for that too, no?

Or maybe I just don't understand how would that make things simpler.

+    tuplesort_puttuple_common(state, &stup,
+                              base->sortKeys &&
+                              base->sortKeys->abbrev_converter &&
+                              !stup.isnull1);

Can't this last argument just be inlined, based on knowledge that we
don't use sortKeys in brin?

What does "inlined" mean for an argument? But yeah, I guess it might be
just set to false. And we should probably have an assert that there are
no sortKeys.

+comparetup_index_brin(const SortTuple *a, const SortTuple *b,
+                      Tuplesortstate *state)
+{
+    BrinTuple  *tuple1;
[...]
+    tuple1 = &((BrinSortTuple *) a)->tuple;
[...]

I'm fairly sure that this cast (and it's neighbour) is incorrect and
should be the following instead:

+ tuple1 = &((BrinSortTuple *) (a->tuple))->tuple;

Additionally, I think the following would be a better approach here,
as we wouldn't need to do pointer-chasing:

Uh, right. This only works because 'tuple' happens to be the first field
in SortTuple.

+ static int
+ comparetup_index_brin(const SortTuple *a, const SortTuple *b,
+                      Tuplesortstate *state)
+ {
+    Assert(TuplesortstateGetPublic(state)->haveDatum1);
+
+    if (DatumGetUInt32(a->datum1) > DatumGetUInt32(b->datum1))
+        return 1;
+    if (DatumGetUInt32(a->datum1) < DatumGetUInt32(b->datum1))
+        return -1;
+     /* silence compilers */
+    return 0;
+ }

Good idea! I forgot we're guaranteed to have datum1.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#9)
Re: Parallel CREATE INDEX for BRIN indexes

On Fri, 14 Jul 2023 at 15:57, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 7/11/23 23:11, Matthias van de Meent wrote:

On Thu, 6 Jul 2023 at 16:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

One thing I was wondering about is whether it might be better to allow
the workers to process overlapping ranges, and then let the leader to
merge the summaries. That would mean we might not need the tableam.c
changes at all, but the leader would need to do more work (although the
BRIN indexes tend to be fairly small). The main reason that got me
thinking about this is that we have pretty much no tests for the union
procedures, because triggering that is really difficult. But for
parallel index builds that'd be much more common.

Hmm, that's a good point. I don't mind either way, but it would add
overhead in the leader to do all of that merging - especially when you
configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd
need to merge up to parallel_workers tuples. That could be a
significant overhead.

... thinks a bit.

Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite
unlikely to be a serious problem - the per-backend IO saved of such
large ranges on a single backend has presumably much more impact than
the merging of n_parallel_tasks max-sized brin tuples. So, seems fine
with me.

As for PARALLEL_SEQSCAN_MAX_CHUNK_SIZE, the last patch actually
considers the chunk_factor (i.e. pages_per_range) *after* doing

pbscanwork->phsw_chunk_size = Min(pbscanwork->phsw_chunk_size,
PARALLEL_SEQSCAN_MAX_CHUNK_SIZE);

so even with (pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE) it
would not need to merge anything.

Now, that might have been a bad idea and PARALLEL_SEQSCAN_MAX_CHUNK_SIZE
should be considered. In which case this *has* to do the union, even if
only for the rare corner case.

But I don't think that's a major issue - it's pretty sure summarizing
the tuples is way more expensive than merging the summaries. Which is
what matters for Amdahl's law ...

Agreed.

+ BrinSpool *bs_spool;
+ BrinSpool *bs_spool_out;

Are both used? If so, could you add comments why we have two spools
here, instead of only one?

OK, I admit I'm not sure both are actually necessary. I was struggling
getting it working with just one spool, because when the leader
participates as a worker, it needs to both summarize some of the chunks
(and put the tuples somewhere). And then it also needs to consume the
final output.

Maybe it's just a case of cargo cult programming - I was mostly copying
stuff from the btree index build, trying to make it work, and then with
two spools it started working.

Two spools seem to be necessary in a participating leader, but both
spools have non-overlapping lifetimes. In the btree code actually two
pairs of spools are actually used (in unique indexes): you can see the
pairs being allocated in both _bt_leader_participate_as_worker (called
from _bt_begin_parallel, from _bt_spools_heapscan) and in
_bt_spools_heapscan.

diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
[...]
+ * Computing BrinTuple size with only the tuple is difficult, so we want to track
+ * the length for r referenced by SortTuple. That's what BrinSortTuple is meant
+ * to do - it's essentially a BrinTuple prefixed by length. We only write the
+ * BrinTuple to the logtapes, though.

Why don't we write the full BrinSortTuple to disk? Doesn't that make more sense?

Not sure I understand. We ultimately do, because we write

(length + BrinTuple)

and BrinSortTuple is exactly that. But if we write BrinSortTuple, we
would end up writing length for that too, no?

Or maybe I just don't understand how would that make things simpler.

I don't quite understand the intricacies of the tape storage format
quite yet (specifically, I'm continuously getting confused by the len
-= sizeof(int)), so you might well be correct.

My comment was written based on just the comment's contents, which
claims "we can't easily recompute the length, so we store it with the
tuple in memory. However, we don't store the length when we write it
to the tape", which seems self-contradictory.

+    tuplesort_puttuple_common(state, &stup,
+                              base->sortKeys &&
+                              base->sortKeys->abbrev_converter &&
+                              !stup.isnull1);

Can't this last argument just be inlined, based on knowledge that we
don't use sortKeys in brin?

What does "inlined" mean for an argument? But yeah, I guess it might be
just set to false. And we should probably have an assert that there are
no sortKeys.

"inlined", "precomputed", "const-ified"? I'm not super good at
vocabulary. But, indeed, thanks.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#10)
Re: Parallel CREATE INDEX for BRIN indexes

Hi,

here's an updated patch, addressing the review comments, and reworking
how the work is divided between the workers & leader etc.

0001 is just v2, rebased to current master

0002 and 0003 address most of the issues, in particular it

- removes the unnecessary spool
- fixes bs_reltuples type to double
- a couple comments are reworded to be clearer
- changes loop/condition in brinbuildCallbackParallel
- removes asserts added for debugging
- fixes cast in comparetup_index_brin
- 0003 then simplifies comparetup_index_brin
- I haven't inlined the tuplesort_puttuple_common parameter
(didn't seem worth it)

0004 Reworks how the work is divided between workers and combined by the
leader. It undoes the tableam.c changes that attempted to divide the
relation into chunks matching the BRIN ranges, and instead merges the
results in the leader (using the BRIN "union" function).

I haven't done any indentation fixes yet.

I did fairly extensive testing, using pageinspect to compare indexes
built with/without parallelism. More testing is needed, but it seems to
work fine (with other opclasses and so on).

In general I'm quite happy with the current state, and I believe it's
fairly close to be committable.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v3-0001-parallel-CREATE-INDEX-for-BRIN-v2.patchtext/x-patch; charset=UTF-8; name=v3-0001-parallel-CREATE-INDEX-for-BRIN-v2.patchDownload+1065-16
v3-0002-fix-review-comments.patchtext/x-patch; charset=UTF-8; name=v3-0002-fix-review-comments.patchDownload+23-36
v3-0003-simplify-comparetup_index_brin.patchtext/x-patch; charset=UTF-8; name=v3-0003-simplify-comparetup_index_brin.patchDownload+4-8
v3-0004-remove-tableam-changes.patchtext/x-patch; charset=UTF-8; name=v3-0004-remove-tableam-changes.patchDownload+115-65
#12Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#11)
Re: Parallel CREATE INDEX for BRIN indexes

On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

here's an updated patch, addressing the review comments, and reworking
how the work is divided between the workers & leader etc.

Thanks!

In general I'm quite happy with the current state, and I believe it's
fairly close to be committable.

Are you planning on committing the patches separately, or squashed? I
won't have much time this week for reviewing the patch, and it seems
like these patches are incremental, so some guidance on what you want
to be reviewed would be appreciated.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#12)
Re: Parallel CREATE INDEX for BRIN indexes

On 11/12/23 10:38, Matthias van de Meent wrote:

On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

here's an updated patch, addressing the review comments, and reworking
how the work is divided between the workers & leader etc.

Thanks!

In general I'm quite happy with the current state, and I believe it's
fairly close to be committable.

Are you planning on committing the patches separately, or squashed? I
won't have much time this week for reviewing the patch, and it seems
like these patches are incremental, so some guidance on what you want
to be reviewed would be appreciated.

Definitely squashed. I only kept them separate to make it more obvious
what the changes are.

If you need more time for a review, I can certainly wait. No rush.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#11)
Re: Parallel CREATE INDEX for BRIN indexes

On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

here's an updated patch, addressing the review comments, and reworking
how the work is divided between the workers & leader etc.

0001 is just v2, rebased to current master

0002 and 0003 address most of the issues, in particular it

- removes the unnecessary spool
- fixes bs_reltuples type to double
- a couple comments are reworded to be clearer
- changes loop/condition in brinbuildCallbackParallel
- removes asserts added for debugging
- fixes cast in comparetup_index_brin
- 0003 then simplifies comparetup_index_brin
- I haven't inlined the tuplesort_puttuple_common parameter
(didn't seem worth it)

OK, thanks

0004 Reworks how the work is divided between workers and combined by the
leader. It undoes the tableam.c changes that attempted to divide the
relation into chunks matching the BRIN ranges, and instead merges the
results in the leader (using the BRIN "union" function).

That's OK.

I haven't done any indentation fixes yet.

I did fairly extensive testing, using pageinspect to compare indexes
built with/without parallelism. More testing is needed, but it seems to
work fine (with other opclasses and so on).

After code-only review, here are some comments:

+++ b/src/backend/access/brin/brin.c
[...]
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_BRIN_SHARED        UINT64CONST(0xA000000000000001)
+#define PARALLEL_KEY_TUPLESORT            UINT64CONST(0xA000000000000002)

These shm keys use the same constants also in use in
access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
operations, I'd prefer if we didn't actively introduce conflicting
identifiers when we still have significant amounts of unused values
remaining.

+#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000003)

This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
more than brin's), backend/executor/execParallel.c
(0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
I've not checked that their uses are exactly the same, I'd expect at
least btree to match mostly, if not fully, 1:1).
I think we could probably benefit from a less ad-hoc sharing of query
texts. I don't think that needs to happen specifically in this patch,
but I think it's something to keep in mind in future efforts.

+_brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
[...]
+    BrinSpool  *spool = state->bs_spool;
[...]
+    if (!state)
+        return;

I think the assignment to spool should be moved to below this
condition, as _brin_begin_parallel calls this with state=NULL when it
can't launch parallel workers, which will cause issues here.

+ state->bs_numtuples = brinshared->indtuples;

My IDE complains about bs_numtuples being an integer. This is a
pre-existing issue, but still valid: we can hit an overflow on tables
with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
but problematic nonetheless.

+     * FIXME This probably needs some memory management fixes - we're reading
+     * tuples from the tuplesort, we're allocating an emty tuple, and so on.
+     * Probably better to release this memory.

This should probably be resolved.

I also noticed that this is likely to execute `union_tuples` many
times when pages_per_range is coprime with the parallel table scan's
block stride (or when we for other reasons have many tuples returned
for each range); and this `union_tuples` internally allocates and
deletes its own memory context for its deserialization of the 'b'
tuple. I think we should just pass a scratch context instead, so that
we don't have the overhead of continously creating then deleting the
same memory context.

+++ b/src/backend/catalog/index.c
[...]
-        indexRelation->rd_rel->relam == BTREE_AM_OID)
+        (indexRelation->rd_rel->relam == BTREE_AM_OID ||
+         indexRelation->rd_rel->relam == BRIN_AM_OID))

I think this needs some more effort. I imagine a new
IndexAmRoutine->amcanbuildparallel is more appropriate than this
hard-coded list - external indexes may want to utilize the parallel
index creation planner facility, too.

Some notes:
As a project PostgreSQL seems to be trying to move away from
hardcoding heap into everything in favour of the more AM-agnostic
'table'. I suggest replacing all mentions of "heap" in the arguments
with "table", to reduce the work future maintainers need to do to fix
this. Even when this AM is mostly targetted towards the heap AM, other
AMs (such as those I've heard of that were developed internally at
EDB) use the same block-addressing that heap does, and should thus be
compatible with BRIN. Thus, "heap" is not a useful name here.

There are 2 new mentions of "tuplestore" in the patch, while the
structure used is tuplesort: one on form_and_spill_tuple, and one on
brinbuildCallbackParallel. Please update those comments.

That's it for code review. I'll do some performance comparisons and
testing soon, too.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#14)
Re: Parallel CREATE INDEX for BRIN indexes

On 11/20/23 20:48, Matthias van de Meent wrote:

On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

here's an updated patch, addressing the review comments, and reworking
how the work is divided between the workers & leader etc.

0001 is just v2, rebased to current master

0002 and 0003 address most of the issues, in particular it

- removes the unnecessary spool
- fixes bs_reltuples type to double
- a couple comments are reworded to be clearer
- changes loop/condition in brinbuildCallbackParallel
- removes asserts added for debugging
- fixes cast in comparetup_index_brin
- 0003 then simplifies comparetup_index_brin
- I haven't inlined the tuplesort_puttuple_common parameter
(didn't seem worth it)

OK, thanks

0004 Reworks how the work is divided between workers and combined by the
leader. It undoes the tableam.c changes that attempted to divide the
relation into chunks matching the BRIN ranges, and instead merges the
results in the leader (using the BRIN "union" function).

That's OK.

I haven't done any indentation fixes yet.

I did fairly extensive testing, using pageinspect to compare indexes
built with/without parallelism. More testing is needed, but it seems to
work fine (with other opclasses and so on).

After code-only review, here are some comments:

+++ b/src/backend/access/brin/brin.c
[...]
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_BRIN_SHARED        UINT64CONST(0xA000000000000001)
+#define PARALLEL_KEY_TUPLESORT            UINT64CONST(0xA000000000000002)

These shm keys use the same constants also in use in
access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
operations, I'd prefer if we didn't actively introduce conflicting
identifiers when we still have significant amounts of unused values
remaining.

Hmmm. Is there some rule of thumb how to pick these key values? I see
nbtsort.c uses 0xA prefix, execParallel.c uses 0xE, while parallel.c
ended up using 0xFFFFFFFFFFFF as prefix. I've user 0xB, simply because
BRIN also starts with B.

+#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000003)

This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
more than brin's), backend/executor/execParallel.c
(0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
I've not checked that their uses are exactly the same, I'd expect at
least btree to match mostly, if not fully, 1:1).
I think we could probably benefit from a less ad-hoc sharing of query
texts. I don't think that needs to happen specifically in this patch,
but I think it's something to keep in mind in future efforts.

I'm afraid I don't quite get what you mean by "ad hoc sharing of query
texts". Are you saying we shouldn't propagate the query text to the
parallel workers? Why? Or what's the proper solution?

+_brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
[...]
+    BrinSpool  *spool = state->bs_spool;
[...]
+    if (!state)
+        return;

I think the assignment to spool should be moved to below this
condition, as _brin_begin_parallel calls this with state=NULL when it
can't launch parallel workers, which will cause issues here.

Good catch! I wonder if we have tests that might trigger this, say by
setting max_parallel_maintenance_workers > 0 while no workers allowed.

+ state->bs_numtuples = brinshared->indtuples;

My IDE complains about bs_numtuples being an integer. This is a
pre-existing issue, but still valid: we can hit an overflow on tables
with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
but problematic nonetheless.

True. I think I've been hesitant to make this a double because it seems
a bit weird to do +1 with a double, and at some point (d == d+1). But
this seems safe, we're guaranteed to be far away from that threshold.

+     * FIXME This probably needs some memory management fixes - we're reading
+     * tuples from the tuplesort, we're allocating an emty tuple, and so on.
+     * Probably better to release this memory.

This should probably be resolved.

AFAICS that comment is actually inaccurate/stale, sorry about that. The
code actually allocates (and resets) a single memtuple, and also
emptyTuple. So I think that's OK, I've removed the comment.

I also noticed that this is likely to execute `union_tuples` many
times when pages_per_range is coprime with the parallel table scan's
block stride (or when we for other reasons have many tuples returned
for each range); and this `union_tuples` internally allocates and
deletes its own memory context for its deserialization of the 'b'
tuple. I think we should just pass a scratch context instead, so that
we don't have the overhead of continously creating then deleting the
same memory context

Perhaps. Looking at the code, isn't it a bit strange how union_tuples
uses the context? It creates the context, calls brin_deform_tuple in
that context, but then the rest of the function (including datumCopy and
similar stuff) happens in the caller's context ...

However, I don't think the number of union_tuples calls is likely to be
very high, especially for large tables. Because we split the table into
2048 chunks, and then cap the chunk size by 8192. For large tables
(where this matters) we're likely close to 8192.

+++ b/src/backend/catalog/index.c
[...]
-        indexRelation->rd_rel->relam == BTREE_AM_OID)
+        (indexRelation->rd_rel->relam == BTREE_AM_OID ||
+         indexRelation->rd_rel->relam == BRIN_AM_OID))

I think this needs some more effort. I imagine a new
IndexAmRoutine->amcanbuildparallel is more appropriate than this
hard-coded list - external indexes may want to utilize the parallel
index creation planner facility, too.

Good idea. I added the IndexAmRoutine flag and used it here.

Some notes:
As a project PostgreSQL seems to be trying to move away from
hardcoding heap into everything in favour of the more AM-agnostic
'table'. I suggest replacing all mentions of "heap" in the arguments
with "table", to reduce the work future maintainers need to do to fix
this. Even when this AM is mostly targetted towards the heap AM, other
AMs (such as those I've heard of that were developed internally at
EDB) use the same block-addressing that heap does, and should thus be
compatible with BRIN. Thus, "heap" is not a useful name here.

I'm not against doing that, but I'd prefer to do that in a separate
patch. There's a bunch of preexisting heap references, so and I don't
want to introduce inconsistency (patch using table, old code heap) nor
do I want to tweak unrelated code.

There are 2 new mentions of "tuplestore" in the patch, while the
structure used is tuplesort: one on form_and_spill_tuple, and one on
brinbuildCallbackParallel. Please update those comments.

That's it for code review. I'll do some performance comparisons and
testing soon, too.

Thanks! Attached is a patch squashing the previous version into a single
v3 commit, with fixes for your review in a separate commit.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v4-0001-parallel-CREATE-INDEX-for-BRIN-v3.patchtext/x-patch; charset=UTF-8; name=v4-0001-parallel-CREATE-INDEX-for-BRIN-v3.patchDownload+1091-6
v4-0002-review-fixes.patchtext/x-patch; charset=UTF-8; name=v4-0002-review-fixes.patchDownload+22-19
#16Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#15)
Re: Parallel CREATE INDEX for BRIN indexes

Hi,

On Wed, 22 Nov 2023 at 20:16, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 11/20/23 20:48, Matthias van de Meent wrote:

On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

here's an updated patch, addressing the review comments, and reworking
how the work is divided between the workers & leader etc.

After code-only review, here are some comments:

+++ b/src/backend/access/brin/brin.c
[...]
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_BRIN_SHARED        UINT64CONST(0xA000000000000001)
+#define PARALLEL_KEY_TUPLESORT            UINT64CONST(0xA000000000000002)

These shm keys use the same constants also in use in
access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
operations, I'd prefer if we didn't actively introduce conflicting
identifiers when we still have significant amounts of unused values
remaining.

Hmmm. Is there some rule of thumb how to pick these key values?

None that I know of.
There is a warning in various places that define these constants that
they take care to not conflict with plan node's node_id: parallel plan
execution uses plain plan node IDs as keys, and as node_id is
int-sized, any other key value that's created manually of value < 2^32
should be sure that it can't be executed in a parallel backend.
But apart from that one case, I can't find a convention, no.

+#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000003)

This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
more than brin's), backend/executor/execParallel.c
(0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
I've not checked that their uses are exactly the same, I'd expect at
least btree to match mostly, if not fully, 1:1).
I think we could probably benefit from a less ad-hoc sharing of query
texts. I don't think that needs to happen specifically in this patch,
but I think it's something to keep in mind in future efforts.

I'm afraid I don't quite get what you mean by "ad hoc sharing of query
texts". Are you saying we shouldn't propagate the query text to the
parallel workers? Why? Or what's the proper solution?

What I mean is that we have several different keys that all look like
they contain the debug query string, and always for the same debugging
purposes. For debugging, I think it'd be useful to use one well-known
key, rather than N well-known keys in each of the N parallel
subsystems.

But as mentioned, it doesn't need to happen in this patch, as that'd
increase scope beyond brin/index ams.

+ state->bs_numtuples = brinshared->indtuples;

My IDE complains about bs_numtuples being an integer. This is a
pre-existing issue, but still valid: we can hit an overflow on tables
with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
but problematic nonetheless.

True. I think I've been hesitant to make this a double because it seems
a bit weird to do +1 with a double, and at some point (d == d+1). But
this seems safe, we're guaranteed to be far away from that threshold.

Yes, ignoring practical constraints like page space, we "only" have
bitspace for 2^48 tuples in each (non-partitioned) relation, so
double's 56 significant bits should be more than enough to count
tuples.

I also noticed that this is likely to execute `union_tuples` many
times when pages_per_range is coprime with the parallel table scan's
block stride (or when we for other reasons have many tuples returned
for each range); and this `union_tuples` internally allocates and
deletes its own memory context for its deserialization of the 'b'
tuple. I think we should just pass a scratch context instead, so that
we don't have the overhead of continously creating then deleting the
same memory context

Perhaps. Looking at the code, isn't it a bit strange how union_tuples
uses the context? It creates the context, calls brin_deform_tuple in
that context, but then the rest of the function (including datumCopy and
similar stuff) happens in the caller's context ...

The union operator may leak (lots of) memory, so I think it makes
sense to keep a context around that can be reset after we've extracted
the merge result.

However, I don't think the number of union_tuples calls is likely to be
very high, especially for large tables. Because we split the table into
2048 chunks, and then cap the chunk size by 8192. For large tables
(where this matters) we're likely close to 8192.

I agree that the merging part of the index creation is the last part,
and usually has no high impact on the total performance of the reindex
operation, but in memory-constrained environments releasing and then
requesting the same chunk of memory over and over again just isn't
great.
Also note that parallel scan chunk sizes decrease when we're about to
hit the end of the table, and that a different AM may have different
ideas about scanning a table in parallel; it could very well decide to
use striped assignments exclusively, as opposed to on-demand chunk
allocations; both increasing the chance that brin's page ranges are
processed by more than one backend.

As a project PostgreSQL seems to be trying to move away from
hardcoding heap into everything in favour of the more AM-agnostic
'table'. I suggest replacing all mentions of "heap" in the arguments
with "table", to reduce the work future maintainers need to do to fix
this.

I'm not against doing that, but I'd prefer to do that in a separate
patch. There's a bunch of preexisting heap references, so and I don't
want to introduce inconsistency (patch using table, old code heap) nor
do I want to tweak unrelated code.

Sure.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#17Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#16)
Re: Parallel CREATE INDEX for BRIN indexes

On 11/23/23 13:33, Matthias van de Meent wrote:

Hi,

On Wed, 22 Nov 2023 at 20:16, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 11/20/23 20:48, Matthias van de Meent wrote:

On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

here's an updated patch, addressing the review comments, and reworking
how the work is divided between the workers & leader etc.

After code-only review, here are some comments:

+++ b/src/backend/access/brin/brin.c
[...]
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_BRIN_SHARED        UINT64CONST(0xA000000000000001)
+#define PARALLEL_KEY_TUPLESORT            UINT64CONST(0xA000000000000002)

These shm keys use the same constants also in use in
access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
operations, I'd prefer if we didn't actively introduce conflicting
identifiers when we still have significant amounts of unused values
remaining.

Hmmm. Is there some rule of thumb how to pick these key values?

None that I know of.
There is a warning in various places that define these constants that
they take care to not conflict with plan node's node_id: parallel plan
execution uses plain plan node IDs as keys, and as node_id is
int-sized, any other key value that's created manually of value < 2^32
should be sure that it can't be executed in a parallel backend.
But apart from that one case, I can't find a convention, no.

OK, in that case 0xB is fine.

+#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000003)

This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
more than brin's), backend/executor/execParallel.c
(0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
I've not checked that their uses are exactly the same, I'd expect at
least btree to match mostly, if not fully, 1:1).
I think we could probably benefit from a less ad-hoc sharing of query
texts. I don't think that needs to happen specifically in this patch,
but I think it's something to keep in mind in future efforts.

I'm afraid I don't quite get what you mean by "ad hoc sharing of query
texts". Are you saying we shouldn't propagate the query text to the
parallel workers? Why? Or what's the proper solution?

What I mean is that we have several different keys that all look like
they contain the debug query string, and always for the same debugging
purposes. For debugging, I think it'd be useful to use one well-known
key, rather than N well-known keys in each of the N parallel
subsystems.

But as mentioned, it doesn't need to happen in this patch, as that'd
increase scope beyond brin/index ams.

Agreed.

I also noticed that this is likely to execute `union_tuples` many
times when pages_per_range is coprime with the parallel table scan's
block stride (or when we for other reasons have many tuples returned
for each range); and this `union_tuples` internally allocates and
deletes its own memory context for its deserialization of the 'b'
tuple. I think we should just pass a scratch context instead, so that
we don't have the overhead of continously creating then deleting the
same memory context

Perhaps. Looking at the code, isn't it a bit strange how union_tuples
uses the context? It creates the context, calls brin_deform_tuple in
that context, but then the rest of the function (including datumCopy and
similar stuff) happens in the caller's context ...

The union operator may leak (lots of) memory, so I think it makes
sense to keep a context around that can be reset after we've extracted
the merge result.

But does the current code actually achieve that? It does create a "brin
union" context, but then it only does this:

/* Use our own memory context to avoid retail pfree */
cxt = AllocSetContextCreate(CurrentMemoryContext,
"brin union",
ALLOCSET_DEFAULT_SIZES);
oldcxt = MemoryContextSwitchTo(cxt);
db = brin_deform_tuple(bdesc, b, NULL);
MemoryContextSwitchTo(oldcxt);

Surely that does not limit the amount of memory used by the actual union
functions in any way?

However, I don't think the number of union_tuples calls is likely to be
very high, especially for large tables. Because we split the table into
2048 chunks, and then cap the chunk size by 8192. For large tables
(where this matters) we're likely close to 8192.

I agree that the merging part of the index creation is the last part,
and usually has no high impact on the total performance of the reindex
operation, but in memory-constrained environments releasing and then
requesting the same chunk of memory over and over again just isn't
great.

OK, I'll take a look at the scratch context you suggested.

My point however was we won't actually do that very often, because on
large tables the BRIN ranges are likely smaller than the parallel scan
chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
ranges are large, there'll be few of them.

Also note that parallel scan chunk sizes decrease when we're about to
hit the end of the table, and that a different AM may have different
ideas about scanning a table in parallel; it could very well decide to
use striped assignments exclusively, as opposed to on-demand chunk
allocations; both increasing the chance that brin's page ranges are
processed by more than one backend.

Yeah, but the ramp-up and ramp-down should have negligible impact, IMO.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#17)
Re: Parallel CREATE INDEX for BRIN indexes

On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 11/23/23 13:33, Matthias van de Meent wrote:

The union operator may leak (lots of) memory, so I think it makes
sense to keep a context around that can be reset after we've extracted
the merge result.

But does the current code actually achieve that? It does create a "brin
union" context, but then it only does this:

/* Use our own memory context to avoid retail pfree */
cxt = AllocSetContextCreate(CurrentMemoryContext,
"brin union",
ALLOCSET_DEFAULT_SIZES);
oldcxt = MemoryContextSwitchTo(cxt);
db = brin_deform_tuple(bdesc, b, NULL);
MemoryContextSwitchTo(oldcxt);

Surely that does not limit the amount of memory used by the actual union
functions in any way?

Oh, yes, of course. For some reason I thought that covered the calls
to the union operator function too, but it indeed only covers
deserialization. I do think it is still worthwhile to not do the
create/delete cycle, but won't hold the patch back for that.

However, I don't think the number of union_tuples calls is likely to be
very high, especially for large tables. Because we split the table into
2048 chunks, and then cap the chunk size by 8192. For large tables
(where this matters) we're likely close to 8192.

I agree that the merging part of the index creation is the last part,
and usually has no high impact on the total performance of the reindex
operation, but in memory-constrained environments releasing and then
requesting the same chunk of memory over and over again just isn't
great.

OK, I'll take a look at the scratch context you suggested.

My point however was we won't actually do that very often, because on
large tables the BRIN ranges are likely smaller than the parallel scan
chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
ranges are large, there'll be few of them.

That's true, so maybe I'm concerned about something that amounts to
only marginal gains.

I noticed that the v4 patch doesn't yet update the documentation in
indexam.sgml with am->amcanbuildparallel.
Once that is included and reviewed I think this will be ready, unless
you want to address any of my comments upthread (that I marked with
'not in this patch') in this patch.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#18)
Re: Parallel CREATE INDEX for BRIN indexes

On 11/28/23 16:39, Matthias van de Meent wrote:

On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 11/23/23 13:33, Matthias van de Meent wrote:

The union operator may leak (lots of) memory, so I think it makes
sense to keep a context around that can be reset after we've extracted
the merge result.

But does the current code actually achieve that? It does create a "brin
union" context, but then it only does this:

/* Use our own memory context to avoid retail pfree */
cxt = AllocSetContextCreate(CurrentMemoryContext,
"brin union",
ALLOCSET_DEFAULT_SIZES);
oldcxt = MemoryContextSwitchTo(cxt);
db = brin_deform_tuple(bdesc, b, NULL);
MemoryContextSwitchTo(oldcxt);

Surely that does not limit the amount of memory used by the actual union
functions in any way?

Oh, yes, of course. For some reason I thought that covered the calls
to the union operator function too, but it indeed only covers
deserialization. I do think it is still worthwhile to not do the
create/delete cycle, but won't hold the patch back for that.

I think the union_tuples() changes are better left for a separate patch.

However, I don't think the number of union_tuples calls is likely to be
very high, especially for large tables. Because we split the table into
2048 chunks, and then cap the chunk size by 8192. For large tables
(where this matters) we're likely close to 8192.

I agree that the merging part of the index creation is the last part,
and usually has no high impact on the total performance of the reindex
operation, but in memory-constrained environments releasing and then
requesting the same chunk of memory over and over again just isn't
great.

OK, I'll take a look at the scratch context you suggested.

My point however was we won't actually do that very often, because on
large tables the BRIN ranges are likely smaller than the parallel scan
chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
ranges are large, there'll be few of them.

That's true, so maybe I'm concerned about something that amounts to
only marginal gains.

However, after thinking about this a bit more, I think we actually do
need to do something about the memory management when merging tuples.
AFAIK the general assumption was that union_tuple() only runs for a
single range, and then the whole context gets freed. But the way the
merging was implemented, it all runs in a single context. And while a
single union_tuple() may not need a lot memory, in total it may be
annoying. I just added a palloc(1MB) into union_tuples and ended up with
~160MB allocated in the PortalContext on just 2GB table. In practice the
memory will grow more slowly, but not great :-/

The attached 0003 patch adds a memory context that's reset after
producing a merged BRIN tuple for each page range.

I noticed that the v4 patch doesn't yet update the documentation in
indexam.sgml with am->amcanbuildparallel.

Should be fixed by 0002. I decided to add a simple note to ambuild(),
not sure if something more is needed.

Once that is included and reviewed I think this will be ready, unless
you want to address any of my comments upthread (that I marked with
'not in this patch') in this patch.

Thanks. I believe the attached version addresses it. There's also 0004
with some indentation tweaks per pgindent.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20231128-0001-parallel-CREATE-INDEX-for-BRIN-v3.patchtext/x-patch; charset=UTF-8; name=v20231128-0001-parallel-CREATE-INDEX-for-BRIN-v3.patchDownload+1096-7
v20231128-0002-add-docs-for-amcanbuildparallel.patchtext/x-patch; charset=UTF-8; name=v20231128-0002-add-docs-for-amcanbuildparallel.patchDownload+7-1
v20231128-0003-use-per-range-memory-context-for-merging-i.patchtext/x-patch; charset=UTF-8; name=v20231128-0003-use-per-range-memory-context-for-merging-i.patchDownload+30-5
v20231128-0004-pgindent.patchtext/x-patch; charset=UTF-8; name=v20231128-0004-pgindent.patchDownload+62-56
#20Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#19)
Re: Parallel CREATE INDEX for BRIN indexes

On Tue, 28 Nov 2023 at 18:59, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 11/28/23 16:39, Matthias van de Meent wrote:

On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 11/23/23 13:33, Matthias van de Meent wrote:

The union operator may leak (lots of) memory, so I think it makes
sense to keep a context around that can be reset after we've extracted
the merge result.

But does the current code actually achieve that? It does create a "brin
union" context, but then it only does this:

/* Use our own memory context to avoid retail pfree */
cxt = AllocSetContextCreate(CurrentMemoryContext,
"brin union",
ALLOCSET_DEFAULT_SIZES);
oldcxt = MemoryContextSwitchTo(cxt);
db = brin_deform_tuple(bdesc, b, NULL);
MemoryContextSwitchTo(oldcxt);

Surely that does not limit the amount of memory used by the actual union
functions in any way?

Oh, yes, of course. For some reason I thought that covered the calls
to the union operator function too, but it indeed only covers
deserialization. I do think it is still worthwhile to not do the
create/delete cycle, but won't hold the patch back for that.

I think the union_tuples() changes are better left for a separate patch.

However, I don't think the number of union_tuples calls is likely to be
very high, especially for large tables. Because we split the table into
2048 chunks, and then cap the chunk size by 8192. For large tables
(where this matters) we're likely close to 8192.

I agree that the merging part of the index creation is the last part,
and usually has no high impact on the total performance of the reindex
operation, but in memory-constrained environments releasing and then
requesting the same chunk of memory over and over again just isn't
great.

OK, I'll take a look at the scratch context you suggested.

My point however was we won't actually do that very often, because on
large tables the BRIN ranges are likely smaller than the parallel scan
chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
ranges are large, there'll be few of them.

That's true, so maybe I'm concerned about something that amounts to
only marginal gains.

However, after thinking about this a bit more, I think we actually do
need to do something about the memory management when merging tuples.
AFAIK the general assumption was that union_tuple() only runs for a
single range, and then the whole context gets freed.

Correct, but it is also is (or should be) assumed that union_tuple
will be called several times in the same context to fix repeat
concurrent updates. Presumably, that only happens rarely, but it's
something that should be kept in mind regardless.

But the way the
merging was implemented, it all runs in a single context. And while a
single union_tuple() may not need a lot memory, in total it may be
annoying. I just added a palloc(1MB) into union_tuples and ended up with
~160MB allocated in the PortalContext on just 2GB table. In practice the
memory will grow more slowly, but not great :-/

The attached 0003 patch adds a memory context that's reset after
producing a merged BRIN tuple for each page range.

Looks good.

This also made me think a bit more about how we're working with the
tuples. With your latest patch, we always deserialize and re-serialize
the sorted brin tuples, just in case the next tuple will also be a
BRIN tuple of the same page range. Could we save some of that
deserialization time by optimistically expecting that we're not going
to need to merge the tuple and only store a local copy of it locally?
See attached 0002; this saves some cycles in common cases.

The v20231128 version of the patchset (as squashed, attached v5-0001)
looks good to me.

Kind regards,

Matthias van de Meent
Neon (http://neon.tech)

Attachments:

v5-0002-Reduce-de-forming-of-BRIN-tuples-in-parallel-BRIN.patchapplication/octet-stream; name=v5-0002-Reduce-de-forming-of-BRIN-tuples-in-parallel-BRIN.patchDownload+44-15
v5-0001-Allow-BRIN-to-build-its-index-in-parallel.patchapplication/octet-stream; name=v5-0001-Allow-BRIN-to-build-its-index-in-parallel.patchDownload+1136-7
#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#20)
#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#21)
#23Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#22)
#24Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#23)
#25Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#24)
#26Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#25)
#27Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#26)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#27)
#29Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#28)
#30Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#29)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#30)
#32Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#32)
#34Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#33)
#35Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#34)
#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#35)
#37Alexander Lakhin
exclusion@gmail.com
In reply to: Tomas Vondra (#36)
#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#23)
#39Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#38)
#40Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#39)
#41Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#40)
#42Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#39)
#43Peter Eisentraut
peter_e@gmx.net
In reply to: Tomas Vondra (#35)
#44Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#43)
#45Peter Eisentraut
peter_e@gmx.net
In reply to: Tomas Vondra (#44)