pgsql: Support parallel btree index builds.
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: /messages/by-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: /messages/by-id/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/9da0cc35284bdbe8d442d732963303ff0e0a40bc
Modified Files
--------------
contrib/bloom/blinsert.c | 3 +-
doc/src/sgml/config.sgml | 44 +-
doc/src/sgml/monitoring.sgml | 12 +-
doc/src/sgml/ref/create_index.sgml | 58 ++
doc/src/sgml/ref/create_table.sgml | 4 +-
src/backend/access/brin/brin.c | 4 +-
src/backend/access/gin/gininsert.c | 2 +-
src/backend/access/gist/gistbuild.c | 2 +-
src/backend/access/hash/hash.c | 2 +-
src/backend/access/hash/hashsort.c | 1 +
src/backend/access/heap/heapam.c | 28 +-
src/backend/access/nbtree/nbtree.c | 134 +---
src/backend/access/nbtree/nbtsort.c | 878 +++++++++++++++++++++++++-
src/backend/access/spgist/spginsert.c | 3 +-
src/backend/access/transam/parallel.c | 12 +-
src/backend/bootstrap/bootstrap.c | 2 +-
src/backend/catalog/heap.c | 2 +-
src/backend/catalog/index.c | 123 +++-
src/backend/catalog/toasting.c | 1 +
src/backend/commands/cluster.c | 3 +-
src/backend/commands/indexcmds.c | 7 +-
src/backend/executor/execParallel.c | 2 +-
src/backend/executor/nodeAgg.c | 6 +-
src/backend/executor/nodeSort.c | 2 +-
src/backend/optimizer/path/allpaths.c | 18 +-
src/backend/optimizer/path/costsize.c | 4 +-
src/backend/optimizer/plan/planner.c | 136 ++++
src/backend/postmaster/pgstat.c | 3 +
src/backend/storage/file/buffile.c | 61 +-
src/backend/storage/file/fd.c | 10 +
src/backend/utils/adt/orderedsetaggs.c | 2 +
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc.c | 10 +
src/backend/utils/misc/postgresql.conf.sample | 3 +-
src/backend/utils/probes.d | 2 +-
src/backend/utils/sort/logtape.c | 199 +++++-
src/backend/utils/sort/tuplesort.c | 595 ++++++++++++++---
src/include/access/nbtree.h | 14 +-
src/include/access/parallel.h | 4 +-
src/include/access/relscan.h | 1 +
src/include/catalog/index.h | 9 +-
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 6 +-
src/include/optimizer/paths.h | 2 +-
src/include/optimizer/planner.h | 1 +
src/include/pgstat.h | 1 +
src/include/storage/buffile.h | 2 +
src/include/storage/fd.h | 1 +
src/include/utils/logtape.h | 39 +-
src/include/utils/tuplesort.h | 132 +++-
src/tools/pgindent/typedefs.list | 6 +
51 files changed, 2237 insertions(+), 361 deletions(-)
Hi,
On 2018-02-02 18:37:11 +0000, Robert Haas wrote:
Support parallel btree index builds.
Wheee! Congrats Peter, Rushash, and everyone else involved!
- Andres
On Sun, Feb 4, 2018 at 9:42 AM, Andres Freund <andres@anarazel.de> wrote:
Wheee! Congrats Peter, Rushash, and everyone else involved!
Thanks!
--
Peter Geoghegan
Andres Freund <andres@anarazel.de> writes:
On 2018-02-02 18:37:11 +0000, Robert Haas wrote:
Support parallel btree index builds.
Wheee! Congrats Peter, Rushash, and everyone else involved!
I'll be happier about it when the valgrind buildfarm animals are
happy.
regards, tom lane
On Sun, Feb 4, 2018 at 10:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'll be happier about it when the valgrind buildfarm animals are
happy.
I don't know if you noticed, but I did post a patch for that on Friday.
--
Peter Geoghegan
On Sun, Feb 4, 2018 at 1:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'll be happier about it when the valgrind buildfarm animals are
happy.
Me too, but it's not clear what the right fix is. One thing that
would help is if you put in an appearance on the thread where this is
being discussed and cast a vote. (Ditto to Andres.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Sun, Feb 4, 2018 at 1:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'll be happier about it when the valgrind buildfarm animals are
happy.
Me too, but it's not clear what the right fix is. One thing that
would help is if you put in an appearance on the thread where this is
being discussed and cast a vote. (Ditto to Andres.)
If you mean do I like fixing this by adding a valgrind suppression,
no I do not. Valgrind suppressions are last-resort band-aids IMO,
to be applied only when it's clearly understood what behavior we're
masking and why it's more reasonable to mask it than make it better
defined. I, at least, don't have that understanding from looking
at the thread. For one thing, Peter has not explained why this issue
appears now with parallel index build when it did not before; it's
not like logtape.c isn't old enough to vote.
Even granting that a suppression is the way to fix it, the proposed
suppression seems pretty darn broad, and hence likely to mask things
we'd wish it hadn't.
regards, tom lane
On Tue, Feb 6, 2018 at 10:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Sun, Feb 4, 2018 at 1:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'll be happier about it when the valgrind buildfarm animals are
happy.Me too, but it's not clear what the right fix is. One thing that
would help is if you put in an appearance on the thread where this is
being discussed and cast a vote. (Ditto to Andres.)If you mean do I like fixing this by adding a valgrind suppression,
no I do not. Valgrind suppressions are last-resort band-aids IMO,
to be applied only when it's clearly understood what behavior we're
masking and why it's more reasonable to mask it than make it better
defined. I, at least, don't have that understanding from looking
at the thread. For one thing, Peter has not explained why this issue
appears now with parallel index build when it did not before; it's
not like logtape.c isn't old enough to vote.
Yeah, he has actually. In other cases, the buffer is guaranteed to
have been filled at least once (and thus, from valgrind's point of
view, is initialized) because if that weren't going to happen then we
would have not have switched to a tape-sort in the first place. You
can't set work_mem smaller than 8kB. But in the parallel case each
worker must always produce a tape, so it can happen if a worker is
unlucky enough to get only a very small slice of the data (because the
other participants gobble it all up before that process really gets
going).
Even granting that a suppression is the way to fix it, the proposed
suppression seems pretty darn broad, and hence likely to mask things
we'd wish it hadn't.
Well, he talked about that at some length too. I don't know how
you're not seeing it on the thread. But what I really need here is
some input on an option you do like, not just a list of things you
don't like.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Feb 6, 2018 at 10:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
... I, at least, don't have that understanding from looking
at the thread. For one thing, Peter has not explained why this issue
appears now with parallel index build when it did not before; it's
not like logtape.c isn't old enough to vote.
Yeah, he has actually. In other cases, the buffer is guaranteed to
have been filled at least once (and thus, from valgrind's point of
view, is initialized) because if that weren't going to happen then we
would have not have switched to a tape-sort in the first place. You
can't set work_mem smaller than 8kB. But in the parallel case each
worker must always produce a tape, so it can happen if a worker is
unlucky enough to get only a very small slice of the data (because the
other participants gobble it all up before that process really gets
going).
Ah, I see. So this is really a problem that's been latent all along,
but was never exposed in any previous use-case for logtape.c.
But what I really need here is
some input on an option you do like, not just a list of things you
don't like.
I like the option of doing VALGRIND_MAKE_MEM_DEFINED on the tail
portion of the buffer before writing it. That seems pretty tightly
tied to the behavior we're decreeing valid, whereas the suppression
is not.
regards, tom lane
On Tue, Feb 6, 2018 at 8:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I like the option of doing VALGRIND_MAKE_MEM_DEFINED on the tail
portion of the buffer before writing it. That seems pretty tightly
tied to the behavior we're decreeing valid, whereas the suppression
is not.
I think that the suppression is actually slightly better scoped than
that, since, for example, that won't just affect writes of
uninitialized bytes from the buffer. But I'll do it that way.
--
Peter Geoghegan
On 2018-Feb-02, Robert Haas wrote:
Support parallel btree index builds.
While looking at a complaint related to progress report of parallel
index builds[1]/messages/by-id/CAEze2Wgm-NnZe3rOnwjYTVriS8xsVhzzVBCGj34h06cDNuaTig@mail.gmail.com, I noticed this comment
+ /*
+ * Execute this worker's part of the sort.
+ *
+ * Unlike leader and serial cases, we cannot avoid calling
+ * tuplesort_performsort() for spool2 if it ends up containing no dead
+ * tuples (this is disallowed for workers by tuplesort).
+ */
+ tuplesort_performsort(btspool->sortstate);
+ if (btspool2)
+ tuplesort_performsort(btspool2->sortstate);
I've been trying to understand why this says "Unlike leader and serial
cases, ...". I understand the "serial" part -- it refers to
_bt_leafbuild. So I'm to understand that that one works differently;
see below. But why does it say "the leader case"? As far as I can see,
the leader executes exactly the same code, so what is the comment
talking about?
Now, if you do look at _bt_leafbuild(), it can be seen that nothing is
done differently there either; we're not actually skipping any calls to
tuplesort_performsort(). Any differentiation between serial/leader/
worker cases seems to be done inside that routine. So the comment is
not very useful there either.
I am wondering if these comments are leftovers from early development
versions of this patch. Maybe we could remove them -- or rewrite them
to indicate not that we avoid calling tuplesort_performsort(), but
instead to say that that function behaves differently.
[1]: /messages/by-id/CAEze2Wgm-NnZe3rOnwjYTVriS8xsVhzzVBCGj34h06cDNuaTig@mail.gmail.com
--
�lvaro Herrera 39�49'30"S 73�17'W
"Puedes vivir s�lo una vez, pero si lo haces bien, una vez es suficiente"
On Mon, Jun 7, 2021 at 4:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Now, if you do look at _bt_leafbuild(), it can be seen that nothing is
done differently there either; we're not actually skipping any calls to
tuplesort_performsort(). Any differentiation between serial/leader/
worker cases seems to be done inside that routine. So the comment is
not very useful there either.I am wondering if these comments are leftovers from early development
versions of this patch. Maybe we could remove them -- or rewrite them
to indicate not that we avoid calling tuplesort_performsort(), but
instead to say that that function behaves differently.
It's talking about something described in the tuplesort.h contract. It
applies to a tuplesort state, not a process -- the leader always has two
tuplesort states (the leader tuplesort state, plus its own worker
tuplesort state).
The leader tuplesort is very much like a serial tuplesort. In
particular, as step 8 in tuplesort.h points out, the leader doesn't
have to call tuplesort_performsort() for the leader tuplesort state if
it already knows that there is no input to sort.
This matters less than it might in a world where we had a user of
parallel tuplesort that doesn't always simply make the leader
participate as a worker. There is a build-time testing option in
nbtsort.c that does this for parallel CREATE INDEX, actually -- see
DISABLE_LEADER_PARTICIPATION.
You kind of have a point about this being something that made more
sense in revisions of the patch from before commit, though. There was
a question about the cost model and the role of the leader that was
ultimately resolved by inventing the current simple behavior. So
feel free to change the wording now.
--
Peter Geoghegan