Add parallelism and glibc dependent only options to reindexdb
Hi,
With the glibc 2.28 coming, all users will have to reindex almost
every indexes after a glibc upgrade to guarantee the lack of
corruption. Unfortunately, reindexdb is not ideal for that as it's
processing everything using a single connexion and isn't able to
discard indexes that doesn't depend on a glibc collation.
PFA a patchset to add parallelism to reindexdb (reusing the
infrastructure in vacuumdb with some additions) and an option to
discard indexes that doesn't depend on glibc (without any specific
collation filtering or glibc version detection), with updated
regression tests. Note that this should be applied on top of the
existing reindexdb cleanup & refactoring patch
(https://commitfest.postgresql.org/23/2115/).
This was sponsored by VMware, and has been discussed internally with
Kevin and Michael, in Cc.
Attachments:
0001-Export-vacuumdb-s-parallel-infrastructure.patchapplication/octet-stream; name=0001-Export-vacuumdb-s-parallel-infrastructure.patchDownload+319-286
0003-Add-parallel-processing-to-reindexdb.patchapplication/octet-stream; name=0003-Add-parallel-processing-to-reindexdb.patchDownload+558-62
0004-Add-a-glibc-dependent-option.patchapplication/octet-stream; name=0004-Add-a-glibc-dependent-option.patchDownload+412-82
0002-Add-a-SimplePtrList-implementation.patchapplication/octet-stream; name=0002-Add-a-SimplePtrList-implementation.patchDownload+36-1
On Sun, Jun 30, 2019 at 11:45:47AM +0200, Julien Rouhaud wrote:
With the glibc 2.28 coming, all users will have to reindex almost
every indexes after a glibc upgrade to guarantee the lack of
corruption. Unfortunately, reindexdb is not ideal for that as it's
processing everything using a single connexion and isn't able to
discard indexes that doesn't depend on a glibc collation.
We have seen that with a database of up to 100GB we finish by cutting
the reindex time from 30 minutes to a couple of minutes with a schema
we work on. Julien, what were the actual numbers?
PFA a patchset to add parallelism to reindexdb (reusing the
infrastructure in vacuumdb with some additions) and an option to
discard indexes that doesn't depend on glibc (without any specific
collation filtering or glibc version detection), with updated
regression tests. Note that this should be applied on top of the
existing reindexdb cleanup & refactoring patch
(https://commitfest.postgresql.org/23/2115/).
Please note that patch 0003 does not seem to apply correctly on HEAD
as of c74d49d4. Here is also a small description of each patch:
- 0001 refactors the connection slot facility from vacuumdb.c into a
new, separate file called parallel.c in src/bin/scripts/. This is not
really fancy as some code only moves around.
- 0002 adds an extra option for simple lists to be able to use
pointers, with an interface to append elements in it.
- 0003 begins to be the actual fancy thing with the addition of a
--jobs option into reindexdb. The main issue here which should be
discussed is that when it comes to reindex of tables, you basically
are not going to have any conflicts between the objects manipulated.
However if you wish to do a reindex on a set of indexes then things
get more tricky as it is necessary to list items per-table so as
multiple connections do not conflict with each other if attempting to
work on multiple indexes of the same table. What this patch does is
to select the set of indexes which need to be worked on (see the
addition of cell in ParallelSlot), and then does a kind of
pre-planning of each item into the connection slots so as each
connection knows from the beginning which items it needs to process.
This is quite different from vacuumdb where a new item is distributed
only on a free connection from a unique list. I'd personally prefer
if we keep the facility in parallel.c so as it is only
execution-dependent and that we have no pre-planning. This would
require keeping within reindexdb.c an array of lists, with one list
corresponding to one connection instead which feels more natural.
- 0004 is the part where the concurrent additions really matter as
this consists in applying an extra filter to the indexes selected so
as only the glibc-sensitive indexes are chosen for the processing.
--
Michael
On Mon, Jul 1, 2019 at 10:55 AM Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Jun 30, 2019 at 11:45:47AM +0200, Julien Rouhaud wrote:
With the glibc 2.28 coming, all users will have to reindex almost
every indexes after a glibc upgrade to guarantee the lack of
corruption. Unfortunately, reindexdb is not ideal for that as it's
processing everything using a single connexion and isn't able to
discard indexes that doesn't depend on a glibc collation.We have seen that with a database of up to 100GB we finish by cutting
the reindex time from 30 minutes to a couple of minutes with a schema
we work on. Julien, what were the actual numbers?
I did my benchmarking using a quite ideal database, having a large
number of tables and various set of indexes, for a 75 GB total size.
This was done on my laptop which has 6 multithreaded cores (and crappy
IO), also keeping the default max_parallel_maintenance_worker = 2.
A naive reindexdb took approximately 33 minutes. Filtering the list
of indexes took that down to slightly less than 15 min, but of course
each database will have a different ratio there.
Then, keeping the --glibc-dependent and using different level of parallelism:
-j1: ~ 14:50
-j3: ~ 7:30
-j6: ~ 5:23
-j8: ~ 4:45
That's pretty much the kind of results I was expecting given the
hardware I used.
PFA a patchset to add parallelism to reindexdb (reusing the
infrastructure in vacuumdb with some additions) and an option to
discard indexes that doesn't depend on glibc (without any specific
collation filtering or glibc version detection), with updated
regression tests. Note that this should be applied on top of the
existing reindexdb cleanup & refactoring patch
(https://commitfest.postgresql.org/23/2115/).Please note that patch 0003 does not seem to apply correctly on HEAD
as of c74d49d4.
Yes, this is because this patchset has to be applied on top of the
reindexdb refactoring patch mentioned. It's sad that we don't have a
good way to deal with that kind of dependency, as it's also breaking
Thomas' cfbot :(
- 0003 begins to be the actual fancy thing with the addition of a
--jobs option into reindexdb. The main issue here which should be
discussed is that when it comes to reindex of tables, you basically
are not going to have any conflicts between the objects manipulated.
However if you wish to do a reindex on a set of indexes then things
get more tricky as it is necessary to list items per-table so as
multiple connections do not conflict with each other if attempting to
work on multiple indexes of the same table. What this patch does is
to select the set of indexes which need to be worked on (see the
addition of cell in ParallelSlot), and then does a kind of
pre-planning of each item into the connection slots so as each
connection knows from the beginning which items it needs to process.
This is quite different from vacuumdb where a new item is distributed
only on a free connection from a unique list. I'd personally prefer
if we keep the facility in parallel.c so as it is only
execution-dependent and that we have no pre-planning. This would
require keeping within reindexdb.c an array of lists, with one list
corresponding to one connection instead which feels more natural.
My fear here is that this approach would add some extra complexity,
especially requiring to deal with free connection handling both in
GetIdleSlot() and the main reindexdb loop. Also, the pre-planning
allows us to start processing the biggest tables first, which
optimises the overall runtime.
Now that we have REINDEX CONCURRENTLY, I think reindexdb is going to
gain more popularity.
Please don't reuse a file name as generic as "parallel.c" -- it's
annoying when navigating source. Maybe conn_parallel.c multiconn.c
connscripts.c admconnection.c ...?
If your server crashes or is stopped midway during the reindex, you
would have to start again from scratch, and it's tedious (if it's
possible at all) to determine which indexes were missed. I think it
would be useful to have a two-phase mode: in the initial phase reindexdb
computes the list of indexes to be reindexed and saves them into a work
table somewhere. In the second phase, it reads indexes from that table
and processes them, marking them as done in the work table. If the
second phase crashes or is stopped, it can be restarted and consults the
work table. I would keep the work table, as it provides a bit of an
audit trail. It may be important to be able to run even if unable to
create such a work table (because of the <ironic>numerous</> users that
DROP DATABASE postgres).
Maybe we'd have two flags in the work table for each index:
"reindex requested", "reindex done".
The "glibc filter" thing (which I take to mean "indexes that depend on
collations") would apply to the first phase: it just skips adding other
indexes to the work table. I suppose ICU collations are not affected,
so the filter would be for glibc collations only? The --glibc-dependent
switch seems too ad-hoc. Maybe "--exclude-rule=glibc"? That way we can
add other rules later. (Not "--exclude=foo" because we'll want to add
the possibility to ignore specific indexes by name.)
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Michael Paquier <michael@paquier.xyz> writes:
- 0003 begins to be the actual fancy thing with the addition of a
--jobs option into reindexdb. The main issue here which should be
discussed is that when it comes to reindex of tables, you basically
are not going to have any conflicts between the objects manipulated.
However if you wish to do a reindex on a set of indexes then things
get more tricky as it is necessary to list items per-table so as
multiple connections do not conflict with each other if attempting to
work on multiple indexes of the same table. What this patch does is
to select the set of indexes which need to be worked on (see the
addition of cell in ParallelSlot), and then does a kind of
pre-planning of each item into the connection slots so as each
connection knows from the beginning which items it needs to process.
This is quite different from vacuumdb where a new item is distributed
only on a free connection from a unique list. I'd personally prefer
if we keep the facility in parallel.c so as it is only
execution-dependent and that we have no pre-planning. This would
require keeping within reindexdb.c an array of lists, with one list
corresponding to one connection instead which feels more natural.
Couldn't we make this enormously simpler and less bug-prone by just
dictating that --jobs applies only to reindex-table operations?
- 0004 is the part where the concurrent additions really matter as
this consists in applying an extra filter to the indexes selected so
as only the glibc-sensitive indexes are chosen for the processing.
I think you'd be better off to define and document this as "reindex
only collation-sensitive indexes", without any particular reference
to a reason why somebody might want to do that.
regards, tom lane
On Mon, Jul 1, 2019 at 3:51 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Please don't reuse a file name as generic as "parallel.c" -- it's
annoying when navigating source. Maybe conn_parallel.c multiconn.c
connscripts.c admconnection.c ...?
I could use scripts_parallel.[ch] as I've already used it in the #define part?
If your server crashes or is stopped midway during the reindex, you
would have to start again from scratch, and it's tedious (if it's
possible at all) to determine which indexes were missed. I think it
would be useful to have a two-phase mode: in the initial phase reindexdb
computes the list of indexes to be reindexed and saves them into a work
table somewhere. In the second phase, it reads indexes from that table
and processes them, marking them as done in the work table. If the
second phase crashes or is stopped, it can be restarted and consults the
work table. I would keep the work table, as it provides a bit of an
audit trail. It may be important to be able to run even if unable to
create such a work table (because of the <ironic>numerous</> users that
DROP DATABASE postgres).
Or we could create a table locally in each database, that would fix
this problem and probably make the code simpler?
It also raises some additional concerns about data expiration. I
guess that someone could launch the tool by mistake, kill reindexdb,
and run it again 2 months later while a lot of new objects have been
added for instance.
The "glibc filter" thing (which I take to mean "indexes that depend on
collations") would apply to the first phase: it just skips adding other
indexes to the work table. I suppose ICU collations are not affected,
so the filter would be for glibc collations only?
Indeed, ICU shouldn't need such a filter. xxx_pattern_ops based
indexes are also excluded.
The --glibc-dependent
switch seems too ad-hoc. Maybe "--exclude-rule=glibc"? That way we can
add other rules later. (Not "--exclude=foo" because we'll want to add
the possibility to ignore specific indexes by name.)
That's a good point, I like the --exclude-rule switch.
On Mon, Jul 1, 2019 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael@paquier.xyz> writes:
- 0003 begins to be the actual fancy thing with the addition of a
--jobs option into reindexdb. The main issue here which should be
discussed is that when it comes to reindex of tables, you basically
are not going to have any conflicts between the objects manipulated.
However if you wish to do a reindex on a set of indexes then things
get more tricky as it is necessary to list items per-table so as
multiple connections do not conflict with each other if attempting to
work on multiple indexes of the same table. What this patch does is
to select the set of indexes which need to be worked on (see the
addition of cell in ParallelSlot), and then does a kind of
pre-planning of each item into the connection slots so as each
connection knows from the beginning which items it needs to process.
This is quite different from vacuumdb where a new item is distributed
only on a free connection from a unique list. I'd personally prefer
if we keep the facility in parallel.c so as it is only
execution-dependent and that we have no pre-planning. This would
require keeping within reindexdb.c an array of lists, with one list
corresponding to one connection instead which feels more natural.Couldn't we make this enormously simpler and less bug-prone by just
dictating that --jobs applies only to reindex-table operations?
That would also mean that we'll have to fallback on doing reindex at
table-level, even if we only want to reindex indexes that depends on
glibc. I'm afraid that this will often add a huge penalty.
- 0004 is the part where the concurrent additions really matter as
this consists in applying an extra filter to the indexes selected so
as only the glibc-sensitive indexes are chosen for the processing.I think you'd be better off to define and document this as "reindex
only collation-sensitive indexes", without any particular reference
to a reason why somebody might want to do that.
We should still document that indexes based on ICU would be exluded?
I also realize that I totally forgot to update reindexdb.sgml. Sorry
about that, I'll fix with the next versions.
Julien Rouhaud wrote:
I think you'd be better off to define and document this as "reindex
only collation-sensitive indexes", without any particular reference
to a reason why somebody might want to do that.We should still document that indexes based on ICU would be exluded?
But why exclude them?
As a data point, in the last 5 years, the en_US collation in ICU
had 7 different versions (across 11 major versions of ICU):
ICU Unicode en_US
54.1 7.0 137.56
55.1 7.0 153.56
56.1 8.0 153.64
57.1 8.0 153.64
58.2 9.0 153.72
59.1 9.0 153.72
60.2 10.0 153.80
61.1 10.0 153.80
62.1 11.0 153.88
63.2 11.0 153.88
64.2 12.1 153.97
The rightmost column corresponds to pg_collation.collversion
in Postgres.
Each time there's a new Unicode version, it seems
all collation versions are bumped. And there's a new Unicode
version pretty much every year these days.
Based on this, most ICU upgrades in practice would require reindexing
affected indexes.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 2019-Jul-01, Daniel Verite wrote:
But why exclude them?
As a data point, in the last 5 years, the en_US collation in ICU
had 7 different versions (across 11 major versions of ICU):
So we need a switch --include-rule=icu-collations?
(I mentioned "--exclude-rule=glibc" elsewhere in the thread, but I think
it should be --include-rule=glibc-collations instead, no?)
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 1, 2019 at 10:13 PM Daniel Verite <daniel@manitou-mail.org> wrote:
I think you'd be better off to define and document this as "reindex
only collation-sensitive indexes", without any particular reference
to a reason why somebody might want to do that.We should still document that indexes based on ICU would be exluded?
But why exclude them?
As a data point, in the last 5 years, the en_US collation in ICU
had 7 different versions (across 11 major versions of ICU):ICU Unicode en_US
54.1 7.0 137.56
55.1 7.0 153.56
56.1 8.0 153.64
57.1 8.0 153.64
58.2 9.0 153.72
59.1 9.0 153.72
60.2 10.0 153.80
61.1 10.0 153.80
62.1 11.0 153.88
63.2 11.0 153.88
64.2 12.1 153.97The rightmost column corresponds to pg_collation.collversion
in Postgres.
Each time there's a new Unicode version, it seems
all collation versions are bumped. And there's a new Unicode
version pretty much every year these days.
Based on this, most ICU upgrades in practice would require reindexing
affected indexes.
I have a vague recollection that ICU was providing some backward
compatibility so that even if you upgrade your lib you can still get
the sort order that was active when you built your indexes, though
maybe for a limited number of versions.
Even if that's just me being delusional, I'd still prefer Alvaro's
approach to have distinct switches for each collation system.
On Mon, Jul 1, 2019 at 1:34 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
I have a vague recollection that ICU was providing some backward
compatibility so that even if you upgrade your lib you can still get
the sort order that was active when you built your indexes, though
maybe for a limited number of versions.
That isn't built in. Another database system that uses ICU handles
this by linking to multiple versions of ICU, each with its own UCA
version and associated collations. I don't think that we want to go
there, so it makes sense to make an upgrade that crosses ICU or glibc
versions as painless as possible.
Note that ICU does at least provide a standard way to use multiple
versions at once; the symbol names have the ICU version baked in.
You're actually calling the functions using the versioned symbol names
without realizing it, because there is macro trickery involved.
--
Peter Geoghegan
On Tue, Jul 2, 2019 at 8:34 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
Even if that's just me being delusional, I'd still prefer Alvaro's
approach to have distinct switches for each collation system.
Hi Julien,
Makes sense. But why use the name "glibc" in the code and user
interface? The name of the collation provider in PostgreSQL is "libc"
(for example in the CREATE COLLATION command), and the problem applies
no matter who makes your libc.
--
Thomas Munro
https://enterprisedb.com
On 2019-Jul-02, Thomas Munro wrote:
On Tue, Jul 2, 2019 at 8:34 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
Even if that's just me being delusional, I'd still prefer Alvaro's
approach to have distinct switches for each collation system.Hi Julien,
Makes sense. But why use the name "glibc" in the code and user
interface? The name of the collation provider in PostgreSQL is "libc"
(for example in the CREATE COLLATION command), and the problem applies
no matter who makes your libc.
Makes sense. "If your libc is glibc and you go across an upgrade over
version X, please use --include-rule=libc-collation"
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 01, 2019 at 06:28:13PM +0200, Julien Rouhaud wrote:
On Mon, Jul 1, 2019 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Couldn't we make this enormously simpler and less bug-prone by just
dictating that --jobs applies only to reindex-table operations?
I had the same argument about the first patch sets actually, but... :)
That would also mean that we'll have to fallback on doing reindex at
table-level, even if we only want to reindex indexes that depends on
glibc. I'm afraid that this will often add a huge penalty.
Yes, I would expect that most of the time glibc-sensible indexes are
also mixed with other ones which we don't care about here. One
advantage of the argument from Tom though is that it is possible to
introduce --jobs with minimal steps:
1) Refactor the code for connection slots, without the cell addition
2) Introduce --jobs without INDEX support.
In short, the conflict business between indexes is something which
could be tackled afterwards and with a separate patch. Parallel
indexes at table-level has value in itself, particularly with
CONCURRENTLY coming in the picture.
--
Michael
On Mon, Jul 01, 2019 at 06:14:20PM +0200, Julien Rouhaud wrote:
On Mon, Jul 1, 2019 at 3:51 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Please don't reuse a file name as generic as "parallel.c" -- it's
annoying when navigating source. Maybe conn_parallel.c multiconn.c
connscripts.c admconnection.c ...?I could use scripts_parallel.[ch] as I've already used it in the
#define part?
multiconn.c sounds rather good, but I have a poor ear for any kind of
naming..
If your server crashes or is stopped midway during the reindex, you
would have to start again from scratch, and it's tedious (if it's
possible at all) to determine which indexes were missed. I think it
would be useful to have a two-phase mode: in the initial phase reindexdb
computes the list of indexes to be reindexed and saves them into a work
table somewhere. In the second phase, it reads indexes from that table
and processes them, marking them as done in the work table. If the
second phase crashes or is stopped, it can be restarted and consults the
work table. I would keep the work table, as it provides a bit of an
audit trail. It may be important to be able to run even if unable to
create such a work table (because of the <ironic>numerous</> users that
DROP DATABASE postgres).Or we could create a table locally in each database, that would fix
this problem and probably make the code simpler?It also raises some additional concerns about data expiration. I
guess that someone could launch the tool by mistake, kill reindexdb,
and run it again 2 months later while a lot of new objects have been
added for instance.
This looks like fancy additions, still that's not the core of the
problem, no? If you begin to play in this area you would need more
control options, basically a "continue" mode to be able to restart a
previously failed attempt, and a "reinit" mode able to restart the
operation completely from scratch, and perhaps even a "reset" mode
which cleans up any data already present. Not really a complexity,
but this has to be maintained a database level.
The --glibc-dependent
switch seems too ad-hoc. Maybe "--exclude-rule=glibc"? That way we can
add other rules later. (Not "--exclude=foo" because we'll want to add
the possibility to ignore specific indexes by name.)That's a good point, I like the --exclude-rule switch.
Sounds kind of nice.
--
Michael
On 2019-07-01 22:46, Alvaro Herrera wrote:
On 2019-Jul-02, Thomas Munro wrote:
On Tue, Jul 2, 2019 at 8:34 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
Even if that's just me being delusional, I'd still prefer Alvaro's
approach to have distinct switches for each collation system.Makes sense. But why use the name "glibc" in the code and user
interface? The name of the collation provider in PostgreSQL is "libc"
(for example in the CREATE COLLATION command), and the problem applies
no matter who makes your libc.Makes sense. "If your libc is glibc and you go across an upgrade over
version X, please use --include-rule=libc-collation"
I think it might be better to put the logic of what indexes are
collation affected etc. into the backend REINDEX command. We are likely
to enhance the collation version and dependency tracking over time,
possibly soon, possibly multiple times, and it would be very cumbersome
to have to keep updating reindexdb with this. Moreover, since for
performance you likely want to reindex by table, implementing a logic of
"reindex all collation-affected indexes on this table" would be much
easier to do in the backend.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 1, 2019 at 11:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Jul 1, 2019 at 1:34 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
I have a vague recollection that ICU was providing some backward
compatibility so that even if you upgrade your lib you can still get
the sort order that was active when you built your indexes, though
maybe for a limited number of versions.That isn't built in. Another database system that uses ICU handles
this by linking to multiple versions of ICU, each with its own UCA
version and associated collations. I don't think that we want to go
there, so it makes sense to make an upgrade that crosses ICU or glibc
versions as painless as possible.Note that ICU does at least provide a standard way to use multiple
versions at once; the symbol names have the ICU version baked in.
You're actually calling the functions using the versioned symbol names
without realizing it, because there is macro trickery involved.
Ah, thanks for the clarification!
On Sun, Jun 30, 2019 at 11:45:47AM +0200, Julien Rouhaud wrote:
Hi,
With the glibc 2.28 coming, all users will have to reindex almost
every indexes after a glibc upgrade to guarantee the lack of
corruption. Unfortunately, reindexdb is not ideal for that as it's
processing everything using a single connexion and isn't able to
discard indexes that doesn't depend on a glibc collation.PFA a patchset to add parallelism to reindexdb (reusing the
infrastructure in vacuumdb with some additions) and an option to
discard indexes that doesn't depend on glibc (without any specific
collation filtering or glibc version detection), with updated
regression tests. Note that this should be applied on top of the
existing reindexdb cleanup & refactoring patch
(https://commitfest.postgresql.org/23/2115/).This was sponsored by VMware, and has been discussed internally with
Kevin and Michael, in Cc.
I wonder why this is necessary:
pg_log_error("cannot reindex glibc dependent objects and a subset of objects");
What's the reasoning behind that? It seems like a valid use case to me -
imagine you have a bug database, but only a couple of tables are used by
the application regularly (the rest may be archive tables, for example).
Why not to allow rebuilding glibc-dependent indexes on the used tables, so
that the database can be opened for users sooner.
BTW now that we allow rebuilding only some of the indexes, it'd be great
to have a dry-run mode, were we just print which indexes will be rebuilt
without actually rebuilding them.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jul 2, 2019 at 9:19 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
On 2019-07-01 22:46, Alvaro Herrera wrote:
On 2019-Jul-02, Thomas Munro wrote:
On Tue, Jul 2, 2019 at 8:34 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
Even if that's just me being delusional, I'd still prefer Alvaro's
approach to have distinct switches for each collation system.Makes sense. But why use the name "glibc" in the code and user
interface? The name of the collation provider in PostgreSQL is "libc"
(for example in the CREATE COLLATION command), and the problem applies
no matter who makes your libc.Makes sense. "If your libc is glibc and you go across an upgrade over
version X, please use --include-rule=libc-collation"I think it might be better to put the logic of what indexes are
collation affected etc. into the backend REINDEX command. We are likely
to enhance the collation version and dependency tracking over time,
possibly soon, possibly multiple times, and it would be very cumbersome
to have to keep updating reindexdb with this. Moreover, since for
performance you likely want to reindex by table, implementing a logic of
"reindex all collation-affected indexes on this table" would be much
easier to do in the backend.
That's a great idea, and would make the parallelism in reindexdb much
simpler. There's however a downside, as users won't have a way to
benefit from index filtering until they upgrade to this version. OTOH
glibc 2.28 is already there, and a hypothetical fancy reindexdb is far
from being released.
On Tue, Jul 2, 2019 at 10:28 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I wonder why this is necessary:
pg_log_error("cannot reindex glibc dependent objects and a subset of objects");
What's the reasoning behind that? It seems like a valid use case to me -
imagine you have a bug database, but only a couple of tables are used by
the application regularly (the rest may be archive tables, for example).
Why not to allow rebuilding glibc-dependent indexes on the used tables, so
that the database can be opened for users sooner.
It just seemed wrong to me to allow a partial processing for something
that's aimed to prevent corruption. I'd think that if users are
knowledgeable enough to only reindex a subset of indexes/tables in
such cases, they can also discard indexes that don't get affected by a
collation lib upgrade. I'm not strongly opposed to supporting if
though, as there indeed can be valid use cases.
BTW now that we allow rebuilding only some of the indexes, it'd be great
to have a dry-run mode, were we just print which indexes will be rebuilt
without actually rebuilding them.
+1. If we end up doing the filter in the backend, we'd have to add
such option in the REINDEX command, and actually issue all the orders
to retrieve the list.