pg_dump versus hash partitioning

Started by Tom Laneabout 3 years ago49 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

Over at [1]/messages/by-id/765e5968-6c39-470f-95bf-7b14e6b9a1c0@app.fastmail.com we have a complaint that dump-and-restore fails for
hash-partitioned tables if a partitioning column is an enum,
because the enum values are unlikely to receive the same OIDs
in the destination database as they had in the source, and the
hash codes are dependent on those OIDs. So restore tries to
load rows into the wrong leaf tables, and it's all a mess.
The patch approach proposed at [1]/messages/by-id/765e5968-6c39-470f-95bf-7b14e6b9a1c0@app.fastmail.com doesn't really work, but
what does work is to use pg_dump's --load-via-partition-root
option, so that the tuple routing decisions are all re-made.

I'd initially proposed that we force --load-via-partition-root
if we notice that we have hash partitioning on an enum column.
But the more I thought about this, the more comparable problems
came to mind:

1. Hash partitioning on text columns will likely fail if the
destination uses a different encoding.

2. Hash partitioning on float columns will fail if you use
--extra-float-digits to round off the values. And then
there's the fact that the behavior of strtod() might vary
across platforms.

3. Hash partitioning on floats is also endian-dependent,
and the same is likely true for some other types.

4. Anybody want to bet that complex types such as jsonb
are entirely free of similar hazards? (Yes, somebody
thought it'd be a good idea to provide jsonb_hash.)

In general, we've never thought that hash values are
required to be consistent across platforms.

That was leading me to think that we should force
--load-via-partition-root for any hash-partitioned table,
just to pre-emptively avoid these problems. But then
I remembered that

5. Range partitioning on text columns will likely fail if the
destination uses a different collation.

This is looking like a very large-caliber foot-gun, isn't it?
And remember that --load-via-partition-root acts at pg_dump
time, not restore. If all you have is a dump file with no
opportunity to go back and get a new one, and it won't load
into your new server, you have got a nasty problem.

I don't think this is an acceptable degree of risk, considering
that the primary use-cases for pg_dump involve target systems
that aren't 100.00% identical to the source.

So here's what I think we should actually do: make
--load-via-partition-root the default. We can provide a
switch to turn it off, for those who are brave or foolish
enough to risk that in the name of saving a few cycles,
but it ought to be the default.

Furthermore, I think we should make this happen in time for
next week's releases. I can write the patch easily enough,
but we need a consensus PDQ that this is what to do.

Anyone want to bikeshed on the spelling of the new switch?
I'm contemplating "--load-via-partition-leaf" or perhaps
"--no-load-via-partition-root".

regards, tom lane

[1]: /messages/by-id/765e5968-6c39-470f-95bf-7b14e6b9a1c0@app.fastmail.com

#2Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#1)
Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 11:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Over at [1] we have a complaint that dump-and-restore fails for
hash-partitioned tables if a partitioning column is an enum,
because the enum values are unlikely to receive the same OIDs
in the destination database as they had in the source, and the
hash codes are dependent on those OIDs.

It seems to me that this is the root of the problem. We can't expect
to hash on something that's not present in the dump file and have
anything work.

So here's what I think we should actually do: make
--load-via-partition-root the default. We can provide a
switch to turn it off, for those who are brave or foolish
enough to risk that in the name of saving a few cycles,
but it ought to be the default.

Furthermore, I think we should make this happen in time for
next week's releases. I can write the patch easily enough,
but we need a consensus PDQ that this is what to do.

This seems extremely precipitous to me and I'm against it. I like the
fact that we have --load-via-partition-root, but it is a bit of a
hack. You don't get a single copy into the partition root, you get one
per child table -- and those COPY statements are listed as data for
the partitions where the data lives now, not for the parent table. I
am not completely sure whether there is a scenario where that's an
issue, but it's certainly an oddity. Also, and I think pretty
significantly, using --load-via-partition-root forces you to pay the
overhead of rerouting every tuple to the target partition whether you
need it or not, which is potentially a large unnecessary expense. I
don't think we should just foist that kind of overhead onto everyone
in every situation for every data type because somebody had a problem
in a certain case.

And even if we do decide to do that at some point, I don't think it is
right at all to rush such a change out on a short time scale, with
little time to mull over consequences and alternative fixes. I think
that could easily hurt more people than it helps.

I think that not all of the cases that you list are of the same type.
Loading a dump under a different encoding or on a different endianness
are surely corner cases. They might come up for some people
occasionally, but they're not typical. In the case of endianness,
that's because little-Endian has pretty much taken over the world; in
the case of encoding, that's because converting data between encodings
is a real pain, and combining that with a database dump and restore is
likely to be very little fun. It's hard to argue that collation
changes fall into the same category: we know that they get changed all
the time, often silently. But none of us database geeks think that's a
good thing: just that it's a thing that we have to deal with.

The enum case seems completely different to me. That's not the result
of trying to migrate your data to another architecture or of the glibc
maintainers not believing in sorting working the same way on Tuesday
that it did on Monday. That's the result of the PostgreSQL project
hashing data in a way that is does not make any real sense for the
application at hand. Any hash function that we use for partitioning
has to work based on data that is preserved by the dump-and-restore
process. I would argue that the float case is not of the same kind:
yes, if you round your data off, then the values are going to hash
differently, but if you truncate your strings, those will hash
differently too. Duh. Intentionally changing the value is supposed to
change the hash code, that's kind of the point of hashing.

So I think we should be asking ourselves what we could do about the
enum case specifically, rather than resorting to a bazooka.

--
Robert Haas
EDB: http://www.enterprisedb.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#2)
Re: pg_dump versus hash partitioning

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Feb 1, 2023 at 11:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Over at [1] we have a complaint that dump-and-restore fails for
hash-partitioned tables if a partitioning column is an enum,
because the enum values are unlikely to receive the same OIDs
in the destination database as they had in the source, and the
hash codes are dependent on those OIDs.

It seems to me that this is the root of the problem.

Well, that was what I thought too to start with, but I now think that
it is far too narrow-minded a view of the problem. The real issue
is something I said that you trimmed:

In general, we've never thought that hash values are
required to be consistent across platforms.

hashenum() is doing something that is perfectly fine in the context
of hash indexes, which have traditionally been our standard of how
reproducible hash values need to be. Hash partitioning has moved
those goalposts, and if there was any discussion of the consequences
of that, I didn't see it.

Furthermore, I think we should make this happen in time for
next week's releases. I can write the patch easily enough,
but we need a consensus PDQ that this is what to do.

This seems extremely precipitous to me and I'm against it.

Yeah, it's been this way for awhile, so if there's debate then
we can let it go awhile longer.

So I think we should be asking ourselves what we could do about the
enum case specifically, rather than resorting to a bazooka.

My idea of solving this with a bazooka would be to deprecate hash
partitioning. Quite frankly that's where I think we will end up
someday, but I'm not going to try to make that argument today.

In the meantime, I think we need to recognize that hash values are
not very portable. I do not think we do our users a service by
letting them discover the corner cases the hard way.

I'd be willing to compromise on the intermediate idea of forcing
--load-via-partition-root just for hashed partitioning.

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#2)
Re: pg_dump versus hash partitioning

Robert Haas <robertmhaas@gmail.com> writes:

... I like the
fact that we have --load-via-partition-root, but it is a bit of a
hack. You don't get a single copy into the partition root, you get one
per child table -- and those COPY statements are listed as data for
the partitions where the data lives now, not for the parent table. I
am not completely sure whether there is a scenario where that's an
issue, but it's certainly an oddity.

I spent a bit more time thinking about that, and while I agree that
it's an oddity, I don't see that it matters in the case of hash
partitioning. You would notice an issue if you tried to do a selective
restore of just one partition --- but under what circumstance would
that be a useful thing to do? By definition, under hash partitioning
there is no user-meaningful difference between different partitions.
Moreover, in the case at hand you would get constraint failures without
--load-via-partition-root, or tuple routing failures with it,
so what's the difference? (Unless you'd created all the partitions
to start with and were doing a selective restore of just one partition's
data, in which case the outcome is "fails" or "works" respectively.)

Also, and I think pretty
significantly, using --load-via-partition-root forces you to pay the
overhead of rerouting every tuple to the target partition whether you
need it or not, which is potentially a large unnecessary expense.

Oddly, I always thought that we prioritize correctness over speed.
I don't mind having an option that allows people to select a less-safe
way of doing this, but I do think it's unwise for less-safe to be the
default, especially when it's something you can't fix after the fact.

What do you think of "--load-via-partition-root=on/off/auto", where
auto means "not with hash partitions" or the like? I'm still
uncomfortable about the collation aspect, but I'm willing to concede
that range partitioning is less likely to fail in this way than hash.

regards, tom lane

#5Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#3)
Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 1:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, that was what I thought too to start with, but I now think that
it is far too narrow-minded a view of the problem. The real issue
is something I said that you trimmed:

In general, we've never thought that hash values are
required to be consistent across platforms.

hashenum() is doing something that is perfectly fine in the context
of hash indexes, which have traditionally been our standard of how
reproducible hash values need to be. Hash partitioning has moved
those goalposts, and if there was any discussion of the consequences
of that, I didn't see it.

Right, I don't think it was ever discussed, and I think that was a
mistake on my part.

In the meantime, I think we need to recognize that hash values are
not very portable. I do not think we do our users a service by
letting them discover the corner cases the hard way.

I think you're not really engaging with the argument that "not
completely portable" and "totally broken" are two different things,
and I still think that's an important point here. One idea that I had
is to add a flag somewhere to indicate whether a particular opclass or
opfamily is suitable for hash partitioning, or perhaps better, an
alternative to opcdefault that sets the default for partitioning,
which could be different from the default for indexing. Then we could
either prohibit this case, or make it work. Of course we would have to
define what "suitable for hash partitioning" means, but "would be
likely to survive a dump and reload on the same machine without any
software changes" is probably a reasonable minimum standard.

I don't think the fact that our *traditional* standard for how stable
a hash function needs to be has been XYZ carries any water. Needs
change over time, and we adapt the code to meet the new needs. Since
we have no system for type properties in PostgreSQL -- a design
decision I find questionable -- we tie all such properties to operator
classes. That's why, for example, we have HASHEXTENDED_PROC, even
though hash indexes don't need 64-bit hash values or a seed. We added
that for hash partitioning, and it's now used in other places too,
because 32-bits aren't enough for everything just because they're
enough for hash indexes, and seeds are handy. That's also why we have
BTINRANGE_PROC, which doesn't exist to support btree indexes, but
rather window frames. The fact that a certain feature requires us to
graft some additional stuff into the operator class/family mechanism,
or that it doesn't quite work with everything that's already part of
that mechanism, isn't an argument against the feature. That's just how
we do things around here. Indeed, if somebody, instead of implementing
hash partitioning by tying it into hash opfamilies, were to make up
some completely new hashing infrastructure that had exactly the
properties they wanted for partitioning, that would be *totally*
unacceptable and surely a reason for rejecting such a feature
outright. The fact that it tries to make use of the existing
infrastructure is a good thing about that feature, not a bad thing,
even though it is turning out that there are some problems.

On the question of whether hash partitioning is a good feature in
general, I can only say that I disagree with what seems to be your
position, which as best as I can tell is "it sucks and we should kill
it with fire". I do think that partitioning in general leaves a lot to
be desired in PostgreSQL in general, and perhaps the issues are even
worse for hash partitioning than they are elsewhere. However, I think
that the solution to that is for people to keep putting more work into
making it better, not to give up and say "ah, partitioning (or hash
partitioning specifically) is a stupid feature that nobody wants". To
think that, you have to be living in a bubble. It's unfortunate that
with all the work that so many people have put into this area we don't
have better results to show for it, but AFAICS there's no help for
that but to keep hacking. Amit Langote's work on locking before
pruning, for example, will be hugely impactful for some kinds of
queries if it gets committed, and it's been a long time coming, partly
because so many other problems needed to be sorted out first. But you
can't run the simplest workload with any kind of partitioning, range,
hash, whatever, and not run into that problem immediately.

--
Robert Haas
EDB: http://www.enterprisedb.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#5)
Re: pg_dump versus hash partitioning

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Feb 1, 2023 at 1:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In the meantime, I think we need to recognize that hash values are
not very portable. I do not think we do our users a service by
letting them discover the corner cases the hard way.

I think you're not really engaging with the argument that "not
completely portable" and "totally broken" are two different things,
and I still think that's an important point here.

I don't buy that argument. From the user's perspective, it's broken
if her use-case fails. "It works for most people" is cold comfort,
most especially so if there's no convenient path to fixing it after
a failure.

I don't think the fact that our *traditional* standard for how stable
a hash function needs to be has been XYZ carries any water.

Well, it wouldn't need to if we had a practical way of changing the
behavior of an existing hash function, but guess what: we don't.
Andrew's original proposal for fixing this was exactly to change the
behavior of hashenum(). There were some problems with the idea of
depending on enumsortorder instead of enum OID, but the really
fundamental issue is that you can't change hashing behavior without
breaking pg_upgrade completely. Not only will your hash indexes be
corrupt, but your hash-partitioned tables will be broken, in exactly
the same way that we're trying to solve for dump/reload cases (which
of course will *also* be broken by redefining the hash function, if
you didn't use --load-via-partition-root). Moreover, while we can
always advise people to reindex, there's no similarly easy way to fix
broken partitioning.

That being the case, I don't think moving the goalposts for hash
function stability is going to lead to a workable solution.

On the question of whether hash partitioning is a good feature in
general, I can only say that I disagree with what seems to be your
position, which as best as I can tell is "it sucks and we should kill
it with fire".

As I said, I'm not prepared to litigate that case today ... but
I do have a sneaking suspicion that we will eventually reach that
conclusion. In any case, if we don't want to reach that conclusion,
we need some practical solution to these dump/reload problems.
Have you got a better idea than --load-via-partition-root?

regards, tom lane

In reply to: Tom Lane (#4)
Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 12:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, and I think pretty
significantly, using --load-via-partition-root forces you to pay the
overhead of rerouting every tuple to the target partition whether you
need it or not, which is potentially a large unnecessary expense.

Oddly, I always thought that we prioritize correctness over speed.
I don't mind having an option that allows people to select a less-safe
way of doing this, but I do think it's unwise for less-safe to be the
default, especially when it's something you can't fix after the fact.

+1

As you pointed out already, pg_dump's primary use-cases all involve
target systems that aren't identical to the source system. Anybody
using pg_dump is unlikely to be particularly concerned about
performance.

What do you think of "--load-via-partition-root=on/off/auto", where
auto means "not with hash partitions" or the like? I'm still
uncomfortable about the collation aspect, but I'm willing to concede
that range partitioning is less likely to fail in this way than hash.

Currently, pg_dump ignores collation versions entirely, except when
run by pg_upgrade. So pg_dump already understands that sometimes it's
important that the collation behavior be completely identical when the
database is restored, and sometimes it's desirable to produce a dump
with any available "logically equivalent" collation. This is about the
high level requirements, which makes sense to me.

ISTM that range partitioning is missing a good high level model that
builds on that. What's really needed is a fully worked out abstraction
that recognizes how collations can be equivalent for some purposes,
but not other purposes. The indirection between "logical and physical
collations" is underdeveloped. There isn't even an official name for
that idea.

--
Peter Geoghegan

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#4)
Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 3:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I spent a bit more time thinking about that, and while I agree that
it's an oddity, I don't see that it matters in the case of hash
partitioning. You would notice an issue if you tried to do a selective
restore of just one partition --- but under what circumstance would
that be a useful thing to do? By definition, under hash partitioning
there is no user-meaningful difference between different partitions.
Moreover, in the case at hand you would get constraint failures without
--load-via-partition-root, or tuple routing failures with it,
so what's the difference? (Unless you'd created all the partitions
to start with and were doing a selective restore of just one partition's
data, in which case the outcome is "fails" or "works" respectively.)

I guess I was worried that pg_dump's dependency ordering stuff might
do something weird in some case that I'm not smart enough to think up.

Also, and I think pretty
significantly, using --load-via-partition-root forces you to pay the
overhead of rerouting every tuple to the target partition whether you
need it or not, which is potentially a large unnecessary expense.

Oddly, I always thought that we prioritize correctness over speed.

That's a bit rich.

It seems to me that the job of pg_dump is to produce a dump that, when
reloaded on another system, recreates the same database state. That
means that we end up with all of the same objects, each defined in the
same way, and that all of the tables end up with all the same contents
that they had before. Here, you'd like to argue that it's perfectly
fine if we instead insert some of the rows into different tables than
where they were on the original system. Under normal circumstances, of
course, we wouldn't consider any such thing, because then we would not
be faithfully replicating the database state, which would be
incorrect. But here you want to argue that it's OK to create a
different database state because trying to recreate the same one would
produce an error and the user might not like getting an error so let's
just do something else instead and not even bother telling them.

As you have quite rightly pointed out, the --load-via-partition-root
behavior is useful for working around a variety of unfortunate things
that can happen. If a user is willing to say that getting a row into
one partition of some table is just as good as getting it into another
partition of that same table and that you don't mind paying the cost
associated with that, then that is something that we can do for that
user. But just as we normally prioritize correctness over speed, so
also do we normally throw errors when things aren't right instead of
silently accepting bad input. The partitions in this scenario are
tables that have constraints. If a dump contains a row that doesn't
satisfy some constraint on the table into which it is being loaded,
that's an error. Keep in mind that there's no rule that a user can't
query a partition directly.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#8)
Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 1:14 PM Robert Haas <robertmhaas@gmail.com> wrote:

It seems to me that the job of pg_dump is to produce a dump that, when
reloaded on another system, recreates the same database state. That
means that we end up with all of the same objects, each defined in the
same way, and that all of the tables end up with all the same contents
that they had before. Here, you'd like to argue that it's perfectly
fine if we instead insert some of the rows into different tables than
where they were on the original system. Under normal circumstances, of
course, we wouldn't consider any such thing, because then we would not
be faithfully replicating the database state, which would be
incorrect. But here you want to argue that it's OK to create a
different database state because trying to recreate the same one would
produce an error and the user might not like getting an error so let's
just do something else instead and not even bother telling them.

This is a misrepresentation of Tom's words. It isn't actually
self-evident what "we end up with all of the same objects, each
defined in the same way, and that all of the tables end up with all
the same contents that they had before" actually means here, in
general. Tom's main concern seems to be just that -- the ambiguity
itself.

If there was a fully worked out idea of what that would mean, then I
suspect it would be quite subtle and complicated -- it's an inherently
tricky area. You seem to be saying that the way that this stuff
currently works is correct by definition, except when it isn't.

--
Peter Geoghegan

In reply to: Robert Haas (#5)
Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 12:39 PM Robert Haas <robertmhaas@gmail.com> wrote:

I don't think the fact that our *traditional* standard for how stable
a hash function needs to be has been XYZ carries any water. Needs
change over time, and we adapt the code to meet the new needs. Since
we have no system for type properties in PostgreSQL -- a design
decision I find questionable -- we tie all such properties to operator
classes.

Are you familiar with B-Tree opclass support function 4, equalimage?
It's used to determine whether a B-Tree index can use deduplication at
CREATE INDEX time. ISTM that the requirements are rather similar here
-- perhaps even identical.

See: https://www.postgresql.org/docs/devel/btree-support-funcs.html

--
Peter Geoghegan

#11Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 4:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't think the fact that our *traditional* standard for how stable
a hash function needs to be has been XYZ carries any water.

Well, it wouldn't need to if we had a practical way of changing the
behavior of an existing hash function, but guess what: we don't.
Andrew's original proposal for fixing this was exactly to change the
behavior of hashenum(). There were some problems with the idea of
depending on enumsortorder instead of enum OID, but the really
fundamental issue is that you can't change hashing behavior without
breaking pg_upgrade completely. Not only will your hash indexes be
corrupt, but your hash-partitioned tables will be broken, in exactly
the same way that we're trying to solve for dump/reload cases (which
of course will *also* be broken by redefining the hash function, if
you didn't use --load-via-partition-root). Moreover, while we can
always advise people to reindex, there's no similarly easy way to fix
broken partitioning.

That being the case, I don't think moving the goalposts for hash
function stability is going to lead to a workable solution.

I don't see that there is any easy, clean way to solve this in
released branches. The idea that I proposed could be implemented in
master, and I think it is the right kind of fix, but it is not
back-patchable. However, I think your argument rests on the premise
that making --load-via-partition-root the default behavior in some or
all cases will not break anything for anyone, and I'm skeptical. I
think that's a significant behavior change and that some people will
notice, and some will find it an improvement while others will find it
worse than the current behavior. I also think that there must be a lot
more people using partitioning in general, and even hash partitioning
specifically, than there are people using hash partitioning on an enum
column.

Personally, I would rather disallow this case in the back-branches --
i.e. make pg_dump barf if it is encountered and block CREATE TABLE
from setting up any new situations of this type -- than foist
--load-via-partition-root on many people who aren't affected by the
issue. I'm not saying that's a great answer, but we have to pick from
the choices we have.

I also don't accept that if someone has hit this issue they are just
hosed and there's no way out. Yeah, it's not a lot of fun: you
probably have to use "CREATE TABLE unpartitioned AS SELECT * FROM
borked; DROP TABLE borked;" or so to rescue your data. But what would
we do if we discovered that the btree opclass sorts 1 before 0, or
something? Surely we wouldn't refuse to fix the opclass just because
some users have existing indexes on disk that would be out of order
with the new opclass definition. We'd just change it and people would
have to deal. People with indexes would need to reindex. People with
partitioning boundaries between 0 and 1 would need to repartition.
This case isn't the same because hashenum() isn't broken in general,
just for this particular purpose. But I think you're trying to argue
that we should fix this by changing something other than the thing
that is broken, and I don't agree with that.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#8)
Re: pg_dump versus hash partitioning

Robert Haas <robertmhaas@gmail.com> writes:

It seems to me that the job of pg_dump is to produce a dump that, when
reloaded on another system, recreates the same database state. That
means that we end up with all of the same objects, each defined in the
same way, and that all of the tables end up with all the same contents
that they had before.

No, its job is to produce *logically* the same state. We don't expect
it to produce, say, the same CTID value that each row had before ---
if you want that, you use pg_upgrade or physical replication, not
pg_dump. There are fairly strong constraints on how identical we
can make the results given that we're not replicating physical state.
And that's not even touching the point that frequently what the user
is after is not an identical copy anyway.

Here, you'd like to argue that it's perfectly
fine if we instead insert some of the rows into different tables than
where they were on the original system.

I can agree with that argument for range or list partitioning, where
the partitions have some semantic meaning to the user. I don't buy it
for hash partitioning. It was an implementation artifact to begin
with that a given row ended up in partition 3 not partition 11, so why
would users care which partition it ends up in after a dump/reload?
If they think there is a difference between the partitions, they need
education.

... But just as we normally prioritize correctness over speed, so
also do we normally throw errors when things aren't right instead of
silently accepting bad input. The partitions in this scenario are
tables that have constraints.

Again, for hash partitioning those constraints are implementation
artifacts not something that users should have any concern with.

Keep in mind that there's no rule that a user can't
query a partition directly.

Sure, and in the case of a hash partition, what he is going to
get is an implementation-dependent subset of the rows. I use
"implementation-dependent" here in the same sense that the SQL
standard does, which is "there is a rule but the implementation
doesn't have to tell you what it is". In particular, we are not
bound to make the subset be the same on different installations.
We already didn't promise that, because of issues like endianness.

regards, tom lane

#13Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#9)
Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 4:44 PM Peter Geoghegan <pg@bowt.ie> wrote:

This is a misrepresentation of Tom's words. It isn't actually
self-evident what "we end up with all of the same objects, each
defined in the same way, and that all of the tables end up with all
the same contents that they had before" actually means here, in
general. Tom's main concern seems to be just that -- the ambiguity
itself.

As far as I can see, Tom didn't admit that there was any ambiguity. He
just said that I was advocating for wrong behavior for the sake of
performance. I don't think that is what I was doing. I also don't
really think Tom thinks that that is what I was doing. But it is what
he said I was doing.

I do agree with you that the ambiguity is the root of the issue. I
mean, if we can't put the rows back into the same partitions where
they were before, does the user care about that, or do they only care
that the rows end up in some partition of the toplevel partitioned
table? I think that there's no single definition of correctness that
is the only defensible one here, and we can't know which one the user
wants a priori. I also think that they can care which one they're
getting, and thus I think that changing the default in a minor release
is a bad plan. Tom, as I understand it, is arguing that the
--load-via-partition-root behavior has negligible downsides and is
almost categorically better than the current default behavior, and
thus making that the new default in some or all situations in a minor
release is totally fine.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: pg_dump versus hash partitioning

Robert Haas <robertmhaas@gmail.com> writes:

Tom, as I understand it, is arguing that the
--load-via-partition-root behavior has negligible downsides and is
almost categorically better than the current default behavior, and
thus making that the new default in some or all situations in a minor
release is totally fine.

I think it's categorically better than a failed restore. I wouldn't
be proposing this if there were no such problem; but there is,
and I don't buy your apparent position that we should leave affected
users to cope as best they can. Yes, it's clearly only a minority
of users that are affected, else we'd have heard complaints before.
But it could be absolutely catastrophic for an affected user,
if they're trying to restore their only backup. I'd rather impose
an across-the-board cost on all users of hash partitioning than
risk such outcomes for a few.

Also, you've really offered no evidence for your apparent position
that --load-via-partition-root has unacceptable overhead. We've
done enough work on partition routing over the last few years that
whatever measurements might've originally justified that idea
don't necessarily apply anymore. Admittedly, I've not measured
it either. But we don't tell people to avoid partitioning because
INSERT is unduly expensive. Partition routing is just the cost of
doing business in that space.

regards, tom lane

In reply to: Robert Haas (#13)
Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 2:12 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 1, 2023 at 4:44 PM Peter Geoghegan <pg@bowt.ie> wrote:

This is a misrepresentation of Tom's words. It isn't actually
self-evident what "we end up with all of the same objects, each
defined in the same way, and that all of the tables end up with all
the same contents that they had before" actually means here, in
general. Tom's main concern seems to be just that -- the ambiguity
itself.

As far as I can see, Tom didn't admit that there was any ambiguity.

Tom said:

"I spent a bit more time thinking about that, and while I agree that
it's an oddity, I don't see that it matters in the case of hash
partitioning. You would notice an issue if you tried to do a
selective restore of just one partition --- but under what
circumstance would that be a useful thing to do?"

While the word ambiguity may not have actually been used, Tom very
clearly admitted some ambiguity. But even if he didn't, so what? It's
perfectly obvious that that's the major underlying issue, and that
this is a high level problem rather than a low level problem.

He just said that I was advocating for wrong behavior for the sake of
performance. I don't think that is what I was doing. I also don't
really think Tom thinks that that is what I was doing. But it is what
he said I was doing.

And I think that you're making a mountain out of a molehill.

I do agree with you that the ambiguity is the root of the issue. I
mean, if we can't put the rows back into the same partitions where
they were before, does the user care about that, or do they only care
that the rows end up in some partition of the toplevel partitioned
table?

That was precisely the question that Tom posed to you, in the same
email as the one that you found objectionable.

Tom, as I understand it, is arguing that the
--load-via-partition-root behavior has negligible downsides and is
almost categorically better than the current default behavior, and
thus making that the new default in some or all situations in a minor
release is totally fine.

I don't know why you seem to think that it was such an absolutist
position as that.

You mentioned "minor releases" here. Who said anything about that?

--
Peter Geoghegan

#16Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#12)
Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 5:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Here, you'd like to argue that it's perfectly
fine if we instead insert some of the rows into different tables than
where they were on the original system.

I can agree with that argument for range or list partitioning, where
the partitions have some semantic meaning to the user. I don't buy it
for hash partitioning. It was an implementation artifact to begin
with that a given row ended up in partition 3 not partition 11, so why
would users care which partition it ends up in after a dump/reload?
If they think there is a difference between the partitions, they need
education.

I see your point. I think it's somewhat valid. However, I also think
it muddies the definition of what pg_dump is allowed to do in a way
that I do not like. I think there's a difference between the CTID or
XMAX of a tuple changing and it ending up in a totally different
partition. It feels like it makes the definition of correctness
subjective: we do think that people care about range and list
partitions as individual entities, so we'll put the rows back where
they were and complain if we can't, but we don't think they think
about hash partitions that way, so we will err on the side of making
the dump restoration succeed. That's a level of guessing what the user
meant that I think is uncomfortable. I feel like when somebody around
here discovers that sort of reasoning in some other software's code,
or in a proposed patch, it pretty often gets blasted on this mailing
list with considerable vigor.

I think you can construct plausible cases where it's not just
academic. For instance, suppose I intend to use some kind of logical
replication system, not necessarily the one built into PostgreSQL, to
replicate data between two systems. Before engaging that system, I
need to make the initial database contents match. The system is
oblivious to partitioning, and just replicates each table to a table
with a matching name. Well, if the partitions don't actually match up
1:1, I kind of need to know about that. In this use case, the rows
silently moving around doesn't meet my needs. Or, suppose I dump and
restore two databases. It works perfectly. I then run a comparison
tool of some sort that compares the two databases. EDB has such a
tool! I don't know whether it would perform the comparison via the
partition root or not, because I don't know how it works. But I find
it pretty plausible that some such tool would show differences between
the source and target databases. Now, if I had done the data migration
using --load-with-partition-root, I would expect that. I might even be
looking for it, to see what got moved around. But otherwise, it might
be unexpected.

Another subtle problem with this whole situation is: suppose that on
host A, I set up a table hash-partitioned by an enum column and make a
bunch of hash partitions. Then, on host B, I set up the same table
with a bunch of foreign table partitions, each corresponding to the
matching partition on the other node. I guess that just doesn't work,
whereas if the column were of any other data type, it would work. If
it were a collatable data type, it would need the collations and
collation definitions to match, too.

I know these things are subtle, and maybe these specific things will
never happen to anyone, or nobody will care. I don't know. I just have
a really hard time accepting that a categorical statement that this
just can't ever matter to anyone.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#15)
Re: pg_dump versus hash partitioning

Peter Geoghegan <pg@bowt.ie> writes:

You mentioned "minor releases" here. Who said anything about that?

I did: I'd like to back-patch the fix if possible. I think changing
the default --load-via-partition-root choice could be back-patchable.

If Robert is resistant to that but would accept it in master,
I'd settle for that in preference to having no fix.

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#16)
Re: pg_dump versus hash partitioning

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Feb 1, 2023 at 5:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I can agree with that argument for range or list partitioning, where
the partitions have some semantic meaning to the user. I don't buy it
for hash partitioning. It was an implementation artifact to begin
with that a given row ended up in partition 3 not partition 11, so why
would users care which partition it ends up in after a dump/reload?
If they think there is a difference between the partitions, they need
education.

I see your point. I think it's somewhat valid. However, I also think
it muddies the definition of what pg_dump is allowed to do in a way
that I do not like. I think there's a difference between the CTID or
XMAX of a tuple changing and it ending up in a totally different
partition. It feels like it makes the definition of correctness
subjective: we do think that people care about range and list
partitions as individual entities, so we'll put the rows back where
they were and complain if we can't, but we don't think they think
about hash partitions that way, so we will err on the side of making
the dump restoration succeed. That's a level of guessing what the user
meant that I think is uncomfortable.

I see your point too, and to me it's evidence for the position that
we should never have done hash partitioning in the first place.
It's precisely because you want to analyze it in the same terms
as range/list partitioning that we have these issues. Or we could
have built it on some other infrastructure than hash index opclasses
... but we didn't do that, and now we have a mess. I don't see a way
out other than relaxing the guarantees about how hash partitioning
works compared to the other kinds.

regards, tom lane

#19David G. Johnston
david.g.johnston@gmail.com
In reply to: Tom Lane (#17)
Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:

You mentioned "minor releases" here. Who said anything about that?

I did: I'd like to back-patch the fix if possible. I think changing
the default --load-via-partition-root choice could be back-patchable.

If Robert is resistant to that but would accept it in master,
I'd settle for that in preference to having no fix.

I'm for accepting --load-via-partition-root as the solution to this problem.
I think it is better than doing nothing and, at the moment, I don't see any
alternatives to pick from.

As evidenced from the current influx of collation problems related to
indexes, we would be foolish to discount the collation issues with just
plain text, so limiting this only to the enum case (which is a must-have)
doesn't seem wise.

pg_dump should be conservative in what it produces - which in this
situation means having minimal environmental dependencies and internal
volatility.

In the interest of compatibility, having an escape hatch to do things as
they are done today is something we should provide. We got this one wrong
and that is going to cause some pain. Though at least with the escape
hatch we shouldn't be dealing with as many unresolvable complaints as when
we back-patched removing the public schema from search_path.

In the worst case, being conservative, the user can always at minimum
restore their dump file into a local database and get access to their data
in a usable format with minimal hassle. The few that would like "same
table name or bust" behavior because of external or application-specific
requirements can still get that behavior.

David J.

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#11)
Re: pg_dump versus hash partitioning

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Feb 1, 2023 at 4:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That being the case, I don't think moving the goalposts for hash
function stability is going to lead to a workable solution.

I don't see that there is any easy, clean way to solve this in
released branches. The idea that I proposed could be implemented in
master, and I think it is the right kind of fix, but it is not
back-patchable.

You waved your arms about inventing some new hashing infrastructure,
but it was phrased in such a way that it wasn't clear to me if that
was actually a serious proposal or not. But if it is: how will you
get around the fact that any change to hashing behavior will break
pg_upgrade of existing hash-partitioned tables? New infrastructure
avails nothing if it has to be bug-compatible with the old. So I'm
not sure how restricting the fix to master helps us.

regards, tom lane

In reply to: Tom Lane (#18)
#22David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#17)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#22)
#24Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tom Lane (#18)
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#16)
#26Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#23)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#20)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#26)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#25)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#28)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#35)
#37Julien Rouhaud
rjuju123@gmail.com
In reply to: Tom Lane (#30)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Julien Rouhaud (#37)
#39Julien Rouhaud
rjuju123@gmail.com
In reply to: Tom Lane (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Julien Rouhaud (#39)
#41Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#40)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#41)
#43Julien Rouhaud
rjuju123@gmail.com
In reply to: Tom Lane (#40)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Julien Rouhaud (#43)
#45Julien Rouhaud
rjuju123@gmail.com
In reply to: Tom Lane (#44)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Julien Rouhaud (#45)
#47Julien Rouhaud
rjuju123@gmail.com
In reply to: Tom Lane (#46)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Julien Rouhaud (#47)
#49Julien Rouhaud
rjuju123@gmail.com
In reply to: Tom Lane (#48)