Something is wrong with wal_compression

Started by Tom Laneabout 3 years ago29 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

The symptom being exhibited by Michael's new BF animal tanager
is perfectly reproducible elsewhere.

$ cat /home/postgres/tmp/temp_config
#default_toast_compression = lz4
wal_compression = lz4
$ export TEMP_CONFIG=/home/postgres/tmp/temp_config
$ cd ~/pgsql/src/test/recovery
$ make check PROVE_TESTS=t/011_crash_recovery.pl
...
+++ tap check in src/test/recovery +++
t/011_crash_recovery.pl .. 1/? 
#   Failed test 'new xid after restart is greater'
#   at t/011_crash_recovery.pl line 53.
#     '729'
#         >
#     '729'

# Failed test 'xid is aborted after crash'
# at t/011_crash_recovery.pl line 57.
# got: 'committed'
# expected: 'aborted'
# Looks like you failed 2 tests of 3.

Maybe this is somehow the test script's fault, but I don't see how.

It fails the same way with 'wal_compression = pglz', so I think it's
generic to that whole feature rather than specific to LZ4.

regards, tom lane

#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#1)
Re: Something is wrong with wal_compression

On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:

The symptom being exhibited by Michael's new BF animal tanager
is perfectly reproducible elsewhere.

I think these tests have always failed with wal_compression ?

/messages/by-id/20210308.173242.463790587797836129.horikyota.ntt@gmail.com
/messages/by-id/20210313012820.GJ29463@telsasoft.com
/messages/by-id/20220222231948.GJ9008@telsasoft.com

/messages/by-id/YNqWd2GSMrnqWIfx@paquier.xyz
|My buildfarm machine has been changed to use wal_compression = lz4,
|while on it for HEAD runs.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#2)
Re: Something is wrong with wal_compression

Justin Pryzby <pryzby@telsasoft.com> writes:

On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:

The symptom being exhibited by Michael's new BF animal tanager
is perfectly reproducible elsewhere.

I think these tests have always failed with wal_compression ?

If that's a known problem, and we've done nothing about it,
that is pretty horrid. That test case is demonstrating fundamental
database corruption after a crash.

regards, tom lane

#4Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#2)
Re: Something is wrong with wal_compression

On Thu, Jan 26, 2023 at 02:08:27PM -0600, Justin Pryzby wrote:

On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:

The symptom being exhibited by Michael's new BF animal tanager
is perfectly reproducible elsewhere.

I think these tests have always failed with wal_compression ?

/messages/by-id/20210308.173242.463790587797836129.horikyota.ntt@gmail.com
/messages/by-id/20210313012820.GJ29463@telsasoft.com
/messages/by-id/20220222231948.GJ9008@telsasoft.com

+ /messages/by-id/c86ce84f-dd38-9951-102f-13a931210f52@dunslane.net

#5Andrey Borodin
amborodin@acm.org
In reply to: Tom Lane (#3)
Re: Something is wrong with wal_compression

On Thu, Jan 26, 2023 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That test case is demonstrating fundamental
database corruption after a crash.

Not exactly corruption. XID was not persisted and buffer data did not
hit a disk. Database is in the correct state.

It was discussed long before WAL compression here [0]/messages/by-id/565FB155-C6B0-41E2-8C44-7B514DC25132@yandex-team.ru. The thing is it
is easier to reproduce with compression, but compression has nothing
to do with it, as far as I understand.

Proposed fix is here[1]/messages/by-id/20210313012820.GJ29463@telsasoft.com, but I think it's better to fix the test. It
should not veryfi Xid, but rather side effects of "CREATE TABLE mine(x
integer);".

Best regards, Andrey Borodin.

[0]: /messages/by-id/565FB155-C6B0-41E2-8C44-7B514DC25132@yandex-team.ru
[1]: /messages/by-id/20210313012820.GJ29463@telsasoft.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrey Borodin (#5)
Re: Something is wrong with wal_compression

Andrey Borodin <amborodin86@gmail.com> writes:

On Thu, Jan 26, 2023 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That test case is demonstrating fundamental
database corruption after a crash.

Not exactly corruption. XID was not persisted and buffer data did not
hit a disk. Database is in the correct state.

Really? I don't see how this part is even a little bit okay:

[00:40:50.744](0.046s) not ok 3 - xid is aborted after crash
[00:40:50.745](0.001s)
[00:40:50.745](0.000s) # Failed test 'xid is aborted after crash'
# at t/011_crash_recovery.pl line 57.
[00:40:50.746](0.001s) # got: 'committed'
# expected: 'aborted'

If any tuples made by that transaction had reached disk,
we'd have a problem.

regards, tom lane

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#6)
Re: Something is wrong with wal_compression

On Fri, Jan 27, 2023 at 11:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrey Borodin <amborodin86@gmail.com> writes:

On Thu, Jan 26, 2023 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That test case is demonstrating fundamental
database corruption after a crash.

Not exactly corruption. XID was not persisted and buffer data did not
hit a disk. Database is in the correct state.

Really? I don't see how this part is even a little bit okay:

[00:40:50.744](0.046s) not ok 3 - xid is aborted after crash
[00:40:50.745](0.001s)
[00:40:50.745](0.000s) # Failed test 'xid is aborted after crash'
# at t/011_crash_recovery.pl line 57.
[00:40:50.746](0.001s) # got: 'committed'
# expected: 'aborted'

If any tuples made by that transaction had reached disk,
we'd have a problem.

The problem is that the WAL wasn't flushed, allowing the same xid to
be allocated again after crash recovery. But for any data pages to
hit the disk, we'd have to flush WAL first, so then it couldn't
happen, no? FWIW I also re-complained about the dangers of anyone
relying on pg_xact_status() for its stated purpose after seeing
tanager's failure[1]/messages/by-id/CA+hUKGJ9p2JPPMA4eYAKq=r9d_4_8vziet_tS1LEBbiny5-ypA@mail.gmail.com.

[1]: /messages/by-id/CA+hUKGJ9p2JPPMA4eYAKq=r9d_4_8vziet_tS1LEBbiny5-ypA@mail.gmail.com

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#7)
Re: Something is wrong with wal_compression

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Jan 27, 2023 at 11:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

If any tuples made by that transaction had reached disk,
we'd have a problem.

The problem is that the WAL wasn't flushed, allowing the same xid to
be allocated again after crash recovery. But for any data pages to
hit the disk, we'd have to flush WAL first, so then it couldn't
happen, no?

Ah, now I get the point: the "committed xact" seen after restart
isn't the same one as we saw before the crash, but a new one that
was given the same XID because nothing about the old one had made
it to disk yet.

FWIW I also re-complained about the dangers of anyone
relying on pg_xact_status() for its stated purpose after seeing
tanager's failure[1].

Indeed, it seems like this behavior makes pg_xact_status() basically
useless as things stand.

regards, tom lane

#9Andrey Borodin
amborodin@acm.org
In reply to: Tom Lane (#8)
Re: Something is wrong with wal_compression

On Thu, Jan 26, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Indeed, it seems like this behavior makes pg_xact_status() basically
useless as things stand.

If we agree that xid allocation is not something persistent, let's fix
the test? We can replace a check with select * from pg_class or,
maybe, add an amcheck run.
As far as I recollect, this test was introduced to test this new
function in 857ee8e391f.

Best regards, Andrey Borodin.

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrey Borodin (#9)
Re: Something is wrong with wal_compression

Andrey Borodin <amborodin86@gmail.com> writes:

On Thu, Jan 26, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Indeed, it seems like this behavior makes pg_xact_status() basically
useless as things stand.

If we agree that xid allocation is not something persistent, let's fix
the test?

If we're not going to fix this behavior, we need to fix the docs
to disclaim that pg_xact_status() is of use for what it's said
to be good for.

regards, tom lane

#11Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#9)
Re: Something is wrong with wal_compression

On Thu, Jan 26, 2023 at 04:14:57PM -0800, Andrey Borodin wrote:

If we agree that xid allocation is not something persistent, let's fix
the test? We can replace a check with select * from pg_class or,
maybe, add an amcheck run.
As far as I recollect, this test was introduced to test this new
function in 857ee8e391f.

My opinion would be to make this function more reliable, FWIW, even if
that involves a performance impact when called in a close loop by
forcing more WAL flushes to ensure its report durability and
consistency. As things stand, this is basically unreliable, and we
document it as something applications can *use*. Adding a note in the
docs to say that this function can be unstable for some edge cases
does not make much sense to me, either. Commit 857ee8e itself says
that we can use it if a database connection is lost, which could
happen on a crash..
--
Michael

#12Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#11)
Re: Something is wrong with wal_compression

On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Jan 26, 2023 at 04:14:57PM -0800, Andrey Borodin wrote:

If we agree that xid allocation is not something persistent, let's fix
the test? We can replace a check with select * from pg_class or,
maybe, add an amcheck run.
As far as I recollect, this test was introduced to test this new
function in 857ee8e391f.

My opinion would be to make this function more reliable, FWIW, even if
that involves a performance impact when called in a close loop by
forcing more WAL flushes to ensure its report durability and
consistency. As things stand, this is basically unreliable, and we
document it as something applications can *use*. Adding a note in the
docs to say that this function can be unstable for some edge cases
does not make much sense to me, either. Commit 857ee8e itself says
that we can use it if a database connection is lost, which could
happen on a crash..

Yeah, the other thread has a patch for that. But it would hurt some
workloads. A better patch would do some kind of amortisation
(reserving N xids at a time or some such scheme, while being careful
to make sure the right CLOG pages etc exist if you crash and skip a
bunch of xids on recovery) but be more complicated. For the record,
back before release 13 added the 64 bit xid allocator, these functions
(or rather their txid_XXX ancestors) were broken in a different way:
they didn't track epochs reliably, the discovery of which led to the
new xid8-based functions, so that might provide a natural
back-patching range, if a back-patchable solution can be agreed on.

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#12)
Re: Something is wrong with wal_compression

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote:

My opinion would be to make this function more reliable, FWIW, even if
that involves a performance impact when called in a close loop by
forcing more WAL flushes to ensure its report durability and
consistency.

Yeah, the other thread has a patch for that. But it would hurt some
workloads.

I think we need to get the thing correct first and worry about
performance later. What's wrong with simply making pg_xact_status
write and flush a record of the XID's existence before returning it?
Yeah, it will cost you if you use that function, but not if you don't.

A better patch would do some kind of amortisation
(reserving N xids at a time or some such scheme, while being careful
to make sure the right CLOG pages etc exist if you crash and skip a
bunch of xids on recovery) but be more complicated.

Maybe that would be appropriate for HEAD, but I'd be wary of adding
anything complicated to the back branches. This is clearly a very
under-tested area.

regards, tom lane

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#13)
Re: Something is wrong with wal_compression

On Fri, Jan 27, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote:

My opinion would be to make this function more reliable, FWIW, even if
that involves a performance impact when called in a close loop by
forcing more WAL flushes to ensure its report durability and
consistency.

Yeah, the other thread has a patch for that. But it would hurt some
workloads.

I think we need to get the thing correct first and worry about
performance later. What's wrong with simply making pg_xact_status
write and flush a record of the XID's existence before returning it?
Yeah, it will cost you if you use that function, but not if you don't.

It would be pg_current_xact_id() that would have to pay the cost of
the WAL flush, not pg_xact_status() itself, but yeah that's what the
patch does (with some optimisations). I guess one question is whether
there are any other reasonable real world uses of
pg_current_xact_id(), other than the original goal[1]/messages/by-id/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com. If not, then
at least you are penalising the right users, even though they probably
only actually call pg_current_status() in extremely rare circumstances
(if COMMIT hangs up). But I wouldn't be surprised if people have
found other reasons to be interested in xid observability, related to
distributed transactions and snapshots and suchlike. There is no
doubt that the current situation is unacceptable, though, so maybe we
really should just do it and make a faster one later. Anyone else
want to vote on this?

[1]: /messages/by-id/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#14)
Re: Something is wrong with wal_compression

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Jan 27, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think we need to get the thing correct first and worry about
performance later. What's wrong with simply making pg_xact_status
write and flush a record of the XID's existence before returning it?
Yeah, it will cost you if you use that function, but not if you don't.

It would be pg_current_xact_id() that would have to pay the cost of
the WAL flush, not pg_xact_status() itself,

Right, typo on my part.

regards, tom lane

#16Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Thomas Munro (#14)
Re: Something is wrong with wal_compression

On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote:

On Fri, Jan 27, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote:

My opinion would be to make this function more reliable, FWIW, even if
that involves a performance impact when called in a close loop by
forcing more WAL flushes to ensure its report durability and
consistency.

Yeah, the other thread has a patch for that.  But it would hurt some
workloads.

I think we need to get the thing correct first and worry about
performance later.  What's wrong with simply making pg_xact_status
write and flush a record of the XID's existence before returning it?
Yeah, it will cost you if you use that function, but not if you don't.

There is no
doubt that the current situation is unacceptable, though, so maybe we
really should just do it and make a faster one later.  Anyone else
want to vote on this?

I wasn't aware of the existence of pg_xact_status, so I suspect that it
is not a widely known and used feature. After reading the documentation,
I'd say that anybody who uses it will want it to give a reliable answer.
So I'd agree that it is better to make it more expensive, but live up to
its promise.

Yours,
Laurenz Albe

#17Michael Paquier
michael@paquier.xyz
In reply to: Laurenz Albe (#16)
Re: Something is wrong with wal_compression

On Fri, Jan 27, 2023 at 06:06:05AM +0100, Laurenz Albe wrote:

On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote:

There is no
doubt that the current situation is unacceptable, though, so maybe we
really should just do it and make a faster one later.  Anyone else
want to vote on this?

I wasn't aware of the existence of pg_xact_status, so I suspect that it
is not a widely known and used feature. After reading the documentation,
I'd say that anybody who uses it will want it to give a reliable answer.
So I'd agree that it is better to make it more expensive, but live up to
its promise.

A code search within the Debian packages (codesearch.debian.net) and
github does not show that it is not actually used, pg_xact_status() is
reported as parts of copies of the Postgres code in the regression
tests.

FWIW, my vote goes for a more expensive but reliable function even in
stable branches. Even 857ee8e mentions that this could be used on a
lost connection, so we don't even satisfy the use case of the original
commit as things stand (right?), because lost connection could just be
a result of a crash, and if crash recovery reassigns the XID, then the
client gets it wrong.
--
Michael

#18Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#14)
Re: Something is wrong with wal_compression

Hi,

On 2023-01-27 16:15:08 +1300, Thomas Munro wrote:

It would be pg_current_xact_id() that would have to pay the cost of
the WAL flush, not pg_xact_status() itself, but yeah that's what the
patch does (with some optimisations). I guess one question is whether
there are any other reasonable real world uses of
pg_current_xact_id(), other than the original goal[1].

txid_current() is a lot older than pg_current_xact_id(), and they're backed by
the same code afaict. 8.4 I think.

Unfortunately txid_current() is used in plenty montiring setups IME.

I don't think it's a good idea to make a function that was quite cheap for 15
years, suddenly be several orders of magnitude more expensive...

Greetings,

Andres Freund

#19Maciek Sakrejda
m.sakrejda@gmail.com
In reply to: Andres Freund (#18)
Re: Something is wrong with wal_compression

On Fri, Jan 27, 2023, 18:58 Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-01-27 16:15:08 +1300, Thomas Munro wrote:

It would be pg_current_xact_id() that would have to pay the cost of
the WAL flush, not pg_xact_status() itself, but yeah that's what the
patch does (with some optimisations). I guess one question is whether
there are any other reasonable real world uses of
pg_current_xact_id(), other than the original goal[1].

txid_current() is a lot older than pg_current_xact_id(), and they're
backed by
the same code afaict. 8.4 I think.

Unfortunately txid_current() is used in plenty montiring setups IME.

I don't think it's a good idea to make a function that was quite cheap for
15
years, suddenly be several orders of magnitude more expensive...

As someone working on a monitoring tool that uses it (well, both), +1. We'd
have to rethink a few things if this becomes a performance concern.

Thanks,
Maciek

#20Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#17)
Re: Something is wrong with wal_compression

Hi,

On 2023-01-28 11:38:50 +0900, Michael Paquier wrote:

On Fri, Jan 27, 2023 at 06:06:05AM +0100, Laurenz Albe wrote:

On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote:

There is no
doubt that the current situation is unacceptable, though, so maybe we
really should just do it and make a faster one later.� Anyone else
want to vote on this?

I wasn't aware of the existence of pg_xact_status, so I suspect that it
is not a widely known and used feature. After reading the documentation,
I'd say that anybody who uses it will want it to give a reliable answer.
So I'd agree that it is better to make it more expensive, but live up to
its promise.

A code search within the Debian packages (codesearch.debian.net) and
github does not show that it is not actually used, pg_xact_status() is
reported as parts of copies of the Postgres code in the regression
tests.

Not finding a user at codesearch.debian.net provides useful information for C
APIs, but a negative result for an SQL exposed function doesn't provide any
information. Those callers will largely be in application code, which largely
won't be in debian.

And as noted two messages up, we wouldn't need to flush in pg_xact_status(),
we'd need to flush in pg_current_xact_id()/txid_current().

FWIW, my vote goes for a more expensive but reliable function even in
stable branches.

I very strenuously object. If we make txid_current() (by way of
pg_current_xact_id()) flush WAL, we'll cause outages.

Greetings,

Andres Freund

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#20)
#22Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#21)
#23Andrey Borodin
amborodin@acm.org
In reply to: Tom Lane (#21)
#24Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#22)
#25Andres Freund
andres@anarazel.de
In reply to: Andrey Borodin (#23)
#26Thomas Munro
thomas.munro@gmail.com
In reply to: Andrey Borodin (#23)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#26)
#28Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#27)
#29Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#28)