Something is wrong with wal_compression
The symptom being exhibited by Michael's new BF animal tanager
is perfectly reproducible elsewhere.
$ cat /home/postgres/tmp/temp_config
#default_toast_compression = lz4
wal_compression = lz4
$ export TEMP_CONFIG=/home/postgres/tmp/temp_config
$ cd ~/pgsql/src/test/recovery
$ make check PROVE_TESTS=t/011_crash_recovery.pl
...
+++ tap check in src/test/recovery +++
t/011_crash_recovery.pl .. 1/?
# Failed test 'new xid after restart is greater'
# at t/011_crash_recovery.pl line 53.
# '729'
# >
# '729'
# Failed test 'xid is aborted after crash'
# at t/011_crash_recovery.pl line 57.
# got: 'committed'
# expected: 'aborted'
# Looks like you failed 2 tests of 3.
Maybe this is somehow the test script's fault, but I don't see how.
It fails the same way with 'wal_compression = pglz', so I think it's
generic to that whole feature rather than specific to LZ4.
regards, tom lane
On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:
The symptom being exhibited by Michael's new BF animal tanager
is perfectly reproducible elsewhere.
I think these tests have always failed with wal_compression ?
/messages/by-id/20210308.173242.463790587797836129.horikyota.ntt@gmail.com
/messages/by-id/20210313012820.GJ29463@telsasoft.com
/messages/by-id/20220222231948.GJ9008@telsasoft.com
/messages/by-id/YNqWd2GSMrnqWIfx@paquier.xyz
|My buildfarm machine has been changed to use wal_compression = lz4,
|while on it for HEAD runs.
Justin Pryzby <pryzby@telsasoft.com> writes:
On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:
The symptom being exhibited by Michael's new BF animal tanager
is perfectly reproducible elsewhere.
I think these tests have always failed with wal_compression ?
If that's a known problem, and we've done nothing about it,
that is pretty horrid. That test case is demonstrating fundamental
database corruption after a crash.
regards, tom lane
On Thu, Jan 26, 2023 at 02:08:27PM -0600, Justin Pryzby wrote:
On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:
The symptom being exhibited by Michael's new BF animal tanager
is perfectly reproducible elsewhere.I think these tests have always failed with wal_compression ?
/messages/by-id/20210308.173242.463790587797836129.horikyota.ntt@gmail.com
/messages/by-id/20210313012820.GJ29463@telsasoft.com
/messages/by-id/20220222231948.GJ9008@telsasoft.com
+ /messages/by-id/c86ce84f-dd38-9951-102f-13a931210f52@dunslane.net
On Thu, Jan 26, 2023 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
That test case is demonstrating fundamental
database corruption after a crash.
Not exactly corruption. XID was not persisted and buffer data did not
hit a disk. Database is in the correct state.
It was discussed long before WAL compression here [0]/messages/by-id/565FB155-C6B0-41E2-8C44-7B514DC25132@yandex-team.ru. The thing is it
is easier to reproduce with compression, but compression has nothing
to do with it, as far as I understand.
Proposed fix is here[1]/messages/by-id/20210313012820.GJ29463@telsasoft.com, but I think it's better to fix the test. It
should not veryfi Xid, but rather side effects of "CREATE TABLE mine(x
integer);".
Best regards, Andrey Borodin.
[0]: /messages/by-id/565FB155-C6B0-41E2-8C44-7B514DC25132@yandex-team.ru
[1]: /messages/by-id/20210313012820.GJ29463@telsasoft.com
Andrey Borodin <amborodin86@gmail.com> writes:
On Thu, Jan 26, 2023 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
That test case is demonstrating fundamental
database corruption after a crash.
Not exactly corruption. XID was not persisted and buffer data did not
hit a disk. Database is in the correct state.
Really? I don't see how this part is even a little bit okay:
[00:40:50.744](0.046s) not ok 3 - xid is aborted after crash
[00:40:50.745](0.001s)
[00:40:50.745](0.000s) # Failed test 'xid is aborted after crash'
# at t/011_crash_recovery.pl line 57.
[00:40:50.746](0.001s) # got: 'committed'
# expected: 'aborted'
If any tuples made by that transaction had reached disk,
we'd have a problem.
regards, tom lane
On Fri, Jan 27, 2023 at 11:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrey Borodin <amborodin86@gmail.com> writes:
On Thu, Jan 26, 2023 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
That test case is demonstrating fundamental
database corruption after a crash.Not exactly corruption. XID was not persisted and buffer data did not
hit a disk. Database is in the correct state.Really? I don't see how this part is even a little bit okay:
[00:40:50.744](0.046s) not ok 3 - xid is aborted after crash
[00:40:50.745](0.001s)
[00:40:50.745](0.000s) # Failed test 'xid is aborted after crash'
# at t/011_crash_recovery.pl line 57.
[00:40:50.746](0.001s) # got: 'committed'
# expected: 'aborted'If any tuples made by that transaction had reached disk,
we'd have a problem.
The problem is that the WAL wasn't flushed, allowing the same xid to
be allocated again after crash recovery. But for any data pages to
hit the disk, we'd have to flush WAL first, so then it couldn't
happen, no? FWIW I also re-complained about the dangers of anyone
relying on pg_xact_status() for its stated purpose after seeing
tanager's failure[1]/messages/by-id/CA+hUKGJ9p2JPPMA4eYAKq=r9d_4_8vziet_tS1LEBbiny5-ypA@mail.gmail.com.
[1]: /messages/by-id/CA+hUKGJ9p2JPPMA4eYAKq=r9d_4_8vziet_tS1LEBbiny5-ypA@mail.gmail.com
Thomas Munro <thomas.munro@gmail.com> writes:
On Fri, Jan 27, 2023 at 11:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
If any tuples made by that transaction had reached disk,
we'd have a problem.
The problem is that the WAL wasn't flushed, allowing the same xid to
be allocated again after crash recovery. But for any data pages to
hit the disk, we'd have to flush WAL first, so then it couldn't
happen, no?
Ah, now I get the point: the "committed xact" seen after restart
isn't the same one as we saw before the crash, but a new one that
was given the same XID because nothing about the old one had made
it to disk yet.
FWIW I also re-complained about the dangers of anyone
relying on pg_xact_status() for its stated purpose after seeing
tanager's failure[1].
Indeed, it seems like this behavior makes pg_xact_status() basically
useless as things stand.
regards, tom lane
On Thu, Jan 26, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Indeed, it seems like this behavior makes pg_xact_status() basically
useless as things stand.
If we agree that xid allocation is not something persistent, let's fix
the test? We can replace a check with select * from pg_class or,
maybe, add an amcheck run.
As far as I recollect, this test was introduced to test this new
function in 857ee8e391f.
Best regards, Andrey Borodin.
Andrey Borodin <amborodin86@gmail.com> writes:
On Thu, Jan 26, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Indeed, it seems like this behavior makes pg_xact_status() basically
useless as things stand.
If we agree that xid allocation is not something persistent, let's fix
the test?
If we're not going to fix this behavior, we need to fix the docs
to disclaim that pg_xact_status() is of use for what it's said
to be good for.
regards, tom lane
On Thu, Jan 26, 2023 at 04:14:57PM -0800, Andrey Borodin wrote:
If we agree that xid allocation is not something persistent, let's fix
the test? We can replace a check with select * from pg_class or,
maybe, add an amcheck run.
As far as I recollect, this test was introduced to test this new
function in 857ee8e391f.
My opinion would be to make this function more reliable, FWIW, even if
that involves a performance impact when called in a close loop by
forcing more WAL flushes to ensure its report durability and
consistency. As things stand, this is basically unreliable, and we
document it as something applications can *use*. Adding a note in the
docs to say that this function can be unstable for some edge cases
does not make much sense to me, either. Commit 857ee8e itself says
that we can use it if a database connection is lost, which could
happen on a crash..
--
Michael
On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jan 26, 2023 at 04:14:57PM -0800, Andrey Borodin wrote:
If we agree that xid allocation is not something persistent, let's fix
the test? We can replace a check with select * from pg_class or,
maybe, add an amcheck run.
As far as I recollect, this test was introduced to test this new
function in 857ee8e391f.My opinion would be to make this function more reliable, FWIW, even if
that involves a performance impact when called in a close loop by
forcing more WAL flushes to ensure its report durability and
consistency. As things stand, this is basically unreliable, and we
document it as something applications can *use*. Adding a note in the
docs to say that this function can be unstable for some edge cases
does not make much sense to me, either. Commit 857ee8e itself says
that we can use it if a database connection is lost, which could
happen on a crash..
Yeah, the other thread has a patch for that. But it would hurt some
workloads. A better patch would do some kind of amortisation
(reserving N xids at a time or some such scheme, while being careful
to make sure the right CLOG pages etc exist if you crash and skip a
bunch of xids on recovery) but be more complicated. For the record,
back before release 13 added the 64 bit xid allocator, these functions
(or rather their txid_XXX ancestors) were broken in a different way:
they didn't track epochs reliably, the discovery of which led to the
new xid8-based functions, so that might provide a natural
back-patching range, if a back-patchable solution can be agreed on.
Thomas Munro <thomas.munro@gmail.com> writes:
On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote:
My opinion would be to make this function more reliable, FWIW, even if
that involves a performance impact when called in a close loop by
forcing more WAL flushes to ensure its report durability and
consistency.
Yeah, the other thread has a patch for that. But it would hurt some
workloads.
I think we need to get the thing correct first and worry about
performance later. What's wrong with simply making pg_xact_status
write and flush a record of the XID's existence before returning it?
Yeah, it will cost you if you use that function, but not if you don't.
A better patch would do some kind of amortisation
(reserving N xids at a time or some such scheme, while being careful
to make sure the right CLOG pages etc exist if you crash and skip a
bunch of xids on recovery) but be more complicated.
Maybe that would be appropriate for HEAD, but I'd be wary of adding
anything complicated to the back branches. This is clearly a very
under-tested area.
regards, tom lane
On Fri, Jan 27, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote:
My opinion would be to make this function more reliable, FWIW, even if
that involves a performance impact when called in a close loop by
forcing more WAL flushes to ensure its report durability and
consistency.Yeah, the other thread has a patch for that. But it would hurt some
workloads.I think we need to get the thing correct first and worry about
performance later. What's wrong with simply making pg_xact_status
write and flush a record of the XID's existence before returning it?
Yeah, it will cost you if you use that function, but not if you don't.
It would be pg_current_xact_id() that would have to pay the cost of
the WAL flush, not pg_xact_status() itself, but yeah that's what the
patch does (with some optimisations). I guess one question is whether
there are any other reasonable real world uses of
pg_current_xact_id(), other than the original goal[1]/messages/by-id/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com. If not, then
at least you are penalising the right users, even though they probably
only actually call pg_current_status() in extremely rare circumstances
(if COMMIT hangs up). But I wouldn't be surprised if people have
found other reasons to be interested in xid observability, related to
distributed transactions and snapshots and suchlike. There is no
doubt that the current situation is unacceptable, though, so maybe we
really should just do it and make a faster one later. Anyone else
want to vote on this?
[1]: /messages/by-id/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com
Thomas Munro <thomas.munro@gmail.com> writes:
On Fri, Jan 27, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think we need to get the thing correct first and worry about
performance later. What's wrong with simply making pg_xact_status
write and flush a record of the XID's existence before returning it?
Yeah, it will cost you if you use that function, but not if you don't.
It would be pg_current_xact_id() that would have to pay the cost of
the WAL flush, not pg_xact_status() itself,
Right, typo on my part.
regards, tom lane
On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote:
On Fri, Jan 27, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote:
My opinion would be to make this function more reliable, FWIW, even if
that involves a performance impact when called in a close loop by
forcing more WAL flushes to ensure its report durability and
consistency.Yeah, the other thread has a patch for that. But it would hurt some
workloads.I think we need to get the thing correct first and worry about
performance later. What's wrong with simply making pg_xact_status
write and flush a record of the XID's existence before returning it?
Yeah, it will cost you if you use that function, but not if you don't.There is no
doubt that the current situation is unacceptable, though, so maybe we
really should just do it and make a faster one later. Anyone else
want to vote on this?
I wasn't aware of the existence of pg_xact_status, so I suspect that it
is not a widely known and used feature. After reading the documentation,
I'd say that anybody who uses it will want it to give a reliable answer.
So I'd agree that it is better to make it more expensive, but live up to
its promise.
Yours,
Laurenz Albe
On Fri, Jan 27, 2023 at 06:06:05AM +0100, Laurenz Albe wrote:
On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote:
There is no
doubt that the current situation is unacceptable, though, so maybe we
really should just do it and make a faster one later. Anyone else
want to vote on this?I wasn't aware of the existence of pg_xact_status, so I suspect that it
is not a widely known and used feature. After reading the documentation,
I'd say that anybody who uses it will want it to give a reliable answer.
So I'd agree that it is better to make it more expensive, but live up to
its promise.
A code search within the Debian packages (codesearch.debian.net) and
github does not show that it is not actually used, pg_xact_status() is
reported as parts of copies of the Postgres code in the regression
tests.
FWIW, my vote goes for a more expensive but reliable function even in
stable branches. Even 857ee8e mentions that this could be used on a
lost connection, so we don't even satisfy the use case of the original
commit as things stand (right?), because lost connection could just be
a result of a crash, and if crash recovery reassigns the XID, then the
client gets it wrong.
--
Michael
Hi,
On 2023-01-27 16:15:08 +1300, Thomas Munro wrote:
It would be pg_current_xact_id() that would have to pay the cost of
the WAL flush, not pg_xact_status() itself, but yeah that's what the
patch does (with some optimisations). I guess one question is whether
there are any other reasonable real world uses of
pg_current_xact_id(), other than the original goal[1].
txid_current() is a lot older than pg_current_xact_id(), and they're backed by
the same code afaict. 8.4 I think.
Unfortunately txid_current() is used in plenty montiring setups IME.
I don't think it's a good idea to make a function that was quite cheap for 15
years, suddenly be several orders of magnitude more expensive...
Greetings,
Andres Freund
On Fri, Jan 27, 2023, 18:58 Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-01-27 16:15:08 +1300, Thomas Munro wrote:
It would be pg_current_xact_id() that would have to pay the cost of
the WAL flush, not pg_xact_status() itself, but yeah that's what the
patch does (with some optimisations). I guess one question is whether
there are any other reasonable real world uses of
pg_current_xact_id(), other than the original goal[1].txid_current() is a lot older than pg_current_xact_id(), and they're
backed by
the same code afaict. 8.4 I think.Unfortunately txid_current() is used in plenty montiring setups IME.
I don't think it's a good idea to make a function that was quite cheap for
15
years, suddenly be several orders of magnitude more expensive...
As someone working on a monitoring tool that uses it (well, both), +1. We'd
have to rethink a few things if this becomes a performance concern.
Thanks,
Maciek
Hi,
On 2023-01-28 11:38:50 +0900, Michael Paquier wrote:
On Fri, Jan 27, 2023 at 06:06:05AM +0100, Laurenz Albe wrote:
On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote:
There is no
doubt that the current situation is unacceptable, though, so maybe we
really should just do it and make a faster one later.� Anyone else
want to vote on this?I wasn't aware of the existence of pg_xact_status, so I suspect that it
is not a widely known and used feature. After reading the documentation,
I'd say that anybody who uses it will want it to give a reliable answer.
So I'd agree that it is better to make it more expensive, but live up to
its promise.
A code search within the Debian packages (codesearch.debian.net) and
github does not show that it is not actually used, pg_xact_status() is
reported as parts of copies of the Postgres code in the regression
tests.
Not finding a user at codesearch.debian.net provides useful information for C
APIs, but a negative result for an SQL exposed function doesn't provide any
information. Those callers will largely be in application code, which largely
won't be in debian.
And as noted two messages up, we wouldn't need to flush in pg_xact_status(),
we'd need to flush in pg_current_xact_id()/txid_current().
FWIW, my vote goes for a more expensive but reliable function even in
stable branches.
I very strenuously object. If we make txid_current() (by way of
pg_current_xact_id()) flush WAL, we'll cause outages.
Greetings,
Andres Freund