PANIC: could not fsync file "pg_multixact/..." since commit dee663f7843

Started by Tomas Vondraabout 5 years ago7 messages
#1Tomas Vondra
Tomas Vondra
tomas.vondra@2ndquadrant.com
1 attachment(s)

Hi,

While running some multixact-oriented stress tests, I noticed that
commit dee663f7843:

Defer flushing of SLRU files.

Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery. Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba. This can cause a significant improvement to
recovery performance, especially when it's otherwise CPU-bound.

...

seems to trigger this issue:

[17820]: PANIC: could not fsync file "pg_multixact/offsets/06E0": No such file or directory
[17820]: PANIC: could not fsync file "pg_multixact/offsets/06E0": No such file or directory
[17818]: LOG: database system is shut down
[17818]: LOG: database system is shut down

which is then followed by this during recovery:

[18599]: PANIC: could not fsync file "pg_multixact/offsets/06E0": No such file or directory
[18599]: PANIC: could not fsync file "pg_multixact/offsets/06E0": No such file or directory
[18599]: PANIC: could not fsync file "pg_multixact/offsets/06E0": No such file or directory
[18599]: PANIC: could not fsync file "pg_multixact/offsets/06E0": No such file or directory
[18599]: PANIC: could not fsync file "pg_multixact/offsets/06E0": No such file or directory
[18599]: PANIC: could not fsync file "pg_multixact/offsets/06E0": No such file or directory
[18599]: PANIC: could not fsync file "pg_multixact/offsets/06E0": No such file or directory
[18599]: PANIC: could not fsync file "pg_multixact/offsets/06E0": No such file or directory
[17818]: LOG: database system is shut down
[17818]: LOG: database system is shut down
[17818]: LOG: database system is shut down

at which point the cluster is kaput, of course.

It's clearly the fault of dee663f7843 - 4 failures out of 4 attempts on
that commit, and after switching to ca7f8e2b86 it goes away.

Reproducing it is pretty simple, but it takes a bit of time. Essentially
do this:

create table t (a int primary key);
insert into t select i from generate_series(1,1000) s(i);

and then run

SELECT * FROM t FOR KEY SHARE;

from pgbench with many concurrent clients. I do this:

pgbench -n -c 32 -j 8 -f select.sql -T 86400 test

After a while (~1h on my machine) the pg_multixact gets over 10GB, which
triggers a more aggressive cleanup (per MultiXactMemberFreezeThreshold).
My guess is that this discards some of the files, but checkpointer is
not aware of that, or something like that. Not sure.

Attached are backtraces from the two crashes - regular and during
recovery. Not sure how interesting / helpful that is, it probably does
not say much about how we got there.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

backtraces.txttext/plain; charset=us-ascii
#2Thomas Munro
Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#1)
Re: PANIC: could not fsync file "pg_multixact/..." since commit dee663f7843

On Wed, Nov 4, 2020 at 2:32 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

After a while (~1h on my machine) the pg_multixact gets over 10GB, which
triggers a more aggressive cleanup (per MultiXactMemberFreezeThreshold).
My guess is that this discards some of the files, but checkpointer is
not aware of that, or something like that. Not sure.

Urgh. Thanks. Looks like perhaps the problem is that I have
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true) in one codepath
that unlinks files, but not another. Looking.

#3Tomas Vondra
Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Thomas Munro (#2)
Re: PANIC: could not fsync file "pg_multixact/..." since commit dee663f7843

On Wed, Nov 04, 2020 at 02:49:24PM +1300, Thomas Munro wrote:

On Wed, Nov 4, 2020 at 2:32 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

After a while (~1h on my machine) the pg_multixact gets over 10GB, which
triggers a more aggressive cleanup (per MultiXactMemberFreezeThreshold).
My guess is that this discards some of the files, but checkpointer is
not aware of that, or something like that. Not sure.

Urgh. Thanks. Looks like perhaps the problem is that I have
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true) in one codepath
that unlinks files, but not another. Looking.

Maybe. I didn't have time to investigate this more deeply, and it takes
quite a bit of time to reproduce. I can try again with extra logging or
test some proposed fixes, if you give me a patch.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Thomas Munro
Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#3)
1 attachment(s)
Re: PANIC: could not fsync file "pg_multixact/..." since commit dee663f7843

On Wed, Nov 4, 2020 at 2:57 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Nov 04, 2020 at 02:49:24PM +1300, Thomas Munro wrote:

On Wed, Nov 4, 2020 at 2:32 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

After a while (~1h on my machine) the pg_multixact gets over 10GB, which
triggers a more aggressive cleanup (per MultiXactMemberFreezeThreshold).
My guess is that this discards some of the files, but checkpointer is
not aware of that, or something like that. Not sure.

Urgh. Thanks. Looks like perhaps the problem is that I have
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true) in one codepath
that unlinks files, but not another. Looking.

Maybe. I didn't have time to investigate this more deeply, and it takes
quite a bit of time to reproduce. I can try again with extra logging or
test some proposed fixes, if you give me a patch.

I think this should be fixed by doing all unlinking through a common
code path. Does this pass your test?

Attachments:

0001-Fix-unlinking-of-SLRU-segments.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-unlinking-of-SLRU-segments.patch
#5Tomas Vondra
Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Thomas Munro (#4)
Re: PANIC: could not fsync file "pg_multixact/..." since commit dee663f7843

On Wed, Nov 04, 2020 at 05:36:46PM +1300, Thomas Munro wrote:

On Wed, Nov 4, 2020 at 2:57 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Nov 04, 2020 at 02:49:24PM +1300, Thomas Munro wrote:

On Wed, Nov 4, 2020 at 2:32 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

After a while (~1h on my machine) the pg_multixact gets over 10GB, which
triggers a more aggressive cleanup (per MultiXactMemberFreezeThreshold).
My guess is that this discards some of the files, but checkpointer is
not aware of that, or something like that. Not sure.

Urgh. Thanks. Looks like perhaps the problem is that I have
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true) in one codepath
that unlinks files, but not another. Looking.

Maybe. I didn't have time to investigate this more deeply, and it takes
quite a bit of time to reproduce. I can try again with extra logging or
test some proposed fixes, if you give me a patch.

I think this should be fixed by doing all unlinking through a common
code path. Does this pass your test?

Seems to be working - without the patch it failed after ~1h, now it's
running for more than 2h without a crash. I'll let it run for a few more
hours (on both machines).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Tomas Vondra
Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#5)
Re: PANIC: could not fsync file "pg_multixact/..." since commit dee663f7843

On 11/4/20 2:50 PM, Tomas Vondra wrote:

On Wed, Nov 04, 2020 at 05:36:46PM +1300, Thomas Munro wrote:

On Wed, Nov 4, 2020 at 2:57 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Nov 04, 2020 at 02:49:24PM +1300, Thomas Munro wrote:

On Wed, Nov 4, 2020 at 2:32 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

After a while (~1h on my machine) the pg_multixact gets over 10GB,

which

triggers a more aggressive cleanup (per

MultiXactMemberFreezeThreshold).

My guess is that this discards some of the files, but checkpointer is
not aware of that, or something like that. Not sure.

Urgh.  Thanks.  Looks like perhaps the problem is that I have
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true) in one codepath
that unlinks files, but not another.  Looking.

Maybe. I didn't have time to investigate this more deeply, and it takes
quite a bit of time to reproduce. I can try again with extra logging or
test some proposed fixes, if you give me a patch.

I think this should be fixed by doing all unlinking through a common
code path.  Does this pass your test?

Seems to be working - without the patch it failed after ~1h, now it's
running for more than 2h without a crash. I'll let it run for a few more
hours (on both machines).

It's been running for hours on both machines, without any crashes etc.
While that's not a definitive proof the fix is correct, it certainly
behaves differently.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Thomas Munro
Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#6)
Re: PANIC: could not fsync file "pg_multixact/..." since commit dee663f7843

On Thu, Nov 5, 2020 at 12:07 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

It's been running for hours on both machines, without any crashes etc.
While that's not a definitive proof the fix is correct, it certainly
behaves differently.

Thanks! Embarrassed to have missed that. Pushed.