standby server crashes hard on out-of-disk-space in HEAD

Started by Tom Laneover 8 years ago5 messages
#1Tom Lane
tgl@sss.pgh.pa.us

logfile from a standby server:

2017-06-12 11:43:46.450 EDT [13605] LOG: started streaming WAL from primary at 3/E6000000 on timeline 1
2017-06-12 11:47:46.992 EDT [11261] FATAL: could not extend file "base/47578/54806": No space left on device
2017-06-12 11:47:46.992 EDT [11261] HINT: Check free disk space.
2017-06-12 11:47:46.992 EDT [11261] CONTEXT: WAL redo at 8/EC7E0CF8 for XLOG/FPI:
2017-06-12 11:47:46.992 EDT [11261] WARNING: buffer refcount leak: [1243] (rel=base/47578/54806, blockNum=5249, flags=0x8a000000, refcount=1 1)
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523)
2017-06-12 11:47:47.567 EDT [11259] LOG: startup process (PID 11261) was terminated by signal 6: Aborted
2017-06-12 11:47:47.567 EDT [11259] LOG: terminating any other active server processes
2017-06-12 11:47:47.584 EDT [11259] LOG: database system is shut down

The FATAL is fine, but we shouldn't have that WARNING I think, and
certainly not the assertion failure.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#1)
Re: standby server crashes hard on out-of-disk-space in HEAD

(On Mon, Jun 12, 2017 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

logfile from a standby server:

2017-06-12 11:43:46.450 EDT [13605] LOG: started streaming WAL from primary at 3/E6000000 on timeline 1
2017-06-12 11:47:46.992 EDT [11261] FATAL: could not extend file "base/47578/54806": No space left on device
2017-06-12 11:47:46.992 EDT [11261] HINT: Check free disk space.
2017-06-12 11:47:46.992 EDT [11261] CONTEXT: WAL redo at 8/EC7E0CF8 for XLOG/FPI:
2017-06-12 11:47:46.992 EDT [11261] WARNING: buffer refcount leak: [1243] (rel=base/47578/54806, blockNum=5249, flags=0x8a000000, refcount=1 1)
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523)
2017-06-12 11:47:47.567 EDT [11259] LOG: startup process (PID 11261) was terminated by signal 6: Aborted
2017-06-12 11:47:47.567 EDT [11259] LOG: terminating any other active server processes
2017-06-12 11:47:47.584 EDT [11259] LOG: database system is shut down

The FATAL is fine, but we shouldn't have that WARNING I think, and
certainly not the assertion failure.

Commit 4b4b680c3d6d8485155d4d4bf0a92d3a874b7a65 (Make backend local
tracking of buffer pins memory efficient., vintage 2014) seems like a
likely culprit here, but I haven't tested.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#2)
Re: standby server crashes hard on out-of-disk-space in HEAD

On 2017-06-12 15:12:23 -0400, Robert Haas wrote:

(On Mon, Jun 12, 2017 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

logfile from a standby server:

2017-06-12 11:43:46.450 EDT [13605] LOG: started streaming WAL from primary at 3/E6000000 on timeline 1
2017-06-12 11:47:46.992 EDT [11261] FATAL: could not extend file "base/47578/54806": No space left on device
2017-06-12 11:47:46.992 EDT [11261] HINT: Check free disk space.
2017-06-12 11:47:46.992 EDT [11261] CONTEXT: WAL redo at 8/EC7E0CF8 for XLOG/FPI:
2017-06-12 11:47:46.992 EDT [11261] WARNING: buffer refcount leak: [1243] (rel=base/47578/54806, blockNum=5249, flags=0x8a000000, refcount=1 1)
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523)
2017-06-12 11:47:47.567 EDT [11259] LOG: startup process (PID 11261) was terminated by signal 6: Aborted
2017-06-12 11:47:47.567 EDT [11259] LOG: terminating any other active server processes
2017-06-12 11:47:47.584 EDT [11259] LOG: database system is shut down

The FATAL is fine, but we shouldn't have that WARNING I think, and
certainly not the assertion failure.

Just for clarification: It's a WARNING so we print all missed leaks,
rather than erroring/asserting at the first leak. We've for a long
while Asserted there's not a single pin failure (in earlier releases we
asserted out at the first leak).

Commit 4b4b680c3d6d8485155d4d4bf0a92d3a874b7a65 (Make backend local
tracking of buffer pins memory efficient., vintage 2014) seems like a
likely culprit here, but I haven't tested.

I'm not that sure. As written above, the Assert isn't new, and given
this hasn't been reported before, I'm a bit doubtful that it's a general
refcount tracking bug. The FPI code has been whacked around more
heavily, so it could well be a bug in it somewhere.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#3)
Re: standby server crashes hard on out-of-disk-space in HEAD

On Tue, Jun 13, 2017 at 4:21 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-06-12 15:12:23 -0400, Robert Haas wrote:

Commit 4b4b680c3d6d8485155d4d4bf0a92d3a874b7a65 (Make backend local
tracking of buffer pins memory efficient., vintage 2014) seems like a
likely culprit here, but I haven't tested.

I'm not that sure. As written above, the Assert isn't new, and given
this hasn't been reported before, I'm a bit doubtful that it's a general
refcount tracking bug. The FPI code has been whacked around more
heavily, so it could well be a bug in it somewhere.

Something doing a bisect could just use a VM that puts the standby on
a tiny partition. I remember seeing this assertion failure some time
ago on a test deployment, and that was really surprising. I think that
this may be hiding something, so we should really try to investigate
more what's wrong here.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#4)
1 attachment(s)
Re: standby server crashes hard on out-of-disk-space in HEAD

On Mon, Aug 28, 2017 at 9:06 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Jun 13, 2017 at 4:21 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-06-12 15:12:23 -0400, Robert Haas wrote:

Commit 4b4b680c3d6d8485155d4d4bf0a92d3a874b7a65 (Make backend local
tracking of buffer pins memory efficient., vintage 2014) seems like a
likely culprit here, but I haven't tested.

I'm not that sure. As written above, the Assert isn't new, and given
this hasn't been reported before, I'm a bit doubtful that it's a general
refcount tracking bug. The FPI code has been whacked around more
heavily, so it could well be a bug in it somewhere.

Something doing a bisect could just use a VM that puts the standby on
a tiny partition. I remember seeing this assertion failure some time
ago on a test deployment, and that was really surprising. I think that
this may be hiding something, so we should really try to investigate
more what's wrong here.

I have been playing a bit with the builds and the attached script
triggering out-of-space errors on a standby (adapt to your
environment), and while looking for a good commit, I have found that
this thing is a bit older than the 2014 vintage... Down to the
merge-base of REL9_4_STABLE and REL9_3_STABLE, the assertion failure
is still the same.

The failure is older than even 9.2, for example by testing at the
merge-base of 9.2 and 9.3:
CONTEXT: xlog redo insert(init): rel 1663/16384/16385; tid 181441/1
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c",
Line: 1788)
But well this assertion got changed in dcafdbcd.
--
Michael

Attachments:

crash_standby.bashapplication/octet-stream; name=crash_standby.bashDownload