pgsql: Remove the restriction that the relmap must be 512 bytes.
Remove the restriction that the relmap must be 512 bytes.
Instead of relying on the ability to atomically overwrite the
entire relmap file in one shot, write a new one and durably
rename it into place. Removing the struct padding and the
calculation showing why the map is exactly 512 bytes, and change
the maximum number of entries to a nearby round number.
Patch by me, reviewed by Andres Freund and Dilip Kumar.
Discussion: /messages/by-id/CA+TgmoZq5=LWDK7kHaUbmWXxcaTuw_QwafgG9dr-BaPym_U8WQ@mail.gmail.com
Discussion: /messages/by-id/CAFiTN-ttOXLX75k_WzRo9ar=VvxFhrHi+rJxns997F+yvkm==A@mail.gmail.com
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/d8cd0c6c95c0120168df93aae095df4e0682a08a
Modified Files
--------------
doc/src/sgml/monitoring.sgml | 4 +-
src/backend/utils/activity/wait_event.c | 4 +-
src/backend/utils/cache/relmapper.c | 94 +++++++++++++++++++--------------
src/include/utils/wait_event.h | 2 +-
4 files changed, 58 insertions(+), 46 deletions(-)
Hi Robert,
On Tue, Jul 26, 2022 at 07:10:22PM +0000, Robert Haas wrote:
Remove the restriction that the relmap must be 512 bytes.
Instead of relying on the ability to atomically overwrite the
entire relmap file in one shot, write a new one and durably
rename it into place. Removing the struct padding and the
calculation showing why the map is exactly 512 bytes, and change
the maximum number of entries to a nearby round number.Patch by me, reviewed by Andres Freund and Dilip Kumar.
Discussion: /messages/by-id/CA+TgmoZq5=LWDK7kHaUbmWXxcaTuw_QwafgG9dr-BaPym_U8WQ@mail.gmail.com
Discussion: /messages/by-id/CAFiTN-ttOXLX75k_WzRo9ar=VvxFhrHi+rJxns997F+yvkm==A@mail.gmail.com
The CI on Windows is blowing up here and there after something that
looks to come from this commit, as of this backtrace:
00000000`007fe300 00000001`405c62dd postgres!errfinish(
char * filename = 0x00000001`40bf1513 "fd.c",
int lineno = 0n756,
char * funcname = 0x00000001`40bf14e0 "durable_rename")+0x41b
[c:\cirrus\src\backend\utils\error\elog.c @ 683]
00000000`007fe360 00000001`4081647b postgres!durable_rename(
char * oldfile = 0x00000000`007fe430 "base/16384/pg_filenode.map.tmp",
char * newfile = 0x00000000`007fe830 "base/16384/pg_filenode.map",
int elevel = 0n21)+0x22d [c:\cirrus\src\backend\storage\file\fd.c @
753]
00000000`007fe3b0 00000001`408166c9 postgres!write_relmap_file(
struct RelMapFile * newmap = 0x00000000`007fecb0,
bool write_wal = true,
bool send_sinval = true,
bool preserve_files = true,
unsigned int dbid = 0x4000,
unsigned int tsid = 0x67f,
char * dbpath = 0x00000000`0090b1c0 "base/16384")+0x38b
[c:\cirrus\src\backend\utils\cache\relmapper.c @ 971]
Here is one of them, kicked by the CF bot, but I have seen similar
crashes with some of my own things (see the txt file in crashlog, in a
manual VACUUM):
https://cirrus-ci.com/task/5240408958566400
Thanks,
--
Michael
On Wed, Jul 27, 2022 at 4:35 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Jul 26, 2022 at 07:10:22PM +0000, Robert Haas wrote:
Remove the restriction that the relmap must be 512 bytes.
The CI on Windows is blowing up here and there after something that
looks to come from this commit, as of this backtrace:
00000000`007fe300 00000001`405c62dd postgres!errfinish(
char * filename = 0x00000001`40bf1513 "fd.c",
int lineno = 0n756,
char * funcname = 0x00000001`40bf14e0 "durable_rename")+0x41b
[c:\cirrus\src\backend\utils\error\elog.c @ 683]
And here's what the error looks like:
2022-07-26 19:38:04.321 GMT [8020][client backend]
[pg_regress/vacuum][8/349:4527] PANIC: could not rename file
"global/pg_filenode.map.tmp" to "global/pg_filenode.map": Permission
denied
Someone else still has the old file open, so we can't rename the new
one to its name? On Windows that should have gone through pgrename()
in dirmod.c, which would retry 100 times with a 100ms sleep between.
Since every backend reads that file (I added an elog() and saw it 2289
times during make check), I guess you can run out of luck.
/me thinks
On Wed, Jul 27, 2022 at 5:01 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Jul 27, 2022 at 4:35 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Jul 26, 2022 at 07:10:22PM +0000, Robert Haas wrote:
Remove the restriction that the relmap must be 512 bytes.
The CI on Windows is blowing up here and there after something that
looks to come from this commit, as of this backtrace:
00000000`007fe300 00000001`405c62dd postgres!errfinish(
char * filename = 0x00000001`40bf1513 "fd.c",
int lineno = 0n756,
char * funcname = 0x00000001`40bf14e0 "durable_rename")+0x41b
[c:\cirrus\src\backend\utils\error\elog.c @ 683]And here's what the error looks like:
2022-07-26 19:38:04.321 GMT [8020][client backend]
[pg_regress/vacuum][8/349:4527] PANIC: could not rename file
"global/pg_filenode.map.tmp" to "global/pg_filenode.map": Permission
deniedSomeone else still has the old file open, so we can't rename the new
one to its name? On Windows that should have gone through pgrename()
in dirmod.c, which would retry 100 times with a 100ms sleep between.
Since every backend reads that file (I added an elog() and saw it 2289
times during make check), I guess you can run out of luck./me thinks
Maybe we just have to rearrange the locking slightly? Something like
the attached.
Attachments:
0001-Fix-read_relmap_file-concurrency-on-Windows.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-read_relmap_file-concurrency-on-Windows.patchDownload+10-11
On Wed, Jul 27, 2022 at 6:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Jul 27, 2022 at 5:01 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Someone else still has the old file open, so we can't rename the new
one to its name? On Windows that should have gone through pgrename()
in dirmod.c, which would retry 100 times with a 100ms sleep between.
Since every backend reads that file (I added an elog() and saw it 2289
times during make check), I guess you can run out of luck./me thinks
Maybe we just have to rearrange the locking slightly? Something like
the attached.
Erm, let me try that again, this time with the CloseTransientFile()
also under the lock, so that we never have a file handle without a
lock.
Attachments:
v2-0001-Fix-read_relmap_file-concurrency-on-Windows.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-read_relmap_file-concurrency-on-Windows.patchDownload+13-14
On Wed, Jul 27, 2022 at 06:21:07PM +1200, Thomas Munro wrote:
Erm, let me try that again, this time with the CloseTransientFile()
also under the lock, so that we never have a file handle without a
lock.
Right. The whole write_relmap_file() already happens while taking
RelationMappingLock, so that seems like a good idea for consistency at
the end (even if I remember that there is a patch floating around to
improve the concurrency of pgrename, which may become an easier move
now that we require Windows 10).
I have tested three runs and that was working here even if the
issue is sporadic, so more runs may be better to have more
confidence.
--
Michael
On Wed, Jul 27, 2022 at 6:25 AM Michael Paquier <michael@paquier.xyz> wrote:
Right. The whole write_relmap_file() already happens while taking
RelationMappingLock, so that seems like a good idea for consistency at
the end (even if I remember that there is a patch floating around to
improve the concurrency of pgrename, which may become an easier move
now that we require Windows 10).I have tested three runs and that was working here even if the
issue is sporadic, so more runs may be better to have more
confidence.
OK, I committed Thomas's patch, after taking the liberty of adding an
explanatory comment.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2022-Jul-26, Robert Haas wrote:
Remove the restriction that the relmap must be 512 bytes.
Instead of relying on the ability to atomically overwrite the
entire relmap file in one shot, write a new one and durably
rename it into place. Removing the struct padding and the
calculation showing why the map is exactly 512 bytes, and change
the maximum number of entries to a nearby round number.
Another thing that seems to have happened here is that catversion ought
to have been touched and wasn't. Trying to start a cluster that was
initdb'd with the previous code enters an infinite loop that dies each
time with
2022-07-27 19:17:27.589 CEST [2516547] LOG: database system is ready to accept connections
2022-07-27 19:17:27.589 CEST [2516730] FATAL: could not read file "global/pg_filenode.map": read 512 of 524
2022-07-27 19:17:27.589 CEST [2516731] FATAL: could not read file "global/pg_filenode.map": read 512 of 524
2022-07-27 19:17:27.589 CEST [2516547] LOG: autovacuum launcher process (PID 2516730) exited with exit code 1
2022-07-27 19:17:27.589 CEST [2516547] LOG: terminating any other active server processes
Perhaps we should still do a catversion bump now, since one hasn't
happened since the commit.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"El número de instalaciones de UNIX se ha elevado a 10,
y se espera que este número aumente" (UPM, 1972)
On Wed, Jul 27, 2022 at 1:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Another thing that seems to have happened here is that catversion ought
to have been touched and wasn't. Trying to start a cluster that was
initdb'd with the previous code enters an infinite loop that dies each
time with2022-07-27 19:17:27.589 CEST [2516547] LOG: database system is ready to accept connections
2022-07-27 19:17:27.589 CEST [2516730] FATAL: could not read file "global/pg_filenode.map": read 512 of 524
2022-07-27 19:17:27.589 CEST [2516731] FATAL: could not read file "global/pg_filenode.map": read 512 of 524
2022-07-27 19:17:27.589 CEST [2516547] LOG: autovacuum launcher process (PID 2516730) exited with exit code 1
2022-07-27 19:17:27.589 CEST [2516547] LOG: terminating any other active server processesPerhaps we should still do a catversion bump now, since one hasn't
happened since the commit.
Hmm, interesting. I didn't think about bumping catversion because I
didn't change anything in the catalogs. I did think about changing the
magic number for the file at one point, but unlike some of our other
constants, there's no indication that this one is intended to be used
as a version number. But in retrospect it would have been good to
change something somewhere. If you want me to bump catversion now, I
can. If you or someone else wants to do it, that's also fine.
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Jul 27, 2022 at 1:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Another thing that seems to have happened here is that catversion ought
to have been touched and wasn't.
Hmm, interesting. I didn't think about bumping catversion because I
didn't change anything in the catalogs. I did think about changing the
magic number for the file at one point, but unlike some of our other
constants, there's no indication that this one is intended to be used
as a version number. But in retrospect it would have been good to
change something somewhere. If you want me to bump catversion now, I
can. If you or someone else wants to do it, that's also fine.
If there's a magic number, then I'd (a) change that and (b) adjust
whatever comments led you to think you shouldn't. Bumping catversion
is a good fallback choice when there's not any more-proximate version
indicator, but here there is.
regards, tom lane
On Wed, Jul 27, 2022 at 1:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
If there's a magic number, then I'd (a) change that and (b) adjust
whatever comments led you to think you shouldn't. Bumping catversion
is a good fallback choice when there's not any more-proximate version
indicator, but here there is.
Maybe I just got cold feet because it doesn't ever have seem to have
been changed before, because the definition says:
#define RELMAPPER_FILEMAGIC 0x592717 /* version ID value */
And the fact that "version" is in there sure makes it seem like a
version number, now that I look again.
I'll add 1 to the value.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Jul 27, 2022 at 2:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 27, 2022 at 1:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
If there's a magic number, then I'd (a) change that and (b) adjust
whatever comments led you to think you shouldn't. Bumping catversion
is a good fallback choice when there's not any more-proximate version
indicator, but here there is.Maybe I just got cold feet because it doesn't ever have seem to have
been changed before, because the definition says:#define RELMAPPER_FILEMAGIC 0x592717 /* version ID value */
And the fact that "version" is in there sure makes it seem like a
version number, now that I look again.I'll add 1 to the value.
Hmm, but that doesn't actually produce nice behavior. It just goes
into an infinite loop, like this:
2022-07-27 14:21:12.826 EDT [32849] LOG: database system was
interrupted; last known up at 2022-07-27 14:21:12 EDT
2022-07-27 14:21:12.860 EDT [32849] LOG: database system was not
properly shut down; automatic recovery in progress
2022-07-27 14:21:12.861 EDT [32849] LOG: invalid record length at
0/14B3BB8: wanted 24, got 0
2022-07-27 14:21:12.861 EDT [32849] LOG: redo is not required
2022-07-27 14:21:12.864 EDT [32850] LOG: checkpoint starting:
end-of-recovery immediate wait
2022-07-27 14:21:12.865 EDT [32850] LOG: checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.001 s, sync=0.001 s, total=0.002 s; sync files=2,
longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB;
lsn=0/14B3BB8, redo lsn=0/14B3BB8
2022-07-27 14:21:12.868 EDT [31930] LOG: database system is ready to
accept connections
2022-07-27 14:21:12.869 EDT [32853] FATAL: relation mapping file
"global/pg_filenode.map" contains invalid data
2022-07-27 14:21:12.869 EDT [32854] FATAL: relation mapping file
"global/pg_filenode.map" contains invalid data
2022-07-27 14:21:12.870 EDT [31930] LOG: autovacuum launcher process
(PID 32853) exited with exit code 1
2022-07-27 14:21:12.870 EDT [31930] LOG: terminating any other active
server processes
2022-07-27 14:21:12.870 EDT [31930] LOG: background worker "logical
replication launcher" (PID 32854) exited with exit code 1
2022-07-27 14:21:12.871 EDT [31930] LOG: all server processes
terminated; reinitializing
While I agree that changing a version identifier that is more closely
related to what got changed is better than changing a generic one, I
think people won't like an infinite loop that spews messages into the
log at top speed as a way of telling them about the problem.
So now I'm back to being unsure what to do here.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2022-Jul-27, Robert Haas wrote:
Hmm, but that doesn't actually produce nice behavior. It just goes
into an infinite loop, like this:
2022-07-27 14:21:12.869 EDT [32853] FATAL: relation mapping file
"global/pg_filenode.map" contains invalid data
This seems almost identical what happens without the version number
change, so I wouldn't call it much of an improvement.
While I agree that changing a version identifier that is more closely
related to what got changed is better than changing a generic one, I
think people won't like an infinite loop that spews messages into the
log at top speed as a way of telling them about the problem.So now I'm back to being unsure what to do here.
I vote to go for the catversion bump for now.
Maybe it's possible to patch the relmapper code afterwards, so that if a
version mismatch is detected the server is stopped with a reasonable
message. An hypothetical improvement would be to keep the code to read
the old version and upgrade the file automatically, but given the number
of times that this file has changed, it's likely pointless effort.
Therefore, my proposal is to add a comment next to the struct definition
suggesting to bump catversion and call it a day.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Hmm, but that doesn't actually produce nice behavior. It just goes
into an infinite loop, like this:
So now I'm back to being unsure what to do here.
I vote to go for the catversion bump for now.
What this is showing us is that any form of corruption in the relmapper
file causes very unpleasant behavior. We probably had better do something
about that, independently of this issue.
In the meantime, I still think bumping the file magic number is a better
answer. It won't make the behavior any worse for un-updated code than
it is already.
regards, tom lane
On Wed, Jul 27, 2022 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Hmm, but that doesn't actually produce nice behavior. It just goes
into an infinite loop, like this:
So now I'm back to being unsure what to do here.I vote to go for the catversion bump for now.
What this is showing us is that any form of corruption in the relmapper
file causes very unpleasant behavior. We probably had better do something
about that, independently of this issue.
I'm not sure how important that is, but it certainly wouldn't hurt.
In the meantime, I still think bumping the file magic number is a better
answer. It won't make the behavior any worse for un-updated code than
it is already.
But it also won't make it any better, so why even bother? The goal of
catversion bumps is to replace crashes or unpredictable misbehavior
with a nice error message that tells you exactly what the problem is.
Here we'd just be replacing an infinite series of crashes with an
infinite series of crashes with a slightly different error message.
It's probably worth comparing those error messages:
FATAL: could not read file "global/pg_filenode.map": read 512 of 524
FATAL: relation mapping file "global/pg_filenode.map" contains invalid data
The first message is what you get now. The second message is what you
get with the proposed change to the magic number. I would argue that
the second message is actually worse than the first one, because the
first one actually gives you some hint what the problem is, whereas
the second one really just says that an unspecified bad thing
happened.
In short, I think Alvaro's idea is unprincipled but solves the problem
of making it clear that a new initdb is required. Your idea is
principled but does not solve any problem.
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
In short, I think Alvaro's idea is unprincipled but solves the problem
of making it clear that a new initdb is required. Your idea is
principled but does not solve any problem.
[ shrug... ] Fair enough.
regards, tom lane