wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Started by Andres Freundabout 4 years ago39 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

I was working on rebasing the AIO branch. Tests started to fail after, but it
turns out that the problem exists independent of AIO.

The problem starts with

commit aa01051418f10afbdfa781b8dc109615ca785ff9
Author: Robert Haas <rhaas@postgresql.org>
Date: 2022-01-24 14:23:15 -0500

pg_upgrade: Preserve database OIDs.

Commit 9a974cbcba005256a19991203583a94b4f9a21a9 arranged to preserve
relfilenodes and tablespace OIDs. For similar reasons, also arrange
to preserve database OIDs.

after this commit the database contents for databases that exist before
pg_upgrade runs, and the databases pg_upgrade (re-)creates can have the same
oid. Consequently the RelFileNode of files in the "pre-existing" database and
the re-created database will be the same.

That's a problem: There is no reliable invalidation mechanism forcing all
processes to clear SMgrRelationHash when a database is dropped /
created. Which means that backends can end up using file descriptors for the
old, dropped, database, when accessing contents in the new database. This is
most likely to happen to bgwriter, but could be any backend (backends in other
databases can end up opening files in another database to write out dirty
victim buffers during buffer replacement).

That's of course dangerous, because we can end up reading the wrong data (by
reading from the old file), or loosing writes (by writing to the old file).

Now, this isn't a problem that's entirely new - but previously it requried
relfilenodes to be recycled due to wraparound or such. The timeframes for that
are long enough that it's very likely that *something* triggers smgr
invalidation processing. E.g. has

if (FirstCallSinceLastCheckpoint())
{
/*
* After any checkpoint, close all smgr files. This is so we
* won't hang onto smgr references to deleted files indefinitely.
*/
smgrcloseall();
}

in its mainloop. Previously for the problem to occur there would have to be
"oid wraparound relfilenode recycling" within a single checkpoint window -
quite unlikely. But now a bgwriter cycle just needs to span a drop/create
database, easier to hit.

I think the most realistic way to address this is the mechanism is to use the
mechanism from
/messages/by-id/CA+hUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg@mail.gmail.com

If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.

We perhaps could avoid all relfilenode recycling, issues on the fd level, by
adding a barrier whenever relfilenode allocation wraps around. But I'm not
sure how easy that is to detect.

I think we might actually be able to remove the checkpoint from dropdb() by
using barriers?

I think we need to add some debugging code to detect "fd to deleted file of
same name" style problems. Especially during regression tests they're quite
hard to notice, because most of the contents read/written are identical across
recycling.

On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to
a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink
== 0.

b) might also work on a few other operating systems, but I have not verified
that. a) has the advantage of providing a filename, which makes it a lot
easier to understand the problem. It'd probably make sense to use b) on all
the operating systems it works on, and then a) on linux for a more useful
error message.

The checks for this would need to be in FileRead/FileWrite/... and direct
calls to pread etc. I think that's an acceptable cost. I have a patch for
this, but it's currently based on the AIO tree. if others agree it'd be useful
to have, I'll adapt it to work on master.

Comments?

Greetings,

Andres Freund

#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Andres Freund (#1)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Wed, Feb 09, 2022 at 02:00:04PM -0800, Andres Freund wrote:

On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to
a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink
== 0.

You could also stat() the file in proc/self/fd/N and compare st_ino. It
"looks" like a symlink (in which case that wouldn't work) but it's actually a
Very Special File. You can even recover deleted, still-opened files that way..

PS. I didn't know pg_upgrade knew about Reply-To ;)

#3Andres Freund
andres@anarazel.de
In reply to: Justin Pryzby (#2)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes

Hi,

On 2022-02-09 16:42:30 -0600, Justin Pryzby wrote:

On Wed, Feb 09, 2022 at 02:00:04PM -0800, Andres Freund wrote:

On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to
a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink
== 0.

You could also stat() the file in proc/self/fd/N and compare st_ino. It
"looks" like a symlink (in which case that wouldn't work) but it's actually a
Very Special File. You can even recover deleted, still-opened files that way..

Yea, the readlink() thing above relies on it being a /proc/self/fd/$fd being a
"Very Special File".

In most places we'd not have convenient access to a inode / filename to
compare it to. I don't think there's any additional information we could gain
anyway, compared to looking at st_nlink == 0 and then doing a readlink() to
get the filename?

PS. I didn't know pg_upgrade knew about Reply-To ;)

Ugh, formatting fail...

Greetings,

Andres Freund

#4Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Wed, Feb 9, 2022 at 5:00 PM Andres Freund <andres@anarazel.de> wrote:

The problem starts with

commit aa01051418f10afbdfa781b8dc109615ca785ff9
Author: Robert Haas <rhaas@postgresql.org>
Date: 2022-01-24 14:23:15 -0500

pg_upgrade: Preserve database OIDs.

Well, that's sad.

I think the most realistic way to address this is the mechanism is to use the
mechanism from
/messages/by-id/CA+hUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg@mail.gmail.com

I agree. While I feel sort of bad about missing this issue in review,
I also feel like it's pretty surprising that there isn't something
plugging this hole already. It feels unexpected that our FD management
layer might hand you an FD that references the previous file that had
the same name instead of the one that exists right now, and I don't
think it's too much of a stretch of the imagination to suppose that
this won't be the last problem we have if we don't fix it. I also
agree that the mechanism proposed in that thread is the best way to
fix the problem. The sinval mechanism won't work, since there's
nothing to ensure that the invalidation messages are processed in a
sufficient timely fashion, and there's no obvious way to get around
that problem. But the ProcSignalBarrier mechanism is not intertwined
with heavyweight locking in the way that sinval is, so it should be a
good fit for this problem.

The main question in my mind is who is going to actually make that
happen. It was your idea (I think), Thomas coded it, and my commit
made it a live problem. So who's going to get something committed
here?

If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.

What about movedb()?

We perhaps could avoid all relfilenode recycling, issues on the fd level, by
adding a barrier whenever relfilenode allocation wraps around. But I'm not
sure how easy that is to detect.

I definitely think it would be nice to cover this issue not only for
database OIDs and tablespace OIDs but also for relfilenode OIDs.
Otherwise we're right on the cusp of having the same bug in that case.
pg_upgrade doesn't drop and recreate tables the way it does databases,
but if somebody made it do that in the future, would they think about
this? I bet not.

Dilip's been working on your idea of making relfilenode into a 56-bit
quantity. That would make the problem of detecting wraparound go away.
In the meantime, I suppose we could do think of trying to do something
here:

/* wraparound, or first post-initdb
assignment, in normal mode */
ShmemVariableCache->nextOid = FirstNormalObjectId;
ShmemVariableCache->oidCount = 0;

That's a bit sketch, because this is the OID counter, which isn't only
used for relfilenodes. I'm not sure whether there would be problems
waiting for a barrier here - I think we might be holding some lwlocks
in some code paths.

I think we might actually be able to remove the checkpoint from dropdb() by
using barriers?

That looks like it might work. We'd need to be sure that the
checkpointer would both close any open FDs and also absorb fsync
requests before acknowledging the barrier.

I think we need to add some debugging code to detect "fd to deleted file of
same name" style problems. Especially during regression tests they're quite
hard to notice, because most of the contents read/written are identical across
recycling.

Maybe. If we just plug the holes here, I am not sure we need permanent
instrumentation. But I'm also not violently opposed to it.

--
Robert Haas
EDB: http://www.enterprisedb.com

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#4)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Fri, Feb 11, 2022 at 7:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

The main question in my mind is who is going to actually make that
happen. It was your idea (I think), Thomas coded it, and my commit
made it a live problem. So who's going to get something committed
here?

I was about to commit that, because the original Windows problem it
solved is showing up occasionally in CI failures (that is, it already
solves a live problem, albeit a different and non-data-corrupting
one):

/messages/by-id/CA+hUKGJp-m8uAD_wS7+dkTgif013SNBSoJujWxvRUzZ1nkoUyA@mail.gmail.com

It seems like I should go ahead and do that today, and we can study
further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?

#6Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#5)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Thu, Feb 10, 2022 at 3:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Feb 11, 2022 at 7:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

The main question in my mind is who is going to actually make that
happen. It was your idea (I think), Thomas coded it, and my commit
made it a live problem. So who's going to get something committed
here?

I was about to commit that, because the original Windows problem it
solved is showing up occasionally in CI failures (that is, it already
solves a live problem, albeit a different and non-data-corrupting
one):

/messages/by-id/CA+hUKGJp-m8uAD_wS7+dkTgif013SNBSoJujWxvRUzZ1nkoUyA@mail.gmail.com

It seems like I should go ahead and do that today, and we can study
further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?

A bird in the hand is worth two in the bush. Unless it's a vicious,
hand-biting bird.

In other words, if you think what you've got is better than what we
have now and won't break anything worse than it is today, +1 for
committing, and more improvements can come when we get enough round
tuits.

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#4)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Hi,

On 2022-02-10 13:49:50 -0500, Robert Haas wrote:

I agree. While I feel sort of bad about missing this issue in review,
I also feel like it's pretty surprising that there isn't something
plugging this hole already. It feels unexpected that our FD management
layer might hand you an FD that references the previous file that had
the same name instead of the one that exists right now, and I don't
think it's too much of a stretch of the imagination to suppose that
this won't be the last problem we have if we don't fix it.

Arguably it's not really the FD layer that's the issue - it's on the smgr
level. But that's semantics...

The main question in my mind is who is going to actually make that
happen. It was your idea (I think), Thomas coded it, and my commit
made it a live problem. So who's going to get something committed
here?

I think it'd make sense for Thomas to commit the current patch for the
tablespace issues first. And then we can go from there?

If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.

What about movedb()?

My first reaction was that it should be fine, because RelFileNode.spcNode
would differ. But of course that doesn't protect against moving a database
back and forth. So yes, you're right, it's needed there too.

Are there other commands that could be problematic? ALTER TABLE SET TABLESPACE
should be fine, that allocates a new relfilenode.

I definitely think it would be nice to cover this issue not only for
database OIDs and tablespace OIDs but also for relfilenode OIDs.
Otherwise we're right on the cusp of having the same bug in that case.
pg_upgrade doesn't drop and recreate tables the way it does databases,
but if somebody made it do that in the future, would they think about
this? I bet not.

And even just plugging the wraparound aspect would be great. It's not actually
*that* hard to wrap oids around, because of toast.

Dilip's been working on your idea of making relfilenode into a 56-bit
quantity. That would make the problem of detecting wraparound go away.
In the meantime, I suppose we could do think of trying to do something
here:

/* wraparound, or first post-initdb
assignment, in normal mode */
ShmemVariableCache->nextOid = FirstNormalObjectId;
ShmemVariableCache->oidCount = 0;

That's a bit sketch, because this is the OID counter, which isn't only
used for relfilenodes.

I can see a few ways to deal with that:

1) Split nextOid into two, nextRelFileNode and nextOid. Have the wraparound
behaviour only in the nextRelFileNode case. That'd be a nice improvement,
but not entirely trivial, because we currently use GetNewRelFileNode() for
oid assignment as well (c.f. heap_create_with_catalog()).

2) Keep track of the last assigned relfilenode in ShmemVariableCache, and wait
for a barrier in GetNewRelFileNode() if the assigned oid is wrapped
around. While there would be a chance of "missing" a ->nextOid wraparound,
e.g. because of >2**32 toast oids being allocated, I think that could only
happen if there were no "wrapped around relfilenodes" allocated, so that
should be ok.

3) Any form of oid wraparound might cause problems, not just relfilenode
issues. So having a place to emit barriers on oid wraparound might be a
good idea.

One thing I wonder is how to make sure the path would be tested. Perhaps we
could emit the anti-wraparound barriers more frequently when assertions are
enabled?

I'm not sure whether there would be problems waiting for a barrier here - I
think we might be holding some lwlocks in some code paths.

It seems like a bad idea to call GetNewObjectId() with lwlocks held. From what
I can see the two callers of GetNewObjectId() both have loops that can go on
for a long time after a wraparound.

I think we need to add some debugging code to detect "fd to deleted file of
same name" style problems. Especially during regression tests they're quite
hard to notice, because most of the contents read/written are identical across
recycling.

Maybe. If we just plug the holes here, I am not sure we need permanent
instrumentation. But I'm also not violently opposed to it.

The question is how we can find all the places we need to add barriers emit &
wait to. I e.g. didn't initially didn't realize that there's wraparound danger
in WAL replay as well.

Greetings,

Andres Freund

#8Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#5)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Hi,

On 2022-02-11 09:10:38 +1300, Thomas Munro wrote:

I was about to commit that, because the original Windows problem it
solved is showing up occasionally in CI failures (that is, it already
solves a live problem, albeit a different and non-data-corrupting
one):

+1

It seems like I should go ahead and do that today, and we can study
further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?

Yes.

I wonder whether we really should make the barriers be conditional on
defined(WIN32) || defined(USE_ASSERT_CHECKING) as done in that patch, even
leaving wraparound dangers aside. On !windows we still have the issues of the
space for the dropped / moved files not being released if there are processes
having them open. That can be a lot of space if there's long-lived connections
in a cluster that doesn't fit into s_b (because processes will have random fds
open for writing back dirty buffers). And we don't truncate the files before
unlinking when done as part of a DROP DATABASE...

But that's something we can fine-tune later as well...

Greetings,

Andres Freund

#9Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#8)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Hi,

On 2022-02-10 14:26:59 -0800, Andres Freund wrote:

On 2022-02-11 09:10:38 +1300, Thomas Munro wrote:

It seems like I should go ahead and do that today, and we can study
further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?

Yes.

I wrote a test to show the problem. While looking at the code I unfortunately
found that CREATE DATABASE ... OID isn't the only problem. Two consecutive
ALTER DATABASE ... SET TABLESPACE also can cause corruption.

The test doesn't work on < 15 (due to PostgresNode renaming and use of
allow_in_place_tablespaces). But manually it's unfortunately quite
reproducible :(

Start a server with

shared_buffers = 1MB
autovacuum=off
bgwriter_lru_maxpages=1
bgwriter_delay=10000

these are just to give more time / to require smaller tables.

Start two psql sessions

s1: \c postgres
s1: CREATE DATABASE conflict_db;
s1: CREATE TABLESPACE test_tablespace LOCATION '/tmp/blarg';
s1: CREATE EXTENSION pg_prewarm;

s2: \c conflict_db
s2: CREATE TABLE large(id serial primary key, dataa text, datab text);
s2: INSERT INTO large(dataa, datab) SELECT g.i::text, 0 FROM generate_series(1, 4000) g(i);
s2: UPDATE large SET datab = 1;

-- prevent AtEOXact_SMgr
s1: BEGIN;

-- Fill shared_buffers with other contents. This needs to write out the prior dirty contents
-- leading to opening smgr rels / file descriptors
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;

s2: \c postgres
-- unlinks the files
s2: ALTER DATABASE conflict_db SET TABLESPACE test_tablespace;
-- now new files with the same relfilenode exist
s2: ALTER DATABASE conflict_db SET TABLESPACE pg_default;
s2: \c conflict_db
-- dirty buffers again
s2: UPDATE large SET datab = 2;

-- this again writes out the dirty buffers. But because nothing forced the smgr handles to be closed,
-- fds pointing to the *old* file contents are used. I.e. we'll write to the wrong file
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;

-- verify that everything is [not] ok
s2: SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10;

┌───────┬───────┐
│ datab │ count │
├───────┼───────┤
│ 1 │ 2117 │
│ 2 │ 67 │
└───────┴───────┘
(2 rows)

oops.

I don't yet have a good idea how to tackle this in the backbranches.

I've started to work on a few debugging aids to find problem like
these. Attached are two WIP patches:

1) use fstat(...).st_nlink == 0 to detect dangerous actions on fds with
deleted files. That seems to work (almost) everywhere. Optionally, on
linux, use readlink(/proc/$pid/fd/$fd) to get the filename.
2) On linux, iterate over PGPROC to get a list of pids. Then iterate over
/proc/$pid/fd/ to check for deleted files. This only can be done after
we've just made sure there's no fds open to deleted files.

Greetings,

Andres Freund

#10Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#9)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Hi,

On 2022-02-22 01:11:21 -0800, Andres Freund wrote:

I've started to work on a few debugging aids to find problem like
these. Attached are two WIP patches:

Forgot to attach. Also importantly includes a tap test for several of these
issues

Greetings,

Andres Freund

Attachments:

v1-0001-WIP-test-for-file-reuse-dangers-around-database-a.patchtext/x-diff; charset=us-asciiDownload+233-1
v1-0002-WIP-AssertFileNotDeleted-fd.patchtext/x-diff; charset=us-asciiDownload+90-1
v1-0003-WIP-fix-old-fd-issues-using-global-barriers-every.patchtext/x-diff; charset=us-asciiDownload+34-18
v1-0004-WIP-Add-linux-only-AssertNoDeletedFilesOpenPid.patchtext/x-diff; charset=us-asciiDownload+83-1
#11Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#10)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Tue, Feb 22, 2022 at 4:40 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-02-22 01:11:21 -0800, Andres Freund wrote:

I've started to work on a few debugging aids to find problem like
these. Attached are two WIP patches:

Forgot to attach. Also importantly includes a tap test for several of these
issues

Hi,

Just a few very preliminary comments:

- I am having some trouble understanding clearly what 0001 is doing.
I'll try to study it further.

- 0002 seems like it's generally a good idea. I haven't yet dug into
why the call sites for AssertFileNotDeleted() are where they are, or
whether that's a complete set of places to check.

- In general, 0003 makes a lot of sense to me.

+               /*
+                * Finally tell the kernel to write the data to
storage. Don't smgr if
+                * previously closed, otherwise we could end up evading fd-reuse
+                * protection.
+                */

- I think the above hunk is missing a word, because it uses smgr as a
verb. I also think that it's easy to imagine this explanation being
insufficient for some future hacker to understand the issue.

- While 0004 seems useful for testing, it's an awfully big hammer. I'm
not sure we should be thinking about committing something like that,
or at least not as a default part of the build. But ... maybe?

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#11)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Hi,

On 2022-03-02 14:52:01 -0500, Robert Haas wrote:

- I am having some trouble understanding clearly what 0001 is doing.
I'll try to study it further.

It tests for the various scenarios I could think of that could lead to FD
reuse, to state the obvious ;). Anything particularly unclear.

- In general, 0003 makes a lot of sense to me.

+               /*
+                * Finally tell the kernel to write the data to
storage. Don't smgr if
+                * previously closed, otherwise we could end up evading fd-reuse
+                * protection.
+                */

- I think the above hunk is missing a word, because it uses smgr as a
verb. I also think that it's easy to imagine this explanation being
insufficient for some future hacker to understand the issue.

Yea, it's definitely not sufficient or gramattically correct. I think I wanted
to send something out, but was very tired by that point..

- While 0004 seems useful for testing, it's an awfully big hammer. I'm
not sure we should be thinking about committing something like that,
or at least not as a default part of the build. But ... maybe?

Aggreed. I think it's racy anyway, because further files could get deleted
(e.g. segments > 0) after the barrier has been processed.

What I am stuck on is what we can do for the released branches. Data
corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like
something we need to address.

Greetings,

Andres Freund

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#12)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Wed, Mar 2, 2022 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:

What I am stuck on is what we can do for the released branches. Data
corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like
something we need to address.

I think we should consider back-porting the ProcSignalBarrier stuff
eventually. I realize that it's not really battle-tested yet and I am
not saying we should do it right now, but I think that if we get these
changes into v15, back-porting it in let's say May of next year could
be a reasonable thing to do. Sure, there is some risk there, but on
the other hand, coming up with completely different fixes for the
back-branches is not risk-free either, nor is it clear that there is
any alternative fix that is nearly as good. In the long run, I am
fairly convinced that ProcSignalBarrier is the way forward not only
for this purpose but for other things as well, and everybody's got to
get on the train or be left behind.

Also, I am aware of multiple instances where the project waited a
long, long time to fix bugs because we didn't have a back-patchable
fix. I disagree with that on principle. A master-only fix now is
better than a back-patchable fix two or three years from now. Of
course a back-patchable fix now is better still, but we have to pick
from the options we have, not the ones we'd like to have.

<digressing a bit>

It seems to me that if we were going to try to construct an
alternative fix for the back-branches, it would have to be something
that didn't involve a new invalidation mechanism -- because the
ProcSignalBarrier stuff is an invalidation mechanism in effect, and I
feel that it can't be better to invent two new invalidation mechanisms
rather than one. And the only idea I have is trying to detect a
dangerous sequence of operations and just outright block it. We have
some cases sort of like that already - e.g. you can't prepare a
transaction if it's done certain things. But, the existing precedents
that occur to me are, I think, all cases where all of the related
actions are being performed in the same backend. It doesn't sound
crazy to me to have some rule like "you can't ALTER TABLESPACE on the
same tablespace in the same backend twice in a row without an
intervening checkpoint", or whatever, and install the book-keeping to
enforce that. But I don't think anything like that can work, both
because the two ALTER TABLESPACE commands could be performed in
different sessions, and also because an intervening checkpoint is no
guarantee of safety anyway, IIUC. So I'm just not really seeing a
reasonable strategy that isn't basically the barrier stuff.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#12)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Wed, Mar 2, 2022 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-03-02 14:52:01 -0500, Robert Haas wrote:

- I am having some trouble understanding clearly what 0001 is doing.
I'll try to study it further.

It tests for the various scenarios I could think of that could lead to FD
reuse, to state the obvious ;). Anything particularly unclear.

As I understand it, the basic idea of this test is:

1. create a database called 'conflict_db' containing a table called 'large'
2. also create a table called 'replace_sb' in the postgres db.
3. have a long running transaction open in 'postgres' database that
repeatedly accesses the 'replace_sb' table to evict the 'conflict_db'
table, sometimes causing write-outs
4. try to get it to write out the data to a stale file descriptor by
fiddling with various things, and then detect the corruption that
results
5. use both a primary and a standby not because they are intrinsically
necessary but because you want to test that both cases work

As to (1) and (2), I note that 'large' is not a hugely informative
table name, especially when it's the smaller of the two test tables.
And also, the names are all over the place. The table names don't have
much to do with each other (large, replace_sb) and they're completely
unrelated to the database names (postgres, conflict_db). And then you
have Perl functions whose names also don't make it obvious what we're
talking about. I can't remember that verify() is the one that accesses
conflict.db large while cause_eviction() is the one that accesses
postgres.replace_sb for more than like 15 seconds. It'd make it a lot
easier if the table names, database names, and Perl function names had
some common elements.

As to (5), the identifiers are just primary and standby, without
mentioning the database name, which adds to the confusion, and there
are no comments explaining why we have two nodes.

Also, some of the comments seem like they might be in the wrong place:

+# Create template database with a table that we'll update, to trigger dirty
+# rows. Using a template database + preexisting rows makes it a bit easier to
+# reproduce, because there's no cache invalidations generated.

Right after this comment, we create a table and then a template
database just after, so I think we're not avoiding any invalidations
at this point. We're avoiding them at later points where the comments
aren't talking about this.

I think it would be helpful to find a way to divide the test case up
into sections that are separated from each other visually, and explain
the purpose of each test a little more in a comment. For example I'm
struggling to understand the exact purpose of this bit:

+$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
+
+verify($node_primary, $node_standby, 3,
+          "restored contents as expected");

I'm all for having test coverage of VACUUM FULL, but it isn't clear to
me what this does that is related to FD reuse.

Similarly for the later ones. I generally grasp that you are trying to
make sure that things are OK with respect to FD reuse in various
scenarios, but I couldn't explain to you what we'd be losing in terms
of test coverage if we removed this line:

+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");

I am not sure how much all of this potential cleanup matters because
I'm pretty skeptical that this would be stable in the buildfarm. It
relies on a lot of assumptions about timing and row sizes and details
of when invalidations are sent that feel like they might not be
entirely predictable at the level you would need to have this
consistently pass everywhere. Perhaps I am too pessimistic?

--
Robert Haas
EDB: http://www.enterprisedb.com

#15Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#14)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Hi,

On 2022-03-03 13:11:17 -0500, Robert Haas wrote:

On Wed, Mar 2, 2022 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-03-02 14:52:01 -0500, Robert Haas wrote:

- I am having some trouble understanding clearly what 0001 is doing.
I'll try to study it further.

It tests for the various scenarios I could think of that could lead to FD
reuse, to state the obvious ;). Anything particularly unclear.

As I understand it, the basic idea of this test is:

1. create a database called 'conflict_db' containing a table called 'large'
2. also create a table called 'replace_sb' in the postgres db.
3. have a long running transaction open in 'postgres' database that
repeatedly accesses the 'replace_sb' table to evict the 'conflict_db'
table, sometimes causing write-outs
4. try to get it to write out the data to a stale file descriptor by
fiddling with various things, and then detect the corruption that
results
5. use both a primary and a standby not because they are intrinsically
necessary but because you want to test that both cases work

Right.

I wasn't proposing to commit the test as-is, to be clear. It was meant as a
demonstration of the types of problems I can see, and what's needed to fix
them...

I can't remember that verify() is the one that accesses conflict.db large
while cause_eviction() is the one that accesses postgres.replace_sb for more
than like 15 seconds.

For more than 15seconds? The whole test runs in a few seconds for me...

As to (5), the identifiers are just primary and standby, without
mentioning the database name, which adds to the confusion, and there
are no comments explaining why we have two nodes.

I don't follow this one - physical rep doesn't do anything below the database
level?

I think it would be helpful to find a way to divide the test case up
into sections that are separated from each other visually, and explain
the purpose of each test a little more in a comment. For example I'm
struggling to understand the exact purpose of this bit:

+$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
+
+verify($node_primary, $node_standby, 3,
+          "restored contents as expected");

I'm all for having test coverage of VACUUM FULL, but it isn't clear to
me what this does that is related to FD reuse.

It's just trying to clean up prior damage, so the test continues to later
ones. Definitely not needed once there's a fix. But it's useful while trying
to make the test actually detect various corruptions.

Similarly for the later ones. I generally grasp that you are trying to
make sure that things are OK with respect to FD reuse in various
scenarios, but I couldn't explain to you what we'd be losing in terms
of test coverage if we removed this line:

+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");

I am not sure how much all of this potential cleanup matters because
I'm pretty skeptical that this would be stable in the buildfarm.

Which "potential cleanup" are you referring to?

It relies on a lot of assumptions about timing and row sizes and details of
when invalidations are sent that feel like they might not be entirely
predictable at the level you would need to have this consistently pass
everywhere. Perhaps I am too pessimistic?

I don't think the test passing should be dependent on row size, invalidations
etc to a significant degree (unless the tables are too small to reach s_b
etc)? The exact symptoms of failures are highly unstable, but obviously we'd
fix them in-tree before committing a test. But maybe I'm missing a dependency?

FWIW, the test did pass on freebsd, linux, macos and windows with the
fix. After a few iterations of improving the fix ;)

Greetings,

Andres Freund

#16Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#15)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Thu, Mar 3, 2022 at 1:28 PM Andres Freund <andres@anarazel.de> wrote:

I can't remember that verify() is the one that accesses conflict.db large
while cause_eviction() is the one that accesses postgres.replace_sb for more
than like 15 seconds.

For more than 15seconds? The whole test runs in a few seconds for me...

I'm not talking about running the test. I'm talking about reading it
and trying to keep straight what is happening. The details of which
function is accessing which database keep going out of my head.
Quickly. Maybe because I'm dumb, but I think some better naming could
help, just in case other people are dumb, too.

As to (5), the identifiers are just primary and standby, without
mentioning the database name, which adds to the confusion, and there
are no comments explaining why we have two nodes.

I don't follow this one - physical rep doesn't do anything below the database
level?

Right but ... those handles are connected to a particular DB on that
node, not just the node in general.

I think it would be helpful to find a way to divide the test case up
into sections that are separated from each other visually, and explain
the purpose of each test a little more in a comment. For example I'm
struggling to understand the exact purpose of this bit:

+$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
+
+verify($node_primary, $node_standby, 3,
+          "restored contents as expected");

I'm all for having test coverage of VACUUM FULL, but it isn't clear to
me what this does that is related to FD reuse.

It's just trying to clean up prior damage, so the test continues to later
ones. Definitely not needed once there's a fix. But it's useful while trying
to make the test actually detect various corruptions.

Ah, OK, I definitely did not understand that before.

Similarly for the later ones. I generally grasp that you are trying to
make sure that things are OK with respect to FD reuse in various
scenarios, but I couldn't explain to you what we'd be losing in terms
of test coverage if we removed this line:

+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");

I am not sure how much all of this potential cleanup matters because
I'm pretty skeptical that this would be stable in the buildfarm.

Which "potential cleanup" are you referring to?

I mean, whatever improvements you might consider making based on my comments.

It relies on a lot of assumptions about timing and row sizes and details of
when invalidations are sent that feel like they might not be entirely
predictable at the level you would need to have this consistently pass
everywhere. Perhaps I am too pessimistic?

I don't think the test passing should be dependent on row size, invalidations
etc to a significant degree (unless the tables are too small to reach s_b
etc)? The exact symptoms of failures are highly unstable, but obviously we'd
fix them in-tree before committing a test. But maybe I'm missing a dependency?

FWIW, the test did pass on freebsd, linux, macos and windows with the
fix. After a few iterations of improving the fix ;)

Hmm. Well, I admit that's already more than I would have expected....

--
Robert Haas
EDB: http://www.enterprisedb.com

#17Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#16)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Some thoughts:

The v1-0003 patch introduced smgropen_cond() to avoid the problem of
IssuePendingWritebacks(), which does desynchronised smgropen() calls
and could open files after the barrier but just before they are
unlinked. Makes sense, but...

1. For that to actually work, we'd better call smgrcloseall() when
PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls
closeAllVds(). Here's a patch for that. But...

2. The reason I used closeAllVds() in 4eb21763 is because I didn't
want random CHECK_FOR_INTERRUPTS() calls to break code like this, from
RelationCopyStorage():

for (blkno = 0; blkno < nblocks; blkno++)
{
/* If we got a cancel signal during the copy of the data, quit */
CHECK_FOR_INTERRUPTS();

smgrread(src, forkNum, blkno, buf.data);

Perhaps we could change RelationCopyStorage() to take Relation, and
use CreateFakeRelCacheEntry() in various places to satisfy that, and
also extend that mechanism to work with temporary tables. Here's an
initial sketch of that idea, to see what problems it crashes into...
Fake Relations are rather unpleasant though; I wonder if there is an
SMgrRelationHandle concept that could make this all nicer. There is
possibly also some cleanup missing to avoid an SMgrRelation that
points to a defunct fake Relation on non-local exit (?).

Attachments:

0001-WIP-Rethink-PROCSIGNAL_BARRIER_SMGRRELEASE.patchtext/x-patch; charset=US-ASCII; name=0001-WIP-Rethink-PROCSIGNAL_BARRIER_SMGRRELEASE.patchDownload+2-16
0002-WIP-Handle-invalidation-in-RelationCopyStorage.patchtext/x-patch; charset=US-ASCII; name=0002-WIP-Handle-invalidation-in-RelationCopyStorage.patchDownload+39-49
#18Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#17)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro <thomas.munro@gmail.com> wrote:

The v1-0003 patch introduced smgropen_cond() to avoid the problem of
IssuePendingWritebacks(), which does desynchronised smgropen() calls
and could open files after the barrier but just before they are
unlinked. Makes sense, but...

1. For that to actually work, we'd better call smgrcloseall() when
PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls
closeAllVds(). Here's a patch for that. But...

What if we instead changed our approach to fixing
IssuePendingWritebacks()? Design sketch:

1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's false.
2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true.
3. If you want, you can set it back to false any time you access the
smgr while holding a relation lock.
4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that
function now refuses either to create a new smgr or to return one
where smgr_zonk is true.

I think this would be sufficient to guarantee that we never try to
write back to a relfilenode if we haven't had a lock on it since the
last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we
need here?

My thinking here is that it's a lot easier to get away with marking
the content of a data structure invalid than it is to get away with
invalidating pointers to that data structure. If we can tinker with
the design so that the SMgrRelationData doesn't actually go away but
just gets a little bit more thoroughly invalidated, we could avoid
having to audit the entire code base for code that keeps smgr handles
around longer than would be possible with the design you demonstrate
here.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#18)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Sat, Apr 2, 2022 at 2:52 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro <thomas.munro@gmail.com> wrote:

The v1-0003 patch introduced smgropen_cond() to avoid the problem of
IssuePendingWritebacks(), which does desynchronised smgropen() calls
and could open files after the barrier but just before they are
unlinked. Makes sense, but...

1. For that to actually work, we'd better call smgrcloseall() when
PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls
closeAllVds(). Here's a patch for that. But...

What if we instead changed our approach to fixing
IssuePendingWritebacks()? Design sketch:

1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's false.
2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true.
3. If you want, you can set it back to false any time you access the
smgr while holding a relation lock.
4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that
function now refuses either to create a new smgr or to return one
where smgr_zonk is true.

I think this would be sufficient to guarantee that we never try to
write back to a relfilenode if we haven't had a lock on it since the
last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we
need here?

My thinking here is that it's a lot easier to get away with marking
the content of a data structure invalid than it is to get away with
invalidating pointers to that data structure. If we can tinker with
the design so that the SMgrRelationData doesn't actually go away but
just gets a little bit more thoroughly invalidated, we could avoid
having to audit the entire code base for code that keeps smgr handles
around longer than would be possible with the design you demonstrate
here.

Another idea would be to call a new function DropPendingWritebacks(),
and also tell all the SMgrRelation objects to close all their internal
state (ie the fds + per-segment objects) but not free the main
SMgrRelationData object, and for good measure also invalidate the
small amount of cached state (which hasn't been mentioned in this
thread, but it kinda bothers me that that state currently survives, so
it was one unspoken reason I liked the smgrcloseall() idea). Since
DropPendingWritebacks() is in a rather different module, perhaps if we
were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
something else, because it's not really a SMGR-only thing anymore.

#20Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#19)
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

On Sat, Apr 2, 2022 at 10:03 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Another idea would be to call a new function DropPendingWritebacks(),
and also tell all the SMgrRelation objects to close all their internal
state (ie the fds + per-segment objects) but not free the main
SMgrRelationData object, and for good measure also invalidate the
small amount of cached state (which hasn't been mentioned in this
thread, but it kinda bothers me that that state currently survives, so
it was one unspoken reason I liked the smgrcloseall() idea). Since
DropPendingWritebacks() is in a rather different module, perhaps if we
were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
something else, because it's not really a SMGR-only thing anymore.

Here's a sketch of that idea.

Attachments:

0001-WIP-Rethink-PROCSIGNAL_BARRIER_SMGRRELEASE.patchtext/x-patch; charset=US-ASCII; name=0001-WIP-Rethink-PROCSIGNAL_BARRIER_SMGRRELEASE.patchDownload+70-16
0002-WIP-fix-old-fd-issues-using-global-barriers-everywhe.patchtext/x-patch; charset=US-ASCII; name=0002-WIP-fix-old-fd-issues-using-global-barriers-everywhe.patchDownload+5-15
#21Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#19)
#22Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#21)
#23Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#22)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#23)
#25Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#24)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#25)
#27Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#26)
#28Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#27)
#29Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#28)
#30Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#29)
#31Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#30)
#32Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#30)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#32)
#34Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#33)
#35Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#31)
#36Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#35)
#37Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#36)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#37)
#39Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#38)