Make relfile tombstone files conditional on WAL level

Started by Thomas Munroabout 5 years ago184 messageshackers

thomas.munro@gmail.com

about 5 years ago

Hi,

I'm starting a new thread for this patch that originated as a
side-discussion in [1]/messages/by-id/CA+hUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg@mail.gmail.com, to give it its own CF entry in the next cycle.
This is a WIP with an open question to research: what could actually
break if we did this?

[1]: /messages/by-id/CA+hUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg@mail.gmail.com

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 5 years ago

In reply to: Thomas Munro (#1)

Re: Make relfile tombstone files conditional on WAL level

On 05/03/2021 00:02, Thomas Munro wrote:

Hi,

I'm starting a new thread for this patch that originated as a
side-discussion in [1], to give it its own CF entry in the next cycle.
This is a WIP with an open question to research: what could actually
break if we did this?

I don't see a problem.

It would indeed be nice to have some other mechanism to prevent the
issue with wal_level=minimal, the tombstone files feel hacky and
complicated. Maybe a new shared memory hash table to track the
relfilenodes of dropped tables.

- Heikki

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Heikki Linnakangas (#2)

Re: Make relfile tombstone files conditional on WAL level

On Thu, Jun 10, 2021 at 6:47 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It would indeed be nice to have some other mechanism to prevent the
issue with wal_level=minimal, the tombstone files feel hacky and
complicated. Maybe a new shared memory hash table to track the
relfilenodes of dropped tables.

Just to summarize the issue here as I understand it, if a relfilenode
is used for two unrelated relations during the same checkpoint cycle
with wal_level=minimal, and if the WAL-skipping optimization is
applied to the second of those but not to the first, then crash
recovery will lose our only copy of the new relation's data, because
we'll replay the removal of the old relfilenode but will not have
logged the new data. Furthermore, we've wondered about writing an
end-of-recovery record in all cases rather than sometimes writing an
end-of-recovery record and sometimes a checkpoint record. That would
allow another version of this same problem, since a single checkpoint
cycle could now span multiple server lifetimes. At present, we dodge
all this by keeping the first segment of the main fork around as a
zero-length file for the rest of the checkpoint cycle, which I think
prevents the problem in both cases. Now, apparently that caused some
problem with the AIO patch set so Thomas is curious about getting rid
of it, and Heikki concurs that it's a hack.

I guess my concern about this patch is that it just seems to be
reducing the number of cases where that hack is used without actually
getting rid of it. Rarely-taken code paths are more likely to have
undiscovered bugs, and that seems particularly likely in this case,
because this is a low-probability scenario to begin with. A lot of
clusters probably never have an OID counter wraparound ever, and even
in those that do, getting an OID collision with just the right timing
followed by a crash before a checkpoint can intervene has got to be
super-unlikely. Even as things are today, if this mechanism has subtle
bugs, it seems entirely possible that they could have escaped notice
up until now.

So I spent some time thinking about the question of getting rid of
tombstone files altogether. I don't think that Heikki's idea of a
shared memory hash table to track dropped relfilenodes can work. The
hash table will have to be of some fixed size N, and whatever the
value of N, the approach will break down if N+1 relfilenodes are
dropped in the same checkpoint cycle.

The two most principled solutions to this problem that I can see are
(1) remove wal_level=minimal and (2) use 64-bit relfilenodes. I have
been reluctant to support #1 because it's hard for me to believe that
there aren't cases where being able to skip a whole lot of WAL-logging
doesn't work out to a nice performance win, but I realize opinions on
that topic vary. And I'm pretty sure that Andres, at least, will hate
#2 because he's unhappy with the width of buffer tags already. So I
don't really have a good idea. I agree this tombstone system is a bit
of a wart, but I'm not sure that this patch really makes anything any
better, and I'm not really seeing another idea that seems better
either.

Maybe I am missing something...

--
Robert Haas
EDB: http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Robert Haas (#3)

Re: Make relfile tombstone files conditional on WAL level

Hi,

On 2021-08-02 16:03:31 -0400, Robert Haas wrote:

The two most principled solutions to this problem that I can see are
(1) remove wal_level=minimal and

I'm personally not opposed to this. It's not practically relevant and makes a
lot of stuff more complicated. We imo should rather focus on optimizing the
things wal_level=minimal accelerates a lot than adding complications for
wal_level=minimal. Such optimizations would have practical relevance, and
there's plenty low hanging fruits.

(2) use 64-bit relfilenodes. I have
been reluctant to support #1 because it's hard for me to believe that
there aren't cases where being able to skip a whole lot of WAL-logging
doesn't work out to a nice performance win, but I realize opinions on
that topic vary. And I'm pretty sure that Andres, at least, will hate
#2 because he's unhappy with the width of buffer tags already.

Yep :/

I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.

It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.

So I don't really have a good idea. I agree this tombstone system is a
bit of a wart, but I'm not sure that this patch really makes anything
any better, and I'm not really seeing another idea that seems better
either.

Maybe I am missing something...

What I proposed in the past was to have a new shared table that tracks
relfilenodes. I still think that's a decent solution for just the problem at
hand. But it'd also potentially be the way to redesign relation forks and even
slim down buffer tags:

Right now a buffer tag is:
- 4 byte tablespace oid
- 4 byte database oid
- 4 byte "relfilenode oid" (don't think we have a good name for this)
- 4 byte fork number
- 4 byte block number

If we had such a shared table we could put at least tablespace, fork number
into that table mapping them to an 8 byte "new relfilenode". That'd only make
the "new relfilenode" unique within a database, but that'd be sufficient for
our purposes. It'd give use a buffertag consisting out of the following:
- 4 byte database oid
- 8 byte "relfilenode"
- 4 byte block number

Of course, it'd add some complexity too, because a buffertag alone wouldn't be
sufficient to read data (as you'd need the tablespace oid from elsewhere). But
that's probably ok, I think all relevant places would have that information.

It's probably possible to remove the database oid from the tag as well, but
it'd make CREATE DATABASE tricker - we'd need to change the filenames of
tables as we copy, to adjust them to the differing oid.

Greetings,

Andres Freund

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Andres Freund (#4)

Re: Make relfile tombstone files conditional on WAL level

On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:

What I proposed in the past was to have a new shared table that tracks
relfilenodes. I still think that's a decent solution for just the problem at
hand.

It's not really clear to me what problem is at hand. The problems that
the tombstone system created for the async I/O stuff weren't really
explained properly, IMHO. And I don't think the current system is all
that ugly. it's not the most beautiful thing in the world but we have
lots of way worse hacks. And, it's easy to understand, requires very
little code, and has few moving parts that can fail. As hacks go it's
a quality hack, I would say.

But it'd also potentially be the way to redesign relation forks and even
slim down buffer tags:

Right now a buffer tag is:
- 4 byte tablespace oid
- 4 byte database oid
- 4 byte "relfilenode oid" (don't think we have a good name for this)
- 4 byte fork number
- 4 byte block number

If we had such a shared table we could put at least tablespace, fork number
into that table mapping them to an 8 byte "new relfilenode". That'd only make
the "new relfilenode" unique within a database, but that'd be sufficient for
our purposes. It'd give use a buffertag consisting out of the following:
- 4 byte database oid
- 8 byte "relfilenode"
- 4 byte block number

Yep. I think this is a good direction.

Of course, it'd add some complexity too, because a buffertag alone wouldn't be
sufficient to read data (as you'd need the tablespace oid from elsewhere). But
that's probably ok, I think all relevant places would have that information.

I think the thing to look at would be the places that call
relpathperm() or relpathbackend(). I imagine this can be worked out,
but it might require some adjustment.

It's probably possible to remove the database oid from the tag as well, but
it'd make CREATE DATABASE tricker - we'd need to change the filenames of
tables as we copy, to adjust them to the differing oid.

Yeah, I'm not really sure that works out to a win. I tend to think
that we should be trying to make databases within the same cluster
more rather than less independent of each other. If we switch to using
a radix tree for the buffer mapping table as you have previously
proposed, then presumably each backend can cache a pointer to the
second level, after the database OID has been resolved. Then you have
no need to compare database OIDs for every lookup. That might turn out
to be better for performance than shoving everything into the buffer
tag anyway, because then backends in different databases would be
accessing distinct parts of the buffer mapping data structure instead
of contending with one another.

--
Robert Haas
EDB: http://www.enterprisedb.com

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Robert Haas (#5)

Re: Make relfile tombstone files conditional on WAL level

On Fri, Mar 5, 2021 at 11:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:

This is a WIP with an open question to research: what could actually
break if we did this?

I thought this part of bgwriter.c might be a candidate:

if (FirstCallSinceLastCheckpoint())
{
/*
* After any checkpoint, close all smgr files. This is so we
* won't hang onto smgr references to deleted files indefinitely.
*/
smgrcloseall();
}

Hmm, on closer inspection, isn't the lack of real interlocking with
checkpoints a bit suspect already? What stops bgwriter from writing
to the previous relfilenode generation's fd, if a relfilenode is
recycled while BgBufferSync() is running? Not sinval, and not the
above code that only runs between BgBufferSync() invocations.

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Robert Haas (#5)

Re: Make relfile tombstone files conditional on WAL level

On Wed, Aug 4, 2021 at 3:22 AM Robert Haas <robertmhaas@gmail.com> wrote:

It's not really clear to me what problem is at hand. The problems that
the tombstone system created for the async I/O stuff weren't really
explained properly, IMHO. And I don't think the current system is all
that ugly. it's not the most beautiful thing in the world but we have
lots of way worse hacks. And, it's easy to understand, requires very
little code, and has few moving parts that can fail. As hacks go it's
a quality hack, I would say.

It's not really an AIO problem. It's just that while testing the AIO
stuff across a lot of operating systems, we had tests failing on
Windows because the extra worker processes you get if you use
io_method=worker were holding cached descriptors and causing stuff
like DROP TABLESPACE to fail. AFAIK every problem we discovered in
that vein is a current live bug in all versions of PostgreSQL for
Windows (it just takes other backends or the bgwriter to hold an fd at
the wrong moment). The solution I'm proposing to that general class
of problem is https://commitfest.postgresql.org/34/2962/ .

In the course of thinking about that, it seemed natural to look into
the possibility of getting rid of the tombstones, so that at least
Unix systems don't find themselves having to suffer through a
CHECKPOINT just to drop a tablespace that happens to contain a
tombstone.

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#6)

Re: Make relfile tombstone files conditional on WAL level

On Wed, Sep 29, 2021 at 4:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Hmm, on closer inspection, isn't the lack of real interlocking with
checkpoints a bit suspect already? What stops bgwriter from writing
to the previous relfilenode generation's fd, if a relfilenode is
recycled while BgBufferSync() is running? Not sinval, and not the
above code that only runs between BgBufferSync() invocations.

I managed to produce a case where live data is written to an unlinked
file and lost, with a couple of tweaks to get the right timing and
simulate OID wraparound. See attached. If you run the following
commands repeatedly with shared_buffers=256kB and
bgwriter_lru_multiplier=10, you should see a number lower than 10,000
from the last query in some runs, depending on timing.

create extension if not exists chaos;
create extension if not exists pg_prewarm;

drop table if exists t1, t2;
checkpoint;
vacuum pg_class;

select clobber_next_oid(200000);
create table t1 as select 42 i from generate_series(1, 10000);
select pg_prewarm('t1'); -- fill buffer pool with t1
update t1 set i = i; -- dirty t1 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time

drop table t1;
checkpoint;
vacuum pg_class;

select clobber_next_oid(200000);
create table t2 as select 0 i from generate_series(1, 10000);
select pg_prewarm('t2'); -- fill buffer pool with t2
update t2 set i = 1 where i = 0; -- dirty t2 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time

select pg_prewarm('pg_attribute'); -- evict all clean t2 buffers
select sum(i) as t2_sum_should_be_10000 from t2; -- have any updates been lost?

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#8)

Re: Make relfile tombstone files conditional on WAL level

On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I managed to produce a case where live data is written to an unlinked
file and lost

I guess this must have been broken since release 9.2 moved checkpoints
out of here[1]/messages/by-id/CA+U5nMLv2ah-HNHaQ=2rxhp_hDJ9jcf-LL2kW3sE4msfnUw9gA@mail.gmail.com. The connection between checkpoints, tombstone files
and file descriptor cache invalidation in auxiliary (non-sinval)
backends was not documented as far as I can see (or at least not
anywhere near the load-bearing parts).

How could it be fixed, simply and backpatchably? If BgSyncBuffer()
did if-FirstCallSinceLastCheckpoint()-then-smgrcloseall() after
locking each individual buffer and before flushing, then I think it
might logically have the correct interlocking against relfilenode
wraparound, but that sounds a tad expensive :-( I guess it could be
made cheaper by using atomics for the checkpoint counter instead of
spinlocks. Better ideas?

[1]: /messages/by-id/CA+U5nMLv2ah-HNHaQ=2rxhp_hDJ9jcf-LL2kW3sE4msfnUw9gA@mail.gmail.com

#10

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#9)

Re: Make relfile tombstone files conditional on WAL level

On Tue, Oct 5, 2021 at 4:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I managed to produce a case where live data is written to an unlinked
file and lost

In conclusion, there *is* something else that would break, so I'm
withdrawing this CF entry (#3030) for now. Also, that something else
is already subtly broken, so I'll try to come up with a fix for that
separately.

#11

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Andres Freund (#4)

Re: Make relfile tombstone files conditional on WAL level

On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:

I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.

It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.

I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse. On Windows, we still need
https://commitfest.postgresql.org/36/2962/ because of the problem that
Windows won't remove files from the directory listing until they are
both unlinked and closed. But in general this seems like it would lead
to cleaner code. For example, GetNewRelFileNode() needn't loop. If it
allocate the smallest unsigned integer that the cluster (or database)
has never previously assigned, the file should definitely not exist on
disk, and if it does, an ERROR is appropriate, as the database is
corrupted. This does assume that allocations from this new 56-bit
relfilenode counter are properly WAL-logged.

I think this would also solve a problem Dilip mentioned to me today:
suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's
been trying to do. Then suppose you do "ALTER DATABASE foo SET
TABLESPACE used_recently_but_not_any_more". You might get an error
complaining that “some relations of database \“%s\” are already in
tablespace \“%s\“” because there could be tombstone files in that
database. With this combination of changes, you could just use the
barrier mechanism from https://commitfest.postgresql.org/36/2962/ to
wait for those files to disappear, because they've got to be
previously-unliked files that Windows is still returning because
they're still opening -- or else they could be a sign of a corrupted
database, but there are no other possibilities.

I think, anyway.

Make relfile tombstone files conditional on WAL level

Attachments:

Attachments:

Attachments:

Attachments: