Make relfile tombstone files conditional on WAL level

Started by Thomas Munroabout 5 years ago184 messageshackers
Jump to latest
#1Thomas Munro
thomas.munro@gmail.com

Hi,

I'm starting a new thread for this patch that originated as a
side-discussion in [1]/messages/by-id/CA+hUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg@mail.gmail.com, to give it its own CF entry in the next cycle.
This is a WIP with an open question to research: what could actually
break if we did this?

[1]: /messages/by-id/CA+hUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg@mail.gmail.com

Attachments:

0001-Make-relfile-tombstone-files-conditional-on-WAL-leve.patchtext/x-patch; charset=US-ASCII; name=0001-Make-relfile-tombstone-files-conditional-on-WAL-leve.patchDownload+23-8
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Thomas Munro (#1)
Re: Make relfile tombstone files conditional on WAL level

On 05/03/2021 00:02, Thomas Munro wrote:

Hi,

I'm starting a new thread for this patch that originated as a
side-discussion in [1], to give it its own CF entry in the next cycle.
This is a WIP with an open question to research: what could actually
break if we did this?

I don't see a problem.

It would indeed be nice to have some other mechanism to prevent the
issue with wal_level=minimal, the tombstone files feel hacky and
complicated. Maybe a new shared memory hash table to track the
relfilenodes of dropped tables.

- Heikki

#3Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: Make relfile tombstone files conditional on WAL level

On Thu, Jun 10, 2021 at 6:47 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It would indeed be nice to have some other mechanism to prevent the
issue with wal_level=minimal, the tombstone files feel hacky and
complicated. Maybe a new shared memory hash table to track the
relfilenodes of dropped tables.

Just to summarize the issue here as I understand it, if a relfilenode
is used for two unrelated relations during the same checkpoint cycle
with wal_level=minimal, and if the WAL-skipping optimization is
applied to the second of those but not to the first, then crash
recovery will lose our only copy of the new relation's data, because
we'll replay the removal of the old relfilenode but will not have
logged the new data. Furthermore, we've wondered about writing an
end-of-recovery record in all cases rather than sometimes writing an
end-of-recovery record and sometimes a checkpoint record. That would
allow another version of this same problem, since a single checkpoint
cycle could now span multiple server lifetimes. At present, we dodge
all this by keeping the first segment of the main fork around as a
zero-length file for the rest of the checkpoint cycle, which I think
prevents the problem in both cases. Now, apparently that caused some
problem with the AIO patch set so Thomas is curious about getting rid
of it, and Heikki concurs that it's a hack.

I guess my concern about this patch is that it just seems to be
reducing the number of cases where that hack is used without actually
getting rid of it. Rarely-taken code paths are more likely to have
undiscovered bugs, and that seems particularly likely in this case,
because this is a low-probability scenario to begin with. A lot of
clusters probably never have an OID counter wraparound ever, and even
in those that do, getting an OID collision with just the right timing
followed by a crash before a checkpoint can intervene has got to be
super-unlikely. Even as things are today, if this mechanism has subtle
bugs, it seems entirely possible that they could have escaped notice
up until now.

So I spent some time thinking about the question of getting rid of
tombstone files altogether. I don't think that Heikki's idea of a
shared memory hash table to track dropped relfilenodes can work. The
hash table will have to be of some fixed size N, and whatever the
value of N, the approach will break down if N+1 relfilenodes are
dropped in the same checkpoint cycle.

The two most principled solutions to this problem that I can see are
(1) remove wal_level=minimal and (2) use 64-bit relfilenodes. I have
been reluctant to support #1 because it's hard for me to believe that
there aren't cases where being able to skip a whole lot of WAL-logging
doesn't work out to a nice performance win, but I realize opinions on
that topic vary. And I'm pretty sure that Andres, at least, will hate
#2 because he's unhappy with the width of buffer tags already. So I
don't really have a good idea. I agree this tombstone system is a bit
of a wart, but I'm not sure that this patch really makes anything any
better, and I'm not really seeing another idea that seems better
either.

Maybe I am missing something...

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#3)
Re: Make relfile tombstone files conditional on WAL level

Hi,

On 2021-08-02 16:03:31 -0400, Robert Haas wrote:

The two most principled solutions to this problem that I can see are
(1) remove wal_level=minimal and

I'm personally not opposed to this. It's not practically relevant and makes a
lot of stuff more complicated. We imo should rather focus on optimizing the
things wal_level=minimal accelerates a lot than adding complications for
wal_level=minimal. Such optimizations would have practical relevance, and
there's plenty low hanging fruits.

(2) use 64-bit relfilenodes. I have
been reluctant to support #1 because it's hard for me to believe that
there aren't cases where being able to skip a whole lot of WAL-logging
doesn't work out to a nice performance win, but I realize opinions on
that topic vary. And I'm pretty sure that Andres, at least, will hate
#2 because he's unhappy with the width of buffer tags already.

Yep :/

I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.

It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.

So I don't really have a good idea. I agree this tombstone system is a
bit of a wart, but I'm not sure that this patch really makes anything
any better, and I'm not really seeing another idea that seems better
either.

Maybe I am missing something...

What I proposed in the past was to have a new shared table that tracks
relfilenodes. I still think that's a decent solution for just the problem at
hand. But it'd also potentially be the way to redesign relation forks and even
slim down buffer tags:

Right now a buffer tag is:
- 4 byte tablespace oid
- 4 byte database oid
- 4 byte "relfilenode oid" (don't think we have a good name for this)
- 4 byte fork number
- 4 byte block number

If we had such a shared table we could put at least tablespace, fork number
into that table mapping them to an 8 byte "new relfilenode". That'd only make
the "new relfilenode" unique within a database, but that'd be sufficient for
our purposes. It'd give use a buffertag consisting out of the following:
- 4 byte database oid
- 8 byte "relfilenode"
- 4 byte block number

Of course, it'd add some complexity too, because a buffertag alone wouldn't be
sufficient to read data (as you'd need the tablespace oid from elsewhere). But
that's probably ok, I think all relevant places would have that information.

It's probably possible to remove the database oid from the tag as well, but
it'd make CREATE DATABASE tricker - we'd need to change the filenames of
tables as we copy, to adjust them to the differing oid.

Greetings,

Andres Freund

#5Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#4)
Re: Make relfile tombstone files conditional on WAL level

On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:

What I proposed in the past was to have a new shared table that tracks
relfilenodes. I still think that's a decent solution for just the problem at
hand.

It's not really clear to me what problem is at hand. The problems that
the tombstone system created for the async I/O stuff weren't really
explained properly, IMHO. And I don't think the current system is all
that ugly. it's not the most beautiful thing in the world but we have
lots of way worse hacks. And, it's easy to understand, requires very
little code, and has few moving parts that can fail. As hacks go it's
a quality hack, I would say.

But it'd also potentially be the way to redesign relation forks and even
slim down buffer tags:

Right now a buffer tag is:
- 4 byte tablespace oid
- 4 byte database oid
- 4 byte "relfilenode oid" (don't think we have a good name for this)
- 4 byte fork number
- 4 byte block number

If we had such a shared table we could put at least tablespace, fork number
into that table mapping them to an 8 byte "new relfilenode". That'd only make
the "new relfilenode" unique within a database, but that'd be sufficient for
our purposes. It'd give use a buffertag consisting out of the following:
- 4 byte database oid
- 8 byte "relfilenode"
- 4 byte block number

Yep. I think this is a good direction.

Of course, it'd add some complexity too, because a buffertag alone wouldn't be
sufficient to read data (as you'd need the tablespace oid from elsewhere). But
that's probably ok, I think all relevant places would have that information.

I think the thing to look at would be the places that call
relpathperm() or relpathbackend(). I imagine this can be worked out,
but it might require some adjustment.

It's probably possible to remove the database oid from the tag as well, but
it'd make CREATE DATABASE tricker - we'd need to change the filenames of
tables as we copy, to adjust them to the differing oid.

Yeah, I'm not really sure that works out to a win. I tend to think
that we should be trying to make databases within the same cluster
more rather than less independent of each other. If we switch to using
a radix tree for the buffer mapping table as you have previously
proposed, then presumably each backend can cache a pointer to the
second level, after the database OID has been resolved. Then you have
no need to compare database OIDs for every lookup. That might turn out
to be better for performance than shoving everything into the buffer
tag anyway, because then backends in different databases would be
accessing distinct parts of the buffer mapping data structure instead
of contending with one another.

--
Robert Haas
EDB: http://www.enterprisedb.com

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#5)
Re: Make relfile tombstone files conditional on WAL level

On Fri, Mar 5, 2021 at 11:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:

This is a WIP with an open question to research: what could actually
break if we did this?

I thought this part of bgwriter.c might be a candidate:

if (FirstCallSinceLastCheckpoint())
{
/*
* After any checkpoint, close all smgr files. This is so we
* won't hang onto smgr references to deleted files indefinitely.
*/
smgrcloseall();
}

Hmm, on closer inspection, isn't the lack of real interlocking with
checkpoints a bit suspect already? What stops bgwriter from writing
to the previous relfilenode generation's fd, if a relfilenode is
recycled while BgBufferSync() is running? Not sinval, and not the
above code that only runs between BgBufferSync() invocations.

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#5)
Re: Make relfile tombstone files conditional on WAL level

On Wed, Aug 4, 2021 at 3:22 AM Robert Haas <robertmhaas@gmail.com> wrote:

It's not really clear to me what problem is at hand. The problems that
the tombstone system created for the async I/O stuff weren't really
explained properly, IMHO. And I don't think the current system is all
that ugly. it's not the most beautiful thing in the world but we have
lots of way worse hacks. And, it's easy to understand, requires very
little code, and has few moving parts that can fail. As hacks go it's
a quality hack, I would say.

It's not really an AIO problem. It's just that while testing the AIO
stuff across a lot of operating systems, we had tests failing on
Windows because the extra worker processes you get if you use
io_method=worker were holding cached descriptors and causing stuff
like DROP TABLESPACE to fail. AFAIK every problem we discovered in
that vein is a current live bug in all versions of PostgreSQL for
Windows (it just takes other backends or the bgwriter to hold an fd at
the wrong moment). The solution I'm proposing to that general class
of problem is https://commitfest.postgresql.org/34/2962/ .

In the course of thinking about that, it seemed natural to look into
the possibility of getting rid of the tombstones, so that at least
Unix systems don't find themselves having to suffer through a
CHECKPOINT just to drop a tablespace that happens to contain a
tombstone.

#8Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#6)
Re: Make relfile tombstone files conditional on WAL level

On Wed, Sep 29, 2021 at 4:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Hmm, on closer inspection, isn't the lack of real interlocking with
checkpoints a bit suspect already? What stops bgwriter from writing
to the previous relfilenode generation's fd, if a relfilenode is
recycled while BgBufferSync() is running? Not sinval, and not the
above code that only runs between BgBufferSync() invocations.

I managed to produce a case where live data is written to an unlinked
file and lost, with a couple of tweaks to get the right timing and
simulate OID wraparound. See attached. If you run the following
commands repeatedly with shared_buffers=256kB and
bgwriter_lru_multiplier=10, you should see a number lower than 10,000
from the last query in some runs, depending on timing.

create extension if not exists chaos;
create extension if not exists pg_prewarm;

drop table if exists t1, t2;
checkpoint;
vacuum pg_class;

select clobber_next_oid(200000);
create table t1 as select 42 i from generate_series(1, 10000);
select pg_prewarm('t1'); -- fill buffer pool with t1
update t1 set i = i; -- dirty t1 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time

drop table t1;
checkpoint;
vacuum pg_class;

select clobber_next_oid(200000);
create table t2 as select 0 i from generate_series(1, 10000);
select pg_prewarm('t2'); -- fill buffer pool with t2
update t2 set i = 1 where i = 0; -- dirty t2 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time

select pg_prewarm('pg_attribute'); -- evict all clean t2 buffers
select sum(i) as t2_sum_should_be_10000 from t2; -- have any updates been lost?

Attachments:

0001-HACK-A-function-to-control-the-OID-allocator.patchtext/x-patch; charset=US-ASCII; name=0001-HACK-A-function-to-control-the-OID-allocator.patchDownload+59-1
0002-HACK-Slow-the-bgwriter-down-a-bit.patchtext/x-patch; charset=US-ASCII; name=0002-HACK-Slow-the-bgwriter-down-a-bit.patchDownload+4-1
#9Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#8)
Re: Make relfile tombstone files conditional on WAL level

On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I managed to produce a case where live data is written to an unlinked
file and lost

I guess this must have been broken since release 9.2 moved checkpoints
out of here[1]/messages/by-id/CA+U5nMLv2ah-HNHaQ=2rxhp_hDJ9jcf-LL2kW3sE4msfnUw9gA@mail.gmail.com. The connection between checkpoints, tombstone files
and file descriptor cache invalidation in auxiliary (non-sinval)
backends was not documented as far as I can see (or at least not
anywhere near the load-bearing parts).

How could it be fixed, simply and backpatchably? If BgSyncBuffer()
did if-FirstCallSinceLastCheckpoint()-then-smgrcloseall() after
locking each individual buffer and before flushing, then I think it
might logically have the correct interlocking against relfilenode
wraparound, but that sounds a tad expensive :-( I guess it could be
made cheaper by using atomics for the checkpoint counter instead of
spinlocks. Better ideas?

[1]: /messages/by-id/CA+U5nMLv2ah-HNHaQ=2rxhp_hDJ9jcf-LL2kW3sE4msfnUw9gA@mail.gmail.com

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#9)
Re: Make relfile tombstone files conditional on WAL level

On Tue, Oct 5, 2021 at 4:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I managed to produce a case where live data is written to an unlinked
file and lost

In conclusion, there *is* something else that would break, so I'm
withdrawing this CF entry (#3030) for now. Also, that something else
is already subtly broken, so I'll try to come up with a fix for that
separately.

#11Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#4)
Re: Make relfile tombstone files conditional on WAL level

On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:

I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.

It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.

I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse. On Windows, we still need
https://commitfest.postgresql.org/36/2962/ because of the problem that
Windows won't remove files from the directory listing until they are
both unlinked and closed. But in general this seems like it would lead
to cleaner code. For example, GetNewRelFileNode() needn't loop. If it
allocate the smallest unsigned integer that the cluster (or database)
has never previously assigned, the file should definitely not exist on
disk, and if it does, an ERROR is appropriate, as the database is
corrupted. This does assume that allocations from this new 56-bit
relfilenode counter are properly WAL-logged.

I think this would also solve a problem Dilip mentioned to me today:
suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's
been trying to do. Then suppose you do "ALTER DATABASE foo SET
TABLESPACE used_recently_but_not_any_more". You might get an error
complaining that “some relations of database \“%s\” are already in
tablespace \“%s\“” because there could be tombstone files in that
database. With this combination of changes, you could just use the
barrier mechanism from https://commitfest.postgresql.org/36/2962/ to
wait for those files to disappear, because they've got to be
previously-unliked files that Windows is still returning because
they're still opening -- or else they could be a sign of a corrupted
database, but there are no other possibilities.

I think, anyway.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#11)
Re: Make relfile tombstone files conditional on WAL level

On Thu, Jan 6, 2022 at 3:07 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:

I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.

It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.

I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse.

+1

I think this would also solve a problem Dilip mentioned to me today:
suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's
been trying to do. Then suppose you do "ALTER DATABASE foo SET
TABLESPACE used_recently_but_not_any_more". You might get an error
complaining that “some relations of database \“%s\” are already in
tablespace \“%s\“” because there could be tombstone files in that
database. With this combination of changes, you could just use the
barrier mechanism from https://commitfest.postgresql.org/36/2962/ to
wait for those files to disappear, because they've got to be
previously-unliked files that Windows is still returning because
they're still opening -- or else they could be a sign of a corrupted
database, but there are no other possibilities.

Yes, this approach will solve the problem for the WAL-logged ALTER
DATABASE we are facing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#13Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#12)
Re: Make relfile tombstone files conditional on WAL level

On Thu, Jan 6, 2022 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse.

+1

I IMHO a few top level point for implementing this idea would be as listed here,

1) the "relfilenode" member inside the RelFileNode will be now 64
bytes and remove the "forkNum" all together from the BufferTag. So
now whenever we want to use the relfilenode or fork number we can use
the respective mask and fetch their values.
2) GetNewRelFileNode() will not loop for checking the file existence
and retry with other relfilenode.
3) Modify mdunlinkfork() so that we immediately perform the unlink
request, make sure to register_forget_request() before unlink.
4) In checkpointer, now we don't need any handling for pendingUnlinks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Dilip Kumar (#13)
Re: Make relfile tombstone files conditional on WAL level

On Thu, Jan 6, 2022 at 9:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 6, 2022 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse.

+1

+1

I IMHO a few top level point for implementing this idea would be as listed here,

1) the "relfilenode" member inside the RelFileNode will be now 64
bytes and remove the "forkNum" all together from the BufferTag. So
now whenever we want to use the relfilenode or fork number we can use
the respective mask and fetch their values.
2) GetNewRelFileNode() will not loop for checking the file existence
and retry with other relfilenode.
3) Modify mdunlinkfork() so that we immediately perform the unlink
request, make sure to register_forget_request() before unlink.
4) In checkpointer, now we don't need any handling for pendingUnlinks.

Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID. We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).

#15Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#14)
Re: Make relfile tombstone files conditional on WAL level

On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID. We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).

Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary. There's no real reason to have them ever match. Yeah, in
theory, it makes it easier to tell which file matches which relation,
but in practice, you always have to double-check in case the table has
ever been rewritten. It doesn't seem worth continuing to contort the
code for a property we can't guarantee anyway.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#15)
Re: Make relfile tombstone files conditional on WAL level

On 2022-01-06 08:52:01 -0500, Robert Haas wrote:

On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID. We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).

Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary.

+1

#17Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#15)
Re: Make relfile tombstone files conditional on WAL level

On Thu, Jan 6, 2022 at 7:22 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com>
wrote:

Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID. We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).

Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary. There's no real reason to have them ever match. Yeah, in
theory, it makes it easier to tell which file matches which relation,
but in practice, you always have to double-check in case the table has
ever been rewritten. It doesn't seem worth continuing to contort the
code for a property we can't guarantee anyway.

Make sense, I have started working on this idea, I will try to post the
first version by early next week.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#18Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#17)
Re: Make relfile tombstone files conditional on WAL level

On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 6, 2022 at 7:22 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID. We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).

Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary. There's no real reason to have them ever match. Yeah, in
theory, it makes it easier to tell which file matches which relation,
but in practice, you always have to double-check in case the table has
ever been rewritten. It doesn't seem worth continuing to contort the
code for a property we can't guarantee anyway.

Make sense, I have started working on this idea, I will try to post the first version by early next week.

Here is the first working patch, with that now we don't need to
maintain the TombStone file until the next checkpoint. This is still
a WIP patch with this I can see my problem related to ALTER DATABASE
SET TABLESPACE WAL-logged problem is solved which Robert reported a
couple of mails above in the same thread.

General idea of the patch:
- Change the RelFileNode.relNode to be 64bit wide, out of which 8 bits
for fork number and 56 bits for the relNode as shown below. [1]/* * RelNodeId: * * this is a storage type for RelNode. The reasoning behind using this is same * as using the BlockId so refer comment atop BlockId. */ typedef struct RelNodeId { uint32 rn_hi; uint32 rn_lo; } RelNodeId; typedef struct RelFileNode { Oid spcNode; /* tablespace */ Oid dbNode; /* database */ RelNodeId relNode; /* relation */ } RelFileNode;
- GetNewRelFileNode() will just generate a new unique relfilenode and
check the file existence and if it already exists then throw an error,
so no loop. We also need to add the logic for preserving the
nextRelNode across restart and also WAL logging it but that is similar
to the preserving nextOid.
- mdunlinkfork, will directly forget the relfilenode, so we get rid of
all unlinking code from the code.
- Now, we don't need any post checkpoint unlinking activity.

[1]: /* * RelNodeId: * * this is a storage type for RelNode. The reasoning behind using this is same * as using the BlockId so refer comment atop BlockId. */ typedef struct RelNodeId { uint32 rn_hi; uint32 rn_lo; } RelNodeId; typedef struct RelFileNode { Oid spcNode; /* tablespace */ Oid dbNode; /* database */ RelNodeId relNode; /* relation */ } RelFileNode;
/*
* RelNodeId:
*
* this is a storage type for RelNode. The reasoning behind using this is same
* as using the BlockId so refer comment atop BlockId.
*/
typedef struct RelNodeId
{
uint32 rn_hi;
uint32 rn_lo;
} RelNodeId;
typedef struct RelFileNode
{
Oid spcNode; /* tablespace */
Oid dbNode; /* database */
RelNodeId relNode; /* relation */
} RelFileNode;

TODO:

There are a couple of TODOs and FIXMEs which I am planning to improve
by next week. I am also planning to do the testing where relfilenode
consumes more than 32 bits, maybe for that we can set the
FirstNormalRelfileNode to higher value for the testing purpose. And,
Improve comments.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-WIP-0001-Don-t-wait-for-next-checkpoint-to-remove-unwanted.patchtext/x-patch; charset=US-ASCII; name=v1-WIP-0001-Don-t-wait-for-next-checkpoint-to-remove-unwanted.patchDownload+405-375
#19Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#18)
Re: Make relfile tombstone files conditional on WAL level

On Fri, Jan 28, 2022 at 8:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

TODO:

There are a couple of TODOs and FIXMEs which I am planning to improve
by next week. I am also planning to do the testing where relfilenode
consumes more than 32 bits, maybe for that we can set the
FirstNormalRelfileNode to higher value for the testing purpose. And,
Improve comments.

I have fixed most of TODO and FIXMEs but there are still a few which I
could not decide, the main one currently we do not have uint8 data
type only int8 is there so I have used int8 for storing relfilenode +
forknumber. Although this is sufficient because I don't think we will
ever get more than 128 fork numbers. But my question is should we
think for adding uint8 as new data type or infect make RelNode itself
as new data type like we have Oid.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-WIP-0001-Don-t-delay-removing-Tombstone-file-until-next-ch.patchtext/x-patch; charset=US-ASCII; name=v2-WIP-0001-Don-t-delay-removing-Tombstone-file-until-next-ch.patchDownload+545-464
#20Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#19)
Re: Make relfile tombstone files conditional on WAL level

On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

the main one currently we do not have uint8 data
type only int8 is there so I have used int8 for storing relfilenode +
forknumber.

I'm confused. We use int8 in tons of places, so I feel like it must exist.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#20)
#22Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#21)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#22)
#24Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#23)
#25Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#24)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#25)
#27Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#27)
#29Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#28)
#30Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#13)
#31Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#30)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#30)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#31)
#34Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#31)
#36Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#31)
#37Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#36)
#38Thomas Munro
thomas.munro@gmail.com
In reply to: Dilip Kumar (#37)
#39Dilip Kumar
dilipbalaut@gmail.com
In reply to: Thomas Munro (#38)
#40Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#35)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#40)
#42Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#41)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#42)
#44Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#43)
#45Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#44)
#46Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#44)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#46)
#48Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#47)
#49Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Simon Riggs (#46)
#50Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#43)
#51Simon Riggs
simon@2ndQuadrant.com
In reply to: Matthias van de Meent (#49)
#52Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Simon Riggs (#51)
#53Thomas Munro
thomas.munro@gmail.com
In reply to: Simon Riggs (#51)
#54Dilip Kumar
dilipbalaut@gmail.com
In reply to: Simon Riggs (#46)
#55Simon Riggs
simon@2ndQuadrant.com
In reply to: Thomas Munro (#53)
#56Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#50)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#50)
#58Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#56)
#59Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#57)
#60Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#59)
#61Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#60)
#62Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#61)
#63Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#61)
#64Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#63)
#65Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#63)
#66Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#65)
#67Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#59)
#68Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#66)
#69Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#68)
#70Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#69)
#71Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#70)
#72Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#71)
#73Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#72)
#74Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#73)
#75Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#74)
#76Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#73)
#77Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#76)
#78Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#77)
#79Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#78)
#80Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#79)
#81Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#80)
#82Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#75)
#83Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#81)
#84Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#83)
#85Hannu Krosing
hannu@tm.ee
In reply to: Robert Haas (#84)
#86Andres Freund
andres@anarazel.de
In reply to: Hannu Krosing (#85)
#87Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#86)
#88Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#81)
#89Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#88)
#90Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#89)
#91Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#90)
#92Thomas Munro
thomas.munro@gmail.com
In reply to: Dilip Kumar (#91)
#93vignesh C
vignesh21@gmail.com
In reply to: Dilip Kumar (#91)
#94Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Dilip Kumar (#91)
#95Amul Sul
sulamul@gmail.com
In reply to: vignesh C (#93)
#96Dilip Kumar
dilipbalaut@gmail.com
In reply to: Ashutosh Sharma (#94)
#97Dilip Kumar
dilipbalaut@gmail.com
In reply to: Thomas Munro (#92)
#98Dilip Kumar
dilipbalaut@gmail.com
In reply to: vignesh C (#93)
#99Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#95)
#100Dilip Kumar
dilipbalaut@gmail.com
In reply to: Thomas Munro (#92)
#101Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Dilip Kumar (#100)
#102Dilip Kumar
dilipbalaut@gmail.com
In reply to: Ashutosh Sharma (#101)
#103Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Dilip Kumar (#102)
#104Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#98)
#105Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#84)
#106Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Ashutosh Sharma (#103)
#107Dilip Kumar
dilipbalaut@gmail.com
In reply to: Ashutosh Sharma (#106)
#108vignesh C
vignesh21@gmail.com
In reply to: Dilip Kumar (#100)
#109Dilip Kumar
dilipbalaut@gmail.com
In reply to: vignesh C (#108)
#110Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#104)
#111Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#110)
#112Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#111)
#113Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#112)
#114Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#113)
#115Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#113)
#116Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#115)
#117Joshua D. Drake
jd@commandprompt.com
In reply to: Robert Haas (#116)
#118Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#116)
#119Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Dilip Kumar (#113)
#120Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Ashutosh Sharma (#119)
#121vignesh C
vignesh21@gmail.com
In reply to: Dilip Kumar (#109)
#122Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#114)
#123Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#122)
#124Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#123)
#125Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#124)
#126Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#125)
#127Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#125)
#128Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#122)
#129Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#91)
#130Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#128)
#131Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#130)
#132Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#115)
#133Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#132)
#134Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tom Lane (#127)
#135Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#122)
#136Dilip Kumar
dilipbalaut@gmail.com
In reply to: Ashutosh Sharma (#120)
#137Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#129)
#138Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#137)
#139Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#138)
#140Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#139)
#141Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#140)
#142Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#129)
#143Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#142)
#144Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#142)
#145Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#144)
#146Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#145)
#147Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#146)
#148Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#147)
#149Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#146)
#150Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#149)
#151Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#148)
#152Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#136)
#153Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#149)
#154Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#153)
#155Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#154)
#156Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#155)
#157Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#156)
#158Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#157)
#159Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#158)
#160Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#156)
#161Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#160)
#162Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#161)
#163Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#162)
#164Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#163)
#165Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#164)
#166Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#165)
#167Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#165)
#168Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#167)
#169Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#168)
#170Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#166)
#171Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#168)
#172Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#171)
#173Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#172)
#174Thomas Munro
thomas.munro@gmail.com
In reply to: Dilip Kumar (#172)
#175Dilip Kumar
dilipbalaut@gmail.com
In reply to: Thomas Munro (#174)
#176Thomas Munro
thomas.munro@gmail.com
In reply to: Dilip Kumar (#175)
#177Maxim Orlov
orlovmg@gmail.com
In reply to: Thomas Munro (#176)
#178Robert Haas
robertmhaas@gmail.com
In reply to: Maxim Orlov (#177)
#179Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#178)
#180Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#179)
#181Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#180)
#182Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#181)
#183vignesh C
vignesh21@gmail.com
In reply to: Michael Paquier (#182)
#184Dilip Kumar
dilipbalaut@gmail.com
In reply to: vignesh C (#183)