Make relfile tombstone files conditional on WAL level
Hi,
I'm starting a new thread for this patch that originated as a
side-discussion in [1]/messages/by-id/CA+hUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg@mail.gmail.com, to give it its own CF entry in the next cycle.
This is a WIP with an open question to research: what could actually
break if we did this?
[1]: /messages/by-id/CA+hUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg@mail.gmail.com
Attachments:
0001-Make-relfile-tombstone-files-conditional-on-WAL-leve.patchtext/x-patch; charset=US-ASCII; name=0001-Make-relfile-tombstone-files-conditional-on-WAL-leve.patchDownload+23-8
On 05/03/2021 00:02, Thomas Munro wrote:
Hi,
I'm starting a new thread for this patch that originated as a
side-discussion in [1], to give it its own CF entry in the next cycle.
This is a WIP with an open question to research: what could actually
break if we did this?
I don't see a problem.
It would indeed be nice to have some other mechanism to prevent the
issue with wal_level=minimal, the tombstone files feel hacky and
complicated. Maybe a new shared memory hash table to track the
relfilenodes of dropped tables.
- Heikki
On Thu, Jun 10, 2021 at 6:47 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
It would indeed be nice to have some other mechanism to prevent the
issue with wal_level=minimal, the tombstone files feel hacky and
complicated. Maybe a new shared memory hash table to track the
relfilenodes of dropped tables.
Just to summarize the issue here as I understand it, if a relfilenode
is used for two unrelated relations during the same checkpoint cycle
with wal_level=minimal, and if the WAL-skipping optimization is
applied to the second of those but not to the first, then crash
recovery will lose our only copy of the new relation's data, because
we'll replay the removal of the old relfilenode but will not have
logged the new data. Furthermore, we've wondered about writing an
end-of-recovery record in all cases rather than sometimes writing an
end-of-recovery record and sometimes a checkpoint record. That would
allow another version of this same problem, since a single checkpoint
cycle could now span multiple server lifetimes. At present, we dodge
all this by keeping the first segment of the main fork around as a
zero-length file for the rest of the checkpoint cycle, which I think
prevents the problem in both cases. Now, apparently that caused some
problem with the AIO patch set so Thomas is curious about getting rid
of it, and Heikki concurs that it's a hack.
I guess my concern about this patch is that it just seems to be
reducing the number of cases where that hack is used without actually
getting rid of it. Rarely-taken code paths are more likely to have
undiscovered bugs, and that seems particularly likely in this case,
because this is a low-probability scenario to begin with. A lot of
clusters probably never have an OID counter wraparound ever, and even
in those that do, getting an OID collision with just the right timing
followed by a crash before a checkpoint can intervene has got to be
super-unlikely. Even as things are today, if this mechanism has subtle
bugs, it seems entirely possible that they could have escaped notice
up until now.
So I spent some time thinking about the question of getting rid of
tombstone files altogether. I don't think that Heikki's idea of a
shared memory hash table to track dropped relfilenodes can work. The
hash table will have to be of some fixed size N, and whatever the
value of N, the approach will break down if N+1 relfilenodes are
dropped in the same checkpoint cycle.
The two most principled solutions to this problem that I can see are
(1) remove wal_level=minimal and (2) use 64-bit relfilenodes. I have
been reluctant to support #1 because it's hard for me to believe that
there aren't cases where being able to skip a whole lot of WAL-logging
doesn't work out to a nice performance win, but I realize opinions on
that topic vary. And I'm pretty sure that Andres, at least, will hate
#2 because he's unhappy with the width of buffer tags already. So I
don't really have a good idea. I agree this tombstone system is a bit
of a wart, but I'm not sure that this patch really makes anything any
better, and I'm not really seeing another idea that seems better
either.
Maybe I am missing something...
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2021-08-02 16:03:31 -0400, Robert Haas wrote:
The two most principled solutions to this problem that I can see are
(1) remove wal_level=minimal and
I'm personally not opposed to this. It's not practically relevant and makes a
lot of stuff more complicated. We imo should rather focus on optimizing the
things wal_level=minimal accelerates a lot than adding complications for
wal_level=minimal. Such optimizations would have practical relevance, and
there's plenty low hanging fruits.
(2) use 64-bit relfilenodes. I have
been reluctant to support #1 because it's hard for me to believe that
there aren't cases where being able to skip a whole lot of WAL-logging
doesn't work out to a nice performance win, but I realize opinions on
that topic vary. And I'm pretty sure that Andres, at least, will hate
#2 because he's unhappy with the width of buffer tags already.
Yep :/
I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.
It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.
So I don't really have a good idea. I agree this tombstone system is a
bit of a wart, but I'm not sure that this patch really makes anything
any better, and I'm not really seeing another idea that seems better
either.
Maybe I am missing something...
What I proposed in the past was to have a new shared table that tracks
relfilenodes. I still think that's a decent solution for just the problem at
hand. But it'd also potentially be the way to redesign relation forks and even
slim down buffer tags:
Right now a buffer tag is:
- 4 byte tablespace oid
- 4 byte database oid
- 4 byte "relfilenode oid" (don't think we have a good name for this)
- 4 byte fork number
- 4 byte block number
If we had such a shared table we could put at least tablespace, fork number
into that table mapping them to an 8 byte "new relfilenode". That'd only make
the "new relfilenode" unique within a database, but that'd be sufficient for
our purposes. It'd give use a buffertag consisting out of the following:
- 4 byte database oid
- 8 byte "relfilenode"
- 4 byte block number
Of course, it'd add some complexity too, because a buffertag alone wouldn't be
sufficient to read data (as you'd need the tablespace oid from elsewhere). But
that's probably ok, I think all relevant places would have that information.
It's probably possible to remove the database oid from the tag as well, but
it'd make CREATE DATABASE tricker - we'd need to change the filenames of
tables as we copy, to adjust them to the differing oid.
Greetings,
Andres Freund
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:
What I proposed in the past was to have a new shared table that tracks
relfilenodes. I still think that's a decent solution for just the problem at
hand.
It's not really clear to me what problem is at hand. The problems that
the tombstone system created for the async I/O stuff weren't really
explained properly, IMHO. And I don't think the current system is all
that ugly. it's not the most beautiful thing in the world but we have
lots of way worse hacks. And, it's easy to understand, requires very
little code, and has few moving parts that can fail. As hacks go it's
a quality hack, I would say.
But it'd also potentially be the way to redesign relation forks and even
slim down buffer tags:Right now a buffer tag is:
- 4 byte tablespace oid
- 4 byte database oid
- 4 byte "relfilenode oid" (don't think we have a good name for this)
- 4 byte fork number
- 4 byte block numberIf we had such a shared table we could put at least tablespace, fork number
into that table mapping them to an 8 byte "new relfilenode". That'd only make
the "new relfilenode" unique within a database, but that'd be sufficient for
our purposes. It'd give use a buffertag consisting out of the following:
- 4 byte database oid
- 8 byte "relfilenode"
- 4 byte block number
Yep. I think this is a good direction.
Of course, it'd add some complexity too, because a buffertag alone wouldn't be
sufficient to read data (as you'd need the tablespace oid from elsewhere). But
that's probably ok, I think all relevant places would have that information.
I think the thing to look at would be the places that call
relpathperm() or relpathbackend(). I imagine this can be worked out,
but it might require some adjustment.
It's probably possible to remove the database oid from the tag as well, but
it'd make CREATE DATABASE tricker - we'd need to change the filenames of
tables as we copy, to adjust them to the differing oid.
Yeah, I'm not really sure that works out to a win. I tend to think
that we should be trying to make databases within the same cluster
more rather than less independent of each other. If we switch to using
a radix tree for the buffer mapping table as you have previously
proposed, then presumably each backend can cache a pointer to the
second level, after the database OID has been resolved. Then you have
no need to compare database OIDs for every lookup. That might turn out
to be better for performance than shoving everything into the buffer
tag anyway, because then backends in different databases would be
accessing distinct parts of the buffer mapping data structure instead
of contending with one another.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Mar 5, 2021 at 11:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:
This is a WIP with an open question to research: what could actually
break if we did this?
I thought this part of bgwriter.c might be a candidate:
if (FirstCallSinceLastCheckpoint())
{
/*
* After any checkpoint, close all smgr files. This is so we
* won't hang onto smgr references to deleted files indefinitely.
*/
smgrcloseall();
}
Hmm, on closer inspection, isn't the lack of real interlocking with
checkpoints a bit suspect already? What stops bgwriter from writing
to the previous relfilenode generation's fd, if a relfilenode is
recycled while BgBufferSync() is running? Not sinval, and not the
above code that only runs between BgBufferSync() invocations.
On Wed, Aug 4, 2021 at 3:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
It's not really clear to me what problem is at hand. The problems that
the tombstone system created for the async I/O stuff weren't really
explained properly, IMHO. And I don't think the current system is all
that ugly. it's not the most beautiful thing in the world but we have
lots of way worse hacks. And, it's easy to understand, requires very
little code, and has few moving parts that can fail. As hacks go it's
a quality hack, I would say.
It's not really an AIO problem. It's just that while testing the AIO
stuff across a lot of operating systems, we had tests failing on
Windows because the extra worker processes you get if you use
io_method=worker were holding cached descriptors and causing stuff
like DROP TABLESPACE to fail. AFAIK every problem we discovered in
that vein is a current live bug in all versions of PostgreSQL for
Windows (it just takes other backends or the bgwriter to hold an fd at
the wrong moment). The solution I'm proposing to that general class
of problem is https://commitfest.postgresql.org/34/2962/ .
In the course of thinking about that, it seemed natural to look into
the possibility of getting rid of the tombstones, so that at least
Unix systems don't find themselves having to suffer through a
CHECKPOINT just to drop a tablespace that happens to contain a
tombstone.
On Wed, Sep 29, 2021 at 4:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Hmm, on closer inspection, isn't the lack of real interlocking with
checkpoints a bit suspect already? What stops bgwriter from writing
to the previous relfilenode generation's fd, if a relfilenode is
recycled while BgBufferSync() is running? Not sinval, and not the
above code that only runs between BgBufferSync() invocations.
I managed to produce a case where live data is written to an unlinked
file and lost, with a couple of tweaks to get the right timing and
simulate OID wraparound. See attached. If you run the following
commands repeatedly with shared_buffers=256kB and
bgwriter_lru_multiplier=10, you should see a number lower than 10,000
from the last query in some runs, depending on timing.
create extension if not exists chaos;
create extension if not exists pg_prewarm;
drop table if exists t1, t2;
checkpoint;
vacuum pg_class;
select clobber_next_oid(200000);
create table t1 as select 42 i from generate_series(1, 10000);
select pg_prewarm('t1'); -- fill buffer pool with t1
update t1 set i = i; -- dirty t1 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time
drop table t1;
checkpoint;
vacuum pg_class;
select clobber_next_oid(200000);
create table t2 as select 0 i from generate_series(1, 10000);
select pg_prewarm('t2'); -- fill buffer pool with t2
update t2 set i = 1 where i = 0; -- dirty t2 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time
select pg_prewarm('pg_attribute'); -- evict all clean t2 buffers
select sum(i) as t2_sum_should_be_10000 from t2; -- have any updates been lost?
Attachments:
0001-HACK-A-function-to-control-the-OID-allocator.patchtext/x-patch; charset=US-ASCII; name=0001-HACK-A-function-to-control-the-OID-allocator.patchDownload+59-1
0002-HACK-Slow-the-bgwriter-down-a-bit.patchtext/x-patch; charset=US-ASCII; name=0002-HACK-Slow-the-bgwriter-down-a-bit.patchDownload+4-1
On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I managed to produce a case where live data is written to an unlinked
file and lost
I guess this must have been broken since release 9.2 moved checkpoints
out of here[1]/messages/by-id/CA+U5nMLv2ah-HNHaQ=2rxhp_hDJ9jcf-LL2kW3sE4msfnUw9gA@mail.gmail.com. The connection between checkpoints, tombstone files
and file descriptor cache invalidation in auxiliary (non-sinval)
backends was not documented as far as I can see (or at least not
anywhere near the load-bearing parts).
How could it be fixed, simply and backpatchably? If BgSyncBuffer()
did if-FirstCallSinceLastCheckpoint()-then-smgrcloseall() after
locking each individual buffer and before flushing, then I think it
might logically have the correct interlocking against relfilenode
wraparound, but that sounds a tad expensive :-( I guess it could be
made cheaper by using atomics for the checkpoint counter instead of
spinlocks. Better ideas?
[1]: /messages/by-id/CA+U5nMLv2ah-HNHaQ=2rxhp_hDJ9jcf-LL2kW3sE4msfnUw9gA@mail.gmail.com
On Tue, Oct 5, 2021 at 4:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I managed to produce a case where live data is written to an unlinked
file and lost
In conclusion, there *is* something else that would break, so I'm
withdrawing this CF entry (#3030) for now. Also, that something else
is already subtly broken, so I'll try to come up with a fix for that
separately.
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:
I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.
I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse. On Windows, we still need
https://commitfest.postgresql.org/36/2962/ because of the problem that
Windows won't remove files from the directory listing until they are
both unlinked and closed. But in general this seems like it would lead
to cleaner code. For example, GetNewRelFileNode() needn't loop. If it
allocate the smallest unsigned integer that the cluster (or database)
has never previously assigned, the file should definitely not exist on
disk, and if it does, an ERROR is appropriate, as the database is
corrupted. This does assume that allocations from this new 56-bit
relfilenode counter are properly WAL-logged.
I think this would also solve a problem Dilip mentioned to me today:
suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's
been trying to do. Then suppose you do "ALTER DATABASE foo SET
TABLESPACE used_recently_but_not_any_more". You might get an error
complaining that “some relations of database \“%s\” are already in
tablespace \“%s\“” because there could be tombstone files in that
database. With this combination of changes, you could just use the
barrier mechanism from https://commitfest.postgresql.org/36/2962/ to
wait for those files to disappear, because they've got to be
previously-unliked files that Windows is still returning because
they're still opening -- or else they could be a sign of a corrupted
database, but there are no other possibilities.
I think, anyway.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Jan 6, 2022 at 3:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:
I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse.
+1
I think this would also solve a problem Dilip mentioned to me today:
suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's
been trying to do. Then suppose you do "ALTER DATABASE foo SET
TABLESPACE used_recently_but_not_any_more". You might get an error
complaining that “some relations of database \“%s\” are already in
tablespace \“%s\“” because there could be tombstone files in that
database. With this combination of changes, you could just use the
barrier mechanism from https://commitfest.postgresql.org/36/2962/ to
wait for those files to disappear, because they've got to be
previously-unliked files that Windows is still returning because
they're still opening -- or else they could be a sign of a corrupted
database, but there are no other possibilities.
Yes, this approach will solve the problem for the WAL-logged ALTER
DATABASE we are facing.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 6, 2022 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse.+1
I IMHO a few top level point for implementing this idea would be as listed here,
1) the "relfilenode" member inside the RelFileNode will be now 64
bytes and remove the "forkNum" all together from the BufferTag. So
now whenever we want to use the relfilenode or fork number we can use
the respective mask and fetch their values.
2) GetNewRelFileNode() will not loop for checking the file existence
and retry with other relfilenode.
3) Modify mdunlinkfork() so that we immediately perform the unlink
request, make sure to register_forget_request() before unlink.
4) In checkpointer, now we don't need any handling for pendingUnlinks.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 6, 2022 at 9:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jan 6, 2022 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse.+1
+1
I IMHO a few top level point for implementing this idea would be as listed here,
1) the "relfilenode" member inside the RelFileNode will be now 64
bytes and remove the "forkNum" all together from the BufferTag. So
now whenever we want to use the relfilenode or fork number we can use
the respective mask and fetch their values.
2) GetNewRelFileNode() will not loop for checking the file existence
and retry with other relfilenode.
3) Modify mdunlinkfork() so that we immediately perform the unlink
request, make sure to register_forget_request() before unlink.
4) In checkpointer, now we don't need any handling for pendingUnlinks.
Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID. We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).
On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID. We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).
Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary. There's no real reason to have them ever match. Yeah, in
theory, it makes it easier to tell which file matches which relation,
but in practice, you always have to double-check in case the table has
ever been rewritten. It doesn't seem worth continuing to contort the
code for a property we can't guarantee anyway.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2022-01-06 08:52:01 -0500, Robert Haas wrote:
On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID. We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary.
+1
On Thu, Jan 6, 2022 at 7:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com>
wrote:Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID. We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary. There's no real reason to have them ever match. Yeah, in
theory, it makes it easier to tell which file matches which relation,
but in practice, you always have to double-check in case the table has
ever been rewritten. It doesn't seem worth continuing to contort the
code for a property we can't guarantee anyway.
Make sense, I have started working on this idea, I will try to post the
first version by early next week.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jan 6, 2022 at 7:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID. We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary. There's no real reason to have them ever match. Yeah, in
theory, it makes it easier to tell which file matches which relation,
but in practice, you always have to double-check in case the table has
ever been rewritten. It doesn't seem worth continuing to contort the
code for a property we can't guarantee anyway.Make sense, I have started working on this idea, I will try to post the first version by early next week.
Here is the first working patch, with that now we don't need to
maintain the TombStone file until the next checkpoint. This is still
a WIP patch with this I can see my problem related to ALTER DATABASE
SET TABLESPACE WAL-logged problem is solved which Robert reported a
couple of mails above in the same thread.
General idea of the patch:
- Change the RelFileNode.relNode to be 64bit wide, out of which 8 bits
for fork number and 56 bits for the relNode as shown below. [1]/* * RelNodeId: * * this is a storage type for RelNode. The reasoning behind using this is same * as using the BlockId so refer comment atop BlockId. */ typedef struct RelNodeId { uint32 rn_hi; uint32 rn_lo; } RelNodeId; typedef struct RelFileNode { Oid spcNode; /* tablespace */ Oid dbNode; /* database */ RelNodeId relNode; /* relation */ } RelFileNode;
- GetNewRelFileNode() will just generate a new unique relfilenode and
check the file existence and if it already exists then throw an error,
so no loop. We also need to add the logic for preserving the
nextRelNode across restart and also WAL logging it but that is similar
to the preserving nextOid.
- mdunlinkfork, will directly forget the relfilenode, so we get rid of
all unlinking code from the code.
- Now, we don't need any post checkpoint unlinking activity.
[1]: /* * RelNodeId: * * this is a storage type for RelNode. The reasoning behind using this is same * as using the BlockId so refer comment atop BlockId. */ typedef struct RelNodeId { uint32 rn_hi; uint32 rn_lo; } RelNodeId; typedef struct RelFileNode { Oid spcNode; /* tablespace */ Oid dbNode; /* database */ RelNodeId relNode; /* relation */ } RelFileNode;
/*
* RelNodeId:
*
* this is a storage type for RelNode. The reasoning behind using this is same
* as using the BlockId so refer comment atop BlockId.
*/
typedef struct RelNodeId
{
uint32 rn_hi;
uint32 rn_lo;
} RelNodeId;
typedef struct RelFileNode
{
Oid spcNode; /* tablespace */
Oid dbNode; /* database */
RelNodeId relNode; /* relation */
} RelFileNode;
TODO:
There are a couple of TODOs and FIXMEs which I am planning to improve
by next week. I am also planning to do the testing where relfilenode
consumes more than 32 bits, maybe for that we can set the
FirstNormalRelfileNode to higher value for the testing purpose. And,
Improve comments.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v1-WIP-0001-Don-t-wait-for-next-checkpoint-to-remove-unwanted.patchtext/x-patch; charset=US-ASCII; name=v1-WIP-0001-Don-t-wait-for-next-checkpoint-to-remove-unwanted.patchDownload+405-375
On Fri, Jan 28, 2022 at 8:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
TODO:
There are a couple of TODOs and FIXMEs which I am planning to improve
by next week. I am also planning to do the testing where relfilenode
consumes more than 32 bits, maybe for that we can set the
FirstNormalRelfileNode to higher value for the testing purpose. And,
Improve comments.
I have fixed most of TODO and FIXMEs but there are still a few which I
could not decide, the main one currently we do not have uint8 data
type only int8 is there so I have used int8 for storing relfilenode +
forknumber. Although this is sufficient because I don't think we will
ever get more than 128 fork numbers. But my question is should we
think for adding uint8 as new data type or infect make RelNode itself
as new data type like we have Oid.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v2-WIP-0001-Don-t-delay-removing-Tombstone-file-until-next-ch.patchtext/x-patch; charset=US-ASCII; name=v2-WIP-0001-Don-t-delay-removing-Tombstone-file-until-next-ch.patchDownload+545-464
On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
the main one currently we do not have uint8 data
type only int8 is there so I have used int8 for storing relfilenode +
forknumber.
I'm confused. We use int8 in tons of places, so I feel like it must exist.
--
Robert Haas
EDB: http://www.enterprisedb.com