In-placre persistance change of a relation
Hello. This is a thread for an alternative solution to wal_level=none
[*1] for bulk data loading.
*1: /messages/by-id/TYAPR01MB29901EBE5A3ACCE55BA99186FE320@TYAPR01MB2990.jpnprd01.prod.outlook.com
At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in
Greetings,
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.It always skips table copy in the SET UNLOGGED case, and only when
wal_level=minimal in the SET LOGGED case. Crash recovery seems
working by some brief testing by hand.Somehow missed that this patch more-or-less does what I was referring to
down-thread, but I did want to mention that it looks like it's missing a
necessary FlushRelationBuffers() call before the sync, otherwise there
could be dirty buffers for the relation that's being set to LOGGED (with
wal_level=minimal), which wouldn't be good. See the comments above
smgrimmedsync().
Right. Thanks. However, since SetRelFileNodeBuffersPersistence()
called just above scans shared buffers so I don't want to just call
FlushRelationBuffers() separately. Instead, I added buffer-flush to
SetRelFileNodeBuffersPersistence().
FWIW this is a revised version of the PoC, which has some known
problems.
- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).
- This version handles indexes but not yet handle toast relatins.
- tableAMs are supposed to support this feature. (but I'm not sure
it's worth allowing them not to do so).
Of course, I haven't performed intensive test on it.
Reading through the thread, it didn't seem very clear, but we should
definitely make sure that it does the right thing on replicas when going
between unlogged and logged (and between logged and unlogged too), of
course.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
PoC_in-place_set_persistence_v2.patchtext/x-patch; charset=us-asciiDownload+459-46
Greetings,
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.It always skips table copy in the SET UNLOGGED case, and only when
wal_level=minimal in the SET LOGGED case. Crash recovery seems
working by some brief testing by hand.Somehow missed that this patch more-or-less does what I was referring to
down-thread, but I did want to mention that it looks like it's missing a
necessary FlushRelationBuffers() call before the sync, otherwise there
could be dirty buffers for the relation that's being set to LOGGED (with
wal_level=minimal), which wouldn't be good. See the comments above
smgrimmedsync().Right. Thanks. However, since SetRelFileNodeBuffersPersistence()
called just above scans shared buffers so I don't want to just call
FlushRelationBuffers() separately. Instead, I added buffer-flush to
SetRelFileNodeBuffersPersistence().
Maybe I'm missing something, but it sure looks like in the patch that
SetRelFileNodeBuffersPersistence() is being called after the
smgrimmedsync() call, and I don't think you get to just switch the order
of those- the sync is telling the kernel to make sure it's written to
disk, while the FlushBuffer() is just writing it into the kernel but
doesn't provide any guarantee that the data has actually made it to
disk. We have to FlushBuffer() first, and then call smgrimmedsync().
Perhaps there's a way to avoid having to go through shared buffers
twice, and I generally agreed it'd be good if we could avoid doing so,
but this approach doesn't look like it actually works.
FWIW this is a revised version of the PoC, which has some known
problems.- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).
Not sure if it'd be better to drop buffers or not, but figuring out how
to deal with rollback seems pretty important. How is the persistence
change in the catalog not WAL-logged though..?
- This version handles indexes but not yet handle toast relatins.
Would need to be fixed, of course.
- tableAMs are supposed to support this feature. (but I'm not sure
it's worth allowing them not to do so).
Seems like they should.
Thanks,
Stephen
Hi,
I suggest outlining what you are trying to achieve here. Starting a new
thread and expecting people to dig through another thread to infer what
you are actually trying to achive isn't great.
FWIW, I'm *extremely* doubtful it's worth adding features that depend on
a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as
I understand. If somebody added support for dynamically adapting
wal_level (e.g. wal_level=auto, that increases wal_level to
replica/logical depending on the presence of replication slots), it'd
perhaps be different.
On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote:
FWIW this is a revised version of the PoC, which has some known
problems.- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).
That's obviously a no-go. I think you might be able to address this if
you accept that the command cannot be run in a transaction (like
CONCURRENTLY). Then you can first do the catalog changes, change the
persistence level, and commit.
Greetings,
Andres Freund
At Wed, 11 Nov 2020 14:18:04 -0800, Andres Freund <andres@anarazel.de> wrote in
Hi,
I suggest outlining what you are trying to achieve here. Starting a new
thread and expecting people to dig through another thread to infer what
you are actually trying to achive isn't great.
Agreed. I'll post that. Thanks.
FWIW, I'm *extremely* doubtful it's worth adding features that depend on
a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as
I understand. If somebody added support for dynamically adapting
wal_level (e.g. wal_level=auto, that increases wal_level to
replica/logical depending on the presence of replication slots), it'd
perhaps be different.
Yes, this depends on wal_level=minimal for switching from UNLOGGED to
LOGGED, that's similar to COPY/INSERT-to-intransaction-created-tables
optimization for wal_level=minimal. And it expands that optimization
to COPY/INSERT-to-existent-tables, which seems worth doing.
Switching to LOGGED needs to emit the initial state to WAL... Hmm.. I
came to think that even in that case skipping table copy reduces I/O
significantly, even though FPI-WAL is emitted.
On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote:
FWIW this is a revised version of the PoC, which has some known
problems.- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).That's obviously a no-go. I think you might be able to address this if
you accept that the command cannot be run in a transaction (like
CONCURRENTLY). Then you can first do the catalog changes, change the
persistence level, and commit.
Of course. The next version reverts persistence change at abort.
Thanks!
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Wed, 11 Nov 2020 09:56:44 -0500, Stephen Frost <sfrost@snowman.net> wrote in
Greetings,
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.It always skips table copy in the SET UNLOGGED case, and only when
wal_level=minimal in the SET LOGGED case. Crash recovery seems
working by some brief testing by hand.Somehow missed that this patch more-or-less does what I was referring to
down-thread, but I did want to mention that it looks like it's missing a
necessary FlushRelationBuffers() call before the sync, otherwise there
could be dirty buffers for the relation that's being set to LOGGED (with
wal_level=minimal), which wouldn't be good. See the comments above
smgrimmedsync().Right. Thanks. However, since SetRelFileNodeBuffersPersistence()
called just above scans shared buffers so I don't want to just call
FlushRelationBuffers() separately. Instead, I added buffer-flush to
SetRelFileNodeBuffersPersistence().Maybe I'm missing something, but it sure looks like in the patch that
SetRelFileNodeBuffersPersistence() is being called after the
smgrimmedsync() call, and I don't think you get to just switch the order
of those- the sync is telling the kernel to make sure it's written to
disk, while the FlushBuffer() is just writing it into the kernel but
doesn't provide any guarantee that the data has actually made it to
disk. We have to FlushBuffer() first, and then call smgrimmedsync().
Perhaps there's a way to avoid having to go through shared buffers
twice, and I generally agreed it'd be good if we could avoid doing so,
but this approach doesn't look like it actually works.
Yeah, sorry for the rare-baked version.. I was confused about the
order at the time. The next version works like this:
LOGGED->UNLOGGED
<collect reloids to process>
for each relations:
<set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
<create init fork>
if it is index call ambuildempty() (which syncs the init fork)
else WAL-log smgr_create then sync the init file.
<update catalog>
...
commit time:
<do nogthing>
abort time:
<unlink init fork>
<revert buffer persistence>
UNLOGGED->LOGGED
<collect reloids to process>
for each relations:
<set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
<record drop-init-fork to pending-deletes>
<sync storage files>
<update catalog>
...
commit time:
<log smgrunlink>
<smgrunlink init fork>
abort time:
<revert buffer persistence>
FWIW this is a revised version of the PoC, which has some known
problems.- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).Not sure if it'd be better to drop buffers or not, but figuring out how
to deal with rollback seems pretty important. How is the persistence
change in the catalog not WAL-logged though..?
Rollback works as the above. Buffer persistence change is registered
in pending-deletes. Persistence change in catalog is rolled back in
the ordinary way (or automatically).
If wal_level > minimal, persistence change of buffers is propagated to
standbys by WAL. However I'm not sure we need wal-logging otherwise,
the next version emits WAL since SMGR_CREATE is always logged by
existing code.
- This version handles indexes but not yet handle toast relatins.
Would need to be fixed, of course.
Fixed.
- tableAMs are supposed to support this feature. (but I'm not sure
it's worth allowing them not to do so).Seems like they should.
Init fork of index relations needs a call to ambuildempty() instead of
"log_smgrcreate-smgrimmedsync" after smgrcreate. Instead of adding
similar interface in indexAm, I reverted changes of tableam and make
RelationCreate/DropInitFork() directly do that. That introduces new
include of amapi.h to storage.c, which is a bit uneasy.
The previous version give up the in-place persistence change in the
case where wal_level > minimal and SET LOGGED since that needs WAL to
be emitted. However, in-place change still has the advantage of not
running a table copy. So the next verson always runs persistence
change in-place.
As suggested by Andres, I'll send a summary of this patch. The patch
will be attached to the coming mail.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello. Before posting the next version, I'd like to explain what this
patch is.
1. The Issue
Bulk data loading is a long-time taking, I/O consuming task. Many
DBAs want that task is faster, even at the cost of increasing risk of
data-loss. wal_level=minimal is an answer to such a
request. Data-loading onto a table that is created in the current
transaction omits WAL-logging and synced at commit.
However, the optimization doesn't benefit the case where the
data-loading is performed onto existing tables. There are quite a few
cases where data is loaded into tables that already contains a lot of
data. Those cases don't take benefit of the optimization.
Another possible solution for bulk data-loading is UNLOGGED
tables. But when we switch LOGGED/UNLOGGED of a table, all the table
content is copied to a newly created heap, which is costly.
2. Proposed Solutions.
There are two proposed solutions are discussed on this mailing
list. One is wal_level = none (*1), which omits WAL-logging almost at
all. Another is extending the existing optimization to the ALTER TABLE
SET LOGGED/UNLOGGED cases, which is to be discussed in this new
thread.
3. In-place Persistence Change
So the attached is a PoC patch of the "another" solution. When we
want to change table persistence in-place, basically we need to do the
following steps.
(the talbe is exclusively locked)
(1) Flip BM_PERMANENT flag of all shared buffer blocks for the heap.
(2) Create or delete the init fork for existing heap.
(3) Flush all buffers of the relation to file system.
(4) Sync heap files.
(5) Make catalog changes.
4. Transactionality
The 1, 2 and 5 above need to be abort-able. 5 is rolled back by
existing infrastructure, and rolling-back of 1 and 2 are achieved by
piggybacking on the pendingDeletes mechanism.
5. Replication
Furthermore, that changes ought to be replicable to standbys. Catalog
changes are replicated as usual.
On-the-fly creation of the init fork leads to recovery mess. Even
though it is removed at abort, if the server crashed before
transaction end, the file is left alone and corrupts database in the
next recovery. I sought a way to create the init fork in
smgrPendingDelete but that needs relcache and relcache is not
available at that late of commit. Finally, I introduced the fifth fork
kind "INITTMP"(_itmp) only to signal that the init file is not
committed. I don't like that way but it seems working fine...
6. SQL Command
The second file in the patchset adds a syntax that changes persistence
of all tables in a tablespace.
ALTER TABLE ALL IN TABLESPACE <tsp> SET LOGGED/UNLOGGED [ NOWAIT ];
7. Testing
I tried to write TAP test for this, but IPC::Run::harness (or
interactive_psql) doesn't seem to work for me. I'm not sure what
exactly is happening but pty redirection doesn't work.
$in = "ls\n"; $out = ""; run ["/usr/bin/bash"], \$in, \$out; print $out;
works but
$in = "ls\n"; $out = ""; run ["/usr/bin/bash"], '<pty<', \$in, '>pty>', \$out; print $out;
doesn't respond.
The patch is attached.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi Horiguchi-san,
Thank you for making a patch so quickly. I've started looking at it.
What makes you think this is a PoC? Documentation and test cases? If there's something you think that doesn't work or are concerned about, can you share it?
Do you know the reason why data copy was done before? And, it may be odd for me to ask this, but I think I saw someone referred to the past discussion that eliminating data copy is difficult due to some processing at commit. I can't find it.
(1)
@@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
+struct SmgrRelationData;
This declaration is already in the file:
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;
/* forward declared, to avoid including smgr.h here */
struct SMgrRelationData;
Regards
Takayuki Tsunakawa
Hello, Tsunakawa-San
Do you know the reason why data copy was done before? And, it may be
odd for me to ask this, but I think I saw someone referred to the past
discussion that eliminating data copy is difficult due to some processing at
commit. I can't find it.
I can share 2 sources why to eliminate the data copy is difficult in hackers thread.
Tom's remark and the context to copy relation's data.
/messages/by-id/31724.1394163360@sss.pgh.pa.us
Amit-San quoted this thread and mentioned that point in another thread.
/messages/by-id/CAA4eK1+HDqS+1fhs5Jf9o4ZujQT=XBZ6sU0kOuEh2hqQAC+t=w@mail.gmail.com
Best,
Takamichi Osumi
At Fri, 13 Nov 2020 06:43:13 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in
Hi Horiguchi-san,
Thank you for making a patch so quickly. I've started looking at it.
What makes you think this is a PoC? Documentation and test cases? If there's something you think that doesn't work or are concerned about, can you share it?
The latest version is heavily revised and is given much comment so it
might have exited from PoC state. The necessity of documentation is
doubtful since this patch doesn't user-facing behavior other than
speed. Some tests are required especialy about recovery and
replication perspective but I haven't been able to make it. (One of
the tests needs to cause crash while a transaction is running.)
Do you know the reason why data copy was done before? And, it may be odd for me to ask this, but I think I saw someone referred to the past discussion that eliminating data copy is difficult due to some processing at commit. I can't find it.
To imagine that, just because it is simpler considering rollback and
code sharing, and maybe no one have been complained that SET
LOGGED/UNLOGGED looks taking a long time than required/expected.
The current implement is simple. It's enough to just discard old or
new relfilenode according to the current transaction ends with commit
or abort. Tweaking of relfilenode under use leads-in some skews in
some places. I used pendingDelete mechanism a bit complexified way
and a violated an abstraction (I think, calling AM-routines from
storage.c is not good.) and even introduce a new fork kind only to
mark a init fork as "not committed yet". There might be better way,
but I haven't find it.
(The patch scans all shared buffer blocks for each relation).
(1)
@@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))+struct SmgrRelationData;
This declaration is already in the file:
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;/* forward declared, to avoid including smgr.h here */
struct SMgrRelationData;
Hmmm. Nice chatch. And will fix in the next version.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 13 Nov 2020 07:15:41 +0000, "osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com> wrote in
Hello, Tsunakawa-San
Thanks for sharing it!
Do you know the reason why data copy was done before? And, it may be
odd for me to ask this, but I think I saw someone referred to the past
discussion that eliminating data copy is difficult due to some processing at
commit. I can't find it.I can share 2 sources why to eliminate the data copy is difficult in hackers thread.
Tom's remark and the context to copy relation's data.
/messages/by-id/31724.1394163360@sss.pgh.pa.us
/messages/by-id/CA+Tgmob44LNwwU73N1aJsGQyzQ61SdhKJRC_89wCm0+aLg=x2Q@mail.gmail.com
No, not really. The issue is more around what happens if we crash
part way through. At crash recovery time, the system catalogs are not
available, because the database isn't consistent yet and, anyway, the
startup process can't be bound to a database, let alone every database
that might contain unlogged tables. So the sentinel that's used to
decide whether to flush the contents of a table or index is the
presence or absence of an _init fork, which the startup process
obviously can see just fine. The _init fork also tells us what to
stick in the relation when we reset it; for a table, we can just reset
to an empty file, but that's not legal for indexes, so the _init fork
contains a pre-initialized empty index that we can just copy over.Now, to make an unlogged table logged, you've got to at some stage
remove those _init forks. But this is not a transactional operation.
If you remove the _init forks and then the transaction rolls back,
you've left the system an inconsistent state. If you postpone the
removal until commit time, then you have a problem if it fails,
It's true. That are the cause of headache.
particularly if it works for the first file but fails for the second.
And if you crash at any point before you've fsync'd the containing
directory, you have no idea which files will still be on disk after a
hard reboot.
This is not an issue in this patch *except* the case where init fork
is failed to removed but the following removal of inittmp fork
succeeds. Another idea is adding a "not-yet-committed" property to a
fork. I added a new fork type for easiness of the patch but I could
go that way if that is an issue.
Amit-San quoted this thread and mentioned that point in another thread.
/messages/by-id/CAA4eK1+HDqS+1fhs5Jf9o4ZujQT=XBZ6sU0kOuEh2hqQAC+t=w@mail.gmail.com
This sounds like a bit differrent discussion. Making part-of-a-table
UNLOGGED looks far difficult to me.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
No, not really. The issue is more around what happens if we crash
part way through. At crash recovery time, the system catalogs are not
available, because the database isn't consistent yet and, anyway, the
startup process can't be bound to a database, let alone every database
that might contain unlogged tables. So the sentinel that's used to
decide whether to flush the contents of a table or index is the
presence or absence of an _init fork, which the startup process
obviously can see just fine. The _init fork also tells us what to
stick in the relation when we reset it; for a table, we can just reset
to an empty file, but that's not legal for indexes, so the _init fork
contains a pre-initialized empty index that we can just copy over.Now, to make an unlogged table logged, you've got to at some stage
remove those _init forks. But this is not a transactional operation.
If you remove the _init forks and then the transaction rolls back,
you've left the system an inconsistent state. If you postpone the
removal until commit time, then you have a problem if it fails,It's true. That are the cause of headache.
...
The current implement is simple. It's enough to just discard old or
new relfilenode according to the current transaction ends with commit
or abort. Tweaking of relfilenode under use leads-in some skews in
some places. I used pendingDelete mechanism a bit complexified way
and a violated an abstraction (I think, calling AM-routines from
storage.c is not good.) and even introduce a new fork kind only to
mark a init fork as "not committed yet". There might be better way,
but I haven't find it.
I have no alternative idea yet, too. I agree that we want to avoid them, especially introducing inittmp fork... Anyway, below are the rest of my review comments for 0001. I want to review 0002 when we have decided to go with 0001.
(2)
XLOG_SMGR_UNLINK seems to necessitate modification of the following comments:
[src/include/catalog/storage_xlog.h]
/*
* Declarations for smgr-related XLOG records
*
* Note: we log file creation and truncation here, but logging of deletion
* actions is handled by xact.c, because it is part of transaction commit.
*/
[src/backend/access/transam/README]
3. Deleting a table, which requires an unlink() that could fail.
Our approach here is to WAL-log the operation first, but to treat failure
of the actual unlink() call as a warning rather than error condition.
Again, this can leave an orphan file behind, but that's cheap compared to
the alternatives. Since we can't actually do the unlink() until after
we've committed the DROP TABLE transaction, throwing an error would be out
of the question anyway. (It may be worth noting that the WAL entry about
the file deletion is actually part of the commit record for the dropping
transaction.)
(3)
+/* This is bit-map, not ordianal numbers */
There seems to be no comments using "bit-map". "Flags for ..." can be seen here and there.
(4)
Some wrong spellings:
+ /* we flush this buffer when swithing to PERMANENT */
swithing -> switching
+ * alredy flushed out by RelationCreate(Drop)InitFork called just
alredy -> already
+ * relation content to be WAL-logged to recovery the table.
recovery -> recover
+ * The inittmp fork works as the sentinel to identify that situaton.
situaton -> situation
(5)
+ table_close(classRel, NoLock);
+
+
+
+
}
These empty lines can be deleted.
(6)
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
...
+ * Make an XLOG entry reporting the file unlink.
Not unlink but buffer persistence?
(7)
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by myself.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /*
+ * We have created the init fork. If server crashes before the current
+ * transaction ends the init fork left alone corrupts data while recovery.
+ * The inittmp fork works as the sentinel to identify that situaton.
+ */
+ smgrcreate(srel, INITTMP_FORKNUM, false);
+ log_smgrcreate(&rnode, INITTMP_FORKNUM);
+ smgrimmedsync(srel, INITTMP_FORKNUM);
If the server crashes between these two processings, only the init fork exists. Is it correct to create the inittmp fork first?
(8)
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+ smgrclose(srel);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ log_smgrunlink(&rnode, INITTMP_FORKNUM);
+ smgrunlink(srel, INITTMP_FORKNUM, false);
+ return;
+ }
smgrclose() should be called just before return.
Isn't it necessary here to revert buffer persistence state change?
(9)
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
Maybe it's better to restore smgrdounlinkfork() that was removed in the older release. That function includes dropping shared buffers, which can clean up the shared buffers that may be cached by this transaction.
(10)
[RelationDropInitFork]
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
bufpersistence should be true.
(11)
+ BlockNumber block = 0;
...
+ DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1,
+ &block);
"block" is unnecessary and 0 can be passed directly.
(12)
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
nrels++;
It's better to put && at the beginning of the line to follow the existing code here.
(13)
+ table_close(rel, lockmode);
lockmode should be NoLock to retain the lock until transaction completion.
(14)
+ ctl.keysize = sizeof(unlogged_relation_entry);
+ ctl.entrysize = sizeof(unlogged_relation_entry);
+ hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
...
+ memset(key.oid, 0, sizeof(key.oid));
+ memcpy(key.oid, de->d_name, oidchars);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
keysize should be the oid member of the struct.
Regards
Takayuki Tsunakawa
Thanks for the comment! Sorry for the late reply.
At Fri, 4 Dec 2020 07:49:22 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
No, not really. The issue is more around what happens if we crash
part way through. At crash recovery time, the system catalogs are not
available, because the database isn't consistent yet and, anyway, the
startup process can't be bound to a database, let alone every database
that might contain unlogged tables. So the sentinel that's used to
decide whether to flush the contents of a table or index is the
presence or absence of an _init fork, which the startup process
obviously can see just fine. The _init fork also tells us what to
stick in the relation when we reset it; for a table, we can just reset
to an empty file, but that's not legal for indexes, so the _init fork
contains a pre-initialized empty index that we can just copy over.Now, to make an unlogged table logged, you've got to at some stage
remove those _init forks. But this is not a transactional operation.
If you remove the _init forks and then the transaction rolls back,
you've left the system an inconsistent state. If you postpone the
removal until commit time, then you have a problem if it fails,It's true. That are the cause of headache.
...
The current implement is simple. It's enough to just discard old or
new relfilenode according to the current transaction ends with commit
or abort. Tweaking of relfilenode under use leads-in some skews in
some places. I used pendingDelete mechanism a bit complexified way
and a violated an abstraction (I think, calling AM-routines from
storage.c is not good.) and even introduce a new fork kind only to
mark a init fork as "not committed yet". There might be better way,
but I haven't find it.I have no alternative idea yet, too. I agree that we want to avoid them, especially introducing inittmp fork... Anyway, below are the rest of my review comments for 0001. I want to review 0002 when we have decided to go with 0001.
(2)
XLOG_SMGR_UNLINK seems to necessitate modification of the following comments:[src/include/catalog/storage_xlog.h]
/*
* Declarations for smgr-related XLOG records
*
* Note: we log file creation and truncation here, but logging of deletion
* actions is handled by xact.c, because it is part of transaction commit.
*/
Sure. Rewrote it.
[src/backend/access/transam/README]
3. Deleting a table, which requires an unlink() that could fail.Our approach here is to WAL-log the operation first, but to treat failure
of the actual unlink() call as a warning rather than error condition.
Again, this can leave an orphan file behind, but that's cheap compared to
the alternatives. Since we can't actually do the unlink() until after
we've committed the DROP TABLE transaction, throwing an error would be out
of the question anyway. (It may be worth noting that the WAL entry about
the file deletion is actually part of the commit record for the dropping
transaction.)
Mmm. I didn't touched theDROP TABLE (RelationDropStorage) path, but I
added a brief description about INITTMP fork to the file.
====
The INITTMP fork file
--------------------------------
An INITTMP fork is created when new relation file is created to mark
the relfilenode needs to be cleaned up at recovery time. The file is
removed at transaction end but is left when the process crashes before
the transaction ends. In contrast to 4 above, failure to remove an
INITTMP file will lead to data loss, in which case the server will
shut down.
====
(3)
+/* This is bit-map, not ordianal numbers */There seems to be no comments using "bit-map". "Flags for ..." can be seen here and there.
I revmoed the comment and use (1 << n) notation to show the fact
instead.
(4)
Some wrong spellings:swithing -> switching
alredy -> already
recovery -> recover
situaton -> situation
Oops! Fixed them.
(5) + table_close(classRel, NoLock); + + + + }These empty lines can be deleted.
s/can/should/ :p. Fixed.
(6) +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) ... + * Make an XLOG entry reporting the file unlink.Not unlink but buffer persistence?
Silly copy-pasto. Fixed.
(7) + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by myself. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* + * We have created the init fork. If server crashes before the current + * transaction ends the init fork left alone corrupts data while recovery. + * The inittmp fork works as the sentinel to identify that situaton. + */ + smgrcreate(srel, INITTMP_FORKNUM, false); + log_smgrcreate(&rnode, INITTMP_FORKNUM); + smgrimmedsync(srel, INITTMP_FORKNUM);If the server crashes between these two processings, only the init fork exists. Is it correct to create the inittmp fork first?
Right. I change it that way, and did the same with the new code added
to RelationCreateStorage.
(8) + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + smgrclose(srel); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + log_smgrunlink(&rnode, INITTMP_FORKNUM); + smgrunlink(srel, INITTMP_FORKNUM, false); + return; + }smgrclose() should be called just before return.
Isn't it necessary here to revert buffer persistence state change?
Mmm. it's a thinko. I was confused with the case of
close/unlink. Fixed all instacnes of the same.
(9) +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +}Maybe it's better to restore smgrdounlinkfork() that was removed in the older release. That function includes dropping shared buffers, which can clean up the shared buffers that may be cached by this transaction.
INITFORK/INITTMP forks cannot be loaded to shared buffer so it's no
use to drop buffers. I added a comment like that.
| /*
| * INIT/INITTMP forks never be loaded to shared buffer so no point in
| * dropping buffers for these files.
| */
| log_smgrunlink(&rnode, INIT_FORKNUM);
I removed DropRelFileNodeBuffers from PDOP_UNLINK_FORK branch in
smgrDoPendingDeletes and added an assertion and a comment instead.
| /* other forks needs to drop buffers */
| Assert(pending->unlink_forknum == INIT_FORKNUM ||
| pending->unlink_forknum == INITTMP_FORKNUM);
|
| log_smgrunlink(&pending->relnode, pending->unlink_forknum);
| smgrunlink(srel, pending->unlink_forknum, false);
(10) [RelationDropInitFork] + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +}bufpersistence should be true.
RelationDropInitFork() chnages the relation persisitence to
"persistent" so it shoud be reverted to "non-persistent (= false)" at
abort. (I agree that the function name is somewhat confusing...)
(11) + BlockNumber block = 0; ... + DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1, + &block);"block" is unnecessary and 0 can be passed directly.
I removed the entire function call.
But, I don't think you're right here.
| DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
| int nforks, BlockNumber *firstDelBlock)
Doesn't just passing 0 lead to SEGV?
(12) - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) nrels++;It's better to put && at the beginning of the line to follow the existing code here.
It's terrible.. Fixed.
(13)
+ table_close(rel, lockmode);lockmode should be NoLock to retain the lock until transaction completion.
I tried to recall the reason for that, but didn't come up with
anything. Fixed.
(14) + ctl.keysize = sizeof(unlogged_relation_entry); + ctl.entrysize = sizeof(unlogged_relation_entry); + hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM); ... + memset(key.oid, 0, sizeof(key.oid)); + memcpy(key.oid, de->d_name, oidchars); + ent = hash_search(hash, &key, HASH_FIND, NULL);keysize should be the oid member of the struct.
It's not a problem since the first member is the oid and perhaps it
seems that I thougth to do someting more on that. Now that I don't
recall what is it and in the first place the key should be just Oid in
the context above. Fixed.
The patch is attached to the next message.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello.
At Thu, 24 Dec 2020 17:02:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
The patch is attached to the next message.
The reason for separating this message is that I modified this so that
it could solve another issue.
There's a complain about orphan files after crash. [1]
1: /messages/by-id/16771-cbef7d97ba93f4b9@postgresql.org
That is, the case where a relation file is left alone after a server
crash that happened before the end of the transaction that has created
a relation. As I read this, I noticed this feature can solve the
issue with a small change.
This version gets changes in RelationCreateStorage and
smgrDoPendingDeletes.
Previously inittmp fork is created only along with an init fork. This
version creates one always when a relation storage file is created. As
the result ResetUnloggedRelationsInDbspaceDir removes all forks if the
inttmp fork of a logged relations is found. Now that pendingDeletes
can contain multiple entries for the same relation, it has been
modified not to close the same smgr multiple times.
- It might be better to split 0001 into two peaces.
- The function name ResetUnloggedRelationsInDbspaceDir is no longer
represents the function correctly.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 25 Dec 2020 09:12:52 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Hello.
At Thu, 24 Dec 2020 17:02:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
The patch is attached to the next message.
The reason for separating this message is that I modified this so that
it could solve another issue.There's a complain about orphan files after crash. [1]
1: /messages/by-id/16771-cbef7d97ba93f4b9@postgresql.org
That is, the case where a relation file is left alone after a server
crash that happened before the end of the transaction that has created
a relation. As I read this, I noticed this feature can solve the
issue with a small change.This version gets changes in RelationCreateStorage and
smgrDoPendingDeletes.Previously inittmp fork is created only along with an init fork. This
version creates one always when a relation storage file is created. As
the result ResetUnloggedRelationsInDbspaceDir removes all forks if the
inttmp fork of a logged relations is found. Now that pendingDeletes
can contain multiple entries for the same relation, it has been
modified not to close the same smgr multiple times.- It might be better to split 0001 into two peaces.
- The function name ResetUnloggedRelationsInDbspaceDir is no longer
represents the function correctly.
As pointed by Robert in another thread [1], persisntence of (at least)
GiST index cannot be flipped in-place due to incompatibility of fake
LSNs with real ones.
This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.
1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com
Hmm. This is not wroking correctly. I'll repost after fixint that.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 08 Jan 2021 17:52:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com
Hmm. This is not wroking correctly. I'll repost after fixint that.
I think I fixed the misbehavior. ResetUnloggedRelationsInDbspaceDir()
handles file operations in the wrong order and with the wrong logic.
It also needed to drop buffers and forget fsync requests.
I thought that the two cases that this patch is expected to fix
(orphan relation files and uncommited init files) can share the same
"cleanup" fork but that is wrong. I had to add one more additional
fork to differentiate the cases of SET UNLOGGED and of creation of
UNLOGGED tables...
The attached is a new version, that seems working correctly but looks
somewhat messy. I'll continue working.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 12 Jan 2021 18:58:08 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Fri, 08 Jan 2021 17:52:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com
Hmm. This is not wroking correctly. I'll repost after fixint that.
I think I fixed the misbehavior. ResetUnloggedRelationsInDbspaceDir()
handles file operations in the wrong order and with the wrong logic.
It also needed to drop buffers and forget fsync requests.I thought that the two cases that this patch is expected to fix
(orphan relation files and uncommited init files) can share the same
"cleanup" fork but that is wrong. I had to add one more additional
fork to differentiate the cases of SET UNLOGGED and of creation of
UNLOGGED tables...The attached is a new version, that seems working correctly but looks
somewhat messy. I'll continue working.
Commit bea449c635 conflicts with this on the change of the definition
of DropRelFileNodeBuffers. The change simplified this patch by a bit:p
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
(I'm not sure when the subject was broken..)
At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Commit bea449c635 conflicts with this on the change of the definition
of DropRelFileNodeBuffers. The change simplified this patch by a bit:p
In this version, I got rid of the "CLEANUP FORK"s, and added a new
system "Smgr marks". The mark files have the name of the
corresponding fork file followed by ".u" (which means Uncommitted.).
"Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM
and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM
in the previous version.x
I noticed that the previous version of the patch still leaves an
orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
revmoed at rollback. In this version the responsibility to remove the
mark files is moved to SyncPostCheckpoint, where main fork files are
actually removed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Thu, 25 Mar 2021 14:08:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
(I'm not sure when the subject was broken..)
At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Commit bea449c635 conflicts with this on the change of the definition
of DropRelFileNodeBuffers. The change simplified this patch by a bit:pIn this version, I got rid of the "CLEANUP FORK"s, and added a new
system "Smgr marks". The mark files have the name of the
corresponding fork file followed by ".u" (which means Uncommitted.).
"Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM
and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM
in the previous version.xI noticed that the previous version of the patch still leaves an
orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
revmoed at rollback. In this version the responsibility to remove the
mark files is moved to SyncPostCheckpoint, where main fork files are
actually removed.
For the record, I noticed that basebackup could be confused by the
mark files but I haven't looked that yet.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Kyotaro wrote:
In this version, I got rid of the "CLEANUP FORK"s, and added a new
system "Smgr marks". The mark files have the name of the
corresponding fork file followed by ".u" (which means Uncommitted.).
"Uncommited"-marked main fork means the same as theCLEANUP2_FORKNUM
and uncommitted-marked init fork means the same as the
CLEANUP_FORKNUM
in the previous version.x
I noticed that the previous version of the patch still leaves an
orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
revmoed at rollback. In this version the responsibility to remove the
mark files is moved to SyncPostCheckpoint, where main fork files are
actually removed.For the record, I noticed that basebackup could be confused by the mark files
but I haven't looked that yet.
Good morning Kyotaro,
the patch didn't apply clean (it's from March; some hunks were failing), so I've fixed it and the combined git format-patch is attached. It did conflict with the following:
b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE
7b565843a94 - Add call to object access hook at the end of table rewrite in ALTER TABLE
9ce346eabf3 - Report progress of startup operations that take a long time.
f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr().
I'm especially worried if I didn't screw up something/forgot something related to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests passed".
Basic demonstration of this patch (with wal_level=minimal):
create unlogged table t6 (id bigint, t text);
-- produces 110GB table, takes ~5mins
insert into t6 select nextval('s1'), repeat('A', 1000) from generate_series(1, 100000000);
alter table t6 set logged;
on baseline SET LOGGED takes: ~7min10s
on patched SET LOGGED takes: 25s
So basically one can - thanks to this patch - use his application (performing classic INSERTs/UPDATEs/DELETEs, so without the need to rewrite to use COPY) and perform literally batch upload and then just switch the tables to LOGGED.
Some more intensive testing also looks good, assuming table prepared to put pressure on WAL:
create unlogged table t_unlogged (id bigint, t text) partition by hash (id);
create unlogged table t_unlogged_h0 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 0);
[..]
create unlogged table t_unlogged_h3 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 3);
Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and Lock/extend .
t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~1MB of WAL
t_unlogged.sql = insert into t_unlogged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~3kB of WAL
so using: pgbench -f <tabletypetest>.sql -T 30 -P 1 -c 32 -j 3 t
with synchronous_commit =ON(default):
with t_logged.sql: tps = 229 (lat avg = 138ms)
with t_unlogged.sql tps = 283 (lat avg = 112ms) # almost all on LWLock/WALWrite
with synchronous_commit =OFF:
with t_logged.sql: tps = 413 (lat avg = 77ms)
with t_unloged.sql: tps = 782 (lat avg = 40ms)
Afterwards switching the unlogged ~16GB partitions takes 5s per partition.
As the thread didn't get a lot of traction, I've registered it into current commitfest https://commitfest.postgresql.org/36/3461/ with You as the author and in 'Ready for review' state.
I think it behaves as almost finished one and apparently after reading all those discussions that go back over 10years+ time span about this feature, and lot of failed effort towards wal_level=noWAL I think it would be nice to finally start getting some of that of it into the core.
-Jakub Wartak.