In-placre persistance change of a relation

Started by Kyotaro Horiguchiover 5 years ago94 messageshackers
Jump to latest
#1Kyotaro Horiguchi
horikyota.ntt@gmail.com

Hello. This is a thread for an alternative solution to wal_level=none
[*1] for bulk data loading.

*1: /messages/by-id/TYAPR01MB29901EBE5A3ACCE55BA99186FE320@TYAPR01MB2990.jpnprd01.prod.outlook.com

At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in

Greetings,

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:

For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.

1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.

It always skips table copy in the SET UNLOGGED case, and only when
wal_level=minimal in the SET LOGGED case. Crash recovery seems
working by some brief testing by hand.

Somehow missed that this patch more-or-less does what I was referring to
down-thread, but I did want to mention that it looks like it's missing a
necessary FlushRelationBuffers() call before the sync, otherwise there
could be dirty buffers for the relation that's being set to LOGGED (with
wal_level=minimal), which wouldn't be good. See the comments above
smgrimmedsync().

Right. Thanks. However, since SetRelFileNodeBuffersPersistence()
called just above scans shared buffers so I don't want to just call
FlushRelationBuffers() separately. Instead, I added buffer-flush to
SetRelFileNodeBuffersPersistence().

FWIW this is a revised version of the PoC, which has some known
problems.

- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).

- This version handles indexes but not yet handle toast relatins.

- tableAMs are supposed to support this feature. (but I'm not sure
it's worth allowing them not to do so).

Of course, I haven't performed intensive test on it.

Reading through the thread, it didn't seem very clear, but we should
definitely make sure that it does the right thing on replicas when going
between unlogged and logged (and between logged and unlogged too), of
course.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

PoC_in-place_set_persistence_v2.patchtext/x-patch; charset=us-asciiDownload+459-46
#2Stephen Frost
sfrost@snowman.net
In reply to: Kyotaro Horiguchi (#1)
Re: In-place persistance change of a relation

Greetings,

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:

At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:

For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.

1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.

It always skips table copy in the SET UNLOGGED case, and only when
wal_level=minimal in the SET LOGGED case. Crash recovery seems
working by some brief testing by hand.

Somehow missed that this patch more-or-less does what I was referring to
down-thread, but I did want to mention that it looks like it's missing a
necessary FlushRelationBuffers() call before the sync, otherwise there
could be dirty buffers for the relation that's being set to LOGGED (with
wal_level=minimal), which wouldn't be good. See the comments above
smgrimmedsync().

Right. Thanks. However, since SetRelFileNodeBuffersPersistence()
called just above scans shared buffers so I don't want to just call
FlushRelationBuffers() separately. Instead, I added buffer-flush to
SetRelFileNodeBuffersPersistence().

Maybe I'm missing something, but it sure looks like in the patch that
SetRelFileNodeBuffersPersistence() is being called after the
smgrimmedsync() call, and I don't think you get to just switch the order
of those- the sync is telling the kernel to make sure it's written to
disk, while the FlushBuffer() is just writing it into the kernel but
doesn't provide any guarantee that the data has actually made it to
disk. We have to FlushBuffer() first, and then call smgrimmedsync().
Perhaps there's a way to avoid having to go through shared buffers
twice, and I generally agreed it'd be good if we could avoid doing so,
but this approach doesn't look like it actually works.

FWIW this is a revised version of the PoC, which has some known
problems.

- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).

Not sure if it'd be better to drop buffers or not, but figuring out how
to deal with rollback seems pretty important. How is the persistence
change in the catalog not WAL-logged though..?

- This version handles indexes but not yet handle toast relatins.

Would need to be fixed, of course.

- tableAMs are supposed to support this feature. (but I'm not sure
it's worth allowing them not to do so).

Seems like they should.

Thanks,

Stephen

#3Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#1)
Re: In-placre persistance change of a relation

Hi,

I suggest outlining what you are trying to achieve here. Starting a new
thread and expecting people to dig through another thread to infer what
you are actually trying to achive isn't great.

FWIW, I'm *extremely* doubtful it's worth adding features that depend on
a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as
I understand. If somebody added support for dynamically adapting
wal_level (e.g. wal_level=auto, that increases wal_level to
replica/logical depending on the presence of replication slots), it'd
perhaps be different.

On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote:

FWIW this is a revised version of the PoC, which has some known
problems.

- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).

That's obviously a no-go. I think you might be able to address this if
you accept that the command cannot be run in a transaction (like
CONCURRENTLY). Then you can first do the catalog changes, change the
persistence level, and commit.

Greetings,

Andres Freund

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#3)
Re: In-placre persistance change of a relation

At Wed, 11 Nov 2020 14:18:04 -0800, Andres Freund <andres@anarazel.de> wrote in

Hi,

I suggest outlining what you are trying to achieve here. Starting a new
thread and expecting people to dig through another thread to infer what
you are actually trying to achive isn't great.

Agreed. I'll post that. Thanks.

FWIW, I'm *extremely* doubtful it's worth adding features that depend on
a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as
I understand. If somebody added support for dynamically adapting
wal_level (e.g. wal_level=auto, that increases wal_level to
replica/logical depending on the presence of replication slots), it'd
perhaps be different.

Yes, this depends on wal_level=minimal for switching from UNLOGGED to
LOGGED, that's similar to COPY/INSERT-to-intransaction-created-tables
optimization for wal_level=minimal. And it expands that optimization
to COPY/INSERT-to-existent-tables, which seems worth doing.

Switching to LOGGED needs to emit the initial state to WAL... Hmm.. I
came to think that even in that case skipping table copy reduces I/O
significantly, even though FPI-WAL is emitted.

On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote:

FWIW this is a revised version of the PoC, which has some known
problems.

- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).

That's obviously a no-go. I think you might be able to address this if
you accept that the command cannot be run in a transaction (like
CONCURRENTLY). Then you can first do the catalog changes, change the
persistence level, and commit.

Of course. The next version reverts persistence change at abort.

Thanks!

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Stephen Frost (#2)
Re: In-place persistance change of a relation

At Wed, 11 Nov 2020 09:56:44 -0500, Stephen Frost <sfrost@snowman.net> wrote in

Greetings,

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:

At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:

For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.

1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.

It always skips table copy in the SET UNLOGGED case, and only when
wal_level=minimal in the SET LOGGED case. Crash recovery seems
working by some brief testing by hand.

Somehow missed that this patch more-or-less does what I was referring to
down-thread, but I did want to mention that it looks like it's missing a
necessary FlushRelationBuffers() call before the sync, otherwise there
could be dirty buffers for the relation that's being set to LOGGED (with
wal_level=minimal), which wouldn't be good. See the comments above
smgrimmedsync().

Right. Thanks. However, since SetRelFileNodeBuffersPersistence()
called just above scans shared buffers so I don't want to just call
FlushRelationBuffers() separately. Instead, I added buffer-flush to
SetRelFileNodeBuffersPersistence().

Maybe I'm missing something, but it sure looks like in the patch that
SetRelFileNodeBuffersPersistence() is being called after the
smgrimmedsync() call, and I don't think you get to just switch the order
of those- the sync is telling the kernel to make sure it's written to
disk, while the FlushBuffer() is just writing it into the kernel but
doesn't provide any guarantee that the data has actually made it to
disk. We have to FlushBuffer() first, and then call smgrimmedsync().
Perhaps there's a way to avoid having to go through shared buffers
twice, and I generally agreed it'd be good if we could avoid doing so,
but this approach doesn't look like it actually works.

Yeah, sorry for the rare-baked version.. I was confused about the
order at the time. The next version works like this:

LOGGED->UNLOGGED
<collect reloids to process>

for each relations:
<set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
<create init fork>
if it is index call ambuildempty() (which syncs the init fork)
else WAL-log smgr_create then sync the init file.
<update catalog>
...
commit time:
<do nogthing>
abort time:
<unlink init fork>
<revert buffer persistence>

UNLOGGED->LOGGED
<collect reloids to process>

for each relations:
<set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
<record drop-init-fork to pending-deletes>
<sync storage files>
<update catalog>
...
commit time:
<log smgrunlink>
<smgrunlink init fork>
abort time:
<revert buffer persistence>

FWIW this is a revised version of the PoC, which has some known
problems.

- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).

Not sure if it'd be better to drop buffers or not, but figuring out how
to deal with rollback seems pretty important. How is the persistence
change in the catalog not WAL-logged though..?

Rollback works as the above. Buffer persistence change is registered
in pending-deletes. Persistence change in catalog is rolled back in
the ordinary way (or automatically).

If wal_level > minimal, persistence change of buffers is propagated to
standbys by WAL. However I'm not sure we need wal-logging otherwise,
the next version emits WAL since SMGR_CREATE is always logged by
existing code.

- This version handles indexes but not yet handle toast relatins.

Would need to be fixed, of course.

Fixed.

- tableAMs are supposed to support this feature. (but I'm not sure
it's worth allowing them not to do so).

Seems like they should.

Init fork of index relations needs a call to ambuildempty() instead of
"log_smgrcreate-smgrimmedsync" after smgrcreate. Instead of adding
similar interface in indexAm, I reverted changes of tableam and make
RelationCreate/DropInitFork() directly do that. That introduces new
include of amapi.h to storage.c, which is a bit uneasy.

The previous version give up the in-place persistence change in the
case where wal_level > minimal and SET LOGGED since that needs WAL to
be emitted. However, in-place change still has the advantage of not
running a table copy. So the next verson always runs persistence
change in-place.

As suggested by Andres, I'll send a summary of this patch. The patch
will be attached to the coming mail.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#1)
Re: In-placre persistance change of a relation

Hello. Before posting the next version, I'd like to explain what this
patch is.

1. The Issue

Bulk data loading is a long-time taking, I/O consuming task. Many
DBAs want that task is faster, even at the cost of increasing risk of
data-loss. wal_level=minimal is an answer to such a
request. Data-loading onto a table that is created in the current
transaction omits WAL-logging and synced at commit.

However, the optimization doesn't benefit the case where the
data-loading is performed onto existing tables. There are quite a few
cases where data is loaded into tables that already contains a lot of
data. Those cases don't take benefit of the optimization.

Another possible solution for bulk data-loading is UNLOGGED
tables. But when we switch LOGGED/UNLOGGED of a table, all the table
content is copied to a newly created heap, which is costly.

2. Proposed Solutions.

There are two proposed solutions are discussed on this mailing
list. One is wal_level = none (*1), which omits WAL-logging almost at
all. Another is extending the existing optimization to the ALTER TABLE
SET LOGGED/UNLOGGED cases, which is to be discussed in this new
thread.

3. In-place Persistence Change

So the attached is a PoC patch of the "another" solution. When we
want to change table persistence in-place, basically we need to do the
following steps.

(the talbe is exclusively locked)

(1) Flip BM_PERMANENT flag of all shared buffer blocks for the heap.

(2) Create or delete the init fork for existing heap.

(3) Flush all buffers of the relation to file system.

(4) Sync heap files.

(5) Make catalog changes.

4. Transactionality

The 1, 2 and 5 above need to be abort-able. 5 is rolled back by
existing infrastructure, and rolling-back of 1 and 2 are achieved by
piggybacking on the pendingDeletes mechanism.

5. Replication

Furthermore, that changes ought to be replicable to standbys. Catalog
changes are replicated as usual.

On-the-fly creation of the init fork leads to recovery mess. Even
though it is removed at abort, if the server crashed before
transaction end, the file is left alone and corrupts database in the
next recovery. I sought a way to create the init fork in
smgrPendingDelete but that needs relcache and relcache is not
available at that late of commit. Finally, I introduced the fifth fork
kind "INITTMP"(_itmp) only to signal that the init file is not
committed. I don't like that way but it seems working fine...

6. SQL Command

The second file in the patchset adds a syntax that changes persistence
of all tables in a tablespace.

ALTER TABLE ALL IN TABLESPACE <tsp> SET LOGGED/UNLOGGED [ NOWAIT ];

7. Testing

I tried to write TAP test for this, but IPC::Run::harness (or
interactive_psql) doesn't seem to work for me. I'm not sure what
exactly is happening but pty redirection doesn't work.

$in = "ls\n"; $out = ""; run ["/usr/bin/bash"], \$in, \$out; print $out;

works but

$in = "ls\n"; $out = ""; run ["/usr/bin/bash"], '<pty<', \$in, '>pty>', \$out; print $out;

doesn't respond.

The patch is attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v3-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload+784-143
v3-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload+214-1
#7tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Kyotaro Horiguchi (#6)
RE: In-placre persistance change of a relation

Hi Horiguchi-san,

Thank you for making a patch so quickly. I've started looking at it.

What makes you think this is a PoC? Documentation and test cases? If there's something you think that doesn't work or are concerned about, can you share it?

Do you know the reason why data copy was done before? And, it may be odd for me to ask this, but I think I saw someone referred to the past discussion that eliminating data copy is difficult due to some processing at commit. I can't find it.

(1)
@@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))

+struct SmgrRelationData;

This declaration is already in the file:

/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;

/* forward declared, to avoid including smgr.h here */
struct SMgrRelationData;

Regards
Takayuki Tsunakawa

#8osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: tsunakawa.takay@fujitsu.com (#7)
RE: In-placre persistance change of a relation

Hello, Tsunakawa-San

Do you know the reason why data copy was done before? And, it may be
odd for me to ask this, but I think I saw someone referred to the past
discussion that eliminating data copy is difficult due to some processing at
commit. I can't find it.

I can share 2 sources why to eliminate the data copy is difficult in hackers thread.

Tom's remark and the context to copy relation's data.
/messages/by-id/31724.1394163360@sss.pgh.pa.us

Amit-San quoted this thread and mentioned that point in another thread.
/messages/by-id/CAA4eK1+HDqS+1fhs5Jf9o4ZujQT=XBZ6sU0kOuEh2hqQAC+t=w@mail.gmail.com

Best,
Takamichi Osumi

#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#7)
Re: In-placre persistance change of a relation

At Fri, 13 Nov 2020 06:43:13 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in

Hi Horiguchi-san,

Thank you for making a patch so quickly. I've started looking at it.

What makes you think this is a PoC? Documentation and test cases? If there's something you think that doesn't work or are concerned about, can you share it?

The latest version is heavily revised and is given much comment so it
might have exited from PoC state. The necessity of documentation is
doubtful since this patch doesn't user-facing behavior other than
speed. Some tests are required especialy about recovery and
replication perspective but I haven't been able to make it. (One of
the tests needs to cause crash while a transaction is running.)

Do you know the reason why data copy was done before? And, it may be odd for me to ask this, but I think I saw someone referred to the past discussion that eliminating data copy is difficult due to some processing at commit. I can't find it.

To imagine that, just because it is simpler considering rollback and
code sharing, and maybe no one have been complained that SET
LOGGED/UNLOGGED looks taking a long time than required/expected.

The current implement is simple. It's enough to just discard old or
new relfilenode according to the current transaction ends with commit
or abort. Tweaking of relfilenode under use leads-in some skews in
some places. I used pendingDelete mechanism a bit complexified way
and a violated an abstraction (I think, calling AM-routines from
storage.c is not good.) and even introduce a new fork kind only to
mark a init fork as "not committed yet". There might be better way,
but I haven't find it.

(The patch scans all shared buffer blocks for each relation).

(1)
@@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))

+struct SmgrRelationData;

This declaration is already in the file:

/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;

/* forward declared, to avoid including smgr.h here */
struct SMgrRelationData;

Hmmm. Nice chatch. And will fix in the next version.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#8)
Re: In-placre persistance change of a relation

At Fri, 13 Nov 2020 07:15:41 +0000, "osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com> wrote in

Hello, Tsunakawa-San

Thanks for sharing it!

Do you know the reason why data copy was done before? And, it may be
odd for me to ask this, but I think I saw someone referred to the past
discussion that eliminating data copy is difficult due to some processing at
commit. I can't find it.

I can share 2 sources why to eliminate the data copy is difficult in hackers thread.

Tom's remark and the context to copy relation's data.
/messages/by-id/31724.1394163360@sss.pgh.pa.us

/messages/by-id/CA+Tgmob44LNwwU73N1aJsGQyzQ61SdhKJRC_89wCm0+aLg=x2Q@mail.gmail.com

No, not really. The issue is more around what happens if we crash
part way through. At crash recovery time, the system catalogs are not
available, because the database isn't consistent yet and, anyway, the
startup process can't be bound to a database, let alone every database
that might contain unlogged tables. So the sentinel that's used to
decide whether to flush the contents of a table or index is the
presence or absence of an _init fork, which the startup process
obviously can see just fine. The _init fork also tells us what to
stick in the relation when we reset it; for a table, we can just reset
to an empty file, but that's not legal for indexes, so the _init fork
contains a pre-initialized empty index that we can just copy over.

Now, to make an unlogged table logged, you've got to at some stage
remove those _init forks. But this is not a transactional operation.
If you remove the _init forks and then the transaction rolls back,
you've left the system an inconsistent state. If you postpone the
removal until commit time, then you have a problem if it fails,

It's true. That are the cause of headache.

particularly if it works for the first file but fails for the second.
And if you crash at any point before you've fsync'd the containing
directory, you have no idea which files will still be on disk after a
hard reboot.

This is not an issue in this patch *except* the case where init fork
is failed to removed but the following removal of inittmp fork
succeeds. Another idea is adding a "not-yet-committed" property to a
fork. I added a new fork type for easiness of the patch but I could
go that way if that is an issue.

Amit-San quoted this thread and mentioned that point in another thread.
/messages/by-id/CAA4eK1+HDqS+1fhs5Jf9o4ZujQT=XBZ6sU0kOuEh2hqQAC+t=w@mail.gmail.com

This sounds like a bit differrent discussion. Making part-of-a-table
UNLOGGED looks far difficult to me.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Kyotaro Horiguchi (#10)
RE: In-placre persistance change of a relation

From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>

No, not really. The issue is more around what happens if we crash
part way through. At crash recovery time, the system catalogs are not
available, because the database isn't consistent yet and, anyway, the
startup process can't be bound to a database, let alone every database
that might contain unlogged tables. So the sentinel that's used to
decide whether to flush the contents of a table or index is the
presence or absence of an _init fork, which the startup process
obviously can see just fine. The _init fork also tells us what to
stick in the relation when we reset it; for a table, we can just reset
to an empty file, but that's not legal for indexes, so the _init fork
contains a pre-initialized empty index that we can just copy over.

Now, to make an unlogged table logged, you've got to at some stage
remove those _init forks. But this is not a transactional operation.
If you remove the _init forks and then the transaction rolls back,
you've left the system an inconsistent state. If you postpone the
removal until commit time, then you have a problem if it fails,

It's true. That are the cause of headache.

...

The current implement is simple. It's enough to just discard old or
new relfilenode according to the current transaction ends with commit
or abort. Tweaking of relfilenode under use leads-in some skews in
some places. I used pendingDelete mechanism a bit complexified way
and a violated an abstraction (I think, calling AM-routines from
storage.c is not good.) and even introduce a new fork kind only to
mark a init fork as "not committed yet". There might be better way,
but I haven't find it.

I have no alternative idea yet, too. I agree that we want to avoid them, especially introducing inittmp fork... Anyway, below are the rest of my review comments for 0001. I want to review 0002 when we have decided to go with 0001.

(2)
XLOG_SMGR_UNLINK seems to necessitate modification of the following comments:

[src/include/catalog/storage_xlog.h]
/*
* Declarations for smgr-related XLOG records
*
* Note: we log file creation and truncation here, but logging of deletion
* actions is handled by xact.c, because it is part of transaction commit.
*/

[src/backend/access/transam/README]
3. Deleting a table, which requires an unlink() that could fail.

Our approach here is to WAL-log the operation first, but to treat failure
of the actual unlink() call as a warning rather than error condition.
Again, this can leave an orphan file behind, but that's cheap compared to
the alternatives. Since we can't actually do the unlink() until after
we've committed the DROP TABLE transaction, throwing an error would be out
of the question anyway. (It may be worth noting that the WAL entry about
the file deletion is actually part of the commit record for the dropping
transaction.)

(3)
+/* This is bit-map, not ordianal numbers */

There seems to be no comments using "bit-map". "Flags for ..." can be seen here and there.

(4)
Some wrong spellings:

+ /* we flush this buffer when swithing to PERMANENT */

swithing -> switching

+ * alredy flushed out by RelationCreate(Drop)InitFork called just

alredy -> already

+ * relation content to be WAL-logged to recovery the table.

recovery -> recover

+ * The inittmp fork works as the sentinel to identify that situaton.

situaton -> situation

(5)
+	table_close(classRel, NoLock);
+
+
+
+
 }

These empty lines can be deleted.

(6)
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
...
+	 * Make an XLOG entry reporting the file unlink.

Not unlink but buffer persistence?

(7)
+	/*
+	 * index-init fork needs further initialization. ambuildempty shoud do
+	 * WAL-log and file sync by itself but otherwise we do that by myself.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_INDEX)
+		rel->rd_indam->ambuildempty(rel);
+	else
+	{
+		log_smgrcreate(&rnode, INIT_FORKNUM);
+		smgrimmedsync(srel, INIT_FORKNUM);
+	}
+
+	/*
+	 * We have created the init fork. If server crashes before the current
+	 * transaction ends the init fork left alone corrupts data while recovery.
+	 * The inittmp fork works as the sentinel to identify that situaton.
+	 */
+	smgrcreate(srel, INITTMP_FORKNUM, false);
+	log_smgrcreate(&rnode, INITTMP_FORKNUM);
+	smgrimmedsync(srel, INITTMP_FORKNUM);

If the server crashes between these two processings, only the init fork exists. Is it correct to create the inittmp fork first?

(8)
+	if (inxact_created)
+	{
+		SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+		smgrclose(srel);
+		log_smgrunlink(&rnode, INIT_FORKNUM);
+		smgrunlink(srel, INIT_FORKNUM, false);
	+		log_smgrunlink(&rnode, INITTMP_FORKNUM);
+		smgrunlink(srel, INITTMP_FORKNUM, false);
+		return;
+	}

smgrclose() should be called just before return.
Isn't it necessary here to revert buffer persistence state change?

(9)
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+	smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}

Maybe it's better to restore smgrdounlinkfork() that was removed in the older release. That function includes dropping shared buffers, which can clean up the shared buffers that may be cached by this transaction.

(10)
[RelationDropInitFork]
+	/* revert buffer-persistence changes at abort */
+	pending = (PendingRelDelete *)
+		MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+	pending->relnode = rnode;
+	pending->op = PDOP_SET_PERSISTENCE;
+	pending->bufpersistence = false;
+	pending->backend = InvalidBackendId;
+	pending->atCommit = true;
+	pending->nestLevel = GetCurrentTransactionNestLevel();
+	pending->next = pendingDeletes;
+	pendingDeletes = pending;
+}

bufpersistence should be true.

(11)
+				BlockNumber block = 0;
...
+				DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1,
+									   &block);

"block" is unnecessary and 0 can be passed directly.

(12)
-			&& pending->backend == InvalidBackendId)
+			&& pending->backend == InvalidBackendId &&
+			pending->op == PDOP_DELETE)
 			nrels++;

It's better to put && at the beginning of the line to follow the existing code here.

(13)
+ table_close(rel, lockmode);

lockmode should be NoLock to retain the lock until transaction completion.

(14)
+	ctl.keysize = sizeof(unlogged_relation_entry);
+	ctl.entrysize = sizeof(unlogged_relation_entry);
+	hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
...
+			memset(key.oid, 0, sizeof(key.oid));
+			memcpy(key.oid, de->d_name, oidchars);
+			ent = hash_search(hash, &key, HASH_FIND, NULL);

keysize should be the oid member of the struct.

Regards
Takayuki Tsunakawa

#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#11)
Re: In-placre persistance change of a relation

Thanks for the comment! Sorry for the late reply.

At Fri, 4 Dec 2020 07:49:22 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in

From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>

No, not really. The issue is more around what happens if we crash
part way through. At crash recovery time, the system catalogs are not
available, because the database isn't consistent yet and, anyway, the
startup process can't be bound to a database, let alone every database
that might contain unlogged tables. So the sentinel that's used to
decide whether to flush the contents of a table or index is the
presence or absence of an _init fork, which the startup process
obviously can see just fine. The _init fork also tells us what to
stick in the relation when we reset it; for a table, we can just reset
to an empty file, but that's not legal for indexes, so the _init fork
contains a pre-initialized empty index that we can just copy over.

Now, to make an unlogged table logged, you've got to at some stage
remove those _init forks. But this is not a transactional operation.
If you remove the _init forks and then the transaction rolls back,
you've left the system an inconsistent state. If you postpone the
removal until commit time, then you have a problem if it fails,

It's true. That are the cause of headache.

...

The current implement is simple. It's enough to just discard old or
new relfilenode according to the current transaction ends with commit
or abort. Tweaking of relfilenode under use leads-in some skews in
some places. I used pendingDelete mechanism a bit complexified way
and a violated an abstraction (I think, calling AM-routines from
storage.c is not good.) and even introduce a new fork kind only to
mark a init fork as "not committed yet". There might be better way,
but I haven't find it.

I have no alternative idea yet, too. I agree that we want to avoid them, especially introducing inittmp fork... Anyway, below are the rest of my review comments for 0001. I want to review 0002 when we have decided to go with 0001.

(2)
XLOG_SMGR_UNLINK seems to necessitate modification of the following comments:

[src/include/catalog/storage_xlog.h]
/*
* Declarations for smgr-related XLOG records
*
* Note: we log file creation and truncation here, but logging of deletion
* actions is handled by xact.c, because it is part of transaction commit.
*/

Sure. Rewrote it.

[src/backend/access/transam/README]
3. Deleting a table, which requires an unlink() that could fail.

Our approach here is to WAL-log the operation first, but to treat failure
of the actual unlink() call as a warning rather than error condition.
Again, this can leave an orphan file behind, but that's cheap compared to
the alternatives. Since we can't actually do the unlink() until after
we've committed the DROP TABLE transaction, throwing an error would be out
of the question anyway. (It may be worth noting that the WAL entry about
the file deletion is actually part of the commit record for the dropping
transaction.)

Mmm. I didn't touched theDROP TABLE (RelationDropStorage) path, but I
added a brief description about INITTMP fork to the file.

====
The INITTMP fork file
--------------------------------

An INITTMP fork is created when new relation file is created to mark
the relfilenode needs to be cleaned up at recovery time. The file is
removed at transaction end but is left when the process crashes before
the transaction ends. In contrast to 4 above, failure to remove an
INITTMP file will lead to data loss, in which case the server will
shut down.
====

(3)
+/* This is bit-map, not ordianal numbers */

There seems to be no comments using "bit-map". "Flags for ..." can be seen here and there.

I revmoed the comment and use (1 << n) notation to show the fact
instead.

(4)
Some wrong spellings:

swithing -> switching
alredy -> already
recovery -> recover
situaton -> situation

Oops! Fixed them.

(5)
+	table_close(classRel, NoLock);
+
+
+
+
}

These empty lines can be deleted.

s/can/should/ :p. Fixed.

(6)
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
...
+	 * Make an XLOG entry reporting the file unlink.

Not unlink but buffer persistence?

Silly copy-pasto. Fixed.

(7)
+	/*
+	 * index-init fork needs further initialization. ambuildempty shoud do
+	 * WAL-log and file sync by itself but otherwise we do that by myself.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_INDEX)
+		rel->rd_indam->ambuildempty(rel);
+	else
+	{
+		log_smgrcreate(&rnode, INIT_FORKNUM);
+		smgrimmedsync(srel, INIT_FORKNUM);
+	}
+
+	/*
+	 * We have created the init fork. If server crashes before the current
+	 * transaction ends the init fork left alone corrupts data while recovery.
+	 * The inittmp fork works as the sentinel to identify that situaton.
+	 */
+	smgrcreate(srel, INITTMP_FORKNUM, false);
+	log_smgrcreate(&rnode, INITTMP_FORKNUM);
+	smgrimmedsync(srel, INITTMP_FORKNUM);

If the server crashes between these two processings, only the init fork exists. Is it correct to create the inittmp fork first?

Right. I change it that way, and did the same with the new code added
to RelationCreateStorage.

(8)
+	if (inxact_created)
+	{
+		SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+		smgrclose(srel);
+		log_smgrunlink(&rnode, INIT_FORKNUM);
+		smgrunlink(srel, INIT_FORKNUM, false);
+		log_smgrunlink(&rnode, INITTMP_FORKNUM);
+		smgrunlink(srel, INITTMP_FORKNUM, false);
+		return;
+	}

smgrclose() should be called just before return.
Isn't it necessary here to revert buffer persistence state change?

Mmm. it's a thinko. I was confused with the case of
close/unlink. Fixed all instacnes of the same.

(9)
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+	smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}

Maybe it's better to restore smgrdounlinkfork() that was removed in the older release. That function includes dropping shared buffers, which can clean up the shared buffers that may be cached by this transaction.

INITFORK/INITTMP forks cannot be loaded to shared buffer so it's no
use to drop buffers. I added a comment like that.

| /*
| * INIT/INITTMP forks never be loaded to shared buffer so no point in
| * dropping buffers for these files.
| */
| log_smgrunlink(&rnode, INIT_FORKNUM);

I removed DropRelFileNodeBuffers from PDOP_UNLINK_FORK branch in
smgrDoPendingDeletes and added an assertion and a comment instead.

| /* other forks needs to drop buffers */
| Assert(pending->unlink_forknum == INIT_FORKNUM ||
| pending->unlink_forknum == INITTMP_FORKNUM);
|
| log_smgrunlink(&pending->relnode, pending->unlink_forknum);
| smgrunlink(srel, pending->unlink_forknum, false);

(10)
[RelationDropInitFork]
+	/* revert buffer-persistence changes at abort */
+	pending = (PendingRelDelete *)
+		MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+	pending->relnode = rnode;
+	pending->op = PDOP_SET_PERSISTENCE;
+	pending->bufpersistence = false;
+	pending->backend = InvalidBackendId;
+	pending->atCommit = true;
+	pending->nestLevel = GetCurrentTransactionNestLevel();
+	pending->next = pendingDeletes;
+	pendingDeletes = pending;
+}

bufpersistence should be true.

RelationDropInitFork() chnages the relation persisitence to
"persistent" so it shoud be reverted to "non-persistent (= false)" at
abort. (I agree that the function name is somewhat confusing...)

(11)
+				BlockNumber block = 0;
...
+				DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1,
+									   &block);

"block" is unnecessary and 0 can be passed directly.

I removed the entire function call.

But, I don't think you're right here.

| DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
| int nforks, BlockNumber *firstDelBlock)

Doesn't just passing 0 lead to SEGV?

(12)
-			&& pending->backend == InvalidBackendId)
+			&& pending->backend == InvalidBackendId &&
+			pending->op == PDOP_DELETE)
nrels++;

It's better to put && at the beginning of the line to follow the existing code here.

It's terrible.. Fixed.

(13)
+ table_close(rel, lockmode);

lockmode should be NoLock to retain the lock until transaction completion.

I tried to recall the reason for that, but didn't come up with
anything. Fixed.

(14)
+	ctl.keysize = sizeof(unlogged_relation_entry);
+	ctl.entrysize = sizeof(unlogged_relation_entry);
+	hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
...
+			memset(key.oid, 0, sizeof(key.oid));
+			memcpy(key.oid, de->d_name, oidchars);
+			ent = hash_search(hash, &key, HASH_FIND, NULL);

keysize should be the oid member of the struct.

It's not a problem since the first member is the oid and perhaps it
seems that I thougth to do someting more on that. Now that I don't
recall what is it and in the first place the key should be just Oid in
the context above. Fixed.

The patch is attached to the next message.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#12)
Re: In-placre persistance change of a relation

Hello.

At Thu, 24 Dec 2020 17:02:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

The patch is attached to the next message.

The reason for separating this message is that I modified this so that
it could solve another issue.

There's a complain about orphan files after crash. [1]

1: /messages/by-id/16771-cbef7d97ba93f4b9@postgresql.org

That is, the case where a relation file is left alone after a server
crash that happened before the end of the transaction that has created
a relation. As I read this, I noticed this feature can solve the
issue with a small change.

This version gets changes in RelationCreateStorage and
smgrDoPendingDeletes.

Previously inittmp fork is created only along with an init fork. This
version creates one always when a relation storage file is created. As
the result ResetUnloggedRelationsInDbspaceDir removes all forks if the
inttmp fork of a logged relations is found. Now that pendingDeletes
can contain multiple entries for the same relation, it has been
modified not to close the same smgr multiple times.

- It might be better to split 0001 into two peaces.

- The function name ResetUnloggedRelationsInDbspaceDir is no longer
represents the function correctly.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v2-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload+800-138
v2-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload+214-1
#14Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#13)
Re: In-placre persistance change of a relation

At Fri, 25 Dec 2020 09:12:52 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Hello.

At Thu, 24 Dec 2020 17:02:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

The patch is attached to the next message.

The reason for separating this message is that I modified this so that
it could solve another issue.

There's a complain about orphan files after crash. [1]

1: /messages/by-id/16771-cbef7d97ba93f4b9@postgresql.org

That is, the case where a relation file is left alone after a server
crash that happened before the end of the transaction that has created
a relation. As I read this, I noticed this feature can solve the
issue with a small change.

This version gets changes in RelationCreateStorage and
smgrDoPendingDeletes.

Previously inittmp fork is created only along with an init fork. This
version creates one always when a relation storage file is created. As
the result ResetUnloggedRelationsInDbspaceDir removes all forks if the
inttmp fork of a logged relations is found. Now that pendingDeletes
can contain multiple entries for the same relation, it has been
modified not to close the same smgr multiple times.

- It might be better to split 0001 into two peaces.

- The function name ResetUnloggedRelationsInDbspaceDir is no longer
represents the function correctly.

As pointed by Robert in another thread [1], persisntence of (at least)
GiST index cannot be flipped in-place due to incompatibility of fake
LSNs with real ones.

This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.

1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v3-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload+854-141
v3-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload+214-1
#15Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#14)
Re: In-placre persistance change of a relation

At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.

1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com

Hmm. This is not wroking correctly. I'll repost after fixint that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#15)
Re: In-placre persistance change of a relation

At Fri, 08 Jan 2021 17:52:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.

1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com

Hmm. This is not wroking correctly. I'll repost after fixint that.

I think I fixed the misbehavior. ResetUnloggedRelationsInDbspaceDir()
handles file operations in the wrong order and with the wrong logic.
It also needed to drop buffers and forget fsync requests.

I thought that the two cases that this patch is expected to fix
(orphan relation files and uncommited init files) can share the same
"cleanup" fork but that is wrong. I had to add one more additional
fork to differentiate the cases of SET UNLOGGED and of creation of
UNLOGGED tables...

The attached is a new version, that seems working correctly but looks
somewhat messy. I'll continue working.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v4-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload+1034-168
v4-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload+214-1
#17Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#16)
Re: In-placre persistance change of a relation

At Tue, 12 Jan 2021 18:58:08 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 08 Jan 2021 17:52:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.

1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com

Hmm. This is not wroking correctly. I'll repost after fixint that.

I think I fixed the misbehavior. ResetUnloggedRelationsInDbspaceDir()
handles file operations in the wrong order and with the wrong logic.
It also needed to drop buffers and forget fsync requests.

I thought that the two cases that this patch is expected to fix
(orphan relation files and uncommited init files) can share the same
"cleanup" fork but that is wrong. I had to add one more additional
fork to differentiate the cases of SET UNLOGGED and of creation of
UNLOGGED tables...

The attached is a new version, that seems working correctly but looks
somewhat messy. I'll continue working.

Commit bea449c635 conflicts with this on the change of the definition
of DropRelFileNodeBuffers. The change simplified this patch by a bit:p

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v5-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload+1028-168
v5-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload+214-1
#18Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#17)
Re: In-placre persistance change of a relation

(I'm not sure when the subject was broken..)

At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Commit bea449c635 conflicts with this on the change of the definition
of DropRelFileNodeBuffers. The change simplified this patch by a bit:p

In this version, I got rid of the "CLEANUP FORK"s, and added a new
system "Smgr marks". The mark files have the name of the
corresponding fork file followed by ".u" (which means Uncommitted.).
"Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM
and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM
in the previous version.x

I noticed that the previous version of the patch still leaves an
orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
revmoed at rollback. In this version the responsibility to remove the
mark files is moved to SyncPostCheckpoint, where main fork files are
actually removed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v6-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload+1384-207
v6-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload+214-1
#19Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#18)
Re: In-placre persistance change of a relation

At Thu, 25 Mar 2021 14:08:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

(I'm not sure when the subject was broken..)

At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Commit bea449c635 conflicts with this on the change of the definition
of DropRelFileNodeBuffers. The change simplified this patch by a bit:p

In this version, I got rid of the "CLEANUP FORK"s, and added a new
system "Smgr marks". The mark files have the name of the
corresponding fork file followed by ".u" (which means Uncommitted.).
"Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM
and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM
in the previous version.x

I noticed that the previous version of the patch still leaves an
orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
revmoed at rollback. In this version the responsibility to remove the
mark files is moved to SyncPostCheckpoint, where main fork files are
actually removed.

For the record, I noticed that basebackup could be confused by the
mark files but I haven't looked that yet.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#20Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Kyotaro Horiguchi (#19)
RE: In-placre persistance change of a relation

Kyotaro wrote:

In this version, I got rid of the "CLEANUP FORK"s, and added a new
system "Smgr marks". The mark files have the name of the
corresponding fork file followed by ".u" (which means Uncommitted.).
"Uncommited"-marked main fork means the same as the

CLEANUP2_FORKNUM

and uncommitted-marked init fork means the same as the

CLEANUP_FORKNUM

in the previous version.x

I noticed that the previous version of the patch still leaves an
orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
revmoed at rollback. In this version the responsibility to remove the
mark files is moved to SyncPostCheckpoint, where main fork files are
actually removed.

For the record, I noticed that basebackup could be confused by the mark files
but I haven't looked that yet.

Good morning Kyotaro,

the patch didn't apply clean (it's from March; some hunks were failing), so I've fixed it and the combined git format-patch is attached. It did conflict with the following:
b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE
7b565843a94 - Add call to object access hook at the end of table rewrite in ALTER TABLE
9ce346eabf3 - Report progress of startup operations that take a long time.
f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr().

I'm especially worried if I didn't screw up something/forgot something related to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests passed".

Basic demonstration of this patch (with wal_level=minimal):
create unlogged table t6 (id bigint, t text);
-- produces 110GB table, takes ~5mins
insert into t6 select nextval('s1'), repeat('A', 1000) from generate_series(1, 100000000);
alter table t6 set logged;
on baseline SET LOGGED takes: ~7min10s
on patched SET LOGGED takes: 25s

So basically one can - thanks to this patch - use his application (performing classic INSERTs/UPDATEs/DELETEs, so without the need to rewrite to use COPY) and perform literally batch upload and then just switch the tables to LOGGED.

Some more intensive testing also looks good, assuming table prepared to put pressure on WAL:
create unlogged table t_unlogged (id bigint, t text) partition by hash (id);
create unlogged table t_unlogged_h0 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 0);
[..]
create unlogged table t_unlogged_h3 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 3);

Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and Lock/extend .
t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~1MB of WAL
t_unlogged.sql = insert into t_unlogged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~3kB of WAL

so using: pgbench -f <tabletypetest>.sql -T 30 -P 1 -c 32 -j 3 t
with synchronous_commit =ON(default):
with t_logged.sql: tps = 229 (lat avg = 138ms)
with t_unlogged.sql tps = 283 (lat avg = 112ms) # almost all on LWLock/WALWrite
with synchronous_commit =OFF:
with t_logged.sql: tps = 413 (lat avg = 77ms)
with t_unloged.sql: tps = 782 (lat avg = 40ms)
Afterwards switching the unlogged ~16GB partitions takes 5s per partition.

As the thread didn't get a lot of traction, I've registered it into current commitfest https://commitfest.postgresql.org/36/3461/ with You as the author and in 'Ready for review' state.
I think it behaves as almost finished one and apparently after reading all those discussions that go back over 10years+ time span about this feature, and lot of failed effort towards wal_level=noWAL I think it would be nice to finally start getting some of that of it into the core.

-Jakub Wartak.

Attachments:

v7-0001-In-place-table-persistence-change-with-new-comman.patchapplication/octet-stream; name=v7-0001-In-place-table-persistence-change-with-new-comman.patchDownload+1583-180
#21Justin Pryzby
pryzby@telsasoft.com
In reply to: Jakub Wartak (#20)
#22Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Justin Pryzby (#21)
#23Justin Pryzby
pryzby@telsasoft.com
In reply to: Jakub Wartak (#22)
#24Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jakub Wartak (#20)
#25Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#24)
#26Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Kyotaro Horiguchi (#24)
#27Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jakub Wartak (#26)
#28Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#27)
#29Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Kyotaro Horiguchi (#28)
#30Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jakub Wartak (#29)
#31Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#30)
#32Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Kyotaro Horiguchi (#31)
#33Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jakub Wartak (#32)
#34Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Kyotaro Horiguchi (#33)
#35Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jakub Wartak (#34)
#36Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#36)
#38Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#37)
#39Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#38)
#40Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#39)
#41Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#40)
#42Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Kyotaro Horiguchi (#35)
#43Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jakub Wartak (#42)
#44Julien Rouhaud
rjuju123@gmail.com
In reply to: Kyotaro Horiguchi (#43)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Julien Rouhaud (#44)
#46Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#45)
#47Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#46)
#48Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#47)
#49Justin Pryzby
pryzby@telsasoft.com
In reply to: Kyotaro Horiguchi (#47)
#50Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Justin Pryzby (#49)
#51Justin Pryzby
pryzby@telsasoft.com
In reply to: Kyotaro Horiguchi (#50)
#52Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Justin Pryzby (#51)
#53Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#52)
#54Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Kyotaro Horiguchi (#53)
#55Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jacob Champion (#54)
#56Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#55)
#57Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#56)
#58Ian Lawrence Barwick
barwick@gmail.com
In reply to: Kyotaro Horiguchi (#57)
#59Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Ian Lawrence Barwick (#58)
#60Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#59)
#61Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#60)
#62Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#61)
#63Gregory Stark (as CFM)
stark.cfm@gmail.com
In reply to: Kyotaro Horiguchi (#62)
#64Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Gregory Stark (as CFM) (#63)
#65Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#64)
#66Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#65)
#67Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Kyotaro Horiguchi (#66)
#68Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jakub Wartak (#67)
#69Nathan Bossart
nathandbossart@gmail.com
In reply to: Kyotaro Horiguchi (#68)
#70Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#69)
#71Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#70)
#72vignesh C
vignesh21@gmail.com
In reply to: Kyotaro Horiguchi (#71)
#73Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: vignesh C (#72)
#74Peter Smith
smithpb2250@gmail.com
In reply to: Kyotaro Horiguchi (#73)
#75Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Peter Smith (#74)
#76Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#75)
#77Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Kyotaro Horiguchi (#76)
#78Michael Paquier
michael@paquier.xyz
In reply to: Jelte Fennema-Nio (#77)
#79Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#78)
#80Nathan Bossart
nathandbossart@gmail.com
In reply to: Kyotaro Horiguchi (#79)
#81Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#80)
#82Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#81)
#83Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#82)
#84Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#83)
#85Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#83)
#86Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#84)
#87Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#82)
#88Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#83)
#89Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#88)
#90Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#83)
#91Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#90)
#92Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#90)
#93Thom Brown
thom@linux.com
In reply to: Kyotaro Horiguchi (#92)
#94Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Thom Brown (#93)