using file cloning in create database / initdb

Started by Andres Freundabout 4 years ago3 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

This thread started at /messages/by-id/20220213021746.GM31460@telsasoft.com
but is mostly independent, so I split the thread off

On 2022-02-12 20:17:46 -0600, Justin Pryzby wrote:

On Sat, Feb 12, 2022 at 06:00:44PM -0800, Andres Freund wrote:

I bet using COW file copies would speed up our own regression tests noticeably
- on slower systems we spend a fair bit of time and space creating template0
and postgres, with the bulk of the data never changing.

Template databases are also fairly commonly used by application developers to
avoid the cost of rerunning all the setup DDL & initial data loading for
different tests. Making that measurably cheaper would be a significant win.

+1

I ran into this last week and was still thinking about proposing it.

Would this help CI

It could theoretically help linux - but currently I think the filesystem for
CI is ext4, which doesn't support FICLONE. I assume it'd help macos, but I
don't know the performance characteristics of copyfile(). I don't think any of
the other OSs have working reflink / file clone support.

You could prototype it for CI on macos by using the "template initdb" patch
and passing -c to cp.

On linux it might be worth using copy_file_range(), if supported, if not file
cloning. But that's kind of an even more separate topic...

or any significant fraction of buildfarm ?

Not sure how many are on new enough linux / mac to benefit and use a suitable
filesystem. There are a few animals with slow-ish storage but running fairly
new linux. Don't think we can see the FS. Those would likely benefit the most.

Or just tests run locally on supporting filesystems.

Probably depends on your storage subsystem. If not that fast, and running
tests concurrently, it'd likely help.

On my workstation, with lots of cores and very fast storage, using the initdb
caching patch modified to do cp --reflink=never / always yields the following
time for concurrent check-world (-j40 PROVE_FLAGS=-j4):

cp --reflink=never:

96.64user 61.74system 1:04.69elapsed 244%CPU (0avgtext+0avgdata 97544maxresident)k
0inputs+34124296outputs (2584major+7247038minor)pagefaults 0swaps
pcheck-world-success

cp --reflink=always:

91.79user 56.16system 1:04.21elapsed 230%CPU (0avgtext+0avgdata 97716maxresident)k
189328inputs+16361720outputs (2674major+7229696minor)pagefaults 0swaps
pcheck-world-success

Seems roughly stable across three runs.

Just comparing the time for cp -r of a fresh initdb'd cluster:
cp -a --reflink=never
real 0m0.043s
user 0m0.000s
sys 0m0.043s
cp -a --reflink=always
real 0m0.021s
user 0m0.004s
sys 0m0.018s

so that's a pretty nice win.

Note that pg_upgrade already supports copy/link/clone. (Obviously, link
wouldn't do anything desirable for CREATE DATABASE).

Yea. We'd likely have to move relevant code into src/port.

Greetings,

Andres Freund

#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Andres Freund (#1)
Re: using file cloning in create database / initdb

On Sat, Feb 12, 2022 at 07:37:30PM -0800, Andres Freund wrote:

On 2022-02-12 20:17:46 -0600, Justin Pryzby wrote:

On Sat, Feb 12, 2022 at 06:00:44PM -0800, Andres Freund wrote:

I bet using COW file copies would speed up our own regression tests noticeably
- on slower systems we spend a fair bit of time and space creating template0
and postgres, with the bulk of the data never changing.

Template databases are also fairly commonly used by application developers to
avoid the cost of rerunning all the setup DDL & initial data loading for
different tests. Making that measurably cheaper would be a significant win.

+1

I ran into this last week and was still thinking about proposing it.

Would this help CI

It could theoretically help linux - but currently I think the filesystem for
CI is ext4, which doesn't support FICLONE. I assume it'd help macos, but I
don't know the performance characteristics of copyfile(). I don't think any of
the other OSs have working reflink / file clone support.

You could prototype it for CI on macos by using the "template initdb" patch
and passing -c to cp.

Yes, copyfile() in CREATE DATABASE seems to help cirrus/darwin a bit.
https://cirrus-ci.com/task/5277139049119744

On xfs:
postgres=# CREATE DATABASE new3 TEMPLATE postgres STRATEGY FILE_COPY ;
2022-07-31 00:21:28.350 CDT [2347] LOG: checkpoint starting: immediate force wait flush-all
...
CREATE DATABASE
Time: 1296.243 ms (00:01.296)

postgres=# CREATE DATABASE new4 TEMPLATE postgres STRATEGY FILE_CLONE;
2022-07-31 00:21:38.697 CDT [2347] LOG: checkpoint starting: immediate force wait flush-all
...
CREATE DATABASE
Time: 167.152 ms

--
Justin

Attachments:

0001-WIP-support-file-cloning-in-CREATE-DATABASE.patchtext/x-diff; charset=us-asciiDownload+125-46
#3Thomas Munro
thomas.munro@gmail.com
In reply to: Justin Pryzby (#2)
Re: using file cloning in create database / initdb

On Tue, Aug 2, 2022 at 6:15 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Sat, Feb 12, 2022 at 07:37:30PM -0800, Andres Freund wrote:

It could theoretically help linux - but currently I think the filesystem for
CI is ext4, which doesn't support FICLONE. I assume it'd help macos, but I
don't know the performance characteristics of copyfile(). I don't think any of
the other OSs have working reflink / file clone support.

Just BTW, I think Solaris (on its closed source ZFS) can also do
this[1]https://blogs.oracle.com/solaris/post/reflink3c-what-is-it-why-do-i-care-and-how-can-i-use-it, but I doubt anyone will be along soon to write the patch for
that.

More interestingly to me at least, it looks like OpenZFS is getting
ready to ship its reflink feature, called BRT (block reference
tracking). I guess it'll just work on Linux and macOS, and for
FreeBSD there may be a new syscall to patch into this code...

+    if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+        return 1;

Why would you return 1, and not -1 (system call style), for failure?

Hmm, I don't think we can do plain open() from a regular backend
without worrying about EMFILE/ENFILE (a problem that pg_upgrade
doesn't have to deal with). Perhaps copydir() should double-wrap the
call to clone_file() in ReserveExternalFD()/ReleaseExternalFD()?

[1]: https://blogs.oracle.com/solaris/post/reflink3c-what-is-it-why-do-i-care-and-how-can-i-use-it