[PoC] Non-volatile WAL buffer

Started by Takashi Menjoabout 6 years ago71 messageshackers
Jump to latest
#1Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp

Dear hackers,

I propose "non-volatile WAL buffer," a proof-of-concept new feature. It
enables WAL records to be durable without output to WAL segment files by
residing on persistent memory (PMEM) instead of DRAM. It improves database
performance by reducing copies of WAL and shortening the time of write
transactions.

I attach the first patchset that can be applied to PostgreSQL 12.0 (refs/
tags/REL_12_0). Please see README.nvwal (added by the patch 0003) to use
the new feature.

PMEM [1]Persistent Memory (SNIA) https://www.snia.org/PM is fast, non-volatile, and byte-addressable memory installed into
DIMM slots. Such products have been already available. For example, an
NVDIMM-N is a type of PMEM module that contains both DRAM and NAND flash.
It can be accessed like a regular DRAM, but on power loss, it can save its
contents into flash area. On power restore, it performs the reverse, that
is, the contents are copied back into DRAM. PMEM also has been already
supported by major operating systems such as Linux and Windows, and new
open-source libraries such as Persistent Memory Development Kit (PMDK) [2]Persistent Memory Development Kit (pmem.io) https://pmem.io/pmdk/.
Furthermore, several DBMSes have started to support PMEM.

It's time for PostgreSQL. PMEM is faster than a solid state disk and
naively can be used as a block storage. However, we cannot gain much
performance in that way because it is so fast that the overhead of
traditional software stacks now becomes unignorable, such as user buffers,
filesystems, and block layers. Non-volatile WAL buffer is a work to make
PostgreSQL PMEM-aware, that is, accessing directly to PMEM as a RAM to
bypass such overhead and achieve the maximum possible benefit. I believe
WAL is one of the most important modules to be redesigned for PMEM because
it has assumed slow disks such as HDDs and SSDs but PMEM is not so.

This work is inspired by "Non-volatile Memory Logging" talked in PGCon
2016 [3]Non-volatile Memory Logging (PGCon 2016) https://www.pgcon.org/2016/schedule/track/Performance/945.en.html to gain more benefit from PMEM than my and Yoshimi's previous
work did [4]Introducing PMDK into PostgreSQL (PGCon 2018) https://www.pgcon.org/2018/schedule/events/1154.en.html[5]Applying PMDK to WAL operations for persistent memory (pgsql-hackers) /messages/by-id/C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp. I submitted a talk proposal for PGCon in this year, and
have measured and analyzed performance of my PostgreSQL with non-volatile
WAL buffer, comparing with the original one that uses PMEM as "a faster-
than-SSD storage." I will talk about the results if accepted.

Best regards,
Takashi Menjo

[1]: Persistent Memory (SNIA) https://www.snia.org/PM
https://www.snia.org/PM
[2]: Persistent Memory Development Kit (pmem.io) https://pmem.io/pmdk/
https://pmem.io/pmdk/
[3]: Non-volatile Memory Logging (PGCon 2016) https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
[4]: Introducing PMDK into PostgreSQL (PGCon 2018) https://www.pgcon.org/2018/schedule/events/1154.en.html
https://www.pgcon.org/2018/schedule/events/1154.en.html
[5]: Applying PMDK to WAL operations for persistent memory (pgsql-hackers) /messages/by-id/C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp
/messages/by-id/C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Attachments:

0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=0001-Support-GUCs-for-external-WAL-buffer.patchDownload+560-23
0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=0002-Non-volatile-WAL-buffer.patchDownload+948-104
0003-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=0003-README-for-non-volatile-WAL-buffer.patchDownload+184-1
#2Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Takashi Menjo (#1)
Re: [PoC] Non-volatile WAL buffer

Hello,

+1 on the idea.

By quickly looking at the patch, I notice that there are no tests.

Is it possible to emulate somthing without the actual hardware, at least
for testing purposes?

--
Fabien.

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Takashi Menjo (#1)
Re: [PoC] Non-volatile WAL buffer

On 24/01/2020 10:06, Takashi Menjo wrote:

I propose "non-volatile WAL buffer," a proof-of-concept new feature. It
enables WAL records to be durable without output to WAL segment files by
residing on persistent memory (PMEM) instead of DRAM. It improves database
performance by reducing copies of WAL and shortening the time of write
transactions.

I attach the first patchset that can be applied to PostgreSQL 12.0 (refs/
tags/REL_12_0). Please see README.nvwal (added by the patch 0003) to use
the new feature.

I have the same comments on this that I had on the previous patch, see:

/messages/by-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8@iki.fi

- Heikki

#4Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Fabien COELHO (#2)
RE: [PoC] Non-volatile WAL buffer

Hello Fabien,

Thank you for your +1 :)

Is it possible to emulate somthing without the actual hardware, at least
for testing purposes?

Yes, you can emulate PMEM using DRAM on Linux, via "memmap=nnG!ssG" kernel
parameter. Please see [1]How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM) https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server and [2]how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system for emulation details. If your emulation
does not work well, please check if the kernel configuration options (like
CONFIG_ FOOBAR) for PMEM and DAX (in [1]How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM) https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server and [3]Persistent Memory Wiki https://nvdimm.wiki.kernel.org/) are set up properly.

Best regards,
Takashi

[1]: How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM) https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
[2]: how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
[3]: Persistent Memory Wiki https://nvdimm.wiki.kernel.org/
https://nvdimm.wiki.kernel.org/

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

#5Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Heikki Linnakangas (#3)
RE: [PoC] Non-volatile WAL buffer

Hello Heikki,

I have the same comments on this that I had on the previous patch, see:

/messages/by-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8@iki.fi

Thanks. I re-read your messages [1]/messages/by-id/83eafbfd-d9c5-6623-2423-7cab1be3888c@iki.fi[2]/messages/by-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8@iki.fi. What you meant, AFAIU, is how
about using memory-mapped WAL segment files as WAL buffers, and switching
CPU instructions or msync() depending on whether the segment files are on
PMEM or not, to sync inserted WAL records.

It sounds reasonable, but I'm sorry that I haven't tested such a program
yet. I'll try it to compare with my non-volatile WAL buffer. For now, I'm
a little worried about the overhead of mmap()/munmap() for each WAL segment
file.

You also told a SIGBUS problem of memory-mapped I/O. I think it's true for
reading from bad memory blocks, as you mentioned, and also true for writing
to such blocks [3]https://pmem.io/2018/11/26/bad-blocks.htm. Handling SIGBUS properly or working around it is future
work.

Best regards,
Takashi

[1]: /messages/by-id/83eafbfd-d9c5-6623-2423-7cab1be3888c@iki.fi
[2]: /messages/by-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8@iki.fi
[3]: https://pmem.io/2018/11/26/bad-blocks.htm

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

#6Robert Haas
robertmhaas@gmail.com
In reply to: Takashi Menjo (#5)
Re: [PoC] Non-volatile WAL buffer

On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:

It sounds reasonable, but I'm sorry that I haven't tested such a program
yet. I'll try it to compare with my non-volatile WAL buffer. For now, I'm
a little worried about the overhead of mmap()/munmap() for each WAL segment
file.

I guess the question here is how the cost of one mmap() and munmap()
pair per WAL segment (normally 16MB) compares to the cost of one
write() per block (normally 8kB). It could be that mmap() is a more
expensive call than read(), but by a small enough margin that the
vastly reduced number of system calls makes it a winner. But that's
just speculation, because I don't know how heavy mmap() actually is.

I have a different concern. I think that, right now, when we reuse a
WAL segment, we write entire blocks at a time, so the old contents of
the WAL segment are overwritten without ever being read. But that
behavior might not be maintained when using mmap(). It might be that
as soon as we write the first byte to a mapped page, the old contents
have to be faulted into memory. Indeed, it's unclear how it could be
otherwise, since the VM page must be made read-write at that point and
the system cannot know that we will overwrite the whole page. But
reading in the old contents of a recycled WAL file just to overwrite
them seems like it would be disastrously expensive.

A related, but more minor, concern is whether there are any
differences in in the write-back behavior when modifying a mapped
region vs. using write(). Either way, the same pages of the same file
will get dirtied, but the kernel might not have the same idea in
either case about when the changed pages should be written back down
to disk, and that could make a big difference to performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Robert Haas (#6)
RE: [PoC] Non-volatile WAL buffer

Hello Robert,

I think our concerns are roughly classified into two:

(1) Performance
(2) Consistency

And your "different concern" is rather into (2), I think.

I'm also worried about it, but I have no good answer for now. I suppose mmap(flags|=MAP_SHARED) called by multiple backend processes for the same file works consistently for both PMEM and non-PMEM devices. However, I have not found any evidence such as specification documents yet.

I also made a tiny program calling memcpy() and msync() on the same mmap()-ed file but mutually distinct address range in parallel, and found that there was no corrupted data. However, that result does not ensure any consistency I'm worried about.  I could give it up if there *were* corrupted data...

So I will go to (1) first. I will test the way Heikki told us to answer whether the cost of mmap() and munmap() per WAL segment, etc, is reasonable or not. If it really is, then I will go to (2).

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

#8Robert Haas
robertmhaas@gmail.com
In reply to: Takashi Menjo (#7)
Re: [PoC] Non-volatile WAL buffer

On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:

I think our concerns are roughly classified into two:

(1) Performance
(2) Consistency

And your "different concern" is rather into (2), I think.

Actually, I think it was mostly a performance concern (writes
triggering lots of reading) but there might be a consistency issue as
well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#6)
Re: [PoC] Non-volatile WAL buffer

Hi,

On 2020-01-27 13:54:38 -0500, Robert Haas wrote:

On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:

It sounds reasonable, but I'm sorry that I haven't tested such a program
yet. I'll try it to compare with my non-volatile WAL buffer. For now, I'm
a little worried about the overhead of mmap()/munmap() for each WAL segment
file.

I guess the question here is how the cost of one mmap() and munmap()
pair per WAL segment (normally 16MB) compares to the cost of one
write() per block (normally 8kB). It could be that mmap() is a more
expensive call than read(), but by a small enough margin that the
vastly reduced number of system calls makes it a winner. But that's
just speculation, because I don't know how heavy mmap() actually is.

mmap()/munmap() on a regular basis does have pretty bad scalability
impacts. I don't think they'd fully hit us, because we're not in a
threaded world however.

My issue with the proposal to go towards mmap()/munmap() is that I think
doing so forcloses a lot of improvements. Even today, on fast storage,
using the open_datasync is faster (at least when somehow hitting the
O_DIRECT path, which isn't that easy these days) - and that's despite it
being really unoptimized. I think our WAL scalability is a serious
issue. There's a fair bit that we can improve by just fix without really
changing the way we do IO:

- Split WALWriteLock into one lock for writing and one for flushing the
WAL. Right now we prevent other sessions from writing out WAL - even
to other segments - when one session is doing a WAL flush. But there's
absolutely no need for that.
- Stop increasing the size of the flush request to the max when flushing
WAL (cf "try to write/flush later additions to XLOG as well" in
XLogFlush()) - that currently reduces throughput in OLTP workloads
quite noticably. It made some sense in the spinning disk times, but I
don't think it does for a halfway decent SSD. By writing the maximum
ready to write, we hold the lock for longer, increasing latency for
the committing transaction *and* preventing more WAL from being written.
- We should immediately ask the OS to flush writes for full XLOG pages
back to the OS. Right now the IO for that will never be started before
the commit comes around in an OLTP workload, which means that we just
waste the time between the XLogWrite() and the commit.

That'll gain us 2-3x, I think. But after that I think we're going to
have to actually change more fundamentally how we do IO for WAL
writes. Using async IO I can do like 18k individual durable 8kb writes
(using O_DSYNC) a second, at a queue depth of 32. On my laptop. If I
make it 4k writes, it's 22k.

That's not directly comparable with postgres WAL flushes, of course, as
it's all separate blocks, whereas WAL will often end up overwriting the
last block. But it doesn't at all account for group commits either,
which we *constantly* end up doing.

Postgres manages somewhere between ~450 (multiple users) ~800 (single
user) individually durable WAL writes / sec on the same hardware. Yes,
that's more than an order of magnitude less. Of course some of that is
just that postgres does more than just IO - but that's not effect on the
order of a magnitude.

So, why am I bringing this up in this thread? Only because I do not see
a way to actually utilize non-pmem hardware to a much higher degree than
we are doing now by using mmap(). Doing so requires using direct IO,
which is fundamentally incompatible with using mmap().

I have a different concern. I think that, right now, when we reuse a
WAL segment, we write entire blocks at a time, so the old contents of
the WAL segment are overwritten without ever being read. But that
behavior might not be maintained when using mmap(). It might be that
as soon as we write the first byte to a mapped page, the old contents
have to be faulted into memory. Indeed, it's unclear how it could be
otherwise, since the VM page must be made read-write at that point and
the system cannot know that we will overwrite the whole page. But
reading in the old contents of a recycled WAL file just to overwrite
them seems like it would be disastrously expensive.

Yea, that's a serious concern.

A related, but more minor, concern is whether there are any
differences in in the write-back behavior when modifying a mapped
region vs. using write(). Either way, the same pages of the same file
will get dirtied, but the kernel might not have the same idea in
either case about when the changed pages should be written back down
to disk, and that could make a big difference to performance.

I don't think there's a significant difference in case of linux - no
idea about others. And either way we probably should force the kernels
hand to start flushing much sooner.

Greetings,

Andres Freund

#10Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Robert Haas (#8)
RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I made another WIP patchset to mmap WAL segments as WAL buffers. Note that this is not a non-volatile WAL buffer patchset but its competitor. I am measuring and analyzing the performance of this patchset to compare with my N.V.WAL buffer.

Please wait for a several more days for the result report...

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Robert Haas <robertmhaas@gmail.com>
Sent: Wednesday, January 29, 2020 6:00 AM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Heikki Linnakangas <hlinnaka@iki.fi>; pgsql-hackers@postgresql.org
Subject: Re: [PoC] Non-volatile WAL buffer

On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

I think our concerns are roughly classified into two:

(1) Performance
(2) Consistency

And your "different concern" is rather into (2), I think.

Actually, I think it was mostly a performance concern (writes triggering lots of reading) but there might be a
consistency issue as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Attachments:

0001-Preallocate-more-WAL-segments.patchapplication/octet-stream; name=0001-Preallocate-more-WAL-segments.patchDownload+10-18
0002-Use-WAL-segments-as-WAL-buffers.patchapplication/octet-stream; name=0002-Use-WAL-segments-as-WAL-buffers.patchDownload+243-591
0003-Lazy-unmap-WAL-segments.patchapplication/octet-stream; name=0003-Lazy-unmap-WAL-segments.patchDownload+26-3
0004-Speculative-map-WAL-segments.patchapplication/octet-stream; name=0004-Speculative-map-WAL-segments.patchDownload+19-1
0005-Allocate-WAL-segments-to-utilize-hugepage.patchapplication/octet-stream; name=0005-Allocate-WAL-segments-to-utilize-hugepage.patchDownload+15-3
#11Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Takashi Menjo (#1)
RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and analyzed its performance with pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL, it was "obviously worse" than the original REL_12_0. VTune told me that the CPU time of memcpy() called by CopyXLogRecordToWAL() got larger than before. When I used *NVDIMM-N and ext4 with filesystem DAX* to store WAL, however, it achieved "not bad" performance compared with our previous patchset and non-volatile WAL buffer. Each CPU time of XLogInsert() and XLogFlush() was reduced like as non-volatile WAL buffer.

So I think mmap()-ing WAL segments as WAL buffers is not such a bad idea as long as we use PMEM, at least NVDIMM-N.

Excuse me but for now I'd keep myself not talking about how much the performance was, because the mmap()-ing patchset is WIP so there might be bugs which wrongfully "improve" or "degrade" performance. Also we need to know persistent memory programming and related features such as filesystem DAX, huge page faults, and WAL persistence with cache flush and memory barrier instructions to explain why the performance improved. I'd talk about all the details at the appropriate time and place. (The conference, or here later...)

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Monday, February 10, 2020 6:30 PM
To: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>
Cc: 'pgsql-hackers@postgresql.org' <pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I made another WIP patchset to mmap WAL segments as WAL buffers. Note that this is not a non-volatile WAL
buffer patchset but its competitor. I am measuring and analyzing the performance of this patchset to compare
with my N.V.WAL buffer.

Please wait for a several more days for the result report...

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center

-----Original Message-----
From: Robert Haas <robertmhaas@gmail.com>
Sent: Wednesday, January 29, 2020 6:00 AM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Heikki Linnakangas <hlinnaka@iki.fi>; pgsql-hackers@postgresql.org
Subject: Re: [PoC] Non-volatile WAL buffer

On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

I think our concerns are roughly classified into two:

(1) Performance
(2) Consistency

And your "different concern" is rather into (2), I think.

Actually, I think it was mostly a performance concern (writes
triggering lots of reading) but there might be a consistency issue as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
Company

#12Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Takashi Menjo (#11)
Re: [PoC] Non-volatile WAL buffer

Menjo-san,

On Mon, Feb 17, 2020 at 1:13 PM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:

I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and analyzed its performance with pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL, it was "obviously worse" than the original REL_12_0.

I apologize for not having any opinion on the patches themselves, but
let me point out that it's better to base these patches on HEAD
(master branch) than REL_12_0, because all new code is committed to
the master branch, whereas stable branches such as REL_12_0 only
receive bug fixes. Do you have any specific reason to be working on
REL_12_0?

Thanks,
Amit

#13Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Amit Langote (#12)
RE: [PoC] Non-volatile WAL buffer

Hello Amit,

I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
whereas stable branches such as REL_12_0 only receive bug fixes. Do you have any specific reason to be working
on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss performance measurement. Of course I know all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm aware of rebasing my patchset onto master sooner or later. However, if someone, including me, says that s/he applies my patchset to "master" and measures its performance, we have to pay attention to which commit the "master" really points to. Although we have sha1 hashes to specify which commit, we should check whether the specific commit on master has patches affecting performance or not because master's HEAD gets new patches day by day. On the other hand, a release tag clearly points the commit all we probably know. Also we can check more easily the features and improvements by using release notes and user manuals.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 1:39 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas <hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Menjo-san,

On Mon, Feb 17, 2020 at 1:13 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and

analyzed its performance with pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL, it was
"obviously worse" than the original REL_12_0.

I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
whereas stable branches such as REL_12_0 only receive bug fixes. Do you have any specific reason to be working
on REL_12_0?

Thanks,
Amit

#14Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Takashi Menjo (#13)
Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:

Hello Amit,

I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
whereas stable branches such as REL_12_0 only receive bug fixes. Do you have any specific reason to be working
on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss performance measurement. Of course I know all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm aware of rebasing my patchset onto master sooner or later. However, if someone, including me, says that s/he applies my patchset to "master" and measures its performance, we have to pay attention to which commit the "master" really points to. Although we have sha1 hashes to specify which commit, we should check whether the specific commit on master has patches affecting performance or not because master's HEAD gets new patches day by day. On the other hand, a release tag clearly points the commit all we probably know. Also we can check more easily the features and improvements by using release notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least two
numbers -- performance with a branch's HEAD without patch applied and
that with patch applied -- which can be enough in most cases to see
the difference the patch makes. Sure, the numbers might change on
each report, but that's fine I'd think. If you continue to develop
against the stable branch, you might miss to notice impact from any
relevant developments in the master branch, even developments which
possibly require rethinking the architecture of your own changes,
although maybe that rarely occurs.

Thanks,
Amit

#15Andres Freund
andres@anarazel.de
In reply to: Takashi Menjo (#11)
Re: [PoC] Non-volatile WAL buffer

Hi,

On 2020-02-17 13:12:37 +0900, Takashi Menjo wrote:

I applied my patchset that mmap()-s WAL segments as WAL buffers to
refs/tags/REL_12_0, and measured and analyzed its performance with
pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL,
it was "obviously worse" than the original REL_12_0. VTune told me
that the CPU time of memcpy() called by CopyXLogRecordToWAL() got
larger than before.

FWIW, this might largely be because of page faults. In contrast to
before we wouldn't reuse the same pages (because they've been
munmap()/mmap()ed), so the first time they're touched, we'll incur page
faults. Did you try mmap()ing with MAP_POPULATE? It's probably also
worthwhile to try to use MAP_HUGETLB.

Still doubtful it's the right direction, but I'd rather have good
numbers to back me up :)

Greetings,

Andres Freund

#16Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Amit Langote (#14)
RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas <hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

Hello Amit,

I apologize for not having any opinion on the patches themselves,
but let me point out that it's better to base these patches on HEAD
(master branch) than REL_12_0, because all new code is committed to
the master branch, whereas stable branches such as REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm
aware of rebasing my patchset onto master sooner or later. However, if someone, including me, says that s/he
applies my patchset to "master" and measures its performance, we have to pay attention to which commit the
"master" really points to. Although we have sha1 hashes to specify which commit, we should check whether the
specific commit on master has patches affecting performance or not because master's HEAD gets new patches day
by day. On the other hand, a release tag clearly points the commit all we probably know. Also we can check more
easily the features and improvements by using release notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest stable release' branch, that's normally just one
of the baselines.
The more important baseline for ongoing development is the master branch's HEAD, which is also what people
volunteering to test your patches would use. Anyone who reports would have to give at least two numbers --
performance with a branch's HEAD without patch applied and that with patch applied -- which can be enough in
most cases to see the difference the patch makes. Sure, the numbers might change on each report, but that's fine
I'd think. If you continue to develop against the stable branch, you might miss to notice impact from any relevant
developments in the master branch, even developments which possibly require rethinking the architecture of your
own changes, although maybe that rarely occurs.

Thanks,
Amit

#17Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Takashi Menjo (#1)
RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master. A new v2 patchset is attached to this mail.

I also measured performance before and after patchset, varying -c/--client and -j/--jobs options of pgbench, for each scaling factor s = 50 or 1000. The results are presented in the following tables and the attached charts. Conditions, steps, and other details will be shown later.

Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)

Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)

Both throughput and average latency are improved for each scaling factor. Throughput seemed to almost reach the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case. I think larger scaling factor leads to less contentions on the same tables and/or indexes, that is, less lock and unlock operations. In such a situation, write-ahead logging appears to be more significant for performance.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch

Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown in the tables above.

(1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>; 'PostgreSQL-development'
<pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas
<hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

Hello Amit,

I apologize for not having any opinion on the patches themselves,
but let me point out that it's better to base these patches on
HEAD (master branch) than REL_12_0, because all new code is
committed to the master branch, whereas stable branches such as
REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone, including
me, says that s/he applies my patchset to "master" and measures its
performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has patches affecting performance or not

because master's HEAD gets new patches day by day. On the other hand, a release tag clearly points the commit
all we probably know. Also we can check more easily the features and improvements by using release notes and
user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least two
numbers -- performance with a branch's HEAD without patch applied and
that with patch applied -- which can be enough in most cases to see
the difference the patch makes. Sure, the numbers might change on
each report, but that's fine I'd think. If you continue to develop against the stable branch, you might miss to

notice impact from any relevant developments in the master branch, even developments which possibly require
rethinking the architecture of your own changes, although maybe that rarely occurs.

Thanks,
Amit

Attachments:

v2-0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=v2-0001-Support-GUCs-for-external-WAL-buffer.patchDownload+748-23
v2-0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=v2-0002-Non-volatile-WAL-buffer.patchDownload+973-113
v2-0003-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v2-0003-README-for-non-volatile-WAL-buffer.patchDownload+184-1
nvwal-performance-s50.pngimage/png; name=nvwal-performance-s50.pngDownload
nvwal-performance-s1000.pngimage/png; name=nvwal-performance-s1000.pngDownload
postgresql.confapplication/octet-stream; name=postgresql.confDownload
#18Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Andres Freund (#15)
RE: [PoC] Non-volatile WAL buffer

Dear Andres,

Thank you for your advice about MAP_POPULATE flag. I rebased my msync patchset onto master and added a commit to append that flag
when mmap. A new v2 patchset is attached to this mail. Note that this patchset is NOT non-volatile WAL buffer's one.

I also measured performance of the following three versions, varying -c/--client and -j/--jobs options of pgbench, for each scaling
factor s = 50 or 1000.

- Before patchset (say "before")
- After patchset except patch 0005 not to use MAP_POPULATE ("after (no populate)")
- After full patchset to use MAP_POPULATE ("after (populate)")

The results are presented in the following tables and the attached charts. Conditions, steps, and other details will be shown
later. Note that, unlike the measurement of non-volatile WAL buffer I sent recently [1]/messages/by-id/002701d5fd03$6e1d97a0$4a58c6e0$@hco.ntt.co.jp_1, I used an NVMe SSD for pg_wal to evaluate
this patchset with traditional mmap-ed files, that is, direct access (DAX) is not supported and there are page caches.

Results (s=50)
==============
Throughput [10^3 TPS]
( c, j) before after after
(no populate) (populate)
------- -------------------------------------
( 8, 8) 30.9 28.1 (- 9.2%) 28.3 (- 8.6%)
(18,18) 61.5 46.1 (-25.0%) 47.7 (-22.3%)
(36,18) 67.0 45.9 (-31.5%) 48.4 (-27.8%)
(54,18) 68.3 47.0 (-31.3%) 49.6 (-27.5%)

Average Latency [ms]
( c, j) before after after
(no populate) (populate)
------- --------------------------------------
( 8, 8) 0.259 0.285 (+10.0%) 0.283 (+ 9.3%)
(18,18) 0.293 0.391 (+33.4%) 0.377 (+28.7%)
(36,18) 0.537 0.784 (+46.0%) 0.744 (+38.5%)
(54,18) 0.790 1.149 (+45.4%) 1.090 (+38.0%)

Results (s=1000)
================
Throghput [10^3 TPS]
( c, j) before after after
(no populate) (populate)
------- ------------------------------------
( 8, 8) 32.0 29.6 (- 7.6%) 29.1 (- 9.0%)
(18,18) 66.1 49.2 (-25.6%) 50.4 (-23.7%)
(36,18) 76.4 51.0 (-33.3%) 53.4 (-30.1%)
(54,18) 80.1 54.3 (-32.2%) 57.2 (-28.6%)

Average latency [10^3 TPS]
( c, j) before after after
(no populate) (populate)
------- --------------------------------------
( 8, 8) 0.250 0.271 (+ 8.4%) 0.275 (+10.0%)
(18,18) 0.272 0.366 (+34.6%) 0.357 (+31.3%)
(36,18) 0.471 0.706 (+49.9%) 0.674 (+43.1%)
(54,18) 0.674 0.995 (+47.6%) 0.944 (+40.1%)

I'd say MAP_POPULATE made performance a little better in large #clients cases, comparing "populate" with "no populate". However,
comparing "after" with "before", I found both throughput and average latency degraded. VTune told me that "after (populate)" still
spent larger CPU time for memcpy-ing WAL records into mmap-ed segments than "before".

I also made a microbenchmark to see the behavior of mmap and msync. I found that:

- A major fault occured at mmap with MAP_POPULATE, instead at first access to the mmap-ed space.
- Some minor faults also occured at mmap with MAP_POPULATE, and no additional fault occured when I loaded from the mmap-ed space.
But once I stored to that space, a minor fault occured.
- When I stored to the page that had been msync-ed, a minor fault occurred.

So I think one of the remaining causes of performance degrade is minor faults when mmap-ed pages get dirtied. And it seems not to
be solved by MAP_POPULATE only, as far as I see.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use two NVMe SSDs; one for PGDATA, another for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- Use the attached postgresql.conf

Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown in the
tables above.

(1) Run initdb with proper -D and -X options
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA x2

Best regards,
Takashi

[1]: /messages/by-id/002701d5fd03$6e1d97a0$4a58c6e0$@hco.ntt.co.jp_1

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Andres Freund <andres@anarazel.de>
Sent: Thursday, February 20, 2020 2:04 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>;
pgsql-hackers@postgresql.org
Subject: Re: [PoC] Non-volatile WAL buffer

Hi,

On 2020-02-17 13:12:37 +0900, Takashi Menjo wrote:

I applied my patchset that mmap()-s WAL segments as WAL buffers to
refs/tags/REL_12_0, and measured and analyzed its performance with
pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL,
it was "obviously worse" than the original REL_12_0. VTune told me
that the CPU time of memcpy() called by CopyXLogRecordToWAL() got
larger than before.

FWIW, this might largely be because of page faults. In contrast to before we wouldn't reuse the same pages
(because they've been munmap()/mmap()ed), so the first time they're touched, we'll incur page faults. Did you
try mmap()ing with MAP_POPULATE? It's probably also worthwhile to try to use MAP_HUGETLB.

Still doubtful it's the right direction, but I'd rather have good numbers to back me up :)

Greetings,

Andres Freund

Attachments:

v2-0001-Preallocate-more-WAL-segments.patchapplication/octet-stream; name=v2-0001-Preallocate-more-WAL-segments.patchDownload+10-18
v2-0002-Use-WAL-segments-as-WAL-buffers.patchapplication/octet-stream; name=v2-0002-Use-WAL-segments-as-WAL-buffers.patchDownload+366-602
v2-0003-Lazy-unmap-WAL-segments.patchapplication/octet-stream; name=v2-0003-Lazy-unmap-WAL-segments.patchDownload+26-3
v2-0004-Speculative-map-WAL-segments.patchapplication/octet-stream; name=v2-0004-Speculative-map-WAL-segments.patchDownload+19-1
v2-0005-Map-WAL-segments-with-MAP_POPULATE-if-non-DAX.patchapplication/octet-stream; name=v2-0005-Map-WAL-segments-with-MAP_POPULATE-if-non-DAX.patchDownload+1-2
msync-performance-s50.pngimage/png; name=msync-performance-s50.pngDownload
msync-performance-s1000.pngimage/png; name=msync-performance-s1000.pngDownload
postgresql.confapplication/octet-stream; name=postgresql.confDownload
#19Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Takashi Menjo (#1)
RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I update my non-volatile WAL buffer's patchset to v3. Now we can use it in streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>; 'Amit Langote'
<amitlangote09@gmail.com>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master. A new v2 patchset is attached to this mail.

I also measured performance before and after patchset, varying -c/--client and -j/--jobs options of pgbench, for
each scaling factor s = 50 or 1000. The results are presented in the following tables and the attached charts.
Conditions, steps, and other details will be shown later.

Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)

Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)

Both throughput and average latency are improved for each scaling factor. Throughput seemed to almost reach
the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case. I think larger scaling factor leads to less
contentions on the same tables and/or indexes, that is, less lock and unlock operations. In such a situation,
write-ahead logging appears to be more significant for performance.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch

Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown
in the tables above.

(1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>;

'PostgreSQL-development'

<pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software
Innovation Center

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas
<hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

Hello Amit,

I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or not

because master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using release notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss to

notice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architecture of your own changes, although maybe that

rarely occurs.

Thanks,
Amit

Attachments:

v3-0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=v3-0001-Support-GUCs-for-external-WAL-buffer.patchDownload+747-22
v3-0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0002-Non-volatile-WAL-buffer.patchDownload+1097-138
v3-0003-walreceiver-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0003-walreceiver-supports-non-volatile-WAL-buffer.patchDownload+85-4
v3-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchDownload+407-17
v3-0005-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0005-README-for-non-volatile-WAL-buffer.patchDownload+184-1
#20Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#19)
Re: [PoC] Non-volatile WAL buffer

Rebased.

2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>:

Dear hackers,

I update my non-volatile WAL buffer's patchset to v3. Now we can use it
in streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL
buffer if applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL
buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path
will be written to postgresql.auto.conf or recovery.conf. The size of the
new NVWAL is same as the master's one.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <

hlinnaka@iki.fi>; 'Amit Langote'

<amitlangote09@gmail.com>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master. A new v2

patchset is attached to this mail.

I also measured performance before and after patchset, varying

-c/--client and -j/--jobs options of pgbench, for

each scaling factor s = 50 or 1000. The results are presented in the

following tables and the attached charts.

Conditions, steps, and other details will be shown later.

Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)

Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)

Both throughput and average latency are improved for each scaling

factor. Throughput seemed to almost reach

the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case. I think

larger scaling factor leads to less

contentions on the same tables and/or indexes, that is, less lock and

unlock operations. In such a situation,

write-ahead logging appears to be more significant for performance.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for

pg_wal

- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access

(DAX)

- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch

Steps
=====
For each (c,j) pair, I did the following steps three times then I found

the median of the three as a final result shown

in the tables above.

(1) Run initdb with proper -D and -X options; and also give --nvwal-path

and --nvwal-size options after patch

(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j

___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation

Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <

hlinnaka@iki.fi>;

'PostgreSQL-development'

<pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the

hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated patchset

and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software
Innovation Center

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas
<hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <

takashi.menjou.vg@hco.ntt.co.jp> wrote:

Hello Amit,

I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit

the "master"

really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or not

because master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using release

notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss to

notice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architecture

of your own changes, although maybe that

rarely occurs.

Thanks,
Amit

--
Takashi Menjo <takashi.menjo@gmail.com>

Attachments:

v4-0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=v4-0001-Support-GUCs-for-external-WAL-buffer.patchDownload+747-22
v4-0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0002-Non-volatile-WAL-buffer.patchDownload+1097-138
v4-0003-walreceiver-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0003-walreceiver-supports-non-volatile-WAL-buffer.patchDownload+85-4
v4-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchDownload+407-17
v4-0005-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0005-README-for-non-volatile-WAL-buffer.patchDownload+184-1
#21Deng, Gang
gang.deng@intel.com
In reply to: Takashi Menjo (#20)
#22Takashi Menjo
takashi.menjo@gmail.com
In reply to: Deng, Gang (#21)
#23Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Takashi Menjo (#22)
#24Deng, Gang
gang.deng@intel.com
In reply to: Takashi Menjo (#23)
#25Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Deng, Gang (#24)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Takashi Menjo (#25)
#27Takashi Menjo
takashi.menjo@gmail.com
In reply to: Heikki Linnakangas (#26)
#28Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#25)
#29Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Takashi Menjo (#20)
#30Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Takashi Menjo (#27)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#30)
#32tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Tomas Vondra (#31)
#33Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: tsunakawa.takay@fujitsu.com (#32)
#34tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Tomas Vondra (#33)
#35Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Tomas Vondra (#29)
#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: tsunakawa.takay@fujitsu.com (#34)
#37Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashwin Agrawal (#35)
#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#36)
#39Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tomas Vondra (#38)
#40Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Heikki Linnakangas (#39)
#41Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#40)
#42Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#41)
#43Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#42)
#44Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#43)
#45Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Masahiko Sawada (#44)
#46Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#43)
#47Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Tomas Vondra (#45)
#48Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#45)
#49Takashi Menjo
takashi.menjo@gmail.com
In reply to: Masahiko Sawada (#48)
#50Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#49)
#51Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#50)
#52Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#51)
#53Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Masahiko Sawada (#48)
#54tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Tomas Vondra (#53)
#55Takashi Menjo
takashi.menjo@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#54)
#56Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#53)
#57tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Masahiko Sawada (#56)
#58Takashi Menjo
takashi.menjo@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#57)
#59tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Takashi Menjo (#58)
#60Takashi Menjo
takashi.menjo@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#59)
#61Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#60)
#62Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Konstantin Knizhnik (#47)
#63Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Tomas Vondra (#62)
#64Takashi Menjo
takashi.menjo@gmail.com
In reply to: Konstantin Knizhnik (#63)
#65Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#56)
#66Takashi Menjo
takashi.menjo@gmail.com
In reply to: Masahiko Sawada (#65)
#67Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Takashi Menjo (#66)
#68Takashi Menjo
takashi.menjo@gmail.com
In reply to: Tomas Vondra (#67)
#69Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Takashi Menjo (#68)
#70Takashi Menjo
takashi.menjo@gmail.com
In reply to: Tomas Vondra (#69)
#71tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Takashi Menjo (#70)