Refactoring the checkpointer&#39;s fsync request queue

thomas.munro@gmail.com

over 7 years ago

In reply to: Thomas Munro (#1)

Re: Refactoring the checkpointer's fsync request queue

(Adding Dmitry to CC list.)

On Tue, Oct 16, 2018 at 12:02 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

There is one major fly in the ointment: fsyncgate[1]. Originally I
planned to propose a patch on top of that one, but it's difficult --
both patches move a lot of the same stuff around. Personally, I don't
think it would be a very good idea to back-patch that anyway. It'd be
riskier than the problem it aims to solve, in terms of bugs and
hard-to-foresee portability problems IMHO. I think we should consider
back-patching some variant of Craig Ringer's PANIC patch, and consider
this redesigned approach for future releases.

So, please find attached the WIP patch that I would like to propose
for PostgreSQL 12, under a separate Commitfest entry. It incorporates
the fsyncgate work by Andres Freund (original file descriptor transfer
POC) and me (many bug fixes and improvements), and the refactoring
work as described above.

Here is a rebased version of the patch, post pread()/pwrite(). I have
also rewritten the commit message to try to explain the rationale
concisely, instead of requiring the reader to consult multiple
discussions that jump between lengthy email threads to understand the
key points.

There is one major problem with this patch: BufferSync(), run in the
checkpointer, can deadlock against a backend that holds a buffer lock
and is blocked in SendFsyncRequest(). To break this deadlock, we need
way out of it on either the sending or receiving side. Here are three
ideas:

1. Go back to the current pressure-valve strategy: make the sending
side perform the fsync(), if it can't immediately write to the pipe.

2. Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe. Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

3. Multiplex the checkpointer's work: Use LWLockConditionalAcquire()
when locking buffers, and if that fails, try to drain the pipe, and
then fall back to a LWLockTimedAcquire(), drain pipe, repeat loop. I
can hear you groan already; that doesn't seem particularly elegant,
and there are portability problems implementing LWLockTimedAcquire():
semtimedop() and sem_timedwait() are not available on all platforms
(eg macOS). Maybe pthread_timed_condwait() could help (!).

I'm not actually sure if idea 1 is correct, and I also don't like that
behaviour under pressure, and I think pressure is more likely than in
current master (since we gave up sender-side queue compaction, and it
seems quite easy to fill up the pipe's buffer). Number 2 appeals to
me the most right now, but I haven't looked into the details or tried
it yet. Number 3 is a straw man that perhaps helps illustrate the
problem but involves taking on unnecessary new problems and seems like
a non-starter. So, is there any reason the bgwriter process shouldn't
do that work on the checkpointer's behalf? Is there another way?

--
Thomas Munro
http://www.enterprisedb.com

robertmhaas@gmail.com

over 7 years ago

In reply to: Thomas Munro (#2)

Re: Refactoring the checkpointer's fsync request queue

On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

There is one major problem with this patch

If there's only one, you're doing great! Although admittedly this
seems like a big one...

1. Go back to the current pressure-valve strategy: make the sending
side perform the fsync(), if it can't immediately write to the pipe.

As you say, this will happen significantly more often with
deduplication. That deduplication logic got added in response to a
real need. Before that, you could cause an individual backend to
start doing its own fsyncs() with something as simple as a bulk load.
The queue would absorb most of them, but not all, and the performance
ramifications where noticeable.

2. Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe. Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

3. Multiplex the checkpointer's work: Use LWLockConditionalAcquire()
when locking buffers, and if that fails, try to drain the pipe, and
then fall back to a LWLockTimedAcquire(), drain pipe, repeat loop. I
can hear you groan already; that doesn't seem particularly elegant,
and there are portability problems implementing LWLockTimedAcquire():
semtimedop() and sem_timedwait() are not available on all platforms
(eg macOS). Maybe pthread_timed_condwait() could help (!).

You don't really need to invent LWLockTimedAcquire(). You could just
keep retrying LWLockConditionalAcquire() in a delay loop. I agree
that doesn't seem particularly elegant, though.

I still feel like this whole pass-the-fds-to-the-checkpointer thing is
a bit of a fool's errand, though. I mean, there's no guarantee that
the first FD that gets passed to the checkpointer is the first one
opened, or even the first one written, is there? It seems like if you
wanted to make this work reliably, you'd need to do it the other way
around: have the checkpointer (or some other background process) open
all the FDs, and anybody else who wants to have one open get it from
the checkpointer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

andres@anarazel.de

over 7 years ago

In reply to: Thomas Munro (#2)

Re: Refactoring the checkpointer's fsync request queue

Hi,

On 2018-11-12 15:58:41 +1300, Thomas Munro wrote:

There is one major problem with this patch: BufferSync(), run in the
checkpointer, can deadlock against a backend that holds a buffer lock
and is blocked in SendFsyncRequest(). To break this deadlock, we need
way out of it on either the sending or receiving side. Here are three
ideas:

That's the deadlock I'd mentioned in Pune (?) btw.

1. Go back to the current pressure-valve strategy: make the sending
side perform the fsync(), if it can't immediately write to the pipe.

I don't think that's correct / safe? I've previously wondered whether
there's any way we could delay the write to a point where the buffer is
not locked anymore - as far as I can tell it's actually not required for
correctness that we send the fsync request before unlocking. It's
architecturally a bit dicey tho :(

Greetings,

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Robert Haas (#3)

Re: Refactoring the checkpointer's fsync request queue

Hi,

On 2018-11-13 12:04:23 -0500, Robert Haas wrote:

I still feel like this whole pass-the-fds-to-the-checkpointer thing is
a bit of a fool's errand, though. I mean, there's no guarantee that
the first FD that gets passed to the checkpointer is the first one
opened, or even the first one written, is there?

I'm not sure I understand the danger you're seeing here. It doesn't have
to be the first fd opened, it has to be an fd that's older than all the
writes that we need to ensure made it to disk. And that ought to be
guaranteed by the logic? Between the FileWrite() and the
register_dirty_segment() (and other relevant paths) the FD cannot be
closed.

It seems like if you wanted to make this work reliably, you'd need to
do it the other way around: have the checkpointer (or some other
background process) open all the FDs, and anybody else who wants to
have one open get it from the checkpointer.

That'd require a process context switch for each FD opened, which seems
clearly like a no-go?

Greetings,

Andres Freund

robertmhaas@gmail.com

over 7 years ago

In reply to: Andres Freund (#5)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Nov 13, 2018 at 1:07 PM Andres Freund <andres@anarazel.de> wrote:

On 2018-11-13 12:04:23 -0500, Robert Haas wrote:

I still feel like this whole pass-the-fds-to-the-checkpointer thing is
a bit of a fool's errand, though. I mean, there's no guarantee that
the first FD that gets passed to the checkpointer is the first one
opened, or even the first one written, is there?

I'm not sure I understand the danger you're seeing here. It doesn't have
to be the first fd opened, it has to be an fd that's older than all the
writes that we need to ensure made it to disk. And that ought to be
guaranteed by the logic? Between the FileWrite() and the
register_dirty_segment() (and other relevant paths) the FD cannot be
closed.

Suppose backend A and backend B open a segment around the same time.
Is it possible that backend A does a write before backend B, but
backend B's copy of the fd reaches the checkpointer before backend A's
copy? If you send the FD to the checkpointer before writing anything
then I think it's fine, but if you write first and then send the FD to
the checkpointer I don't see what guarantees the ordering.

It seems like if you wanted to make this work reliably, you'd need to
do it the other way around: have the checkpointer (or some other
background process) open all the FDs, and anybody else who wants to
have one open get it from the checkpointer.

That'd require a process context switch for each FD opened, which seems
clearly like a no-go?

I don't know how bad that would be. But hey, no cost is too great to
pay as a workaround for insane kernel semantics, right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

thomas.munro@gmail.com

over 7 years ago

In reply to: Robert Haas (#6)

Re: Refactoring the checkpointer's fsync request queue

(Replies to a couple of different messages below)

On Wed, Nov 14, 2018 at 6:04 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

There is one major problem with this patch

If there's only one, you're doing great! Although admittedly this
seems like a big one...

Make that two.

2. Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe. Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

But is there an argument against it? The checkpointer would still be
creating checkpoints including running fsync, but the background
writer would be, erm, writing, erm, in the background.

Admittedly it adds a whole extra rabbit hole to this rabbit hole,
which was itself a diversion from my goal of refactoring the syncing
machinery to support undo logs. But the other two ideas seem to suck
and/or have correctness issues.

On Wed, Nov 14, 2018 at 7:43 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 13, 2018 at 1:07 PM Andres Freund <andres@anarazel.de> wrote:

On 2018-11-13 12:04:23 -0500, Robert Haas wrote:

I still feel like this whole pass-the-fds-to-the-checkpointer thing is
a bit of a fool's errand, though. I mean, there's no guarantee that
the first FD that gets passed to the checkpointer is the first one
opened, or even the first one written, is there?

I'm not sure I understand the danger you're seeing here. It doesn't have
to be the first fd opened, it has to be an fd that's older than all the
writes that we need to ensure made it to disk. And that ought to be
guaranteed by the logic? Between the FileWrite() and the
register_dirty_segment() (and other relevant paths) the FD cannot be
closed.

Suppose backend A and backend B open a segment around the same time.
Is it possible that backend A does a write before backend B, but
backend B's copy of the fd reaches the checkpointer before backend A's
copy? If you send the FD to the checkpointer before writing anything
then I think it's fine, but if you write first and then send the FD to
the checkpointer I don't see what guarantees the ordering.

I'm not sure if it matters whether we send the fd before or after the
write, but we still need some kind of global ordering of fds that can
order a given fd with respect to writes in other processes, so the
patch introduces a global shared counter captured immediately after
open() (including when reopened in the vfd machinery).

In your example, both fds arrive in the checkpointer in some order,
and it will keep the one with the older sequence number and close the
other one. This sorting of all interesting fds will be forced before
the checkpoint completes by AbsorbFsyncRequests(), which drains all
messages from the pipe until it sees a message for the next checkpoint
cycle.

Hmm, I think there is a flaw in the plan here though. Messages for
different checkpoint cycles race to enter the pipe around the time the
cycle counter is bumped, so you could have a message for n hiding
behind a message for n + 1 and not drain enough; I'm not sure and need
to look at something else today, but I see a couple of potential
solutions to that which I will mull over, based on either a new shared
counter increment or a special pipe message written after BufferSync()
by the bgwriter (if we go for idea #2; Andres had something similar in
the original prototype but it could self-deadlock). I need to figure
out if that is a suitable barrier due to buffer interlocking.

It seems like if you wanted to make this work reliably, you'd need to
do it the other way around: have the checkpointer (or some other
background process) open all the FDs, and anybody else who wants to
have one open get it from the checkpointer.

That'd require a process context switch for each FD opened, which seems
clearly like a no-go?

I don't know how bad that would be. But hey, no cost is too great to
pay as a workaround for insane kernel semantics, right?

Yeah, seems extremely expensive and unnecessary. It seems sufficient
to track the global opening order... or at least a proxy that
identifies the fd that performed the oldest write. Which I believe
this patch is doing.

--
Thomas Munro
http://www.enterprisedb.com

Dmitry Dolgov

9erthalion6@gmail.com

over 7 years ago

In reply to: Thomas Munro (#7)

Re: Refactoring the checkpointer's fsync request queue

On Wed, 14 Nov 2018 at 00:44, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Here is a rebased version of the patch, post pread()/pwrite(). I have
also rewritten the commit message to try to explain the rationale
concisely, instead of requiring the reader to consult multiple
discussions that jump between lengthy email threads to understand the
key points.

Thank you for working on this patch!

There is one major problem with this patch: BufferSync(), run in the
checkpointer, can deadlock against a backend that holds a buffer lock
and is blocked in SendFsyncRequest(). To break this deadlock, we need
way out of it on either the sending or receiving side.

Or introduce a third side, but I'm not sure how appropriate it here.

2. Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe. Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

I also think it sounds better than other options (although probably it's
partially because these options were formulated while already having some bias
towards one of the solution).

2. Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe. Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

I had the same first thought, but then after reading the corresponding mailing
thread I've got an impression that the purpose of this change was rather
technical (to split work between different processed because of performance
reasons) and not exactly relevant to the division of labor - am I wrong here?

While testing this patch with frequent checkpoints I've stumbled upon an
interesting error, that happened already after I finished one test:

TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574)
2018-11-13 22:06:29.773 CET [7886] LOG: checkpointer process (PID
7934) was terminated by signal 6: Aborted
2018-11-13 22:06:29.773 CET [7886] LOG: terminating any other active
server processes
2018-11-13 22:06:29.773 CET [7937] WARNING: terminating connection
because of crash of another server process
2018-11-13 22:06:29.773 CET [7937] DETAIL: The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-11-13 22:06:29.773 CET [7937] HINT: In a moment you should be
able to reconnect to the database and repeat your command.
2018-11-13 22:06:29.778 CET [7886] LOG: all server processes
terminated; reinitializing

I assume it should't be like that? I haven't investigated deeply yet, but
backtrace looks like:

bt

#0 0x00007f7ee7a3af00 in raise () from /lib64/libc.so.6
#1 0x00007f7ee7a3ca57 in abort () from /lib64/libc.so.6
#2 0x0000560e89d1858e in ExceptionalCondition
(conditionName=conditionName@entry=0x560e89eca333 "!(rc > 0)",
errorType=errorType@entry=0x560e89d6cec8 "FailedAssertion",
fileName=fileName@entry=0x560e89eca2c9 "checkpointer.c",
lineNumber=lineNumber@entry=574) at assert.c:54
#3 0x0000560e89b5e3ff in CheckpointerMain () at checkpointer.c:574
#4 0x0000560e8995ef9e in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7ffe05c32f60) at bootstrap.c:460
#5 0x0000560e89b69c55 in StartChildProcess
(type=type@entry=CheckpointerProcess) at postmaster.c:5369
#6 0x0000560e89b6af15 in reaper (postgres_signal_arg=<optimized out>)
at postmaster.c:2916
#7 <signal handler called>
#8 0x00007f7ee7afe00b in select () from /lib64/libc.so.6
#9 0x0000560e89b6bd20 in ServerLoop () at postmaster.c:1679
#10 0x0000560e89b6d1bc in PostmasterMain (argc=3, argv=<optimized
out>) at postmaster.c:1388
#11 0x0000560e89acadc6 in main (argc=3, argv=0x560e8ad42c00) at main.c:228

robertmhaas@gmail.com

over 7 years ago

In reply to: Thomas Munro (#7)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

But is there an argument against it? The checkpointer would still be
creating checkpoints including running fsync, but the background
writer would be, erm, writing, erm, in the background.

I don't know. I guess the fact that the checkpointer is still
performing the fsyncs is probably a key point. I mean, in the old
division of labor, fsyncs could interrupt the background writing that
was supposed to be happening.

I'm not sure if it matters whether we send the fd before or after the
write, but we still need some kind of global ordering of fds that can
order a given fd with respect to writes in other processes, so the
patch introduces a global shared counter captured immediately after
open() (including when reopened in the vfd machinery).

But how do you make reading that counter atomic with the open() itself?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

andres@anarazel.de

over 7 years ago

In reply to: Robert Haas (#9)

Re: Refactoring the checkpointer's fsync request queue

On 2018-11-14 16:36:49 -0500, Robert Haas wrote:

On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I'm not sure if it matters whether we send the fd before or after the
write, but we still need some kind of global ordering of fds that can
order a given fd with respect to writes in other processes, so the
patch introduces a global shared counter captured immediately after
open() (including when reopened in the vfd machinery).

But how do you make reading that counter atomic with the open() itself?

I don't see why it has to be. As long as the "fd generation" assignment
happens before fsync (and writes secondarily), there ought not to be any
further need for synchronizity?

Greetings,

Andres Freund

#11

robertmhaas@gmail.com

over 7 years ago

In reply to: Andres Freund (#10)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Nov 14, 2018 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:

On 2018-11-14 16:36:49 -0500, Robert Haas wrote:

But how do you make reading that counter atomic with the open() itself?

I don't see why it has to be. As long as the "fd generation" assignment
happens before fsync (and writes secondarily), there ought not to be any
further need for synchronizity?

If the goal is to have the FD that is opened first end up in the
checkpointer's table, grabbing a counter backwards does not achieve
it, because there's a race.

S1: open FD
S2: open FD
S2: local_counter = shared_counter++
S1: local_counter = shared_counter++

Now S1 was opened first but has a higher shared counter value than S2
which was opened later. Does that matter? Beats me! I just work
here...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

thomas.munro@gmail.com

over 7 years ago

In reply to: Dmitry Dolgov (#8)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Nov 15, 2018 at 5:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

While testing this patch with frequent checkpoints I've stumbled upon an
interesting error, that happened already after I finished one test:

TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574)

Thanks for testing! Yeah, that's:

+       rc = WaitEventSetWait(wes, cur_timeout * 1000, &event, 1, 0);
+       Assert(rc > 0);

I got confused about the API. If there is a timeout, you get rc == 0,
but I think I was expecting rc == 1, event.event == WL_TIMEOUT. Oops.
I will fix that when I post a new experimental version that uses the
bgworker as discussed, and we can try to figure out if that design
will fly.

--
Thomas Munro
http://www.enterprisedb.com

#13

thomas.munro@gmail.com

over 7 years ago

In reply to: Robert Haas (#11)

Re: Refactoring the checkpointer's fsync request queue

On Sat, Nov 17, 2018 at 4:05 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 14, 2018 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:

On 2018-11-14 16:36:49 -0500, Robert Haas wrote:

But how do you make reading that counter atomic with the open() itself?

I don't see why it has to be. As long as the "fd generation" assignment
happens before fsync (and writes secondarily), there ought not to be any
further need for synchronizity?

If the goal is to have the FD that is opened first end up in the
checkpointer's table, grabbing a counter backwards does not achieve
it, because there's a race.

S1: open FD
S2: open FD
S2: local_counter = shared_counter++
S1: local_counter = shared_counter++

Now S1 was opened first but has a higher shared counter value than S2
which was opened later. Does that matter? Beats me! I just work
here...

It's not important for the sequence numbers to match the opening order
exactly (that'd work too but be expensive to orchestrate). It's
important for the sequence numbers to be assigned before each backend
does its first pwrite(). That gives us the following interleavings to
worry about:

S1: local_counter = shared_counter++
S2: local_counter = shared_counter++
S1: pwrite()
S2: pwrite()

S1: local_counter = shared_counter++
S2: local_counter = shared_counter++
S2: pwrite()
S1: pwrite()

S1: local_counter = shared_counter++
S1: pwrite()
S2: local_counter = shared_counter++
S2: pwrite()

... plus the same interleavings with S1 and S2 labels swapped. In all
6 orderings, the fd that has the lowest sequence number can see errors
relating to write-back of kernel buffers dirtied by both pwrite()
calls.

Or to put it another way, you can't be given a lower sequence number
than another process that has already written, because that other
process must have been given a sequence number before it wrote.

--
Thomas Munro
http://www.enterprisedb.com

#14

robertmhaas@gmail.com

over 7 years ago

In reply to: Thomas Munro (#13)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Nov 16, 2018 at 5:38 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Or to put it another way, you can't be given a lower sequence number
than another process that has already written, because that other
process must have been given a sequence number before it wrote.

OK, that makes sense.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

thomas.munro@gmail.com

over 7 years ago

In reply to: Robert Haas (#14)

Re: Refactoring the checkpointer's fsync request queue

On Sat, Nov 17, 2018 at 10:53 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Nov 15, 2018 at 5:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

While testing this patch with frequent checkpoints I've stumbled upon an
interesting error, that happened already after I finished one test:

TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574)

Fixed in the 0001 patch (and a similar problem in the WIN32 branch).

On Thu, Nov 15, 2018 at 10:37 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

But is there an argument against it? The checkpointer would still be
creating checkpoints including running fsync, but the background
writer would be, erm, writing, erm, in the background.

I don't know. I guess the fact that the checkpointer is still
performing the fsyncs is probably a key point. I mean, in the old
division of labor, fsyncs could interrupt the background writing that
was supposed to be happening.

Robert explained off-list that BgBufferSync() and BufferSync() have
rather different goals, and performing them in the same loop without
major reengineering to merge their logic would probably not work out
well. So I'm abandoning that plan for now (though it could perhaps be
interesting if done right).

I do have a new plan though...

On Wed, Nov 14, 2018 at 7:01 AM Andres Freund <andres@anarazel.de> wrote:

... I've previously wondered whether
there's any way we could delay the write to a point where the buffer is
not locked anymore - as far as I can tell it's actually not required for
correctness that we send the fsync request before unlocking. It's
architecturally a bit dicey tho :(

... and it's basically what Andres said there ^^^.

The specific hazard I wondered about is when a checkpoint begins after
BufferAlloc() calls pwrite() but before it calls sendto(), so that we
fail to fsync() a file that was modified before the checkpoint LSN.
But, AFAICS, assuming we call sendto() before we update the buffer
header, there are only two possibilities from the point of view of
BufferAlloc():

1. The checkpointer's BufferSync() loop arrives before we update the
buffer header, so it sees the buffer as dirty, writes it out (again),
remembers that the segment is dirty, and then when we eventually get
the buffer header lock we see that it's not dirty anymore and we just
skip the buffer.

2. The checkpointer's BufferSync() loop arrives after we updated the
buffer header, so it sees it as invalid (or some later state), which
means that we have already called sendto() (before we updated the
header).

Either way, the checkpointer finishes up calling fsync() before the
checkpoint completes, as it should, and the worst that can happen due
to bad timing is a harmless double pwrite().

I noticed a subtle problem though. Suppose we have case 2 above.
After BufferSync() returns in the checkpointer, our backend has called
sendto() to register a dirty file. In v2 the checkpointer runs
AbsorbAllFsyncRequests() to drain the pipe until it sees a message for
the current cycle (that is, it absorbs messages for the previous
cycle). That's obviously not good enough, since backends race to call
sendto() and a message for cycle n - 1 might be hiding behind a
message for cycle n. So I propose to drain the pipe until it is empty
or we see a message for cycle n + 1 (where n is the current cycle
before we start draining, meaning that we ran out of fds and forced a
new cycle in FlushFsyncRequestQueueIfNecessary()). I think that
works, because although we can't be sure that we'll receive all
messages for n - 1 before we receive a message for n due to races on
the insert side, we *can* be certain that we'll receive all messages
for n - 1 before we receive a message for n + 1, because we know that
they were already in the pipe before we began. In the happy case, our
system never runs out of fds so the pipe will eventually be empty,
since backends send at most one message per cycle per file and the
number of files is finite, and in the sad case, there are too many
dirty files per cycle, so we keep creating new cycles while absorbing,
but again the loop is bounded because we know that seeing n + 1 is
enough for our purpose (which is to fsync all files that were already
mentioned in messages send to the pipe before we started our loop).

That's implemented in the 0002 patch, separated for ease of review.
The above theories cover BufferAlloc()'s interaction with a
checkpoint, and that seems to be the main story. I'm not entirely
sure about the other callers of FlushBuffer() or FlushOneBuffer() (eg
special case init fork stuff), but I've run out of brain power for now
and wanted to post an update.

--
Thomas Munro
http://www.enterprisedb.com

#16

thomas.munro@gmail.com

over 7 years ago

In reply to: Thomas Munro (#15)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I do have a new plan though...

Ugh. The plan in my previous email doesn't work, I was confused about
the timing of the buffer header update. Back to the drawing board.

--
Thomas Munro
http://www.enterprisedb.com

#17

Dmitry Dolgov

9erthalion6@gmail.com

over 7 years ago

In reply to: Thomas Munro (#16)

Re: Refactoring the checkpointer's fsync request queue

On Mon, Nov 26, 2018 at 11:47 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I do have a new plan though...

Ugh. The plan in my previous email doesn't work, I was confused about
the timing of the buffer header update. Back to the drawing board.

Any chance to share the drawing board with the ideas? :)

On the serious note, I assume you have plans to work on this during the next
CF, right?

#18

thomas.munro@gmail.com

over 7 years ago

In reply to: Dmitry Dolgov (#17)

Re: Refactoring the checkpointer's fsync request queue

On Sun, Dec 2, 2018 at 1:46 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Nov 26, 2018 at 11:47 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
I do have a new plan though...

Ugh. The plan in my previous email doesn't work, I was confused about
the timing of the buffer header update. Back to the drawing board.

Any chance to share the drawing board with the ideas? :)

On the serious note, I assume you have plans to work on this during the next
CF, right?

Indeed I am. Unfortunately, the solution to that deadlock eludes me still.

So, I have split this work into multiple patches. 0001 is a draft
version of some new infrastructure I'd like to propose, 0002 is the
thing originally described by the first two paragraphs in the first
email in this thread, and the rest I'll have to defer for now (the fd
passing stuff).

To restate the purpose of this work: I want to make it possible for
other patches to teach the checkpointer to fsync new kinds of files
that are accessed through the buffer pool. Specifically, undo segment
files (for zheap) and SLRU files (see Shawn Debnath's plan to put clog
et al into the standard buffer pool). The main changes are:

1. A bunch of stuff moved out of md.c into smgrsync.c, where the same
pendingOpTable machinery can be shared by any block storage
implementation.
2. The actual fsync'ing now happens by going through smgrsyncimmed().
3. You can now tell the checkpointer to forget individual segments
(undo and slru both need to be able to do that when they truncate data
from the 'front').
4. The protocol for forgetting relations etc is slightly different:
if a file is found to be missing, AbsortFsyncRequests() and then probe
to see if the segment number disappeared from the set (instead of
cancel flags), though I need to test this case.
5. Requests (ie segment numbers) are now stored in a sorted vector,
because it doesn't make sense to store large and potentially sparse
integers in bitmapsets. See patch 0001 for new machinery to support
that.

The interfaces in 0001 are perhaps a bit wordy and verbose (and hard
to fit in 80 columns). Maybe I need something better for memory
contexts. Speaking of which, it wasn't possible to do a
guaranteed-no-alloc merge (like the one done for zero-anchored
bitmapset in commit 1556cb2fc), so I had to add a second vector for
'in progress' segments. I merge them with the main set on the next
attempt, if it's found to be non-empty. Very open to better ideas on
how to do any of this.

--
Thomas Munro
http://www.enterprisedb.com

#19

thomas.munro@gmail.com

over 7 years ago

In reply to: Thomas Munro (#18)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Jan 1, 2019 at 10:41 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

So, I have split this work into multiple patches. 0001 is a draft
version of some new infrastructure I'd like to propose, 0002 is the
thing originally described by the first two paragraphs in the first
email in this thread, and the rest I'll have to defer for now (the fd
passing stuff).

Apologies, there was a header missing from 0002, and a small change
needed to a contrib file that I missed. Here is a new version.

For the 0001 patch, I'll probably want to reconsider the naming a bit
("simple -> "specialized", "generic", ...?), refine (ability to turn
off the small vector optimisation? optional MemoryContext? ability
to extend without copying or zero-initialising at the same time?
comparators with a user data parameter? two-value comparators vs
three-value comparators? qsort with inline comparator? etc etc), and
remove some gratuitous C++ cargo cultisms, and maybe also instantiate
the thing centrally for some common types (I mean, perhaps 0002 should
use a common uint32_vector rather than defining its own
segnum_vector?).

I suppose an alternative would be to use simplehash for the set of
segment numbers here, but it seems like overkill and would waste a ton
of memory in the common case of holding a single number. I wondered
also about some kind of tree (basically, C++ set) but it seems much
more complicated and would still be much larger for common cases.
Sorted vectors seem to work pretty well here (but would lose in
theoretical cases where you insert low values in large sets, but not
in practice here AFAICS).

--
Thomas Munro
http://www.enterprisedb.com

#20

thomas.munro@gmail.com

over 7 years ago

In reply to: Thomas Munro (#19)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Jan 2, 2019 at 11:40 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

For the 0001 patch, I'll probably want to reconsider the naming a bit
("simple -> "specialized", "generic", ...?), refine (ability to turn
off the small vector optimisation? optional MemoryContext? ability
to extend without copying or zero-initialising at the same time?
comparators with a user data parameter? two-value comparators vs
three-value comparators? qsort with inline comparator? etc etc), and
remove some gratuitous C++ cargo cultisms, and maybe also instantiate
the thing centrally for some common types (I mean, perhaps 0002 should
use a common uint32_vector rather than defining its own
segnum_vector?).

Here's a new version that fixes a couple of stupid bugs (mainly a
broken XXX_lower_bound(), which I replaced with the standard algorithm
I see in many sources).

I couldn't resist the urge to try porting pg_qsort() to this style.
It seems to be about twice as fast as the original at sorting integers
on my machine with -O2. I suppose people aren't going to be too
enthusiastic about yet another copy of qsort in the tree, but maybe
this approach (with a bit more work) could replace the Perl code-gen
for tuple sorting. Then the net number of copies wouldn't go up, but
this could be used for more things too, and it fits with the style of
simplehash.h and simplevector.h. Thoughts?

--
Thomas Munro
http://www.enterprisedb.com

#21

Kevin.Grittner@wicourts.gov

over 7 years ago

In reply to: Thomas Munro (#20)

#22

Kevin.Grittner@wicourts.gov

over 7 years ago

In reply to: Kevin Grittner (#21)

#23

andres@anarazel.de

over 7 years ago

In reply to: Kevin Grittner (#21)

#24

Kevin.Grittner@wicourts.gov

over 7 years ago

In reply to: Andres Freund (#23)

#25

thomas.munro@gmail.com

over 7 years ago

In reply to: Kevin Grittner (#24)

#26

andres@anarazel.de

over 7 years ago

In reply to: Kevin Grittner (#24)

#27

Kevin.Grittner@wicourts.gov

over 7 years ago

In reply to: Andres Freund (#26)

#28

andres@anarazel.de

over 7 years ago

In reply to: Kevin Grittner (#27)

#29

sdn@amazon.com

over 7 years ago

In reply to: Thomas Munro (#20)

#30

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Shawn Debnath (#29)

#31

sdn@amazon.com

over 7 years ago

In reply to: Shawn Debnath (#29)

#32

thomas.munro@gmail.com

over 7 years ago

In reply to: Shawn Debnath (#31)

#33

andres@anarazel.de

over 7 years ago

In reply to: Thomas Munro (#32)

#34

sdn@amazon.com

over 7 years ago

In reply to: Andres Freund (#33)

#35

sdn@amazon.com

over 7 years ago

In reply to: Shawn Debnath (#34)

#36

andres@anarazel.de

over 7 years ago

In reply to: Shawn Debnath (#35)

#37

thomas.munro@gmail.com

over 7 years ago

In reply to: Andres Freund (#36)

#38

Euler Taveira de Oliveira

sdn@amazon.com

over 7 years ago

In reply to: Thomas Munro (#37)

#39

Ibrar Ahmed

ibrar.ahmad@gmail.com

over 7 years ago

In reply to: Thomas Munro (#1)

#40

euler@timbira.com

over 7 years ago

In reply to: Ibrar Ahmed (#39)

#41

Paul Jungwirth

pj@illuminatedcomputing.com

over 7 years ago

In reply to: Euler Taveira de Oliveira (#40)

#42

Euler Taveira de Oliveira

euler@timbira.com

over 7 years ago

In reply to: Paul Jungwirth (#41)

#43

andres@anarazel.de

over 7 years ago

In reply to: Shawn Debnath (#38)

#44

thomas.munro@gmail.com

over 7 years ago

In reply to: Andres Freund (#43)

#45

andres@anarazel.de

over 7 years ago

In reply to: Thomas Munro (#44)

#46

thomas.munro@gmail.com

over 7 years ago

In reply to: Andres Freund (#45)

#47

andres@anarazel.de

over 7 years ago

In reply to: Thomas Munro (#46)

#48

sdn@amazon.com

over 7 years ago

In reply to: Andres Freund (#47)

#49

andres@anarazel.de

over 7 years ago

In reply to: Shawn Debnath (#48)

#50

Ibrar Ahmed

ibrar.ahmad@gmail.com

over 7 years ago

In reply to: Paul Jungwirth (#41)

#51

sdn@amazon.com

over 7 years ago

In reply to: Andres Freund (#49)

#52

thomas.munro@gmail.com

over 7 years ago

In reply to: Shawn Debnath (#51)

#53

thomas.munro@gmail.com

about 7 years ago

In reply to: Thomas Munro (#52)

#54

andres@anarazel.de

about 7 years ago

In reply to: Thomas Munro (#53)

#55

robertmhaas@gmail.com

about 7 years ago

In reply to: Andres Freund (#54)

#56

sdn@amazon.com

about 7 years ago

In reply to: Robert Haas (#55)

#57

thomas.munro@gmail.com

about 7 years ago

In reply to: Shawn Debnath (#56)

#58

robertmhaas@gmail.com

about 7 years ago

In reply to: Shawn Debnath (#56)

#59

sdn@amazon.com

about 7 years ago

In reply to: Robert Haas (#58)

#60

thomas.munro@gmail.com

about 7 years ago

In reply to: Shawn Debnath (#59)

#61

robertmhaas@gmail.com

about 7 years ago

In reply to: Shawn Debnath (#59)

#62

sdn@amazon.com

about 7 years ago

In reply to: Andres Freund (#54)

#63

Paul Jungwirth

pj@illuminatedcomputing.com

about 7 years ago

In reply to: Ibrar Ahmed (#50)

#64

sdn@amazon.com

about 7 years ago

In reply to: Robert Haas (#61)

#65

sdn@amazon.com

about 7 years ago

In reply to: Andres Freund (#43)

#66

thomas.munro@gmail.com

about 7 years ago

In reply to: Shawn Debnath (#65)

#67

sdn@amazon.com

about 7 years ago

In reply to: Thomas Munro (#66)

#68

thomas.munro@gmail.com

about 7 years ago

In reply to: Shawn Debnath (#67)

#69

Tom Lane

tgl@sss.pgh.pa.us

about 7 years ago

In reply to: Thomas Munro (#68)

#70

sdn@amazon.com

about 7 years ago

In reply to: Tom Lane (#69)

#71

sdn@amazon.com

about 7 years ago

In reply to: Andres Freund (#47)

#72

thomas.munro@gmail.com

about 7 years ago

In reply to: Shawn Debnath (#70)

#73

sdn@amazon.com

about 7 years ago

In reply to: Thomas Munro (#72)

#74

thomas.munro@gmail.com

about 7 years ago

In reply to: Shawn Debnath (#73)

#75

thomas.munro@gmail.com

about 7 years ago

In reply to: Thomas Munro (#74)

#76

robertmhaas@gmail.com

about 7 years ago

In reply to: Thomas Munro (#75)

#77

sdn@amazon.com

about 7 years ago

In reply to: Robert Haas (#76)

#78

thomas.munro@gmail.com

about 7 years ago

In reply to: Shawn Debnath (#77)

#79

thomas.munro@gmail.com

about 7 years ago

In reply to: Thomas Munro (#78)

#80

sdn@amazon.com

about 7 years ago

In reply to: Thomas Munro (#79)

#81

thomas.munro@gmail.com

about 7 years ago

In reply to: Shawn Debnath (#80)

#82

thomas.munro@gmail.com

about 7 years ago

In reply to: Thomas Munro (#81)

#83

sdn@amazon.com

about 7 years ago

In reply to: Thomas Munro (#82)

#84

andres@anarazel.de

about 7 years ago

In reply to: Shawn Debnath (#83)

#85

thomas.munro@gmail.com

about 7 years ago

In reply to: Andres Freund (#84)

#86

sdn@amazon.com

about 7 years ago

In reply to: Thomas Munro (#85)

#87

thomas.munro@gmail.com

about 7 years ago

In reply to: Shawn Debnath (#86)

#88

Alvaro Herrera

alvherre@2ndquadrant.com

about 7 years ago

In reply to: Thomas Munro (#87)

#89

thomas.munro@gmail.com

about 7 years ago

In reply to: Alvaro Herrera (#88)

#90

thomas.munro@gmail.com

about 7 years ago

In reply to: Thomas Munro (#89)

#91

thomas.munro@gmail.com

about 7 years ago

In reply to: Thomas Munro (#87)

#92