CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

Started by Robert Haasover 15 years ago93 messages

robertmhaas@gmail.com

over 15 years ago

Kevin didn't send out an official gavel-banging announcement of the
end of CommitFest 2009-07 (possibly because I neglected until today to
give him privileges to actually change it in the web application), but
I'd just like to take a minute to thank him publicly for his efforts.
We started this CommitFest with something like 60 patches, which is
definitely on the larger side for a CommitFest, and Kevin did a great
job staying on top of what was going on with all of them and, I felt,
really helped keep us on track. At the same time, I felt he did this
with a very light touch that made the whole thing go very smoothly.
So -- thanks, Kevin!

I also appreciate the efforts of all those who reviewed. Good reviews
are really critical to keep the burden from building up on committers,
and I appreciate the efforts of everyone who contributed, in many
cases probably on their own time. I'm particularly grateful to the
people who were vigilant about spelling, grammar, coding style,
whitespace, and other nitpicky little issues that are not much fun,
but which at least for me are a major time sink if they're still
lingering when it comes time to do the actual commit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 15 years ago

In reply to: Robert Haas (#1)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

Robert Haas <robertmhaas@gmail.com> wrote:

I'd just like to take a minute to thank him publicly for his
efforts. We started this CommitFest with something like 60
patches, which is definitely on the larger side for a CommitFest,
and Kevin did a great job staying on top of what was going on with
all of them and, I felt, really helped keep us on track. At the
same time, I felt he did this with a very light touch that made
the whole thing go very smoothly. So -- thanks, Kevin!

You're welcome. It was educational for me. I don't think I want to
try to handle two in a row, and I think your style is better suited
than mine to the final CF for a release, but I might be able to take
on the 2010-11 CF if people want that.

My hand was not always so light behind the scenes, though -- I sent
or received about 100 off-list emails to try to keep things moving.
Hopefully nobody was too offended by my nagging. :-)

Oh, and thanks for putting together the CF web application. Without
that, I couldn't have done half as well as I did.

I also appreciate the efforts of all those who reviewed.

Yes, I'll second that! I've always been impressed with the
PostgreSQL community, and managing this CF gave me new insights and
appreciation for the intelligence, professionalism, and community
spirit of its members -- authors, reviewers, and committers.

-Kevin

Greg Smith

greg@2ndquadrant.com

over 15 years ago

In reply to: Kevin Grittner (#2)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

Kevin Grittner wrote:

I don't think I want to try to handle two in a row, and I think your style is better suited
than mine to the final CF for a release, but I might be able to take on the 2010-11 CF if people want that

Ha, you just put yourself right back on the hook with that comment, and
Robert does seem like the right guy for CF-4 @ 2011-01. Leaving the
question of what's going to happen with CF-2 next month.

I think the crucial thing with the 2010-09 CF is that we have to get
serious progress made sorting out all the sync rep ideas before/during
that one. The review Yeb did and subsequent discussion was really
helpful, but the scope on that needs to actually get nailed down to
*something* concrete if it's going to get built early enough in the 9.1
release to be properly reviewed and tested for more than one round.
Parts of the design and scope still feel like they're expanding to me,
and I think having someone heavily involved in the next CF who is
willing to push on nailing down that particular area is pretty
important. Will volunteer myself if I can stay on schedule to make it
past the major time commitment sink I've had so far this year by then.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Greg Smith (#3)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

On Wed, Aug 18, 2010 at 7:46 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Kevin Grittner wrote:

I don't think I want to try to handle two in a row, and I think your style
is better suited
than mine to the final CF for a release, but I might be able to take on
the 2010-11 CF if people want that

Ha, you just put yourself right back on the hook with that comment, and
Robert does seem like the right guy for CF-4 @ 2011-01. Leaving the
question of what's going to happen with CF-2 next month.

My reputation precedes me, apparently. Although I appreciate everyone
so far being willing to avoid mentioning exactly what that reputation
might be. :-)

I think the crucial thing with the 2010-09 CF is that we have to get serious
progress made sorting out all the sync rep ideas before/during that one.
The review Yeb did and subsequent discussion was really helpful, but the
scope on that needs to actually get nailed down to *something* concrete if
it's going to get built early enough in the 9.1 release to be properly
reviewed and tested for more than one round. Parts of the design and scope
still feel like they're expanding to me, and I think having someone heavily
involved in the next CF who is willing to push on nailing down that
particular area is pretty important. Will volunteer myself if I can stay on
schedule to make it past the major time commitment sink I've had so far this
year by then.

Sitting on Sync Rep is a job and a half by itself, without adding all
the other CF work on top of it. Maybe we should try to find two
vi^Holunteers: a CommitFest Manager (CFM) and a Major Feature
Babysitter (MBS). At any rate, we should definitely NOT wait another
month to start thinking about Sync Rep again. I haven't actually
looked at any of the Sync Rep code AT ALL but IIRC Heikki expressed
the view that the biggest thing standing in the way of a halfway
decent Sync Rep implementation was a number of polling loops that
needed to be replaced with something that wouldn't introduce
up-to-100ms delays. And so far we haven't seen a patch for that.
Somebody write one. And then let's get it reviewed and committed RSN.
It may seem like we're early in the release cycle yet, but for a
feature of this magnitude we are not. We committed way too much big
stuff at the very end of the last release cycle; Hot Standby was still
being cleaned up in May after commit in November. We'll be lucky to
commit sync rep that early.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Robert Haas (#4)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

On 19/08/10 04:46, Robert Haas wrote:

At any rate, we should definitely NOT wait another
month to start thinking about Sync Rep again.

Agreed. EnterpriseDB is interested in having that feature, so I'm on the
hook to spend time on it regardless of commitfests.

I haven't actually
looked at any of the Sync Rep code AT ALL but IIRC Heikki expressed
the view that the biggest thing standing in the way of a halfway
decent Sync Rep implementation was a number of polling loops that
needed to be replaced with something that wouldn't introduce
up-to-100ms delays.

Well, that's the only uncontroversial thing about it that doesn't
require any fighting over the UI or desired behavior. That's why I've
focused on that first, and also because it's useful regardless of
synchronous replication. But once that's done, we'll have to nail down
how synchronous replication is supposed to behave, and how to configure it.

And so far we haven't seen a patch for that.
Somebody write one. And then let's get it reviewed and committed RSN.

Fujii is on vacation, but I've started working on it. The two issues
with Fujii's latest patch are that it would not respond promptly on
platforms where signals don't interrupt sleep, and it suffers the
classic race condition that pselect() was invented for. I'm going to
replace pg_usleep() with select(), and use the so called "self-pipe
trick" to get over the race condition. I have that written up but I want
to do some testing and cleanup before posting the patch.

It may seem like we're early in the release cycle yet, but for a
feature of this magnitude we are not. We committed way too much big
stuff at the very end of the last release cycle; Hot Standby was still
being cleaned up in May after commit in November. We'll be lucky to
commit sync rep that early.

Agreed. We need to decide the scope and minimum set of features real
soon to get something concrete finished.

BTW, on what platforms signals don't interrupt sleep? Although that
issue has been discussed many times before, I couldn't find any
reference to a real platform in the archives.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#5)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

BTW, on what platforms signals don't interrupt sleep? Although that
issue has been discussed many times before, I couldn't find any
reference to a real platform in the archives.

I've got one in captivity (my old HPUX box). Happy to test whatever you
come up with.

Considering that pg_usleep is implemented with select, I'm not following
what you mean by "replace pg_usleep() with select()"?

regards, tom lane

Alvaro Herrera

alvherre@commandprompt.com

over 15 years ago

In reply to: Heikki Linnakangas (#5)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

Excerpts from Heikki Linnakangas's message of jue ago 19 02:02:34 -0400 2010:

On 19/08/10 04:46, Robert Haas wrote:

And so far we haven't seen a patch for that.
Somebody write one. And then let's get it reviewed and committed RSN.

Fujii is on vacation, but I've started working on it. The two issues
with Fujii's latest patch are that it would not respond promptly on
platforms where signals don't interrupt sleep, and it suffers the
classic race condition that pselect() was invented for. I'm going to
replace pg_usleep() with select(), and use the so called "self-pipe
trick" to get over the race condition. I have that written up but I want
to do some testing and cleanup before posting the patch.

Hmm, IIRC the self-pipe trick doesn't work on Windows, mainly because
select() doesn't handle pipes, only sockets. You may need some extra
hack to make it work there.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Magnus Hagander

magnus@hagander.net

over 15 years ago

In reply to: Alvaro Herrera (#7)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

On Thu, Aug 19, 2010 at 17:08, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Excerpts from Heikki Linnakangas's message of jue ago 19 02:02:34 -0400 2010:

On 19/08/10 04:46, Robert Haas wrote:

And so far we haven't seen a patch for that.
Somebody write one. And then let's get it reviewed and committed RSN.

Fujii is on vacation, but I've started working on it. The two issues
with Fujii's latest patch are that it would not respond promptly on
platforms where signals don't interrupt sleep, and it suffers the
classic race condition that pselect() was invented for. I'm going to
replace pg_usleep() with select(), and use the so called "self-pipe
trick" to get over the race condition. I have that written up but I want
to do some testing and cleanup before posting the patch.

Hmm, IIRC the self-pipe trick doesn't work on Windows, mainly because
select() doesn't handle pipes, only sockets. You may need some extra
hack to make it work there.

We have a pipe implementation using sockets that is used on Windows
for just this reason, IIRC.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Alvaro Herrera (#7)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

On 19/08/10 18:08, Alvaro Herrera wrote:

Excerpts from Heikki Linnakangas's message of jue ago 19 02:02:34 -0400 2010:

On 19/08/10 04:46, Robert Haas wrote:

And so far we haven't seen a patch for that.
Somebody write one. And then let's get it reviewed and committed RSN.

Fujii is on vacation, but I've started working on it. The two issues
with Fujii's latest patch are that it would not respond promptly on
platforms where signals don't interrupt sleep, and it suffers the
classic race condition that pselect() was invented for. I'm going to
replace pg_usleep() with select(), and use the so called "self-pipe
trick" to get over the race condition. I have that written up but I want
to do some testing and cleanup before posting the patch.

Hmm, IIRC the self-pipe trick doesn't work on Windows, mainly because
select() doesn't handle pipes, only sockets. You may need some extra
hack to make it work there.

That's fine, we can do the naive set-a-flag implementation on Windows
because on Windows our "signals" are only delivered at
CHECK_FOR_INTERRUPTS(), so we don't have the race condition to begin with.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#6)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

On 19/08/10 16:38, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

BTW, on what platforms signals don't interrupt sleep? Although that
issue has been discussed many times before, I couldn't find any
reference to a real platform in the archives.

I've got one in captivity (my old HPUX box). Happy to test whatever you
come up with.

Considering that pg_usleep is implemented with select, I'm not following
what you mean by "replace pg_usleep() with select()"?

Instead of using pg_usleep(), call select() directly, waiting not only
for the timeout, but also for data to arrive on the "self-pipe". The
signal handler writes a byte to the self-pipe, waking up the select().
That way the select() is interupted by the signal arriving, even if
signals per se don't interrupt it. And it closes the race condition
involved with setting a flag in the signal handler and checking that in
the main loop.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#10)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 19/08/10 16:38, Tom Lane wrote:

Considering that pg_usleep is implemented with select, I'm not following
what you mean by "replace pg_usleep() with select()"?

Instead of using pg_usleep(), call select() directly, waiting not only
for the timeout, but also for data to arrive on the "self-pipe". The
signal handler writes a byte to the self-pipe, waking up the select().
That way the select() is interupted by the signal arriving, even if
signals per se don't interrupt it. And it closes the race condition
involved with setting a flag in the signal handler and checking that in
the main loop.

Hmm, but couldn't you still do that inside pg_usleep? Signal handlers
that do that couldn't know if they were interrupting a sleep per se,
so this would have to be a backend-wide convention.

regards, tom lane

#12

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#11)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

On 19/08/10 19:57, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 19/08/10 16:38, Tom Lane wrote:

Considering that pg_usleep is implemented with select, I'm not following
what you mean by "replace pg_usleep() with select()"?

Instead of using pg_usleep(), call select() directly, waiting not only
for the timeout, but also for data to arrive on the "self-pipe". The
signal handler writes a byte to the self-pipe, waking up the select().
That way the select() is interupted by the signal arriving, even if
signals per se don't interrupt it. And it closes the race condition
involved with setting a flag in the signal handler and checking that in
the main loop.

Hmm, but couldn't you still do that inside pg_usleep? Signal handlers
that do that couldn't know if they were interrupting a sleep per se,
so this would have to be a backend-wide convention.

I don't understand, do what inside pg_usleep()?

We only need to respond quickly to one signal, the one that tells
walsender "there's some new WAL that you should send". We can rely on
polling for all the other signals like SIGHUP for config reload or
shutdown request, like we do today.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#12)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 19/08/10 19:57, Tom Lane wrote:

Hmm, but couldn't you still do that inside pg_usleep? Signal handlers
that do that couldn't know if they were interrupting a sleep per se,
so this would have to be a backend-wide convention.

I don't understand, do what inside pg_usleep()?

I was envisioning still using pg_usleep, and having this interaction
between signal handlers and the delay be private to pg_usleep, rather
than putting such ugly code into forty-nine different places. There
are a *lot* of places where we have loops that break down delays
into at-most-one-second pg_usleep calls, and if we're going to have a
hack like this we should fix them all, not just two or three that SR
cares about.

But I'm still not seeing how this self-pipe hack avoids a race
condition. If the signal handler is sending a byte whenever it
executes, then you could have bytes already stacked up in the pipe
from previous interrupts that didn't happen to come while inside
pg_usleep. If you clear those before sleeping, you have a race
condition, and if you don't, then you fail to sleep the intended
amount of time even though there was no interrupt this time.

regards, tom lane

#14

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#13)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

On 19/08/10 20:18, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 19/08/10 19:57, Tom Lane wrote:

Hmm, but couldn't you still do that inside pg_usleep? Signal handlers
that do that couldn't know if they were interrupting a sleep per se,
so this would have to be a backend-wide convention.

I don't understand, do what inside pg_usleep()?

I was envisioning still using pg_usleep, and having this interaction
between signal handlers and the delay be private to pg_usleep, rather
than putting such ugly code into forty-nine different places. There
are a *lot* of places where we have loops that break down delays
into at-most-one-second pg_usleep calls, and if we're going to have a
hack like this we should fix them all, not just two or three that SR
cares about.

Hmm, will need to think about a suitable API for that. The nice thing
would be that we could implement it using pselect() where available.
(And reliable - the Linux select() man page says that glibc's pselect()
is emulated using select(), and suffers from the very same race
condition pselect() was invented to solve. How awful is that!?)

But I'm still not seeing how this self-pipe hack avoids a race
condition. If the signal handler is sending a byte whenever it
executes, then you could have bytes already stacked up in the pipe
from previous interrupts that didn't happen to come while inside
pg_usleep. If you clear those before sleeping, you have a race
condition, and if you don't, then you fail to sleep the intended
amount of time even though there was no interrupt this time.

You clear the pipe after waking up. Before sending all the pending WAL
and deciding that you're fully caught up again:

for(;;)
{
1. select()
2. clear pipe
3. send any pending WAL
}

There's more information on the self-pipe trick at e.g
http://lwn.net/Articles/177897/

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#14)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 19/08/10 20:18, Tom Lane wrote:

But I'm still not seeing how this self-pipe hack avoids a race
condition. If the signal handler is sending a byte whenever it
executes, then you could have bytes already stacked up in the pipe
from previous interrupts that didn't happen to come while inside
pg_usleep. If you clear those before sleeping, you have a race
condition, and if you don't, then you fail to sleep the intended
amount of time even though there was no interrupt this time.

You clear the pipe after waking up.

Hmm ... that doesn't answer my second objection about failing to sleep
the expected amount of time, but on reflection I guess that can't be
done: we have to be able to cope with interrupts occurring just before
the sleep actually begins, and there's no way to define "just before"
except by reference to the calling code having done whatever it might
need to do before/after sleeping.

Hmm, will need to think about a suitable API for that.

Offhand I'd suggest something like

SetSleepInterrupt() -- called by signal handlers, writes pipe
ClearSleepInterrupt() -- called by sleep-and-do-something loops, clears pipe

pg_usleep() itself remains the same, but it is now guaranteed to return
immediately if SetSleepInterrupt is called, or has been called since the
last ClearSleepInterrupt.

The nice thing
would be that we could implement it using pselect() where available.
(And reliable - the Linux select() man page says that glibc's pselect()
is emulated using select(), and suffers from the very same race
condition pselect() was invented to solve. How awful is that!?)

Ick. So how would you tell if it's trustworthy?

regards, tom lane

#16

Noname

tomas@tuxteam.de

over 15 years ago

In reply to: Heikki Linnakangas (#14)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, Aug 19, 2010 at 08:36:09PM +0300, Heikki Linnakangas wrote:

[...]

Hmm, will need to think about a suitable API for that. The nice thing would
be that we could implement it using pselect() where available. (And
reliable - the Linux select() man page says that glibc's pselect() is
emulated using select(), and suffers from the very same race condition
pselect() was invented to solve. How awful is that!?)

It is indeed. It seems, though, that from Linux kernel 2.6.16 and glibc
2.4 on, things look better [1]<http://lwn.net/Articles/176911/>. As a reference, Debian stable (not known
to adventure too far into the present ;-) is libc 2.7 on kernel 2.6.26.

Of course, "enterprise" GNU/Linux distros are said to be even more
conservative...

[1]: <http://lwn.net/Articles/176911/>

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFMbiauBcgs9XrR2kYRAhiwAJ41f29jSIy409epTH0eJRXW17oByACeIkRo
CRg2BCw8tn3PkdnNR1i/MCY=
=GVMT
-----END PGP SIGNATURE-----

#17

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#15)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

On 19/08/10 20:59, Tom Lane wrote:

Offhand I'd suggest something like

SetSleepInterrupt() -- called by signal handlers, writes pipe
ClearSleepInterrupt() -- called by sleep-and-do-something loops, clears pipe

pg_usleep() itself remains the same, but it is now guaranteed to return
immediately if SetSleepInterrupt is called, or has been called since the
last ClearSleepInterrupt.

Hmm, we have pg_usleep() calls in some fairly low-level functions, like
mdunlink() and s_lock(). If someone has called SetSleepInterrupt(), we
don't want those pg_usleep()s to return immediately. And pg_usleep() is
used in some client code too. I think we need a separate sleep function
for this.

Another idea is to not use unix signals at all, but ProcSendSignal() and
ProcWaitForSignal(). We would not need the signal handler at all.
Walsender would use ProcWaitForSignal() instead of pg_usleep() and
backends that want to wake it up would use ProcSendSignal(). The problem
is that there is currently no way to specify a timeout, but I presume
the underlying semaphore operations have that capability, and we could
expose it.

Actually ProcSendSignal()/ProcWaitForSignal() won't work as is, because
walsender doesn't have a PGPROC entry, but you could easily build a
similar mechanism, using PGSemaphoreLock/Unlock like
ProcSendSignal()/WaitForSignal() does.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#17)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Hmm, we have pg_usleep() calls in some fairly low-level functions, like
mdunlink() and s_lock(). If someone has called SetSleepInterrupt(), we
don't want those pg_usleep()s to return immediately. And pg_usleep() is
used in some client code too. I think we need a separate sleep function
for this.

Well, we'd need some careful thought about which sleeps need what, but I
don't necessarily have an objection to a separate interruptable sleep
function.

Another idea is to not use unix signals at all, but ProcSendSignal() and
ProcWaitForSignal(). We would not need the signal handler at all.
Walsender would use ProcWaitForSignal() instead of pg_usleep() and
backends that want to wake it up would use ProcSendSignal().

You keep on proposing solutions that only work for walsender :-(.
Most of the other places where we have pg_usleep actually do want
a timed sleep, I believe. It's also unclear that we can always expect
ProcSendSignal to be usable --- for example, stuff like SIGHUP would
be sent by processes that might not be connected to shared memory.

The problem
is that there is currently no way to specify a timeout, but I presume
the underlying semaphore operations have that capability, and we could
expose it.

They don't, or at least the semop-based implementation doesn't.

regards, tom lane

#19

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#18)

Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

On 20/08/10 16:24, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

Hmm, we have pg_usleep() calls in some fairly low-level functions, like
mdunlink() and s_lock(). If someone has called SetSleepInterrupt(), we
don't want those pg_usleep()s to return immediately. And pg_usleep() is
used in some client code too. I think we need a separate sleep function
for this.

Well, we'd need some careful thought about which sleeps need what, but I
don't necessarily have an objection to a separate interruptable sleep
function.

If we have to, we could also support multiple interrupts with multiple
self-pipes, so that you can choose at pg_usleep() which ones to wake up on.

Another idea is to not use unix signals at all, but ProcSendSignal() and
ProcWaitForSignal(). We would not need the signal handler at all.
Walsender would use ProcWaitForSignal() instead of pg_usleep() and
backends that want to wake it up would use ProcSendSignal().

You keep on proposing solutions that only work for walsender :-(.

Well yes, the other places where we use pg_usleep() are not really a
problem as is. If as a side-effect we can make them respond more quickly
to signals, with small changes, that's good, but walsender is the only
one that's performance critical.

That said, a select() based solution is my current favorite.

Most of the other places where we have pg_usleep actually do want
a timed sleep, I believe. It's also unclear that we can always expect
ProcSendSignal to be usable --- for example, stuff like SIGHUP would
be sent by processes that might not be connected to shared memory.

The problem
is that there is currently no way to specify a timeout, but I presume
the underlying semaphore operations have that capability, and we could
expose it.

They don't, or at least the semop-based implementation doesn't.

There's semtimedop(). I don't know how portable it is, but it seems to
exist at least on Linux, Solaris, HPUX and AIX. On what platforms do we
use sysv semaphores?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#20

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#19)

Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

[ It's way past time to change the thread title ]

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 20/08/10 16:24, Tom Lane wrote:

You keep on proposing solutions that only work for walsender :-(.

Well yes, the other places where we use pg_usleep() are not really a
problem as is.

Well, yes they are. They cause unnecessary process wakeups and thereby
consume cycles even when the database is idle. See for example a
longstanding complaint here:
https://bugzilla.redhat.com/show_bug.cgi?id=252129

If we're going to go to the trouble of having a mechanism like this,
I'd like it to fix that problem so I can close out that bug.

There's semtimedop(). I don't know how portable it is, but it seems to
exist at least on Linux, Solaris, HPUX and AIX.

It's not on my HPUX, and I don't see it in the Single Unix Spec.

On what platforms do we use sysv semaphores?

AFAIK, everything except Windows and extremely old versions of OS X.

regards, tom lane

#21

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#20)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 20/08/10 17:28, Tom Lane wrote:

[ It's way past time to change the thread title ]

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 20/08/10 16:24, Tom Lane wrote:

You keep on proposing solutions that only work for walsender :-(.

Well yes, the other places where we use pg_usleep() are not really a
problem as is.

Well, yes they are. They cause unnecessary process wakeups and thereby
consume cycles even when the database is idle. See for example a
longstanding complaint here:
https://bugzilla.redhat.com/show_bug.cgi?id=252129

If we're going to go to the trouble of having a mechanism like this,
I'd like it to fix that problem so I can close out that bug.

Hmm, if you want to put bgwriter and walwriter to deep sleep, then
someone will need to wake them up when they have work to do. Currently
they poll. Maybe they should just sleep longer, like 10 seconds, if
there hasn't been any work to do in the last X wakeups.

We've been designing the new sleep facility so that the event that wakes
up the sleep is sent from the signal handler in the same process, but it
seems that all the potential users would actually want to be woken up
from *another* process, so the signal handler seems like an unnecessary
middleman. Particularly on Windows where signals are simulated with
pipes and threads, while you could just send a Windows event directly
from one process to another.

A common feature that all the users of this facility want is that once
the event is sent, re-sending it is a fast no-op until re-enabled by the
receiver. For example, if we need backends to wake up bgwriter after
dirtying a buffer, you don't want to waste many cycles determining that
bgwriter is already active and doesn't need to be woken up.

Let's call these "latches". I'm thinking of something like this:

/* Similar to LWLockId */
typedef enum
{
BgwriterLatch,
WalwriterLatch,
/* plus one for each walsender */
} LatchId;

/*
* Wait for given latch to be set. Only one process can wait
* for a given latch at a time.
*/
WaitLatch(LatchId latch, long timeout);

/*
* Sets latch. Returns quickly if the latch is set already.
*/
SetLatch(LatchId latch);

/*
* Clear the latch. Calling WaitLatch after this will sleep, unless
* the latch is set again before the WaitLatch call.
*/
ResetLatch(LatchId latch);

There would be a boolean for each latch in shared memory, to indicate if
the latch is "armed", allowing quick return from SetLatch if the latch
is already set. Plus a signal to wake up the waiting process (maybe use
procsignal.c), and the self-pipe trick within the receiving process to
make it race condition free. On Windows, the signal and the self-pipe
trick are replaced with Windows events.

I'll try out this approach tomorrow..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#22

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#21)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 20/08/10 17:28, Tom Lane wrote:

Well, yes they are. They cause unnecessary process wakeups and thereby
consume cycles even when the database is idle. See for example a
longstanding complaint here:
https://bugzilla.redhat.com/show_bug.cgi?id=252129

If we're going to go to the trouble of having a mechanism like this,
I'd like it to fix that problem so I can close out that bug.

Hmm, if you want to put bgwriter and walwriter to deep sleep, then
someone will need to wake them up when they have work to do. Currently
they poll. Maybe they should just sleep longer, like 10 seconds, if
there hasn't been any work to do in the last X wakeups.

I should probably clarify that I don't expect this patch to solve that
problem all by itself. But ISTM that a guaranteed-interruptible version
of pg_usleep is a necessary prerequisite before we can even think about
dialing down the database's CPU consumption at idle.

We've been designing the new sleep facility so that the event that wakes
up the sleep is sent from the signal handler in the same process, but it
seems that all the potential users would actually want to be woken up
from *another* process, so the signal handler seems like an unnecessary
middleman.

No, because there are also lots of cases where the signal is arriving
from a non-Postgres process; SIGTERM arriving from init being the killer
case that you simply don't get to propose an alternative design for.
More generally, I'm uninterested in decoupling cases like SIGHUP for
postgresql.conf change from the existing signal mechanism, because it's
just too useful to be able to trigger that externally.

Particularly on Windows where signals are simulated with
pipes and threads, while you could just send a Windows event directly
from one process to another.

Windows of course has different constraints, and in particular it lacks
the constraint that we must be able to respond to system-defined
signals. So I wouldn't object to the underlying implementation being
different for Windows.

[ "latch" proposal ]

This seems reasonably clean as far as signal conditions generated
internally to Postgres go, but I remain unclear on how it helps for
response to actual signals.

regards, tom lane

#23

Alvaro Herrera

alvherre@commandprompt.com

over 15 years ago

In reply to: Tom Lane (#22)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Excerpts from Tom Lane's message of lun ago 23 19:44:02 -0400 2010:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

[ "latch" proposal ]

This seems reasonably clean as far as signal conditions generated
internally to Postgres go, but I remain unclear on how it helps for
response to actual signals.

This could probably replace the signalling between postmaster and
autovac launcher, as well.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#24

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Alvaro Herrera (#23)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 24/08/10 04:08, Alvaro Herrera wrote:

Excerpts from Tom Lane's message of lun ago 23 19:44:02 -0400 2010:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

[ "latch" proposal ]

This seems reasonably clean as far as signal conditions generated
internally to Postgres go, but I remain unclear on how it helps for
response to actual signals.

This could probably replace the signalling between postmaster and
autovac launcher, as well.

Hmm, postmaster needs to stay out of shared memory..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#25

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#22)

1 attachment(s)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 24/08/10 02:44, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

[ "latch" proposal ]

This seems reasonably clean as far as signal conditions generated
internally to Postgres go, but I remain unclear on how it helps for
response to actual signals.

You can can set the latch in the signal handler.

Here's a first attempt at implementing that. To demonstrate how it
works, I modified walsender to use the new latch facility, also to
respond quickly to SIGHUP and SIGTERM.

There's two kinds of latches, local and global. Local latches can only
be set from the same process - allowing you to replace pg_usleep() with
something that is always interruptible by signals (by setting the latch
in the signal handler). The global latches work the same, and indeed the
implementation is the same, but the latch resides in shared memory, and
can be set by any process attached to shared memory. On Unix, when you
set a latch waited for by another process, the setter sends SIGUSR1 to
the waiting process, and the signal handler sends the byte to the
self-pipe to wake up the select().

On Windows, we use WaitEvent to wait on a latch, and SetEvent to wake
up. The difference between global and local latches is that for global
latches, the Event object needs to be created upfront at postmaster
startup so that its inherited to all child processes, and stored in
shared memory. A local Event object can be created only in the process
that needs it.

I put the code in src/backend/storage/ipc/latch.c now, but it probably
ought to go in src/backend/portability instead, with a separate
win32_latch.c file for Windows.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

latches-1.patchtext/x-diff; name=latches-1.patchDownload

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 615a7fa..094d0c9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -55,6 +55,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
@@ -1025,6 +1026,13 @@ EndPrepare(GlobalTransaction gxact)
 
 	/* If we crash now, we have prepared: WAL replay will fix things */
 
+	/*
+	 * Wake up all walsenders to send WAL up to the PREPARE record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
 	/* write correct CRC and close file */
 	if ((write(fd, &statefile_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
 	{
@@ -2005,6 +2013,13 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	/* Flush XLOG to disk */
 	XLogFlush(recptr);
 
+	/*
+	 * Wake up all walsenders to send WAL up to the COMMIT PREPARED record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
 	/* Mark the transaction committed in pg_clog */
 	TransactionIdCommitTree(xid, nchildren, children);
 
@@ -2078,6 +2093,13 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	XLogFlush(recptr);
 
 	/*
+	 * Wake up all walsenders to send WAL up to the ABORT PREPARED record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
+	/*
 	 * Mark the transaction aborted in clog.  This is not absolutely necessary
 	 * but we may as well do it while we are here.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6bcc55c..942d5c2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
 #include "libpq/be-fsstubs.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -1068,6 +1069,13 @@ RecordTransactionCommit(void)
 		XLogFlush(XactLastRecEnd);
 
 		/*
+		 * Wake up all walsenders to send WAL up to the COMMIT record
+		 * immediately if replication is enabled
+		 */
+		if (max_wal_senders > 0)
+			WalSndWakeup();
+
+		/*
 		 * Now we may update the CLOG, if we wrote a COMMIT record above
 		 */
 		if (markXidCommitted)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 53c2581..9f5d1af 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -66,9 +66,6 @@ bool		am_walsender = false;		/* Am I a walsender process ? */
 int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
 int			WalSndDelay = 200;	/* max sleep time between some actions */
 
-#define NAPTIME_PER_CYCLE 100000L		/* max sleep time between cycles
-										 * (100ms) */
-
 /*
  * These variables are used similarly to openLogFile/Id/Seg/Off,
  * but for walsender to read the XLOG.
@@ -93,6 +90,7 @@ static volatile sig_atomic_t ready_to_stop = false;
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndShutdownHandler(SIGNAL_ARGS);
 static void WalSndQuickDieHandler(SIGNAL_ARGS);
+static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -144,6 +142,16 @@ WalSenderMain(void)
 	/* Handle handshake messages before streaming */
 	WalSndHandshake();
 
+	/* Initialize shared memory status */
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile WalSnd *walsnd = MyWalSnd;
+
+		SpinLockAcquire(&walsnd->mutex);
+		walsnd->sentPtr = sentPtr;
+		SpinLockRelease(&walsnd->mutex);
+	}
+
 	/* Main loop of walsender */
 	return WalSndLoop();
 }
@@ -380,8 +388,6 @@ WalSndLoop(void)
 	/* Loop forever, unless we get an error */
 	for (;;)
 	{
-		long		remain;		/* remaining time (us) */
-
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
@@ -421,32 +427,41 @@ WalSndLoop(void)
 		/*
 		 * If we had sent all accumulated WAL in last round, nap for the
 		 * configured time before retrying.
-		 *
-		 * On some platforms, signals won't interrupt the sleep.  To ensure we
-		 * respond reasonably promptly when someone signals us, break down the
-		 * sleep into NAPTIME_PER_CYCLE increments, and check for interrupts
-		 * after each nap.
 		 */
 		if (caughtup)
 		{
-			remain = WalSndDelay * 1000L;
-			while (remain > 0)
-			{
-				/* Check for interrupts */
-				if (got_SIGHUP || shutdown_requested || ready_to_stop)
-					break;
+			/*
+			 * Even if we wrote all the WAL that was available when we started
+			 * sending, more might have arrived while we were sending this
+			 * batch. We had the latch set while sending, so we have not
+			 * received any signals from that time. Let's arm the latch
+			 * again, and after that check that we're still up-to-date.
+			 */
+			ResetLatch(&MyWalSnd->latch);
 
-				/* Sleep and check that the connection is still alive */
-				pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain);
-				CheckClosedConnection();
+			if (!XLogSend(output_message, &caughtup))
+				break;
+			if (caughtup && !got_SIGHUP && !ready_to_stop && !shutdown_requested)
+			{
+				/*
+				 * XXX: Should we invent an API to wait for data coming from the
+				 * client connection too? It's not critical, but we could then
+				 * eliminate the timeout altogether and go to sleep for good.
+				 */
 
-				remain -= NAPTIME_PER_CYCLE;
+				/* Sleep */
+				WaitLatch(&MyWalSnd->latch, WalSndDelay * 1000);
 			}
-		}
 
-		/* Attempt to send the log once every loop */
-		if (!XLogSend(output_message, &caughtup))
-			break;
+			/* Check if the connection was closed */
+			CheckClosedConnection();
+		}
+		else
+		{
+			/* Attempt to send the log once every loop */
+			if (!XLogSend(output_message, &caughtup))
+				break;
+		}
 	}
 
 	/*
@@ -493,10 +508,15 @@ InitWalSnd(void)
 		}
 		else
 		{
-			/* found */
-			MyWalSnd = (WalSnd *) walsnd;
+			/*
+			 * Found a free slot. Take ownership of the latch and initialize
+			 * the other fields.
+			 */
+			InitLatch((Latch *) &walsnd->latch);
 			walsnd->pid = MyProcPid;
-			MemSet(&MyWalSnd->sentPtr, 0, sizeof(XLogRecPtr));
+			MemSet(&walsnd->sentPtr, 0, sizeof(XLogRecPtr));
+			/* Set MyWalSnd only after it's fully initialized. */
+			MyWalSnd = (WalSnd *) walsnd;
 			SpinLockRelease(&walsnd->mutex);
 			break;
 		}
@@ -523,6 +543,7 @@ WalSndKill(int code, Datum arg)
 	 * for this.
 	 */
 	MyWalSnd->pid = 0;
+	ReleaseLatch(&MyWalSnd->latch);
 
 	/* WalSnd struct isn't mine anymore */
 	MyWalSnd = NULL;
@@ -787,6 +808,8 @@ static void
 WalSndSigHupHandler(SIGNAL_ARGS)
 {
 	got_SIGHUP = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /* SIGTERM: set flag to shut down */
@@ -794,6 +817,8 @@ static void
 WalSndShutdownHandler(SIGNAL_ARGS)
 {
 	shutdown_requested = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /*
@@ -828,11 +853,20 @@ WalSndQuickDieHandler(SIGNAL_ARGS)
 	exit(2);
 }
 
+/* SIGUSR1: set flag to send WAL records */
+static void
+WalSndXLogSendHandler(SIGNAL_ARGS)
+{
+	latch_sigusr1_handler();
+}
+
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
 {
 	ready_to_stop = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /* Set up signal handlers */
@@ -847,7 +881,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, WalSndQuickDieHandler);	/* hard crash time */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, SIG_IGN); /* not used */
+	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
@@ -895,6 +929,16 @@ WalSndShmemInit(void)
 	}
 }
 
+/* Wake up all walsenders */
+void
+WalSndWakeup(void)
+{
+	int		i;
+
+	for (i = 0; i < max_wal_senders; i++)
+		SetLatch(&WalSndCtl->walsnds[i].latch);
+}
+
 /*
  * This isn't currently used for anything. Monitoring tools might be
  * interested in the future, and we'll need something like this in the
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index a5da0ec..5cbc515 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -15,7 +15,7 @@ override CFLAGS+= -fno-inline
 endif
 endif
 
-OBJS = ipc.o ipci.o pmsignal.o procarray.o procsignal.o shmem.o shmqueue.o \
-	sinval.o sinvaladt.o standby.o
+OBJS = ipc.o ipci.o latch.o pmsignal.o procarray.o procsignal.o shmem.o \
+       shmqueue.o sinval.o sinvaladt.o standby.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 492ac9a..0083513 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
@@ -117,6 +118,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SInvalShmemSize());
 		size = add_size(size, PMSignalShmemSize());
 		size = add_size(size, ProcSignalShmemSize());
+		size = add_size(size, LatchShmemSize());
 		size = add_size(size, BgWriterShmemSize());
 		size = add_size(size, AutoVacuumShmemSize());
 		size = add_size(size, WalSndShmemSize());
@@ -217,6 +219,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	PMSignalShmemInit();
 	ProcSignalShmemInit();
+	LatchShmemInit();
 	BgWriterShmemInit();
 	AutoVacuumShmemInit();
 	WalSndShmemInit();
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
new file mode 100644
index 0000000..1becf7a
--- /dev/null
+++ b/src/backend/storage/ipc/latch.c
@@ -0,0 +1,306 @@
+/*-------------------------------------------------------------------------
+ *
+ * latch.c
+ *	  Routines for interprocess latches
+ *
+ * A latch allows you to wait until another process, or a signal handler
+ * within the same process, wakes you up. There is three basic operations
+ * on a latch:
+ *
+ * SetLatch		- Sets the latch
+ * ResetLatch	- Clears the latch, allowing it to be set again
+ * WaitLatch	- waits for the latch to become set
+ *
+ * These can be used to wait for an event, without the race conditions
+ * involved with e.g plain Unix signals and select(). pselect() was
+ * invented to solve the same problem, but it is not portable enough.
+ *
+ * The implementation is such that setting a latch that's already set
+ * is quick.
+ *
+ * The pattern to wait on an event is:
+ *
+ * for (;;)
+ * {
+ *     WaitLatch();
+ *     ResetLatch();
+ *
+ *     if (work to do)
+ *         Do Stuff();
+ * }
+ *
+ * It's important to reset the latch *before* checking if there's work to
+ * do. Otherwise, if someone sets the latch between the check and the
+ * ResetLatch call, you will miss it and Wait will block.
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <signal.h>
+#include <unistd.h>
+
+#include "miscadmin.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/shmem.h"
+
+/* Read and write end of the self-pipe */
+static int selfpipe_readfd;
+static int selfpipe_writefd;
+
+/* Are we currently in WaitLatch()? The signal handler would like to know. */
+static volatile sig_atomic_t waiting = false;
+
+/* private function prototypes */
+static void drainSelfPipe(void);
+static void sendSelfPipeByte(void);
+
+/*
+ * Initialize a backend-local latch.
+ */
+void
+InitLatch(Latch *latch)
+{
+	Assert(latch->owner_pid == 0);
+	latch->owner_pid = MyProcPid;
+	latch->is_set = false;
+}
+
+/*
+ * Initialize an inter-process latch. Like InitLatch(), but the latch can
+ * be triggered from another process. A process that needs to wait on
+ * an inter-proess latch also needs to ensure that latch_sigusr1_handler()
+ * is called from the SIGUSR1 signal handler.
+ */
+void
+InitSharedLatch(Latch *latch)
+{
+	/*
+	 * This is the same as InitLatch() in this implementation. The
+	 * Windows implementation will likely differ.
+	 */
+	InitLatch(latch);
+}
+
+/*
+ * Release a latch previously allocated with InitLatch() or InitShareLatch().
+ */
+void
+ReleaseLatch(Latch *latch)
+{
+	Assert(latch->owner_pid == MyProcPid);
+	latch->owner_pid = 0;
+}
+
+/*
+ * Wait for given latch to be set, or until 'timeout' milliseconds passes.
+ * If 'timeout' is 0, wait forever. If the latch is already set, returns
+ * immediately.
+ *
+ * The latch must have been previously initialized by the current process.
+ */
+void
+WaitLatch(Latch *latch, long timeout)
+{
+	struct timeval tv;
+	fd_set		input_mask;
+	int			rc;
+
+	if (latch->owner_pid != MyProcPid)
+		elog(ERROR, "cannot wait on a latch now owned by current process");
+
+	waiting = true;
+	for (;;)
+	{
+		/*
+		 * Clear the pipe, and check if the latch is set already. If someone
+		 * sets the latch between this and the select() below, the setter
+		 * will write a byte to the pipe (or signal us and the signal handler
+		 * will do that), and the select() will return immediately.
+		 */
+		drainSelfPipe();
+		if (latch->is_set)
+			break;
+
+		/* Sleep */
+		if (timeout > 0)
+		{
+			tv.tv_sec = timeout / 1000000L;
+			tv.tv_usec = timeout % 1000000L;
+		}
+
+		FD_ZERO(&input_mask);
+		FD_SET(selfpipe_readfd, &input_mask);
+
+		rc = select(selfpipe_readfd + 1, &input_mask, NULL, NULL,
+					(timeout > 0) ? &tv : NULL);
+		if (rc < 0)
+		{
+			if (errno != EINTR)
+				ereport(ERROR,
+						(errcode_for_socket_access(),
+						 errmsg("select() failed: %m")));
+		}
+		if (rc == 0)
+			break;		/* timeout exceeded */
+	}
+	waiting = false;
+}
+
+/*
+ * Sets a latch and wakes up anyone waiting on it. Returns quickly if the
+ * latch is set already.
+ */
+void
+SetLatch(Latch *latch)
+{
+	pid_t owner_pid;
+
+	/* Quick exit if already set */
+	if (latch->is_set)
+		return;
+
+	latch->is_set = true;
+
+	/*
+	 * See if anyone's waiting for the latch. It can be the current process
+	 * if we're in a signal handler. We use the self-pipe to wake up the
+	 * select() in that case. If it's another process, send a signal.
+	 */
+	owner_pid = latch->owner_pid;
+	if (owner_pid == 0)
+		return;
+	else if (owner_pid == MyProcPid)
+		sendSelfPipeByte();
+	else
+		kill(owner_pid, SIGUSR1);
+}
+
+/*
+ * Clear the latch. Calling WaitLatch after this will sleep, unless
+ * the latch is set again before the WaitLatch call.
+ */
+void
+ResetLatch(Latch *latch)
+{
+	/* Only the owner should reset the latch */
+	Assert(latch->owner_pid == MyProcPid);
+
+	if (!latch->is_set)
+		return;
+
+	latch->is_set = false;
+}
+
+/*
+ * Unused in this implementation, but Windows will need this.
+ */
+static int
+NumSharedLatches(void)
+{
+	int numLatches = 0;
+
+	/* Each walsender needs one latch */
+	numLatches += max_wal_senders;
+
+	return numLatches;
+}
+
+/*
+ * LatchShmemSize
+ *		Compute space needed for latch's shared memory
+ */
+Size
+LatchShmemSize(void)
+{
+	return 0;
+}
+
+/*
+ * LatchShmemInit
+ *		Allocate and initialize shared memory needed for latches
+ */
+void
+LatchShmemInit(void)
+{
+	int pipefd[2];
+
+	/*
+	 * Set up a pipe that allows a signal handler to wake up the select()
+	 * in WaitLatch(). Make the write-end non-blocking, so that SetLatch()
+	 * won't block if the event has already been set many times filling
+	 * the kernel buffer. Make the read-end non-blocking too, so that we
+	 * can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK.
+	 */
+	if (pipe(pipefd) < 0)
+		elog(FATAL, "pipe() failed: %m");
+	if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0)
+		elog(FATAL, "fcntl() failed on read-end of self-pipe: %m");
+	if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0)
+		elog(FATAL, "fcntl() failed on write-end of self-pipe: %m");
+
+	selfpipe_readfd = pipefd[0];
+	selfpipe_writefd = pipefd[1];
+}
+
+void
+latch_sigusr1_handler(void)
+{
+	if (waiting)
+		sendSelfPipeByte();
+}
+
+/* Send one byte to the self-pipe, to wake up WaitLatch() */
+static void
+sendSelfPipeByte(void)
+{
+	int rc;
+	char dummy = 0;
+
+retry:
+	rc = write(selfpipe_writefd, &dummy, 1);
+	if (rc < 0)
+	{
+		if (errno != EAGAIN && errno != EWOULDBLOCK)
+		{
+			/*
+			 * XXX: Is it safe to elog(ERROR) in a signal handler?
+			 */
+			elog(ERROR, "write() on self-pipe failed: %m");
+		}
+		if (errno == EINTR)
+			goto retry;
+	}
+}
+
+/* Read all available data from the self-pipe */
+static void
+drainSelfPipe(void)
+{
+	int rc;
+	char buf;
+
+	for (;;)
+	{
+		rc = read(selfpipe_readfd, &buf, 1);
+		if (rc < 0)
+		{
+			if (errno == EINTR)
+				continue;
+			if (errno == EAGAIN || errno == EWOULDBLOCK)
+				break;		/* the pipe is empty */
+
+			elog(ERROR, "read() on self-pipe failed: %m");
+		}
+		else if (rc == 0)
+			elog(ERROR, "unexpected EOF on self-pipe");
+	}
+}
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 2340236..2ad2ad9 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -278,5 +278,7 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	latch_sigusr1_handler();
+
 	errno = save_errno;
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 874959e..3a93820 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -12,7 +12,12 @@
 #ifndef _WALSENDER_H
 #define _WALSENDER_H
 
+#include "postgres.h"
+
+#include <signal.h>
+
 #include "access/xlog.h"
+#include "storage/latch.h"
 #include "storage/spin.h"
 
 /*
@@ -23,6 +28,16 @@ typedef struct WalSnd
 	pid_t		pid;			/* this walsender's process id, or 0 */
 	XLogRecPtr	sentPtr;		/* WAL has been sent up to this point */
 
+	/* XXX
+	 * When a walsender process is signaled, to wake it up to send any
+	 * pending WAL, the sender of the signal should send busy to true.
+	 * A walsender marked as busy should not be signaled again, to avoid
+	 * redundant signaling which would slow down the walsender and the
+	 * system as a whole. Walsender will clear the flag when it is
+	 * finished sending all pending WAL again.
+	 */
+	Latch		latch;
+
 	slock_t		mutex;			/* locks shared variables shown above */
 } WalSnd;
 
@@ -45,5 +60,6 @@ extern int	WalSenderMain(void);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
+extern void WalSndWakeup(void);
 
 #endif   /* _WALSENDER_H */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
new file mode 100644
index 0000000..28989e5
--- /dev/null
+++ b/src/include/storage/latch.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * latch.h
+ *	  Routines for interprocess latches
+ *
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LATCH_H
+#define LATCH_H
+
+typedef struct
+{
+	volatile sig_atomic_t	is_set;
+	volatile sig_atomic_t	owner_pid;
+} Latch;
+
+/*
+ * prototypes for functions in latch.c
+ */
+extern void InitLatch(Latch *latch);
+extern void InitSharedLatch(Latch *latch);
+extern void ReleaseLatch(Latch *latch);
+extern void WaitLatch(Latch *latch, long timeout);
+extern void SetLatch(Latch *latch);
+extern void ResetLatch(Latch *latch);
+
+extern Size LatchShmemSize(void);
+extern void LatchShmemInit(void);
+
+extern void latch_sigusr1_handler(void);
+
+#endif   /* LATCH_H */

#26

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#25)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Thu, Aug 26, 2010 at 7:40 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Here's a first attempt at implementing that. To demonstrate how it works, I
modified walsender to use the new latch facility, also to respond quickly to
SIGHUP and SIGTERM.

Great!

There's two kinds of latches, local and global. Local latches can only be
set from the same process - allowing you to replace pg_usleep() with
something that is always interruptible by signals (by setting the latch in
the signal handler). The global latches work the same, and indeed the
implementation is the same, but the latch resides in shared memory, and can
be set by any process attached to shared memory. On Unix, when you set a
latch waited for by another process, the setter sends SIGUSR1 to the waiting
process, and the signal handler sends the byte to the self-pipe to wake up
the select().

According to this explanation, the latch which walsender uses seems to be
local. If it's true, walsender should call InitSharedLatch rather than
InitLatch?

/*
* XXX: Should we invent an API to wait for data coming from the
* client connection too? It's not critical, but we could then
* eliminate the timeout altogether and go to sleep for good.
*/

Yes, it would be very helpful when walsender waits for the ACK from
the standby in upcoming synchronous replication.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#27

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Fujii Masao (#26)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 27/08/10 10:39, Fujii Masao wrote:

On Thu, Aug 26, 2010 at 7:40 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

There's two kinds of latches, local and global. Local latches can only be
set from the same process - allowing you to replace pg_usleep() with
something that is always interruptible by signals (by setting the latch in
the signal handler). The global latches work the same, and indeed the
implementation is the same, but the latch resides in shared memory, and can
be set by any process attached to shared memory. On Unix, when you set a
latch waited for by another process, the setter sends SIGUSR1 to the waiting
process, and the signal handler sends the byte to the self-pipe to wake up
the select().

According to this explanation, the latch which walsender uses seems to be
local. If it's true, walsender should call InitSharedLatch rather than
InitLatch?

Yes, it should.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#28

Greg Smith

greg@2ndquadrant.com

over 15 years ago

In reply to: Tom Lane (#20)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Tom Lane wrote:

Well, yes they are. They cause unnecessary process wakeups and thereby
consume cycles even when the database is idle. See for example a
longstanding complaint here:
https://bugzilla.redhat.com/show_bug.cgi?id=252129

If we're going to go to the trouble of having a mechanism like this,
I'd like it to fix that problem so I can close out that bug.

The way the background writer wakes up periodically to absorb fsync
requests is already way too infrequent on a busy system. That component
of the bug report reeks as "not a bug" to me. Don't like the default?
Increase bgwriter_delay; move on with your life. We're not going to
increase the default, to screw over regular users, just to instead
prioritize people who prefer low power consumption. There's a knob for
it, you can tune the other way if you want. And based on this pile of
data I'm sorting through the last few weeks, my guess is that if the
default is going anywhere in 9.1 it's to make the BGW run *more* often,
not less. What those asking for the default change don't realize is
that if the BGW stops doing useful work, backends will start doing more
disk writes with their own fsync calls, and now you've just messed with
out how often the disks have to be powered up. I could probably
construct a test case that uses more power with the behavior they think
they want than the current one does. The only clear case where this is
always a win is when the system it totally idle.

The complaint that there's no similar way to detune the logger for lower
power use, something you can't really tweak on your own, is a much more
reasonable demand.

I have a patch that adds a new column to pg_stat_bgwriter to count how
often backend fsync calls happen directly, because the background writer
couldn't be bothered to absorb them. If this latching idea goes
somewhere, that should be a reasonable way to measure if the new
implementation is getting in the way of that particular processing
requirement. It's important to realize that the fsync absorb function
of the background writer has become absolutely critical on modern
systems that hit high transaction rates, so any refactoring of its basic
design needs to stay responsive to those incoming backend requests. I
don't see any high-level issues with the latch design approach Heikki
has proposed in regards to that. I'll take a look at some of the other
test cases I have here to see if they can help quantify its impact on
this aspect of BGW behavior.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us

#29

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Greg Smith (#28)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Fri, Aug 27, 2010 at 5:54 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Tom Lane wrote:

Well, yes they are. They cause unnecessary process wakeups and thereby
consume cycles even when the database is idle. See for example a
longstanding complaint here:
https://bugzilla.redhat.com/show_bug.cgi?id=252129

If we're going to go to the trouble of having a mechanism like this,
I'd like it to fix that problem so I can close out that bug.

The way the background writer wakes up periodically to absorb fsync requests
is already way too infrequent on a busy system.

Maybe instead of a fixed-duration sleep we could wake it up when it
needs to do something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#30

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Robert Haas (#29)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Aug 27, 2010 at 5:54 PM, Greg Smith <greg@2ndquadrant.com> wrote:

The way the background writer wakes up periodically to absorb fsync requests
is already way too infrequent on a busy system.

Maybe instead of a fixed-duration sleep we could wake it up when it
needs to do something.

*Any* fixed delay is going to be too long for some people and not long
enough for others; and the very same system might fall into both
categories at different times of day. I don't think "make
bgwriter_delay customizable" is an adequate answer. We've put up with
that so far because it wasn't possible to do better given the
infrastructure we had for waiting; but if we're going to try to improve
the infrastructure, we should have the ambition of getting rid of this
type of problem.

regards, tom lane

#31

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Greg Smith (#28)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Greg Smith <greg@2ndquadrant.com> writes:

Tom Lane wrote:

Well, yes they are. They cause unnecessary process wakeups and thereby
consume cycles even when the database is idle. See for example a
longstanding complaint here:
https://bugzilla.redhat.com/show_bug.cgi?id=252129

... The only clear case where this is
always a win is when the system it totally idle.

If you'll climb down off that horse for a moment: yeah, the idle case is
*exactly* what they're complaining about. In particular, the complaint
is that it's unreasonable to have Postgres running on a machine at all
unless it's actively being used, because it forces significant CPU power
drain anyway. That gets in the way of our plan for world domination,
no? If you can't have a PG sitting unobtrusively in the background,
waiting for you to have a need for it, it won't get installed in the
first place. People will pick mysql, or something else with a smaller
footprint, to put on their laptops, and then we lose some more mindshare.
I see this as just another facet of the argument about whether it's okay
to have default settings that try to take over the entire machine.

regards, tom lane

#32

Greg Stark

gsstark@mit.edu

over 15 years ago

In reply to: Tom Lane (#31)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Sat, Aug 28, 2010 at 2:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

That gets in the way of our plan for world domination,
no? If you can't have a PG sitting unobtrusively in the background,
waiting for you to have a need for it, it won't get installed in the
first place. People will pick mysql, or something else with a smaller
footprint, to put on their laptops, and then we lose some more mindshare.

People are always complaining about hosting companies not supporting
Postgres but put yourself in the shoes of a hosting copmany that wants
to have a few hundred hosting clients all on the same box, either as a
shared box or more likely these days as VMs. That's perfectly
reasonable if they're all idle nearly all the time but if they're all
waking up ever few milliseconds that's going to consume a substantial
amount of cpu before you even start handling requests.

--
greg

#33

Greg Smith

greg@2ndquadrant.com

over 15 years ago

In reply to: Tom Lane (#31)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Tom Lane wrote:

Greg Smith <greg@2ndquadrant.com> writes:

... The only clear case where this is
always a win is when the system it totally idle.

If you'll climb down off that horse for a moment: yeah, the idle case is
*exactly* what they're complaining about.

I wasn't on a horse here--I saw the specific case they must be most
annoyed with. If it's possible to turn this around into a "push" model
instead of how it works now, without tanking so that performance (and
maybe even power use!) suffers for the worst or even average case, that
refactoring could end up outperforming the current model. I'd like the
background writer to never go to sleep at all really if enough works
come in to keep it always busy, and I think that might fall out nicely
as a result of aiming for the other end--making it sleeper deeper when
it's not needed.

Just pointing out the very specific place where I suspect that will go
wrong if it's not done carefully is the fsync absorption, because it's
already broken enough that we're reworking it here. PostgreSQL already
ship a bad enough default configuration to decide yet another spot
should be yielded to somebody else's priorities, unless that actually
meets performance goals. I think I can quantify what those should be
with a test case for this part.

I see this as just another facet of the argument about whether it's okay
to have default settings that try to take over the entire machine.

Mostly agreed here. So long as the result is automatic enough to not
introduce yet another GUC set to the wrong thing for busy systems by
default, this could turn out great. The list of things you must change
to get the database to work right for serious work is already far too
long, and I dread the thought of putting yet another on there just for
the sake of lower power use. Another part of the plan for world
domination is that Oracle DBAs install the database for a test and say
"wow, that wasn't nearly as complicated as I'm used to and it performs
well, that was nice". That those people matter too is all I'm saying.
I could easily file a bug from their perspective saying "Background
writer is a lazy SOB in default config" that would be no less valid than
the one being discussed here.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us

#34

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Heikki Linnakangas (#27)

1 attachment(s)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Here's a 2nd version of the "latch" patch. Now with a Windows
implementation. Comments welcome.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

latch-2.patchtext/x-diff; name=latch-2.patchDownload

diff --git a/configure b/configure
index bd9b347..432cd58 100755
--- a/configure
+++ b/configure
@@ -27773,6 +27773,13 @@ _ACEOF
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
+# Select latch implementation type.
+if test "$PORTNAME" != "win32"; then
+  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
+else
+  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
+fi
+
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -29098,7 +29105,7 @@ fi
 ac_config_files="$ac_config_files GNUmakefile src/Makefile.global"
 
 
-ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
+ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
 
 
 if test "$PORTNAME" = "win32"; then
@@ -29722,6 +29729,7 @@ do
     "src/backend/port/dynloader.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c" ;;
     "src/backend/port/pg_sema.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}" ;;
     "src/backend/port/pg_shmem.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}" ;;
+    "src/backend/port/pg_latch.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}" ;;
     "src/include/dynloader.h") CONFIG_LINKS="$CONFIG_LINKS src/include/dynloader.h:src/backend/port/dynloader/${template}.h" ;;
     "src/include/pg_config_os.h") CONFIG_LINKS="$CONFIG_LINKS src/include/pg_config_os.h:src/include/port/${template}.h" ;;
     "src/Makefile.port") CONFIG_LINKS="$CONFIG_LINKS src/Makefile.port:src/makefiles/Makefile.${template}" ;;
diff --git a/configure.in b/configure.in
index 7b09986..7f84cea 100644
--- a/configure.in
+++ b/configure.in
@@ -1700,6 +1700,13 @@ else
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
+# Select latch implementation type.
+if test "$PORTNAME" != "win32"; then
+  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
+else
+  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
+fi
+
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -1841,6 +1848,7 @@ AC_CONFIG_LINKS([
   src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c
   src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}
   src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}
+  src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}
   src/include/dynloader.h:src/backend/port/dynloader/${template}.h
   src/include/pg_config_os.h:src/include/port/${template}.h
   src/Makefile.port:src/makefiles/Makefile.${template}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 615a7fa..094d0c9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -55,6 +55,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
@@ -1025,6 +1026,13 @@ EndPrepare(GlobalTransaction gxact)
 
 	/* If we crash now, we have prepared: WAL replay will fix things */
 
+	/*
+	 * Wake up all walsenders to send WAL up to the PREPARE record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
 	/* write correct CRC and close file */
 	if ((write(fd, &statefile_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
 	{
@@ -2005,6 +2013,13 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	/* Flush XLOG to disk */
 	XLogFlush(recptr);
 
+	/*
+	 * Wake up all walsenders to send WAL up to the COMMIT PREPARED record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
 	/* Mark the transaction committed in pg_clog */
 	TransactionIdCommitTree(xid, nchildren, children);
 
@@ -2078,6 +2093,13 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	XLogFlush(recptr);
 
 	/*
+	 * Wake up all walsenders to send WAL up to the ABORT PREPARED record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
+	/*
 	 * Mark the transaction aborted in clog.  This is not absolutely necessary
 	 * but we may as well do it while we are here.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6bcc55c..942d5c2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
 #include "libpq/be-fsstubs.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -1068,6 +1069,13 @@ RecordTransactionCommit(void)
 		XLogFlush(XactLastRecEnd);
 
 		/*
+		 * Wake up all walsenders to send WAL up to the COMMIT record
+		 * immediately if replication is enabled
+		 */
+		if (max_wal_senders > 0)
+			WalSndWakeup();
+
+		/*
 		 * Now we may update the CLOG, if we wrote a COMMIT record above
 		 */
 		if (markXidCommitted)
diff --git a/src/backend/port/Makefile b/src/backend/port/Makefile
index db0c2af..f50cff8 100644
--- a/src/backend/port/Makefile
+++ b/src/backend/port/Makefile
@@ -21,7 +21,7 @@ subdir = src/backend/port
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dynloader.o pg_sema.o pg_shmem.o $(TAS)
+OBJS = dynloader.o pg_sema.o pg_shmem.o pg_latch.o $(TAS)
 
 ifeq ($(PORTNAME), darwin)
 SUBDIRS += darwin
diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
new file mode 100644
index 0000000..f28b682
--- /dev/null
+++ b/src/backend/port/unix_latch.c
@@ -0,0 +1,315 @@
+/*-------------------------------------------------------------------------
+ *
+ * unix_latch.c
+ *	  Routines for interprocess latches
+ *
+ * A latch allows you to wait until another process, or a signal handler
+ * within the same process, wakes you up. There is three basic operations
+ * on a latch:
+ *
+ * SetLatch		- Sets the latch
+ * ResetLatch	- Clears the latch, allowing it to be set again
+ * WaitLatch	- waits for the latch to become set
+ *
+ * These can be used to wait for an event, without the race conditions
+ * involved with plain Unix signals and select(). pselect() was invented
+ * to solve the same problem, but it is not portable enough. Also,
+ * select() is not interrupted by signals on some platforms.
+ *
+ * The implementation is such that setting a latch that's already set
+ * is quick.
+ *
+ * The pattern to wait on an event is:
+ *
+ * for (;;)
+ * {
+ *     WaitLatch();
+ *     ResetLatch();
+ *
+ *     if (work to do)
+ *         Do Stuff();
+ * }
+ *
+ * It's important to reset the latch *before* checking if there's work to
+ * do. Otherwise, if someone sets the latch between the check and the
+ * ResetLatch call, you will miss it and Wait will block.
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <signal.h>
+#include <unistd.h>
+
+#include "miscadmin.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/shmem.h"
+
+/* Are we currently in WaitLatch()? The signal handler would like to know. */
+static volatile sig_atomic_t waiting = false;
+
+/* Read and write end of the self-pipe */
+static int selfpipe_readfd = -1;
+static int selfpipe_writefd = -1;
+
+/* private function prototypes */
+static void initSelfPipe(void);
+static void drainSelfPipe(void);
+static void sendSelfPipeByte(void);
+
+/*
+ * Initialize a backend-local latch.
+ */
+void
+InitLatch(Latch *latch)
+{
+	Assert(latch->owner_pid == 0);
+
+	/* Initialize the self pipe if this is our first latch in the process */
+	if (selfpipe_readfd == -1)
+		initSelfPipe();
+
+	latch->owner_pid = MyProcPid;
+	latch->is_set = false;
+}
+
+/*
+ * Initialize an inter-process latch. Like InitLatch(), but the latch can
+ * be set from another process. A process that needs to wait on an
+ * inter-proess latch also needs to ensure that latch_sigusr1_handler()
+ * is called from the SIGUSR1 signal handler.
+ *
+ * NB: You must increase the shared latch count in NumSharedLatches() in
+ * win32_latch.c if you introduce a new shared latch!
+ */
+void
+InitSharedLatch(Latch *latch)
+{
+	/*
+	 * This is the same as InitLatch() in this implementation. The
+	 * Windows implementation will likely differ.
+	 */
+	InitLatch(latch);
+}
+
+/*
+ * Release a latch previously allocated with InitLatch() or InitShareLatch().
+ */
+void
+ReleaseLatch(Latch *latch)
+{
+	Assert(latch->owner_pid == MyProcPid);
+	latch->owner_pid = 0;
+}
+
+/*
+ * Wait for given latch to be set, or until 'timeout' milliseconds passes.
+ * If 'timeout' is 0, wait forever. If the latch is already set, returns
+ * immediately.
+ *
+ * The latch must have been previously initialized by the current process.
+ */
+void
+WaitLatch(Latch *latch, long timeout)
+{
+	struct timeval tv;
+	fd_set		input_mask;
+	int			rc;
+
+	if (latch->owner_pid != MyProcPid)
+		elog(ERROR, "cannot wait on a latch owned by another process");
+
+	waiting = true;
+	for (;;)
+	{
+		/*
+		 * Clear the pipe, and check if the latch is set already. If someone
+		 * sets the latch between this and the select() below, the setter
+		 * will write a byte to the pipe (or signal us and the signal handler
+		 * will do that), and the select() will return immediately.
+		 */
+		drainSelfPipe();
+		if (latch->is_set)
+			break;
+
+		/* Sleep */
+		if (timeout > 0)
+		{
+			tv.tv_sec = timeout / 1000000L;
+			tv.tv_usec = timeout % 1000000L;
+		}
+
+		FD_ZERO(&input_mask);
+		FD_SET(selfpipe_readfd, &input_mask);
+
+		rc = select(selfpipe_readfd + 1, &input_mask, NULL, NULL,
+					(timeout > 0) ? &tv : NULL);
+		if (rc < 0)
+		{
+			if (errno != EINTR)
+				ereport(ERROR,
+						(errcode_for_socket_access(),
+						 errmsg("select() failed: %m")));
+		}
+		if (rc == 0)
+			break;		/* timeout exceeded */
+	}
+	waiting = false;
+}
+
+/*
+ * Sets a latch and wakes up anyone waiting on it. Returns quickly if the
+ * latch is already set.
+ */
+void
+SetLatch(Latch *latch)
+{
+	pid_t owner_pid;
+
+	/* Quick exit if already set */
+	if (latch->is_set)
+		return;
+
+	latch->is_set = true;
+
+	/*
+	 * See if anyone's waiting for the latch. It can be the current process
+	 * if we're in a signal handler. We use the self-pipe to wake up the
+	 * select() in that case. If it's another process, send a signal.
+	 */
+	owner_pid = latch->owner_pid;
+	if (owner_pid == 0)
+		return;
+	else if (owner_pid == MyProcPid)
+		sendSelfPipeByte();
+	else
+		kill(owner_pid, SIGUSR1);
+}
+
+/*
+ * Clear the latch. Calling WaitLatch after this will sleep, unless
+ * the latch is set again before the WaitLatch call.
+ */
+void
+ResetLatch(Latch *latch)
+{
+	/* Only the owner should reset the latch */
+	Assert(latch->owner_pid == MyProcPid);
+
+	if (!latch->is_set)
+		return;
+
+	latch->is_set = false;
+}
+
+/*
+ * LatchShmemSize
+ *		Compute space needed for latch's shared memory
+ *
+ * Not needed for Unix implementation.
+ */
+Size
+LatchShmemSize(void)
+{
+	return 0;
+}
+
+/*
+ * LatchShmemInit
+ *		Allocate and initialize shared memory needed for latches
+ *
+ * Not needed for Unix implementation.
+ */
+void
+LatchShmemInit(void)
+{
+}
+
+/*
+ * SetEvent uses SIGUSR1 to wake up the process waiting on the latch.
+ */
+void
+latch_sigusr1_handler(void)
+{
+	if (waiting)
+		sendSelfPipeByte();
+}
+
+/* initialize the self-pipe */
+static void
+initSelfPipe(void)
+{
+	int pipefd[2];
+
+	/*
+	 * Set up a pipe that allows a signal handler to wake up the select()
+	 * in WaitLatch(). Make the write-end non-blocking, so that SetLatch()
+	 * won't block if the event has already been set many times filling
+	 * the kernel buffer. Make the read-end non-blocking too, so that we
+	 * can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK.
+	 */
+	if (pipe(pipefd) < 0)
+		elog(FATAL, "pipe() failed: %m");
+	if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0)
+		elog(FATAL, "fcntl() failed on read-end of self-pipe: %m");
+	if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0)
+		elog(FATAL, "fcntl() failed on write-end of self-pipe: %m");
+
+	selfpipe_readfd = pipefd[0];
+	selfpipe_writefd = pipefd[1];
+}
+
+/* Send one byte to the self-pipe, to wake up WaitLatch() */
+static void
+sendSelfPipeByte(void)
+{
+	int rc;
+	char dummy = 0;
+
+retry:
+	rc = write(selfpipe_writefd, &dummy, 1);
+	if (rc < 0)
+	{
+		if (errno != EAGAIN && errno != EWOULDBLOCK)
+		{
+			/*
+			 * XXX: Is it safe to elog(ERROR) in a signal handler?
+			 */
+			elog(ERROR, "write() on self-pipe failed: %m");
+		}
+		if (errno == EINTR)
+			goto retry;
+	}
+}
+
+/* Read all available data from the self-pipe */
+static void
+drainSelfPipe(void)
+{
+	int rc;
+	char buf;
+
+	for (;;)
+	{
+		rc = read(selfpipe_readfd, &buf, 1);
+		if (rc < 0)
+		{
+			if (errno == EINTR)
+				continue;
+			if (errno == EAGAIN || errno == EWOULDBLOCK)
+				break;		/* the pipe is empty */
+
+			elog(ERROR, "read() on self-pipe failed: %m");
+		}
+		else if (rc == 0)
+			elog(ERROR, "unexpected EOF on self-pipe");
+	}
+}
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
new file mode 100644
index 0000000..d4408f8
--- /dev/null
+++ b/src/backend/port/win32_latch.c
@@ -0,0 +1,263 @@
+/*-------------------------------------------------------------------------
+ *
+ * win32_latch.c
+ *	  Routines for interprocess latches
+ *
+ * Windows implementation of latches, using Windows events. See
+ * unix_latch.c for information on usage.
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <signal.h>
+#include <unistd.h>
+
+#include "miscadmin.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+
+typedef struct
+{
+	volatile slock_t mutex;
+	int nfreehandles;
+	int maxhandles;
+	HANDLE handles[1]; /* variable length */
+} SharedEventHandles;
+
+SharedEventHandles *sharedHandles;
+
+/* Are we currently in WaitLatch()? The signal handler would like to know. */
+static volatile HANDLE waitingEvent = false;
+
+/*
+ * Initialize a backend-local latch.
+ */
+void
+InitLatch(Latch *latch)
+{
+	Assert(latch->event == NULL);
+	latch->isshared = false;
+	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
+	latch->is_set = false;
+}
+
+/*
+ * Initialize an inter-process latch. Like InitLatch(), but the latch can
+ * be triggered from another process. A process that needs to wait on
+ * an inter-proess latch also needs to ensure that latch_sigusr1_handler()
+ * is called from the SIGUSR1 signal handler.
+ */
+void
+InitSharedLatch(Latch *latch)
+{
+	Assert(latch->event == NULL);
+	SpinLockAcquire(&sharedHandles->mutex);
+	if (sharedHandles->nfreehandles <= 0)
+	{
+		SpinLockRelease(&sharedHandles->mutex);
+		elog(ERROR, "out of shared event objects");
+	}
+	sharedHandles->nfreehandles--;
+	latch->event = sharedHandles->handles[sharedHandles->nfreehandles];
+	SpinLockRelease(&sharedHandles->mutex);
+
+	latch->isshared = true;
+	latch->is_set = false;
+	ResetEvent(latch->event);
+}
+
+/*
+ * Release a latch previously allocated with InitLatch() or InitShareLatch().
+ */
+void
+ReleaseLatch(Latch *latch)
+{
+	Assert(latch->event);
+	if (latch->isshared)
+	{
+		/* Put the event handle back to the pool */
+		SpinLockAcquire(&sharedHandles->mutex);
+		if (sharedHandles->nfreehandles >= sharedHandles->maxhandles)
+		{
+			SpinLockRelease(&sharedHandles->mutex);
+			elog(PANIC, "too many free event handles");
+		}
+		sharedHandles->handles[sharedHandles->nfreehandles] = latch->event;
+		sharedHandles->nfreehandles++;
+		SpinLockRelease(&sharedHandles->mutex);
+	}
+	else
+	{
+		/* XXX: is this safe from compiler rearrangement? */
+		HANDLE handle = latch->event;
+		latch->event = NULL;
+		CloseHandle(handle);
+	}
+}
+
+/*
+ * Wait for given latch to be set, or until 'timeout' milliseconds passes.
+ * If 'timeout' is 0, wait forever. If the latch is already set, returns
+ * immediately.
+ *
+ * The latch must have been previously initialized by the current process.
+ */
+void
+WaitLatch(Latch *latch, long timeout)
+{
+	DWORD		rc;
+
+	waitingEvent = latch->event;
+	for (;;)
+	{
+		/*
+		 * Reset the event, and check if the latch is set already. If someone
+		 * sets the latch between this and the WaitForSingleObject() call
+		 * below, the setter will set the event and WaitForSingleObject() will
+		 * return immediately.
+		 */
+		if (!ResetEvent(latch->event))
+		{
+			waitingEvent = NULL;
+			elog(ERROR, "ResetEvent failed: error code %d", (int) GetLastError());
+		}
+		if (latch->is_set)
+			break;
+
+		rc = WaitForSingleObject(latch->event, timeout / 1000);
+		if (rc == WAIT_FAILED)
+		{
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("WaitForSingleObject() failed: error code %d", (int) GetLastError())));
+		}
+		if (rc == WAIT_TIMEOUT)
+			break;		/* timeout exceeded */
+	}
+	waitingEvent = NULL;
+}
+
+/*
+ * Sets a latch and wakes up anyone waiting on it. Returns quickly if the
+ * latch is set already.
+ */
+void
+SetLatch(Latch *latch)
+{
+	HANDLE handle;
+
+	/* Quick exit if already set */
+	if (latch->is_set)
+		return;
+
+	latch->is_set = true;
+
+	/*
+	 * See if anyone's waiting for the latch. It can be the current process
+	 * if we're in a signal handler. Use a local variable here in case the
+	 * latch is just released between the test and the SetEvent call.
+	 */
+	handle = latch->event;
+	if (handle)
+	{
+		if (!SetEvent(handle))
+			elog(LOG, "SetEvent failed: error code %d", (int) GetLastError());
+	}
+}
+
+/*
+ * Clear the latch. Calling WaitLatch after this will sleep, unless
+ * the latch is set again before the WaitLatch call.
+ */
+void
+ResetLatch(Latch *latch)
+{
+	if (!latch->is_set)
+		return;
+
+	latch->is_set = false;
+}
+
+/*
+ * Number of shared latches, used to allocate the right number of shared
+ * Event handles at postmaster startup. You must update this if you
+ * introduce a new shared latch!
+ */
+static int
+NumSharedLatches(void)
+{
+	int numLatches = 0;
+
+	/* Each walsender needs one latch */
+	numLatches += max_wal_senders;
+
+	return numLatches;
+}
+
+/*
+ * LatchShmemSize
+ *		Compute space needed for latch's shared memory
+ */
+Size
+LatchShmemSize(void)
+{
+	return offsetof(SharedEventHandles, handles) +
+		NumSharedLatches() * sizeof(HANDLE);
+}
+
+/*
+ * LatchShmemInit
+ *		Allocate and initialize shared memory needed for latches
+ */
+void
+LatchShmemInit(void)
+{
+	Size		size = LatchShmemSize();
+	bool		found;
+
+	sharedHandles = ShmemInitStruct("SharedEventHandles", size, &found);
+
+	/* If we're first, initialize the struct and allocate handles */
+	if (!found)
+	{
+		int i;
+		SECURITY_ATTRIBUTES sa;
+
+		/*
+		 * Set up security attributes to specify that the events are
+		 * inherited.
+		 */
+		ZeroMemory(&sa, sizeof(sa));
+		sa.nLength = sizeof(sa);
+		sa.bInheritHandle = TRUE;
+
+		SpinLockInit(&sharedHandles->mutex);
+		sharedHandles->maxhandles = NumSharedLatches();
+		sharedHandles->nfreehandles = sharedHandles->maxhandles;
+		for (i = 0; i < sharedHandles->maxhandles; i++)
+		{
+			sharedHandles->handles[i] = CreateEvent(&sa, TRUE, FALSE, NULL);
+			if (sharedHandles->handles[i] == NULL)
+				elog(ERROR, "CreateEvent failed: error code %d", (int) GetLastError());
+		}
+	}
+}
+
+/* I'm not sure if this is needed on Windows... */
+void
+latch_sigusr1_handler(void)
+{
+	HANDLE handle = waitingEvent;
+	if (handle)
+		SetEvent(handle);
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 53c2581..3ef3c4f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -34,6 +34,7 @@
  */
 #include "postgres.h"
 
+#include <signal.h>
 #include <unistd.h>
 
 #include "access/xlog_internal.h"
@@ -66,9 +67,6 @@ bool		am_walsender = false;		/* Am I a walsender process ? */
 int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
 int			WalSndDelay = 200;	/* max sleep time between some actions */
 
-#define NAPTIME_PER_CYCLE 100000L		/* max sleep time between cycles
-										 * (100ms) */
-
 /*
  * These variables are used similarly to openLogFile/Id/Seg/Off,
  * but for walsender to read the XLOG.
@@ -93,6 +91,7 @@ static volatile sig_atomic_t ready_to_stop = false;
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndShutdownHandler(SIGNAL_ARGS);
 static void WalSndQuickDieHandler(SIGNAL_ARGS);
+static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -144,6 +143,16 @@ WalSenderMain(void)
 	/* Handle handshake messages before streaming */
 	WalSndHandshake();
 
+	/* Initialize shared memory status */
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile WalSnd *walsnd = MyWalSnd;
+
+		SpinLockAcquire(&walsnd->mutex);
+		walsnd->sentPtr = sentPtr;
+		SpinLockRelease(&walsnd->mutex);
+	}
+
 	/* Main loop of walsender */
 	return WalSndLoop();
 }
@@ -380,8 +389,6 @@ WalSndLoop(void)
 	/* Loop forever, unless we get an error */
 	for (;;)
 	{
-		long		remain;		/* remaining time (us) */
-
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
@@ -421,32 +428,41 @@ WalSndLoop(void)
 		/*
 		 * If we had sent all accumulated WAL in last round, nap for the
 		 * configured time before retrying.
-		 *
-		 * On some platforms, signals won't interrupt the sleep.  To ensure we
-		 * respond reasonably promptly when someone signals us, break down the
-		 * sleep into NAPTIME_PER_CYCLE increments, and check for interrupts
-		 * after each nap.
 		 */
 		if (caughtup)
 		{
-			remain = WalSndDelay * 1000L;
-			while (remain > 0)
-			{
-				/* Check for interrupts */
-				if (got_SIGHUP || shutdown_requested || ready_to_stop)
-					break;
+			/*
+			 * Even if we wrote all the WAL that was available when we started
+			 * sending, more might have arrived while we were sending this
+			 * batch. We had the latch set while sending, so we have not
+			 * received any signals from that time. Let's arm the latch
+			 * again, and after that check that we're still up-to-date.
+			 */
+			ResetLatch(&MyWalSnd->latch);
 
-				/* Sleep and check that the connection is still alive */
-				pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain);
-				CheckClosedConnection();
+			if (!XLogSend(output_message, &caughtup))
+				break;
+			if (caughtup && !got_SIGHUP && !ready_to_stop && !shutdown_requested)
+			{
+				/*
+				 * XXX: Should we invent an API to wait for data coming from the
+				 * client connection too? It's not critical, but we could then
+				 * eliminate the timeout altogether and go to sleep for good.
+				 */
 
-				remain -= NAPTIME_PER_CYCLE;
+				/* Sleep */
+				WaitLatch(&MyWalSnd->latch, WalSndDelay * 1000);
 			}
-		}
 
-		/* Attempt to send the log once every loop */
-		if (!XLogSend(output_message, &caughtup))
-			break;
+			/* Check if the connection was closed */
+			CheckClosedConnection();
+		}
+		else
+		{
+			/* Attempt to send the log once every loop */
+			if (!XLogSend(output_message, &caughtup))
+				break;
+		}
 	}
 
 	/*
@@ -493,10 +509,15 @@ InitWalSnd(void)
 		}
 		else
 		{
-			/* found */
-			MyWalSnd = (WalSnd *) walsnd;
+			/*
+			 * Found a free slot. Take ownership of the latch and initialize
+			 * the other fields.
+			 */
+			InitSharedLatch((Latch *) &walsnd->latch);
 			walsnd->pid = MyProcPid;
-			MemSet(&MyWalSnd->sentPtr, 0, sizeof(XLogRecPtr));
+			MemSet(&walsnd->sentPtr, 0, sizeof(XLogRecPtr));
+			/* Set MyWalSnd only after it's fully initialized. */
+			MyWalSnd = (WalSnd *) walsnd;
 			SpinLockRelease(&walsnd->mutex);
 			break;
 		}
@@ -523,6 +544,7 @@ WalSndKill(int code, Datum arg)
 	 * for this.
 	 */
 	MyWalSnd->pid = 0;
+	ReleaseLatch(&MyWalSnd->latch);
 
 	/* WalSnd struct isn't mine anymore */
 	MyWalSnd = NULL;
@@ -787,6 +809,8 @@ static void
 WalSndSigHupHandler(SIGNAL_ARGS)
 {
 	got_SIGHUP = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /* SIGTERM: set flag to shut down */
@@ -794,6 +818,8 @@ static void
 WalSndShutdownHandler(SIGNAL_ARGS)
 {
 	shutdown_requested = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /*
@@ -828,11 +854,20 @@ WalSndQuickDieHandler(SIGNAL_ARGS)
 	exit(2);
 }
 
+/* SIGUSR1: set flag to send WAL records */
+static void
+WalSndXLogSendHandler(SIGNAL_ARGS)
+{
+	latch_sigusr1_handler();
+}
+
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
 {
 	ready_to_stop = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /* Set up signal handlers */
@@ -847,7 +882,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, WalSndQuickDieHandler);	/* hard crash time */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, SIG_IGN); /* not used */
+	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
@@ -895,6 +930,16 @@ WalSndShmemInit(void)
 	}
 }
 
+/* Wake up all walsenders */
+void
+WalSndWakeup(void)
+{
+	int		i;
+
+	for (i = 0; i < max_wal_senders; i++)
+		SetLatch(&WalSndCtl->walsnds[i].latch);
+}
+
 /*
  * This isn't currently used for anything. Monitoring tools might be
  * interested in the future, and we'll need something like this in the
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 492ac9a..0083513 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
@@ -117,6 +118,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SInvalShmemSize());
 		size = add_size(size, PMSignalShmemSize());
 		size = add_size(size, ProcSignalShmemSize());
+		size = add_size(size, LatchShmemSize());
 		size = add_size(size, BgWriterShmemSize());
 		size = add_size(size, AutoVacuumShmemSize());
 		size = add_size(size, WalSndShmemSize());
@@ -217,6 +219,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	PMSignalShmemInit();
 	ProcSignalShmemInit();
+	LatchShmemInit();
 	BgWriterShmemInit();
 	AutoVacuumShmemInit();
 	WalSndShmemInit();
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index e892e2d..3dd25f1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -21,6 +21,7 @@
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "storage/ipc.h"
+#include "storage/latch.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
@@ -278,5 +279,7 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	latch_sigusr1_handler();
+
 	errno = save_errno;
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 874959e..73c5904 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include "access/xlog.h"
+#include "storage/latch.h"
 #include "storage/spin.h"
 
 /*
@@ -24,6 +25,12 @@ typedef struct WalSnd
 	XLogRecPtr	sentPtr;		/* WAL has been sent up to this point */
 
 	slock_t		mutex;			/* locks shared variables shown above */
+
+	/*
+	 * Latch used by backends to wake up this walsender when it has work
+	 * to do.
+	 */
+	Latch		latch;
 } WalSnd;
 
 /* There is one WalSndCtl struct for the whole database cluster */
@@ -45,5 +52,6 @@ extern int	WalSenderMain(void);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
+extern void WalSndWakeup(void);
 
 #endif   /* _WALSENDER_H */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
new file mode 100644
index 0000000..e81e77f
--- /dev/null
+++ b/src/include/storage/latch.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * latch.h
+ *	  Routines for interprocess latches
+ *
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LATCH_H
+#define LATCH_H
+
+#include <signal.h>
+
+#ifndef WIN32
+typedef struct
+{
+	volatile sig_atomic_t	is_set;
+	volatile sig_atomic_t	owner_pid;
+} Latch;
+#else
+typedef struct
+{
+	volatile sig_atomic_t	is_set;
+	bool isshared;
+	HANDLE	event;
+} Latch;
+#endif
+
+/*
+ * prototypes for functions in latch.c
+ */
+extern void InitLatch(Latch *latch);
+extern void InitSharedLatch(Latch *latch);
+extern void ReleaseLatch(Latch *latch);
+extern void WaitLatch(Latch *latch, long timeout);
+extern void SetLatch(Latch *latch);
+extern void ResetLatch(Latch *latch);
+
+extern Size LatchShmemSize(void);
+extern void LatchShmemInit(void);
+
+extern void latch_sigusr1_handler(void);
+
+#endif   /* LATCH_H */
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index cf7c3ee..bf0e904 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -64,6 +64,7 @@ sub mkvcbuild
     $postgres->ReplaceFile('src\backend\port\dynloader.c','src\backend\port\dynloader\win32.c');
     $postgres->ReplaceFile('src\backend\port\pg_sema.c','src\backend\port\win32_sema.c');
     $postgres->ReplaceFile('src\backend\port\pg_shmem.c','src\backend\port\win32_shmem.c');
+    $postgres->ReplaceFile('src\backend\port\pg_latch.c','src\backend\port\win32_latch.c');
     $postgres->AddFiles('src\port',@pgportfiles);
     $postgres->AddDir('src\timezone');
     $postgres->AddFiles('src\backend\parser','scan.l','gram.y');

#35

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#34)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Tue, Aug 31, 2010 at 4:06 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Here's a 2nd version of the "latch" patch. Now with a Windows
implementation. Comments welcome.

Seems good.

Two minor comments:

rc = WaitForSingleObject(latch->event, timeout / 1000);
if (rc == WAIT_FAILED)
{
ereport(ERROR,
(errcode_for_socket_access(),
errmsg("WaitForSingleObject() failed: error code %d", (int) GetLastError())));
}
if (rc == WAIT_TIMEOUT)
break; /* timeout exceeded */

We should also check "rc == WAIT_OBJECT_0"?

static volatile HANDLE waitingEvent = false;

s/false/NULL?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#36

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Fujii Masao (#26)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Fri, Aug 27, 2010 at 4:39 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

/*
* XXX: Should we invent an API to wait for data coming from the
* client connection too? It's not critical, but we could then
* eliminate the timeout altogether and go to sleep for good.
*/

Yes, it would be very helpful when walsender waits for the ACK from
the standby in upcoming synchronous replication.

I propose to change WaitLatch() so that it accepts the socket
descriptor as an argument, to wait for data coming from the
client connection. WaitLatch() monitors not only the self-pipe
but also the given socket. If -1 is supplied, it checks only
the self-pipe.

The socket should be monitored by using poll() if the platform
has it, since poll() is usually more efficient.

So I'd like to change Unix implementation of WaitLatch() as
follows. Thought?

-------------------
define WaitLatch(latch, timeout) WaitLatchAndSocket(latch, -1, timeout)

void WaitLatchAndSocket(Latch *latch, int sock, long timeout);
{
...

FD_SET(selfpipe_readfd, &input_mask);
if (sock != -1)
FD_SET(sock, &input_mask);

#ifdef HAVE_POLL
poll(...)
#else
select(...)
#endif /* HAVE_POLL */

...
}
-------------------

Windows implementation of WaitLatchAndSocket() seems not to be
so simple. We would need to wait for both the latch event and
the packet from the socket by using WaitForMultipleObjectsEx().

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#37

Bruce Momjian

bruce@momjian.us

over 15 years ago

In reply to: Tom Lane (#31)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Tom Lane wrote:

Greg Smith <greg@2ndquadrant.com> writes:

Tom Lane wrote:

Well, yes they are. They cause unnecessary process wakeups and thereby
consume cycles even when the database is idle. See for example a
longstanding complaint here:
https://bugzilla.redhat.com/show_bug.cgi?id=252129

... The only clear case where this is
always a win is when the system it totally idle.

If you'll climb down off that horse for a moment: yeah, the idle case is
*exactly* what they're complaining about. In particular, the complaint
is that it's unreasonable to have Postgres running on a machine at all
unless it's actively being used, because it forces significant CPU power
drain anyway. That gets in the way of our plan for world domination,
no? If you can't have a PG sitting unobtrusively in the background,
waiting for you to have a need for it, it won't get installed in the
first place. People will pick mysql, or something else with a smaller
footprint, to put on their laptops, and then we lose some more mindshare.
I see this as just another facet of the argument about whether it's okay
to have default settings that try to take over the entire machine.

FYI, last week I was running PG 8.4.X with default settings on a T43
laptop running XP with 1 Gig of RAM and it caused a racing game I was
playing to run jerkily. When I shut down the Postgres server, the
problem was fixed. I was kind of surprised that an idle PG server could
cause that.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#38

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Fujii Masao (#36)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 31/08/10 15:47, Fujii Masao wrote:

On Fri, Aug 27, 2010 at 4:39 PM, Fujii Masao<masao.fujii@gmail.com> wrote:

/*
* XXX: Should we invent an API to wait for data coming from the
* client connection too? It's not critical, but we could then
* eliminate the timeout altogether and go to sleep for good.
*/

Yes, it would be very helpful when walsender waits for the ACK from
the standby in upcoming synchronous replication.

I propose to change WaitLatch() so that it accepts the socket
descriptor as an argument, to wait for data coming from the
client connection.

Yeah, we probably should do that now.

WaitLatch() monitors not only the self-pipe
but also the given socket. If -1 is supplied, it checks only
the self-pipe.

The obvious next question is how to wait for multiple sockets and a
latch at the same time? Perhaps we should have a select()-like interface
where you can pass multiple file descriptors. Then again, looking at the
current callers of select() in the backend, apart from postmaster they
all wait for only one fd.

The socket should be monitored by using poll() if the platform
has it, since poll() is usually more efficient.

Yeah, I guess.

So I'd like to change Unix implementation of WaitLatch() as
follows. Thought?

-------------------
define WaitLatch(latch, timeout) WaitLatchAndSocket(latch, -1, timeout)

void WaitLatchAndSocket(Latch *latch, int sock, long timeout);
{
...

FD_SET(selfpipe_readfd,&input_mask);
if (sock != -1)
FD_SET(sock,&input_mask);

#ifdef HAVE_POLL
poll(...)
#else
select(...)
#endif /* HAVE_POLL */

...
}
-------------------

Yep.

Windows implementation of WaitLatchAndSocket() seems not to be
so simple. We would need to wait for both the latch event and
the packet from the socket by using WaitForMultipleObjectsEx().

Well, we already use WaitForMultipleObjectsEx() to implement select() on
Windows, so it should be straightforward to copy that. I'll look into that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#39

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#38)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Wed, Sep 1, 2010 at 4:11 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

The obvious next question is how to wait for multiple sockets and a latch at
the same time? Perhaps we should have a select()-like interface where you
can pass multiple file descriptors. Then again, looking at the current
callers of select() in the backend, apart from postmaster they all wait for
only one fd.

Currently backends have not waited for multiple sockets, so I don't think that
interface is required for now. Similarly, we don't need to wait for the socket
to be ready to *write* because there is no use case for now.

Windows implementation of WaitLatchAndSocket() seems not to be
so simple. We would need to wait for both the latch event and
the packet from the socket by using WaitForMultipleObjectsEx().

Well, we already use WaitForMultipleObjectsEx() to implement select() on
Windows, so it should be straightforward to copy that. I'll look into that.

Agreed.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#40

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Fujii Masao (#39)

1 attachment(s)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 02/09/10 06:46, Fujii Masao wrote:

On Wed, Sep 1, 2010 at 4:11 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

The obvious next question is how to wait for multiple sockets and a latch at
the same time? Perhaps we should have a select()-like interface where you
can pass multiple file descriptors. Then again, looking at the current
callers of select() in the backend, apart from postmaster they all wait for
only one fd.

Currently backends have not waited for multiple sockets, so I don't think that
interface is required for now. Similarly, we don't need to wait for the socket
to be ready to *write* because there is no use case for now.

Ok, here's an updated patch with WaitLatchOrSocket that let's you do that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

latch-3.patchtext/x-diff; name=latch-3.patchDownload

diff --git a/configure b/configure
index bd9b347..432cd58 100755
--- a/configure
+++ b/configure
@@ -27773,6 +27773,13 @@ _ACEOF
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
+# Select latch implementation type.
+if test "$PORTNAME" != "win32"; then
+  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
+else
+  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
+fi
+
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -29098,7 +29105,7 @@ fi
 ac_config_files="$ac_config_files GNUmakefile src/Makefile.global"
 
 
-ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
+ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
 
 
 if test "$PORTNAME" = "win32"; then
@@ -29722,6 +29729,7 @@ do
     "src/backend/port/dynloader.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c" ;;
     "src/backend/port/pg_sema.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}" ;;
     "src/backend/port/pg_shmem.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}" ;;
+    "src/backend/port/pg_latch.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}" ;;
     "src/include/dynloader.h") CONFIG_LINKS="$CONFIG_LINKS src/include/dynloader.h:src/backend/port/dynloader/${template}.h" ;;
     "src/include/pg_config_os.h") CONFIG_LINKS="$CONFIG_LINKS src/include/pg_config_os.h:src/include/port/${template}.h" ;;
     "src/Makefile.port") CONFIG_LINKS="$CONFIG_LINKS src/Makefile.port:src/makefiles/Makefile.${template}" ;;
diff --git a/configure.in b/configure.in
index 7b09986..7f84cea 100644
--- a/configure.in
+++ b/configure.in
@@ -1700,6 +1700,13 @@ else
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
+# Select latch implementation type.
+if test "$PORTNAME" != "win32"; then
+  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
+else
+  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
+fi
+
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -1841,6 +1848,7 @@ AC_CONFIG_LINKS([
   src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c
   src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}
   src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}
+  src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}
   src/include/dynloader.h:src/backend/port/dynloader/${template}.h
   src/include/pg_config_os.h:src/include/port/${template}.h
   src/Makefile.port:src/makefiles/Makefile.${template}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 615a7fa..094d0c9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -55,6 +55,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
@@ -1025,6 +1026,13 @@ EndPrepare(GlobalTransaction gxact)
 
 	/* If we crash now, we have prepared: WAL replay will fix things */
 
+	/*
+	 * Wake up all walsenders to send WAL up to the PREPARE record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
 	/* write correct CRC and close file */
 	if ((write(fd, &statefile_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
 	{
@@ -2005,6 +2013,13 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	/* Flush XLOG to disk */
 	XLogFlush(recptr);
 
+	/*
+	 * Wake up all walsenders to send WAL up to the COMMIT PREPARED record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
 	/* Mark the transaction committed in pg_clog */
 	TransactionIdCommitTree(xid, nchildren, children);
 
@@ -2078,6 +2093,13 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	XLogFlush(recptr);
 
 	/*
+	 * Wake up all walsenders to send WAL up to the ABORT PREPARED record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
+	/*
 	 * Mark the transaction aborted in clog.  This is not absolutely necessary
 	 * but we may as well do it while we are here.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6bcc55c..942d5c2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
 #include "libpq/be-fsstubs.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -1068,6 +1069,13 @@ RecordTransactionCommit(void)
 		XLogFlush(XactLastRecEnd);
 
 		/*
+		 * Wake up all walsenders to send WAL up to the COMMIT record
+		 * immediately if replication is enabled
+		 */
+		if (max_wal_senders > 0)
+			WalSndWakeup();
+
+		/*
 		 * Now we may update the CLOG, if we wrote a COMMIT record above
 		 */
 		if (markXidCommitted)
diff --git a/src/backend/port/Makefile b/src/backend/port/Makefile
index db0c2af..f50cff8 100644
--- a/src/backend/port/Makefile
+++ b/src/backend/port/Makefile
@@ -21,7 +21,7 @@ subdir = src/backend/port
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dynloader.o pg_sema.o pg_shmem.o $(TAS)
+OBJS = dynloader.o pg_sema.o pg_shmem.o pg_latch.o $(TAS)
 
 ifeq ($(PORTNAME), darwin)
 SUBDIRS += darwin
diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
new file mode 100644
index 0000000..1632529
--- /dev/null
+++ b/src/backend/port/unix_latch.c
@@ -0,0 +1,336 @@
+/*-------------------------------------------------------------------------
+ *
+ * unix_latch.c
+ *	  Routines for interprocess latches
+ *
+ * A latch allows you to wait until another process, or a signal handler
+ * within the same process, wakes you up. There is three basic operations
+ * on a latch:
+ *
+ * SetLatch		- Sets the latch
+ * ResetLatch	- Clears the latch, allowing it to be set again
+ * WaitLatch	- waits for the latch to become set
+ *
+ * These can be used to wait for an event, without the race conditions
+ * involved with plain Unix signals and select(). pselect() was invented
+ * to solve the same problem, but it is not portable enough. Also,
+ * select() is not interrupted by signals on some platforms.
+ *
+ * The implementation is such that setting a latch that's already set
+ * is quick.
+ *
+ * The correct pattern to wait for an event is:
+ *
+ * for (;;)
+ * {
+ *     ResetLatch();
+ *     if (work to do)
+ *         Do Stuff();
+ *
+ *     WaitLatch();
+ * }
+ *
+ * It's important to reset the latch *before* checking if there's work to
+ * do. Otherwise, if someone sets the latch between the check and the
+ * ResetLatch call, you will miss it and Wait will block.
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <signal.h>
+#include <unistd.h>
+
+#include "miscadmin.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/shmem.h"
+
+/* Are we currently in WaitLatch()? The signal handler would like to know. */
+static volatile sig_atomic_t waiting = false;
+
+/* Read and write end of the self-pipe */
+static int selfpipe_readfd = -1;
+static int selfpipe_writefd = -1;
+
+/* private function prototypes */
+static void initSelfPipe(void);
+static void drainSelfPipe(void);
+static void sendSelfPipeByte(void);
+
+/*
+ * Initialize a backend-local latch.
+ */
+void
+InitLatch(Latch *latch)
+{
+	Assert(latch->owner_pid == 0);
+
+	/* Initialize the self pipe if this is our first latch in the process */
+	if (selfpipe_readfd == -1)
+		initSelfPipe();
+
+	latch->owner_pid = MyProcPid;
+	latch->is_set = false;
+}
+
+/*
+ * Initialize an inter-process latch. Like InitLatch(), but the latch can
+ * be set from another process. A process that needs to wait on an
+ * inter-proess latch also needs to ensure that latch_sigusr1_handler()
+ * is called from the SIGUSR1 signal handler.
+ *
+ * NB: You must increase the shared latch count in NumSharedLatches() in
+ * win32_latch.c if you introduce a new shared latch!
+ */
+void
+InitSharedLatch(Latch *latch)
+{
+	/*
+	 * This is the same as InitLatch() in this implementation. The
+	 * Windows implementation will likely differ.
+	 */
+	InitLatch(latch);
+}
+
+/*
+ * Release a latch previously allocated with InitLatch() or InitShareLatch().
+ */
+void
+ReleaseLatch(Latch *latch)
+{
+	Assert(latch->owner_pid == MyProcPid);
+	latch->owner_pid = 0;
+}
+
+/*
+ * Wait for given latch to be set, or until 'timeout' milliseconds passes.
+ * If 'timeout' is 0, wait forever. If the latch is already set, returns
+ * immediately.
+ *
+ * The latch must have been previously initialized by the current process.
+ */
+void
+WaitLatch(Latch *latch, long timeout)
+{
+	WaitLatchOrSocket(latch, PGINVALID_SOCKET, timeout);
+}
+
+/*
+ * Like WaitLatch, but will also return when there's data available in
+ * 'sock' for reading.
+ */
+void
+WaitLatchOrSocket(Latch *latch, pgsocket sock, long timeout)
+{
+	struct timeval tv;
+	fd_set		input_mask;
+	int			rc;
+
+	if (latch->owner_pid != MyProcPid)
+		elog(ERROR, "cannot wait on a latch owned by another process");
+
+	waiting = true;
+	for (;;)
+	{
+		int hifd;
+
+		/*
+		 * Clear the pipe, and check if the latch is set already. If someone
+		 * sets the latch between this and the select() below, the setter
+		 * will write a byte to the pipe (or signal us and the signal handler
+		 * will do that), and the select() will return immediately.
+		 */
+		drainSelfPipe();
+		if (latch->is_set)
+			break;
+
+		/* Sleep */
+		if (timeout > 0)
+		{
+			tv.tv_sec = timeout / 1000000L;
+			tv.tv_usec = timeout % 1000000L;
+		}
+
+		FD_ZERO(&input_mask);
+		FD_SET(selfpipe_readfd, &input_mask);
+		hifd = selfpipe_readfd;
+		if (sock != PGINVALID_SOCKET)
+		{
+			FD_SET(sock, &input_mask);
+			if (sock > hifd)
+				hifd = sock;
+		}
+
+		rc = select(hifd + 1, &input_mask, NULL, NULL,
+					(timeout > 0) ? &tv : NULL);
+		if (rc < 0)
+		{
+			if (errno != EINTR)
+				ereport(ERROR,
+						(errcode_for_socket_access(),
+						 errmsg("select() failed: %m")));
+		}
+		if (rc == 0)
+			break;		/* timeout exceeded */
+		if (sock != PGINVALID_SOCKET && FD_ISSET(sock, &input_mask))
+			break;		/* data available in socket */
+	}
+	waiting = false;
+}
+
+/*
+ * Sets a latch and wakes up anyone waiting on it. Returns quickly if the
+ * latch is already set.
+ */
+void
+SetLatch(Latch *latch)
+{
+	pid_t owner_pid;
+
+	/* Quick exit if already set */
+	if (latch->is_set)
+		return;
+
+	latch->is_set = true;
+
+	/*
+	 * See if anyone's waiting for the latch. It can be the current process
+	 * if we're in a signal handler. We use the self-pipe to wake up the
+	 * select() in that case. If it's another process, send a signal.
+	 */
+	owner_pid = latch->owner_pid;
+	if (owner_pid == 0)
+		return;
+	else if (owner_pid == MyProcPid)
+		sendSelfPipeByte();
+	else
+		kill(owner_pid, SIGUSR1);
+}
+
+/*
+ * Clear the latch. Calling WaitLatch after this will sleep, unless
+ * the latch is set again before the WaitLatch call.
+ */
+void
+ResetLatch(Latch *latch)
+{
+	/* Only the owner should reset the latch */
+	Assert(latch->owner_pid == MyProcPid);
+
+	if (!latch->is_set)
+		return;
+
+	latch->is_set = false;
+}
+
+/*
+ * LatchShmemSize
+ *		Compute space needed for latch's shared memory
+ *
+ * Not needed for Unix implementation.
+ */
+Size
+LatchShmemSize(void)
+{
+	return 0;
+}
+
+/*
+ * LatchShmemInit
+ *		Allocate and initialize shared memory needed for latches
+ *
+ * Not needed for Unix implementation.
+ */
+void
+LatchShmemInit(void)
+{
+}
+
+/*
+ * SetEvent uses SIGUSR1 to wake up the process waiting on the latch.
+ */
+void
+latch_sigusr1_handler(void)
+{
+	if (waiting)
+		sendSelfPipeByte();
+}
+
+/* initialize the self-pipe */
+static void
+initSelfPipe(void)
+{
+	int pipefd[2];
+
+	/*
+	 * Set up a pipe that allows a signal handler to wake up the select()
+	 * in WaitLatch(). Make the write-end non-blocking, so that SetLatch()
+	 * won't block if the event has already been set many times filling
+	 * the kernel buffer. Make the read-end non-blocking too, so that we
+	 * can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK.
+	 */
+	if (pipe(pipefd) < 0)
+		elog(FATAL, "pipe() failed: %m");
+	if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0)
+		elog(FATAL, "fcntl() failed on read-end of self-pipe: %m");
+	if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0)
+		elog(FATAL, "fcntl() failed on write-end of self-pipe: %m");
+
+	selfpipe_readfd = pipefd[0];
+	selfpipe_writefd = pipefd[1];
+}
+
+/* Send one byte to the self-pipe, to wake up WaitLatch() */
+static void
+sendSelfPipeByte(void)
+{
+	int rc;
+	char dummy = 0;
+
+retry:
+	rc = write(selfpipe_writefd, &dummy, 1);
+	if (rc < 0)
+	{
+		if (errno != EAGAIN && errno != EWOULDBLOCK)
+		{
+			/*
+			 * XXX: Is it safe to elog(ERROR) in a signal handler?
+			 */
+			elog(ERROR, "write() on self-pipe failed: %m");
+		}
+		if (errno == EINTR)
+			goto retry;
+	}
+}
+
+/* Read all available data from the self-pipe */
+static void
+drainSelfPipe(void)
+{
+	int rc;
+	char buf;
+
+	for (;;)
+	{
+		rc = read(selfpipe_readfd, &buf, 1);
+		if (rc < 0)
+		{
+			if (errno == EINTR)
+				continue;
+			if (errno == EAGAIN || errno == EWOULDBLOCK)
+				break;		/* the pipe is empty */
+
+			elog(ERROR, "read() on self-pipe failed: %m");
+		}
+		else if (rc == 0)
+			elog(ERROR, "unexpected EOF on self-pipe");
+	}
+}
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
new file mode 100644
index 0000000..219af0f
--- /dev/null
+++ b/src/backend/port/win32_latch.c
@@ -0,0 +1,302 @@
+/*-------------------------------------------------------------------------
+ *
+ * win32_latch.c
+ *	  Routines for interprocess latches
+ *
+ * Windows implementation of latches, using Windows events. See
+ * unix_latch.c for information on usage.
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <signal.h>
+#include <unistd.h>
+
+#include "miscadmin.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+
+typedef struct
+{
+	volatile slock_t mutex;
+	int nfreehandles;
+	int maxhandles;
+	HANDLE handles[1]; /* variable length */
+} SharedEventHandles;
+
+SharedEventHandles *sharedHandles;
+
+/* Are we currently in WaitLatch()? The signal handler would like to know. */
+static volatile HANDLE waitingEvent = false;
+
+/*
+ * Initialize a backend-local latch.
+ */
+void
+InitLatch(Latch *latch)
+{
+	Assert(latch->event == NULL);
+	latch->isshared = false;
+	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
+	latch->is_set = false;
+}
+
+/*
+ * Initialize an inter-process latch. Like InitLatch(), but the latch can
+ * be triggered from another process. A process that needs to wait on
+ * an inter-proess latch also needs to ensure that latch_sigusr1_handler()
+ * is called from the SIGUSR1 signal handler.
+ */
+void
+InitSharedLatch(Latch *latch)
+{
+	Assert(latch->event == NULL);
+	SpinLockAcquire(&sharedHandles->mutex);
+	if (sharedHandles->nfreehandles <= 0)
+	{
+		SpinLockRelease(&sharedHandles->mutex);
+		elog(ERROR, "out of shared event objects");
+	}
+	sharedHandles->nfreehandles--;
+	latch->event = sharedHandles->handles[sharedHandles->nfreehandles];
+	SpinLockRelease(&sharedHandles->mutex);
+
+	latch->isshared = true;
+	latch->is_set = false;
+	ResetEvent(latch->event);
+}
+
+/*
+ * Release a latch previously allocated with InitLatch() or InitShareLatch().
+ */
+void
+ReleaseLatch(Latch *latch)
+{
+	Assert(latch->event);
+	if (latch->isshared)
+	{
+		/* Put the event handle back to the pool */
+		SpinLockAcquire(&sharedHandles->mutex);
+		if (sharedHandles->nfreehandles >= sharedHandles->maxhandles)
+		{
+			SpinLockRelease(&sharedHandles->mutex);
+			elog(PANIC, "too many free event handles");
+		}
+		sharedHandles->handles[sharedHandles->nfreehandles] = latch->event;
+		sharedHandles->nfreehandles++;
+		SpinLockRelease(&sharedHandles->mutex);
+		latch->event = NULL;
+	}
+	else
+	{
+		/* XXX: is this safe from compiler rearrangement? */
+		HANDLE handle = latch->event;
+		latch->event = NULL;
+		CloseHandle(handle);
+	}
+}
+
+/*
+ * Wait for given latch to be set, or until 'timeout' milliseconds passes.
+ * If 'timeout' is 0, wait forever. If the latch is already set, returns
+ * immediately.
+ *
+ * The latch must have been previously initialized by the current process.
+ */
+void
+WaitLatch(Latch *latch, long timeout)
+{
+	WaitLatchOrSocket(latch, PGINVALID_SOCKET, timeout);
+}
+
+/*
+ * Like WaitLatch, but will also return when there's data available in
+ * 'sock' for reading.
+ */
+void
+WaitLatchOrSocket(Latch *latch, SOCKET sock, long timeout)
+
+{
+	DWORD		rc;
+	HANDLE		events[3];
+	HANDLE		sockevent;
+	int			numevents;
+
+	events[0] = latch->event;
+	events[1] = pgwin32_signal_event;
+	numevents = 2;
+	if (sock != PGINVALID_SOCKET)
+	{
+		sockevent = WSACreateEvent();
+		WSAEventSelect(sock, sockevent, FD_READ);
+		events[numevents++] = sockevent;
+	}
+
+	waitingEvent = latch->event;
+	for (;;)
+	{
+		/*
+		 * Reset the event, and check if the latch is set already. If someone
+		 * sets the latch between this and the WaitForMultipleObjects() call
+		 * below, the setter will set the event and WaitForMultipleObjects()
+		 * will return immediately.
+		 */
+		if (!ResetEvent(latch->event))
+		{
+			waitingEvent = NULL;
+			elog(ERROR, "ResetEvent failed: error code %d", (int) GetLastError());
+		}
+		if (latch->is_set)
+			break;
+
+		rc = WaitForMultipleObjects(numevents, events, FALSE, (timeout > 0) ? (timeout / 1000) : INFINITE);
+		if (rc == WAIT_FAILED)
+		{
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("WaitForSingleObject() failed: error code %d", (int) GetLastError())));
+		}
+		if (rc == WAIT_TIMEOUT)
+			break;		/* timeout exceeded */
+		if (rc == WAIT_OBJECT_0 + 1)
+			pgwin32_dispatch_queued_signals();
+		if (rc == WAIT_OBJECT_0 + 2)
+		{
+			Assert(sock != PGINVALID_SOCKET);
+			break;
+		}
+	}
+	waitingEvent = NULL;
+
+	/* Clean up the handle we created for the socket */
+	if (sock != PGINVALID_SOCKET)
+	{
+		WSAEventSelect(sock, sockevent, 0);
+		WSACloseEvent(sockevent);
+	}
+}
+
+/*
+ * Sets a latch and wakes up anyone waiting on it. Returns quickly if the
+ * latch is set already.
+ */
+void
+SetLatch(Latch *latch)
+{
+	HANDLE handle;
+
+	/* Quick exit if already set */
+	if (latch->is_set)
+		return;
+
+	latch->is_set = true;
+
+	/*
+	 * See if anyone's waiting for the latch. It can be the current process
+	 * if we're in a signal handler. Use a local variable here in case the
+	 * latch is just released between the test and the SetEvent call.
+	 */
+	handle = latch->event;
+	if (handle)
+	{
+		if (!SetEvent(handle))
+			elog(LOG, "SetEvent failed: error code %d", (int) GetLastError());
+	}
+}
+
+/*
+ * Clear the latch. Calling WaitLatch after this will sleep, unless
+ * the latch is set again before the WaitLatch call.
+ */
+void
+ResetLatch(Latch *latch)
+{
+	if (!latch->is_set)
+		return;
+
+	latch->is_set = false;
+}
+
+/*
+ * Number of shared latches, used to allocate the right number of shared
+ * Event handles at postmaster startup. You must update this if you
+ * introduce a new shared latch!
+ */
+static int
+NumSharedLatches(void)
+{
+	int numLatches = 0;
+
+	/* Each walsender needs one latch */
+	numLatches += max_wal_senders;
+
+	return numLatches;
+}
+
+/*
+ * LatchShmemSize
+ *		Compute space needed for latch's shared memory
+ */
+Size
+LatchShmemSize(void)
+{
+	return offsetof(SharedEventHandles, handles) +
+		NumSharedLatches() * sizeof(HANDLE);
+}
+
+/*
+ * LatchShmemInit
+ *		Allocate and initialize shared memory needed for latches
+ */
+void
+LatchShmemInit(void)
+{
+	Size		size = LatchShmemSize();
+	bool		found;
+
+	sharedHandles = ShmemInitStruct("SharedEventHandles", size, &found);
+
+	/* If we're first, initialize the struct and allocate handles */
+	if (!found)
+	{
+		int i;
+		SECURITY_ATTRIBUTES sa;
+
+		/*
+		 * Set up security attributes to specify that the events are
+		 * inherited.
+		 */
+		ZeroMemory(&sa, sizeof(sa));
+		sa.nLength = sizeof(sa);
+		sa.bInheritHandle = TRUE;
+
+		SpinLockInit(&sharedHandles->mutex);
+		sharedHandles->maxhandles = NumSharedLatches();
+		sharedHandles->nfreehandles = sharedHandles->maxhandles;
+		for (i = 0; i < sharedHandles->maxhandles; i++)
+		{
+			sharedHandles->handles[i] = CreateEvent(&sa, TRUE, FALSE, NULL);
+			if (sharedHandles->handles[i] == NULL)
+				elog(ERROR, "CreateEvent failed: error code %d", (int) GetLastError());
+		}
+	}
+}
+
+/* I'm not sure if this is needed on Windows... */
+void
+latch_sigusr1_handler(void)
+{
+	HANDLE handle = waitingEvent;
+	if (handle)
+		SetEvent(handle);
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 53c2581..cda191d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -34,6 +34,7 @@
  */
 #include "postgres.h"
 
+#include <signal.h>
 #include <unistd.h>
 
 #include "access/xlog_internal.h"
@@ -66,9 +67,6 @@ bool		am_walsender = false;		/* Am I a walsender process ? */
 int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
 int			WalSndDelay = 200;	/* max sleep time between some actions */
 
-#define NAPTIME_PER_CYCLE 100000L		/* max sleep time between cycles
-										 * (100ms) */
-
 /*
  * These variables are used similarly to openLogFile/Id/Seg/Off,
  * but for walsender to read the XLOG.
@@ -93,6 +91,7 @@ static volatile sig_atomic_t ready_to_stop = false;
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndShutdownHandler(SIGNAL_ARGS);
 static void WalSndQuickDieHandler(SIGNAL_ARGS);
+static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -144,6 +143,16 @@ WalSenderMain(void)
 	/* Handle handshake messages before streaming */
 	WalSndHandshake();
 
+	/* Initialize shared memory status */
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile WalSnd *walsnd = MyWalSnd;
+
+		SpinLockAcquire(&walsnd->mutex);
+		walsnd->sentPtr = sentPtr;
+		SpinLockRelease(&walsnd->mutex);
+	}
+
 	/* Main loop of walsender */
 	return WalSndLoop();
 }
@@ -380,8 +389,6 @@ WalSndLoop(void)
 	/* Loop forever, unless we get an error */
 	for (;;)
 	{
-		long		remain;		/* remaining time (us) */
-
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
@@ -421,32 +428,42 @@ WalSndLoop(void)
 		/*
 		 * If we had sent all accumulated WAL in last round, nap for the
 		 * configured time before retrying.
-		 *
-		 * On some platforms, signals won't interrupt the sleep.  To ensure we
-		 * respond reasonably promptly when someone signals us, break down the
-		 * sleep into NAPTIME_PER_CYCLE increments, and check for interrupts
-		 * after each nap.
 		 */
 		if (caughtup)
 		{
-			remain = WalSndDelay * 1000L;
-			while (remain > 0)
-			{
-				/* Check for interrupts */
-				if (got_SIGHUP || shutdown_requested || ready_to_stop)
-					break;
+			/*
+			 * Even if we wrote all the WAL that was available when we started
+			 * sending, more might have arrived while we were sending this
+			 * batch. We had the latch set while sending, so we have not
+			 * received any signals from that time. Let's arm the latch
+			 * again, and after that check that we're still up-to-date.
+			 */
+			ResetLatch(&MyWalSnd->latch);
 
-				/* Sleep and check that the connection is still alive */
-				pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain);
-				CheckClosedConnection();
+			if (!XLogSend(output_message, &caughtup))
+				break;
+			if (caughtup && !got_SIGHUP && !ready_to_stop && !shutdown_requested)
+			{
+				/*
+				 * XXX: We don't really need the periodic wakeups anymore,
+				 * WaitLatchOrSocket should reliably wake up as soon as
+				 * something interesting happens.
+				 */
 
-				remain -= NAPTIME_PER_CYCLE;
+				/* Sleep */
+				WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock, 0);
+				elog(LOG, "woke up");
 			}
-		}
 
-		/* Attempt to send the log once every loop */
-		if (!XLogSend(output_message, &caughtup))
-			break;
+			/* Check if the connection was closed */
+			CheckClosedConnection();
+		}
+		else
+		{
+			/* Attempt to send the log once every loop */
+			if (!XLogSend(output_message, &caughtup))
+				break;
+		}
 	}
 
 	/*
@@ -493,10 +510,15 @@ InitWalSnd(void)
 		}
 		else
 		{
-			/* found */
-			MyWalSnd = (WalSnd *) walsnd;
+			/*
+			 * Found a free slot. Take ownership of the latch and initialize
+			 * the other fields.
+			 */
+			InitSharedLatch((Latch *) &walsnd->latch);
 			walsnd->pid = MyProcPid;
-			MemSet(&MyWalSnd->sentPtr, 0, sizeof(XLogRecPtr));
+			MemSet(&walsnd->sentPtr, 0, sizeof(XLogRecPtr));
+			/* Set MyWalSnd only after it's fully initialized. */
+			MyWalSnd = (WalSnd *) walsnd;
 			SpinLockRelease(&walsnd->mutex);
 			break;
 		}
@@ -523,6 +545,7 @@ WalSndKill(int code, Datum arg)
 	 * for this.
 	 */
 	MyWalSnd->pid = 0;
+	ReleaseLatch(&MyWalSnd->latch);
 
 	/* WalSnd struct isn't mine anymore */
 	MyWalSnd = NULL;
@@ -787,6 +810,8 @@ static void
 WalSndSigHupHandler(SIGNAL_ARGS)
 {
 	got_SIGHUP = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /* SIGTERM: set flag to shut down */
@@ -794,6 +819,8 @@ static void
 WalSndShutdownHandler(SIGNAL_ARGS)
 {
 	shutdown_requested = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /*
@@ -828,11 +855,20 @@ WalSndQuickDieHandler(SIGNAL_ARGS)
 	exit(2);
 }
 
+/* SIGUSR1: set flag to send WAL records */
+static void
+WalSndXLogSendHandler(SIGNAL_ARGS)
+{
+	latch_sigusr1_handler();
+}
+
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
 {
 	ready_to_stop = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /* Set up signal handlers */
@@ -847,7 +883,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, WalSndQuickDieHandler);	/* hard crash time */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, SIG_IGN); /* not used */
+	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
@@ -895,6 +931,16 @@ WalSndShmemInit(void)
 	}
 }
 
+/* Wake up all walsenders */
+void
+WalSndWakeup(void)
+{
+	int		i;
+
+	for (i = 0; i < max_wal_senders; i++)
+		SetLatch(&WalSndCtl->walsnds[i].latch);
+}
+
 /*
  * This isn't currently used for anything. Monitoring tools might be
  * interested in the future, and we'll need something like this in the
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 492ac9a..0083513 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
@@ -117,6 +118,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SInvalShmemSize());
 		size = add_size(size, PMSignalShmemSize());
 		size = add_size(size, ProcSignalShmemSize());
+		size = add_size(size, LatchShmemSize());
 		size = add_size(size, BgWriterShmemSize());
 		size = add_size(size, AutoVacuumShmemSize());
 		size = add_size(size, WalSndShmemSize());
@@ -217,6 +219,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	PMSignalShmemInit();
 	ProcSignalShmemInit();
+	LatchShmemInit();
 	BgWriterShmemInit();
 	AutoVacuumShmemInit();
 	WalSndShmemInit();
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index e892e2d..3dd25f1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -21,6 +21,7 @@
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "storage/ipc.h"
+#include "storage/latch.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
@@ -278,5 +279,7 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	latch_sigusr1_handler();
+
 	errno = save_errno;
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 874959e..73c5904 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include "access/xlog.h"
+#include "storage/latch.h"
 #include "storage/spin.h"
 
 /*
@@ -24,6 +25,12 @@ typedef struct WalSnd
 	XLogRecPtr	sentPtr;		/* WAL has been sent up to this point */
 
 	slock_t		mutex;			/* locks shared variables shown above */
+
+	/*
+	 * Latch used by backends to wake up this walsender when it has work
+	 * to do.
+	 */
+	Latch		latch;
 } WalSnd;
 
 /* There is one WalSndCtl struct for the whole database cluster */
@@ -45,5 +52,6 @@ extern int	WalSenderMain(void);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
+extern void WalSndWakeup(void);
 
 #endif   /* _WALSENDER_H */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
new file mode 100644
index 0000000..677e723
--- /dev/null
+++ b/src/include/storage/latch.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * latch.h
+ *	  Routines for interprocess latches
+ *
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LATCH_H
+#define LATCH_H
+
+#include <signal.h>
+
+#ifndef WIN32
+typedef struct
+{
+	volatile sig_atomic_t	is_set;
+	volatile sig_atomic_t	owner_pid;
+} Latch;
+#else
+typedef struct
+{
+	volatile sig_atomic_t	is_set;
+	bool isshared;
+	HANDLE	event;
+} Latch;
+#endif
+
+/*
+ * prototypes for functions in latch.c
+ */
+extern void InitLatch(Latch *latch);
+extern void InitSharedLatch(Latch *latch);
+extern void ReleaseLatch(Latch *latch);
+extern void WaitLatch(Latch *latch, long timeout);
+extern void WaitLatchOrSocket(Latch *latch, pgsocket sock, long timeout);
+extern void SetLatch(Latch *latch);
+extern void ResetLatch(Latch *latch);
+
+extern Size LatchShmemSize(void);
+extern void LatchShmemInit(void);
+
+extern void latch_sigusr1_handler(void);
+
+#endif   /* LATCH_H */
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index cf7c3ee..bf0e904 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -64,6 +64,7 @@ sub mkvcbuild
     $postgres->ReplaceFile('src\backend\port\dynloader.c','src\backend\port\dynloader\win32.c');
     $postgres->ReplaceFile('src\backend\port\pg_sema.c','src\backend\port\win32_sema.c');
     $postgres->ReplaceFile('src\backend\port\pg_shmem.c','src\backend\port\win32_shmem.c');
+    $postgres->ReplaceFile('src\backend\port\pg_latch.c','src\backend\port\win32_latch.c');
     $postgres->AddFiles('src\port',@pgportfiles);
     $postgres->AddDir('src\timezone');
     $postgres->AddFiles('src\backend\parser','scan.l','gram.y');
diff --git a/src/tools/msvc/build.pl b/src/tools/msvc/build.pl
index 24db58c..bf8f356 100644
--- a/src/tools/msvc/build.pl
+++ b/src/tools/msvc/build.pl
@@ -56,7 +56,7 @@ if ($buildwhat)
 }
 else
 {
-    system("msbuild pgsql.sln /verbosity:detailed /p:Configuration=$bconf");
+    system("msbuild pgsql.sln /verbosity:normal /p:Configuration=$bconf");
 }
 
 # report status

#41

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#40)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Ok, here's an updated patch with WaitLatchOrSocket that let's you do that.

Minor code review items:

s/There is three/There are three/ in unix_latch.c header comment

The header comment points out the correct usage of ResetLatch, but I
think it should also emphasize that the correct usage of SetLatch is
to set whatever flags indicate work-to-do before SetLatch.

I don't care for the Asserts in InitLatch and InitSharedLatch.
Initialization functions ought not assume that the struct they are
told to initialize contains anything but garbage. And *especially*
not assume that without documentation.

s/inter-proess/inter-process/, at least 2 places

Does ReleaseLatch() have any actual use-case, and if so what would it be?
I think we could do without it.

The WaitLatch timeout API could use a bit of refinement. I'd suggest
defining negative timeout as meaning wait forever, so that timeout = 0
can be used for "check but don't wait". Also, it seems like the
function shouldn't just return void but should return a bool to show
whether it saw the latch set or timed out. (Yeah, I realize the caller
could look into the latch to find that out, but callers really ought to
treat latches as opaque structs.)

I don't think you have the select-failed logic right in
WaitLatchOrSocket; on EINTR it will suppose that FD_ISSET is a valid
test to make, which I think ain't the case. Just "continue" around
the loop.

Comment for unix_latch's latch_sigusr1_handler refers to SetEvent,
which is a Windows-ism.

+ * XXX: Is it safe to elog(ERROR) in a signal handler?

No, it isn't.

It seems like both implementations are #include'ing more than they
ought to --- why replication/walsender.h, in particular?
I don't think unix_latch needs spin.h either.

+typedef struct
+{
+	volatile sig_atomic_t	is_set;
+	volatile sig_atomic_t	owner_pid;
+} Latch;

I don't believe it is either sane or portable to declare struct fields
as volatile. You need to attach the volatile qualifier to the function
arguments instead, eg

extern WaitLatch(volatile Latch *latch, ...)

Also, using sig_atomic_t for owner_pid is entirely not sane.
On many platforms sig_atomic_t is only a byte, and besides
which you have no need for that field to be settable by a
signal handler.

regards, tom lane

#42

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Tom Lane (#41)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Fri, Sep 3, 2010 at 5:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Does ReleaseLatch() have any actual use-case, and if so what would it be?
I think we could do without it.

In Unix, probably we can live without that. But in Windows, we need to
free SharedEventHandles slot for upcoming process using a latch when
ending.

The WaitLatch timeout API could use a bit of refinement. I'd suggest
defining negative timeout as meaning wait forever, so that timeout = 0
can be used for "check but don't wait". Also, it seems like the
function shouldn't just return void but should return a bool to show
whether it saw the latch set or timed out. (Yeah, I realize the caller
could look into the latch to find that out, but callers really ought to
treat latches as opaque structs.)

Agreed.

I don't think you have the select-failed logic right in
WaitLatchOrSocket; on EINTR it will suppose that FD_ISSET is a valid
test to make, which I think ain't the case. Just "continue" around
the loop.

EINTR already makes us go back to the top of the loop since FD_ISSET(pipe)
is not checked. Then we would clear the pipe and break out of the loop
because of "latch->is_set == true".

+ * XXX: Is it safe to elog(ERROR) in a signal handler?

No, it isn't.

We should use elog(FATAL) or check proc_exit_inprogress, instead?

+		if (errno != EAGAIN && errno != EWOULDBLOCK)
+		{
+			/*
+			 * XXX: Is it safe to elog(ERROR) in a signal handler?
+			 */
+			elog(ERROR, "write() on self-pipe failed: %m");
+		}
+		if (errno == EINTR)
+			goto retry;

"errno == EINTR)" seems to be never checked.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#43

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Fujii Masao (#42)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Fujii Masao <masao.fujii@gmail.com> writes:

+ * XXX: Is it safe to elog(ERROR) in a signal handler?

No, it isn't.

We should use elog(FATAL) or check proc_exit_inprogress, instead?

elog(FATAL) is *certainly* not a better idea. I think there's really
nothing that can be done, you just have to silently ignore the error.
BTW, if we retry, there had probably better be a limit on how many times
to retry ...

+		if (errno != EAGAIN && errno != EWOULDBLOCK)
+		{
+			/*
+			 * XXX: Is it safe to elog(ERROR) in a signal handler?
+			 */
+			elog(ERROR, "write() on self-pipe failed: %m");
+		}
+		if (errno == EINTR)
+			goto retry;

"errno == EINTR)" seems to be never checked.

Another issue with coding like that is that it supposes elog() won't
change errno.

regards, tom lane

#44

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Tom Lane (#43)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Fri, Sep 3, 2010 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

+ * XXX: Is it safe to elog(ERROR) in a signal handler?

No, it isn't.

We should use elog(FATAL) or check proc_exit_inprogress, instead?

elog(FATAL) is *certainly* not a better idea. I think there's really
nothing that can be done, you just have to silently ignore the error.

Hmm.. some functions called by a signal handler use elog(FATAL), e.g.,
RecoveryConflictInterrupt() do that when unknown conflict mode is given
as an argument. Are these calls unsafe, too?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#45

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#41)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 02/09/10 23:13, Tom Lane wrote:

The WaitLatch ...timeout API could use a bit of refinement. I'd suggest
defining negative timeout as meaning wait forever, so that timeout = 0
can be used for "check but don't wait". Also, it seems like the
function shouldn't just return void but should return a bool to show
whether it saw the latch set or timed out.

In case of WaitLatchOrSocket, the caller might want to know if a latch
was set, the socket became readable, or it timed out. So we need three
different return values.

(Yeah, I realize the caller
could look into the latch to find that out, but callers really ought to
treat latches as opaque structs.)

Hmm, maybe we need a TestLatch function to check if a latch is set.

I don't think you have the select-failed logic right in
WaitLatchOrSocket; on EINTR it will suppose that FD_ISSET is a valid
test to make, which I think ain't the case. Just "continue" around
the loop.

Yep.

I also realized that the timeout handling is a bit surprising with
interrupts. After EINTR we call select() again with the same timeout, so
a signal effectively restarts the timer. We seem to have similar
behavior in a couple of other places, in pgstat.c and auth.c. So maybe
that's OK and just needs to be documented, but I thought I'd bring it up.

It seems like both implementations are #include'ing more than they
ought to --- why replication/walsender.h, in particular?

Windows implementation needs it for the max_wal_senders variable, to
allocate enough shared Event objects in LatchShmemInit. In unix_latch.c
it's not needed.

Also, using sig_atomic_t for owner_pid is entirely not sane.
On many platforms sig_atomic_t is only a byte, and besides
which you have no need for that field to be settable by a
signal handler.

Hmm, true, it doesn't need to be set from signal handler, but is there
an atomicity problem if one process calls ReleaseLatch while another
process is in SetLatch? ReleaseLatch sets owner_pid to 0, while SetLatch
reads it and calls kill() on it. Can we assume that pid_t is atomic, or
do we need a spinlock to protect it? (Windows implementation has a
similar issue with HANDLE instead of pid_t)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#46

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Fujii Masao (#44)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Fujii Masao <masao.fujii@gmail.com> writes:

On Fri, Sep 3, 2010 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

elog(FATAL) is *certainly* not a better idea. �I think there's really
nothing that can be done, you just have to silently ignore the error.

Hmm.. some functions called by a signal handler use elog(FATAL), e.g.,
RecoveryConflictInterrupt() do that when unknown conflict mode is given
as an argument. Are these calls unsafe, too?

[ shrug... ] I stated before that the Hot Standby patch is doing
utterly unsafe things in signal handlers. Simon rejected that.
I am waiting for irrefutable evidence to emerge from the field
(and am very confident that it will be forthcoming...) before
I argue with him further. Meanwhile, I'm not going to accept anything
unsafe in a core facility like this patch is going to be.

regards, tom lane

#47

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#45)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 02/09/10 23:13, Tom Lane wrote:

(Yeah, I realize the caller
could look into the latch to find that out, but callers really ought to
treat latches as opaque structs.)

Hmm, maybe we need a TestLatch function to check if a latch is set.

+1. It could be a macro for now. I wish that we could declare Latch as
an opaque struct, but we probably need to let callers embed it in a
larger struct, so we'll just have to rely on callers to code cleanly.

I also realized that the timeout handling is a bit surprising with
interrupts. After EINTR we call select() again with the same timeout, so
a signal effectively restarts the timer.

Actually it's platform-dependent. On some machines (I think
BSD-derived) the value of the timeout struct gets reduced by the elapsed
time before returning, so that if you just loop around you don't get
more than the originally specified delay time in total.

We seem to have similar
behavior in a couple of other places, in pgstat.c and auth.c. So maybe
that's OK and just needs to be documented, but I thought I'd bring it up.

Yeah. I am hoping that this facility will greatly reduce the need for
callers to care about the exact timeout delay, so I think that what we
should do for now is just document that the timeout might be several
times as long as specified, and see how it goes from there.

We could fix the problem if we had to, by adding some gettimeofday()
calls and computing the remaining delay time each time through the
loop. But let's avoid doing that until it's clear we need it.

Also, using sig_atomic_t for owner_pid is entirely not sane.

Hmm, true, it doesn't need to be set from signal handler, but is there
an atomicity problem if one process calls ReleaseLatch while another
process is in SetLatch?

If there is *any* possibility of that happening then you have far worse
problems than whether the field is atomically readable or not: the
behavior will be unpredictable at just slightly larger timescales.
This is the reason why I think it'd be better if ReleaseLatch simply
didn't exist. That'd discourage people from designing dynamic latch
structures, which are fundamentally going to be subject to race
conditions.

regards, tom lane

#48

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Tom Lane (#46)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

On Fri, Sep 3, 2010 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

elog(FATAL) is *certainly* not a better idea. I think there's really
nothing that can be done, you just have to silently ignore the error.

Hmm.. some functions called by a signal handler use elog(FATAL), e.g.,
RecoveryConflictInterrupt() do that when unknown conflict mode is given
as an argument. Are these calls unsafe, too?

[ shrug... ] I stated before that the Hot Standby patch is doing
utterly unsafe things in signal handlers. Simon rejected that.
I am waiting for irrefutable evidence to emerge from the field
(and am very confident that it will be forthcoming...) before
I argue with him further. Meanwhile, I'm not going to accept anything
unsafe in a core facility like this patch is going to be.

Oh. I thought you had ignored his objections and fixed it. Why are
we releasing 9.0 with this problem again? Surely this is nuts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#49

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Robert Haas (#48)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

[ shrug... ] �I stated before that the Hot Standby patch is doing
utterly unsafe things in signal handlers. �Simon rejected that.
I am waiting for irrefutable evidence to emerge from the field
(and am very confident that it will be forthcoming...) before
I argue with him further. �Meanwhile, I'm not going to accept anything
unsafe in a core facility like this patch is going to be.

Oh. I thought you had ignored his objections and fixed it. Why are
we releasing 9.0 with this problem again? Surely this is nuts.

My original review of hot standby found about half a dozen things
I thought were broken:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg00178.php
After a *very* long-drawn-out fight I fixed one of them
(max_standby_delay), largely still over Simon's objections. I don't
have the energy to repeat that another half-dozen times, so I'm going
to wait for the suspected problems to be proven by field experience.

regards, tom lane

#50

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Tom Lane (#49)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Fri, Sep 3, 2010 at 12:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

[ shrug... ] I stated before that the Hot Standby patch is doing
utterly unsafe things in signal handlers. Simon rejected that.
I am waiting for irrefutable evidence to emerge from the field
(and am very confident that it will be forthcoming...) before
I argue with him further. Meanwhile, I'm not going to accept anything
unsafe in a core facility like this patch is going to be.

Oh. I thought you had ignored his objections and fixed it. Why are
we releasing 9.0 with this problem again? Surely this is nuts.

My original review of hot standby found about half a dozen things
I thought were broken:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg00178.php
After a *very* long-drawn-out fight I fixed one of them
(max_standby_delay), largely still over Simon's objections. I don't
have the energy to repeat that another half-dozen times, so I'm going
to wait for the suspected problems to be proven by field experience.

Bummer. Allow me to cast a vote for doing something about the fact
that handle_standby_sig_alarm() thinks it can safely acquire an LWLock
in a signal handler. I think we should be making our decisions on
what to change in the code based on what is technically sound, rather
than based on how much the author complains about changing it. Of
course there may be cases where there is a legitimate difference of
opinion concerning the best way forward, but I don't believe this is
one of them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#51

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#47)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 03/09/10 17:51, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 02/09/10 23:13, Tom Lane wrote:

Also, using sig_atomic_t for owner_pid is entirely not sane.

Hmm, true, it doesn't need to be set from signal handler, but is there
an atomicity problem if one process calls ReleaseLatch while another
process is in SetLatch?

If there is *any* possibility of that happening then you have far worse
problems than whether the field is atomically readable or not: the
behavior will be unpredictable at just slightly larger timescales.
This is the reason why I think it'd be better if ReleaseLatch simply
didn't exist. That'd discourage people from designing dynamic latch
structures, which are fundamentally going to be subject to race
conditions.

Each Walsender needs a latch, and walsenders come and go. I first
experimented with had no ReleaseLatch function; instead any process
could call WaitLatch on any shared latch, as long as only one process
waits on a given latch at a time. But it had exactly the same problem,
WaitLatch had to set the pid on the Latch struct to allow other
processes to send the signal. Another process could call SetLatch and
read the pid field, while WaitLatch is just setting it. I think we'll
have to put a spinlock there, if we can't assume that assignment of
pid_t is atomic. It's not the end of the world..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#52

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#51)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 03/09/10 17:51, Tom Lane wrote:

If there is *any* possibility of that happening then you have far worse
problems than whether the field is atomically readable or not: the
behavior will be unpredictable at just slightly larger timescales.

Each Walsender needs a latch, and walsenders come and go.

Well, then we need to think extremely hard about the circumstances in
which we need to send a cross-process latch signal to walsenders and
what the behavior needs to be in the race conditions.

WaitLatch had to set the pid on the Latch struct to allow other
processes to send the signal. Another process could call SetLatch and
read the pid field, while WaitLatch is just setting it. I think we'll
have to put a spinlock there, if we can't assume that assignment of
pid_t is atomic. It's not the end of the world..

Yes it is. Signal handlers can't take spinlocks (what if they interrupt
while the mainline is holding the lock?).

It's probably not too unreasonable to assume that pid_t assignment is
atomic. But I'm still thinking that we have bigger problems than that
if there are really cases where SetLatch can execute at approximately
the same time as a latch owner is coming or going.

regards, tom lane

#53

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#52)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 03/09/10 21:16, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

WaitLatch had to set the pid on the Latch struct to allow other
processes to send the signal. Another process could call SetLatch and
read the pid field, while WaitLatch is just setting it. I think we'll
have to put a spinlock there, if we can't assume that assignment of
pid_t is atomic. It's not the end of the world..

Yes it is. Signal handlers can't take spinlocks (what if they interrupt
while the mainline is holding the lock?).

Ok, I see.

It's probably not too unreasonable to assume that pid_t assignment is
atomic. But I'm still thinking that we have bigger problems than that
if there are really cases where SetLatch can execute at approximately
the same time as a latch owner is coming or going.

I don't see how to avoid it. A walsender, or any process really, can
exit at any time. It can make the latch inaccessible to others before it
exits to minimize the window, but it's always going to be possible that
another process is just about to call SetLatch when you exit.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#54

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#53)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 03/09/10 21:16, Tom Lane wrote:

It's probably not too unreasonable to assume that pid_t assignment is
atomic. But I'm still thinking that we have bigger problems than that
if there are really cases where SetLatch can execute at approximately
the same time as a latch owner is coming or going.

I don't see how to avoid it. A walsender, or any process really, can
exit at any time. It can make the latch inaccessible to others before it
exits to minimize the window, but it's always going to be possible that
another process is just about to call SetLatch when you exit.

Well, in that case what we need to do is presume that the latch object
has a continuing existence but the owner/receiver can come and go.
I would suggest that InitLatch needs to initialize the object into a
valid but unowned state; there is *no* deinitialize operation; and
there are AcquireLatch and ReleaseLatch operations to become owner
or stop being owner. We also need to define the semantics of SetLatch
on an unowned latch --- does this set a signal condition that will be
available to the next owner?

This amount of complexity might be overkill for local latches, but I
think we need it for shared ones.

regards, tom lane

#55

Ron Mayer

rm_pg@cheapcomplexdevices.com

over 15 years ago

In reply to: Tom Lane (#49)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

[ shrug... ] I stated before that the Hot Standby patch is doing
utterly unsafe things in signal handlers. Simon rejected that.
I am waiting for irrefutable evidence to emerge from the field
(and am very confident that it will be forthcoming...) [...]

[...]Why are
we releasing 9.0 with this problem again? Surely this is nuts.

Will the docs give enough info so that release note readers
will know when they're giving well-informed consent to volunteer
to produce such field evidence?

#56

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Ron Mayer (#55)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Fri, Sep 3, 2010 at 3:11 PM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:

Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

[ shrug... ] I stated before that the Hot Standby patch is doing
utterly unsafe things in signal handlers. Simon rejected that.
I am waiting for irrefutable evidence to emerge from the field
(and am very confident that it will be forthcoming...) [...]

[...]Why are
we releasing 9.0 with this problem again? Surely this is nuts.

Will the docs give enough info so that release note readers
will know when they're giving well-informed consent to volunteer
to produce such field evidence?

Yeah, exactly. Good news: you can now run queries on the standby.
Bad news: we've abandoned our policy of not releasing with known bugs.
Have fun and enjoy using PostgreSQL, the world's most advanced open
source databSegmentation fault

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#57

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Robert Haas (#50)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 03/09/10 19:38, Robert Haas wrote:

On Fri, Sep 3, 2010 at 12:10 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Robert Haas<robertmhaas@gmail.com> writes:

On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

[ shrug... ] I stated before that the Hot Standby patch is doing
utterly unsafe things in signal handlers. Simon rejected that.
I am waiting for irrefutable evidence to emerge from the field
(and am very confident that it will be forthcoming...) before
I argue with him further. Meanwhile, I'm not going to accept anything
unsafe in a core facility like this patch is going to be.

Oh. I thought you had ignored his objections and fixed it. Why are
we releasing 9.0 with this problem again? Surely this is nuts.

My original review of hot standby found about half a dozen things
I thought were broken:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg00178.php
After a *very* long-drawn-out fight I fixed one of them
(max_standby_delay), largely still over Simon's objections. I don't
have the energy to repeat that another half-dozen times, so I'm going
to wait for the suspected problems to be proven by field experience.

Bummer. Allow me to cast a vote for doing something about the fact
that handle_standby_sig_alarm() thinks it can safely acquire an LWLock
in a signal handler. I think we should be making our decisions on
what to change in the code based on what is technically sound, rather
than based on how much the author complains about changing it. Of
course there may be cases where there is a legitimate difference of
opinion concerning the best way forward, but I don't believe this is
one of them.

Hmm, just to make the risk more concrete, here's one scenario that could
happen:

1. Startup process tries to acquire cleanup lock on a page. It's pinned,
so it has to wait, and calls ResolveRecoveryConflictWithBufferPin().
2. ResolveRecoveryConflictWithBufferPin enables the standby SIGALRM
handler by calling enable_standby_sig_alarm(), and calls
ProcWaitForSignal().

3. ProcWaitForSignal() calls semop() (assuming sysv semaphores here) to
wait for the process semaphore

4. Max standby delay is reached and SIGALRM fired. CheckStandbyTimeout()
is called in signal handler. CheckStandbyTimeout() calls
SendRecoveryConflictWithBufferPin(), which calls CancelDBBackends()

5. CancelDBBackends() tries to acquire ProcArrayLock in exclusive mode.
It's being held by another process, so we have to sleep

6. To sleep, LWLockAcquire calls PGSemaphoreLock, which calls semop() to
wait on on the process semaphore.

So we now have the same process nested twice inside a semop() call.
Looking at the Linux signal (7) man page, it is undefined what happens
if semop() is re-entered in a signal handler while another semop() call
is happening in main line of execution. Assuming it somehow works, which
semop() call is going to return when the semaphore is incremented?

Maybe that's ok, if I'm reading the deadlock checker code correctly, it
also calls semop() to increment the another process' semaphore, and the
deadlock checker can be invoked from a signal handler while in semop()
to wait on our process' semaphore. BTW, sem_post(), which is the Posix
function for incrementing a semaphore, is listed as a safe function to
call in a signal handler. But it's certainly fishy.

A safer approach would be to just PGSemaphoreUnlock() in the signal
handler, and do all the other processing outside it. You'd still call
semop() within semop(), but at least it would be closer to the semop()
within semop() we already do in the deadlock checker. And there would be
less randomness from timing and lock contention involved, making it
easier to test the behavior on various platforms.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#58

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#57)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Fri, Sep 3, 2010 at 4:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Maybe that's ok, if I'm reading the deadlock checker code correctly, it also
calls semop() to increment the another process' semaphore, and the deadlock
checker can be invoked from a signal handler while in semop() to wait on our
process' semaphore. BTW, sem_post(), which is the Posix function for
incrementing a semaphore, is listed as a safe function to call in a signal
handler. But it's certainly fishy.

Color me confused; I may need to backpedal rapidly here. I had
thought that what Tom was complaining about was the fact that the
signal handler was taking LWLocks, which I would have thought to be
totally unsafe. But it seems the deadlock detector does the same
thing, more or less because the signal handlers are set up so that
they don't do anything unless we're within a limited section of code
where nothing too interesting can happen. I'm not too sure why we
think that it's safe to invoke the deadlock detector that way, but
it's also not too clear to me why this case is any worse.

It furthermore appears that Simon's reply to Tom's complaint about
this function was: "This was modelled very closely on
handle_sig_alarm() and was reviewed by other hackers. I'm not great on
that, as you know, so if you can explain what it is I can't do, and
how that differs from handle_sig_alarm running the deadlock detector
in the same way, then I'll work on it some more."

I guess I need the same explanation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#59

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#57)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

A safer approach would be to just PGSemaphoreUnlock() in the signal
handler, and do all the other processing outside it.

I don't see any particularly good reason to assume that
PGSemaphoreUnlock is safe either: you're still talking about nested
semop operations.

The pre-existing SIGALRM handler uses a self-signal (kill(MyProcPid,
SIGINT)) to kick the process off any wait it might be doing. I'd rather
do something like that.

Or maybe the work you're doing on latches would help ...

regards, tom lane

#60

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Robert Haas (#58)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Robert Haas <robertmhaas@gmail.com> writes:

Color me confused; I may need to backpedal rapidly here. I had
thought that what Tom was complaining about was the fact that the
signal handler was taking LWLocks, which I would have thought to be
totally unsafe.

Well, it's unsafe if the signal could interrupt mainline code that is
trying to take an LWLock or already holds the same LWLock --- or, if you
consider deadlock risks against other processes, already holding other
LWLocks might be problematic too. And trying to use facilities like
elog or palloc is also pretty unsafe if the mainline is too.

The reason the deadlock checker is okay is that there is only a *very*
narrow range of mainline code that it could possibly be interrupting,
basically nothing but the PGSemaphoreLock() call in ProcSleep.
(There's a lot of other code inside the loop there, but notice that it's
protected by tests that ensure that it won't run unless the deadlock
checker already did.)

One salient thing about ProcSleep is that you shouldn't call it while
holding any LWLocks, and another is that if you're sleeping while
holding regular heavyweight locks, the deadlock checker is exactly what
will get you out of trouble if there's a deadlock.

Now the HS case likewise appears to be set up so that the signal can
only directly interrupt ProcWaitForSignal, so I think the core issue is
whether any deadlock situations are possible. Given that this gets
called from a low-level place like LockBufferForCleanup, I don't feel
too comfortable about that. I certainly haven't seen any analysis or
documentation of what locks can safely be held at that point.
The deadlock checker only tries to take the LockMgr LWLocks, so
extrapolating from whether it is safe to whether touching the
ProcArrayLock is safe seems entirely unfounded.

It might be worth pointing out here that LockBufferForCleanup is already
known to be a risk factor for undetected deadlocks, even without HS in
the picture, because of the possibility of deadlocks involving a chain
of both heavyweight locks and LWLocks. Whether HS makes it materially
worse may be something that we need field experience to determine.

regards, tom lane

#61

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Tom Lane (#60)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Fri, Sep 3, 2010 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Now the HS case likewise appears to be set up so that the signal can
only directly interrupt ProcWaitForSignal, so I think the core issue is
whether any deadlock situations are possible. Given that this gets
called from a low-level place like LockBufferForCleanup, I don't feel
too comfortable about that. I certainly haven't seen any analysis or
documentation of what locks can safely be held at that point.
The deadlock checker only tries to take the LockMgr LWLocks, so
extrapolating from whether it is safe to whether touching the
ProcArrayLock is safe seems entirely unfounded.

It might be worth pointing out here that LockBufferForCleanup is already
known to be a risk factor for undetected deadlocks, even without HS in
the picture, because of the possibility of deadlocks involving a chain
of both heavyweight locks and LWLocks. Whether HS makes it materially
worse may be something that we need field experience to determine.

OK, well, that does make it more clear what you're worried about, but
it seems that moving the work out of the signal handler isn't going to
solve this problem. The core issue appears to be whether the caller
might be holding a lock which another process might attempt to take
while already holding ProcArrayLock, or (in the case of a three or
more way deadlock) whether the caller might be holding a lock that
some other process could try to acquire after seizing ProcArrayLock.
As far as I can tell from looking through the code, the only locks we
ever acquire while holding ProcArrayLock are XidGenLock and
known_assigned_xids_lck (which is a spin lock). The latter is never
held more than VERY briefly. XidGenLock is generally also held only
briefly, not acquiring any other locks in the meantime, but
GetNewTransactionId() is not obviously (to me) safe.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#62

Greg Stark

gsstark@mit.edu

over 15 years ago

In reply to: Heikki Linnakangas (#57)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Fri, Sep 3, 2010 at 9:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

So we now have the same process nested twice inside a semop() call. Looking
at the Linux signal (7) man page, it is undefined what happens if semop() is
re-entered in a signal handler while another semop() call is happening in
main line of execution. Assuming it somehow works, which semop() call is
going to return when the semaphore is incremented?

Fwiw I wouldn't be too worried about semop specifically. It's a
syscall and will always return with EINTR. What makes functions
async-unsafe is when they might do extra processing manipulating data
structures in user space such as mallocing memory. POSIX seems to be
giving semop license to do that but realistically I can't imagine any
implementation doing so.

What I might wonder about is what happens if the signal is called just
after the semop completes or just before it starts.

And someone mentioned calling elog from within the signal handler --
doesn't elog call palloc() and sprintf? That can't be async-safe.

--
greg

#63

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Tom Lane (#60)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Fri, 2010-09-03 at 18:24 -0400, Tom Lane wrote:

Now the HS case likewise appears to be set up so that the signal can
only directly interrupt ProcWaitForSignal, so I think the core issue
is
whether any deadlock situations are possible. Given that this gets
called from a low-level place like LockBufferForCleanup, I don't feel
too comfortable about that.

LockBufferForCleanup is only ever called during recovery by
heap_xlog_clean() or btree_xlog_vacuum().

The actions taken to replay a WAL record are independent of all other
WAL records from a locking perspective, so replay of every WAL record
starts with no LWlocks held by startup process. LockBufferForCleanup is
taken early on in replay a heap or btree cleanup record and so we can
easily check that no other LWlocks are held while it is called.

I certainly haven't seen any analysis or
documentation of what locks can safely be held at that point.
The deadlock checker only tries to take the LockMgr LWLocks, so
extrapolating from whether it is safe to whether touching the
ProcArrayLock is safe seems entirely unfounded.

So the startup process calls one LWlock, ProcArrayLock, and is not
holding any other LWlock when it does. The deadlock checker attempts to
get and hold all of the other lock partition locks. So deadlock checker
already does the thing you're saying might be dangerous and the startup
process doesn't.

The ProcArrayLock is only taken as a way of signaling other backends. If
that is particularly unsafe we could redesign that aspect.

It might be worth pointing out here that LockBufferForCleanup is
already
known to be a risk factor for undetected deadlocks, even without HS in
the picture, because of the possibility of deadlocks involving a chain
of both heavyweight locks and LWLocks. Whether HS makes it materially
worse may be something that we need field experience to determine.

You may be right and that it will be a problem.

The deadlock risk we're protecting against is a deadlock involving both
normal locks and buffer pins. We're safer having it than not having this
code, IMHO.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#64

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#54)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 03/09/10 21:50, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 03/09/10 21:16, Tom Lane wrote:

It's probably not too unreasonable to assume that pid_t assignment is
atomic. But I'm still thinking that we have bigger problems than that
if there are really cases where SetLatch can execute at approximately
the same time as a latch owner is coming or going.

I don't see how to avoid it. A walsender, or any process really, can
exit at any time. It can make the latch inaccessible to others before it
exits to minimize the window, but it's always going to be possible that
another process is just about to call SetLatch when you exit.

Well, in that case what we need to do is presume that the latch object
has a continuing existence but the owner/receiver can come and go.
I would suggest that InitLatch needs to initialize the object into a
valid but unowned state; there is *no* deinitialize operation; and
there are AcquireLatch and ReleaseLatch operations to become owner
or stop being owner.

I think we have just a terminology issue. What you're describing is
exactly how it works now, if you just s/InitLatch/AcquireLatch. At the
moment there's no need for an initialization function other than the
InitLatch/AcquireLatch that associates the latch with the current
process. I can add one for the sake of future-proofing, and to have
better-defined behavior for setting a latch that has not been owned by
anyone yet, but it's not strictly necessary.

We also need to define the semantics of SetLatch
on an unowned latch --- does this set a signal condition that will be
available to the next owner?

At the moment, no. Perhaps that would be useful, separating the Init and
Acquire operations is needed to make that sane.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#65

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#64)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 03/09/10 21:50, Tom Lane wrote:

Well, in that case what we need to do is presume that the latch object
has a continuing existence but the owner/receiver can come and go.
I would suggest that InitLatch needs to initialize the object into a
valid but unowned state; there is *no* deinitialize operation; and
there are AcquireLatch and ReleaseLatch operations to become owner
or stop being owner.

I think we have just a terminology issue. What you're describing is
exactly how it works now, if you just s/InitLatch/AcquireLatch.

No, it isn't. What I'm suggesting requires breaking InitLatch into two
operations.

We also need to define the semantics of SetLatch
on an unowned latch --- does this set a signal condition that will be
available to the next owner?

At the moment, no. Perhaps that would be useful, separating the Init and
Acquire operations is needed to make that sane.

Exactly. I'm not totally sure either if it would be useful, but the
current design makes it impossible to allow that.

BTW, on reflection the AcquireLatch/ReleaseLatch terminology seems a bit
ill chosen: ReleaseLatch sounds way too much like something that would
just unlock or clear the latch. Perhaps OwnLatch/DisownLatch, or
something along that line.

regards, tom lane

#66

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#65)

1 attachment(s)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 06/09/10 17:18, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

I think we have just a terminology issue. What you're describing is
exactly how it works now, if you just s/InitLatch/AcquireLatch.

No, it isn't. What I'm suggesting requires breaking InitLatch into two
operations.

We also need to define the semantics of SetLatch
on an unowned latch --- does this set a signal condition that will be
available to the next owner?

At the moment, no. Perhaps that would be useful, separating the Init and
Acquire operations is needed to make that sane.

Exactly. I'm not totally sure either if it would be useful, but the
current design makes it impossible to allow that.

Ok, I've split the Init and Acquire steps into two.

BTW, on reflection the AcquireLatch/ReleaseLatch terminology seems a bit
ill chosen: ReleaseLatch sounds way too much like something that would
just unlock or clear the latch. Perhaps OwnLatch/DisownLatch, or
something along that line.

Yeah, I see what you mean. Although, maybe it's just me but Own/Disown
looks ugly. Anyone have a better suggestion?

Here's an updated patch, with all the issues reported this far fixed,
except for that naming issue, and Fujii's suggestion to use poll()
instead of select() where available. I've also polished it quite a bit,
improving comments etc. Magnus, can you take a look at the Windows
implementation to check that it's sane? At least it seems to work.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

latch-4.patchtext/x-diff; name=latch-4.patchDownload

diff --git a/configure b/configure
index bd9b347..432cd58 100755
--- a/configure
+++ b/configure
@@ -27773,6 +27773,13 @@ _ACEOF
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
+# Select latch implementation type.
+if test "$PORTNAME" != "win32"; then
+  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
+else
+  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
+fi
+
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -29098,7 +29105,7 @@ fi
 ac_config_files="$ac_config_files GNUmakefile src/Makefile.global"
 
 
-ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
+ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
 
 
 if test "$PORTNAME" = "win32"; then
@@ -29722,6 +29729,7 @@ do
     "src/backend/port/dynloader.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c" ;;
     "src/backend/port/pg_sema.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}" ;;
     "src/backend/port/pg_shmem.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}" ;;
+    "src/backend/port/pg_latch.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}" ;;
     "src/include/dynloader.h") CONFIG_LINKS="$CONFIG_LINKS src/include/dynloader.h:src/backend/port/dynloader/${template}.h" ;;
     "src/include/pg_config_os.h") CONFIG_LINKS="$CONFIG_LINKS src/include/pg_config_os.h:src/include/port/${template}.h" ;;
     "src/Makefile.port") CONFIG_LINKS="$CONFIG_LINKS src/Makefile.port:src/makefiles/Makefile.${template}" ;;
diff --git a/configure.in b/configure.in
index 7b09986..7f84cea 100644
--- a/configure.in
+++ b/configure.in
@@ -1700,6 +1700,13 @@ else
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
+# Select latch implementation type.
+if test "$PORTNAME" != "win32"; then
+  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
+else
+  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
+fi
+
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -1841,6 +1848,7 @@ AC_CONFIG_LINKS([
   src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c
   src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}
   src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}
+  src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}
   src/include/dynloader.h:src/backend/port/dynloader/${template}.h
   src/include/pg_config_os.h:src/include/port/${template}.h
   src/Makefile.port:src/makefiles/Makefile.${template}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 615a7fa..094d0c9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -55,6 +55,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
@@ -1025,6 +1026,13 @@ EndPrepare(GlobalTransaction gxact)
 
 	/* If we crash now, we have prepared: WAL replay will fix things */
 
+	/*
+	 * Wake up all walsenders to send WAL up to the PREPARE record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
 	/* write correct CRC and close file */
 	if ((write(fd, &statefile_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
 	{
@@ -2005,6 +2013,13 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	/* Flush XLOG to disk */
 	XLogFlush(recptr);
 
+	/*
+	 * Wake up all walsenders to send WAL up to the COMMIT PREPARED record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
 	/* Mark the transaction committed in pg_clog */
 	TransactionIdCommitTree(xid, nchildren, children);
 
@@ -2078,6 +2093,13 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	XLogFlush(recptr);
 
 	/*
+	 * Wake up all walsenders to send WAL up to the ABORT PREPARED record
+	 * immediately if replication is enabled
+	 */
+	if (max_wal_senders > 0)
+		WalSndWakeup();
+
+	/*
 	 * Mark the transaction aborted in clog.  This is not absolutely necessary
 	 * but we may as well do it while we are here.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6bcc55c..942d5c2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
 #include "libpq/be-fsstubs.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -1068,6 +1069,13 @@ RecordTransactionCommit(void)
 		XLogFlush(XactLastRecEnd);
 
 		/*
+		 * Wake up all walsenders to send WAL up to the COMMIT record
+		 * immediately if replication is enabled
+		 */
+		if (max_wal_senders > 0)
+			WalSndWakeup();
+
+		/*
 		 * Now we may update the CLOG, if we wrote a COMMIT record above
 		 */
 		if (markXidCommitted)
diff --git a/src/backend/port/Makefile b/src/backend/port/Makefile
index db0c2af..f50cff8 100644
--- a/src/backend/port/Makefile
+++ b/src/backend/port/Makefile
@@ -21,7 +21,7 @@ subdir = src/backend/port
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dynloader.o pg_sema.o pg_shmem.o $(TAS)
+OBJS = dynloader.o pg_sema.o pg_shmem.o pg_latch.o $(TAS)
 
 ifeq ($(PORTNAME), darwin)
 SUBDIRS += darwin
diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
new file mode 100644
index 0000000..cc8c8b4
--- /dev/null
+++ b/src/backend/port/unix_latch.c
@@ -0,0 +1,431 @@
+/*-------------------------------------------------------------------------
+ *
+ * unix_latch.c
+ *	  Routines for inter-process latches
+ *
+ * A latch is a simple boolean variable, with operations that let you to
+ * sleep until it is set. A latch can be set from another process, or a
+ * signal handler within the same process.
+ *
+ * The latch interface is a reliable replacement for the common pattern of
+ * using pg_usleep() or select() to wait until a signal arrives, where the
+ * signal handler sets a global variable. Because on some platforms, an
+ * incoming signal doesn't interrupt sleep, and even on platforms where it
+ * does there is a race condition if the signal arrives just before
+ * entering the sleep, the common pattern must periodically wake up and
+ * poll the global variable. pselect() system call was invented to solve
+ * the problem, but it is not portable enough. Latches are designed to
+ * overcome these limitations, allowing you to sleep without polling and
+ * ensuring a quick response to signals from other processes.
+ *
+ * There are two kinds of latches: local and shared. A local latch is
+ * initialized by InitLatch, and can only be set from the same process.
+ * A local latch can be used to wait for a signal to arrive, by calling
+ * SetLatch in the signal handler. A shared latch resides in shared memory,
+ * and must be initialized at postmaster startup by InitSharedLatch. Before
+ * a shared latch can be waited on, it must be associated with a process
+ * with AcquireLatch. Only the process owning the latch can wait on it, but
+ * any process can set it.
+ *
+ * There are three basic operations on a latch:
+ *
+ * SetLatch		- Sets the latch
+ * ResetLatch	- Clears the latch, allowing it to be set again
+ * WaitLatch	- Waits for the latch to become set
+ *
+ * The correct pattern to wait for an event is:
+ *
+ * for (;;)
+ * {
+ *     ResetLatch();
+ *     if (work to do)
+ *         Do Stuff();
+ *
+ *     WaitLatch();
+ * }
+ *
+ * It's important to reset the latch *before* checking if there's work to
+ * do. Otherwise, if someone sets the latch between the check and the
+ * ResetLatch call, you will miss it and Wait will block.
+ *
+ * To wake up the waiter, you must first set a global flag or something
+ * else that the main loop tests in the "if (work to do)" part, and call
+ * SetLatch *after* that. SetLatch is designed to return quickly if the
+ * latch is already set.
+ *
+ *
+ * Implementation
+ * --------------
+ *
+ * The Unix implementation uses the so-called self-pipe trick to overcome
+ * the race condition involved with select() and setting a global flag
+ * in the signal handler. When a latch is set and the current process
+ * is waiting for it, the signal handler wakes up the select() in
+ * WaitLatch by writing a byte to a pipe. A signal by itself doesn't
+ * interrupt select() on all platforms, and even on platforms where it
+ * does, a signal that arrives just before the select() call does not
+ * prevent the select() from entering sleep. An incoming byte on a pipe
+ * however reliably interrupts the sleep, and makes select() to return
+ * immediately if the signal arrives just before select() begins.
+ *
+ * When SetLatch is called from the same process that owns the latch,
+ * SetLatch writes the byte directly to the pipe. If it's owned by another
+ * process, SIGUSR1 is sent and the signal handler in the waiting process
+ * writes the byte to the pipe on behalf of the signaling process.
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <signal.h>
+#include <unistd.h>
+
+#include "miscadmin.h"
+#include "storage/latch.h"
+#include "storage/shmem.h"
+
+/* Are we currently in WaitLatch? The signal handler would like to know. */
+static volatile sig_atomic_t waiting = false;
+
+/* Read and write end of the self-pipe */
+static int selfpipe_readfd = -1;
+static int selfpipe_writefd = -1;
+
+/* private function prototypes */
+static void initSelfPipe(void);
+static void drainSelfPipe(void);
+static void sendSelfPipeByte(void);
+
+
+/*
+ * Initialize a backend-local latch.
+ */
+void
+InitLatch(volatile Latch *latch)
+{
+	/* Initialize the self pipe if this is our first latch in the process */
+	if (selfpipe_readfd == -1)
+		initSelfPipe();
+
+	latch->is_set = false;
+	latch->owner_pid = MyProcPid;
+	latch->is_shared = false;
+}
+
+/*
+ * Initialize a shared latch that can be set from other processes. The latch
+ * is initially owned by no-one, use AcquireLatch to associate it with the
+ * current process.
+ *
+ * NB: When you introduce a new shared latch, you must increase the shared
+ * latch count in NumSharedLatches in win32_latch.c!
+ */
+void
+InitSharedLatch(volatile Latch *latch)
+{
+	latch->is_set = false;
+	latch->owner_pid = 0;
+	latch->is_shared = true;
+}
+
+/*
+ * Associate a shared latch with the current process, allowing it to
+ * wait on it.
+ *
+ * Make sure that latch_sigusr1_handler() is called from the SIGUSR1 signal
+ * handler, as shared latches use SIGUSR1 to for inter-process communication.
+ */
+void
+AcquireLatch(volatile Latch *latch)
+{
+	Assert(latch->is_shared);
+
+	/* Initialize the self pipe if this is our first latch in the process */
+	if (selfpipe_readfd == -1)
+		initSelfPipe();
+
+	if (latch->owner_pid != 0)
+		elog(ERROR, "latch already owned");
+	latch->owner_pid = MyProcPid;
+}
+
+/*
+ * Release a latch previously acquired with AcquireLatch.
+ */
+void
+ReleaseLatch(volatile Latch *latch)
+{
+	Assert(latch->is_shared);
+	Assert(latch->owner_pid == MyProcPid);
+	latch->owner_pid = 0;
+}
+
+/*
+ * Wait for given latch to be set or until timeout is exceeded.
+ * If the latch is already set, the function returns immediately.
+ *
+ * The 'timeout' is given in microseconds, and -1 means wait forever.
+ * On some platforms, signals cause the timeout to be restarted, so beware
+ * that the function can sleep for several times longer than the specified
+ * timeout.
+ *
+ * The latch must be owned by the current process, ie. it must be a
+ * backend-local latch initialized with InitLatch, or a shared latch
+ * previously acquired with AcquireLatch.
+ *
+ * Returns 'true' if the latch was set, or 'false' if timeout was reached.
+ */
+bool
+WaitLatch(volatile Latch *latch, long timeout)
+{
+	return WaitLatchOrSocket(latch, PGINVALID_SOCKET, timeout) > 0;
+}
+
+/*
+ * Like WaitLatch, but will also return when there's data available in
+ * 'sock' for reading. Returns 0 if timeout was reached, 1 if the latch
+ * was set, or 2 if the scoket became readable.
+ */
+int
+WaitLatchOrSocket(volatile Latch *latch, pgsocket sock, long timeout)
+{
+	struct timeval tv, *tvp = NULL;
+	fd_set		input_mask;
+	int			rc;
+	int			result = 0;
+
+	if (latch->owner_pid != MyProcPid)
+		elog(ERROR, "cannot wait on a latch owned by another process");
+
+	/* Initialize timeout */
+	if (timeout >= 0)
+	{
+		tv.tv_sec = timeout / 1000000L;
+		tv.tv_usec = timeout % 1000000L;
+		tvp = &tv;
+	}
+
+	waiting = true;
+	for (;;)
+	{
+		int hifd;
+
+		/*
+		 * Clear the pipe, and check if the latch is set already. If someone
+		 * sets the latch between this and the select() below, the setter
+		 * will write a byte to the pipe (or signal us and the signal handler
+		 * will do that), and the select() will return immediately.
+		 */
+		drainSelfPipe();
+		if (latch->is_set)
+		{
+			result = 1;
+			break;
+		}
+
+		FD_ZERO(&input_mask);
+		FD_SET(selfpipe_readfd, &input_mask);
+		hifd = selfpipe_readfd;
+		if (sock != PGINVALID_SOCKET)
+		{
+			FD_SET(sock, &input_mask);
+			if (sock > hifd)
+				hifd = sock;
+		}
+
+		rc = select(hifd + 1, &input_mask, NULL, NULL, tvp);
+		if (rc < 0)
+		{
+			if (errno == EINTR)
+				continue;
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("select() failed: %m")));
+		}
+		if (rc == 0)
+		{
+			/* timeout exceeded */
+			result = 0;
+			break;
+		}
+		if (sock != PGINVALID_SOCKET && FD_ISSET(sock, &input_mask))
+		{
+			result = 2;
+			break;		/* data available in socket */
+		}
+	}
+	waiting = false;
+
+	return result;
+}
+
+/*
+ * Sets a latch and wakes up anyone waiting on it. Returns quickly if the
+ * latch is already set.
+ */
+void
+SetLatch(volatile Latch *latch)
+{
+	pid_t owner_pid;
+
+	/* Quick exit if already set */
+	if (latch->is_set)
+		return;
+
+	latch->is_set = true;
+
+	/*
+	 * See if anyone's waiting for the latch. It can be the current process
+	 * if we're in a signal handler. We use the self-pipe to wake up the
+	 * select() in that case. If it's another process, send a signal.
+	 *
+	 * Fetch owner_pid only once, in case the owner simultaneously releases
+	 * the latch and clears owner_pid. XXX: This assumes that pid_t is
+	 * atomic, which isn't guaranteed to be true! In practice, the effective
+	 * range of pid_t fits in a 32 bit integer, and so should be atomic. In
+	 * the worst case, we might end up signaling wrong process if the right
+	 * one releases the latch just as we fetch owner_pid. Even then, you're
+	 * very unlucky if a process with that bogus pid exists.
+	 */
+	owner_pid = latch->owner_pid;
+	if (owner_pid == 0)
+		return;
+	else if (owner_pid == MyProcPid)
+		sendSelfPipeByte();
+	else
+		kill(owner_pid, SIGUSR1);
+}
+
+/*
+ * Clear the latch. Calling WaitLatch after this will sleep, unless
+ * the latch is set again before the WaitLatch call.
+ */
+void
+ResetLatch(volatile Latch *latch)
+{
+	/* Only the owner should reset the latch */
+	Assert(latch->owner_pid == MyProcPid);
+
+	latch->is_set = false;
+}
+
+/*
+ * LatchShmemSize
+ *		Compute space needed for latch's shared memory
+ *
+ * Not needed for Unix implementation.
+ */
+Size
+LatchShmemSize(void)
+{
+	return 0;
+}
+
+/*
+ * LatchShmemInit
+ *		Allocate and initialize shared memory needed for latches
+ *
+ * Not needed for Unix implementation.
+ */
+void
+LatchShmemInit(void)
+{
+}
+
+/*
+ * SetLatch uses SIGUSR1 to wake up the process waiting on the latch. Wake
+ * up WaitLatch.
+ */
+void
+latch_sigusr1_handler(void)
+{
+	if (waiting)
+		sendSelfPipeByte();
+}
+
+/* initialize the self-pipe */
+static void
+initSelfPipe(void)
+{
+	int pipefd[2];
+
+	/*
+	 * Set up the self-pipe that allows a signal handler to wake up the
+	 * select() in WaitLatch. Make the write-end non-blocking, so that
+	 * SetLatch won't block if the event has already been set many times
+	 * filling the kernel buffer. Make the read-end non-blocking too, so
+	 * that we can easily clear the pipe by reading until EAGAIN or
+	 * EWOULDBLOCK.
+	 */
+	if (pipe(pipefd) < 0)
+		elog(FATAL, "pipe() failed: %m");
+	if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0)
+		elog(FATAL, "fcntl() failed on read-end of self-pipe: %m");
+	if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0)
+		elog(FATAL, "fcntl() failed on write-end of self-pipe: %m");
+
+	selfpipe_readfd = pipefd[0];
+	selfpipe_writefd = pipefd[1];
+}
+
+/* Send one byte to the self-pipe, to wake up WaitLatch */
+static void
+sendSelfPipeByte(void)
+{
+	int rc;
+	char dummy = 0;
+
+retry:
+	rc = write(selfpipe_writefd, &dummy, 1);
+	if (rc < 0)
+	{
+		/* If interrupted by signal, just retry */
+		if (errno == EINTR)
+			goto retry;
+
+		/*
+		 * If the pipe is full, we don't need to retry, the data that's
+		 * there already is enough to wake up WaitLatch.
+		 */
+		if (errno == EAGAIN || errno == EWOULDBLOCK)
+			return;
+
+		/*
+		 * Oops, the write() failed for some other reason. We might be in
+		 * a signal handler, so it's not safe to elog(). We have no choice
+		 * but silently ignore the error.
+		 */
+		return;
+	}
+}
+
+/* Read all available data from the self-pipe */
+static void
+drainSelfPipe(void)
+{
+	int rc;
+	char buf;
+
+	for (;;)
+	{
+		rc = read(selfpipe_readfd, &buf, 1);
+		if (rc < 0)
+		{
+			if (errno == EAGAIN || errno == EWOULDBLOCK)
+				break;		/* the pipe is empty */
+			else if (errno == EINTR)
+				continue;	/* retry */
+			else
+				elog(ERROR, "read() on self-pipe failed: %m");
+		}
+		else if (rc == 0)
+			elog(ERROR, "unexpected EOF on self-pipe");
+	}
+}
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
new file mode 100644
index 0000000..3e3a30d
--- /dev/null
+++ b/src/backend/port/win32_latch.c
@@ -0,0 +1,284 @@
+/*-------------------------------------------------------------------------
+ *
+ * win32_latch.c
+ *	  Windows implementation of latches.
+ *
+ * The Windows implementation uses Windows events. See unix_latch.c for
+ * information on usage.
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <signal.h>
+#include <unistd.h>
+
+#include "miscadmin.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+
+/*
+ * Shared latches are implemented with Windows events that are shared by
+ * all postmaster child processes. At postmaster startup we create enough
+ * Event objects, and mark them as inheritable so that they are accessible
+ * in child processes. The handles are stored in sharedHandles.
+ */
+typedef struct
+{
+	slock_t		mutex;			/* protects all the other fields */
+
+	int			maxhandles;		/* number of shared handles created */
+	int			nfreehandles;	/* number of free handles in array */
+	HANDLE		handles[1];		/* free handles, variable length */
+} SharedEventHandles;
+
+static SharedEventHandles *sharedHandles;
+
+void
+InitLatch(volatile Latch *latch)
+{
+	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
+	latch->is_shared = false;
+	latch->is_set = false;
+}
+
+void
+InitSharedLatch(volatile Latch *latch)
+{
+	latch->is_shared = true;
+	latch->is_set = false;
+	latch->event = NULL;
+}
+
+void
+AcquireLatch(volatile Latch *latch)
+{
+	HANDLE event;
+
+	/* Sanity checks */
+	Assert(latch->is_shared);
+	if (latch->event != 0)
+		elog(ERROR, "latch already owned");
+
+	/* Reserve an event handle from the shared handles array */
+	SpinLockAcquire(&sharedHandles->mutex);
+	if (sharedHandles->nfreehandles <= 0)
+	{
+		SpinLockRelease(&sharedHandles->mutex);
+		elog(ERROR, "out of shared event objects");
+	}
+	sharedHandles->nfreehandles--;
+	event = sharedHandles->handles[sharedHandles->nfreehandles];
+	SpinLockRelease(&sharedHandles->mutex);
+
+	latch->event = event;
+}
+
+void
+ReleaseLatch(volatile Latch *latch)
+{
+	Assert(latch->is_shared);
+	Assert(latch->event != NULL);
+
+	/* Put the event handle back to the pool */
+	SpinLockAcquire(&sharedHandles->mutex);
+	if (sharedHandles->nfreehandles >= sharedHandles->maxhandles)
+	{
+		SpinLockRelease(&sharedHandles->mutex);
+		elog(PANIC, "too many free event handles");
+	}
+	sharedHandles->handles[sharedHandles->nfreehandles] = latch->event;
+	sharedHandles->nfreehandles++;
+	SpinLockRelease(&sharedHandles->mutex);
+
+	latch->event = NULL;
+}
+
+bool
+WaitLatch(volatile Latch *latch, long timeout)
+{
+	return WaitLatchOrSocket(latch, PGINVALID_SOCKET, timeout) > 0;
+}
+
+int
+WaitLatchOrSocket(volatile Latch *latch, SOCKET sock, long timeout)
+{
+	DWORD		rc;
+	HANDLE		events[3];
+	HANDLE		latchevent;
+	HANDLE		sockevent;
+	int			numevents;
+	int			result = 0;
+
+	latchevent = latch->event;
+
+	events[0] = latchevent;
+	events[1] = pgwin32_signal_event;
+	numevents = 2;
+	if (sock != PGINVALID_SOCKET)
+	{
+		sockevent = WSACreateEvent();
+		WSAEventSelect(sock, sockevent, FD_READ);
+		events[numevents++] = sockevent;
+	}
+
+	for (;;)
+	{
+		/*
+		 * Reset the event, and check if the latch is set already. If someone
+		 * sets the latch between this and the WaitForMultipleObjects() call
+		 * below, the setter will set the event and WaitForMultipleObjects()
+		 * will return immediately.
+		 */
+		if (!ResetEvent(latchevent))
+			elog(ERROR, "ResetEvent failed: error code %d", (int) GetLastError());
+		if (latch->is_set)
+		{
+			result = 1;
+			break;
+		}
+
+		rc = WaitForMultipleObjects(numevents, events, FALSE,
+								(timeout >= 0) ? (timeout / 1000) : INFINITE);
+		if (rc == WAIT_FAILED)
+			elog(ERROR, "WaitForMultipleObjects() failed: error code %d", (int) GetLastError());
+		else if (rc == WAIT_TIMEOUT)
+		{
+			result = 0;
+			break;
+		}
+		else if (rc == WAIT_OBJECT_0 + 1)
+			pgwin32_dispatch_queued_signals();
+		else if (rc == WAIT_OBJECT_0 + 2)
+		{
+			Assert(sock != PGINVALID_SOCKET);
+			result = 2;
+			break;
+		}
+		else if (rc != WAIT_OBJECT_0)
+			elog(ERROR, "unexpected return code from WaitForMultipleObjects(): %d", rc);
+	}
+
+	/* Clean up the handle we created for the socket */
+		if (sock != PGINVALID_SOCKET)
+	{
+		WSAEventSelect(sock, sockevent, 0);
+		WSACloseEvent(sockevent);
+	}
+
+	return result;
+}
+
+void
+SetLatch(volatile Latch *latch)
+{
+	HANDLE handle;
+
+	/* Quick exit if already set */
+	if (latch->is_set)
+		return;
+
+	latch->is_set = true;
+
+	/*
+	 * See if anyone's waiting for the latch. It can be the current process
+	 * if we're in a signal handler. Use a local variable here in case the
+	 * latch is just released between the test and the SetEvent call, and
+	 * event field set to NULL.
+	 *
+	 * Fetch handle field only once, in case the owner simultaneously
+	 * releases the latch and clears handle. This assumes that HANDLE is
+	 * atomic, which isn't guaranteed to be true! In practice, it should be,
+	 * and in the worst case we end up calling SetEvent with a bogus handle,
+	 * and SetEvent will return an error with no harm done.
+	 */
+	handle = latch->event;
+	if (handle)
+	{
+		SetEvent(handle);
+		/*
+		 * Note that we silently ignore any errors. We might be in a signal
+		 * handler or other critical path where it's not safe to call elog().
+		 */
+	}
+}
+
+void
+ResetLatch(volatile Latch *latch)
+{
+	latch->is_set = false;
+}
+
+/*
+ * Number of shared latches, used to allocate the right number of shared
+ * Event handles at postmaster startup. You must update this if you
+ * introduce a new shared latch!
+ */
+static int
+NumSharedLatches(void)
+{
+	int numLatches = 0;
+
+	/* Each walsender needs one latch */
+	numLatches += max_wal_senders;
+
+	return numLatches;
+}
+
+/*
+ * LatchShmemSize
+ *		Compute space needed for latch's shared memory
+ */
+Size
+LatchShmemSize(void)
+{
+	return offsetof(SharedEventHandles, handles) +
+		NumSharedLatches() * sizeof(HANDLE);
+}
+
+/*
+ * LatchShmemInit
+ *		Allocate and initialize shared memory needed for latches
+ */
+void
+LatchShmemInit(void)
+{
+	Size		size = LatchShmemSize();
+	bool		found;
+
+	sharedHandles = ShmemInitStruct("SharedEventHandles", size, &found);
+
+	/* If we're first, initialize the struct and allocate handles */
+	if (!found)
+	{
+		int i;
+		SECURITY_ATTRIBUTES sa;
+
+		/*
+		 * Set up security attributes to specify that the events are
+		 * inherited.
+		 */
+		ZeroMemory(&sa, sizeof(sa));
+		sa.nLength = sizeof(sa);
+		sa.bInheritHandle = TRUE;
+
+		SpinLockInit(&sharedHandles->mutex);
+		sharedHandles->maxhandles = NumSharedLatches();
+		sharedHandles->nfreehandles = sharedHandles->maxhandles;
+		for (i = 0; i < sharedHandles->maxhandles; i++)
+		{
+			sharedHandles->handles[i] = CreateEvent(&sa, TRUE, FALSE, NULL);
+			if (sharedHandles->handles[i] == NULL)
+				elog(ERROR, "CreateEvent failed: error code %d", (int) GetLastError());
+		}
+	}
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 53c2581..3ceec24 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -34,6 +34,7 @@
  */
 #include "postgres.h"
 
+#include <signal.h>
 #include <unistd.h>
 
 #include "access/xlog_internal.h"
@@ -66,9 +67,6 @@ bool		am_walsender = false;		/* Am I a walsender process ? */
 int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
 int			WalSndDelay = 200;	/* max sleep time between some actions */
 
-#define NAPTIME_PER_CYCLE 100000L		/* max sleep time between cycles
-										 * (100ms) */
-
 /*
  * These variables are used similarly to openLogFile/Id/Seg/Off,
  * but for walsender to read the XLOG.
@@ -93,6 +91,7 @@ static volatile sig_atomic_t ready_to_stop = false;
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndShutdownHandler(SIGNAL_ARGS);
 static void WalSndQuickDieHandler(SIGNAL_ARGS);
+static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -144,6 +143,16 @@ WalSenderMain(void)
 	/* Handle handshake messages before streaming */
 	WalSndHandshake();
 
+	/* Initialize shared memory status */
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile WalSnd *walsnd = MyWalSnd;
+
+		SpinLockAcquire(&walsnd->mutex);
+		walsnd->sentPtr = sentPtr;
+		SpinLockRelease(&walsnd->mutex);
+	}
+
 	/* Main loop of walsender */
 	return WalSndLoop();
 }
@@ -380,8 +389,6 @@ WalSndLoop(void)
 	/* Loop forever, unless we get an error */
 	for (;;)
 	{
-		long		remain;		/* remaining time (us) */
-
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
@@ -421,32 +428,42 @@ WalSndLoop(void)
 		/*
 		 * If we had sent all accumulated WAL in last round, nap for the
 		 * configured time before retrying.
-		 *
-		 * On some platforms, signals won't interrupt the sleep.  To ensure we
-		 * respond reasonably promptly when someone signals us, break down the
-		 * sleep into NAPTIME_PER_CYCLE increments, and check for interrupts
-		 * after each nap.
 		 */
 		if (caughtup)
 		{
-			remain = WalSndDelay * 1000L;
-			while (remain > 0)
-			{
-				/* Check for interrupts */
-				if (got_SIGHUP || shutdown_requested || ready_to_stop)
-					break;
+			/*
+			 * Even if we wrote all the WAL that was available when we started
+			 * sending, more might have arrived while we were sending this
+			 * batch. We had the latch set while sending, so we have not
+			 * received any signals from that time. Let's arm the latch
+			 * again, and after that check that we're still up-to-date.
+			 */
+			ResetLatch(&MyWalSnd->latch);
 
-				/* Sleep and check that the connection is still alive */
-				pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain);
-				CheckClosedConnection();
+			if (!XLogSend(output_message, &caughtup))
+				break;
+			if (caughtup && !got_SIGHUP && !ready_to_stop && !shutdown_requested)
+			{
+				/*
+				 * XXX: We don't really need the periodic wakeups anymore,
+				 * WaitLatchOrSocket should reliably wake up as soon as
+				 * something interesting happens.
+				 */
 
-				remain -= NAPTIME_PER_CYCLE;
+				/* Sleep */
+				WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock, -1);
+				elog(LOG, "walsender woke up");
 			}
-		}
 
-		/* Attempt to send the log once every loop */
-		if (!XLogSend(output_message, &caughtup))
-			break;
+			/* Check if the connection was closed */
+			CheckClosedConnection();
+		}
+		else
+		{
+			/* Attempt to send the log once every loop */
+			if (!XLogSend(output_message, &caughtup))
+				break;
+		}
 	}
 
 	/*
@@ -493,10 +510,15 @@ InitWalSnd(void)
 		}
 		else
 		{
-			/* found */
-			MyWalSnd = (WalSnd *) walsnd;
+			/*
+			 * Found a free slot. Take ownership of the latch and initialize
+			 * the other fields.
+			 */
+			AcquireLatch((Latch *) &walsnd->latch);
 			walsnd->pid = MyProcPid;
-			MemSet(&MyWalSnd->sentPtr, 0, sizeof(XLogRecPtr));
+			MemSet(&walsnd->sentPtr, 0, sizeof(XLogRecPtr));
+			/* Set MyWalSnd only after it's fully initialized. */
+			MyWalSnd = (WalSnd *) walsnd;
 			SpinLockRelease(&walsnd->mutex);
 			break;
 		}
@@ -523,6 +545,7 @@ WalSndKill(int code, Datum arg)
 	 * for this.
 	 */
 	MyWalSnd->pid = 0;
+	ReleaseLatch(&MyWalSnd->latch);
 
 	/* WalSnd struct isn't mine anymore */
 	MyWalSnd = NULL;
@@ -787,6 +810,8 @@ static void
 WalSndSigHupHandler(SIGNAL_ARGS)
 {
 	got_SIGHUP = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /* SIGTERM: set flag to shut down */
@@ -794,6 +819,8 @@ static void
 WalSndShutdownHandler(SIGNAL_ARGS)
 {
 	shutdown_requested = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /*
@@ -828,11 +855,20 @@ WalSndQuickDieHandler(SIGNAL_ARGS)
 	exit(2);
 }
 
+/* SIGUSR1: set flag to send WAL records */
+static void
+WalSndXLogSendHandler(SIGNAL_ARGS)
+{
+	latch_sigusr1_handler();
+}
+
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
 {
 	ready_to_stop = true;
+	if (MyWalSnd)
+		SetLatch(&MyWalSnd->latch);
 }
 
 /* Set up signal handlers */
@@ -847,7 +883,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, WalSndQuickDieHandler);	/* hard crash time */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, SIG_IGN); /* not used */
+	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
@@ -891,10 +927,21 @@ WalSndShmemInit(void)
 			WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
 
 			SpinLockInit(&walsnd->mutex);
+			InitSharedLatch(&walsnd->latch);
 		}
 	}
 }
 
+/* Wake up all walsenders */
+void
+WalSndWakeup(void)
+{
+	int		i;
+
+	for (i = 0; i < max_wal_senders; i++)
+		SetLatch(&WalSndCtl->walsnds[i].latch);
+}
+
 /*
  * This isn't currently used for anything. Monitoring tools might be
  * interested in the future, and we'll need something like this in the
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 492ac9a..0083513 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
@@ -117,6 +118,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SInvalShmemSize());
 		size = add_size(size, PMSignalShmemSize());
 		size = add_size(size, ProcSignalShmemSize());
+		size = add_size(size, LatchShmemSize());
 		size = add_size(size, BgWriterShmemSize());
 		size = add_size(size, AutoVacuumShmemSize());
 		size = add_size(size, WalSndShmemSize());
@@ -217,6 +219,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	PMSignalShmemInit();
 	ProcSignalShmemInit();
+	LatchShmemInit();
 	BgWriterShmemInit();
 	AutoVacuumShmemInit();
 	WalSndShmemInit();
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index e892e2d..3dd25f1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -21,6 +21,7 @@
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "storage/ipc.h"
+#include "storage/latch.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
@@ -278,5 +279,7 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	latch_sigusr1_handler();
+
 	errno = save_errno;
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 874959e..73c5904 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include "access/xlog.h"
+#include "storage/latch.h"
 #include "storage/spin.h"
 
 /*
@@ -24,6 +25,12 @@ typedef struct WalSnd
 	XLogRecPtr	sentPtr;		/* WAL has been sent up to this point */
 
 	slock_t		mutex;			/* locks shared variables shown above */
+
+	/*
+	 * Latch used by backends to wake up this walsender when it has work
+	 * to do.
+	 */
+	Latch		latch;
 } WalSnd;
 
 /* There is one WalSndCtl struct for the whole database cluster */
@@ -45,5 +52,6 @@ extern int	WalSenderMain(void);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
+extern void WalSndWakeup(void);
 
 #endif   /* _WALSENDER_H */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
new file mode 100644
index 0000000..b972586
--- /dev/null
+++ b/src/include/storage/latch.h
@@ -0,0 +1,62 @@
+/*-------------------------------------------------------------------------
+ *
+ * latch.h
+ *	  Routines for interprocess latches
+ *
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LATCH_H
+#define LATCH_H
+
+#include <signal.h>
+
+/*
+ * Latch structure should be treated as opaque and only accessed through
+ * the public functions. It is defined here to allow embedding Latches as
+ * part of bigger structs.
+ */
+typedef struct
+{
+	sig_atomic_t	is_set;
+	bool			is_shared;
+#ifndef WIN32
+	int				owner_pid;
+#else
+	HANDLE			event;
+#endif
+} Latch;
+
+/*
+ * prototypes for functions in latch.c
+ */
+extern void InitLatch(volatile Latch *latch);
+extern void InitSharedLatch(volatile Latch *latch);
+extern void AcquireLatch(volatile Latch *latch);
+extern void ReleaseLatch(volatile Latch *latch);
+extern bool WaitLatch(volatile Latch *latch, long timeout);
+extern int	WaitLatchOrSocket(volatile Latch *latch, pgsocket sock,
+				  long timeout);
+extern void SetLatch(volatile Latch *latch);
+extern void ResetLatch(volatile Latch *latch);
+#define TestLatch(latch) (((volatile Latch *) latch)->is_set)
+
+extern Size LatchShmemSize(void);
+extern void LatchShmemInit(void);
+
+/*
+ * Unix implementation uses SIGUSR1 for inter-process signaling, Win32 doesn't
+ * need this.
+ */
+#ifndef WIN32
+extern void latch_sigusr1_handler(void);
+#else
+#define latch_sigusr1_handler()
+#endif
+
+#endif   /* LATCH_H */
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index cf7c3ee..bf0e904 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -64,6 +64,7 @@ sub mkvcbuild
     $postgres->ReplaceFile('src\backend\port\dynloader.c','src\backend\port\dynloader\win32.c');
     $postgres->ReplaceFile('src\backend\port\pg_sema.c','src\backend\port\win32_sema.c');
     $postgres->ReplaceFile('src\backend\port\pg_shmem.c','src\backend\port\win32_shmem.c');
+    $postgres->ReplaceFile('src\backend\port\pg_latch.c','src\backend\port\win32_latch.c');
     $postgres->AddFiles('src\port',@pgportfiles);
     $postgres->AddDir('src\timezone');
     $postgres->AddFiles('src\backend\parser','scan.l','gram.y');

#67

Markus Wanner

markus@bluegap.ch

over 15 years ago

In reply to: Heikki Linnakangas (#66)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Hi,

On 09/06/2010 06:27 PM, Heikki Linnakangas wrote:

Here's an updated patch, with all the issues reported this far fixed,
except for that naming issue, and Fujii's suggestion to use poll()
instead of select() where available. I've also polished it quite a bit,
improving comments etc. Magnus, can you take a look at the Windows
implementation to check that it's sane? At least it seems to work.

Is pselect() really as unportable as stated in the patch? What platforms
have problems with pselect()?

Using the self-pipe trick, don't we risk running into the open file
handles limitation? Or is it just two handles per process?

Do I understand correctly that the purpose of this patch is to work
around the brokenness of select() on very few platforms? Or is there any
additional feature that plain signals don't give us?

+ * It's important to reset the latch*before*  checking if there's work to
+ * do. Otherwise, if someone sets the latch between the check and the
+ * ResetLatch call, you will miss it and Wait will block.

Why doesn't WaitLatch() clear it? What's the use case for waiting for a
latch and *not* wanting to reset it?

Regards

Markus

#68

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Markus Wanner (#67)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Markus Wanner <markus@bluegap.ch> writes:

Is pselect() really as unportable as stated in the patch? What platforms
have problems with pselect()?

Well, it's not defined in the Single Unix Spec, which is our customary
reference for portability. Also, it's alleged that some platforms have
it but in a form that's not actually any safer than select(). For
example, I read in the Darwin man page for it

IMPLEMENTATION NOTES
The pselect() function is implemented in the C library as a wrapper
around select().

and that man page appears to be borrowed verbatim from FreeBSD.

Using the self-pipe trick, don't we risk running into the open file
handles limitation? Or is it just two handles per process?

It's just two handles per process.

regards, tom lane

#69

Markus Wanner

markus@bluegap.ch

over 15 years ago

In reply to: Tom Lane (#68)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 09/06/2010 08:46 PM, Tom Lane wrote:

Well, it's not defined in the Single Unix Spec, which is our customary
reference for portability.

FWIW, I bet the self-pipe trick isn't mentioned there, either... any
good precedence that it actually works as expected on all of the target
platforms? Existing users of the self-pipe trick?

(You are certainly aware that pselect() is defined in newer versions of
POSIX).

Also, it's alleged that some platforms have
it but in a form that's not actually any safer than select(). For
example, I read in the Darwin man page for it

IMPLEMENTATION NOTES
The pselect() function is implemented in the C library as a wrapper
around select().

Ouch. Indeed, quick googling reveals the following source code for Darwin:

http://www.opensource.apple.com/source/Libc/Libc-391.5.22/gen/FreeBSD/pselect.c

Now that you are mentioning it, I seem to recall that even glibc had a
user-space "implementation" of pselect. Fortunately, that's quite some
years ago.

and that man page appears to be borrowed verbatim from FreeBSD.

At least FreeBSD seems to have fixed this about 8 months ago:
http://svn.freebsd.org/viewvc/base?view=revision&revision=200725

Maybe Darwin catches up eventually?

AFAICT the custom select() implementation we are using for Windows could
easily be changed to mimic pselect() instead. Thus most reasonably
up-to-date Linux distributions plus Windows certainly provide a workable
pselect() syscall. Would it be worth using pselect() for those (and
maybe others that support pselect() appropriately)?

It's just two handles per process.

Good. How about syscall overhead? One more write operation to the
self-pipe per signal from within the signal handler and one read to
actually clear the 'ready' state of the pipe from the waiter portion of
the code, right?

Do we plan to replace all (or most) existing internal signals with these
latches to circumvent the interruption problem? Or just the ones we need
to wait for using pg_usleep()?

For Postgres-R, I'd probably have to extend it to call select() not only
on the self-pipe, but on at least one other socket as well (to talk to
the GCS). As long as that's possible, it looks like a more portable
replacement for the pselect() variant that's currently in place there.

Regards

Markus

#70

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Markus Wanner (#69)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Markus Wanner <markus@bluegap.ch> writes:

AFAICT the custom select() implementation we are using for Windows could
easily be changed to mimic pselect() instead. Thus most reasonably
up-to-date Linux distributions plus Windows certainly provide a workable
pselect() syscall. Would it be worth using pselect() for those (and
maybe others that support pselect() appropriately)?

I don't entirely see the point of opening ourselves up to the risk of
using a pselect that's not safe under the hood. In any case, on most
modern platforms poll() is preferable to any variant of select().

regards, tom lane

#71

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Markus Wanner (#69)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 06/09/10 23:10, Markus Wanner wrote:

Good. How about syscall overhead? One more write operation to the
self-pipe per signal from within the signal handler and one read to
actually clear the 'ready' state of the pipe from the waiter portion of
the code, right?

Right.

Do we plan to replace all (or most) existing internal signals with these
latches to circumvent the interruption problem? Or just the ones we need
to wait for using pg_usleep()?

At least the poll loops in bgwriter and walwriter need to be replaced if
we want to fix the issue Tom mentioned earlier that the server currently
wakes up periodically, waking up the CPU which wastes electricity.
There's no hurry to replace other code.

For Postgres-R, I'd probably have to extend it to call select() not only
on the self-pipe, but on at least one other socket as well (to talk to
the GCS). As long as that's possible, it looks like a more portable
replacement for the pselect() variant that's currently in place there.

Yeah, that would be a straightforward extension.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#72

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Markus Wanner (#67)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 06/09/10 20:24, Markus Wanner wrote:

On 09/06/2010 06:27 PM, Heikki Linnakangas wrote:
+ * It's important to reset the latch*before* checking if there's work to
+ * do. Otherwise, if someone sets the latch between the check and the
+ * ResetLatch call, you will miss it and Wait will block.
Why doesn't WaitLatch() clear it? What's the use case for waiting for a
latch and *not* wanting to reset it?

Setting a latch that's already set is very fast, so you want to keep it
set until the last moment. See the coding in walsender for example, it
goes to some lengths to avoid clearing the latch until it's very sure
there's no more work for it to do. That helps to keep the overhead in
backends committing transactions low. (no-one has tried to measure that
yet, though)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#73

Markus Wanner

markus@bluegap.ch

over 15 years ago

In reply to: Heikki Linnakangas (#72)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 09/07/2010 09:06 AM, Heikki Linnakangas wrote:

Setting a latch that's already set is very fast, so you want to keep it
set until the last moment. See the coding in walsender for example, it
goes to some lengths to avoid clearing the latch until it's very sure
there's no more work for it to do. That helps to keep the overhead in
backends committing transactions low. (no-one has tried to measure that
yet, though)

Understood, thanks.

Markus

#74

Markus Wanner

markus@bluegap.ch

over 15 years ago

In reply to: Tom Lane (#70)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Hi,

On 09/06/2010 11:03 PM, Tom Lane wrote:

I don't entirely see the point of opening ourselves up to the risk of
using a pselect that's not safe under the hood.

It should be possible to reliably determine the platforms that provide
an atomic pselect(). For those, I'm hesitant to use a "trick", where
pselect() clearly provides a simpler and more "official" alternative.
Especially considering that those platforms form the vast majority for
running Postgres on.

What I'm most concerned about is the write() syscall within the signal
handler. If that fails for another reason than those covered, we miss
the signal. As Heikki points out in the comment, it's hard to deal with
such a failure.

Regarding the exact implementation, the positioning of drainSelfPipe in
Heikki's implementation seems strange to me. Most descriptions of the
self-pipe trick [1]D. J. Bernstein, The self-pipe trick http://cr.yp.to/docs/selfpipe.html [2]Emile van Bergen, Avoiding races with Unix signals and select() http://www.xs4all.nl/~evbergen/unix-signals.html [4]LWN Article: The new pselect() system call, mentions the self-pipe trick in a comment http://lwn.net/Articles/176911/ put the drainSelfPipe() just after the
select(), where you can be sure there actually is something to read.
(Except [3]Alex Pennace, Safe UNIX Signal Handling Tips http://osiris.978.org/~alex/safesignalhandling.html, which recommends putting it inside the signal handler,
which I find even more frightening).

Maybe you can read more than one byte at a time in drainSelfPipe(), to
save some syscalls?

Talking about the trick itself again: I found a lot of descriptions and
mentioning of the self-pipe trick, but so far I only found an unknown
window manager [5]Karmen: a window manager http://freshmeat.net/projects/karmen and the custom inetd that's mentioned in the LWN
article [4]LWN Article: The new pselect() system call, mentions the self-pipe trick in a comment http://lwn.net/Articles/176911/ which really use that trick. Digging deeper revealed that
there's a sigsafe library [6]sigsafe library http://www.slamb.org/projects/sigsafe/ as well as the bglibs [7]Bruce Guenter, one stop library package http://untroubled.org/bglibs/ which both seems
to use the self-pipe trick as well (of which the later doesn't even care
about the write()'s return value in the signal handler). None of these
two libraries seems to be used in any project of relevance.

Overall I got the impression that people like to describe the trick,
because it sounds so nifty and clever. However, I'd feel more
comfortable if I knew there were some medium to large projects that
actually use that trick. But AFAICT not even Bernstein's qmail does.

In any case, on most
modern platforms poll() is preferable to any variant of select().

Only Linux provides a ppoll() variant. And poll() itself doesn't replace
pselect().

Overall, I'm glad this gets addressed. Note that this is a long standing
issue for Postgres-R and it's covered with a lengthy comment in its TODO
file [8]Postgres-R TOOD entry http://git.postgres-r.org/?p=Postgres-R;a=blob;f=src/backend/replication/TODO;h=7bfc37ee9629943b9ff052d571b9d933ab38a0a8;hb=HEAD#l12.

Regards

Markus Wanner

[1]: D. J. Bernstein, The self-pipe trick http://cr.yp.to/docs/selfpipe.html
http://cr.yp.to/docs/selfpipe.html

[2]: Emile van Bergen, Avoiding races with Unix signals and select() http://www.xs4all.nl/~evbergen/unix-signals.html
http://www.xs4all.nl/~evbergen/unix-signals.html

[3]: Alex Pennace, Safe UNIX Signal Handling Tips http://osiris.978.org/~alex/safesignalhandling.html
http://osiris.978.org/~alex/safesignalhandling.html

[4]: LWN Article: The new pselect() system call, mentions the self-pipe trick in a comment http://lwn.net/Articles/176911/
trick in a comment
http://lwn.net/Articles/176911/

[5]: Karmen: a window manager http://freshmeat.net/projects/karmen
http://freshmeat.net/projects/karmen

[6]: sigsafe library http://www.slamb.org/projects/sigsafe/
http://www.slamb.org/projects/sigsafe/

[7]: Bruce Guenter, one stop library package http://untroubled.org/bglibs/
http://untroubled.org/bglibs/

[8]: Postgres-R TOOD entry http://git.postgres-r.org/?p=Postgres-R;a=blob;f=src/backend/replication/TODO;h=7bfc37ee9629943b9ff052d571b9d933ab38a0a8;hb=HEAD#l12
http://git.postgres-r.org/?p=Postgres-R;a=blob;f=src/backend/replication/TODO;h=7bfc37ee9629943b9ff052d571b9d933ab38a0a8;hb=HEAD#l12

#75

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Markus Wanner (#74)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 08/09/10 20:36, Markus Wanner wrote:

On 09/06/2010 11:03 PM, Tom Lane wrote:

I don't entirely see the point of opening ourselves up to the risk of
using a pselect that's not safe under the hood.

It should be possible to reliably determine the platforms that provide
an atomic pselect(). For those, I'm hesitant to use a "trick", where
pselect() clearly provides a simpler and more "official" alternative.
Especially considering that those platforms form the vast majority for
running Postgres on.

Perhaps, but I'm equally concerned that having different implementations
for different platforms means that all implementations get less testing
than if we use only one. Because of that I'm actually reluctant to even
use poll() where available instead of select(). At least in the first
phase, until someone demonstrates that there's a measurable difference
in performance. We only call poll/select when we're about to sleep, so
it's not really that performance critical anyway.

What I'm most concerned about is the write() syscall within the signal
handler. If that fails for another reason than those covered, we miss
the signal. As Heikki points out in the comment, it's hard to deal with
such a failure.

Yeah, there isn't much you can do about it. Perhaps you could set a
"mayday flag" (a global boolean variable) if it fails, and check that in
the main loop, elogging a warning there instead. But I don't think we
need to go to such lengths, realistically the write() will never fail or
you have bigger problems.

Maybe you can read more than one byte at a time in drainSelfPipe(), to
save some syscalls?

Perhaps, although it should be very rare to have more than one byte in
the pipe. SetLatch doesn't write another byte if the latch is already
set, so you only get multiple bytes in the pipe if many processes set
the latch at the same instant.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#76

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#75)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 08/09/10 20:36, Markus Wanner wrote:

It should be possible to reliably determine the platforms that provide
an atomic pselect(). For those, I'm hesitant to use a "trick", where
pselect() clearly provides a simpler and more "official" alternative.
Especially considering that those platforms form the vast majority for
running Postgres on.

Perhaps, but I'm equally concerned that having different implementations
for different platforms means that all implementations get less testing
than if we use only one.

There's that, and there's also that Markus' premise is full of holes.
Exactly how will you determine that pselect is safe at compile time?
Even if you correctly determine that, how can you be sure that the
finished executable will only be run against a version of libc that has
a safe implementation? Considering that we know that major platforms
such as FreeBSD have changed their implementations *very* recently,
it seems foolish to assume that an executable built on a machine with
corrected pselect could not be run on one with an older implementation.

Because of that I'm actually reluctant to even
use poll() where available instead of select(). At least in the first
phase, until someone demonstrates that there's a measurable difference
in performance.

select() is demonstrably a loser whenever the process has a lot of open
files. Also, we have plenty of experience with substituting poll() for
select(), so I'm not too worried about copy-and-pasting such code.

regards, tom lane

#77

Martijn van Oosterhout

kleptog@svana.org

over 15 years ago

In reply to: Markus Wanner (#67)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Mon, Sep 06, 2010 at 07:24:59PM +0200, Markus Wanner wrote:

Do I understand correctly that the purpose of this patch is to work
around the brokenness of select() on very few platforms? Or is there any
additional feature that plain signals don't give us?

If the issue is just that select() doesn't get interrupted and we don't
care about a couple of syscalls, would it not be better to simply use
sigaction to turn on SA_RESTART just prior to the select() and turn it
off just after. Or are these systems so broken that select() won't be
interrupted, even if the signal handler is explicitly configured to do
so?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patriotism is when love of your own people comes first; nationalism,
when hate for people other than your own comes first.
- Charles de Gaulle

#78

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Martijn van Oosterhout (#77)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Martijn van Oosterhout <kleptog@svana.org> writes:

If the issue is just that select() doesn't get interrupted and we don't
care about a couple of syscalls, would it not be better to simply use
sigaction to turn on SA_RESTART just prior to the select() and turn it
off just after. Or are these systems so broken that select() won't be
interrupted, even if the signal handler is explicitly configured to do
so?

I think you mean turn *off* SA_RESTART. We'd have to do that for each
signal that we were concerned about allowing to interrupt the select(),
so it's more than just two added calls. Another small problem is that
the latch code doesn't/shouldn't know what handlers are active, and
AFAICS you can't use sigaction() to flip that flag without setting the
handler address too. So while maybe we could do it that way, it'd be
pretty dang messy.

In my mind the main value of the Latch code will be to have a clean
platform-independent API for waiting. Why all the angst about whether
the implementation underneath is clean or not? It's more important that
it *works* and we don't have to worry about whether it will break on
platform XYZ.

regards, tom lane

#79

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Martijn van Oosterhout (#77)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 08/09/10 23:07, Martijn van Oosterhout wrote:

On Mon, Sep 06, 2010 at 07:24:59PM +0200, Markus Wanner wrote:

Do I understand correctly that the purpose of this patch is to work
around the brokenness of select() on very few platforms? Or is there any
additional feature that plain signals don't give us?

If the issue is just that select() doesn't get interrupted and we don't
care about a couple of syscalls, would it not be better to simply use
sigaction to turn on SA_RESTART just prior to the select() and turn it
off just after. Or are these systems so broken that select() won't be
interrupted, even if the signal handler is explicitly configured to do
so?

I don't know if SA_RESTART is portable. But in any case, that will do
nothing about the race condition where the signal arrives just *before*
the select() call.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#80

Markus Wanner

markus@bluegap.ch

over 15 years ago

In reply to: Tom Lane (#76)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 09/08/2010 08:18 PM, Tom Lane wrote:

Considering that we know that major platforms
such as FreeBSD have changed their implementations *very* recently,
it seems foolish to assume that an executable built on a machine with
corrected pselect could not be run on one with an older implementation.

FWIW testing a recent development (i.e. 9.0-devel) version of FreeBSD
still failed to properly support pselect().

Also, we have plenty of experience with substituting poll() for
select(), so I'm not too worried about copy-and-pasting such code.

It's certainly a simpler change than the self-pipe trick vs. pselect(), yes.

I'm happy to go with the self-pipe trick. A quick micro-benchmark didn't
even show any significant difference compared to pselect(), so form that
perspective, it's not a big deal.

And we'd then have a major project using the self-pipe trick. (I would
still like to know what others exist).

Regards

Markus Wanner

#81

Markus Wanner

markus@bluegap.ch

over 15 years ago

In reply to: Heikki Linnakangas (#75)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 09/08/2010 08:01 PM, Heikki Linnakangas wrote:

Yeah, there isn't much you can do about it. Perhaps you could set a
"mayday flag" (a global boolean variable) if it fails, and check that in
the main loop, elogging a warning there instead. But I don't think we
need to go to such lengths, realistically the write() will never fail or
you have bigger problems.

Hm.. I think I'd like to see such a mayday flag. Just so we at least
have a chance of finding out that something has gone wrong - whether or
not there's a bigger problem.

Perhaps, although it should be very rare to have more than one byte in
the pipe. SetLatch doesn't write another byte if the latch is already
set, so you only get multiple bytes in the pipe if many processes set
the latch at the same instant.

Depending on what you use these latches for, it might not be that rare
anymore. Trying to read more than one byte at a time doesn't cost anything.

Regards

Markus

#82

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Heikki Linnakangas (#66)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 06/09/10 19:27, Heikki Linnakangas wrote:

On 06/09/10 17:18, Tom Lane wrote:

BTW, on reflection the AcquireLatch/ReleaseLatch terminology seems a bit
ill chosen: ReleaseLatch sounds way too much like something that would
just unlock or clear the latch. Perhaps OwnLatch/DisownLatch, or
something along that line.

Yeah, I see what you mean. Although, maybe it's just me but Own/Disown
looks ugly. Anyone have a better suggestion?

Magnus suggested AssociateLatch, given that the description of the
function is that it associates the latch with the current process. I
went with Own/Disown after all, it feels more precise, and having made
the changes it doesn't look that ugly to me anymore.

Here's an updated patch, with all the issues reported this far fixed,
except for that naming issue, and Fujii's suggestion to use poll()
instead of select() where available. I've also polished it quite a bit,
improving comments etc. Magnus, can you take a look at the Windows
implementation to check that it's sane? At least it seems to work.

We discussed the patch over IM, there's one point minor point I'd like
to get into the archives:

It seems that NumSharedLatches() is entirely wrongly placed if it's in
the win32 specific code! That needs to be somewhere shared, so people find it,

Yeah. There's a notice of that in OwnLatch(), but it would be nice if we
could make it even more prominent. One idea is to put in latch.h as:

#define NumSharedLatches() (max_wal_senders /* + something else in the
future */ )

When it's a #define, we don't need to put #include "walsender.h" in
latch.h, it's enough to put it in win32_latch.c. It's a bit weird to
have a #define in one header file that doesn't work unless you #include
another header file in where you use it, but it would work. Any opinions
on whether that's better than having NumSharedLatches() defined in the
Win32-specific win32_latch.c file? I'm inclined to leave it as it is, in
win32_latch.c, but I'm not sure.

Barring any last-minute objections, I'll commit this in the next few
days. This patch doesn't affect walreceiver yet; I think the next step
is to use the latches to eliminate the polling loop in walreceiver too,
so that as soon as a piece of WAL is fsync'd to disk in the standby,
it's applied.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#83

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#82)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 06/09/10 19:27, Heikki Linnakangas wrote:

It seems that NumSharedLatches() is entirely wrongly placed if it's in
the win32 specific code! That needs to be somewhere shared, so people find it,

Yeah. There's a notice of that in OwnLatch(), but it would be nice if we
could make it even more prominent. One idea is to put in latch.h as:

#define NumSharedLatches() (max_wal_senders /* + something else in the
future */ )

When it's a #define, we don't need to put #include "walsender.h" in
latch.h, it's enough to put it in win32_latch.c. It's a bit weird to
have a #define in one header file that doesn't work unless you #include
another header file in where you use it, but it would work.

We have other precedents for that. But having said that ...

Any opinions
on whether that's better than having NumSharedLatches() defined in the
Win32-specific win32_latch.c file? I'm inclined to leave it as it is, in
win32_latch.c, but I'm not sure.

I'd leave it alone. I see no very good reason to expose
NumSharedLatches globally.

Barring any last-minute objections, I'll commit this in the next few
days. This patch doesn't affect walreceiver yet; I think the next step
is to use the latches to eliminate the polling loop in walreceiver too,
so that as soon as a piece of WAL is fsync'd to disk in the standby,
it's applied.

I will do some work as well once it's in. Since I was the one
complaining about extra wakeups in the other processes, I'm willing
to go fix 'em.

regards, tom lane

#84

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#83)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 11/09/10 18:02, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

Barring any last-minute objections, I'll commit this in the next few
days. This patch doesn't affect walreceiver yet; I think the next step
is to use the latches to eliminate the polling loop in walreceiver too,
so that as soon as a piece of WAL is fsync'd to disk in the standby,
it's applied.

I will do some work as well once it's in. Since I was the one
complaining about extra wakeups in the other processes, I'm willing
to go fix 'em.

Committed. I'll take a look at making walreceiver respond quickly when
WAL arrives in the standby, using latches, but that shouldn't interfere
with what you're doing.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#85

Jeff Davis

pgsql@j-davis.com

over 15 years ago

In reply to: Heikki Linnakangas (#84)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Sat, 2010-09-11 at 19:15 +0300, Heikki Linnakangas wrote:

Committed. I'll take a look at making walreceiver respond quickly when
WAL arrives in the standby, using latches, but that shouldn't interfere
with what you're doing.

I glanced at the code, and I see (in OwnLatch()):

+       if (latch->owner_pid != 0)
+               elog(ERROR, "latch already owned");
+       latch->owner_pid = MyProcPid;

But it looks like there may be a race there. I assume the callers are
supposed to have a lock at this point (and it looks like the current
caller is safe, but I didn't look in detail). Something still seems
strange to me though -- why throw an ERROR there if it can't happen (or
indicates an inconsistent state when it does happen)? And it doesn't
look safe to _not_ throw an error (due to a race) if it does happen.

It seems like OwnLatch is only supposed to be used when you_not_ to
throw an error in the event of a race (I don't think it is). already
know that you own it, because there's no error code returned or blocking
behavior, it just throws an ERROR. But if that's the case, what's the
point of OwnLatch()?

Perhaps add a few comments to describe to other users of the API how to
properly own a shared latch?

Regards,
Jeff Davis

#86

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Jeff Davis (#85)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Jeff Davis <pgsql@j-davis.com> writes:

I glanced at the code, and I see (in OwnLatch()):

+       if (latch->owner_pid != 0)
+               elog(ERROR, "latch already owned");
+       latch->owner_pid = MyProcPid;

But it looks like there may be a race there.

Yeah, that error check is only intended to catch gross logic errors,
not to guard against race conditions. I don't think we really could
prevent a race there without adding a spinlock, which seems like
overkill.

... why throw an ERROR there if it can't happen (or
indicates an inconsistent state when it does happen)?

Are you suggesting that an Assert would be sufficient?

regards, tom lane

#87

Jeff Davis

pgsql@j-davis.com

over 15 years ago

In reply to: Tom Lane (#86)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Sun, 2010-09-12 at 12:29 -0400, Tom Lane wrote:

... why throw an ERROR there if it can't happen (or
indicates an inconsistent state when it does happen)?

Are you suggesting that an Assert would be sufficient?

I'm not too picky about whether it's Assert, ERROR, or PANIC (Asserts
aren't available in production systems, which I assume is why elog was
used); but we should be consistent and document that:
(a) it shouldn't happen
(b) that it's just a sanity check and we're ignoring the race

However, that also means that the whole concept of OwnLatch/DisownLatch
is entirely redundant, and only there for asserts because it doesn't do
anything else. That seems a little strange to me, as well, so (at
minimum) it should be documented that the functions really have no
effect on execution and are required only to support debugging.

Regards,
Jeff Davis

#88

Jeff Davis

pgsql@j-davis.com

over 15 years ago

In reply to: Jeff Davis (#87)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Sun, 2010-09-12 at 10:13 -0700, Jeff Davis wrote:

I'm not too picky about whether it's Assert, ERROR, or PANIC (Asserts
aren't available in production systems, which I assume is why elog was
used); but we should be consistent and document that:
(a) it shouldn't happen
(b) that it's just a sanity check and we're ignoring the race

I should add that Assert seems to imply both of those things (at least
to me), so that would solve one of the confusing aspects of the API.

The other confusing part that I think still needs comments is that
OwnLatch/DisownLatch don't really do anything; they just carry
information for sanity checks later.

Regards,
Jeff Davis

#89

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Jeff Davis (#87)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

Jeff Davis <pgsql@j-davis.com> writes:

However, that also means that the whole concept of OwnLatch/DisownLatch
is entirely redundant, and only there for asserts because it doesn't do
anything else. That seems a little strange to me, as well, so (at
minimum) it should be documented that the functions really have no
effect on execution and are required only to support debugging.

Uh, this is nonsense. You have to have something like these functions
to support transferring ownership of a latch from one process to
another, which is required at least for the walreceiver usage.

It's correct that the latch code itself isn't trying very hard to avoid
a race condition in acquiring ownership, but that doesn't make the whole
thing useless, it just means that we're assuming that will be avoided
by logic elsewhere. If there is a bug elsewhere that allows two
different processes to try to take ownership of the same latch, the
current coding will expose that bug soon enough.

regards, tom lane

#90

Jeff Davis

pgsql@j-davis.com

over 15 years ago

In reply to: Tom Lane (#89)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Sun, 2010-09-12 at 14:12 -0400, Tom Lane wrote:

Uh, this is nonsense. You have to have something like these functions
to support transferring ownership of a latch from one process to
another, which is required at least for the walreceiver usage.

Oh, I see. It's needed to know where to send the signal, of course.

Regards,
Jeff Davis

#91

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Jeff Davis (#87)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 12/09/10 20:13, Jeff Davis wrote:

On Sun, 2010-09-12 at 12:29 -0400, Tom Lane wrote:

... why throw an ERROR there if it can't happen (or
indicates an inconsistent state when it does happen)?

Are you suggesting that an Assert would be sufficient?

I'm not too picky about whether it's Assert, ERROR, or PANIC (Asserts
aren't available in production systems, which I assume is why elog was
used);

Right, OwnLatch is a not performance-critical, so it's better to elog()
IMHO.

but we should be consistent and document that:
(a) it shouldn't happen
(b) that it's just a sanity check and we're ignoring the race

Would this be sufficient?

--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -156,6 +156,7 @@ OwnLatch(volatile Latch *latch)
  	if (selfpipe_readfd == -1)
  		initSelfPipe();

+ /* sanity check */
if (latch->owner_pid != 0)
elog(ERROR, "latch already owned");
latch->owner_pid = MyProcPid;

Or you want to suggest something better?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#92

Jeff Davis

pgsql@j-davis.com

over 15 years ago

In reply to: Heikki Linnakangas (#91)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On Mon, 2010-09-13 at 09:10 +0300, Heikki Linnakangas wrote:

but we should be consistent and document that:
(a) it shouldn't happen
(b) that it's just a sanity check and we're ignoring the race

Would this be sufficient?
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -156,6 +156,7 @@ OwnLatch(volatile Latch *latch)
if (selfpipe_readfd == -1)
initSelfPipe();
+ /* sanity check */
if (latch->owner_pid != 0)
elog(ERROR, "latch already owned");
latch->owner_pid = MyProcPid;

Or you want to suggest something better?

Perfect. I was just slightly confused reading it the first time, and I
think that would have cleared it up.

Thanks,
Jeff Davis

#93

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Jeff Davis (#92)

Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

On 13/09/10 20:43, Jeff Davis wrote:

On Mon, 2010-09-13 at 09:10 +0300, Heikki Linnakangas wrote:
but we should be consistent and document that:
(a) it shouldn't happen
(b) that it's just a sanity check and we're ignoring the race

Would this be sufficient?
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -156,6 +156,7 @@ OwnLatch(volatile Latch *latch)
if (selfpipe_readfd == -1)
initSelfPipe();
+ /* sanity check */
if (latch->owner_pid != 0)
elog(ERROR, "latch already owned");
latch->owner_pid = MyProcPid;

Or you want to suggest something better?
Perfect. I was just slightly confused reading it the first time, and I
think that would have cleared it up.

Ok, added that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com