Group commit, revised
Attached is a patch that myself and Simon Riggs collaborated on. I
took the group commit patch that Simon posted to the list back in
November, and partially rewrote it. Here is that original thread:
http://archives.postgresql.org/pgsql-hackers/2011-11/msg00802.php
I've also attached the results of a pgbench-tools driven benchmark,
which are quite striking (Just the most relevant image - e-mail me
privately if you'd like a copy of the full report, as I don't want to
send a large PDF file to the list as a courtesy to others). Apart from
the obvious improvement in throughput, there is also a considerable
improvement in average and worst latency at all client counts.
To recap, the initial impetus to pursue this idea came from the
observation that with sync rep, we could get massive improvements in
the transactions-per-second throughput by simply adding clients. Greg
Smith performed a benchmark while in Amsterdam for the PgConf.EU
conference, which was discussed in a talk there. Over an
inter-continental connection from Amsterdam to his office in Baltimore
on the U.S. east coast, he saw TPS as reported by pgbench on what I
suppose was either an insert or update workload grow from a mere 10
TPS for a single connection to over 2000 TPS for about 300
connections. That was with the large, inherent latency imposed by
those sorts of distances (3822 miles/ 6150 km, about a 100ms ping time
on a decent connection). Quite obviously, as clients were added, the
server was able to batch increasing numbers of commits in each
confirmation message, resulting in this effect.
The main way that I've added value here is by refactoring and fixing
bugs. There were some tricky race conditions that caused the
regression tests to fail for that early draft patch, but there were a
few other small bugs too. There is an unprecedented latch pattern
introduced by this patch: Two processes (the WAL Writer and any given
connection backend) have a mutual dependency. Only one may sleep, in
which case the other is responsible for waking it. Much of the
difficulty in debugging this patch, and much of the complexity that
I've added, came from preventing both from simultaneously sleeping,
even in the face of various known failure modes like postmaster death.
If this happens, it does of course represent a denial-of-service, so
that's something that reviewers are going to want to heavily
scrutinise. I suppose that sync rep should be fine here, as it waits
on walsenders, but I haven't actually comprehensively tested the
patch, so there may be undiscovered unpleasant interactions with other
xlog related features. I can report that it passes the regression
tests successfully, and on an apparently consistently basis - I
battled with intermittent failures for a time.
Simon's original patch largely copied the syncrep.c code as an
expedient to prove the concept. Obviously this design was never
intended to get committed, and I've done some commonality and
variability analysis, refactoring to considerably slim down the new
groupcommit.c file by exposing some previously module-private
functions from syncrep.c .
I encourage others to reproduce my benchmark here. I attach a
pgbench-tools config. You can get the latest version of the tool at:
https://github.com/gregs1104/pgbench-tools
I also attach hdparm information for the disk that was used during
these benchmarks. Note that I have not disabled the write cache. It's
a Linux box, with ext4, running a recent kernel.
The benefits (and, admittedly, the controversies) of this patch go
beyond mere batching of commits: it also largely, though not entirely,
obviates the need for user backends to directly write/flush WAL, and
the WAL Writer may never sleep if it continually finds work to do -
wal_writer_delay is obsoleted, as are commit_siblings and
commit_delay. I suspect that they will not be missed. Of course, it
does all this to facilitate group commit itself. The group commit
feature does not have a GUC to control it, as this seems like
something that would be fairly pointless to turn off. FWIW, this is
currently the case for the recently introduced Maria DB group commit
implementation.
Auxiliary processes cannot use group commit. The changes made prevent
them from availing of commit_siblings/commit_delay parallelism,
because it doesn't exist anymore.
Group commit is sometimes throttled, which seems appropriate - if a
backend requests that the WAL Writer flush an LSN deemed too far from
the known flushed point, that request is rejected and the backend goes
through another path, where XLogWrite() is called. Currently the group
commit infrastructure decides that on the sole basis of there being a
volume of WAL that is equivalent in size to checkpoint_segments
between the two points. This is probably a fairly horrible heuristic,
not least since it overloads checkpoint_segments, but is of course
only a first-pass effort. Bright ideas are always welcome.
Thoughts?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
group-commit-results.pngimage/png; name=group-commit-results.pngDownload
�PNG
IHDR � � ,� PLTE��� ���� � ��� � ���@ �� Ai��� �@���0`�� @� ������**�� @�� 333MMMfff�������������������22�������U��������������� � d �"�".�W � �p � ���� ��� � �����P����E ��r��z�����k������� �� ���� �����P@Uk/� ��@�@��`��`��� ��@��@��`��p������������������������|�@�� �����K PIDATx������:A|�g��_@ t�I����ww�=��$��t D�����z�G�*9 �aS�w@@RB��7G�#����������/v�.�w]�/W��O���������-'��-������\{L;�?��� _����k�?����a�1)=�;����D��r;���~�E�N���B@��_��]��������ux���L���c���qr�~~�~q����5�����eq>Y__
��u ���i��@>��m�������`���_�+��|>������� 9�|�K0�g��N�����=*�����vT1
#T�0��X"����
�B��e,�zf��[���7�*Za�@�U��H�W@�����m�H�<N�����L��&������fx�J��G�~Z���_
��KT)$`�qr��C�-#I�]t�cfz���D�nu��������Q���<t^#I3���uy��!��!���0hX�]�����s[�,�4���zG[0R-���� 8���*a�������2�6{��*o��d��Q,Q%w�� �LI>���_�Ag�����hF�E��}���=����Gp����r���X��u�T�����OyC��HR�����y��s#�pn�
�-f�w�m�H���M�3y'�����r�\����P�`��(��*_������yv��w��C�"U��yM}V�����l���s,��������>�y?�z$��f�`��0>���V�n���d��P��f��o*���W��q��fs_�����}G_�+o2��-���3�e_'9���J046����������� �To����E�����Wn��jM���������*F����� ���~�,{�"x��� `#� ��J���hF:��|#G0�)���Cp�=���I��sU�o\F�^���+�
��-����x����U�H���s��|r��v�������S���f��|R����]�n�|"���!e�W�#����#_�T7�0`�CG�Np�TX����9`?������b�n����L������w�~��U�a�,4;�?�UO
��N0R�*�_�nm�H!��7��N���j�����t�@��*G�?�W�����Ju��hFZMm�H+�����]���R���T�����!�
����_�nm�H��7�4�r��m�=`��R��x�J���!�
���`�nm�H��7����Z���0�?�IG
<T��+`|,�R� 8���^��7����Z�����R�������C��G��T7��`�3UQm�H���Bp���g���t�@�=UN�����8[��Ao��"9�7P���?-�F���Z(5�H��[UL&@�W@#.�R��5��"9�7P���?-�F���p��s�IG
�0��9��+�N`�n��2������D���*�=Ad8�Ty!5�H���*o0�_��y�����������B$�J�0���_��fW������}�����G���#y�7�8�{���6��?��"�;t�{���p�����2��
�
�F�T7d^������"����0h���M����%8�CW�c�EGn��]�b�W�&TE���uV$��O���Xj:�D��V�C�
.���Z��`��� $3�$4�!`f�
x���_�n@����C�
�%�
x�C#�����?��_w���+�
���A�H����_�u���3���^�_w0�� KuC�6�nP(�����A�nw���f�V���+�&���uC�6j_��[ `K�p�Cw��03x��^0�_��� KuC���n��-�W�X����=`f��������+`��`�nhQ���C�
�%�
��C���L{��03~�,�
� ��A�H�'�_�;����3����'@��_Wa����-�o����z�p�Cd���L� 03~����T7 `{ `�L��O�.;4M@���iF�� 3�W�9,�_��������d��'tU$��C��c��\�����zU&VAN����#� ����#���g���j���_\F����U��>;
���0����[��=�)��fFQ����b�\v��/nBUt.�/�>�����4��%R�������l�~�V�B `K��czr�?B2sx�KO����G���|��������F���y�����B@y�
x3��=`f��������c{�"�=�:R��X@�P��u���
0{���]��1�
#�_M��D������0�"���_��\*�>'@�1�
#�KY��Hu��9 �*���R}.F
��&@�1�
#�?Y����l�s.���7��I��T)0f���,�;��l�s"�U�+g�I��T)0f���,����G��N0����I��T)��3�I��T)�
!����������-�D����\*�>\����\*�~�R��B@�;�9 �*���R}.F
&@�1�
#�_!�T7�/ ����u�S��G��~����2RP@�>�
#EFG��.�,#�'@�1�
#E>�� ' �
�|�����2��� 4!�g�[��.�UO>K,�1��\*�������/����}���P���?|5�i�4����OD#`go$��@�n�'��Ra�@�U��HQ��,��Hu���i�?/T*����\*�X��V)*pxR$�K���B;�@���q�#1�����R}.F�:�SFj���HQ��,����l���-Q��;���s�0R `�*Za��+ �������� `l �J��T)�p�(5�Ra��+ ���l�����+�T�K���2L�Rc.F
�B�n@��@�8�H���J��T)�D�0RT% +� R������q�%j��R}.F
,QE+�~�R�P���g� D�+�\�W@�>�
#EE,#5�Ra��D@����n�D��c;0�z$�?��P����HQ��G�H��T)j���n@��@�8�~��s�0Rd��'tu<R�(5�Ra��%�(���W �X�
��l�&|�����W�#��[� �?�� �GJP����H�� �}i�#�a-r��DGn������o�8wqS���:+�F�*nr�-�)R�q'��/~��s�0R� `�(5�Ra��+ ���l������+�T�K��KT�
#������Hu��A���+�T�K��B^�� Pj���H�W@!�
�g ���W@�>�
#���F
qY�3 �
�g���=�}rY ~,��;�����S�� ���1�j���(Sh��
�M�=u��(�����0��1�-��|W��u�d�Ya����3��Fg��_8�+�<�$ur=%�=��f��N])����g�U��TQ�g���;{eW+E&��BjX�Q��p��l�?S�s����P?
�m'��,W��fp":�����tnS+0�'�dx�H�������~����s�0Rx����s�0R����9���a�q�C?{0�W@�4�#-���}R}.F
e��7�@�1�
#����s��*l�T7�{�@�8���PY��J��T)4�?&�sJ��T)4�wo�t.��AS����~�C��@�8�H�m�*��7!H��T)D�_����Rc.F
M�o���T7h
x�D@�8=�`BQ@�s0R}.F����Y�����&���K���2G��
|����4�T7� ��<��zfF�a2�Ix��[]������
x�@�/3a�����;���1$�49 ��e&�!!���ch������� � ����y�W��J��'�1*�~����dT����0��Oz���o��J��'�1*�}���0?����<%�28���W_/so#`����W�;]eRBT�S�[���%W��=��g�O�y��t�2��S��9W&�~O���Dz���\�n����]���<���-����4����]%����������t�F���M��d��98u������,��i�0f'�V� ��>
qqh��o""%d�����e������W���.��"�b!����g���$ ��[7b�C�sg��[
��`�� $�+~S
�`r��/�|f"'! �
��3�d�� IEND�B`�On 16.01.2012 00:42, Peter Geoghegan wrote:
I've also attached the results of a pgbench-tools driven benchmark,
which are quite striking (Just the most relevant image - e-mail me
privately if you'd like a copy of the full report, as I don't want to
send a large PDF file to the list as a courtesy to others). Apart from
the obvious improvement in throughput, there is also a considerable
improvement in average and worst latency at all client counts.
Impressive results. How about uploading the PDF to the community wiki?
The main way that I've added value here is by refactoring and fixing
bugs. There were some tricky race conditions that caused the
regression tests to fail for that early draft patch, but there were a
few other small bugs too. There is an unprecedented latch pattern
introduced by this patch: Two processes (the WAL Writer and any given
connection backend) have a mutual dependency. Only one may sleep, in
which case the other is responsible for waking it. Much of the
difficulty in debugging this patch, and much of the complexity that
I've added, came from preventing both from simultaneously sleeping,
even in the face of various known failure modes like postmaster death.
If this happens, it does of course represent a denial-of-service, so
that's something that reviewers are going to want to heavily
scrutinise. I suppose that sync rep should be fine here, as it waits
on walsenders, but I haven't actually comprehensively tested the
patch, so there may be undiscovered unpleasant interactions with other
xlog related features. I can report that it passes the regression
tests successfully, and on an apparently consistently basis - I
battled with intermittent failures for a time.
I think it might be simpler if it wasn't the background writer that's
responsible for "driving" the group commit queue, but the backends
themselves. When a flush request comes in, you join the queue, and if
someone else is already doing the flush, sleep until the driver wakes
you up. If no-one is doing the flush yet (ie. the queue is empty), start
doing it yourself. You'll need a state variable to keep track who's
driving the queue, but otherwise I think it would be simpler as there
would be no dependency on WAL writer.
I tend think of the group commit facility as a bus. Passengers can hop
on board at any time, and they take turns on who drives the bus. When
the first passengers hops in, there is no driver so he takes the driver
seat. When the next passenger hops in, he sees that someone is driving
the bus already, so he sits down, and places a big sign on his forehead
stating the bus stop where he wants to get off, and goes to sleep. When
the driver has reached his own bus stop, he wakes up all the passengers
who wanted to get off at the same stop or any of the earlier stops [1]in a real bus, a passenger would not be happy if he's woken up too late and finds himself at the next stop instead of the one where he wanted to go, but for group commit, that is fine..
He also wakes up the passenger whose bus stop is the farthest from the
current stop, and gets off the bus. The woken-up passengers who have
already reached their stops can immediately get off the bus, and the one
who has not notices that no-one is driving the bus anymore, so he takes
the driver seat.
[1]: in a real bus, a passenger would not be happy if he's woken up too late and finds himself at the next stop instead of the one where he wanted to go, but for group commit, that is fine.
late and finds himself at the next stop instead of the one where he
wanted to go, but for group commit, that is fine.
In this arrangement, you could use the per-process semaphore for the
sleep/wakeups, instead of latches. I'm not sure if there's any
difference, but semaphores are more tried and tested, at least.
The benefits (and, admittedly, the controversies) of this patch go
beyond mere batching of commits: it also largely, though not entirely,
obviates the need for user backends to directly write/flush WAL, and
the WAL Writer may never sleep if it continually finds work to do -
wal_writer_delay is obsoleted, as are commit_siblings and
commit_delay.
wal_writer_delay is still needed for controlling how often asynchronous
commits are flushed to disk.
Auxiliary processes cannot use group commit. The changes made prevent
them from availing of commit_siblings/commit_delay parallelism,
because it doesn't exist anymore.
Auxiliary processes have PGPROC entries too. Why can't they participate?
Group commit is sometimes throttled, which seems appropriate - if a
backend requests that the WAL Writer flush an LSN deemed too far from
the known flushed point, that request is rejected and the backend goes
through another path, where XLogWrite() is called.
Hmm, if the backend doing the big flush gets the WALWriteLock before a
bunch of group committers, the group committers will have to wait until
the big flush is finished, anyway. I presume the idea of the throttling
is to avoid the situation where a bunch of small commits need to wait
for a huge flush to finish.
Perhaps the big flusher should always join the queue, but use some
heuristic to first flush up to the previous commit request, to wake up
others quickly, and do another flush to flush its own request after that.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 16 January 2012 08:11, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Impressive results. How about uploading the PDF to the community wiki?
Sure. http://wiki.postgresql.org/wiki/Group_commit .
I think it might be simpler if it wasn't the background writer that's
responsible for "driving" the group commit queue, but the backends
themselves. When a flush request comes in, you join the queue, and if
someone else is already doing the flush, sleep until the driver wakes you
up. If no-one is doing the flush yet (ie. the queue is empty), start doing
it yourself. You'll need a state variable to keep track who's driving the
queue, but otherwise I think it would be simpler as there would be no
dependency on WAL writer.
I think this replaces one problem with another. You've now effectively
elevated a nominated backend to the status of an auxiliary process -
do you intend to have the postmaster look after it, as with any other
auxiliary process? I'm not sure that that is a more difficult problem
to solve, but I suspect so. At least my proposal can have any one of
the backends, both currently participating in group commit and yet to,
wake up the WAL Writer.
I tend think of the group commit facility as a bus. Passengers can hop on
board at any time, and they take turns on who drives the bus. When the first
passengers hops in, there is no driver so he takes the driver seat. When the
next passenger hops in, he sees that someone is driving the bus already, so
he sits down, and places a big sign on his forehead stating the bus stop
where he wants to get off, and goes to sleep. When the driver has reached
his own bus stop, he wakes up all the passengers who wanted to get off at
the same stop or any of the earlier stops [1]. He also wakes up the
passenger whose bus stop is the farthest from the current stop, and gets off
the bus. The woken-up passengers who have already reached their stops can
immediately get off the bus, and the one who has not notices that no-one is
driving the bus anymore, so he takes the driver seat.[1] in a real bus, a passenger would not be happy if he's woken up too late
and finds himself at the next stop instead of the one where he wanted to go,
but for group commit, that is fine.In this arrangement, you could use the per-process semaphore for the
sleep/wakeups, instead of latches. I'm not sure if there's any difference,
but semaphores are more tried and tested, at least.
Yes, and I expect that this won't be the last time someone uses a bus
analogy in relation to this!
The proposed patch is heavily based on sync rep, which I'd have
imagined was more tried and tested than any proposed completely
alternative implementation, as it is basically a generalisation of
exactly the same principle, WAL Writer changes notwithstanding. I
would have imagined that that aspect would be particularly approved
of.
wal_writer_delay is still needed for controlling how often asynchronous
commits are flushed to disk.
That had occurred to me of course, but has anyone ever actually
tweaked wal_writer_delay with adjusting the behaviour of asynchronous
commits in mind? I'm pretty sure that the answer is no. I have a
slight preference for obsoleting it as a consequence of introducing
group commit, but I don't think that it matters that much.
Auxiliary processes cannot use group commit. The changes made prevent
them from availing of commit_siblings/commit_delay parallelism,
because it doesn't exist anymore.Auxiliary processes have PGPROC entries too. Why can't they participate?
It was deemed to be a poor design decision to effectively create a
dependency on the WAL Writer among other auxiliary processes, as to do
so would perhaps compromise the way in which the postmaster notices
and corrects isolated failures. Maybe I'll revisit that assessment,
but I am not convinced that it's worth the very careful analysis of
the implications of such an unprecedented dependency, without there
being some obvious advantage. It it's a question of their being
deprived of commit_siblings "group commit", well, we know from
experience that people didn't tend to touch it a whole lot anyway.
Group commit is sometimes throttled, which seems appropriate - if a
backend requests that the WAL Writer flush an LSN deemed too far from
the known flushed point, that request is rejected and the backend goes
through another path, where XLogWrite() is called.Hmm, if the backend doing the big flush gets the WALWriteLock before a bunch
of group committers, the group committers will have to wait until the big
flush is finished, anyway. I presume the idea of the throttling is to avoid
the situation where a bunch of small commits need to wait for a huge flush
to finish.
Exactly. Of course, you're never going to see that situation with
pgbench. I don't have much data to inform exactly what the right
trade-off is here, or some generic approximation of it across
platforms and hardware - other people will know more about this than I
do. While I have a general sense that the cost of flushing a single
page of data is the same as flushing a relatively much larger amount
of data, I cannot speak to much of an understanding of what that trade
off might be for larger amounts of data, where the question of
modelling some trade-off between throughput and latency arises,
especially with all the baggage that the implementation carries such
as whether or not we're using full_page_writes, hardware and so on.
Something simple will probably work well.
Perhaps the big flusher should always join the queue, but use some heuristic
to first flush up to the previous commit request, to wake up others quickly,
and do another flush to flush its own request after that.
Maybe, but we should decide what a big flusher looks like first. That
way, if we can't figure out a way to do what you describe with it in
time for 9.2, we can at least do what I'm already doing.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On 17.01.2012 16:35, Peter Geoghegan wrote:
On 16 January 2012 08:11, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I think it might be simpler if it wasn't the background writer that's
responsible for "driving" the group commit queue, but the backends
themselves. When a flush request comes in, you join the queue, and if
someone else is already doing the flush, sleep until the driver wakes you
up. If no-one is doing the flush yet (ie. the queue is empty), start doing
it yourself. You'll need a state variable to keep track who's driving the
queue, but otherwise I think it would be simpler as there would be no
dependency on WAL writer.I think this replaces one problem with another. You've now effectively
elevated a nominated backend to the status of an auxiliary process -
do you intend to have the postmaster look after it, as with any other
auxiliary process?
The GroupCommitWaitForLSN() call happens within a critical section. If
the process dies, you're screwed no matter what. It's not very different
from the current situation where if one backend flushes the WAL, another
backend will notice that once it gets the WALWriteLock, and returns quickly.
wal_writer_delay is still needed for controlling how often asynchronous
commits are flushed to disk.That had occurred to me of course, but has anyone ever actually
tweaked wal_writer_delay with adjusting the behaviour of asynchronous
commits in mind?
I found it very helpful to reduce wal_writer_delay in pgbench tests,
when running with synchronous_commit=off. The reason is that hint bits
don't get set until the commit record is flushed to disk, so making the
flushes more frequent reduces the contention on the clog. However, Simon
made async commits nudge WAL writer if the WAL page fills up, so I'm not
sure how relevant that experience is anymore.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 17 January 2012 17:37, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
I found it very helpful to reduce wal_writer_delay in pgbench tests, when
running with synchronous_commit=off. The reason is that hint bits don't get
set until the commit record is flushed to disk, so making the flushes more
frequent reduces the contention on the clog. However, Simon made async
commits nudge WAL writer if the WAL page fills up, so I'm not sure how
relevant that experience is anymore.
It's quite possible that the WAL Writer will spin continuously, if
given enough work to do, with this patch.
Although it's hard to tell from the graph I sent, there is a modest
improvement in TPS for even a single client. See the tables in the
PDF.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Jan 15, 2012, at 4:42 PM, Peter Geoghegan wrote:
Attached is a patch that myself and Simon Riggs collaborated on. I
took the group commit patch that Simon posted to the list back in
November, and partially rewrote it.
Forgive me if this is a dumb question, but I noticed a few places doing things like:
if (...)
Blah();
else
{
...
}
Is that allowed PG formatting? I thought that if braces were required on one branch of an if they were supposed to go on both sides.
Also, I didn't see any README changes in the patch... perhaps this is big enough to warrant them?
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
Excerpts from Jim Nasby's message of mar ene 17 21:21:57 -0300 2012:
On Jan 15, 2012, at 4:42 PM, Peter Geoghegan wrote:
Attached is a patch that myself and Simon Riggs collaborated on. I
took the group commit patch that Simon posted to the list back in
November, and partially rewrote it.Forgive me if this is a dumb question, but I noticed a few places doing things like:
if (...)
Blah();
else
{
...
}Is that allowed PG formatting? I thought that if braces were required on one branch of an if they were supposed to go on both sides.
Yeah, we even used to have pg_indent remove braces around single
statements, so if you had one statement in the "if" branch and more
around the other one, pg_indent would have left things like that anyway.
(This particular pg_indent behavior got removed because it messed up
PG_TRY blocks.)
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jan 17, 2012 at 12:37 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
I found it very helpful to reduce wal_writer_delay in pgbench tests, when
running with synchronous_commit=off. The reason is that hint bits don't get
set until the commit record is flushed to disk, so making the flushes more
frequent reduces the contention on the clog. However, Simon made async
commits nudge WAL writer if the WAL page fills up, so I'm not sure how
relevant that experience is anymore.
There's still a small but measurable effect there in master. I think
we might be able to make it fully auto-tuning, but I don't think we're
fully there yet (not sure how much this patch changes that equation).
I suggested a design similar to the one you just proposed to Simon
when he originally suggested this feature. It seems that if the WAL
writer is the only one doing WAL flushes, then there must be some IPC
overhead - and context switching - involved whenever WAL is flushed.
But clearly we're saving something somewhere else, on the basis of
Peter's results, so maybe it's not worth worrying about. It does seem
pretty odd to have all the regular backends going through the WAL
writer and the auxiliary processes using a different mechanism,
though. If we got rid of that, maybe WAL writer wouldn't even require
a lock, if there's only one process that can be doing it at a time.
What happens in standalone mode?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Jan 18, 2012 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jan 17, 2012 at 12:37 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I found it very helpful to reduce wal_writer_delay in pgbench tests, when
running with synchronous_commit=off. The reason is that hint bits don't get
set until the commit record is flushed to disk, so making the flushes more
frequent reduces the contention on the clog. However, Simon made async
commits nudge WAL writer if the WAL page fills up, so I'm not sure how
relevant that experience is anymore.
Still completely relevant and orthogonal to this discussion. The patch
retains multi-modal behaviour.
There's still a small but measurable effect there in master. I think
we might be able to make it fully auto-tuning, but I don't think we're
fully there yet (not sure how much this patch changes that equation).I suggested a design similar to the one you just proposed to Simon
when he originally suggested this feature. It seems that if the WAL
writer is the only one doing WAL flushes, then there must be some IPC
overhead - and context switching - involved whenever WAL is flushed.
But clearly we're saving something somewhere else, on the basis of
Peter's results, so maybe it's not worth worrying about. It does seem
pretty odd to have all the regular backends going through the WAL
writer and the auxiliary processes using a different mechanism,
though. If we got rid of that, maybe WAL writer wouldn't even require
a lock, if there's only one process that can be doing it at a time.
When we did sync rep it made sense to have the WALSender do the work
and for others to just wait. It would be quite strange to require a
different design for essentially the same thing for normal commits and
WAL flushes to local disk. I should mention the original proposal for
streaming replication had each backend send data to standby
independently and that was recognised as a bad idea after some time.
Same for sync rep also.
The gain is that previously there was measurable contention for the
WALWriteLock, now there is none. Plus the gang effect continues to
work even when the database gets busy, which isn't true of piggyback
commits as we use now.
Not sure why its odd to have backends do one thing and auxiliaries do
another. The whole point of auxiliary processes is that they do a
specific thing different to normal backends. Anyway, the important
thing is to have auxiliary processes be independent of each other as
much as possible, which simplifies error handling and state logic in
the postmaster.
With regard to context switching, we're making a kernel call to fsync,
so we'll get a context switch anyway. The whole process is similar to
the way lwlock wake up works.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 18, 2012 at 5:38 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, Jan 18, 2012 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jan 17, 2012 at 12:37 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I found it very helpful to reduce wal_writer_delay in pgbench tests, when
running with synchronous_commit=off. The reason is that hint bits don't get
set until the commit record is flushed to disk, so making the flushes more
frequent reduces the contention on the clog. However, Simon made async
commits nudge WAL writer if the WAL page fills up, so I'm not sure how
relevant that experience is anymore.Still completely relevant and orthogonal to this discussion. The patch
retains multi-modal behaviour.
I don't know what you mean by this. I think removing wal_writer_delay
is premature, because I think it still may have some utility, and the
patch removes it. That's a separate change that should be factored
out of this patch and discussed separately.
There's still a small but measurable effect there in master. I think
we might be able to make it fully auto-tuning, but I don't think we're
fully there yet (not sure how much this patch changes that equation).I suggested a design similar to the one you just proposed to Simon
when he originally suggested this feature. It seems that if the WAL
writer is the only one doing WAL flushes, then there must be some IPC
overhead - and context switching - involved whenever WAL is flushed.
But clearly we're saving something somewhere else, on the basis of
Peter's results, so maybe it's not worth worrying about. It does seem
pretty odd to have all the regular backends going through the WAL
writer and the auxiliary processes using a different mechanism,
though. If we got rid of that, maybe WAL writer wouldn't even require
a lock, if there's only one process that can be doing it at a time.When we did sync rep it made sense to have the WALSender do the work
and for others to just wait. It would be quite strange to require a
different design for essentially the same thing for normal commits and
WAL flushes to local disk. I should mention the original proposal for
streaming replication had each backend send data to standby
independently and that was recognised as a bad idea after some time.
Same for sync rep also.
I don't think those cases are directly comparable. SR is talking to
another machine, and I can't imagine that there is a terribly
convenient or portable way for every backend that needs one to get a
hold of the file descriptor for the socket. Even if it could, the
data is sent as a stream, so if multiple backends sent to the same
file descriptor you'd have to have some locking to prevent messages
from getting interleaved. Or else you could have multiple
connections, or use UDP, but that gets rather complex. Anyway, none
of this is an issue for file I/O: anybody can open the file, and if
two backends write data at different offsets at the same time - or the
same data at the same offset at the same time - there's no problem.
So the fact that it wasn't a good idea for SR doesn't convince me that
it's also a bad idea here.
On the other hand, I'm not saying we *should* do it that way, either -
i.e. I am not trying to "require a different design" just because it's
fun to make people change things. Rather, I am trying to figure out
whether the design you've chosen is in fact the best one, and part of
that involves reasoning about why it might or might not be. There are
obvious reasons to think that having process A kick process B and go
to sleep, then have process B do some work and wake up process A might
be less efficient than having process A just do the work itself, in
the uncontended case. The fact that that isn't happening seems
awfully strange to me; it's hard to understand why 3 system calls are
faster than one. That suggests that either the current system is
badly broken in some way that this patch fixes (in which case, it
would be nice to know what the broken thing is) or that there's an
opportunity for further optimization of the new patch (either now or
later, depending on how much work we're talking about). Just to be
clear, it makes perfect sense that the new system is faster in the
contended case, and the benchmark numbers are very impressive. It's a
lot harder to understand why it's not slower in the uncontended case.
Not sure why its odd to have backends do one thing and auxiliaries do
another. The whole point of auxiliary processes is that they do a
specific thing different to normal backends. Anyway, the important
thing is to have auxiliary processes be independent of each other as
much as possible, which simplifies error handling and state logic in
the postmaster.
Yeah, I guess the shutdown sequence could get a bit complex if we try
to make everyone go through the WAL writer all the time. But I wonder
if we could rejiggering things somehow so that everything goes through
WAL writer if its dead.
+ * Wait for group commit, and then return true, if group commit serviced the
+ * request (not necessarily successfully). Otherwise, return false
and fastpath
+ * out of here, allowing the backend to make alternative arrangements to flush
+ * its WAL in a more granular fashion. This can happen because the record that
+ * the backend requests to have flushed in so far into the future
that to group
+ * commit it would
This trails off in mid-sentence.
+ if (delta > XLOG_SEG_SIZE * CheckPointSegments ||
+ !ProcGlobal->groupCommitAvailable)
That seems like a gigantic value. I would think that we'd want to
forget about group commit any time we're behind by more than one
segment (maybe less).
+ if (ProcDiePending || QueryCancelPending)
+ {
+ GroupCommitCancelWait();
+
+ /*
+ * Let out-of-line interrupt handler take it
from here. Cannot raise
+ * an error here because we're in an enclosing
critical block.
+ */
+ break;
+ }
Presumably in this case we need to return false, not true. TBH, I
feel like this error handling is quite fragile, a fragility it
inherits from sync rep, on which it is based. I am not sure I have a
better idea, though.
+ /*
+ * Backends may still be waiting to have their transactions
committed in batch,
+ * but calling this function prevents any backend from using group commit, and
+ * thus from having a dependency on the WAL Writer.
+ *
+ * When the WAL Writer finishes servicing those remaining backends,
it will not
+ * have any additional work to do, and may shutdown.
+ */
+ void
+ GroupCommitDisable(void)
+ {
+ ProcGlobal->groupCommitAvailable = false;
+ }
I can't imagine that this is safe against memory ordering issues.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 19 January 2012 17:40, Robert Haas <robertmhaas@gmail.com> wrote:
I don't know what you mean by this. I think removing wal_writer_delay
is premature, because I think it still may have some utility, and the
patch removes it. That's a separate change that should be factored
out of this patch and discussed separately.
FWIW, I don't really care too much if we keep wal_writer_delay,
provided it is only used in place of the patch's
WALWRITER_NORMAL_TIMEOUT constant. I will note in passing that I doubt
that the effect with asynchronous commits and hint bits is pronounced
enough to have ever transferred through to someone making a practical
recommendation to reduce wal_writer_delay to ameliorate clog
contention.
When we did sync rep it made sense to have the WALSender do the work
and for others to just wait. It would be quite strange to require a
different design for essentially the same thing for normal commits and
WAL flushes to local disk. I should mention the original proposal for
streaming replication had each backend send data to standby
independently and that was recognised as a bad idea after some time.
Same for sync rep also.I don't think those cases are directly comparable. SR is talking to
another machine, and I can't imagine that there is a terribly
convenient or portable way for every backend that needs one to get a
hold of the file descriptor for the socket. Even if it could, the
data is sent as a stream, so if multiple backends sent to the same
file descriptor you'd have to have some locking to prevent messages
from getting interleaved. Or else you could have multiple
connections, or use UDP, but that gets rather complex. Anyway, none
of this is an issue for file I/O: anybody can open the file, and if
two backends write data at different offsets at the same time - or the
same data at the same offset at the same time - there's no problem.
So the fact that it wasn't a good idea for SR doesn't convince me that
it's also a bad idea here.On the other hand, I'm not saying we *should* do it that way, either -
i.e. I am not trying to "require a different design" just because it's
fun to make people change things. Rather, I am trying to figure out
whether the design you've chosen is in fact the best one, and part of
that involves reasoning about why it might or might not be. There are
obvious reasons to think that having process A kick process B and go
to sleep, then have process B do some work and wake up process A might
be less efficient than having process A just do the work itself, in
the uncontended case. The fact that that isn't happening seems
awfully strange to me; it's hard to understand why 3 system calls are
faster than one. That suggests that either the current system is
badly broken in some way that this patch fixes (in which case, it
would be nice to know what the broken thing is) or that there's an
opportunity for further optimization of the new patch (either now or
later, depending on how much work we're talking about). Just to be
clear, it makes perfect sense that the new system is faster in the
contended case, and the benchmark numbers are very impressive. It's a
lot harder to understand why it's not slower in the uncontended case.Not sure why its odd to have backends do one thing and auxiliaries do
another. The whole point of auxiliary processes is that they do a
specific thing different to normal backends. Anyway, the important
thing is to have auxiliary processes be independent of each other as
much as possible, which simplifies error handling and state logic in
the postmaster.Yeah, I guess the shutdown sequence could get a bit complex if we try
to make everyone go through the WAL writer all the time. But I wonder
if we could rejiggering things somehow so that everything goes through
WAL writer if its dead.
I'm not sure what you mean by this last bit. It sounds a bit hazardous.
My problem with nominating a backend to the status of an auxiliary is
that no matter what way you cut it, it increases the failure surface
area, so to speak.
I'm not sure why Heikki thinks that it follows that having a
dependency on some backend is simpler than having one on an auxiliary
process. As to the question of IPC and context switch overhead, I'd
speculate that protecting access to a data structure with book keeping
information regarding which backend is currently the driver and so on
might imply considerably more overhead than IPC and context switching.
It might also be that having WAL Writer almost solely responsible for
this might facilitate more effective use of CPU cache.
On most modern architectures, system calls don't actually cause a full
context switch. The kernel can just enter a "mode switch" (go from
user mode to kernel mode, and then back to user mode). You can observe
this effect with vmstat. That's how 3 system calls might not look much
more expensive than 1.
+ if (delta > XLOG_SEG_SIZE * CheckPointSegments || + !ProcGlobal->groupCommitAvailable)That seems like a gigantic value. I would think that we'd want to
forget about group commit any time we're behind by more than one
segment (maybe less).
I'm sure that you're right - I myself described it as horrible in my
original mail. I think that whatever value we set this to ought to be
well reasoned. Right now, your magic number doesn't seem much better
than my bogus heuristic (which only ever served as a placeholder
implementation that hinted at a well-principled formula). What's your
basis for suggesting that that limit would always be close to optimal?
Once again, I ask the question: what does a big flusher look like?
+ if (ProcDiePending || QueryCancelPending) + { + GroupCommitCancelWait(); + + /* + * Let out-of-line interrupt handler take it from here. Cannot raise + * an error here because we're in an enclosing critical block. + */ + break; + }Presumably in this case we need to return false, not true.
No, that is not an error. As described in the comment above that
function, the return value indicates if group commit serviced the
request, not necessarily successfully. If we were to return false,
we'd be acting as if group commit just didn't want to flush that LSN,
necessitating making alternative arrangements (calling XLogWrite()).
However, we should just hurry up and get to the interrupt handler to
error and report failure to the client (though not before executing
the code immediately before "return true" at the end of the function).
Maybe you could argue that it would be better to do that anyway, but I
think that it could mask problems and unnecessarily complicate
matters.
TBH, I feel like this error handling is quite fragile, a fragility it
inherits from sync rep, on which it is based. I am not sure I have a
better idea, though.
I agree with this analysis. However, since no better alternative
suggests itself, we may find that an XXX comment will go a long way.
+ void + GroupCommitDisable(void) + { + ProcGlobal->groupCommitAvailable = false; + }I can't imagine that this is safe against memory ordering issues.
That had occurred to me. Strictly speaking, it doesn't necessarily
have to be, provided that some architecture with weak memory ordering
is not inclined to take an inordinate amount of time, or even forever,
to notice that the variable has been set to false. WAL Writer calls
this to "turn the tap off" to allow it to finish its group commit
backlog (allowing those backends to shutdown) and shutdown itself, but
it doesn't actually assume that turning the tap off will be
immediately effective.
However, now that I think about it, even if my assumption about the
behaviour of all machines regarding memory ordering there doesn't fall
down, it does seem possible that GroupCommitShutdownReady() could
return true, but that status could immediately change because some
backend on Alpha or something didn't see that the flag was set, so
your point is well taken.
By the way, how goes the introduction of memory barriers to Postgres?
I'm aware that you committed an implementation back in September, but
am not sure what the plan is as regards using them in 9.2.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Jan 19, 2012 at 10:46 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
On 19 January 2012 17:40, Robert Haas <robertmhaas@gmail.com> wrote:
I don't know what you mean by this. I think removing wal_writer_delay
is premature, because I think it still may have some utility, and the
patch removes it. That's a separate change that should be factored
out of this patch and discussed separately.FWIW, I don't really care too much if we keep wal_writer_delay,
provided it is only used in place of the patch's
WALWRITER_NORMAL_TIMEOUT constant. I will note in passing that I doubt
that the effect with asynchronous commits and hint bits is pronounced
enough to have ever transferred through to someone making a practical
recommendation to reduce wal_writer_delay to ameliorate clog
contention.
It was very visible in some benchmarking Heikki did, and I was able to
reproduce it.
Yeah, I guess the shutdown sequence could get a bit complex if we try
to make everyone go through the WAL writer all the time. But I wonder
if we could rejiggering things somehow so that everything goes through
WAL writer if its dead.I'm not sure what you mean by this last bit. It sounds a bit hazardous.
That last "if" was supposed to say "unless", which may contribute to
the confusion.
My problem with nominating a backend to the status of an auxiliary is
that no matter what way you cut it, it increases the failure surface
area, so to speak.
I think that's the wrong way of thinking about it. Imagine this: we
maintain a queue of people who are waiting on the current WAL flush,
the current-flush-to LSN, and a queue of people who are waiting on the
next WAL flush, and a leader. All this data is protected by a
spinlock. When you want to flush WAL, you grab the spinlock. If the
current-flush-to LSN is greater than the LSN you need, you add
yourself to the waiting-on-current-flush queue, release the spinlock,
and go to sleep. Otherwise, if there's no leader, you become the
leader, enter your flush LSN as the current-flush-to-LSN, and release
the spinlock. If there is a leader, you add yourself to the
waiting-on-next-flush queue, release the spinlock, and sleep.
If you become the leader, you perform the flush. Then you retake the
spinlock, dequeue anyone waiting on the current flush, move all of the
next flush waiters (or those within a certain LSN distance) to the
current flush list, remember who is at the head of that new queue, and
release the spinlock. You then set a flag in the PGPROC of the
backend now at the head of the next-flush queue and wake him up; he
checks that flag on waking to know whether he is the leader or whether
he's just been woken because his flush is done. After waking him so
the next flush can get started, you wake all the people who were
waiting on the flush you already did.
This may or may not be a good design, but I don't think it has any
more failure surface area than what you have here. In particular,
whether or not the WAL writer is running doesn't matter; the system
can run just fine without it, and can even still do group commit. To
look at it another way, it's not a question of whether we're treating
a regular backend as an auxiliary process; there's no rule anywhere
that backends can't be dependent on the proper operation of other
backends for proper functioning - there are MANY places that have that
property, including LWLockAcquire() and LWLockRelease(). Nor is there
any rule that background processes are more reliable than foreground
processes, nor do I believe they are. Much of the existing patch's
failure surface seems to me to come from the coordination it requires
between ordinary backends and the background writer, and possible race
conditions appertaining thereto: WAL writer dies, backend dies,
postmaster dies, postmaster and WAL writer die together, etc.
+ if (delta > XLOG_SEG_SIZE * CheckPointSegments || + !ProcGlobal->groupCommitAvailable)That seems like a gigantic value. I would think that we'd want to
forget about group commit any time we're behind by more than one
segment (maybe less).I'm sure that you're right - I myself described it as horrible in my
original mail. I think that whatever value we set this to ought to be
well reasoned. Right now, your magic number doesn't seem much better
than my bogus heuristic (which only ever served as a placeholder
implementation that hinted at a well-principled formula). What's your
basis for suggesting that that limit would always be close to optimal?
It's probably not - I suspect even a single WAL segment still too
large. I'm pretty sure that it would be safe to always flush up to
the next block boundary, because we always write the whole block
anyway even if it's only partially filled. So there's no possible
regression there. Anything larger than "the end of the current 8kB
block" is going to be a trade-off between latency and throughput, and
it seems to me that there's probably only one way to figure out what's
reasonable: test a bunch of different values and see which ones
perform well in practice. Maybe we could answer part of the question
by writing a test program which does repeated sequential writes of
increasingly-large chunks of data. We might find out for example that
64kB is basically the same as 8kB because most of the overhead is in
the system call anyway, but beyond that the curve kinks. Or it may be
that 16kB is already twice as slow as 8kB, or that you can go up to
1MB without a problem. I don't see a way to know that without testing
it on a couple different pieces of hardware and seeing what happens.
+ if (ProcDiePending || QueryCancelPending) + { + GroupCommitCancelWait(); + + /* + * Let out-of-line interrupt handler take it from here. Cannot raise + * an error here because we're in an enclosing critical block. + */ + break; + }Presumably in this case we need to return false, not true.
No, that is not an error. As described in the comment above that
function, the return value indicates if group commit serviced the
request, not necessarily successfully. If we were to return false,
we'd be acting as if group commit just didn't want to flush that LSN,
necessitating making alternative arrangements (calling XLogWrite()).
However, we should just hurry up and get to the interrupt handler to
error and report failure to the client (though not before executing
the code immediately before "return true" at the end of the function).
Maybe you could argue that it would be better to do that anyway, but I
think that it could mask problems and unnecessarily complicate
matters.
Ugh. Our current system doesn't even require this code to be
interruptible, I think: you can't receive cancel or die interrupts
either while waiting for an LWLock or while holding it.
By the way, how goes the introduction of memory barriers to Postgres?
I'm aware that you committed an implementation back in September, but
am not sure what the plan is as regards using them in 9.2.
Well, I was expecting a few patches to go in that needed them, but so
far that hasn't happened. I think it would be a fine thing if we had
a good reason to commit some code that uses them, so that we can watch
the buildfarm turn pretty colors.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
I spent some time cleaning this up. Details below, but here are the
highlights:
* Reverted the removal of wal_writer_delay
* Doesn't rely on WAL writer. Any process can act as the "leader" now.
* Removed heuristic on big flushes
* Uses PGSemaphoreLock/Unlock instead of latches
On 20.01.2012 17:30, Robert Haas wrote:
On Thu, Jan 19, 2012 at 10:46 PM, Peter Geoghegan<peter@2ndquadrant.com> wrote:
+ if (delta> XLOG_SEG_SIZE * CheckPointSegments || + !ProcGlobal->groupCommitAvailable)That seems like a gigantic value. I would think that we'd want to
forget about group commit any time we're behind by more than one
segment (maybe less).I'm sure that you're right - I myself described it as horrible in my
original mail. I think that whatever value we set this to ought to be
well reasoned. Right now, your magic number doesn't seem much better
than my bogus heuristic (which only ever served as a placeholder
implementation that hinted at a well-principled formula). What's your
basis for suggesting that that limit would always be close to optimal?It's probably not - I suspect even a single WAL segment still too
large. I'm pretty sure that it would be safe to always flush up to
the next block boundary, because we always write the whole block
anyway even if it's only partially filled. So there's no possible
regression there. Anything larger than "the end of the current 8kB
block" is going to be a trade-off between latency and throughput, and
it seems to me that there's probably only one way to figure out what's
reasonable: test a bunch of different values and see which ones
perform well in practice. Maybe we could answer part of the question
by writing a test program which does repeated sequential writes of
increasingly-large chunks of data. We might find out for example that
64kB is basically the same as 8kB because most of the overhead is in
the system call anyway, but beyond that the curve kinks. Or it may be
that 16kB is already twice as slow as 8kB, or that you can go up to
1MB without a problem. I don't see a way to know that without testing
it on a couple different pieces of hardware and seeing what happens.
I ripped away that heuristic for a flush that's "too large". After
pondering it for a while, I came to the conclusion that as implemented
in the patch, it was pointless. The thing is, if the big flush doesn't
go through the group commit machinery, it's going to grab the
WALWriteLock straight away. Before any smaller commits can proceed, they
will need to grab that lock anyway, so the effect is the same as if the
big flush had just joined the queue.
Maybe we should have a heuristic to split a large flush into smaller
chunks. The WAL segment boundary would be a quite natural split point,
for example, because when crossing the file boundary you have to issue
separate fsync()s for the files anyway. But I'm happy with leaving that
out for now, it's not any worse than it used to be without group commit
anyway.
Ugh. Our current system doesn't even require this code to be
interruptible, I think: you can't receive cancel or die interrupts
either while waiting for an LWLock or while holding it.
Right. Or within HOLD/RESUME_INTERRUPTS blocks.
The patch added some CHECK_FOR_INTERRUPTS() calls into various places in
the commit/abort codepaths, but that's all dead code; they're all within
a HOLD/RESUME_INTERRUPTS blocks.
I replaced the usage of latches with the more traditional
PGSemaphoreLock/Unlock. It semaphore model works just fine in this case,
where we have a lwlock to guard the wait queue, and when a process is
waiting we know it will be woken up or something messed up at a pretty
low level. We don't need a timeout or to wake up at other signals while
waiting. Furthermore, the WAL writer didn't have a latch before this
patch; it's not many lines of code to initialize the latch and set up
the signal handler for it, but it already has a semaphore that's ready
to use.
I wonder if we should rename the file into "xlogflush.c" or something
like that, to make it clear that this works with any XLOG flushes, not
just commits? Group commit is the usual term for this feature, so we
should definitely mention that in the comments, but it might be better
to rename the files/functions to emphasize that this is about WAL
flushing in general.
This probably needs some further cleanup to fix things I've broken, and
I haven't done any performance testing, but it's progress. Do you have a
shell script or something that you used for the performance tests that I
could run? Or would you like to re-run the tests you did earlier with
this patch?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
group_commit_2012_01_21.patchtext/x-diff; name=group_commit_2012_01_21.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1760,1809 **** SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
- <varlistentry id="guc-commit-delay" xreflabel="commit_delay">
- <term><varname>commit_delay</varname> (<type>integer</type>)</term>
- <indexterm>
- <primary><varname>commit_delay</> configuration parameter</primary>
- </indexterm>
- <listitem>
- <para>
- When the commit data for a transaction is flushed to disk, any
- additional commits ready at that time are also flushed out.
- <varname>commit_delay</varname> adds a time delay, set in
- microseconds, before a transaction attempts to
- flush the WAL buffer out to disk. A nonzero delay can allow more
- transactions to be committed with only one flush operation, if
- system load is high enough that additional transactions become
- ready to commit within the given interval. But the delay is
- just wasted if no other transactions become ready to
- commit. Therefore, the delay is only performed if at least
- <varname>commit_siblings</varname> other transactions are
- active at the instant that a server process has written its
- commit record.
- The default <varname>commit_delay</> is zero (no delay).
- Since all pending commit data will be written at every flush
- regardless of this setting, it is rare that adding delay
- by increasing this parameter will actually improve performance.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="guc-commit-siblings" xreflabel="commit_siblings">
- <term><varname>commit_siblings</varname> (<type>integer</type>)</term>
- <indexterm>
- <primary><varname>commit_siblings</> configuration parameter</primary>
- </indexterm>
- <listitem>
- <para>
- Minimum number of concurrent open transactions to require
- before performing the <varname>commit_delay</> delay. A larger
- value makes it more probable that at least one other
- transaction will become ready to commit during the delay
- interval. The default is five transactions.
- </para>
- </listitem>
- </varlistentry>
-
</variablelist>
</sect2>
<sect2 id="runtime-config-wal-checkpoints">
--- 1760,1765 ----
*** a/doc/src/sgml/wal.sgml
--- b/doc/src/sgml/wal.sgml
***************
*** 367,386 ****
of data corruption.
</para>
- <para>
- <xref linkend="guc-commit-delay"> also sounds very similar to
- asynchronous commit, but it is actually a synchronous commit method
- (in fact, <varname>commit_delay</varname> is ignored during an
- asynchronous commit). <varname>commit_delay</varname> causes a delay
- just before a synchronous commit attempts to flush
- <acronym>WAL</acronym> to disk, in the hope that a single flush
- executed by one such transaction can also serve other transactions
- committing at about the same time. Setting <varname>commit_delay</varname>
- can only help when there are many concurrently committing transactions,
- and it is difficult to tune it to a value that actually helps rather
- than hurt throughput.
- </para>
-
</sect1>
<sect1 id="wal-configuration">
--- 367,372 ----
***************
*** 556,579 ****
</para>
<para>
- The <xref linkend="guc-commit-delay"> parameter defines for how many
- microseconds the server process will sleep after writing a commit
- record to the log with <function>LogInsert</function> but before
- performing a <function>LogFlush</function>. This delay allows other
- server processes to add their commit records to the log so as to have all
- of them flushed with a single log sync. No sleep will occur if
- <xref linkend="guc-fsync">
- is not enabled, or if fewer than <xref linkend="guc-commit-siblings">
- other sessions are currently in active transactions; this avoids
- sleeping when it's unlikely that any other session will commit soon.
- Note that on most platforms, the resolution of a sleep request is
- ten milliseconds, so that any nonzero <varname>commit_delay</varname>
- setting between 1 and 10000 microseconds would have the same effect.
- Good values for these parameters are not yet clear; experimentation
- is encouraged.
- </para>
-
- <para>
The <xref linkend="guc-wal-sync-method"> parameter determines how
<productname>PostgreSQL</productname> will ask the kernel to force
<acronym>WAL</acronym> updates out to disk.
--- 542,547 ----
*** a/src/backend/access/transam/Makefile
--- b/src/backend/access/transam/Makefile
***************
*** 13,19 **** top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
! twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogutils.o
include $(top_srcdir)/src/backend/common.mk
--- 13,19 ----
include $(top_builddir)/src/Makefile.global
OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
! twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogutils.o groupcommit.o
include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/transam/groupcommit.c
***************
*** 0 ****
--- 1,346 ----
+ /*-------------------------------------------------------------------------
+ *
+ * groupcommit.c
+ *
+ * Group Commit is new as of PostgreSQL 9.2. It is a performance optimization
+ * where the fsyncing of WAL is batched across many more-or-less concurrently
+ * committing transactions, thereby amortizing the cost of that notoriously
+ * expensive operation, and potentially greatly increasing transaction
+ * throughput.
+ *
+ * Transactions requesting XlogFlush() wait until the WALWriter has performed
+ * the write for them, then will be woken to continue. This module contains the
+ * code for waiting and release of backends.
+ *
+ * Manage waiting backends by having a single ordered queue of waiting backends,
+ * so that we can avoid searching through all waiters each time the WAL Writer
+ * writes/fsyncs.
+ *
+ * Copyright (c) 2010-2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/groupcommit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+ #include "postgres.h"
+
+ #include <unistd.h>
+
+ #include "access/groupcommit.h"
+ #include "access/xact.h"
+ #include "access/xlog.h"
+ #include "access/xlog_internal.h"
+ #include "miscadmin.h"
+ #include "storage/pmsignal.h"
+ #include "storage/proc.h"
+ #include "tcop/tcopprot.h"
+ #include "utils/builtins.h"
+
+ /*
+ * Shared memory structure
+ */
+ typedef struct
+ {
+ SHM_QUEUE queue; /* list of processes waiting to be woken up */
+ PGPROC *driver; /* process currently flushing the WAL */
+ } GroupCommitData;
+
+ static GroupCommitData *gcShared;
+
+
+ static bool amdriving = false;
+
+ /* Shmem functions */
+ Size
+ GroupCommitShmemSize(void)
+ {
+ return sizeof(GroupCommitData);
+ }
+
+ void
+ GroupCommitShmemInit(void)
+ {
+ bool found;
+
+ /*
+ * Create or attach to the GroupCommitShared structure.
+ *
+ */
+ gcShared = (GroupCommitData *)
+ ShmemInitStruct("Group Commit", GroupCommitShmemSize(), &found);
+
+ if (!found)
+ {
+ /* First time through, so initialize it */
+ SHMQueueInit(&gcShared->queue);
+ gcShared->driver = NULL;
+ }
+ }
+
+ /*
+ * ================================================
+ * Group Commit functions for normal user backends
+ * ================================================
+ */
+
+ /*
+ * Wait for group commit, and then return true, if group commit serviced the
+ * request (not necessarily successfully). Otherwise, return false and fastpath
+ * out of here, allowing the backend to make alternative arrangements to flush
+ * its WAL in a more granular fashion. This can happen because the record that
+ * the backend requests to have flushed in so far into the future that to group
+ * commit it would
+ *
+ * Initially backends start in state GROUP_COMMIT_INIT. Backends that service
+ * database connections will then change that state to GROUP_COMMIT_NOT_WAITING
+ * before adding itself to the wait queue. During GroupCommitWakeQueue(), the
+ * WAL Writer changes the state to GROUP_COMMIT_WAIT_COMPLETE once write/fsync
+ * is confirmed.
+ *
+ * The backend then resets its state to GROUP_COMMIT_NOT_WAITING.
+ */
+ void
+ GroupCommitWaitForLSN(XLogRecPtr XactCommitLSN)
+ {
+ Assert(SHMQueueIsDetached(&(MyProc->waitLSNLinks)));
+ Assert(MyProc->grpCommitState == GROUP_COMMIT_NOT_WAITING);
+ Assert(!LWLockHeldByMe(SyncRepLock));
+
+ Assert(!amdriving);
+
+ LWLockAcquire(GroupCommitLock, LW_EXCLUSIVE);
+
+ /* See if we have to drive */
+ if (gcShared->driver == NULL)
+ {
+ amdriving = true;
+ gcShared->driver = MyProc;
+ LWLockRelease(GroupCommitLock);
+
+ DoXLogFlush(XactCommitLSN);
+ return;
+ }
+ else
+ {
+ /*
+ * Set our waitLSN, add ourselves to the queue, and Wait for the
+ * driver to wake us up
+ */
+ MyProc->waitLSN = XactCommitLSN;
+ MyProc->grpCommitState = GROUP_COMMIT_WAITING;
+ BatchCommitQueueInsert(&gcShared->queue);
+ Assert(BatchCommitQueueIsOrderedByLSN(&gcShared->queue));
+
+ LWLockRelease(GroupCommitLock);
+
+ #ifdef GROUP_COMMIT_DEBUG
+ elog(LOG, "waiting for LSN %X/%X",
+ XactCommitLSN.xlogid,
+ XactCommitLSN.xrecoff);
+ #endif
+
+ /*
+ * Wait for specified LSN to be confirmed.
+ *
+ * Each proc has its own wait latch, so we perform a normal latch
+ * check/wait loop here.
+ */
+ for (;;)
+ {
+ GroupCommitState CommitState;
+
+ /*
+ * Wait until the driver services our request to flush WAL (or
+ * wakes us up to become the next driver).
+ */
+ PGSemaphoreLock(&MyProc->sem, true);
+
+ /*
+ * Try checking the state without the lock first. There's no
+ * guarantee that we'll read the most up-to-date value, so if it
+ * looks like we're still waiting, recheck while holding the lock.
+ * But if it looks like we're done, we must really be done, because
+ * once the driver updates our state to GROUP_COMMIT_WAIT_COMPLETE,
+ * it will never update it again, so we can't be seeing a stale
+ * value in that case. Control will never reach here if its value
+ * is GROUP_COMMIT_INIT, as it is in the case of auxiliary
+ * processes.
+ */
+ CommitState = ((volatile PGPROC *) MyProc)->grpCommitState;
+ Assert(CommitState != GROUP_COMMIT_NOT_WAITING);
+ if (CommitState == GROUP_COMMIT_WAIT_COMPLETE)
+ break;
+ if (CommitState == GROUP_COMMIT_WAITING)
+ {
+ LWLockAcquire(GroupCommitLock, LW_EXCLUSIVE);
+ CommitState = MyProc->grpCommitState;
+
+ /*
+ * If we're still not finished, and no-one is driving,
+ * take the driver's seat
+ */
+ if (CommitState == GROUP_COMMIT_WAIT_COMPLETE)
+ {
+ LWLockRelease(GroupCommitLock);
+ break;
+ }
+ if (CommitState == GROUP_COMMIT_WAITING && gcShared->driver == NULL)
+ {
+ /* Remove ourself from the queue */
+ SHMQueueDelete(&(MyProc->waitLSNLinks));
+ MyProc->grpCommitState = GROUP_COMMIT_NOT_WAITING;
+ MyProc->waitLSN.xlogid = 0;
+ MyProc->waitLSN.xrecoff = 0;
+
+ amdriving = true;
+ gcShared->driver = MyProc;
+ LWLockRelease(GroupCommitLock);
+
+ DoXLogFlush(XactCommitLSN);
+ return;
+ }
+
+ LWLockRelease(GroupCommitLock);
+ }
+ }
+
+ /*
+ * The WAL has been flushed up to our LSN and we have been removed from
+ * the queue. Clean up state and leave. It's OK to reset these shared
+ * memory fields without holding GroupCommitLock, because the WAL
+ * writer will ignore us anyway when we're not on the queue.
+ */
+ Assert(SHMQueueIsDetached(&(MyProc->waitLSNLinks)));
+ MyProc->grpCommitState = GROUP_COMMIT_NOT_WAITING;
+ MyProc->waitLSN.xlogid = 0;
+ MyProc->waitLSN.xrecoff = 0;
+ }
+ }
+
+ /*
+ * Walk queue from head. Set the state of any backends that need to be woken
+ * due to waiting on an LSN up to flushLSN, then remove them from the queue and
+ * wake them up.
+ *
+ * Caller must hold GroupCommitLock in exclusive mode.
+ */
+ void
+ GroupCommitReleaseWaiters(XLogRecPtr flushLSN)
+ {
+ PGPROC *proc = NULL;
+ PGPROC *thisproc = NULL;
+ int numprocs = 0;
+
+ LWLockAcquire(GroupCommitLock, LW_EXCLUSIVE);
+
+ if (gcShared->driver == MyProc)
+ gcShared->driver = NULL;
+ amdriving = false;
+
+ Assert(BatchCommitQueueIsOrderedByLSN(&gcShared->queue));
+ Assert(!LWLockHeldByMe(SyncRepLock));
+
+ proc = (PGPROC *) SHMQueueNext(&gcShared->queue,
+ &gcShared->queue,
+ offsetof(PGPROC, waitLSNLinks));
+
+ while (proc)
+ {
+ /*
+ * Assume the queue is ordered by LSN (assumption is verified elsewhere
+ * by assertions).
+ */
+ if (XLByteLT(flushLSN, proc->waitLSN))
+ break;
+
+ /*
+ * Move to next proc, so we can delete thisproc from the queue.
+ * thisproc is valid, proc may be NULL after this.
+ */
+ thisproc = proc;
+ proc = (PGPROC *) SHMQueueNext(&gcShared->queue,
+ &(proc->waitLSNLinks),
+ offsetof(PGPROC, waitLSNLinks));
+
+ /*
+ * Set state to complete; see GroupCommitWaitForLSN() for discussion of
+ * the various states.
+ */
+ thisproc->grpCommitState = GROUP_COMMIT_WAIT_COMPLETE;
+ /*
+ * Remove thisproc from queue.
+ */
+ SHMQueueDelete(&(thisproc->waitLSNLinks));
+
+ /*
+ * Wake only when we have set state and removed from queue.
+ */
+ PGSemaphoreUnlock(&thisproc->sem);
+
+ numprocs++;
+ }
+
+ /*
+ * Finally, wake up the process that's *last* in the queue, with the
+ * highest LSN. He's not done yet, but will become the next driver.
+ */
+ if (gcShared->driver == NULL)
+ {
+ proc = (PGPROC *) SHMQueuePrev(&gcShared->queue,
+ &gcShared->queue,
+ offsetof(PGPROC, waitLSNLinks));
+ if (proc)
+ PGSemaphoreUnlock(&proc->sem);
+ }
+
+ LWLockRelease(GroupCommitLock);
+
+ #ifdef GROUP_COMMIT_DEBUG
+ elog(LOG, "released %d procs up to %X/%X",
+ numprocs,
+ flushLSN.xlogid,
+ flushLSN.xrecoff);
+ #endif
+ }
+
+ void
+ GroupCommitCleanupAtProcExit(void)
+ {
+ PGPROC *proc;
+
+ /* XXX: check if we're driving, and clear if so */
+ if (!SHMQueueIsDetached(&(MyProc->waitLSNLinks)))
+ {
+ Assert(MyProc->grpCommitState != GROUP_COMMIT_NOT_WAITING);
+ LWLockAcquire(GroupCommitLock, LW_EXCLUSIVE);
+ SHMQueueDelete(&(MyProc->waitLSNLinks));
+ LWLockRelease(GroupCommitLock);
+ }
+ if (amdriving)
+ {
+ amdriving = false;
+ LWLockAcquire(GroupCommitLock, LW_EXCLUSIVE);
+ /*
+ * Zap the old driver. Presumably it was myself, but since we
+ * shouldn't get here in the first place, maybe we're all confused.
+ * In any case, it's always safe to clear the driver field, as long
+ * as we wake up someone from the wait queue to become the new driver.
+ * It's ok if two backends think they're driving.
+ */
+ gcShared->driver = NULL;
+
+ /*
+ * If the queue is not empty, wake up someone to become the new
+ * driver.
+ */
+ proc = (PGPROC *) SHMQueueNext(&gcShared->queue,
+ &gcShared->queue,
+ offsetof(PGPROC, waitLSNLinks));
+ if (proc)
+ PGSemaphoreUnlock(&proc->sem);
+
+ LWLockRelease(GroupCommitLock);
+ }
+ }
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 67,75 **** bool XactDeferrable;
int synchronous_commit = SYNCHRONOUS_COMMIT_ON;
- int CommitDelay = 0; /* precommit delay in microseconds */
- int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
-
/*
* MyXactAccessedTempRel is set when a temporary relation is accessed.
* We don't allow PREPARE TRANSACTION in that case. (This is global
--- 67,72 ----
***************
*** 1094,1115 **** RecordTransactionCommit(void)
if ((wrote_xlog && synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
forceSyncCommit || nrels > 0)
{
- /*
- * Synchronous commit case:
- *
- * Sleep before flush! So we can flush more than one commit records
- * per single fsync. (The idea is some other backend may do the
- * XLogFlush while we're sleeping. This needs work still, because on
- * most Unixen, the minimum select() delay is 10msec or more, which is
- * way too long.)
- *
- * We do not sleep if enableFsync is not turned on, nor if there are
- * fewer than CommitSiblings other backends with active transactions.
- */
- if (CommitDelay > 0 && enableFsync &&
- MinimumActiveBackends(CommitSiblings))
- pg_usleep(CommitDelay);
-
XLogFlush(XactLastRecEnd);
/*
--- 1091,1096 ----
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 24,29 ****
--- 24,30 ----
#include <unistd.h>
#include "access/clog.h"
+ #include "access/groupcommit.h"
#include "access/multixact.h"
#include "access/subtrans.h"
#include "access/transam.h"
***************
*** 2020,2033 **** UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
* Ensure that all XLOG data through the given position is flushed to disk.
*
* NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
! * already held, and we try to avoid acquiring it if possible.
*/
void
XLogFlush(XLogRecPtr record)
{
- XLogRecPtr WriteRqstPtr;
- XLogwrtRqst WriteRqst;
-
/*
* During REDO, we are reading not writing WAL. Therefore, instead of
* trying to flush the WAL, we should update minRecoveryPoint instead. We
--- 2021,2036 ----
* Ensure that all XLOG data through the given position is flushed to disk.
*
* NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
! * already held, and we try to avoid acquiring it if possible. Moreover,
! * XLogFlush is the main and only entry point for group commit.
! *
! * We'll actually call XLogWrite from here in the event of not being able to
! * participate in a group commit, due to it a backend requesting the flushing of
! * an LSN that is very far from the flush pointer's current position.
*/
void
XLogFlush(XLogRecPtr record)
{
/*
* During REDO, we are reading not writing WAL. Therefore, instead of
* trying to flush the WAL, we should update minRecoveryPoint instead. We
***************
*** 2059,2079 **** XLogFlush(XLogRecPtr record)
* Since fsync is usually a horribly expensive operation, we try to
* piggyback as much data as we can on each fsync: if we see any more data
* entered into the xlog buffer, we'll write and fsync that too, so that
! * the final value of LogwrtResult.Flush is as large as possible. This
! * gives us some chance of avoiding another fsync immediately after.
*/
- /* initialize to given target; may increase below */
- WriteRqstPtr = record;
-
/* read LogwrtResult and update local state */
{
/* use volatile pointer to prevent code rearrangement */
volatile XLogCtlData *xlogctl = XLogCtl;
SpinLockAcquire(&xlogctl->info_lck);
! if (XLByteLT(WriteRqstPtr, xlogctl->LogwrtRqst.Write))
! WriteRqstPtr = xlogctl->LogwrtRqst.Write;
LogwrtResult = xlogctl->LogwrtResult;
SpinLockRelease(&xlogctl->info_lck);
}
--- 2062,2083 ----
* Since fsync is usually a horribly expensive operation, we try to
* piggyback as much data as we can on each fsync: if we see any more data
* entered into the xlog buffer, we'll write and fsync that too, so that
! * the final value of LogwrtResult. Flush is as large as is reasonable; in
! * the group commit case (i.e. ordinary multiuser backends that a group
! * commit would be useful for), that will be as many backends as are queued
! * up.
*/
/* read LogwrtResult and update local state */
{
/* use volatile pointer to prevent code rearrangement */
volatile XLogCtlData *xlogctl = XLogCtl;
SpinLockAcquire(&xlogctl->info_lck);
! if (XLByteLT(xlogctl->LogwrtRqst.Write, record))
! xlogctl->LogwrtRqst.Write = record;
! if (XLByteLT(xlogctl->LogwrtRqst.Flush, record))
! xlogctl->LogwrtRqst.Flush = record;
LogwrtResult = xlogctl->LogwrtResult;
SpinLockRelease(&xlogctl->info_lck);
}
***************
*** 2081,2116 **** XLogFlush(XLogRecPtr record)
/* done already? */
if (!XLByteLE(record, LogwrtResult.Flush))
{
! /* now wait for the write lock */
! LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
! LogwrtResult = XLogCtl->Write.LogwrtResult;
! if (!XLByteLE(record, LogwrtResult.Flush))
! {
! /* try to write/flush later additions to XLOG as well */
! if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! {
! XLogCtlInsert *Insert = &XLogCtl->Insert;
! uint32 freespace = INSERT_FREESPACE(Insert);
! if (freespace < SizeOfXLogRecord) /* buffer is full */
! WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! else
! {
! WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! WriteRqstPtr.xrecoff -= freespace;
! }
! LWLockRelease(WALInsertLock);
! WriteRqst.Write = WriteRqstPtr;
! WriteRqst.Flush = WriteRqstPtr;
! }
! else
! {
! WriteRqst.Write = WriteRqstPtr;
! WriteRqst.Flush = record;
! }
! XLogWrite(WriteRqst, false, false);
! }
! LWLockRelease(WALWriteLock);
}
END_CRIT_SECTION();
--- 2085,2100 ----
/* done already? */
if (!XLByteLE(record, LogwrtResult.Flush))
{
! /* use volatile pointer to prevent code rearrangement */
! volatile XLogCtlData *xlogctl = XLogCtl;
! /* group commit */
! GroupCommitWaitForLSN(record);
!
! /* update our result */
! SpinLockAcquire(&xlogctl->info_lck);
! LogwrtResult = xlogctl->LogwrtResult;
! SpinLockRelease(&xlogctl->info_lck);
}
END_CRIT_SECTION();
***************
*** 2144,2149 **** XLogFlush(XLogRecPtr record)
--- 2128,2180 ----
}
/*
+ * A lower-level version of XLogFlush(). This is used from group commit code
+ * to actually perform a flush, while XLogFlush() is the facade for the
+ * group commit code. Got it?
+ */
+ XLogRecPtr
+ DoXLogFlush(XLogRecPtr record)
+ {
+ XLogRecPtr WriteRqstPtr;
+ XLogwrtRqst WriteRqst;
+
+ WriteRqstPtr = record;
+
+ /* read LogwrtResult and update local state */
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ if (XLByteLT(xlogctl->LogwrtRqst.Write, record))
+ WriteRqstPtr = xlogctl->LogwrtRqst.Write;
+ LogwrtResult = xlogctl->LogwrtResult;
+ SpinLockRelease(&xlogctl->info_lck);
+ }
+
+ /* done already? */
+ if (!XLByteLE(record, LogwrtResult.Flush))
+ {
+ /* now wait for the write lock */
+ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+
+ LogwrtResult = XLogCtl->Write.LogwrtResult;
+ if (!XLByteLE(record, LogwrtResult.Flush))
+ {
+ WriteRqst.Write = WriteRqstPtr;
+ WriteRqst.Flush = WriteRqstPtr;
+ XLogWrite(WriteRqst, false, false);
+ }
+ LWLockRelease(WALWriteLock);
+ }
+
+ /* Wake up everyone else that we flushed as a side-effect */
+ GroupCommitReleaseWaiters(LogwrtResult.Flush);
+
+ return LogwrtResult.Flush;
+ }
+
+ /*
* Flush xlog, but without specifying exactly where to flush to.
*
* We normally flush only completed blocks; but if there is nothing to do on
***************
*** 2235,2240 **** XLogBackgroundFlush(void)
--- 2266,2281 ----
LWLockRelease(WALWriteLock);
END_CRIT_SECTION();
+
+ /*
+ * Wake up everyone else that we flushed as a side-effect. If there are
+ * active commits happening, the driver will get the WALWriteLock as
+ * soon as we release it, and if it no longer has any work to do it will
+ * wake up everyone. But if it does, ie. if our flush didn't cover the
+ * current driver, it's good that we wake up everyone that we did flush,
+ * or they will have to wait for another flush cycle.
+ */
+ GroupCommitReleaseWaiters(LogwrtResult.Flush);
}
/*
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 34,39 ****
--- 34,45 ----
* take some time. Once caught up, the current highest priority standby
* will release waiters from the queue.
*
+ * As of Postgres 9.2, some of the functions that were previously private to
+ * this module are generalized and exposed as generic "batch commit"
+ * infrastructure, for use by the group commit feature, which similarly batches
+ * commits, though it does so merely as a local performance optimization for
+ * user backends.
+ *
* Portions Copyright (c) 2010-2012, PostgreSQL Global Development Group
*
* IDENTIFICATION
***************
*** 65,81 **** char *SyncRepStandbyNames;
#define SyncRepRequested() \
(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
static bool announce_next_takeover = true;
- static void SyncRepQueueInsert(void);
static void SyncRepCancelWait(void);
static int SyncRepGetStandbyPriority(void);
- #ifdef USE_ASSERT_CHECKING
- static bool SyncRepQueueIsOrderedByLSN(void);
- #endif
-
/*
* ===========================================================
* Synchronous Replication functions for normal user backends
--- 71,86 ----
#define SyncRepRequested() \
(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
+ /* Convenience macros for SyncRep use of batch commit infrastructure */
+ #define SyncRepQueueInsert() BatchCommitQueueInsert(&(WalSndCtl->SyncRepQueue))
+ #define SyncRepQueueIsOrderedByLSN() BatchCommitQueueIsOrderedByLSN(&(WalSndCtl->SyncRepQueue))
+
static bool announce_next_takeover = true;
static void SyncRepCancelWait(void);
static int SyncRepGetStandbyPriority(void);
/*
* ===========================================================
* Synchronous Replication functions for normal user backends
***************
*** 105,112 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
if (!SyncRepRequested() || !SyncStandbysDefined())
return;
! Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
Assert(WalSndCtl != NULL);
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
--- 110,118 ----
if (!SyncRepRequested() || !SyncStandbysDefined())
return;
! Assert(SHMQueueIsDetached(&(MyProc->waitLSNLinks)));
Assert(WalSndCtl != NULL);
+ Assert(!LWLockHeldByMe(GroupCommitLock));
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
***************
*** 253,259 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
* holding SyncRepLock, because any walsenders will ignore us anyway when
* we're not on the queue.
*/
! Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
MyProc->syncRepState = SYNC_REP_NOT_WAITING;
MyProc->waitLSN.xlogid = 0;
MyProc->waitLSN.xrecoff = 0;
--- 259,265 ----
* holding SyncRepLock, because any walsenders will ignore us anyway when
* we're not on the queue.
*/
! Assert(SHMQueueIsDetached(&(MyProc->waitLSNLinks)));
MyProc->syncRepState = SYNC_REP_NOT_WAITING;
MyProc->waitLSN.xlogid = 0;
MyProc->waitLSN.xrecoff = 0;
***************
*** 267,285 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
}
/*
! * Insert MyProc into SyncRepQueue, maintaining sorted invariant.
*
* Usually we will go at tail of queue, though it's possible that we arrive
* here out of order, so start at tail and work back to insertion point.
*/
! static void
! SyncRepQueueInsert(void)
{
PGPROC *proc;
! proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
! &(WalSndCtl->SyncRepQueue),
! offsetof(PGPROC, syncRepLinks));
while (proc)
{
--- 273,293 ----
}
/*
! * Insert MyProc into a queue (usually SyncRepQueue), maintaining sorted
! * invariant.
*
* Usually we will go at tail of queue, though it's possible that we arrive
* here out of order, so start at tail and work back to insertion point.
*/
! void
! BatchCommitQueueInsert(SHM_QUEUE *waitQueue)
{
PGPROC *proc;
!
! proc = (PGPROC *) SHMQueuePrev(waitQueue,
! waitQueue,
! offsetof(PGPROC, waitLSNLinks));
while (proc)
{
***************
*** 290,304 **** SyncRepQueueInsert(void)
if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
break;
! proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
! &(proc->syncRepLinks),
! offsetof(PGPROC, syncRepLinks));
}
if (proc)
! SHMQueueInsertAfter(&(proc->syncRepLinks), &(MyProc->syncRepLinks));
else
! SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue), &(MyProc->syncRepLinks));
}
/*
--- 298,312 ----
if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
break;
! proc = (PGPROC *) SHMQueuePrev(waitQueue,
! &(proc->waitLSNLinks),
! offsetof(PGPROC, waitLSNLinks));
}
if (proc)
! SHMQueueInsertAfter(&(proc->waitLSNLinks), &(MyProc->waitLSNLinks));
else
! SHMQueueInsertAfter(waitQueue, &(MyProc->waitLSNLinks));
}
/*
***************
*** 307,315 **** SyncRepQueueInsert(void)
static void
SyncRepCancelWait(void)
{
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
! if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
! SHMQueueDelete(&(MyProc->syncRepLinks));
MyProc->syncRepState = SYNC_REP_NOT_WAITING;
LWLockRelease(SyncRepLock);
}
--- 315,324 ----
static void
SyncRepCancelWait(void)
{
+ Assert(!LWLockHeldByMe(GroupCommitLock));
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
! if (!SHMQueueIsDetached(&(MyProc->waitLSNLinks)))
! SHMQueueDelete(&(MyProc->waitLSNLinks));
MyProc->syncRepState = SYNC_REP_NOT_WAITING;
LWLockRelease(SyncRepLock);
}
***************
*** 317,326 **** SyncRepCancelWait(void)
void
SyncRepCleanupAtProcExit(void)
{
! if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
{
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
! SHMQueueDelete(&(MyProc->syncRepLinks));
LWLockRelease(SyncRepLock);
}
}
--- 326,335 ----
void
SyncRepCleanupAtProcExit(void)
{
! if (!SHMQueueIsDetached(&(MyProc->waitLSNLinks)))
{
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
! SHMQueueDelete(&(MyProc->waitLSNLinks));
LWLockRelease(SyncRepLock);
}
}
***************
*** 388,393 **** SyncRepReleaseWaiters(void)
--- 397,403 ----
* change pg_stat_get_wal_senders().
*/
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ Assert(!LWLockHeldByMe(GroupCommitLock));
for (i = 0; i < max_wal_senders; i++)
{
***************
*** 522,531 **** SyncRepWakeQueue(bool all)
int numprocs = 0;
Assert(SyncRepQueueIsOrderedByLSN());
proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
&(WalSndCtl->SyncRepQueue),
! offsetof(PGPROC, syncRepLinks));
while (proc)
{
--- 532,542 ----
int numprocs = 0;
Assert(SyncRepQueueIsOrderedByLSN());
+ Assert(!LWLockHeldByMe(GroupCommitLock));
proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
&(WalSndCtl->SyncRepQueue),
! offsetof(PGPROC, waitLSNLinks));
while (proc)
{
***************
*** 541,548 **** SyncRepWakeQueue(bool all)
*/
thisproc = proc;
proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! &(proc->syncRepLinks),
! offsetof(PGPROC, syncRepLinks));
/*
* Set state to complete; see SyncRepWaitForLSN() for discussion of
--- 552,559 ----
*/
thisproc = proc;
proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! &(proc->waitLSNLinks),
! offsetof(PGPROC, waitLSNLinks));
/*
* Set state to complete; see SyncRepWaitForLSN() for discussion of
***************
*** 553,559 **** SyncRepWakeQueue(bool all)
/*
* Remove thisproc from queue.
*/
! SHMQueueDelete(&(thisproc->syncRepLinks));
/*
* Wake only when we have set state and removed from queue.
--- 564,570 ----
/*
* Remove thisproc from queue.
*/
! SHMQueueDelete(&(thisproc->waitLSNLinks));
/*
* Wake only when we have set state and removed from queue.
***************
*** 604,611 **** SyncRepUpdateSyncStandbysDefined(void)
}
#ifdef USE_ASSERT_CHECKING
! static bool
! SyncRepQueueIsOrderedByLSN(void)
{
PGPROC *proc = NULL;
XLogRecPtr lastLSN;
--- 615,622 ----
}
#ifdef USE_ASSERT_CHECKING
! bool
! BatchCommitQueueIsOrderedByLSN(SHM_QUEUE *waitQueue)
{
PGPROC *proc = NULL;
XLogRecPtr lastLSN;
***************
*** 613,621 **** SyncRepQueueIsOrderedByLSN(void)
lastLSN.xlogid = 0;
lastLSN.xrecoff = 0;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! &(WalSndCtl->SyncRepQueue),
! offsetof(PGPROC, syncRepLinks));
while (proc)
{
--- 624,632 ----
lastLSN.xlogid = 0;
lastLSN.xrecoff = 0;
! proc = (PGPROC *) SHMQueueNext(waitQueue,
! waitQueue,
! offsetof(PGPROC, waitLSNLinks));
while (proc)
{
***************
*** 628,636 **** SyncRepQueueIsOrderedByLSN(void)
lastLSN = proc->waitLSN;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! &(proc->syncRepLinks),
! offsetof(PGPROC, syncRepLinks));
}
return true;
--- 639,647 ----
lastLSN = proc->waitLSN;
! proc = (PGPROC *) SHMQueueNext(waitQueue,
! &(proc->waitLSNLinks),
! offsetof(PGPROC, waitLSNLinks));
}
return true;
*** a/src/backend/storage/ipc/ipci.c
--- b/src/backend/storage/ipc/ipci.c
***************
*** 126,131 **** CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
--- 126,132 ----
size = add_size(size, BTreeShmemSize());
size = add_size(size, SyncScanShmemSize());
size = add_size(size, AsyncShmemSize());
+ size = add_size(size, GroupCommitShmemSize());
#ifdef EXEC_BACKEND
size = add_size(size, ShmemBackendArraySize());
#endif
***************
*** 190,195 **** CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
--- 191,197 ----
* Set up xlog, clog, and buffers
*/
XLOGShmemInit();
+ GroupCommitShmemInit();
CLOGShmemInit();
SUBTRANSShmemInit();
MultiXactShmemInit();
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 35,40 ****
--- 35,41 ----
#include <unistd.h>
#include <sys/time.h>
+ #include "access/groupcommit.h"
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xact.h"
***************
*** 377,387 **** InitProcess(void)
#endif
MyProc->recoveryConflictPending = false;
! /* Initialize fields for sync rep */
MyProc->waitLSN.xlogid = 0;
MyProc->waitLSN.xrecoff = 0;
MyProc->syncRepState = SYNC_REP_NOT_WAITING;
! SHMQueueElemInit(&(MyProc->syncRepLinks));
/*
* Acquire ownership of the PGPROC's latch, so that we can use WaitLatch.
--- 378,389 ----
#endif
MyProc->recoveryConflictPending = false;
! /* Initialize fields for waiting on WAL LSNs */
MyProc->waitLSN.xlogid = 0;
MyProc->waitLSN.xrecoff = 0;
MyProc->syncRepState = SYNC_REP_NOT_WAITING;
! MyProc->grpCommitState = GROUP_COMMIT_NOT_WAITING;
! SHMQueueElemInit(&(MyProc->waitLSNLinks));
/*
* Acquire ownership of the PGPROC's latch, so that we can use WaitLatch.
***************
*** 739,746 **** ProcKill(int code, Datum arg)
Assert(MyProc != NULL);
! /* Make sure we're out of the sync rep lists */
SyncRepCleanupAtProcExit();
#ifdef USE_ASSERT_CHECKING
if (assert_enabled)
--- 741,749 ----
Assert(MyProc != NULL);
! /* Make sure we're out of the wait for LSN lists */
SyncRepCleanupAtProcExit();
+ GroupCommitCleanupAtProcExit();
#ifdef USE_ASSERT_CHECKING
if (assert_enabled)
***************
*** 776,781 **** ProcKill(int code, Datum arg)
--- 779,786 ----
MyProc->links.next = (SHM_QUEUE *) procglobal->freeProcs;
procglobal->freeProcs = MyProc;
}
+ /* Invalidate group commit state */
+ MyProc->grpCommitState = GROUP_COMMIT_NOT_WAITING;
/* PGPROC struct isn't mine anymore */
MyProc = NULL;
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 125,132 ****
/* XXX these should appear in other modules' header files */
extern bool Log_disconnections;
- extern int CommitDelay;
- extern int CommitSiblings;
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
--- 125,130 ----
***************
*** 2019,2046 **** static struct config_int ConfigureNamesInt[] =
},
{
- {"commit_delay", PGC_USERSET, WAL_SETTINGS,
- gettext_noop("Sets the delay in microseconds between transaction commit and "
- "flushing WAL to disk."),
- NULL
- },
- &CommitDelay,
- 0, 0, 100000,
- NULL, NULL, NULL
- },
-
- {
- {"commit_siblings", PGC_USERSET, WAL_SETTINGS,
- gettext_noop("Sets the minimum concurrent open transactions before performing "
- "commit_delay."),
- NULL
- },
- &CommitSiblings,
- 5, 0, 1000,
- NULL, NULL, NULL
- },
-
- {
{"extra_float_digits", PGC_USERSET, CLIENT_CONN_LOCALE,
gettext_noop("Sets the number of digits displayed for floating-point values."),
gettext_noop("This affects real, double precision, and geometric data types. "
--- 2017,2022 ----
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 171,179 ****
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
- #commit_delay = 0 # range 0-100000, in microseconds
- #commit_siblings = 5 # range 1-1000
-
# - Checkpoints -
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
--- 171,176 ----
*** /dev/null
--- b/src/include/access/groupcommit.h
***************
*** 0 ****
--- 1,42 ----
+ /*-------------------------------------------------------------------------
+ *
+ * groupcommit.h
+ * Group commit interface
+ *
+ * Exports from access/transam/groupcommit.c.
+ *
+ * Copyright (c) 2010-2011, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/groupcommit.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef _GROUPCOMMIT_H
+ #define _GROUPCOMMIT_H
+
+ #include "access/xlogdefs.h"
+ #include "replication/syncrep.h"
+
+ typedef enum GroupCommitState
+ {
+ GROUP_COMMIT_NOT_WAITING,
+ GROUP_COMMIT_WAITING,
+ GROUP_COMMIT_WAIT_COMPLETE
+ } GroupCommitState;
+
+ extern PGDLLIMPORT bool UseGroupCommit;
+
+ /* function prototypes */
+ extern Size GroupCommitShmemSize(void);
+ extern void GroupCommitShmemInit(void);
+
+ /* called by user backends */
+ extern void GroupCommitWaitForLSN(XLogRecPtr XactCommitLSN);
+ /* called at backend exit */
+ extern void GroupCommitCleanupAtProcExit(void);
+
+ /* called by the driver */
+ extern void GroupCommitReleaseWaiters(XLogRecPtr flushLSN);
+
+ #endif /* _GROUPCOMMIT_H */
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 265,270 **** extern CheckpointStatsData CheckpointStats;
--- 265,271 ----
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
extern void XLogFlush(XLogRecPtr RecPtr);
+ extern XLogRecPtr DoXLogFlush(XLogRecPtr RecPtr);
extern void XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
extern int XLogFileInit(uint32 log, uint32 seg,
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 14,19 ****
--- 14,20 ----
#define _SYNCREP_H
#include "utils/guc.h"
+ #include "storage/shmem.h"
/* syncRepState */
#define SYNC_REP_NOT_WAITING 0
***************
*** 41,44 **** extern int SyncRepWakeQueue(bool all);
--- 42,52 ----
extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+ /* Batch commit infrastructure, used by sync rep and group commit */
+ extern void BatchCommitQueueInsert(SHM_QUEUE *waitQueue);
+
+ #ifdef USE_ASSERT_CHECKING
+ extern bool BatchCommitQueueIsOrderedByLSN(SHM_QUEUE *waitQueue);
+ #endif
+
#endif /* _SYNCREP_H */
*** a/src/include/storage/latch.h
--- b/src/include/storage/latch.h
***************
*** 25,31 ****
* and must be initialized at postmaster startup by InitSharedLatch. Before
* a shared latch can be waited on, it must be associated with a process
* with OwnLatch. Only the process owning the latch can wait on it, but any
! * process can set it.
*
* There are three basic operations on a latch:
*
--- 25,36 ----
* and must be initialized at postmaster startup by InitSharedLatch. Before
* a shared latch can be waited on, it must be associated with a process
* with OwnLatch. Only the process owning the latch can wait on it, but any
! * process can set it. Note that the use of the process latch (which is a field
! * in PGPROC) is generally preferred to using an ad-hoc shared latch, as generic
! * signal handlers will call SetLatch on the process latch. Signals have
! * the potential to invalidate the Latch timeout on certain platforms, resulting
! * in a denial-of-service, so it is important to verify that all signal handlers
! * within the process call SetLatch().
*
* There are three basic operations on a latch:
*
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 79,84 **** typedef enum LWLockId
--- 79,85 ----
SerializablePredicateLockListLock,
OldSerXidLock,
SyncRepLock,
+ GroupCommitLock,
/* Individual lock IDs end here */
FirstBufMappingLock,
FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 119,126 **** struct PGPROC
* syncRepLinks used only while holding SyncRepLock.
*/
XLogRecPtr waitLSN; /* waiting for this LSN or higher */
int syncRepState; /* wait state for sync rep */
! SHM_QUEUE syncRepLinks; /* list link if process is in syncrep queue */
/*
* All PROCLOCK objects for locks held or awaited by this backend are
--- 119,127 ----
* syncRepLinks used only while holding SyncRepLock.
*/
XLogRecPtr waitLSN; /* waiting for this LSN or higher */
+ int grpCommitState; /* wait state for group commit */
int syncRepState; /* wait state for sync rep */
! SHM_QUEUE waitLSNLinks; /* list link if process is in WAL queue */
/*
* All PROCLOCK objects for locks held or awaited by this backend are
On 20 January 2012 22:30, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Maybe we should have a heuristic to split a large flush into smaller chunks.
The WAL segment boundary would be a quite natural split point, for example,
because when crossing the file boundary you have to issue separate fsync()s
for the files anyway. But I'm happy with leaving that out for now, it's not
any worse than it used to be without group commit anyway.
Let's quantify how much of a problem that is first.
The patch added some CHECK_FOR_INTERRUPTS() calls into various places in the
commit/abort codepaths, but that's all dead code; they're all within a
HOLD/RESUME_INTERRUPTS blocks.
Fair enough, but do you think it's acceptable to say "well, we can't
have errors within critical blocks anyway, and nor can the driver, so
just assume that the driver will successfully service the request"?
Furthermore, the WAL writer didn't have a latch before this patch; it's not
many lines of code to initialize the latch and set up the signal handler for
it, but it already has a semaphore that's ready to use.
Uh, yes it did - WalWriterDelay is passed to WaitLatch(). It didn't
use the process latch as I did (which is initialised anyway), though I
believe it should have (on general principal, to avoid invalidation
issues when generic handlers are registered, plus because the process
latch is already initialised and available), which is why I changed
it. Whatever you do with group commit, you're going to want to look at
the changes made to the WAL Writer in my original patch outside of the
main loop, because there are one or two fixes for it included
(registering a usr1 signal handler and saving errno in existing
handlers), and because we need an alternative way of power saving if
you're not going to include the mechanism originally proposed - maybe
something similar to what has been done for the BGWriter in my patch
for that. At 5 wake-ups per second by default, the process is by a
wide margin the biggest culprit (except BGWriter, which is also 5 by
default, but that is addressed by my other patch that you're
reviewing). I want to fix that problem too, and possibly investigate
if there's something to be done about the checkpointer (though that
only has a 5 second timeout, so it's not a major concern). In any
case, we should encourage the idea that auxiliary processes will use
the proc latch, unless perhaps they only use a local latch like the
avlauncher does, imho.
Why did you remove the new assertions in unix_latch.c/win32_latch.c? I
think you should keep them, as well as my additional comments on latch
timeout invalidation issues in latch.h which are untouched in your
revision (though this looks to be a rough revision, so I shouldn't
read anything into that either way I suppose). In general, we should
try and use the process latch whenever we can.
I wonder if we should rename the file into "xlogflush.c" or something like
that, to make it clear that this works with any XLOG flushes, not just
commits? Group commit is the usual term for this feature, so we should
definitely mention that in the comments, but it might be better to rename
the files/functions to emphasize that this is about WAL flushing in general.
Okay.
This probably needs some further cleanup to fix things I've broken, and I
haven't done any performance testing, but it's progress. Do you have a shell
script or something that you used for the performance tests that I could
run? Or would you like to re-run the tests you did earlier with this patch?
No, I'm using pgbench-tools, and there's no reason to think that you
couldn't get similar results on ordinary hardware, which is all I used
- obviously you'll want to make sure that you're using a file system
that supports granular fsyncs, like ext4. All of the details,
including the config for pgbench-tools, are in my original e-mail. I
have taken the time to re-run the benchmark and update the wiki with
that new information - I'd call it a draw.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On 21 January 2012 03:13, Peter Geoghegan <peter@2ndquadrant.com> wrote:
I have taken the time to re-run the benchmark and update the wiki with
that new information - I'd call it a draw.
On second though, the occasional latency spikes that we see with my
patch (which uses the poll() based latch in the run that is
benchmarked) might be significant - difficult to say. The averages are
about the same though.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Jan 20, 2012 at 10:30 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
I spent some time cleaning this up. Details below, but here are the
highlights:* Reverted the removal of wal_writer_delay
* Removed heuristic on big flushes
No contested viewpoints on anything there.
* Doesn't rely on WAL writer. Any process can act as the "leader" now.
* Uses PGSemaphoreLock/Unlock instead of latches
Thanks for producing an alternate version, it will allow us to comment
on various approaches.
There is much yet to discuss so please don't think about committing
anything yet.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
I've been thinking, what exactly is the important part of this group
commit patch that gives the benefit? Keeping the queue sorted isn't all
that important - XLogFlush() requests for commits will come in almost
the correct order anyway.
I also don't much like the division of labour between groupcommit.c and
xlog.c. XLogFlush() calls GroupCommitWaitForLSN(), which calls back
DoXLogFlush(), which is a bit hard to follow. (that's in my latest
version; the original patch had similar division but it also cut across
processes, with the wal writer actually doing the flush)
It occurs to me that the WALWriteLock already provides much of the same
machinery we're trying to build here. It provides FIFO-style queuing,
with the capability to wake up the next process or processes in the
queue. Perhaps all we need is some new primitive to LWLock, to make it
do exactly what we need.
Attached is a patch to do that. It adds a new mode to
LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free,
it is acquired and the function returns immediately. However, unlike
normal LWLockConditionalAcquire(), if the lock is not free it goes to
sleep until it is released. But unlike normal LWLockAcquire(), when the
lock is released, the function returns *without* acquiring the lock.
I modified XLogFlush() to use that new function for WALWriteLock. It
tries to get WALWriteLock, but if it's not immediately free, it waits
until it is released. Then, before trying to acquire the lock again, it
rechecks LogwrtResult to see if the record was already flushed by the
process that released the lock.
This is a much smaller patch than the group commit patch, and in my
pgbench-tools tests on my laptop, it has the same effect on performance.
The downside is that it touches lwlock.c, which is a change at a lower
level. Robert's flexlocks patch probably would've been useful here.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
groupcommit-with-special-lwlockmode-1.patchtext/x-diff; name=groupcommit-with-special-lwlockmode-1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ce659ec..d726a98 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2066,23 +2066,42 @@ XLogFlush(XLogRecPtr record)
/* initialize to given target; may increase below */
WriteRqstPtr = record;
- /* read LogwrtResult and update local state */
+ /*
+ * Now wait until we get the write lock, or someone else does the
+ * flush for us.
+ */
+ for (;;)
{
/* use volatile pointer to prevent code rearrangement */
volatile XLogCtlData *xlogctl = XLogCtl;
+ /* read LogwrtResult and update local state */
SpinLockAcquire(&xlogctl->info_lck);
if (XLByteLT(WriteRqstPtr, xlogctl->LogwrtRqst.Write))
WriteRqstPtr = xlogctl->LogwrtRqst.Write;
LogwrtResult = xlogctl->LogwrtResult;
SpinLockRelease(&xlogctl->info_lck);
- }
- /* done already? */
- if (!XLByteLE(record, LogwrtResult.Flush))
- {
- /* now wait for the write lock */
- LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ /* done already? */
+ if (XLByteLE(record, LogwrtResult.Flush))
+ break;
+
+ /*
+ * Try to get the write lock. If we can't get it immediately, wait
+ * until it's released, but before actually acquiring it, loop back
+ * to check if the backend holding the lock did the flush for us
+ * already. This helps to maintain a good rate of group committing
+ * when the system is bottlenecked by the speed of fsyncing.
+ */
+ if (!LWLockConditionalAcquire(WALWriteLock, LW_EXCLUSIVE_BUT_WAIT))
+ {
+ /*
+ * Didn't get the lock straight away, but we might be done
+ * already
+ */
+ continue;
+ }
+ /* Got the lock */
LogwrtResult = XLogCtl->Write.LogwrtResult;
if (!XLByteLE(record, LogwrtResult.Flush))
{
@@ -2111,6 +2130,8 @@ XLogFlush(XLogRecPtr record)
XLogWrite(WriteRqst, false, false);
}
LWLockRelease(WALWriteLock);
+ /* done */
+ break;
}
END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index cc41568..488f573 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -430,6 +430,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
elog(PANIC, "cannot wait without a PGPROC structure");
proc->lwWaiting = true;
+ proc->lwWaitOnly = false;
proc->lwExclusive = (mode == LW_EXCLUSIVE);
proc->lwWaitLink = NULL;
if (lock->head == NULL)
@@ -496,7 +497,9 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
/*
* LWLockConditionalAcquire - acquire a lightweight lock in the specified mode
*
- * If the lock is not available, return FALSE with no side-effects.
+ * If the lock is not available, return FALSE with no side-effects. In
+ * LW_EXCLUSIVE_BUT_WAIT mode, if the lock is not available, waits until it
+ * is available, but then returns false without acquiring it.
*
* If successful, cancel/die interrupts are held off until lock release.
*/
@@ -504,7 +507,9 @@ bool
LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
{
volatile LWLock *lock = &(LWLockArray[lockid].lock);
+ PGPROC *proc = MyProc;
bool mustwait;
+ int extraWaits = 0;
PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock);
@@ -523,7 +528,7 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
SpinLockAcquire(&lock->mutex);
/* If I can get the lock, do so quickly. */
- if (mode == LW_EXCLUSIVE)
+ if (mode == LW_EXCLUSIVE || mode == LW_EXCLUSIVE_BUT_WAIT)
{
if (lock->exclusive == 0 && lock->shared == 0)
{
@@ -544,8 +549,77 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
mustwait = true;
}
- /* We are done updating shared state of the lock itself. */
- SpinLockRelease(&lock->mutex);
+ if (mustwait && mode == LW_EXCLUSIVE_BUT_WAIT)
+ {
+ /*
+ * Add myself to wait queue.
+ *
+ * If we don't have a PGPROC structure, there's no way to wait. This
+ * should never occur, since MyProc should only be null during shared
+ * memory initialization.
+ */
+ if (proc == NULL)
+ elog(PANIC, "cannot wait without a PGPROC structure");
+
+ proc->lwWaiting = true;
+ proc->lwWaitOnly = true;
+ proc->lwExclusive = (mode == LW_EXCLUSIVE);
+ proc->lwWaitLink = NULL;
+ if (lock->head == NULL)
+ lock->head = proc;
+ else
+ lock->tail->lwWaitLink = proc;
+ lock->tail = proc;
+
+ /* Can release the mutex now */
+ SpinLockRelease(&lock->mutex);
+
+ /*
+ * Wait until awakened.
+ *
+ * Since we share the process wait semaphore with the regular lock
+ * manager and ProcWaitForSignal, and we may need to acquire an LWLock
+ * while one of those is pending, it is possible that we get awakened
+ * for a reason other than being signaled by LWLockRelease. If so,
+ * loop back and wait again. Once we've gotten the LWLock,
+ * re-increment the sema by the number of additional signals received,
+ * so that the lock manager or signal manager will see the received
+ * signal when it next waits.
+ */
+ LOG_LWDEBUG("LWLockAcquire", lockid, "waiting");
+
+#ifdef LWLOCK_STATS
+ block_counts[lockid]++;
+#endif
+
+ TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
+
+ for (;;)
+ {
+ /* "false" means cannot accept cancel/die interrupt here. */
+ PGSemaphoreLock(&proc->sem, false);
+ if (!proc->lwWaiting)
+ break;
+ extraWaits++;
+ }
+
+ TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
+
+ LOG_LWDEBUG("LWLockAcquire", lockid, "awakened");
+ }
+ else
+ {
+ /* We are done updating shared state of the lock itself. */
+ SpinLockRelease(&lock->mutex);
+ }
+
+ TRACE_POSTGRESQL_LWLOCK_ACQUIRE(lockid, mode);
+
+ /*
+ * Fix the process wait semaphore's count for any absorbed wakeups.
+ */
+ while (extraWaits-- > 0)
+ PGSemaphoreUnlock(&proc->sem);
if (mustwait)
{
@@ -619,19 +693,29 @@ LWLockRelease(LWLockId lockid)
* Remove the to-be-awakened PGPROCs from the queue. If the front
* waiter wants exclusive lock, awaken him only. Otherwise awaken
* as many waiters as want shared access.
+ *
+ * Everyone marked with lwWaitOnly are woken up, but they don't
+ * affect the setting of releaseOK. They're woken up, but they
+ * might decide to not acquire the lock after all.
*/
proc = head;
- if (!proc->lwExclusive)
+ if (!proc->lwExclusive || proc->lwWaitOnly)
{
while (proc->lwWaitLink != NULL &&
- !proc->lwWaitLink->lwExclusive)
+ (proc->lwWaitLink->lwWaitOnly ||
+ !proc->lwWaitLink->lwExclusive))
+ {
proc = proc->lwWaitLink;
+ if (!proc->lwWaitOnly)
+ lock->releaseOK = false;
+ }
}
/* proc is now the last PGPROC to be released */
lock->head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
/* prevent additional wakeups until retryer gets to run */
- lock->releaseOK = false;
+ if (!proc->lwWaitOnly)
+ lock->releaseOK = false;
}
else
{
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index df3df29..85c9a5c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -94,7 +94,8 @@ typedef enum LWLockId
typedef enum LWLockMode
{
LW_EXCLUSIVE,
- LW_SHARED
+ LW_SHARED,
+ LW_EXCLUSIVE_BUT_WAIT
} LWLockMode;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 358d1a4..027a793 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -102,6 +102,7 @@ struct PGPROC
/* Info about LWLock the process is currently waiting for, if any. */
bool lwWaiting; /* true if waiting for an LW lock */
bool lwExclusive; /* true if waiting for exclusive access */
+ bool lwWaitOnly; /* true if just want to wake up on release */
struct PGPROC *lwWaitLink; /* next waiter for same LW lock */
/* Info about lock the process is currently waiting for, if any. */
On Wed, Jan 25, 2012 at 3:11 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
I've been thinking, what exactly is the important part of this group commit
patch that gives the benefit? Keeping the queue sorted isn't all that
important - XLogFlush() requests for commits will come in almost the correct
order anyway.I also don't much like the division of labour between groupcommit.c and
xlog.c. XLogFlush() calls GroupCommitWaitForLSN(), which calls back
DoXLogFlush(), which is a bit hard to follow. (that's in my latest version;
the original patch had similar division but it also cut across processes,
with the wal writer actually doing the flush)It occurs to me that the WALWriteLock already provides much of the same
machinery we're trying to build here. It provides FIFO-style queuing, with
the capability to wake up the next process or processes in the queue.
Perhaps all we need is some new primitive to LWLock, to make it do exactly
what we need.Attached is a patch to do that. It adds a new mode to
LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, it
is acquired and the function returns immediately. However, unlike normal
LWLockConditionalAcquire(), if the lock is not free it goes to sleep until
it is released. But unlike normal LWLockAcquire(), when the lock is
released, the function returns *without* acquiring the lock.I modified XLogFlush() to use that new function for WALWriteLock. It tries
to get WALWriteLock, but if it's not immediately free, it waits until it is
released. Then, before trying to acquire the lock again, it rechecks
LogwrtResult to see if the record was already flushed by the process that
released the lock.This is a much smaller patch than the group commit patch, and in my
pgbench-tools tests on my laptop, it has the same effect on performance. The
downside is that it touches lwlock.c, which is a change at a lower level.
Robert's flexlocks patch probably would've been useful here.
I think you should break this off into a new function,
LWLockWaitUntilFree(), rather than treating it as a new LWLockMode.
Also, instead of adding lwWaitOnly, I would suggest that we generalize
lwWaiting and lwExclusive into a field lwWaitRequest, which can be set
to 1 for exclusive, 2 for shared, 3 for wait-for-free, etc. That way
we don't have to add another boolean every time someone invents a new
type of wait - not that that should hopefully happen very often. A
side benefit of this is that you can simplify the test in
LWLockRelease(): keep releasing waiters until you come to an exclusive
waiter, then stop. This possibly saves one shared memory fetch and
subsequent test instruction, which might not be trivial - all of this
code is extremely hot. We probably want to benchmark this carefully
to make sure that it doesn't cause a performance regression even just
from this:
+ if (!proc->lwWaitOnly)
+ lock->releaseOK = false;
I know it sounds crazy, but I'm not 100% sure that that additional
test there is cheap enough not to matter. Assuming it is, though, I
kind of like the concept: I think there must be other places where
these semantics are useful.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 26.01.2012 04:10, Robert Haas wrote:
On Wed, Jan 25, 2012 at 3:11 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Attached is a patch to do that. It adds a new mode to
LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, it
is acquired and the function returns immediately. However, unlike normal
LWLockConditionalAcquire(), if the lock is not free it goes to sleep until
it is released. But unlike normal LWLockAcquire(), when the lock is
released, the function returns *without* acquiring the lock.
...I think you should break this off into a new function,
LWLockWaitUntilFree(), rather than treating it as a new LWLockMode.
Also, instead of adding lwWaitOnly, I would suggest that we generalize
lwWaiting and lwExclusive into a field lwWaitRequest, which can be set
to 1 for exclusive, 2 for shared, 3 for wait-for-free, etc. That way
we don't have to add another boolean every time someone invents a new
type of wait - not that that should hopefully happen very often. A
side benefit of this is that you can simplify the test in
LWLockRelease(): keep releasing waiters until you come to an exclusive
waiter, then stop. This possibly saves one shared memory fetch and
subsequent test instruction, which might not be trivial - all of this
code is extremely hot.
Makes sense. Attached is a new version, doing exactly that.
We probably want to benchmark this carefully
to make sure that it doesn't cause a performance regression even just
from this:+ if (!proc->lwWaitOnly)
+ lock->releaseOK = false;I know it sounds crazy, but I'm not 100% sure that that additional
test there is cheap enough not to matter. Assuming it is, though, I
kind of like the concept: I think there must be other places where
these semantics are useful.
Yeah, we have to be careful with any overhead in there, it can be a hot
spot. I wouldn't expect any measurable difference from the above,
though. Could I ask you to rerun the pgbench tests you did recently with
this patch? Or can you think of a better test for this?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
groupcommit-with-lwlockwaituntilfree-2.patchtext/x-diff; name=groupcommit-with-lwlockwaituntilfree-2.patchDownload
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 327,333 **** MarkAsPreparing(TransactionId xid, const char *gid,
proc->databaseId = databaseid;
proc->roleId = owner;
proc->lwWaiting = false;
! proc->lwExclusive = false;
proc->lwWaitLink = NULL;
proc->waitLock = NULL;
proc->waitProcLock = NULL;
--- 327,333 ----
proc->databaseId = databaseid;
proc->roleId = owner;
proc->lwWaiting = false;
! proc->lwWaitMode = 0;
proc->lwWaitLink = NULL;
proc->waitLock = NULL;
proc->waitProcLock = NULL;
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 2118,2140 **** XLogFlush(XLogRecPtr record)
/* initialize to given target; may increase below */
WriteRqstPtr = record;
! /* read LogwrtResult and update local state */
{
/* use volatile pointer to prevent code rearrangement */
volatile XLogCtlData *xlogctl = XLogCtl;
SpinLockAcquire(&xlogctl->info_lck);
if (XLByteLT(WriteRqstPtr, xlogctl->LogwrtRqst.Write))
WriteRqstPtr = xlogctl->LogwrtRqst.Write;
LogwrtResult = xlogctl->LogwrtResult;
SpinLockRelease(&xlogctl->info_lck);
- }
! /* done already? */
! if (!XLByteLE(record, LogwrtResult.Flush))
! {
! /* now wait for the write lock */
! LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
LogwrtResult = XLogCtl->Write.LogwrtResult;
if (!XLByteLE(record, LogwrtResult.Flush))
{
--- 2118,2160 ----
/* initialize to given target; may increase below */
WriteRqstPtr = record;
! /*
! * Now wait until we get the write lock, or someone else does the
! * flush for us.
! */
! for (;;)
{
/* use volatile pointer to prevent code rearrangement */
volatile XLogCtlData *xlogctl = XLogCtl;
+ /* read LogwrtResult and update local state */
SpinLockAcquire(&xlogctl->info_lck);
if (XLByteLT(WriteRqstPtr, xlogctl->LogwrtRqst.Write))
WriteRqstPtr = xlogctl->LogwrtRqst.Write;
LogwrtResult = xlogctl->LogwrtResult;
SpinLockRelease(&xlogctl->info_lck);
! /* done already? */
! if (XLByteLE(record, LogwrtResult.Flush))
! break;
!
! /*
! * Try to get the write lock. If we can't get it immediately, wait
! * until it's released, and recheck if we still need to do the flush
! * or if the backend that held the lock did it for us already. This
! * helps to maintain a good rate of group committing when the system
! * is bottlenecked by the speed of fsyncing.
! */
! if (!LWLockWaitUntilFree(WALWriteLock, LW_EXCLUSIVE))
! {
! /*
! * The lock is now free, but we didn't acquire it yet. Before we
! * do, loop back to check if someone else flushed the record for
! * us already.
! */
! continue;
! }
! /* Got the lock */
LogwrtResult = XLogCtl->Write.LogwrtResult;
if (!XLByteLE(record, LogwrtResult.Flush))
{
***************
*** 2163,2168 **** XLogFlush(XLogRecPtr record)
--- 2183,2190 ----
XLogWrite(WriteRqst, false, false);
}
LWLockRelease(WALWriteLock);
+ /* done */
+ break;
}
END_CRIT_SECTION();
*** a/src/backend/storage/lmgr/lwlock.c
--- b/src/backend/storage/lmgr/lwlock.c
***************
*** 430,436 **** LWLockAcquire(LWLockId lockid, LWLockMode mode)
elog(PANIC, "cannot wait without a PGPROC structure");
proc->lwWaiting = true;
! proc->lwExclusive = (mode == LW_EXCLUSIVE);
proc->lwWaitLink = NULL;
if (lock->head == NULL)
lock->head = proc;
--- 430,436 ----
elog(PANIC, "cannot wait without a PGPROC structure");
proc->lwWaiting = true;
! proc->lwWaitMode = mode;
proc->lwWaitLink = NULL;
if (lock->head == NULL)
lock->head = proc;
***************
*** 565,570 **** LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
--- 565,708 ----
}
/*
+ * LWLockWaitUntilFree - Wait until a lock is free
+ *
+ * The semantics of this function are a bit funky. If the lock is currently
+ * free, it is acquired in the given mode, and the function returns true. If
+ * the lock isn't immediately free, the function waits until it is released
+ * and returns false, but does not acquire the lock.
+ *
+ * This is currently used for WALWriteLock: when a backend flushes the WAL,
+ * holding WALWriteLock, it can flush the commit records of many other
+ * backends as a side-effect. Those other backends need to wait until the
+ * flush finishes, but don't need to acquire the lock anymore. They can just
+ * wake up, observe that their records have already been flushed, and return.
+ */
+ bool
+ LWLockWaitUntilFree(LWLockId lockid, LWLockMode mode)
+ {
+ volatile LWLock *lock = &(LWLockArray[lockid].lock);
+ PGPROC *proc = MyProc;
+ bool mustwait;
+ int extraWaits = 0;
+
+ PRINT_LWDEBUG("LWLockWaitUntilFree", lockid, lock);
+
+ /* Ensure we will have room to remember the lock */
+ if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
+ elog(ERROR, "too many LWLocks taken");
+
+ /*
+ * Lock out cancel/die interrupts until we exit the code section protected
+ * by the LWLock. This ensures that interrupts will not interfere with
+ * manipulations of data structures in shared memory.
+ */
+ HOLD_INTERRUPTS();
+
+ /* Acquire mutex. Time spent holding mutex should be short! */
+ SpinLockAcquire(&lock->mutex);
+
+ /* If I can get the lock, do so quickly. */
+ if (mode == LW_EXCLUSIVE)
+ {
+ if (lock->exclusive == 0 && lock->shared == 0)
+ {
+ lock->exclusive++;
+ mustwait = false;
+ }
+ else
+ mustwait = true;
+ }
+ else
+ {
+ if (lock->exclusive == 0)
+ {
+ lock->shared++;
+ mustwait = false;
+ }
+ else
+ mustwait = true;
+ }
+
+ if (mustwait)
+ {
+ /*
+ * Add myself to wait queue.
+ *
+ * If we don't have a PGPROC structure, there's no way to wait. This
+ * should never occur, since MyProc should only be null during shared
+ * memory initialization.
+ */
+ if (proc == NULL)
+ elog(PANIC, "cannot wait without a PGPROC structure");
+
+ proc->lwWaiting = true;
+ proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
+ proc->lwWaitLink = NULL;
+ if (lock->head == NULL)
+ lock->head = proc;
+ else
+ lock->tail->lwWaitLink = proc;
+ lock->tail = proc;
+
+ /* Can release the mutex now */
+ SpinLockRelease(&lock->mutex);
+
+ /*
+ * Wait until awakened. Like in LWLockAcquire, be prepared for bogus
+ * wakups, because we share the semaphore with ProcWaitForSignal.
+ */
+ LOG_LWDEBUG("LWLockWaitUntilFree", lockid, "waiting");
+
+ #ifdef LWLOCK_STATS
+ block_counts[lockid]++;
+ #endif
+
+ TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
+
+ for (;;)
+ {
+ /* "false" means cannot accept cancel/die interrupt here. */
+ PGSemaphoreLock(&proc->sem, false);
+ if (!proc->lwWaiting)
+ break;
+ extraWaits++;
+ }
+
+ TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
+
+ LOG_LWDEBUG("LWLockUntilFree", lockid, "awakened");
+ }
+ else
+ {
+ /* We are done updating shared state of the lock itself. */
+ SpinLockRelease(&lock->mutex);
+ }
+
+ /*
+ * Fix the process wait semaphore's count for any absorbed wakeups.
+ */
+ while (extraWaits-- > 0)
+ PGSemaphoreUnlock(&proc->sem);
+
+ if (mustwait)
+ {
+ /* Failed to get lock, so release interrupt holdoff */
+ RESUME_INTERRUPTS();
+ LOG_LWDEBUG("LWLockWaitUntilFree", lockid, "failed");
+ TRACE_POSTGRESQL_LWLOCK_WAIT_UNTIL_FREE_FAIL(lockid, mode);
+ }
+ else
+ {
+ /* Add lock to list of locks held by this backend */
+ held_lwlocks[num_held_lwlocks++] = lockid;
+ TRACE_POSTGRESQL_LWLOCK_WAIT_UNTIL_FREE(lockid, mode);
+ }
+
+ return !mustwait;
+ }
+
+ /*
* LWLockRelease - release a previously acquired lock
*/
void
***************
*** 618,637 **** LWLockRelease(LWLockId lockid)
/*
* Remove the to-be-awakened PGPROCs from the queue. If the front
* waiter wants exclusive lock, awaken him only. Otherwise awaken
! * as many waiters as want shared access.
*/
proc = head;
! if (!proc->lwExclusive)
{
while (proc->lwWaitLink != NULL &&
! !proc->lwWaitLink->lwExclusive)
proc = proc->lwWaitLink;
}
/* proc is now the last PGPROC to be released */
lock->head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
! /* prevent additional wakeups until retryer gets to run */
! lock->releaseOK = false;
}
else
{
--- 756,791 ----
/*
* Remove the to-be-awakened PGPROCs from the queue. If the front
* waiter wants exclusive lock, awaken him only. Otherwise awaken
! * as many waiters as want shared access (or just want to be
! * woken up when the lock becomes free without acquiring it,
! * ie. LWLockWaitUntilFree).
*/
+ bool releaseOK = true;
+
proc = head;
! if (proc->lwWaitMode != LW_EXCLUSIVE)
{
while (proc->lwWaitLink != NULL &&
! proc->lwWaitLink->lwWaitMode != LW_EXCLUSIVE)
! {
proc = proc->lwWaitLink;
+ if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
+ releaseOK = false;
+ }
}
/* proc is now the last PGPROC to be released */
lock->head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
! /*
! * Prevent additional wakeups until retryer gets to run. Backends
! * that are just waiting for the lock to become free don't prevent
! * wakeups, because they might decide that they don't want the
! * lock, after all.
! */
! if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
! releaseOK = false;
!
! lock->releaseOK = releaseOK;
}
else
{
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 361,367 **** InitProcess(void)
if (IsAutoVacuumWorkerProcess())
MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
MyProc->lwWaiting = false;
! MyProc->lwExclusive = false;
MyProc->lwWaitLink = NULL;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
--- 361,367 ----
if (IsAutoVacuumWorkerProcess())
MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
MyProc->lwWaiting = false;
! MyProc->lwWaitMode = 0;
MyProc->lwWaitLink = NULL;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
***************
*** 516,522 **** InitAuxiliaryProcess(void)
MyPgXact->inCommit = false;
MyPgXact->vacuumFlags = 0;
MyProc->lwWaiting = false;
! MyProc->lwExclusive = false;
MyProc->lwWaitLink = NULL;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
--- 516,522 ----
MyPgXact->inCommit = false;
MyPgXact->vacuumFlags = 0;
MyProc->lwWaiting = false;
! MyProc->lwWaitMode = 0;
MyProc->lwWaitLink = NULL;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
*** a/src/backend/utils/probes.d
--- b/src/backend/utils/probes.d
***************
*** 35,40 **** provider postgresql {
--- 35,42 ----
probe lwlock__wait__done(LWLockId, LWLockMode);
probe lwlock__condacquire(LWLockId, LWLockMode);
probe lwlock__condacquire__fail(LWLockId, LWLockMode);
+ probe lwlock__wait__until__free(LWLockId, LWLockMode);
+ probe lwlock__wait__until__free__fail(LWLockId, LWLockMode);
probe lock__wait__start(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE);
probe lock__wait__done(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE);
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 94,100 **** typedef enum LWLockId
typedef enum LWLockMode
{
LW_EXCLUSIVE,
! LW_SHARED
} LWLockMode;
--- 94,103 ----
typedef enum LWLockMode
{
LW_EXCLUSIVE,
! LW_SHARED,
! LW_WAIT_UNTIL_FREE /* A special mode used in PGPROC->lwlockMode, when
! * waiting for lock to become free. Not to be used
! * as LWLockAcquire argument */
} LWLockMode;
***************
*** 105,110 **** extern bool Trace_lwlocks;
--- 108,114 ----
extern LWLockId LWLockAssign(void);
extern void LWLockAcquire(LWLockId lockid, LWLockMode mode);
extern bool LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode);
+ extern bool LWLockWaitUntilFree(LWLockId lockid, LWLockMode mode);
extern void LWLockRelease(LWLockId lockid);
extern void LWLockReleaseAll(void);
extern bool LWLockHeldByMe(LWLockId lockid);
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 101,107 **** struct PGPROC
/* Info about LWLock the process is currently waiting for, if any. */
bool lwWaiting; /* true if waiting for an LW lock */
! bool lwExclusive; /* true if waiting for exclusive access */
struct PGPROC *lwWaitLink; /* next waiter for same LW lock */
/* Info about lock the process is currently waiting for, if any. */
--- 101,107 ----
/* Info about LWLock the process is currently waiting for, if any. */
bool lwWaiting; /* true if waiting for an LW lock */
! uint8 lwWaitMode; /* lwlock mode being waited for */
struct PGPROC *lwWaitLink; /* next waiter for same LW lock */
/* Info about lock the process is currently waiting for, if any. */
On Fri, Jan 27, 2012 at 8:35 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Yeah, we have to be careful with any overhead in there, it can be a hot
spot. I wouldn't expect any measurable difference from the above, though.
Could I ask you to rerun the pgbench tests you did recently with this patch?
Or can you think of a better test for this?
I can't do so immediately, because I'm waiting for Nate Boley to tell
me I can go ahead and start testing on that machine again. But I will
do it once I get the word.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jan 27, 2012 at 5:35 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 26.01.2012 04:10, Robert Haas wrote:
I think you should break this off into a new function,
LWLockWaitUntilFree(), rather than treating it as a new LWLockMode.
Also, instead of adding lwWaitOnly, I would suggest that we generalize
lwWaiting and lwExclusive into a field lwWaitRequest, which can be set
to 1 for exclusive, 2 for shared, 3 for wait-for-free, etc. That way
we don't have to add another boolean every time someone invents a new
type of wait - not that that should hopefully happen very often. A
side benefit of this is that you can simplify the test in
LWLockRelease(): keep releasing waiters until you come to an exclusive
waiter, then stop. This possibly saves one shared memory fetch and
subsequent test instruction, which might not be trivial - all of this
code is extremely hot.Makes sense. Attached is a new version, doing exactly that.
Others are going to test this out on high-end systems. I wanted to
try it out on the other end of the scale. I've used a Pentium 4,
3.2GHz,
with 2GB of RAM and with a single IDE drive running ext4. ext4 is
amazingly bad on IDE, giving about 25 fsync's per second (and it lies
about fdatasync, but apparently not about fsync)
I ran three modes, head, head with commit_delay, and the group_commit patch
shared_buffers = 600MB
wal_sync_method=fsync
optionally with:
commit_delay=5
commit_siblings=1
pgbench -i -s40
for clients in 1 5 10 15 20 25 30
pgbench -T 30 -M prepared -c $clients -j $clients
ran 5 times each, taking maximum tps from the repeat runs.
The results are impressive.
clients head head_commit_delay group_commit
1 23.9 23.0 23.9
5 31.0 51.3 59.9
10 35.0 56.5 95.7
15 35.6 64.9 136.4
20 34.3 68.7 159.3
25 36.5 64.1 168.8
30 37.2 83.8 71.5
I haven't inspected that deep fall off at 30 clients for the patch.
By way of reference, if I turn off synchronous commit, I get
tps=1245.8 which is 100% CPU limited. This sets an theoretical upper
bound on what could be achieved by the best possible group committing
method.
If the group_commit patch goes in, would we then rip out commit_delay
and commit_siblings?
Cheers,
Jeff
Attachments:
graph.pngimage/png; name=graph.pngDownload
�PNG
IHDR � ! H�� sRGB ��� gAMA ���a pHYs � ��o�d 4LIDATx^��It]���5��#�<�g�x�!�L�V^��x��Pk�����z��%Bp���e�HD�8!6qo�������d0�BD����G��������}�ns����w����O�������m�������~�(�� e�2�mx�Kf� @ [�/��9w� ��0� �$�xn@ @��@ �@N�pN��- @ S @ 9@�9���� $L� �D ���B �
-�'�x�����?������l�c�>����fk��,� �@a%<55�
����������j_oo�799�s<~���t�R��V����� �@a%�f���p aE�CCCu��^����jY@ (�RH�D�G���I�����[�k~||�n���7��H�+����*z�� P6���+�4$�f+K��(��� �G���p�NY�$l��[���=H8
-�� �8
-a���vV"��X����E�����X@ �C���L����!J��l�2off�O�;����K�� @ .��J8n��: 'E��@ �@3H� $�K@ i@�H8�2��! @��k����� � �M�H�H8�2��! @�H8Z ����! @ :"a"����3 @ H 'R�� D'���p�R�� �$��)H\� �@�H8z��@ H� F��$.@ � a$��p $B��6�%��$)��*Jccc�V��Rc�pXR@ q Z�f=a�kKxjj�����&''�tk�������6��t�������pXR@ q Z�&QQ$�(xhh�����#�0��pJ@ �!PJ �����������-[�G�����HX��N�����`����H���>~��X��f�@��V���{��aO26�o\ �G$\����B (�RJXU��������O|zz�������I����L��@'����k��%���I�����6S�����8@ �K��V�6_���� z9��Z
Qr�7�����P�@ X�BKx1 [��Hx�9� �@�M!�vE��.����y��~
@ � a$��p`=����������'���=�g �l?p��������O':� � , ���0�ED��$k������4����s��]8�NT� F�U/���o������6�T��G��m�7�`���k��{�y��t[b��� �j@�H�%9�T���qO��n���X����_��Zr���j������C�l�0.[���y��j�������7��|=��,�n�}������D���7�@� a$\�R��*��l�����_�6{<U_o��/`��Z�Ym�l� D #a������~����"
A:q�9O����8� �� |H �*8�x���V�L�4'Uco?��Z�|</�+=E�l� �0�-�k|A�������Ql$L4''@���0�l���0E��O�_'�g'n��.uk���h8*W��@ a$\�r�2M��U�j��������&�3���~���cO�! �z�����*J�VB�WRZ�l�733����cV_�WX
[ �;:,�r��X{�*E������#�����wv�u���@�%,�
y������r����������'\����c>�H0{��z��i���<D���t6�BK�d�+a�v��A�kg��`���t~�h�H�z/���m2�u��]pg�J����%_S]��=�H�� �!PJ +��������n��Z#������!�d
L���(x����Q����Ps>g����[�^�A �K��^�|y +2^�d�_eW������i�������������{o����n�I���|��{������,���#o���e�����w�
e#P ���n$a���V,�}����=z��Z[����3��JO�q���:�D y</�� �!PJ OOO{}}}Ag-������Y��2\Um����F�W��8����Z�g� :�@�%lAR�oOOO]�g�M��=�Qj4��]v�1��b�?}vO��y��u�� "���
�zXK#j�j6@��Z�yf��~�{k�"w�
��6��y�j:�hX��M�p�c����L@ mH� a$�v�K��Zx���e_�P�f3Y��6�M��y�gM��A��0.FI\�Sh��C�Z0U�������=f�*�"���t a$�N��������vR���|��#���3L�t9�=!�/$���-������������e�TI�%�xN$O #��KU�W��"w�"��M���"w#b�����<3�@�H8~���LM/i��J^�
�^�e�&��H��?<U�d���@H �(6��������jS-��t�H���[����@ #��%�CU�l��J��W6���xe�>�q�t$��_���U��eH�) k�/6@�3 a$\��������<�zN��k$��7w�};�� F�)����aFnh��L����U������>�H��q, �� F��{?$ZE�v�"��ov����x��K� ���K����l%$���Z��4669��;:2�DNh$�NYo�n���?:)0\%'Ph k)���!O�6���
�V�Z�����z{{��<\�ti�{��B�aI%w\#�u�8T�Xc8A��@�Z�g# OOO{+W���j�W�,q���G���p��Ym�nt' X��5�51 Pm������Xm ��] ����N$�T���z�?Y�Q�+�
��l(�C��=+~��L��@)%�j����{333>�$$<>>�7�����H���>~�������}�~��oy?|��u�o��:��}��;�=���:�C�r�w�
e#PJ ��Z��e7����.^�<|�o��u����� ���?{���H������H��(��]Tv$�(�����3��aIE?N�ml;p�?U���kY��Ah���P����3 P&����jG�===
{9���H����V���+���Gw��%<�w�/�*,��1{U%��T�@�%�'v$�}E�[k����gt��R/E�4)E����4U��b@�M� �S`�R[�u��g�St�6G@��vO�*��M^C s�0��]���;5���UA�G����F�l���3����2+��9@�H8������ZO�Q���&������Xr�N #�L
��W���XE��m�hx>+��2�b�M �;$��S/�'��Y��i�w�?k�B��Y��������l�@��*a� ��i����>���e��������Y����O��� Cn�s��S�]8<7��@��&a�f�d��`u#w�o�7;;[XVHx�Y�h��c7��~f���\5{������?�� ��@&��U���A��%�
#���a{� U�**fO�n�~���'r$ P�I��~6�^�U�4�.M���g?|�U�b��?��
��!P�IXi7�$l�4rv�,+"��9�����p} �C���?�^�P�� �jHU���E�|�lE����6�#`%��,���2 P-�I�t��`(zg,;k�p��~����jh�k��'���3O���s�L�@jv�,Cg,$��b�G:����.��>������� ��HU���/��]��Ti���U�n��.eh:�E�6"�(��&���;f8R4~���]89�\ E#���5,�����Fl&�p�VU����?��t�2������`����G����+�?��������.��M �.L��@u Z��+a7;T�mr{^����0_��jh->�����c�� P4�I8�������Mu���J��t����u4�������}�<�I,��uA�����F������Dyyx����=���
��f�@��&����� ���+��v�K$�(������-�Q��/���J��!P\�����v�j$a���)���������p�U�y���$=�\YHU����e`4����lhn7"'n_�^������3�{��%�y(6�BK�]y������lO�iz\����!J���0Y��[S:���L�� %t�'��@�`R����K��BKw��A��I���L���8%r �'���@�P�p��@�����kfp���:�w�v�l8sdI 5 g��4���SU�g#`���?K?�l@�va��G �.������]"�.�F.�B@�H8t���C54IC���%@�p����&���p�2�NM�I:��'@�p���#�$���p��eG`��-�����B -H �-[���[�
z����!@�p>��+�"���p���NM��P�| �.�/��$ a$��<�?q0$���[��2��/��h� �� P@H 7-�l� f��b���#x
$A #��T
mOMy�-�jM��K��'A1�k����������b�
offfQ�ivr��O��K~Q$���WHz��MTC�E�]�`�q�x�������_��a����$��~��W���K���������466dS�}a��������TC�-)�g���s>y����]��n?�}�{Gb�8mI�}�(�:��BK����'�������d��j_�L�T ������Rj�;�nV�9�l ���G�u�?�]|��P�����������;�����W�� 4���a�g�q����������(:o��%�>�,�������*���ls�w+��
BW�*4v�q:F�p�}Q��S%��7wQ0��QJL������d�-��8��W�q ��dj���}g�RW0200���������6!������;w�.����������w+��UP\ ��$�V�Ze����'�322R����j??��
~Z��������:o���1������ ����
�Wz����>r&T������#��h��Z42�~���E��c�]}lG�&J�������c�=��������J��}�p�� ��Y��g�t��lX�-�I9����v����?����y7o}������`���������E~ W�������@Q�����^�&jF��� �*#aS�(6���� >���y�:o��/�q��QPqlN�~�j�D*9=�mC�q\�6�$���Q��]U���_��.]���::��\J ��������(x;A��������7�^��������cs"�v`�;��9eD��_1�B������u�]�lw��>#��Z��jh��M�6yk��ImrTwH�%l�f=����!�Qr�7���N�����=W������B��$��P�h#b5)���D�^�l� Z�yfU�%�(x������G"f��<K]�{k���0������M 7���K�������u��g�St�V�~��/\�|�I!` ���c:�w>
6����%��� ����'�� ������_�}E}l����K�mA�p����@@ w��?���_}�(
&.�7����7�"��;H�������[���J[9�.\�|�)!�� � ��*��'t5�h�F>���$��;D�'��Y7�����������x�w�����3q�W��pR��@� ���Y7/4�IK��6x|�]�����^X���#9�A �!��+.a
7z�������%�j���� W+{IM� �
KX��b�����2/tE_f�]��z"��f5��$\a ym��X���U���?�<����t����jRV!H��v;b�2��
[�������,�p�va��);(��J�])ill, �j_X�e�;�����:b�?q�ds\� ����@�c� �8/y��!PJ kM����`!j{1�F��E��fmY%�����X�BG���^��\�o��s�B����K��H8�<���~��p1���G������������:&l�������<����������:nZ�8*�2JX�����+���8�?���e����v��c5�9��'�������QV$q�����>�}����?==��\�2��];l��p;�������������-[�����{��p�������322R�����?oz��A��/�y7������l��\[o�G���@����e���^������eB�p�6$�^� �������*�>|���o\ �7-[$�v���l�'��C?�����$<~�k��S
?z�%o�����j6�������nf�_���}{���hn��
~ ����G
��3�o#:o��%����� ����\u�U��+����k��|�yt���Ao��UA������aj'�g�Ei��i������������]PZ��+��b���6�N��Y��h����/"e�
c#U����#{�C�
��WR��~���/�$���������:��� [��
8L$�N��5�\HX25��w�� T����@ #}���v����s�����1���Y4��}m����v��w#1#[=�����Gijw�-a�.v��N[+V�jNm��?�{�v��6����P��a�/K$�/Y�� kr:bE���;��Y�W���:��g���g����
X�<P#�N����S����$�;6���9���D�v_w�$��c��4��M���Pm���� ��
{\T ����-�F�q��w����R�J���2M�������V�� a���c7_���i*��K@��\1P'����%�n������)#�(�p�����cG���/���8����s��5�qIK�+#a;Z��]�����9j���2'JAM��20q�&�B
l�%����~�d��X�yjbbA��D��<1���WJ�S���]i�h��PA�"�_|��:�Yu�[�h��Uu����u$�>�[m�Qb�2Rm���=��������}�s�{h7*�����~��.a��e��>�'m$\?G��\��M�?�~���S�)�L�s�O���>�vM�jk������u����<�}v��:`����b��Ok��i�1������L�c�7�qn$l��Y��Vik�;�m�L�6}���;l�"�&��,a{�}�j�h���~n��/E^cW�N����4�|��=��m_m�4��-���{�{]8C� &���[�[!��IXU�vG,u��#V����:0�� a_���H��������Z]��wz�w_�Kx�&��^�m������E�U�PH�D�ruvG,U9��uq/��?8�B]���q�g;B��*�\��|�j��86@�X�p�$��7w�|�r�#V���l�0#w��V�����3�yc�����=��������A @�%������:b������H2���mV����zk��>e��_^}A���g>� t&�ib��/�����`
o������(9�@�%���k�����$o��%$M(����f��LQ�:_��6�� �|]K�TU���"��p�^o?�mAG}��Q���&�����T ��K������JX�QT�vP�H�������l�+��sjcpw�����z"�E(JZ��S ������o
G!���� .�����i(�x�2��VL����+�cW����Q����C��a����UU��}��}�I�o {�J{����c� �A X��v��@�}|��&�S�S;*J�U}+)*Zn5iE��*�U5����� �W���oT]��n����� ��"��*a���?��Z��#&E���%'�6j�����63���`�"E����RQ�?�&���<g����]B��v����@� ��I����F��e+z�$�m�D�EJ���6Z�6l�t����0��Eb��@ �(���y=������ej��_�Vj������������V[l���0b
�?y�jhuk��lU��'g�
�N PZ K����rrn��G���C��La���R�kl��S�,r^E1������h��im����N��n�����]9K�Q���%�(x����Y�6�^��8�v��cX)��J6����#�:P��VcnU��m6z�����m��?)O �M��Vt{�Ey���^WW��1����4p�,�<>>� �������n�K�����n�����2�������o��{��/{{��l]k�S�]���4���{�/i�r��)Kz���z�� P6���� �"�%K�xZ�2�����2V��n��p?����H�j�E���Uen���$</�"�EX� `���� ��c# �:���M����e��OJS��p������/_&O������z��hxfa�4}������+Wz���aO���RJX������W������U��Y������u�M�W��7�w�(9��kK�n�w���3?/yJ���H8���\�6a#d=�;D������~v��F�'�WeE����
����Ws k� 3�D�����&��R�> ��&�y���[m?x���a{h������{��_�ppp��#c:�6�.t;�����7l���i����~OOO�����q��U��5���w������3�9���j@7m��g������������,e$��,$�������n��\P-�s�_~N3*����[-{&���1����)�R�=�J@U������������|��+����Z�S��t��c�.U�100����ij��!�j�IXR4�u�}�<k����k_[Py�<�����1���8[�:�H��oT��N[���_�� n�b>��5m{��o�U�*��$��"��?���|���G���$b��4�5����h��p�v^�b�733p3}nl��~�h3�}�><�@��kvN�h��iK������p���z��{�"�vr5��f�Z?k������a����< �G��������������F}&B4U�&Rl6������p��"a�����p��
��4��*��zw�xYC��K�j��X���{g�'���4iF,@�� �����~��`��t��a��Q�l# ��6�e$C3��[��sk������=�H8��Z�K�%a����F�Ol�������,,_4�<:��zDk���W.���)�vHUD,�6km�I����;S��cV��������v�������U��3�1��o\ZV�������M��"`6@�|4x��wx���-M?�z�����y���9��jZ��^���:�I�HC������/i�H��@@�H�^=.)BH8)�����3���*~v��@�{^�H'�pe�� T� �H�'��X�4N��n$\����A G g a� �xM[��
�>P������l� PMH8 ��������\��@���_5K�� ���p��8��1�_��-� @�3 �%��X�����_�W3\����:���J@ hH��6�[���l�Ve"�!Jol�������_>�<��k��OY��w�@�(����"Z�H�]�#���?�p��3��[�}k���SNvza%�� �F����i�W�h��G��[l$���7�(��mk�v`-G��y���t.�����$��B��J+aUC�kM����w������HXk��X��_���z�3���n{��8�S���Yo��=�����}�L#��@� �R����� ;�'7�F������R�322R������wo��{x������s�/�_�������X�{���}��'2�����{���>w���L��w�m���n?���Uz��o�<I�<��l(�RJ�]�^[����I��dR�H����4&��5��8�hI��������o�b�����K8��~p����3�4��>���n��9�'x�0�
�@ ��R�n�H���Q����:c�c��@����w��r����C�i�)�����{�����;x_}���/�_^�Y��C����[�g�w�m��������@>*'aal��fX�q$l/�������5��[+~���-����r������>j�m$�f�'�H@v�f~���C���^����o}<����m1��l�o�sL�����y�wA�W��5/��k"m�N����{!�F�n�+j���6:.�,q.�$P ' �\+��O��3�k�Zo����p��>�$����Y�TaJ"����H������p��7�������?Z�����O��94��E��������v����\yw�z�q$<P�v��o#�G� �%�����"a-�`������Q���)���W�H��������deU&i���sd�^[��������G���"������=W(%�I���GN�,��@��p>�����X��u�:}�H��
�^i�YEy��u?j����J����*��_���:z���k�A�Q�! H"����y��L,���q�*A/F��X��D�&}o��B��������o�B����?:�+J��)��2����:c�z0R5t���+j�[�}�w��'��9F���a���cJ"�P���aV�HJn/a�&�m�����T�����F��m����w�w��
�����A:!�$��H���5����XQf�����=7\���m���m�>}�L UU7���a���� �L�u���i���u�E?8�@�@�����wv�u����^���q]{����p�F�hM�������UO��ly��<o��3���'�� �I ���:cm?pm �g������-O�����������f������R
�Y�l[�<d����i���5�� Z@�1%|��C���=y�����uWk%^�7�������gyA @ ?H���'��J����z�y��y;o���3�/���{��Wj���������J��>����o�a%8��;C �� n!aI�����g����|�?�6���?\�WCO�x��D���H�j�m6���,�>�� �?$A�O�`M]�e
������Mg@j'^���W��R�� �"��CJXQ������}����s���[���v�w5&W����)���} @ [H8�����
_��{���<�Z;7=�$,�j���>�4��n� PZ����RR�7;;d������X��&�(x�/��`��-���Ks����*��9� PJ OMMy����x������>��V�z{{���I?w����f�-�����]��%����cI�
� �@)%�&txx8��������CV�^��HXU�[�[[���������c @ ��K�D�G���/!��%��p�(8���E�O|u�Kr @��@�%�J7������%��gdd���f�7��&Q����Gk������������La�4���a6��@�%,����I8nu�s/
�����F�@ �J���X�jg%������YQar< @
�RJXb����������e����?��%�^L��9� �@)%5�q�6@ �@��p�H8�b��! @@�0�M� �� a$�S���� ���0o �D #����� $��y @ 'H �T��- a$�[ @ 9@�H8���m! @ #a�@ �� F�9=n@ H �@ �@N*'aw����Xh�;:6N� �*%���)���������]KX�"���C! @ �JIXQ���P���W{q�a$�<q D P) /��� G(
@ �@�G=>>�WA����H���>~��X�l��)�/Nn��T^�TG�/*�V��I�%���?G@ O������Y��V'}Q�H'����{8
y�����t���Xl�����'�������I�'�@�$���z(�K�:��VeF'������;� ��@�! @ ?H8?��� :� ��@�! @ ?H8?��� :� ��@�! @ ?H8?��� :� ��@�! @ ?H8?��� :� ��@�! @ ?H�a�N{g���3��M�)=��v;M������l �j����]]]���������lZM�L���\�|
�VsTY a+����������k1@�@hU���!O��_\UL��488��wzz�������Zz�����kkW-�����900��Z�*(�UMkQ�Ox��@�C�-A�[���5�^��p'�]b25UO�]n��V�Q�r�J�f�,W1��~p�� a+������b��+����D�&��bz��h��������vY�bZ���O� F�>��y������6���U����/�fff������WO� n#�N��n�E]��+mn��*�WE�n�ZZ��W�3�"���5���^H���#V;f�����UM�����lUK�����-[�G�UK���vY�zZ�
�!YH�������I�W�����,UK��������k�$�UJ�i�6yk��jim%���5�o��6$�6a�@ hB S4 @ 9@�9���� $L� �D ���B � �0e � �$�xn@ @��@ �@N�pN��- @ S @ 9@�9���� $��e��������;�� �$���o����8��d`�)� � .n���diD�H8�,���@ �
fj�$i�s{��V����K�f�j���;�k
y�2z����������RtH& �"$�"��\����f�o��m�H�6a#`#L�w���r��{zz���I?�n$�����Ea�s@ ($\�\H�$�f���v���g#a�U����k���\(%$\�l���q$�H�Q%��������==GC �.$\��
Rf���Xo������j�u�0�#�{����������,Y�5�f�'�� �@�R8
����VI�v�vv���|u_��;���L@ $ C � �#���c�� @ � �H�8� �$�K�@ �D G���� �� ��Xr%@ �@$H8.� $G '��+A � "@��pq0 @ 9H89�\ � � ����! @ �@����J� �H�p$\@ H� N�%W� D"��#��`@ �@r�pr,� @ $ C � �#���c�� @ � �H�8� �$�K�@ �D�H��~�� e�2@�dZ���Pn��\{ IEND�B`�On 2012-01-29 01:48, Jeff Janes wrote:
I ran three modes, head, head with commit_delay, and the group_commit patch
shared_buffers = 600MB
wal_sync_method=fsyncoptionally with:
commit_delay=5
commit_siblings=1pgbench -i -s40
for clients in 1 5 10 15 20 25 30
pgbench -T 30 -M prepared -c $clients -j $clientsran 5 times each, taking maximum tps from the repeat runs.
The results are impressive.
clients head head_commit_delay group_commit
1 23.9 23.0 23.9
5 31.0 51.3 59.9
10 35.0 56.5 95.7
15 35.6 64.9 136.4
20 34.3 68.7 159.3
25 36.5 64.1 168.8
30 37.2 83.8 71.5I haven't inspected that deep fall off at 30 clients for the patch.
By way of reference, if I turn off synchronous commit, I get
tps=1245.8 which is 100% CPU limited. This sets an theoretical upper
bound on what could be achieved by the best possible group committing
method.If the group_commit patch goes in, would we then rip out commit_delay
and commit_siblings?
Adding to the list of tests that isn't excactly a real-world system I
decided
to repeat Jeff's tests on a Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz
with 4GB of memory and an Intel X25-M 160GB SSD drive underneath.
Baseline Commitdelay Group commit
1 1168.67 1233.33 1212.67
5 2611.33 3022.00 2647.67
10 3044.67 3333.33 3296.33
15 3153.33 3177.00 3456.67
20 3087.33 3126.33 3618.67
25 2715.00 2359.00 3309.33
30 2736.33 2831.67 2737.67
Numbers are average over 3 runs.
I have set checkpoint_segments to 30 .. otherwise same configuration as
Jeff.
Attached is a graph.
Nice conclusion is.. group commit outperforms baseline in all runs (on
this system).
My purpose was actual more to try to quantify the difference between a
single SSD and
a single rotating disk.
--
Jesper
Attachments:
pgbench.pngimage/png; name=pgbench.pngDownload
�PNG
IHDR � 4j! sRGB ��� pHYs � ��+ tIME�*(r��� tEXtComment Created with GIMPW� IDATx���wx��������-��Z�B��)�(����U����(���a�.��&!��!�n�f[��������!@B����3{vv6s��|�=�=�C�,��\�%#g�=��c���C����������^��^�6~xR����c������&�sT�����#��� lg�=���Z � -���A����A�AAm@APA�A�AAm@APA�A��j�=0���/:XK��WgL7q��y������ � �N��]�f� G�����]�����n}U�����hA���
��?����`� h��tS���$��8.��L/��v���5� �q�]{G�����_���3��]�Y�� (y��tQ��]�DO� ==�j����<������C��DXm�
��U��5oYI�wA
�-/q� .M���N� H���)�9r�HZZVE�h������?_���U���N��3����� �Xe�
����Y��/QR
�>|x������a���G��@�f����N�V�Z��.�^�y`���{W?�%b�O;V=�}|� =���l����aA�R��( �cM#�t���r���������5���bx��x!��v V4� H�6���}J��
\�RJJ
VE����� j� ��� ��6 � �
� j� ��� ��6 � �
� j� ��� ��6 � �
� j� ��� ��6 � �
� ��� ��6 � �
� j� ��� ��6 � �
� j� ��� ��6 � �
� j� �Q�`������@:"<�i-�V:������^f>Qe&�����|>�7L���AP�N���:PfJ/5�/5���1l#������%��_�L�������W�5��6 H����8Pj>PjN/5�j�W��\��-��|Z��i��q�y�~�aR�C�A:����������2���=���!������`�y��3:;� �5��5�88T�j���qr��zEP�C��p�����t���^j�6������pH�,-L:4\�hhaT������|}\��>Xn��������A�3z�J�� X���_�s�N�aF��Wi��x���/�(3�lW�2��g�hH�tH�4-\�'��c��N�����#���Kb�/��I�����%����";;��r���`U�6 H+`s��+,�KM���,���`��'�$�����C���aR���������3���js�.�P���}}�3�7P��<jC� "r'br0���e��%��n<���ydR�xh�lH�tP�D����{!E<����^��e�R[�r����`����'�^I��U �!�^\5����|
P�� ��=38}����r�V��%[$�������&�$<8���=d��i ����Fg�3��\��*����x ����!��!���ak������/Aq���(�v{��m�`�=����vr��A~�%x��v�
�����g�,[7+����#���|����m�Gy������>X'{v��$���'�_5p����MJ���sr�6��K����KLg/�|�1�&.K����$����35���j���; 6\0l�`!{u���h/�F��6�����>�u:+����3V�<3~��7��w�JvI7�~���$!q\*�az�Y���qQ8+iJ&d��/����3H�.M���v���[p���� a\�]�� �����[b�[b� ~%����
f�"�I ����9�>;o�If�s�'�����������~�������` @�%����I����!==�j�z�t:����x%���3qRO���'td���{h����d���>*&Lb�,Ep�����4�K!��Q��(��}'��36���?vz�kB0-��j_�� ���R @�����g��y���rV|� � ����������Ax~����K8\.M7�t8x%���(6�'�������u4�����[A�V��B�Q�kw�%�
S��U�5e�
�����J|v��}����`���bh��M����1�gs�J�"'�QCG�|�b�oa/M�G K��be����E���������D�y�>|x��s9����x%�k�b�U��2��RsF�Yk��������!
������7`���`X�U{��j���b�Z�xvo���F(p��V����\��?�17����l����"�R����������-�4����*^�����@�����^X��II06 7��fWX��������>���M>s����w�P1��rj7��"������L\���,?���X�C������}H����otm��o.��v� t���<�-��7?�Q���q�����}�%g��w=kfE1����=e��\%���!�S�`��3���K�G*-6�+@|��z��,��������v��*����
t��,pO���d��2�p�[�j�����X\�����'�mM&���/q?5���PwrQ�-�|��)�;>=T��)��uI�,���w_����fBmh]p^4r�(7:9��2��Z��#L��
���I����l�h����!��,?���X�g��p����%q����=�S%D�>�����sbp��\�o.O(���&�����0��rL����%��M<�e���������2^1n@m@:���2��2S���L�����pz
�'��vN�8��l��q�\H=����$�`�;"��u�>%�����VX����e��2���N�@w?��ij�tH�����.|�x8Q5+Q�=���`mz��+7��O��������u�b]!�
H[cq2����([m�?\���D���M��8��5z ��������-��l�5r���]qJ��i��8��d�������6 ����,w/�|���d��t=$\�.M��p�[Fr���"/��?=T����A�����5��kL��K�'���iM�L.���@��LMs��r!�*�V��$���������q�� XvX����#��]w?�+����P�q��Z$�c��5)�;����2sN����R7r�.� ����~v������6I��_H�{��w�X���Q��a�$q�1I����u(3�_�"��p9��,J���g��N����}z�6W{��{�����K;v/jC��}Jw.�\R�CM�N��8��4�G��#M�=��,{���^j>Pf�(�h,���,����0):�u��d��{{o�1,��=\�^�Tk�?���,���^�/
����a��;���P�E���KC )���9�w�.���u)'k�W�8�Q�3@��AZ���?c"�JL��j�*�g/�����_M����B����EX�e�5K����w�;���guIst�.h�������p2����)A@�$\gQj�D.�d������������5�������/�_0<7����~�2�?�6 ��z����E��_��q�%����x��:%�^u�QDY�U�z��lM �'{K����j����h����,���p���BDm�3�����PT�?��6 ��`n��J���A AP��>?�z�NRar�l�Y�2�����n���z�Q���%(� APrQi��4�G�&��`=w2"��G����Q���u
!Y�s @�/�&����'����Vw��,��4.% a���Q�/��K��������h���.��6Z]�^k��3���K���)��T�� �"}��>�Ti�\�����H��GL���\���jNvz���/� �W_���Ci�*����Yd�����R�h��1~��d�����.�=o}qO�=1^M��|e�|e��J`;]g��3gh��V���7Z�Gq�n (oo�`Y��4U%����X���R�N
O�����p1����c�� �'wG�yJ���hW��|fq7.�c��������v�}`c��<#�VW_���A����3;
�L�u��:sF�������c�$�G��#M���S*�F�5���m-���W����Iab^��%�yJ7��8hm������vWM�r�06����G��w������������]���7�T*��zG�{������Qg��3g�-���)Tk�3����� �K��G��#�#"��U�pL����&�/��� ������?�Db��
7�;�������u�5P������� �i�'3a]��b�:�}�;�G����)�X���S��|�����
��,�����|�)>�T�8� 4��0���-��`�.�o����=&/a���p'Ro��W��L��a/Y+����x�W����dg3nM���`����Q�n 0�z�9�)]k��3gX����������l�d E����st6��
%{��V��������p��2-���+
����e� �}����$���o��c~+�R�RB���F���5������5�����7���@��?O�HS<L�i�8h�_��:��^�H���!O��Fm��`�a���6�/�f�������0�{V��+2a����Z�Q���r/��K�=���%H���0� p��:sf�9�������K�dX��|Pk>��/Q��Y�RA4�����"~�-|{�;y����yZ;&/�6 m� ,�����Yd����RQ�(���=/�_�����_�NV����(��)������S�@��@�8 `X��rLcN�3ej��Z����m�m��� ?��w���(D�<�(�m���������Pm���rb�N^BP:4c+���W���Q��\����{.���|Jy��V�]#WzV�����H�=����[:�[:��5��k��Sz�9��(j���Y���W�� ������T�G���|<_�+��p%$��L�\������M����2#&/�6 �'�/�|UP�y��TAt������Qds�!���Vz�I&%���!h���rQw��{��S `u���vS��v������������-Wm��\H�
x���������)�q�V!9Tr��XO���
K�O���l��KjCG�����]Z���IZ�B�;�����kv���w�*��������uRx�Z�Q� U>�|�S�:S��rPcJ�Y�� �6}3w�Fmll]EJ.�<���G]x��e�������d_� �
����y �i%L��������M&������� oQ�V��1bUa��-��+��AW�4�
jC^�>�B�_�/R�mZ�J%����a*�� y��O17p|��X��Y��n$!�Z���_����'~�D���Ey��v����_� y��7��Q�Dg9�[��R�{��T����WU��-<N��1bUAY�����T}ohG��V��U���9j}�Z����^�Qg��N:�P�O\E �����z6�/i�<�O��1��k�9+l���w#�J��'��x�B�|?�is�������U,��a���<����� �]Bm@����]��jM{.}P��f����%���C]��������-�������v�������2]�Z�W��S��vC?+�� �D��6wj���I���N^����lx�D@�|�<�G:�
@_��,mw�4j�R�_.��;�k��t�&�V���%�KE�/0y �i�}���1�z��z���'�"}���A�������\]Xmv/��� ��G�a`X���L��S�����r]n����4���R�"D��V��)�C�q�2��N�a��j-U�Fk�4�4Zs����Z��Z�U�^%8\��E���/� (��B��T�%^B�D@����\i'����k��� R�����#7u�4����ar���%\����R�s����plG�E�Zh�V��9��S�=�q�����l����u�����������K���mQ��qV
�@�!�I;Z�q��/D�� U&�(cB��J���V��`��,�zK��R����,5zk��R��Tmu4�����HV�'��r��K$�">)��|�E6�z�18\u.�-��\�a�����_u�[I��(�`��m�����1n��hc�����6g�%����8��"���~�#�{~+������<<��N�A�*��jCA�.��PP���2����+ V����+cC�q!�8�P�ZE��%������n��V�����Uz�Fg��[k��J����\i���g����C��X�5^�Q� �xo���T�������"��\:r'n�1*A�#��7�/5@����s���#FFa��
7 ��YUP����k}�D!��?/T�A��$g�[��)2�������w�n��X���2]n�!�\�������Y�E9B$A�{��(�B�q!��PU|�"F�o�� �����j;j��j���A�n��������� @���MA� �����y�Om���q�
��=?�fq�����R����(s3+h�����)�w�o�F�����|�]�
��r�
�������)��@�"�O :����|�"�0U�������T�5ZK����y�G>���v����29`��]��|�aY�H�����-�C��(b���h�`��jO�R����L^Bm�#�[��V/��o��v�RL���[�|���]l�����d � �ld�}o��ZK�Z����{�
�����E���EL�*>X�����������T U�]�]��R�hT\k�2Xj�,U���&;?��gH��$X�����{D����IH�x} o��y�Rf�Y X�U[���%��;�����5k�w]���UM����%�v�_�WA��
%V�[���t����g�=Z��x�.�����o�%/N �����e��,� �G�B�$X���<��������h�+)��f���L���F>���dj7e���
�$/��[�D�c�RG�xo�1X`*���j�,G�S�4���X����V��-��)�J4 $�z��5���d���T��u�YW�K�� �T !��e��Fw�Re,^}�#�sm�@�W���G���N���}�:��5E���H^���6�W���o.�_�8z��o����������@����w�������*!;�6�e��.V�k��]������}6���HK��.�m.���"��CN�>���w�[���b��
���*c��]oI`�2�� n�@.�vn �5G��'{K���%ZIv���� �������)�ZiO� (E���#��l��������k:�t��!�����g����C�\'{v��$���'�_5p����\�dN���TYq�we����A"���{1��q)m�/Zs^?��2� �$VN��My]G(��|��r~m�u ���s<}D�����b���:��p��o��Ae�q��c�?1��\�@�)O�[L�����������F�����:�6�����>�u:+��0���[A��8.��0�t�,����fK���Eu�}6�z���R��f�rQ�8�y�������/gt����, ��uR��]�c*��q�q���O��'m�O ��������{��=C�Q]��\o; ]�-N*�6��bw��S��tW��Nyu����X�K��
u���b�R�� �|{����[u�������� %��.j�2��%z��w���[�V���N'