Scaling shared buffer eviction

Started by Amit Kapilaalmost 12 years ago132 messageshackers
Jump to latest
#1Amit Kapila
amit.kapila16@gmail.com

As mentioned previously about my interest in improving shared
buffer eviction especially by reducing contention around
BufFreelistLock, I would like to share my progress about the
same.

The test used for this work is mainly the case when all the
data doesn't fit in shared buffers, but does fit in memory.
It is mainly based on previous comparison done by Robert
for similar workload:
http://rhaas.blogspot.in/2012/03/performance-and-scalability-on-ibm.html

To start with, I have taken LWLOCK_STATS report to confirm
the contention around BufFreelistLock and the data for HEAD
is as follows:

M/c details
IBM POWER-7 16 cores, 64 hardware threads
RAM - 64GB
Test
scale factor = 3000
shared_buffers = 8GB
number_of_threads = 64
duration = 5mins
./pgbench -c 64 -j 64 -T 300 -S postgres

LWLOCK_STATS data for BufFreeListLock
PID 11762 lwlock main 0: shacq 0 exacq 253988 blk 29023

Here the high *blk* count for scale factor 3000 clearly shows
that to find a usable buffer when data doesn't fit in shared buffers
it has to wait.

To solve this issue, I have implemented a patch which makes
sure that there are always enough buffers on freelist such that
the need for backend to run clock-sweep is minimal, the
implementation idea is more or less same as discussed
previously in below thread, so I will explain it at end of mail.
/messages/by-id/006e01ce926c$c7768680$56639380$@kapila@huawei.com

LWLOCK_STATS data after Patch (test used is same as
used for HEAD):

BufFreeListLock
PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0

Here the low *exacq* and *blk* count shows that the need to
run clock sweep for backend has reduced significantly.

Performance Data
-------------------------------
shared_buffers= 8GB
number of threads - 64
sc - scale factor

sc tps
Head 3000 45569
Patch 3000 46457
Head 1000 93037
Patch 1000 92711

Above data shows that there is no significant change in
performance or scalability even after the contention is
reduced significantly around BufFreelistLock.

I have analyzed the patch both with perf record and
LWLOCK_STATS, both indicates that there is a high
contention around BufMappingLocks.

Data With perf record -a -g
-----------------------------------------

+  10.14%          swapper  [kernel.kallsyms]      [k]
.pseries_dedicated_idle_sleep
+   7.77%         postgres  [kernel.kallsyms]      [k] ._raw_spin_lock
+   6.88%         postgres  [kernel.kallsyms]      [k]
.function_trace_call
+   4.15%          pgbench  [kernel.kallsyms]      [k] .try_to_wake_up
+   3.20%          swapper  [kernel.kallsyms]      [k]
.function_trace_call
+   2.99%          pgbench  [kernel.kallsyms]      [k]
.function_trace_call
+   2.41%         postgres  postgres               [.] AllocSetAlloc
+   2.38%         postgres  [kernel.kallsyms]      [k] .try_to_wake_up
+   2.27%          pgbench  [kernel.kallsyms]      [k] ._raw_spin_lock
+   1.49%         postgres  [kernel.kallsyms]      [k]
._raw_spin_lock_irq
+   1.36%         postgres  postgres               [.]
AllocSetFreeIndex
+   1.09%          swapper  [kernel.kallsyms]      [k] ._raw_spin_lock
+   0.91%         postgres  postgres               [.] GetSnapshotData
+   0.90%         postgres  postgres               [.]
MemoryContextAllocZeroAligned

Expanded graph
------------------------------

- 10.14% swapper [kernel.kallsyms] [k]
.pseries_dedicated_idle_sleep
- .pseries_dedicated_idle_sleep
- 10.13% .pseries_dedicated_idle_sleep
- 10.13% .cpu_idle
- 10.00% .start_secondary
.start_secondary_prolog
- 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock
- ._raw_spin_lock
- 6.63% ._raw_spin_lock
- 5.95% .double_rq_lock
- .load_balance
- 5.95% .__schedule
- .schedule
- 3.27% .SyS_semtimedop
.SyS_ipc
syscall_exit
semop
PGSemaphoreLock
LWLockAcquireCommon
- LWLockAcquire
- 3.27% BufferAlloc
ReadBuffer_common
- ReadBufferExtended
- 3.27% ReadBuffer
- 2.73% ReleaseAndReadBuffer
- 1.70% _bt_relandgetbuf
_bt_search
_bt_first
btgettuple

It shows BufferAlloc->LWLOCK as top contributor and we use
BufMappingLocks in BufferAlloc, I have checked other expanded
calls as well, StrategyGetBuffer is not present in top contributors.

Data with LWLOCK_STATS
----------------------------------------------
BufMappingLocks

PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101
PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72
PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54
PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60
PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128
PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81
PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503
PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71
PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46
PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30
PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14
PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25
PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30
PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28
PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25
PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22
PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29
PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36
PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22

This data shows that after patch, there is no contention
for BufFreeListLock, rather there is a huge contention around
BufMappingLocks. I have checked that HEAD also has contention
around BufMappingLocks.

As per my analysis till now, I think reducing contention around
BufFreelistLock is not sufficient to improve scalability, we need
to work on reducing contention around BufMappingLocks as well.

Details of patch
------------------------
1. Changed bgwriter to move buffers (having usage_count as zero)
on free list based on threshold (high_watermark) and decrement the
usage count if usage_count is greater than zero.
2. StrategyGetBuffer() will wakeup bgwriter when the number of
buffers in freelist drop under low_watermark.
Currently I am using hard-coded values, we can choose to make
them as configurable later on if required.
3. Work done to get a buffer from freelist is done under spin lock
and clock sweep still runs under BufFreelistLock.

This is still a WIP patch and some of the changes are just kind
of prototype to check the idea, like I have hacked bgwriter code
such that it continuously fills the freelist till it is able to put
enough buffers on freelist such that it reaches high_watermark
and commented some part of previous code.

Thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v1.patchapplication/octet-stream; name=scalable_buffer_eviction_v1.patchDownload+199-57
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#1)
Re: Scaling shared buffer eviction

On Thu, May 15, 2014 at 11:11 AM, Amit Kapila <amit.kapila16@gmail.com>wrote:

Data with LWLOCK_STATS
----------------------------------------------
BufMappingLocks

PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101
PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72
PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54
PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60
PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128
PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81
PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503
PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71
PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46
PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30
PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14
PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25
PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30
PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28
PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25
PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22
PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29
PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36
PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22

This data shows that after patch, there is no contention
for BufFreeListLock, rather there is a huge contention around
BufMappingLocks. I have checked that HEAD also has contention
around BufMappingLocks.

As per my analysis till now, I think reducing contention around
BufFreelistLock is not sufficient to improve scalability, we need
to work on reducing contention around BufMappingLocks as well.

To reduce the contention around BufMappingLocks, I have tried the patch
by just increasing the Number of Buffer Partitions, and it actually shows
a really significant increase in scalability both due to reduced contention
around BufFreeListLock and BufMappingLocks. The real effect of reducing
contention around BufFreeListLock was hidden because the whole contention
was shifted to BufMappingLocks. I have taken performance data for both
HEAD+increase_buf_part and Patch+increase_buf_part to clearly see the
benefit of reducing contention around BufFreeListLock. This data has been
taken using pgbench read only load (Select).

Performance Data
-------------------------------
HEAD + 64 = HEAD + (NUM_BUFFER_PARTITONS(64) +
LOG2_NUM_LOCK_PARTITIONS(6))
V1 + 64 = PATCH + (NUM_BUFFER_PARTITONS(64) +
LOG2_NUM_LOCK_PARTITIONS(6))
Similarly 128 means 128 buffer partitions

shared_buffers= 8GB
scale factor = 3000
RAM - 64GB

Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544

shared_buffers= 8GB
scale factor = 1000
RAM - 64GB

Thrds (64) Thrds (128) HEAD 92142 31050 HEAD + 64 108120 86367 V1 + 64
117454 123429 HEAD + 128 107762 86902 V1 + 128 123641 124822

Observations
-------------------------
1. There is increase of upto 5 times in performance for data that can fit
in memory but not in shared buffers
2. Though there is a increase in performance by just increasing number
of buffer partitions, but it doesn't scale well (especially see the case
when partitions have increased to 128 from 64).

I have verified that contention has reduced around BufMappingLocks
by running the patch with LWLOCKS

BufFreeListLock
PID 17894 lwlock main 0: shacq 0 exacq 171 blk 27 spindelay 1

BufMappingLocks

PID 17902 lwlock main 38: shacq 12770 exacq 10104 blk 282 spindelay 0
PID 17924 lwlock main 39: shacq 11409 exacq 10257 blk 243 spindelay 0
PID 17929 lwlock main 40: shacq 13120 exacq 10739 blk 239 spindelay 0
PID 17940 lwlock main 41: shacq 11865 exacq 10373 blk 262 spindelay 0
..
..
PID 17831 lwlock main 162: shacq 12706 exacq 10267 blk 199 spindelay 0
PID 17826 lwlock main 163: shacq 11081 exacq 10256 blk 168 spindelay 0
PID 17903 lwlock main 164: shacq 11494 exacq 10375 blk 176 spindelay 0
PID 17899 lwlock main 165: shacq 12043 exacq 10485 blk 216 spindelay 0

We can clearly notice that the number for *blk* has reduced significantly
which shows that contention has reduced.

The patch is still in a shape to prove the merit of idea and I have just
changed the number of partitions so that if someone wants to verify
the performance for similar load, it can be done by just applying
the patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v2.patchapplication/octet-stream; name=scalable_buffer_eviction_v2.patchDownload+201-59
#3Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#2)
Re: Scaling shared buffer eviction

On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com>wrote:

Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544

I haven't actually reviewed the code, but this sort of thing seems like
good evidence that we need your patch, or something like it. The fact that
the patch produces little performance improvement on it's own (though it
does produce some) shouldn't be held against it - the fact that the
contention shifts elsewhere when the first bottleneck is removed is not
your patch's fault.

In terms of ameliorating contention on the buffer mapping locks, I think it
would be better to replace the whole buffer mapping table with something
different. I started working on that almost 2 years ago, building a
hash-table that can be read without requiring any locks and written with,
well, less locking than what we have right now:

http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash

I never got quite as far as trying to hook that up to the buffer mapping
machinery, but maybe that would be worth doing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Amit Kapila (#2)
Re: Scaling shared buffer eviction

On Fri, May 16, 2014 at 7:51 AM, Amit Kapila <amit.kapila16@gmail.com>wrote:

shared_buffers= 8GB
scale factor = 3000
RAM - 64GB

Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544

shared_buffers= 8GB
scale factor = 1000
RAM - 64GB

Thrds (64) Thrds (128) HEAD 92142 31050 HEAD + 64 108120 86367 V1 + 64
117454 123429 HEAD + 128 107762 86902 V1 + 128 123641 124822

I'm having a little trouble following this. These figure are transactions
per second for a 300 second pgbench tpc-b run? What does "Thrds" denote?

--
Peter Geoghegan

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Geoghegan (#4)
Re: Scaling shared buffer eviction

On Sat, May 17, 2014 at 6:29 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, May 16, 2014 at 7:51 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

shared_buffers= 8GB
scale factor = 3000
RAM - 64GB

I'm having a little trouble following this. These figure are transactions

per second for a 300 second pgbench tpc-b run?

Yes, the figures are tps for a 300 second run.
It is for select-only transactions.

What does "Thrds" denote?
It denotes number of threads (-j in pgbench run)

I have used below statements to take data
./pgbench -c 64 -j 64 -T 300 -S postgres
./pgbench -c 128 -j 128 -T 300 -S postgres

The reason for posting the numbers for 64/128 threads is because we have
mainly concurrency bottleneck when the number of connections are higher
than CPU cores and I am using 16 cores, 64 hardware threads m/c.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#3)
Re: Scaling shared buffer eviction

On Sat, May 17, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I haven't actually reviewed the code, but this sort of thing seems like

good evidence that we need your patch, or something like it. The fact that
the patch produces little performance improvement on it's own (though it
does produce some) shouldn't be held against it - the fact that the
contention shifts elsewhere when the first bottleneck is removed is not
your patch's fault.

In terms of ameliorating contention on the buffer mapping locks, I think

it would be better to replace the whole buffer mapping table with something
different.

Is there anything bad except for may be increase in LWLocks with scaling
hash partitions w.r.t to shared buffers either by auto tuning or by having a
configuration knob. I understand that it would be bit difficult for users
to
estimate the correct value of such a parameter, we can provide info about
its usage in docs such that if user increases shared buffers by 'X' (20
times)
of default value (128M), then consider increasing such partitions and it
should
be always power of 2 or does something similar to above internally in code.

I agree that may be even by having a reasonably good estimate of number of
partitions w.r.t shared buffers, we might not be able to eliminate the
contention
around BufMappingLocks, but I think the scalability we get by doing that is
not
bad either.

I started working on that almost 2 years ago, building a hash-table that

can be read without requiring any locks and written with, well, less
locking than what we have right now:

I have still not read the complete code, but by just going through initial
file
header, it seems to me that it will be much better than current
implementation in terms of concurrency, by the way does such an
implementation can extend to reducing scalability for hash indexes as well?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#3)
Re: Scaling shared buffer eviction

On Sat, May 17, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544

I haven't actually reviewed the code, but this sort of thing seems like
good evidence that we need your patch, or something like it. The fact that
the patch produces little performance improvement on it's own (though it
does produce some) shouldn't be held against it - the fact that the
contention shifts elsewhere when the first bottleneck is removed is not
your patch's fault.

I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin

adding buffer's to freelist until it reaches high threshold and then

again goes back to sleep.

b. New stats for number of buffers on freelist has been added, some
old one's like maxwritten_clean can be removed as new logic for
syncing buffers and moving them to free list doesn't use them.
However I think it's better to remove them once the new logic is
accepted. Added some new logs for info related to free list under
BGW_DEBUG

c. Used the already existing bgwriterLatch in BufferStrategyControl to
wake bgwriter when number of buffer's in freelist drops below
threshold.

d. Autotune the low and high threshold for freelist for various
configurations. Generally if keep small number (200~2000) of buffers
always available on freelist, then even for high shared buffers
like 15GB, it appears to be sufficient. However when the value
of shared buffer's is less, then we need much smaller number. I
think we can provide these as config knobs for user as well, but for
now based on LWLOCK_STATS result, I have chosen some hard
coded values for low and high threshold values for freelist.
Values for low and high threshold have been decided based on total
number of shared buffers, basically I have divided them into 5
categories (16~100, 100~1000, 1000~10000, 10000~100000,
100000 and above) and then ran tests(read-only pgbench) for various
configurations falling under these categories. The reason for keeping
lesser categories for larger shared buffers is that if there are small
number (200~2000) of buffers available on free list, then it seems to
be sufficient for quite high loads, however as the total number of
shared
buffer's decreases we need to be more careful as if we keep the number
as
too low then it will lead to more clock sweep by backends (which means
freelist lock contention) and if we keep number higher bgwriter will
evict
many useful buffers. Results based on LWLOCK_STATS is at end of mail.

e. One reason why I think number of buf-partitions is hard-coded to 16 is
that
minimum number of shared buffers allowed are 16 (128kb). However,
there
is handling in code (in function init_htab()) which ensure that even
if number
of partitions are more that shared buffers, it handles it safely.

I have checked the bgwriter CPU usage with and without patch
for various configurations and the observation is that for most of the
loads bgwriter's CPU usage after patch is between 8~20% and in
HEAD it is 0~2%. It shows that with patch when shared buffers
are under use by backends, bgwriter is constantly doing work to
ease the work of backends. Detailed data is provided later in the
mail.

Performance Data:
-------------------------------

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads

RAM = 64GB

Database Locale =C

checkpoint_segments=256

checkpoint_timeout =15min

shared_buffers=8GB

scale factor = 3000

Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)

Duration of each individual run = 5mins

Client Count/patch_ver (tps) 8 16 32 64 128 Head 26220 48686 70779 45232
17310 Patch 26402 50726 75574 111468 114521

Data is taken by using script (pert_buff_mgmt.sh) attached with mail.
This data is read-only pgbench data with different number of client
connections. All the numbers are in tps. This data is median of 3
5 min pgbench read-only runs. Please find the detailed data for 3 runs
in attached open office document (perf_read_scalability_data_v3.ods)

This data clearly shows that patch has improved improved performance
upto 5~6 times.

Results of BGwriter CPU usage:
--------------------------------------------------

Here sc is scale factor and sb is shared buffers and the data is
for read-only pgbench runs.

./pgbench -c 64 - j 64 -S -T 300 postgres
sc - 3000, sb - 8GB
HEAD
CPU usage - 0~2.3%
Patch v_3
CPU usage - 8.6%

sc - 100, sb - 128MB
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
CPU Usage - 1~2%
tps- 36199.047132
Patch v_3
CPU usage - 12~13%
tps = 109182.681827

sc - 50, sb - 75MB
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
CPU Usage - 0.7~2%
tps- 37760.575128
Patch v_3
CPU usage - 20~22%
tps = 106310.744198

./pgbench -c 16 - j 16 -S -T 300 postgres
sc - 100, sb - 128kb
--need to change pgbench for this.
HEAD
CPU Usage - 0~0.3%
tps- 40979.529254
Patch v_3
CPU usage - 35~40%
tps = 42956.785618

Results of LWLOCK_STATS based on low-high threshold values of freelist:
--------------------------------------------------------------------------------------------------------------

In the results, values of exacq and blk shows the contention on freelist
lock.
sc is scale factor and sb is number of shared_buffers. Below results shows
that for all except one (1MB) of configuration the contention around
buffreelist
lock is reduced significantly. For 1MB case also, it has reduced exacq
count
which shows that it has performed clock sweep lesser number of times.

sc - 3000, sb - 15GB --(sb > 100000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 4406 lwlock main 0: shacq 0 exacq 84482 blk 5139 spindelay 62
Patch v_3
PID 4864 lwlock main 0: shacq 0 exacq 34 blk 1 spindelay 0

sc - 3000, sb - 8GB --(sb > 100000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 24124 lwlock main 0: shacq 0 exacq 285155 blk 33910 spindelay 548
Patch v_3
PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0

sc - 100, sb - 768MB --(sb > 10000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 9144 lwlock main 0: shacq 0 exacq 284636 blk 34091 spindelay 555
Patch v-3 (lw=100,hg=1000)
PID 9428 lwlock main 0: shacq 0 exacq 306 blk 59 spindelay 0

sc - 100, sb - 128MB --(sb > 10000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 5405 lwlock main 0: shacq 0 exacq 285449 blk 32345 spindelay 714
Patch v-3
PID 8625 lwlock main 0: shacq 0 exacq 740 blk 178 spindelay 0

sc - 50, sb - 75MB --(sb > 1000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 12681 lwlock main 0: shacq 0 exacq 289347 blk 34064 spindelay 773
Patch v3
PID 12800 lwlock main 0: shacq 0 exacq 76287 blk 15183 spindelay 28

sc - 50, sb - 10MB --(sb > 1000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 10283 lwlock main 0: shacq 0 exacq 287500 blk 32177 spindelay 864
Patch v3 (for > 1000, lw = 50 hg =200)
PID 11629 lwlock main 0: shacq 0 exacq 60139 blk 12978 spindelay 40

sc - 1, sb - 7MB --(sb > 100)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 47127 lwlock main 0: shacq 0 exacq 289462 blk 37057 spindelay 119
Patch v3
PID 47283 lwlock main 0: shacq 0 exacq 9507 blk 1656 spindelay 0

sc - 1, sb - 1MB --(sb > 100)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 43215 lwlock main 0: shacq 0 exacq 301384 blk 36740 spindelay 902
Patch v3
PID 46542 lwlock main 0: shacq 0 exacq 197231 blk 37532 spindelay 294

sc - 100, sb - 128kb focus(sb > 16)
./pgbench -c 16 - j 16 -S -T 300 postgres (for this, I need to reduce value
of naccounts to 2500, else it was always giving no unpinned buffers
available)
HEAD
PID 49751 lwlock main 0: shacq 0 exacq 1821276 blk 130119 spindelay 7
Patch v3
PID 50768 lwlock main 0: shacq 0 exacq 382610 blk 46543 spindelay 1

More Datapoints and work:
a. I have yet to take data by merging it with scalable lwlock patch of
Andres (https://commitfest.postgresql.org/action/patch_view?id=1313).
There are many conflicts in the patch, so waiting for an updated patch.
b. Read-only data for more configurations.
c. Data for Write work load (tpc-b of pgbench, Bulk insert (Copy))
d. Update docs and Remove unused code.

Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v3.patchapplication/octet-stream; name=scalable_buffer_eviction_v3.patchDownload+292-111
perf_buff_mgmt.shapplication/x-sh; name=perf_buff_mgmt.shDownload
perf_read_scalability_data_v3.odsapplication/vnd.oasis.opendocument.spreadsheet; name=perf_read_scalability_data_v3.odsDownload
#8Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Amit Kapila (#7)
Re: Scaling shared buffer eviction

Amit Kapila <amit.kapila16@gmail.com> wrote:

I have improved the patch  by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
    removed the hibernate logic as bgwriter will now work only when
    there is scarcity of buffer's in free list. Basic idea is when the
    number of buffers on freelist drops below the low threshold, the
    allocating backend sets the latch and bgwriter wakesup and begin
    adding buffer's to freelist until it reaches high threshold and then
    again goes back to sleep.

The numbers from your benchmarks are very exciting, but the above
concerns me.  My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist, but toward
preventing shared_buffers from accumulating a lot of dirty pages,
which were leading to cascades of writes between caches and thus to
write stalls.  By pushing dirty pages into the (*much* larger) OS
cache, and letting write combining happen there, where the OS could
pace based on the total number of dirty pages instead of having
some hidden and appearing rather suddenly, latency spikes were
avoided while not causing any noticeable increase in the number of
OS writes to the RAID controller's cache.

Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter. Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Kevin Grittner (#8)
Re: Scaling shared buffer eviction

On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Amit Kapila <amit.kapila16@gmail.com> wrote:

I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin
adding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.

The numbers from your benchmarks are very exciting, but the above
concerns me. My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist, but toward
preventing shared_buffers from accumulating a lot of dirty pages,
which were leading to cascades of writes between caches and thus to
write stalls. By pushing dirty pages into the (*much* larger) OS
cache, and letting write combining happen there, where the OS could
pace based on the total number of dirty pages instead of having
some hidden and appearing rather suddenly, latency spikes were
avoided while not causing any noticeable increase in the number of
OS writes to the RAID controller's cache.

Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter.

I think it would have been better if bgwriter does writes based on
the amount of buffer's that get dirtied to achieve the balance of
writes.

Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

I agree that for some cases as explained by you, the current bgwriter
logic does satisfy the need, however there are other cases as well
where actually it doesn't help much, one of such cases I am trying to
improve (ease backend buffer allocations) and other may be when
there is constant write activity for which I am not sure how much it
really helps. Part of the reason for trying to make bgwriter respond
mainly to ease backend allocations is the previous discussion for
the same, refer below link:
/messages/by-id/CA+TgmoZ7dvhC4h-ffJmZCff6VWyNfOEAPZ021VxW61uH46R3QA@mail.gmail.com

However if we want to retain current property of bgwriter, we can do
the same by one of below ways:
a. Have separate processes for writing dirty buffers and moving buffers
to freelist.
b. In the current bgwriter, separate the two works based on the need.
The need can be decided based on whether bgwriter has been waked
due to shortage of buffers on free list or if it has been waked due to
BgWriterDelay.

Now as populating freelist and balance writes by writing dirty buffers
are two separate responsibilities, so not sure if doing that by one
process is a good idea.

I am planing to take some more performance data, part of which will
be write load as well, but I am now sure if that can anyway show the
need as mentioned by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#10Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#8)
Re: Scaling shared buffer eviction

On Sun, Jun 8, 2014 at 9:51 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

Amit Kapila <amit.kapila16@gmail.com> wrote:

I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin
adding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.

The numbers from your benchmarks are very exciting, but the above
concerns me. My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist,

Just to be clear, prior to this patch, the bgwriter has never been in
the business of putting pages on the freelist in the first place, so
it wouldn't have been possible for you to tune for that.

Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter. Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

I think, as Amit says downthread, that the crucial design question
here is whether we need two processes, one to populate the freelist so
that regular backends don't need to run the clock sweep, and a second
to flush dirty buffers, or whether a single process can serve both
needs. In favor of a single process, many people have commented that
the background writer doesn't seem to do much right now. If the
process is mostly sitting around idle, then giving it more
responsibilities might be OK. In favor of having a second process,
I'm a little concerned that if the background writer gets busy writing
a page, it might then be unavailable to populate the freelist until it
finishes, which might be a very long time relative to the buffer
allocation needs of other backends. I'm not sure what the right
answer is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#9)
Re: Scaling shared buffer eviction

On Mon, Jun 9, 2014 at 9:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

I am planing to take some more performance data, part of which will
be write load as well, but I am now sure if that can anyway show the
need as mentioned by you.

After taking the performance data for write load using tpc-b with the
patch, I found that there is a regression in it. So I went ahead and
tried to figure out the reason for same and found that after patch,
Bgwriter started flushing buffers which were required by backends
and reason was that *nextVictimBuffer* was not getting updated
properly while we are running clock sweep kind of logic (decrement
the usage count when number of buffers on freelist fall below low
threshhold value) in Bgwriter. In HEAD, I noticed that at default
settings, BGwriter was not at all flushing any buffers which is at least
better than what my patch was doing (flushing buffers required by
backend).

So I tried to fix the issue by updating *nextVictimBuffer* in new
BGWriter logic and results are positive.

sbe - scalable buffer eviction

Select only Data
Client count/TPS64128Un-patched4523217310sbe_v3111468114521sbe_v4153137
160752
TPC-B

Client count/TPS
64128Un-patched825784sbe_v4814845

For Select Data, I am quite confident that it will improve if we introduce
nextVictimBuffer increments in BGwriter and rather it scales much better
with that change, however for TPC-B, I am getting fluctuation in data,
so not sure it has eliminated the problem. The main difference is that in
HEAD, BGwriter never increments nextVictimBuffer during syncing the
buffers, it just notes down the current setting before start and then
proceeds sequentially.

I think it will be good if we can have a new process for moving buffers to
free list due to below reasons:

a. while trying to move buffers to freelist, it should not block due
to in between write activity.
b. The writer should not increment nextVictimBuffer and maintain
the current logic.

One significant change in this version of patch is to use a separate
spin lock to protect nextVictimBuffer rather than using BufFreelistLock.

Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v4.patchapplication/octet-stream; name=scalable_buffer_eviction_v4.patchDownload+335-114
#12Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#7)
Re: Scaling shared buffer eviction

On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin

adding buffer's to freelist until it reaches high threshold and then

again goes back to sleep.

This essentially removes BgWriterDelay, but it's still mentioned in
BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed. I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.

A comparison of BgBufferSync() with BgBufferSyncAndMoveBuffersToFreelist()
reveals that you've removed at least one behavior that some people (at
least, me) will care about, which is the guarantee that the background
writer will scan the entire buffer pool at least every couple of minutes.
This is important because it guarantees that dirty data doesn't sit in
memory forever. When the system becomes busy again after a long idle
period, users will expect the system to have used the idle time to flush
dirty buffers to disk. This also improves data recovery prospects if, for
example, somebody loses their pg_xlog directory - there may be dirty
buffers whose contents are lost, of course, but they won't be months old.

b. New stats for number of buffers on freelist has been added, some

old one's like maxwritten_clean can be removed as new logic for
syncing buffers and moving them to free list doesn't use them.
However I think it's better to remove them once the new logic is
accepted. Added some new logs for info related to free list under
BGW_DEBUG

If I'm reading this right, the new statistic is an incrementing counter
where, every time you update it, you add the number of buffers currently on
the freelist. That makes no sense. I think what you should be counting is
the number of allocations that are being satisfied from the free-list.
Then, by comparing the rate at which that value is incrementing to the rate
at which buffers_alloc is incrementing, somebody can figure out what
percentage of allocations are requiring a clock-sweep run. Actually, I
think it's better to flip it around: count the number of allocations that
require an individual backend to run the clock sweep (vs. being satisfied
from the free-list); call it, say, buffers_backend_clocksweep. We can then
try to tune the patch to make that number as small as possible under
varying workloads.

c. Used the already existing bgwriterLatch in BufferStrategyControl to

wake bgwriter when number of buffer's in freelist drops below
threshold.

Seems like a good idea.

d. Autotune the low and high threshold for freelist for various
configurations. Generally if keep small number (200~2000) of buffers
always available on freelist, then even for high shared buffers
like 15GB, it appears to be sufficient. However when the value
of shared buffer's is less, then we need much smaller number. I
think we can provide these as config knobs for user as well, but for
now based on LWLOCK_STATS result, I have chosen some hard
coded values for low and high threshold values for freelist.
Values for low and high threshold have been decided based on total
number of shared buffers, basically I have divided them into 5
categories (16~100, 100~1000, 1000~10000, 10000~100000,
100000 and above) and then ran tests(read-only pgbench) for various
configurations falling under these categories. The reason for keeping
lesser categories for larger shared buffers is that if there are small
number (200~2000) of buffers available on free list, then it seems to
be sufficient for quite high loads, however as the total number of
shared
buffer's decreases we need to be more careful as if we keep the
number as
too low then it will lead to more clock sweep by backends (which means
freelist lock contention) and if we keep number higher bgwriter will
evict
many useful buffers. Results based on LWLOCK_STATS is at end of mail.

I think we need to come up with some kind of formula here rather than just
a list of hard-coded constants. And it definitely needs some comments
explaining the logic behind the choices.

Aside from those specific remarks, I think the elephant in the room is the
question of whether it really makes sense to have one process which is
responsible both for populating the free list and for writing buffers to
disk. One problem, which I alluded to above under point (1), is that we
might sometimes want to ensure that dirty buffers are written out to disk
without decrementing usage counts or adding anything to the free list.
This is a potentially solvable problem, though, because we can figure out
the number of buffers that we need to scan for freelist population and the
number that we need to scan for minimum buffer pool cleaning (one cycle
every 2 minutes). Once we've met the first goal, any further buffers we
run into under the second goal get cleaned if appropriate but their usage
counts don't get pushed down nor do they get added to the freelist. Once
we meet the second goal, we can go back to sleep.

But the other problem, which I think is likely unsolvable, is that writing
a dirty page can take a long time on a busy system (multiple seconds) and
the freelist can be emptied much, much quicker than that (milliseconds).
Although your benchmark results show great speed-ups on read-only
workloads, we're not really going to get the benefit consistently on
read-write workloads -- unless of course the background writer fails to
actually write anything, which should be viewed as a bug, not a feature --
because the freelist will often be empty while the background writer is
blocked on I/O.

I'm wondering if it would be a whole lot simpler and better to introduce a
new background process, maybe with a name like bgreclaim. That process
wouldn't write dirty buffers. Instead, it would just run the clock sweep
(i.e. the last loop inside StrategyGetBuffer) and put the buffers onto the
free list. Then, we could leave the bgwriter logic more or less intact.
It certainly needs improvement, but that could be another patch.

Incidentally, while I generally think your changes to the locking regimen
in StrategyGetBuffer() are going in the right direction, they need
significant cleanup. Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl->numBufferAllocs++ and
SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not
OK. I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.

Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the
LWLock is acquired today. Increment StrategyControl->numBufferAllocs; save
the values of StrategyControl->bgwriterLatch; pop a buffer off the freelist
if there is one, saving its identity. Release the spinlock. Then, set the
bgwriterLatch if needed. In the first loop, first check whether the buffer
we previously popped from the freelist is pinned or has a non-zero usage
count and return it if not, holding the buffer header lock. Otherwise,
reacquire the spinlock just long enough to pop a new potential victim and
then loop around.

Under this locking strategy, StrategyNotifyBgWriter would use
freelist_lck. Right now, the patch removes the only caller, and should
therefore remove the function as well, but if we go with the new-process
idea listed above that part would get reverted, and then you'd need to make
it use the correct spinlock. You should also go through this patch and
remove all the commented-out bits and pieces that you haven't cleaned up;
those are distracting and unhelpful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#12)
Re: Scaling shared buffer eviction

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

This essentially removes BgWriterDelay, but it's still mentioned in

BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed. I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.

I have kept it just for the reason that if the basic approach is
sounds reasonable/accepted, then I will clean it up. Sorry for
the inconvenience, I didn't realized that it can be annoying for
reviewer, I will remove all such code from patch in next version.

A comparison of BgBufferSync() with

BgBufferSyncAndMoveBuffersToFreelist() reveals that you've removed at least
one behavior that some people (at least, me) will care about, which is the
guarantee that the background writer will scan the entire buffer pool at
least every couple of minutes.

Okay, I will take care of this based on the conclusion of
the other points in this mail.

This is important because it guarantees that dirty data doesn't sit in

memory forever. When the system becomes busy again after a long idle
period, users will expect the system to have used the idle time to flush
dirty buffers to disk. This also improves data recovery prospects if, for
example, somebody loses their pg_xlog directory - there may be dirty
buffers whose contents are lost, of course, but they won't be months old.

b. New stats for number of buffers on freelist has been added, some
old one's like maxwritten_clean can be removed as new logic for
syncing buffers and moving them to free list doesn't use them.
However I think it's better to remove them once the new logic is
accepted. Added some new logs for info related to free list under
BGW_DEBUG

If I'm reading this right, the new statistic is an incrementing counter

where, every time you update it, you add the number of buffers currently on
the freelist. That makes no sense.

I think using 'number of buffers currently on the freelist' and
'number of recently allocated buffers' for consecutive cycles,
we can figure out approximately how many buffer allocations
needs clock sweep assuming low and high threshold water
marks are fixed. However there can be cases where it is not
easy to estimate that number.

I think what you should be counting is the number of allocations that are

being satisfied from the free-list. Then, by comparing the rate at which
that value is incrementing to the rate at which buffers_alloc is
incrementing, somebody can figure out what percentage of allocations are
requiring a clock-sweep run. Actually, I think it's better to flip it
around: count the number of allocations that require an individual backend
to run the clock sweep (vs. being satisfied from the free-list); call it,
say, buffers_backend_clocksweep. We can then try to tune the patch to make
that number as small as possible under varying workloads.

This can give us clear idea to tune the patch, however we need
to maintain 3 counters for it in code(recent_alloc (needed for
current bgwriter logic) and other 2 suggested by you). Do you
want to retain such counters in code or it's for kind of debug info
for patch?

d. Autotune the low and high threshold for freelist for various
configurations.

I think we need to come up with some kind of formula here rather than

just a list of hard-coded constants.

That was my initial intention as well and I have tried based
on number of shared buffers like keeping threshold values as
percentage of shared buffers but nothing could satisfy different
kind of workloads. The current values I have choosen are based
on experiments for various workloads at different thresholds. I have
shown the lwlock_stats data for various loads based on current
thresholds upthread. Another way could be to make them as config
knobs and use the values as given by user incase it is provided by
user else go with fixed values.

There are other instances in code as well (one of them I remember
offhand is in pglz_compress) where we use fixed values based on
different sizes.

And it definitely needs some comments explaining the logic behind the

choices.

Agreed, I shall improve them in next version of patch.

Aside from those specific remarks, I think the elephant in the room is

the question of whether it really makes sense to have one process which is
responsible both for populating the free list and for writing buffers to
disk. One problem, which I alluded to above under point (1), is that we
might sometimes want to ensure that dirty buffers are written out to disk
without decrementing usage counts or adding anything to the free list.
This is a potentially solvable problem, though, because we can figure out
the number of buffers that we need to scan for freelist population and the
number that we need to scan for minimum buffer pool cleaning (one cycle
every 2 minutes). Once we've met the first goal, any further buffers we
run into under the second goal get cleaned if appropriate but their usage
counts don't get pushed down nor do they get added to the freelist. Once
we meet the second goal, we can go back to sleep.

But the other problem, which I think is likely unsolvable, is that

writing a dirty page can take a long time on a busy system (multiple
seconds) and the freelist can be emptied much, much quicker than that
(milliseconds). Although your benchmark results show great speed-ups on
read-only workloads, we're not really going to get the benefit consistently
on read-write workloads -- unless of course the background writer fails to
actually write anything, which should be viewed as a bug, not a feature --
because the freelist will often be empty while the background writer is
blocked on I/O.

I'm wondering if it would be a whole lot simpler and better to introduce

a new background process, maybe with a name like bgreclaim.

That will certainly help in retaining the current behaviour of
bgwriter and make the idea cleaner. I will modify the patch
to have a new background process unless somebody thinks
otherwise.

That process wouldn't write dirty buffers.

If we go with this approach, one thing which we need to decide
is what to do incase buf which has usage_count as zero is *dirty*,
as I don't think it is good idea to put it in freelist. Few options to
handle such a case are:

a. Skip such a buffer; the downside is if we have to skip lot
of buffers due to this reason then having separate process
such as bgreclaim will be less advantageous.
b. Skip the buffer and notify bgwriter to flush buffers, now this
notification can be sent either as soon as we encounter one
such buffer or after few such buffers (incase of few, we need to decide
some useful number). In this option, there is a chance that bgwriter
decide not to flush buffer/'s which ideally should not happen because
I think bgwriter considers the number of recent allocations for
performing scan to flush dirt buffers.
c. Have some mechanism where bgreclaim can notify bgwriter
to flush some specific buffers. I think if we have such a mechanism
that can be later even used by backends if required.
d. Keep the logic as per current patch and improve such that it can
retain the behaviour of one cycle per two minutes as suggested above
by you on the basis that in anycase it is better than the current code.

I don't think option (d) is best way to handle this scenario,however I
kept it incase nothing else sounds reasonable. Option (c) might have
lot of work which I am not sure is justifiable to handle the current
scenario,
though it can be useful for some other things. Option (a) should be okay
for most cases, but I think option (b) would be better.

Instead, it would just run the clock sweep (i.e. the last loop inside

StrategyGetBuffer) and put the buffers onto the free list.

Don't we need to do more than just last loop inside StrategyGetBuffer(),
as clock sweep in strategy get buffer is responsible for getting one
buffer with usage_count = 0 where as we need to run the loop till it
finds and moves enough such buffers so that it can populate freelist
with number of buffers equal to high water mark of freelist.

Then, we could leave the bgwriter logic more or less intact. It certainly

needs improvement, but that could be another patch.

Incidentally, while I generally think your changes to the locking regimen

in StrategyGetBuffer() are going in the right direction, they need
significant cleanup. Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl->numBufferAllocs++ and
SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not
OK.

I have kept them outside spinlock because as per patch the only
callsite for setting StrategyControl->bgwriterLatch is StrategyGetBuffer()
and StrategyControl->numBufferAllocs is used just for statistics purpose
(which I thought might be okay even if it is not accurate) whereas without
patch it is used by bgwriter for purpose other than stats as well.
However it certainly needs to be protected for separate bgreclaim process
idea or for retaining current bgwriter behaviour.

I think you should get rid of BufFreelistLock completely and just decide

that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.

Then, in StrategyGetBuffer, acquire the freelist_lck at the point where

the LWLock is acquired today. Increment StrategyControl->numBufferAllocs;
save the values of StrategyControl->bgwriterLatch; pop a buffer off the
freelist if there is one, saving its identity. Release the spinlock.
Then, set the bgwriterLatch if needed. In the first loop, first check
whether the buffer we previously popped from the freelist is pinned or has
a non-zero usage count and return it if not, holding the buffer header
lock. Otherwise, reacquire the spinlock just long enough to pop a new
potential victim and then loop around.

I shall take care of doing this way in next version of patch.

Under this locking strategy, StrategyNotifyBgWriter would use

freelist_lck. Right now, the patch removes the only caller, and should
therefore remove the function as well, but if we go with the new-process
idea listed above that part would get reverted, and then you'd need to make
it use the correct spinlock. You should also go through this patch and
remove all the commented-out bits and pieces that you haven't cleaned up;
those are distracting and unhelpful.

Sure.

Thank you for review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#13)
Re: Scaling shared buffer eviction

On Wed, Aug 6, 2014 at 6:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

If I'm reading this right, the new statistic is an incrementing counter
where, every time you update it, you add the number of buffers currently on
the freelist. That makes no sense.

I think using 'number of buffers currently on the freelist' and
'number of recently allocated buffers' for consecutive cycles,
we can figure out approximately how many buffer allocations
needs clock sweep assuming low and high threshold water
marks are fixed. However there can be cases where it is not
easy to estimate that number.

Counters should be design in such a way that you can read it, and then
read it again later, and make sense of it - you should not need to
read the counter on *consecutive* cycles to interpret it.

I think what you should be counting is the number of allocations that are
being satisfied from the free-list. Then, by comparing the rate at which
that value is incrementing to the rate at which buffers_alloc is
incrementing, somebody can figure out what percentage of allocations are
requiring a clock-sweep run. Actually, I think it's better to flip it
around: count the number of allocations that require an individual backend
to run the clock sweep (vs. being satisfied from the free-list); call it,
say, buffers_backend_clocksweep. We can then try to tune the patch to make
that number as small as possible under varying workloads.

This can give us clear idea to tune the patch, however we need
to maintain 3 counters for it in code(recent_alloc (needed for
current bgwriter logic) and other 2 suggested by you). Do you
want to retain such counters in code or it's for kind of debug info
for patch?

I only mean to propose one new counter, and I'd imagine including that
in the final patch. We already have a counter of total buffer
allocations; that's buffers_alloc. I'm proposing to add an additional
counter for the number of those allocations not satisfied from the
free list, with a name like buffers_alloc_clocksweep (I said
buffers_backend_clocksweep above, but that's probably not best, as the
existing buffers_backend counts buffer *writes*, not allocations). I
think we would definitely want to retain this counter in the final
patch, as an additional column in pg_stat_bgwriter.

d. Autotune the low and high threshold for freelist for various
configurations.

I think we need to come up with some kind of formula here rather than just
a list of hard-coded constants.

That was my initial intention as well and I have tried based
on number of shared buffers like keeping threshold values as
percentage of shared buffers but nothing could satisfy different
kind of workloads. The current values I have choosen are based
on experiments for various workloads at different thresholds. I have
shown the lwlock_stats data for various loads based on current
thresholds upthread. Another way could be to make them as config
knobs and use the values as given by user incase it is provided by
user else go with fixed values.

How did you go about determining the optimal value for a particular workload?

When the list is kept short, it's less likely that a value on the list
will be referenced or dirtied again before the page is actually
recycled. That's clearly good. But when the list is long, it's less
likely to become completely empty and thereby force individual
backends to run the clock-sweep. My suspicion is that, when the
number of buffers is small, the impact of the list being too short
isn't likely to be very significant, because running the clock-sweep
isn't all that expensive anyway - even if you have to scan through the
entire buffer pool multiple times, there aren't that many buffers.
But when the number of buffers is large, those repeated scans can
cause a major performance hit, so having an adequate pool of free
buffers becomes much more important.

I think your list of high-watermarks is far too generous for low
buffer counts. With more than 100k shared buffers, you've got a
high-watermark of 2k buffers, which means that 2% or less of the
buffers will be on the freelist, which seems a little on the high side
to me, but probably in the ballpark of what we should be aiming for.
But at 10001 shared buffers, you can have 1000 of them on the
freelist, which is 10% of the buffer pool; that seems high. At 101
shared buffers, 75% of the buffers in the system can be on the
freelist; that seems ridiculous. The chances of a buffer still being
unused by the time it reaches the head of the freelist seem very
small.

Based on your existing list of thresholds, and taking the above into
account, I'd suggest something like this: let the high-watermark for
the freelist be 0.5% of the total number of buffers, with a maximum of
2000 and a minimum of 5. Let the low-watermark be 20% of the
high-watermark. That might not be best, but I think some kind of
formula like that can likely be made to work. I would suggest
focusing your testing on configurations with *large* settings for
shared_buffers, say 1-64GB, rather than small configurations. Anyone
who cares greatly about performance isn't going to be running with
only 8MB of shared_buffers anyway. Arguably we shouldn't even run the
reclaim process on very small configurations; I think there should
probably a GUC (PGC_SIGHUP) to control whether it gets launched.

I think it would be a good idea to analyze how frequently the reclaim
process gets woken up. In the worst case, this happens once per (high
watermark - low watermark) allocations; that is, the system reaches
the low watermark and then does no further allocations until the
reclaim process brings the freelist back up to the high watermark.
But if more allocations occur between the time the reclaim process is
woken and the time it reaches the high watermark, then it should run
for longer, until the high watermark is reached. At least for
debugging purposes, I think it would be useful to have a counter of
reclaim wakeups. I'm not sure whether that's worth including in the
final patch, but it might be.

That will certainly help in retaining the current behaviour of
bgwriter and make the idea cleaner. I will modify the patch
to have a new background process unless somebody thinks
otherwise.

If we go with this approach, one thing which we need to decide
is what to do incase buf which has usage_count as zero is *dirty*,
as I don't think it is good idea to put it in freelist.

I thought a bit about this yesterday. I think the problem is that we
might be in a situation where buffers are being dirtied faster than
they can be cleaned. In that case, if we only put clean buffers on the
freelist, then every backend in the system will be fighting over the
ever-dwindling supply of clean buffers until, in the worst case,
there's maybe only 1 clean buffer which is getting evicted repeatedly
at top speed - or maybe even no clean buffers, and the reclaim process
just spins in an infinite loop looking for clean buffers that aren't
there.

To put that another way, the rate at which buffers are being dirtied
can't exceed the rate at which they are being cleaned forever.
Eventually, somebody is going to have to wait. Having the backends
wait by being forced to write some dirty buffers does not seem like a
bad way to accomplish that. So I favor just putting the buffers on
freelist without regard to whether they are clean or dirty. If this
turns out not to work well we can look at other options (probably some
variant of (b) from your list).

Instead, it would just run the clock sweep (i.e. the last loop inside
StrategyGetBuffer) and put the buffers onto the free list.

Don't we need to do more than just last loop inside StrategyGetBuffer(),
as clock sweep in strategy get buffer is responsible for getting one
buffer with usage_count = 0 where as we need to run the loop till it
finds and moves enough such buffers so that it can populate freelist
with number of buffers equal to high water mark of freelist.

Yeah, that's what I meant. Of course, it should add each buffer to
the freelist individually, not batch them up and add them all at once.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#13)
Re: Scaling shared buffer eviction

On 2014-08-06 15:42:08 +0530, Amit Kapila wrote:

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

This essentially removes BgWriterDelay, but it's still mentioned in

BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed. I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.

FWIW, I found this email amost unreadable because it misses quoting
signs after linebreaks in quoted content.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#15)
Re: Scaling shared buffer eviction

On Wed, Aug 13, 2014 at 2:32 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-08-06 15:42:08 +0530, Amit Kapila wrote:

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

This essentially removes BgWriterDelay, but it's still mentioned in

BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer

called

from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what

you've

changed. I realize you probably left it that way for testing purposes,

but

you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out,

so

that the scope of the changes you've made is clear to reviewers.

FWIW, I found this email amost unreadable because it misses quoting
signs after linebreaks in quoted content.

I think I have done something wrong while replying to Robert's
mail, the main point in that mail was trying to see if there is any
major problem incase we have separate process (bgreclaim) to
populate freelist. One thing which I thought could be problematic
is to put a buf in freelist which has usage_count as zero and is *dirty*.
Please do let me know if you want clarification for something in
particular.

Overall, the main changes required in patch as per above feedback
are:
1. add an additional counter for the number of those
allocations not satisfied from the free list, with a
name like buffers_alloc_clocksweep.
2. Autotune the low and high threshold values for buffers
in freelist. In the patch, I have kept them as hard-coded
values.
3. For populating freelist, have a separate process (bgreclaim)
instead of doing it by bgwriter.

There are other things also which I need to take care as per
feedback like some change in locking strategy and code.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#17Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#16)
Re: Scaling shared buffer eviction

On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:

Overall, the main changes required in patch as per above feedback
are:
1. add an additional counter for the number of those
allocations not satisfied from the free list, with a
name like buffers_alloc_clocksweep.
2. Autotune the low and high threshold values for buffers
in freelist. In the patch, I have kept them as hard-coded
values.
3. For populating freelist, have a separate process (bgreclaim)
instead of doing it by bgwriter.

I'm not convinced that 3) is the right way to go to be honest. Seems
like a huge bandaid to me.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#17)
Re: Scaling shared buffer eviction

On Wed, Aug 13, 2014 at 4:25 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:

Overall, the main changes required in patch as per above feedback
are:
1. add an additional counter for the number of those
allocations not satisfied from the free list, with a
name like buffers_alloc_clocksweep.
2. Autotune the low and high threshold values for buffers
in freelist. In the patch, I have kept them as hard-coded
values.
3. For populating freelist, have a separate process (bgreclaim)
instead of doing it by bgwriter.

I'm not convinced that 3) is the right way to go to be honest. Seems
like a huge bandaid to me.

Doing both (populating freelist and flushing dirty buffers) via bgwriter
isn't the best way either because it might not be able to perform
both the jobs as per need.
One example is it could take much longer time to flush a dirty buffer
than to move it into free list, so if there are few buffers which we need
to flush, then I think task of maintaining buffers in freelist will get hit
even though there are buffers in list which can be moved to
free list(non-dirty buffers).
Another is maintaining the current behaviour of bgwriter which is to scan
the entire buffer pool every few mins (assuming default configuration).
We can attempt to solve this problem as suggested by Robert upthread
but I am not completely sure if that can guarantee that the current
behaviour will be retained as it is.

I am not telling that having a separate process won't have any issues,
but I think we can tackle them without changing or complicating current
bgwriter behaviour.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#12)
Re: Scaling shared buffer eviction

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Incidentally, while I generally think your changes to the locking regimen

in StrategyGetBuffer() are going in the right direction, they need
significant cleanup. Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl->numBufferAllocs++ and
SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not
OK. I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.

Then, in StrategyGetBuffer, acquire the freelist_lck at the point where

the LWLock is acquired today. Increment StrategyControl->numBufferAllocs;
save the values of StrategyControl->bgwriterLatch; pop a buffer off the
freelist if there is one, saving its identity. Release the spinlock.
Then, set the bgwriterLatch if needed. In the first loop, first check
whether the buffer we previously popped from the freelist is pinned or has
a non-zero usage count and return it if not, holding the buffer header
lock. Otherwise, reacquire the spinlock just long enough to pop a new
potential victim and then loop around.

Today, while working on updating the patch to improve locking
I found that as now we are going to have a new process, we need
a separate latch in StrategyControl to wakeup that process.
Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.

I thought it is better to mention about above points so that if you have
any different thoughts about it, then it is better to discuss them now
rather than after I take performance data with this locking protocol.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#19)
Re: Scaling shared buffer eviction

Amit Kapila <amit.kapila16@gmail.com> writes:

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.

Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.

I'm rather concerned by this cavalier assumption that we can protect
fields a,b,c with one lock and fields x,y,z in the same struct with some
other lock.

A minimum requirement for that to work safely at all is that the fields
are of atomically fetchable/storable widths; which might be okay here
but it's a restriction that bears thinking about (and documenting).

But quite aside from safety, the fields are almost certainly going to
be in the same cache line which means contention between processes that
are trying to fetch or store them concurrently. For a patch whose sole
excuse for existence is to improve performance, that should be a very
scary concern.

(And yes, I realize these issues already affect the freelist. Perhaps
that's part of the reason we have performance issues with it.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#20)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#19)
#24Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#24)
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#24)
#27Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#26)
#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#25)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#28)
#30Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Amit Kapila (#27)
#31Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Kirkwood (#30)
#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#29)
#33Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#26)
#34Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#32)
#35Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#34)
#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#29)
#37Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Amit Kapila (#31)
#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#25)
#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Kirkwood (#37)
#40Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#38)
#41Merlin Moncure
mmoncure@gmail.com
In reply to: Amit Kapila (#38)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#40)
#43Thom Brown
thom@linux.com
In reply to: Amit Kapila (#40)
#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Merlin Moncure (#41)
#45Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#43)
#46Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Amit Kapila (#39)
#47Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#42)
#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#42)
#49Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Kirkwood (#46)
#50Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Amit Kapila (#49)
#51Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#48)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#51)
#53Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#52)
#54Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#52)
#55Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#54)
#56Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#55)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#53)
#58Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#57)
#59Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#58)
#60Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#51)
#61Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#53)
#62Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#61)
#63Gregory Smith
gregsmithpgsql@gmail.com
In reply to: Andres Freund (#51)
#64Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#60)
#65Amit Kapila
amit.kapila16@gmail.com
In reply to: Gregory Smith (#63)
#66Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Amit Kapila (#65)
#67Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#64)
#68Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#67)
#69Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#68)
#70Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#64)
#71Gregory Smith
gregsmithpgsql@gmail.com
In reply to: Amit Kapila (#67)
#72Amit Kapila
amit.kapila16@gmail.com
In reply to: Gregory Smith (#71)
#73Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#69)
#74Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#73)
#75Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#73)
#76Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#75)
#77Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#76)
#78Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#77)
#79Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#78)
#80Gregory Smith
gregsmithpgsql@gmail.com
In reply to: Robert Haas (#73)
#81Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#76)
#82Robert Haas
robertmhaas@gmail.com
In reply to: Gregory Smith (#80)
#83Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#81)
#84Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#83)
#85Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#84)
#86Gregory Smith
gregsmithpgsql@gmail.com
In reply to: Robert Haas (#82)
#87Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#79)
#88Merlin Moncure
mmoncure@gmail.com
In reply to: Robert Haas (#87)
#89Robert Haas
robertmhaas@gmail.com
In reply to: Merlin Moncure (#88)
#90Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#87)
#91Andres Freund
andres@anarazel.de
In reply to: Merlin Moncure (#88)
#92Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#90)
#93Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#92)
#94Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#91)
#95Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#89)
#96Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#93)
#97Andres Freund
andres@anarazel.de
In reply to: Merlin Moncure (#94)
#98Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#75)
#99Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#95)
#100Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#99)
#101Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#98)
#102Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#100)
#103Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#100)
#104Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#100)
#105Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#102)
#106Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#101)
#107Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#103)
#108Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#104)
#109Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#108)
#110Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#77)
#111Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#110)
#112Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#96)
#113Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#112)
#114Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#113)
#115Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#114)
#116Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#115)
#117Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#113)
#118Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#117)
#119Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#118)
#120Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#101)
#121Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#120)
#122Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#121)
#123Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#121)
#124Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#123)
#125Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#124)
#126Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#120)
#127Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#126)
#128Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#127)
#129Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#126)
#130Thom Brown
thom@linux.com
In reply to: Amit Kapila (#129)
#131Andres Freund
andres@anarazel.de
In reply to: Thom Brown (#130)
#132Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#131)