Speed up Clog Access by increasing CLOG buffers
After reducing ProcArrayLock contention in commit
(0e141c0fbb211bdd23783afa731e3eef95c9ad7a), the other lock
which seems to be contentious in read-write transactions is
CLogControlLock. In my investigation, I found that the contention
is mainly due to two reasons, one is that while writing the transaction
status in CLOG (TransactionIdSetPageStatus()), it acquires EXCLUSIVE
CLogControlLock which contends with every other transaction which
tries to access the CLOG for checking transaction status and to reduce it
already a patch [1]/messages/by-id/CANP8+j+imQfHxkChFyfnXDyi6k-arAzRV+ZG-V_OFxEtJjOL2Q@mail.gmail.com is proposed by Simon; Second contention is due to
the reason that when the CLOG page is not found in CLOG buffers, it
needs to acquire CLogControlLock in Exclusive mode which again contends
with shared lockers which tries to access the transaction status.
Increasing CLOG buffers to 64 helps in reducing the contention due to second
reason. Experiments revealed that increasing CLOG buffers only helps
once the contention around ProcArrayLock is reduced.
Performance Data
-----------------------------
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)
Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB
pgbench setup
------------------------
scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-b
HEAD - commit 0e141c0f
Patch-1 - increase_clog_bufs_v1
Client Count/Patch_ver 1 8 16 32 64 128 256 HEAD 911 5695 9886 18028 27851
28654 25714 Patch-1 954 5568 9898 18450 29313 31108 28213
This data shows that there is an increase of ~5% at 64 client-count
and 8~10% at more higher clients without degradation at lower client-
count. In above data, there is some fluctuation seen at 8-client-count,
but I attribute that to run-to-run variation, however if anybody has doubts
I can again re-verify the data at lower client counts.
Now if we try to further increase the number of CLOG buffers to 128,
no improvement is seen.
I have also verified that this improvement can be seen only after the
contention around ProcArrayLock is reduced. Below is the data with
Commit before the ProcArrayLock reduction patch. Setup and test
is same as mentioned for previous test.
HEAD - commit 253de7e1
Patch-1 - increase_clog_bufs_v1
Client Count/Patch_ver 128 256 HEAD 16657 10512 Patch-1 16694 10477
I think the benefit of this patch would be more significant along
with the other patch to reduce CLogControlLock contention [1]/messages/by-id/CANP8+j+imQfHxkChFyfnXDyi6k-arAzRV+ZG-V_OFxEtJjOL2Q@mail.gmail.com
(I have not tested both the patches together as still there are
few issues left in the other patch), however it has it's own independent
value, so can be considered separately.
Thoughts?
[1]: /messages/by-id/CANP8+j+imQfHxkChFyfnXDyi6k-arAzRV+ZG-V_OFxEtJjOL2Q@mail.gmail.com
/messages/by-id/CANP8+j+imQfHxkChFyfnXDyi6k-arAzRV+ZG-V_OFxEtJjOL2Q@mail.gmail.com
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
increase_clog_bufs_v1.patchapplication/octet-stream; name=increase_clog_bufs_v1.patchDownload
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 3a58f1e..d5c4043 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -417,30 +417,34 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
/*
* Number of shared CLOG buffers.
*
- * Testing during the PostgreSQL 9.2 development cycle revealed that on a
+ * Testing during the PostgreSQL 9.6 development cycle revealed that on a
* large multi-processor system, it was possible to have more CLOG page
- * requests in flight at one time than the number of CLOG buffers which existed
- * at that time, which was hardcoded to 8. Further testing revealed that
- * performance dropped off with more than 32 CLOG buffers, possibly because
- * the linear buffer search algorithm doesn't scale well.
+ * requests in flight at one time than the number of CLOG buffers which
+ * existed at that time, which was 32 assuming there are enough shared_buffers.
+ * Further testing revealed that either performance stayed same or dropped off
+ * with more than 64 CLOG buffers, possibly because the linear buffer search
+ * algorithm doesn't scale well or some other locking bottlenecks in the
+ * system mask the improvement.
*
- * Unconditionally increasing the number of CLOG buffers to 32 did not seem
+ * Unconditionally increasing the number of CLOG buffers to 64 did not seem
* like a good idea, because it would increase the minimum amount of shared
* memory required to start, which could be a problem for people running very
* small configurations. The following formula seems to represent a reasonable
* compromise: people with very low values for shared_buffers will get fewer
- * CLOG buffers as well, and everyone else will get 32.
+ * CLOG buffers as well, and everyone else will get 64.
*
* It is likely that some further work will be needed here in future releases;
* for example, on a 64-core server, the maximum number of CLOG requests that
* can be simultaneously in flight will be even larger. But that will
* apparently require more than just changing the formula, so for now we take
- * the easy way out.
+ * the easy way out. It could also happen that after removing other locking
+ * bottlenecks, further increase in CLOG buffers can help, but that's not the
+ * case now.
*/
Size
CLOGShmemBuffers(void)
{
- return Min(32, Max(4, NBuffers / 512));
+ return Min(64, Max(4, NBuffers / 512));
}
/*
On 2015-09-01 10:19:19 +0530, Amit Kapila wrote:
pgbench setup
------------------------
scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-bHEAD - commit 0e141c0f
Patch-1 - increase_clog_bufs_v1Client Count/Patch_ver 1 8 16 32 64 128 256 HEAD 911 5695 9886 18028 27851
28654 25714 Patch-1 954 5568 9898 18450 29313 31108 28213This data shows that there is an increase of ~5% at 64 client-count
and 8~10% at more higher clients without degradation at lower client-
count. In above data, there is some fluctuation seen at 8-client-count,
but I attribute that to run-to-run variation, however if anybody has doubts
I can again re-verify the data at lower client counts.
Now if we try to further increase the number of CLOG buffers to 128,
no improvement is seen.I have also verified that this improvement can be seen only after the
contention around ProcArrayLock is reduced. Below is the data with
Commit before the ProcArrayLock reduction patch. Setup and test
is same as mentioned for previous test.
The buffer replacement algorithm for clog is rather stupid - I do wonder
where the cutoff is that it hurts.
Could you perhaps try to create a testcase where xids are accessed that
are so far apart on average that they're unlikely to be in memory? And
then test that across a number of client counts?
There's two reasons that I'd like to see that: First I'd like to avoid
regression, second I'd like to avoid having to bump the maximum number
of buffers by small buffers after every hardware generation...
/* * Number of shared CLOG buffers. * - * Testing during the PostgreSQL 9.2 development cycle revealed that on a + * Testing during the PostgreSQL 9.6 development cycle revealed that on a * large multi-processor system, it was possible to have more CLOG page - * requests in flight at one time than the number of CLOG buffers which existed - * at that time, which was hardcoded to 8. Further testing revealed that - * performance dropped off with more than 32 CLOG buffers, possibly because - * the linear buffer search algorithm doesn't scale well. + * requests in flight at one time than the number of CLOG buffers which + * existed at that time, which was 32 assuming there are enough shared_buffers. + * Further testing revealed that either performance stayed same or dropped off + * with more than 64 CLOG buffers, possibly because the linear buffer search + * algorithm doesn't scale well or some other locking bottlenecks in the + * system mask the improvement. * - * Unconditionally increasing the number of CLOG buffers to 32 did not seem + * Unconditionally increasing the number of CLOG buffers to 64 did not seem * like a good idea, because it would increase the minimum amount of shared * memory required to start, which could be a problem for people running very * small configurations. The following formula seems to represent a reasonable * compromise: people with very low values for shared_buffers will get fewer - * CLOG buffers as well, and everyone else will get 32. + * CLOG buffers as well, and everyone else will get 64. * * It is likely that some further work will be needed here in future releases; * for example, on a 64-core server, the maximum number of CLOG requests that * can be simultaneously in flight will be even larger. But that will * apparently require more than just changing the formula, so for now we take - * the easy way out. + * the easy way out. It could also happen that after removing other locking + * bottlenecks, further increase in CLOG buffers can help, but that's not the + * case now. */
I think the comment should be more drastically rephrased to not
reference individual versions and numbers.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund wrote:
The buffer replacement algorithm for clog is rather stupid - I do wonder
where the cutoff is that it hurts.Could you perhaps try to create a testcase where xids are accessed that
are so far apart on average that they're unlikely to be in memory? And
then test that across a number of client counts?There's two reasons that I'd like to see that: First I'd like to avoid
regression, second I'd like to avoid having to bump the maximum number
of buffers by small buffers after every hardware generation...
I wonder if it would make sense to explore an idea that has been floated
for years now -- to have pg_clog pages be allocated as part of shared
buffers rather than have their own separate pool. That way, no separate
hardcoded allocation limit is needed. It's probably pretty tricky to
implement, though :-(
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2015-09-07 10:34:10 -0300, Alvaro Herrera wrote:
I wonder if it would make sense to explore an idea that has been floated
for years now -- to have pg_clog pages be allocated as part of shared
buffers rather than have their own separate pool. That way, no separate
hardcoded allocation limit is needed. It's probably pretty tricky to
implement, though :-(
I still think that'd be a good plan, especially as it'd also let us use
a lot of related infrastructure. I doubt we could just use the standard
cache replacement mechanism though - it's not particularly efficient
either... I also have my doubts that a hash table lookup at every clog
lookup is going to be ok performancewise.
The biggest problem will probably be that the buffer manager is pretty
directly tied to relations and breaking up that bond won't be all that
easy. My guess is that the best bet here is that the easiest way to at
least explore this is to define pg_clog/... as their own tablespaces
(akin to pg_global) and treat the files therein as plain relations.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund wrote:
On 2015-09-07 10:34:10 -0300, Alvaro Herrera wrote:
I wonder if it would make sense to explore an idea that has been floated
for years now -- to have pg_clog pages be allocated as part of shared
buffers rather than have their own separate pool. That way, no separate
hardcoded allocation limit is needed. It's probably pretty tricky to
implement, though :-(I still think that'd be a good plan, especially as it'd also let us use
a lot of related infrastructure. I doubt we could just use the standard
cache replacement mechanism though - it's not particularly efficient
either... I also have my doubts that a hash table lookup at every clog
lookup is going to be ok performancewise.
Yeah. I guess we'd have to mark buffers as unusable for regular pages
("somehow"), and have a separate lookup mechanism. As I said, it is
likely to be tricky.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 7, 2015 at 7:04 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Andres Freund wrote:
The buffer replacement algorithm for clog is rather stupid - I do wonder
where the cutoff is that it hurts.Could you perhaps try to create a testcase where xids are accessed that
are so far apart on average that they're unlikely to be in memory?
Yes, I am working on it, what I have in mind is to create a table with
large number of rows (say 50000000) and have each row with different
transaction id. Now each transaction should try to update rows that
are at least 1048576 (number of transactions whose status can be held in
32 CLog buffers) distance apart, that way for each update it will try to
access
Clog page that is not in-memory. Let me know if you can think of any
better or simpler way.
There's two reasons that I'd like to see that: First I'd like to avoid
regression, second I'd like to avoid having to bump the maximum number
of buffers by small buffers after every hardware generation...I wonder if it would make sense to explore an idea that has been floated
for years now -- to have pg_clog pages be allocated as part of shared
buffers rather than have their own separate pool.
There could be some benefits of it, but I think we still have to acquire
Exclusive lock while committing transaction or while Extending Clog
which are also major sources of contention in this area. I think the
benefits of moving it to shared_buffers could be that the upper limit on
number of pages that can be retained in memory could be increased and even
if we have to replace the page, responsibility to flush it could be
delegated
to checkpoint. So yes, there could be benefits with this idea, but not sure
if they are worth investigating this idea, one thing we could try if you
think
that is beneficial is that just skip fsync during write of clog pages and
if thats
beneficial, then we can think of pushing it to checkpoint (something similar
to what Andres has mentioned on nearby thread).
Yet another way could be to have configuration variable for clog buffers
(Clog_Buffers).
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 7, 2015 at 9:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Andres Freund wrote:
The buffer replacement algorithm for clog is rather stupid - I do wonder
where the cutoff is that it hurts.Could you perhaps try to create a testcase where xids are accessed that
are so far apart on average that they're unlikely to be in memory? And
then test that across a number of client counts?There's two reasons that I'd like to see that: First I'd like to avoid
regression, second I'd like to avoid having to bump the maximum number
of buffers by small buffers after every hardware generation...I wonder if it would make sense to explore an idea that has been floated
for years now -- to have pg_clog pages be allocated as part of shared
buffers rather than have their own separate pool. That way, no separate
hardcoded allocation limit is needed. It's probably pretty tricky to
implement, though :-(
Yeah, I looked at that once and threw my hands up in despair pretty
quickly. I also considered another idea that looked simpler: instead
of giving every SLRU its own pool of pages, have one pool of pages for
all of them, separate from shared buffers but common to all SLRUs.
That looked easier, but still not easy.
I've also considered trying to replace the entire SLRU system with new
code and throwing away what exists today. The locking mode is just
really strange compared to what we do elsewhere. That, too, does not
look all that easy. :-(
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 3, 2015 at 5:11 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-09-01 10:19:19 +0530, Amit Kapila wrote:
pgbench setup
------------------------
scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-bHEAD - commit 0e141c0f
Patch-1 - increase_clog_bufs_v1The buffer replacement algorithm for clog is rather stupid - I do wonder
where the cutoff is that it hurts.Could you perhaps try to create a testcase where xids are accessed that
are so far apart on average that they're unlikely to be in memory? And
then test that across a number of client counts?
Okay, I have tried one such test, but what I could come up with is on an
average every 100th access is a disk access and then tested it with
different number of clog buffers and client count. Below is the result:
Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=32GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB
autovacuum=off
HEAD - commit 49124613
Patch-1 - Clog Buffers - 64
Patch-2 - Clog Buffers - 128
Client Count/Patch_ver 1 8 64 128 HEAD 1395 8336 37866 34463 Patch-1 1615
8180 37799 35315 Patch-2 1409 8219 37068 34729
So there is not much difference in test results with different values for
Clog
buffers, probably because the I/O has dominated the test and it shows that
increasing the clog buffers won't regress the current behaviour even though
there are lot more accesses for transaction status outside CLOG buffers.
Now about the test, create a table with large number of rows (say 11617457,
I have tried to create larger, but it was taking too much time (more than a
day))
and have each row with different transaction id. Now each transaction
should
update rows that are at least 1048576 (number of transactions whose status
can
be held in 32 CLog buffers) distance apart, that way ideally for each update
it will
try to access Clog page that is not in-memory, however as the value to
update
is getting selected randomly and that leads to every 100th access as disk
access.
Test
-------
1. Attached file clog_prep.sh should create and populate the required
table and create the function used to access the CLOG pages. You
might want to update the no_of_rows based on the rows you want to
create in table
2. Attached file access_clog_disk.sql is used to execute the function
with random values. You might want to update nrows variable based
on the rows created in previous step.
3. Use pgbench as follows with different client count
./pgbench -c 4 -j 4 -n -M prepared -f "access_clog_disk.sql" -T 300 postgres
4. To ensure that clog access function always accesses same data
during each run, the test ensures to copy the data_directory created by
step-1
before each run.
I have checked by adding some instrumentation that approximately
every 100th access is disk access, attached patch clog_info-v1.patch
adds the necessary instrumentation in code.
As an example, pgbench test yields below results:
./pgbench -c 4 -j 4 -n -M prepared -f "access_clog_disk.sql" -T 180 postgres
LOG: trans_status(3169396)
LOG: trans_status_disk(29546)
LOG: trans_status(3054952)
LOG: trans_status_disk(28291)
LOG: trans_status(3131242)
LOG: trans_status_disk(28989)
LOG: trans_status(3155449)
LOG: trans_status_disk(29347)
Here 'trans_status' is the number of times the process went for accessing
the CLOG status and 'trans_status_disk' is the number of times it went
to disk for accessing CLOG page.
/*
* Number of shared CLOG buffers.
*I think the comment should be more drastically rephrased to not
reference individual versions and numbers.
Updated comments and the patch (increate_clog_bufs_v2.patch)
containing the same is attached.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
increase_clog_bufs_v2.patchapplication/octet-stream; name=increase_clog_bufs_v2.patchDownload
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 3a58f1e..1ee8309 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -417,30 +417,23 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
/*
* Number of shared CLOG buffers.
*
- * Testing during the PostgreSQL 9.2 development cycle revealed that on a
- * large multi-processor system, it was possible to have more CLOG page
- * requests in flight at one time than the number of CLOG buffers which existed
- * at that time, which was hardcoded to 8. Further testing revealed that
- * performance dropped off with more than 32 CLOG buffers, possibly because
- * the linear buffer search algorithm doesn't scale well.
+ * On larger multi-processor systems, it is possible to have many CLOG page
+ * requests in flight at one time which could lead to disk access for CLOG
+ * page if the required page is not found in memory. Testing revealed that
+ * we can get the best performance by having 64 CLOG buffers, more than that
+ * it doesn't improve performance.
*
- * Unconditionally increasing the number of CLOG buffers to 32 did not seem
+ * Unconditionally keeping the number of CLOG buffers to 64 did not seem
* like a good idea, because it would increase the minimum amount of shared
* memory required to start, which could be a problem for people running very
* small configurations. The following formula seems to represent a reasonable
* compromise: people with very low values for shared_buffers will get fewer
- * CLOG buffers as well, and everyone else will get 32.
- *
- * It is likely that some further work will be needed here in future releases;
- * for example, on a 64-core server, the maximum number of CLOG requests that
- * can be simultaneously in flight will be even larger. But that will
- * apparently require more than just changing the formula, so for now we take
- * the easy way out.
+ * CLOG buffers as well, and everyone else will get 64.
*/
Size
CLOGShmemBuffers(void)
{
- return Min(32, Max(4, NBuffers / 512));
+ return Min(64, Max(4, NBuffers / 512));
}
/*
clog_info-v1.patchapplication/octet-stream; name=clog_info-v1.patchDownload
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 3a58f1e..a729101 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -399,6 +399,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
char *byteptr;
XidStatus status;
+ trans_status++;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5fcea11..c61fe36 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -671,6 +671,8 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
return false;
}
+ trans_status_disk++;
+
if (CloseTransientFile(fd))
{
slru_errcause = SLRU_CLOSE_FAILED;
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 6bc0b06..97c6ad1 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -39,6 +39,8 @@
*/
bool proc_exit_inprogress = false;
+int trans_status = 0;
+int trans_status_disk = 0;
/*
* This flag tracks whether we've called atexit() in the current process
* (or in the parent postmaster).
@@ -137,7 +139,8 @@ proc_exit(int code)
chdir(gprofDirName);
}
#endif
-
+ elog(LOG, "trans_status(%d)", trans_status);
+ elog(LOG, "trans_status_disk(%d)", trans_status_disk);
elog(DEBUG3, "exit(%d)", code);
exit(code);
diff --git a/src/include/postgres.h b/src/include/postgres.h
index ccf1605..e352524 100644
--- a/src/include/postgres.h
+++ b/src/include/postgres.h
@@ -117,6 +117,9 @@ typedef enum vartag_external
VARTAG_ONDISK = 18
} vartag_external;
+extern int trans_status;
+extern int trans_status_disk;
+
/* this test relies on the specific tag values above */
#define VARTAG_IS_EXPANDED(tag) \
(((tag) & ~1) == VARTAG_EXPANDED_RO)
On Fri, Sep 11, 2015 at 10:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Could you perhaps try to create a testcase where xids are accessed that
are so far apart on average that they're unlikely to be in memory? And
then test that across a number of client counts?Now about the test, create a table with large number of rows (say 11617457,
I have tried to create larger, but it was taking too much time (more than a day))
and have each row with different transaction id. Now each transaction should
update rows that are at least 1048576 (number of transactions whose status can
be held in 32 CLog buffers) distance apart, that way ideally for each update it will
try to access Clog page that is not in-memory, however as the value to update
is getting selected randomly and that leads to every 100th access as disk access.
What about just running a regular pgbench test, but hacking the
XID-assignment code so that we increment the XID counter by 100 each
time instead of 1?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 11, 2015 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 11, 2015 at 10:31 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
Could you perhaps try to create a testcase where xids are accessed
that
are so far apart on average that they're unlikely to be in memory? And
then test that across a number of client counts?Now about the test, create a table with large number of rows (say
11617457,
I have tried to create larger, but it was taking too much time (more
than a day))
and have each row with different transaction id. Now each transaction
should
update rows that are at least 1048576 (number of transactions whose
status can
be held in 32 CLog buffers) distance apart, that way ideally for each
update it will
try to access Clog page that is not in-memory, however as the value to
update
is getting selected randomly and that leads to every 100th access as
disk access.
What about just running a regular pgbench test, but hacking the
XID-assignment code so that we increment the XID counter by 100 each
time instead of 1?
If I am not wrong we need 1048576 number of transactions difference
for each record to make each CLOG access a disk access, so if we
increment XID counter by 100, then probably every 10000th (or multiplier
of 10000) transaction would go for disk access.
The number 1048576 is derived by below calc:
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
Transaction difference required for each transaction to go for disk access:
CLOG_XACTS_PER_PAGE * num_clog_buffers.
I think reducing to every 100th access for transaction status as disk access
is sufficient to prove that there is no regression with the patch for the
screnario
asked by Andres or do you think it is not?
Now another possibility here could be that we try by commenting out fsync
in CLOG path to see how much it impact the performance of this test and
then for pgbench test. I am not sure there will be any impact because even
every 100th transaction goes to disk access that is still less as compare
WAL fsync which we have to perform for each transaction.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 11, 2015 at 11:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
If I am not wrong we need 1048576 number of transactions difference
for each record to make each CLOG access a disk access, so if we
increment XID counter by 100, then probably every 10000th (or multiplier
of 10000) transaction would go for disk access.The number 1048576 is derived by below calc:
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)Transaction difference required for each transaction to go for disk access:
CLOG_XACTS_PER_PAGE * num_clog_buffers.I think reducing to every 100th access for transaction status as disk access
is sufficient to prove that there is no regression with the patch for the
screnario
asked by Andres or do you think it is not?
I have no idea. I was just suggesting that hacking the server somehow
might be an easier way of creating the scenario Andres was interested
in than the process you described. But feel free to ignore me, I
haven't taken much time to think about this.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/11/2015 10:31 AM, Amit Kapila wrote:
Updated comments and the patch (increate_clog_bufs_v2.patch)
containing the same is attached.
I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x
RAID10 SSD (data + xlog) with Min(64,).
Kept the shared_buffers=64GB and effective_cache_size=160GB settings
across all runs, but did runs with both synchronous_commit on and off
and different scale factors for pgbench.
The results are in flux for all client numbers within -2 to +2%
depending on the latency average.
So no real conclusion from here other than the patch doesn't help/hurt
performance on this setup, likely depends on further CLogControlLock
related changes to see real benefit.
Best regards,
Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 18, 2015 at 11:08 PM, Jesper Pedersen <
jesper.pedersen@redhat.com> wrote:
On 09/11/2015 10:31 AM, Amit Kapila wrote:
Updated comments and the patch (increate_clog_bufs_v2.patch)
containing the same is attached.I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x
RAID10 SSD (data + xlog) with Min(64,).
The benefit with this patch could be seen at somewhat higher
client-count as you can see in my initial mail, can you please
once try with client count > 64?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Aug 31, 2015 at 9:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Increasing CLOG buffers to 64 helps in reducing the contention due to second
reason. Experiments revealed that increasing CLOG buffers only helps
once the contention around ProcArrayLock is reduced.
There has been a lot of research on bitmap compression, more or less
for the benefit of bitmap index access methods.
Simple techniques like run length encoding are effective for some
things. If the need to map the bitmap into memory to access the status
of transactions is a concern, there has been work done on that, too.
Byte-aligned bitmap compression is a technique that might offer a good
trade-off between compression clog, and decompression overhead -- I
think that there basically is no decompression overhead, because set
operations can be performed on the "compressed" representation
directly. There are other techniques, too.
Something to consider. There could be multiple benefits to compressing
clog, even beyond simply avoiding managing clog buffers.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/18/2015 11:11 PM, Amit Kapila wrote:
I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x
RAID10 SSD (data + xlog) with Min(64,).The benefit with this patch could be seen at somewhat higher
client-count as you can see in my initial mail, can you please
once try with client count > 64?
Client count were from 1 to 80.
I did do one run with Min(128,) like you, but didn't see any difference
in the result compared to Min(64,), so focused instead in the
sync_commit on/off testing case.
Best regards,
Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
On Fri, Sep 11, 2015 at 9:21 PM, Robert Haas <robertmhaas@gmail.com>
wrote:On Fri, Sep 11, 2015 at 10:31 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
Could you perhaps try to create a testcase where xids are accessed
that
are so far apart on average that they're unlikely to be in memory?
And
then test that across a number of client counts?
Now about the test, create a table with large number of rows (say
11617457,
I have tried to create larger, but it was taking too much time (more
than a day))
and have each row with different transaction id. Now each transaction
should
update rows that are at least 1048576 (number of transactions whose
status can
be held in 32 CLog buffers) distance apart, that way ideally for each
update it will
try to access Clog page that is not in-memory, however as the value to
update
is getting selected randomly and that leads to every 100th access as
disk access.
What about just running a regular pgbench test, but hacking the
XID-assignment code so that we increment the XID counter by 100 each
time instead of 1?If I am not wrong we need 1048576 number of transactions difference
for each record to make each CLOG access a disk access, so if we
increment XID counter by 100, then probably every 10000th (or multiplier
of 10000) transaction would go for disk access.The number 1048576 is derived by below calc:
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
Transaction difference required for each transaction to go for disk access:
CLOG_XACTS_PER_PAGE * num_clog_buffers.
That guarantees that every xid occupies its own 32-contiguous-pages chunk
of clog.
But clog pages are not pulled in and out in 32-page chunks, but one page
chunks. So you would only need 32,768 differences to get every real
transaction to live on its own clog page, which means every look up of a
different real transaction would have to do a page replacement. (I think
your references to disk access here are misleading. Isn't the issue here
the contention on the lock that controls the page replacement, not the
actual IO?)
I've attached a patch that allows you set the guc "JJ_xid",which makes it
burn the given number of xids every time one new one is asked for. (The
patch introduces lots of other stuff as well, but I didn't feel like
ripping the irrelevant parts out--if you don't set any of the other gucs it
introduces from their defaults, they shouldn't cause you trouble.) I think
there are other tools around that do the same thing, but this is the one I
know about. It is easy to drive the system into wrap-around shutdown with
this, so lowering autovacuum_vacuum_cost_delay is a good idea.
Actually I haven't attached it, because then the commitfest app will list
it as the patch needing review, instead I've put it here
https://drive.google.com/file/d/0Bzqrh1SO9FcERV9EUThtT3pacmM/view?usp=sharing
I think reducing to every 100th access for transaction status as disk access
is sufficient to prove that there is no regression with the patch for the
screnario
asked by Andres or do you think it is not?Now another possibility here could be that we try by commenting out fsync
in CLOG path to see how much it impact the performance of this test and
then for pgbench test. I am not sure there will be any impact because even
every 100th transaction goes to disk access that is still less as compare
WAL fsync which we have to perform for each transaction.
You mentioned that your clog is not on ssd, but surely at this scale of
hardware, the hdd the clog is on has a bbu in front of it, no?
But I thought Andres' concern was not about fsync, but about the fact that
the SLRU does linear scans (repeatedly) of the buffers while holding the
control lock? At some point, scanning more and more buffers under the lock
is going to cause more contention than scanning fewer buffers and just
evicting a page will.
Cheers,
Jeff
On Mon, Oct 5, 2015 at 6:34 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:If I am not wrong we need 1048576 number of transactions difference
for each record to make each CLOG access a disk access, so if we
increment XID counter by 100, then probably every 10000th (or multiplier
of 10000) transaction would go for disk access.The number 1048576 is derived by below calc:
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)Transaction difference required for each transaction to go for disk
access:
CLOG_XACTS_PER_PAGE * num_clog_buffers.That guarantees that every xid occupies its own 32-contiguous-pages chunk
of clog.But clog pages are not pulled in and out in 32-page chunks, but one page
chunks. So you would only need 32,768 differences to get every real
transaction to live on its own clog page, which means every look up of a
different real transaction would have to do a page replacement.
Agreed, but that doesn't effect the test result with the test done above.
(I think your references to disk access here are misleading. Isn't the
issue here the contention on the lock that controls the page replacement,
not the actual IO?)
The point is that if there is no I/O needed, then all the read-access for
transaction status will just use Shared locks, however if there is an I/O,
then it would need an Exclusive lock.
I've attached a patch that allows you set the guc "JJ_xid",which makes it
burn the given number of xids every time one new one is asked for. (The
patch introduces lots of other stuff as well, but I didn't feel like
ripping the irrelevant parts out--if you don't set any of the other gucs it
introduces from their defaults, they shouldn't cause you trouble.) I think
there are other tools around that do the same thing, but this is the one I
know about. It is easy to drive the system into wrap-around shutdown with
this, so lowering autovacuum_vacuum_cost_delay is a good idea.Actually I haven't attached it, because then the commitfest app will list
it as the patch needing review, instead I've put it here
https://drive.google.com/file/d/0Bzqrh1SO9FcERV9EUThtT3pacmM/view?usp=sharing
Thanks, I think probably this could also be used for testing.
I think reducing to every 100th access for transaction status as disk
access
is sufficient to prove that there is no regression with the patch for the
screnario
asked by Andres or do you think it is not?Now another possibility here could be that we try by commenting out fsync
in CLOG path to see how much it impact the performance of this test and
then for pgbench test. I am not sure there will be any impact because
even
every 100th transaction goes to disk access that is still less as compare
WAL fsync which we have to perform for each transaction.You mentioned that your clog is not on ssd, but surely at this scale of
hardware, the hdd the clog is on has a bbu in front of it, no?
Yes.
But I thought Andres' concern was not about fsync, but about the fact that
the SLRU does linear scans (repeatedly) of the buffers while holding the
control lock? At some point, scanning more and more buffers under the lock
is going to cause more contention than scanning fewer buffers and just
evicting a page will.
Yes, at some point, that could matter, but I could not see the impact
at 64 or 128 number of Clog buffers.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 21, 2015 at 6:25 PM, Jesper Pedersen <jesper.pedersen@redhat.com
wrote:
On 09/18/2015 11:11 PM, Amit Kapila wrote:
I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x
RAID10 SSD (data + xlog) with Min(64,).
The benefit with this patch could be seen at somewhat higher
client-count as you can see in my initial mail, can you please
once try with client count > 64?Client count were from 1 to 80.
I did do one run with Min(128,) like you, but didn't see any difference in
the result compared to Min(64,), so focused instead in the sync_commit
on/off testing case.
I think the main focus for test in this area would be at higher client
count. At what scale factors have you taken the data and what are
the other non-default settings you have used. By the way, have you
tried by dropping and recreating the database and restarting the server
after each run, can you share the exact steps you have used to perform
the tests. I am not sure why it is not showing the benefit in your testing,
may be the benefit is on some what more higher end m/c or it could be
that some of the settings used for test are not same as mine or the way
to test the read-write workload of pgbench is different.
In anycase, I went ahead and tried further reducing the CLogControlLock
contention by grouping the transaction status updates. The basic idea
is same as is used to reduce the ProcArrayLock contention [1]/messages/by-id/CAA4eK1JbX4FzPHigNt0JSaz30a85BPJV+ewhk+wg_o-T6xufEA@mail.gmail.com which is to
allow one of the proc to become leader and update the transaction status for
other active transactions in system. This has helped to reduce the
contention
around CLOGControlLock. Attached patch group_update_clog_v1.patch
implements this idea.
I have taken performance data with this patch to see the impact at
various scale-factors. All the data is for cases when data fits in shared
buffers and is taken against commit - 5c90a2ff on server with below
configuration and non-default postgresql.conf settings.
Performance Data
-----------------------------
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)
Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB
Refer attached files for performance data.
sc_300_perf.png - This data indicates that at scale_factor 300, there is a
gain of ~15% at higher client counts, without degradation at lower client
count.
different_sc_perf.png - At various scale factors, there is a gain from
~15% to 41% at higher client counts and in some cases we see gain
of ~5% at somewhat moderate client count (64) as well.
perf_write_clogcontrollock_data_v1.ods - Detailed performance data at
various client counts and scale factors.
Feel free to ask for more details if the data in attached files is not
clear.
Below is the LWLock_Stats information with and without patch:
Stats Data
---------
A. scale_factor = 300; shared_buffers=32GB; client_connections - 128
HEAD - 5c90a2ff
----------------
CLogControlLock Data
------------------------
PID 94100 lwlock main 11: shacq 678672 exacq 326477 blk 204427 spindelay
8532 dequeue self 93192
PID 94129 lwlock main 11: shacq 757047 exacq 363176 blk 207840 spindelay
8866 dequeue self 96601
PID 94115 lwlock main 11: shacq 721632 exacq 345967 blk 207665 spindelay
8595 dequeue self 96185
PID 94011 lwlock main 11: shacq 501900 exacq 241346 blk 173295 spindelay
7882 dequeue self 78134
PID 94087 lwlock main 11: shacq 653701 exacq 314311 blk 201733 spindelay
8419 dequeue self 92190
After Patch group_update_clog_v1
----------------
CLogControlLock Data
------------------------
PID 100205 lwlock main 11: shacq 836897 exacq 176007 blk 116328 spindelay
1206 dequeue self 54485
PID 100034 lwlock main 11: shacq 437610 exacq 91419 blk 77523 spindelay 994
dequeue self 35419
PID 100175 lwlock main 11: shacq 748948 exacq 158970 blk 114027 spindelay
1277 dequeue self 53486
PID 100162 lwlock main 11: shacq 717262 exacq 152807 blk 115268 spindelay
1227 dequeue self 51643
PID 100214 lwlock main 11: shacq 856044 exacq 180422 blk 113695 spindelay
1202 dequeue self 54435
The above data indicates that contention due to CLogControlLock is
reduced by around 50% with this patch.
The reasons for remaining contention could be:
1. Readers of clog data (checking transaction status data) can take
Exclusive CLOGControlLock when reading the page from disk, this can
contend with other Readers (shared lockers of CLogControlLock) and with
exclusive locker which updates transaction status. One of the ways to
mitigate this contention is to increase the number of CLOG buffers for which
patch has been already posted on this thread.
2. Readers of clog data (checking transaction status data) takes shared
CLOGControlLock which can contend with exclusive locker (Group leader) which
updates transaction status. I have tried to reduce the amount of work done
by group leader, by allowing group leader to just read the Clog page once
for all the transactions in the group which updated the same CLOG page
(idea similar to what we currently we use for updating the status of
transactions
having sub-transaction tree), but that hasn't given any further performance
boost,
so I left it.
I think we can use some other ways as well to reduce the contention around
CLOGControlLock by doing somewhat major surgery around SLRU like using
buffer pools similar to shared buffers, but this idea gives us moderate
improvement without much impact on exiting mechanism.
Thoughts?
[1]: /messages/by-id/CAA4eK1JbX4FzPHigNt0JSaz30a85BPJV+ewhk+wg_o-T6xufEA@mail.gmail.com
/messages/by-id/CAA4eK1JbX4FzPHigNt0JSaz30a85BPJV+ewhk+wg_o-T6xufEA@mail.gmail.com
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
group_update_clog_v1.patchapplication/octet-stream; name=group_update_clog_v1.patchDownload
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index ea83655..007317a 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -40,6 +40,7 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pg_trace.h"
+#include "storage/proc.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -91,6 +92,10 @@ static void TransactionIdSetStatusBit(TransactionId xid, XidStatus status,
XLogRecPtr lsn, int slotno);
static void set_status_by_pages(int nsubxids, TransactionId *subxids,
XidStatus status, XLogRecPtr lsn);
+static void TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
+ XLogRecPtr lsn, int pageno);
+static void TransactionIdSetPageStatusInternal(TransactionId xid, XidStatus status,
+ XLogRecPtr lsn, int pageno);
/*
@@ -248,6 +253,14 @@ set_status_by_pages(int nsubxids, TransactionId *subxids,
* Record the final state of transaction entries in the commit log for
* all entries on a single page. Atomic only on this page.
*
+ * Group the status updation for transactions that don't have
+ * subtransactions. This improves the efficiency of the transaction
+ * status updation by reducing the number of lock acquirations required
+ * for it. To achieve the group transaction status updation, we need to
+ * populate the transaction status related information in shared memory
+ * and doing it for sub-transactions would need a big chunk of shared
+ * memory, so we are not doing this optimization for such cases.
+ *
* Otherwise API is same as TransactionIdSetTreeStatus()
*/
static void
@@ -262,7 +275,92 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
status == TRANSACTION_STATUS_ABORTED ||
(status == TRANSACTION_STATUS_SUB_COMMITTED && !TransactionIdIsValid(xid)));
- LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
+ if (nsubxids > 0)
+ {
+ LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
+
+ /*
+ * If we're doing an async commit (ie, lsn is valid), then we must
+ * wait for any active write on the page slot to complete. Otherwise
+ * our update could reach disk in that write, which will not do since
+ * we mustn't let it reach disk until we've done the appropriate WAL
+ * flush. But when lsn is invalid, it's OK to scribble on a page while
+ * it is write-busy, since we don't care if the update reaches disk
+ * sooner than we think.
+ */
+ slotno = SimpleLruReadPage(ClogCtl, pageno, XLogRecPtrIsInvalid(lsn), xid);
+
+ /*
+ * Set the main transaction id, if any.
+ *
+ * If we update more than one xid on this page while it is being
+ * written out, we might find that some of the bits go to disk and
+ * others don't. If we are updating commits on the page with the
+ * top-level xid that could break atomicity, so we subcommit the
+ * subxids first before we mark the top-level commit.
+ */
+ if (TransactionIdIsValid(xid))
+ {
+ /* Subtransactions first, if needed ... */
+ if (status == TRANSACTION_STATUS_COMMITTED)
+ {
+ for (i = 0; i < nsubxids; i++)
+ {
+ Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+ TransactionIdSetStatusBit(subxids[i],
+ TRANSACTION_STATUS_SUB_COMMITTED,
+ lsn, slotno);
+ }
+ }
+
+ /* ... then the main transaction */
+ TransactionIdSetStatusBit(xid, status, lsn, slotno);
+ }
+
+ /* Set the subtransactions */
+ for (i = 0; i < nsubxids; i++)
+ {
+ Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+ TransactionIdSetStatusBit(subxids[i], status, lsn, slotno);
+ }
+
+ ClogCtl->shared->page_dirty[slotno] = true;
+
+ LWLockRelease(CLogControlLock);
+ }
+ else
+ {
+ /*
+ * If we can immediately acquire CLogControlLock, we update the status
+ * of our own XID and release the lock. If not, use group XID status
+ * updation to improve efficiency.
+ */
+ if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
+ {
+ TransactionIdSetPageStatusInternal(xid, status, lsn, pageno);
+ LWLockRelease(CLogControlLock);
+ }
+ else
+ TransactionGroupUpdateXidStatus(xid, status, lsn, pageno);
+ }
+}
+
+/*
+ * Record the final state of transaction entry in the commit log
+ *
+ * We don't do any locking here; caller must handle that.
+ */
+static void
+TransactionIdSetPageStatusInternal(TransactionId xid, XidStatus status,
+ XLogRecPtr lsn, int pageno)
+{
+ int slotno;
+
+ /* We should definitely have an XID whose status needs to be updated. */
+ Assert(TransactionIdIsValid(xid));
+
+ Assert(status == TRANSACTION_STATUS_COMMITTED ||
+ status == TRANSACTION_STATUS_ABORTED);
/*
* If we're doing an async commit (ie, lsn is valid), then we must wait
@@ -276,42 +374,141 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
slotno = SimpleLruReadPage(ClogCtl, pageno, XLogRecPtrIsInvalid(lsn), xid);
/*
- * Set the main transaction id, if any.
- *
- * If we update more than one xid on this page while it is being written
- * out, we might find that some of the bits go to disk and others don't.
- * If we are updating commits on the page with the top-level xid that
- * could break atomicity, so we subcommit the subxids first before we mark
- * the top-level commit.
+ * Update the status of transaction in clog.
*/
- if (TransactionIdIsValid(xid))
+ TransactionIdSetStatusBit(xid, status, lsn, slotno);
+
+ ClogCtl->shared->page_dirty[slotno] = true;
+}
+
+/*
+ * When we cannot immediately acquire CLogControlLock in exclusive mode at
+ * commit time, add ourselves to a list of processes that need their XIDs
+ * status updation. The first process to add itself to the list will acquire
+ * CLogControlLock in exclusive mode and perform TransactionIdSetPageStatusInternal
+ * on behalf of all group members. This avoids a great deal of contention
+ * around CLogControlLock when many processes are trying to commit at once,
+ * since the lock need not be repeatedly handed off from one committing
+ * process to the next.
+ */
+static void
+TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
+ XLogRecPtr lsn, int pageno)
+{
+ volatile PROC_HDR *procglobal = ProcGlobal;
+ PGPROC *proc = MyProc;
+ uint32 nextidx;
+ uint32 wakeidx;
+ int extraWaits = -1;
+
+ /* We should definitely have an XID whose status needs to be updated. */
+ Assert(TransactionIdIsValid(xid));
+
+ /*
+ * Add ourselves to the list of processes needing a group XID status
+ * updation.
+ */
+ proc->updateXidStatus = true;
+ proc->memberXid = xid;
+ proc->memberXidstatus = status;
+ proc->clogPage = pageno;
+ proc->asyncCommitLsn = lsn;
+ while (true)
{
- /* Subtransactions first, if needed ... */
- if (status == TRANSACTION_STATUS_COMMITTED)
+ nextidx = pg_atomic_read_u32(&procglobal->firstupdateXidStatusElem);
+ pg_atomic_write_u32(&proc->nextupdateXidStatusElem, nextidx);
+
+ if (pg_atomic_compare_exchange_u32(&procglobal->firstupdateXidStatusElem,
+ &nextidx,
+ (uint32) proc->pgprocno))
+ break;
+ }
+
+ /*
+ * If the list was not empty, the leader will update the status of our
+ * XID. It is impossible to have followers without a leader because the
+ * first process that has added itself to the list will always have
+ * nextidx as INVALID_PGPROCNO.
+ */
+ if (nextidx != INVALID_PGPROCNO)
+ {
+ /* Sleep until the leader updates our XID status. */
+ for (;;)
{
- for (i = 0; i < nsubxids; i++)
- {
- Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
- TransactionIdSetStatusBit(subxids[i],
- TRANSACTION_STATUS_SUB_COMMITTED,
- lsn, slotno);
- }
+ /* acts as a read barrier */
+ PGSemaphoreLock(&proc->sem);
+ if (!proc->updateXidStatus)
+ break;
+ extraWaits++;
}
- /* ... then the main transaction */
- TransactionIdSetStatusBit(xid, status, lsn, slotno);
+ Assert(pg_atomic_read_u32(&proc->nextupdateXidStatusElem) == INVALID_PGPROCNO);
+
+ /* Fix semaphore count for any absorbed wakeups */
+ while (extraWaits-- > 0)
+ PGSemaphoreUnlock(&proc->sem);
+ return;
}
- /* Set the subtransactions */
- for (i = 0; i < nsubxids; i++)
+ /* We are the leader. Acquire the lock on behalf of everyone. */
+ LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
+
+ /*
+ * Now that we've got the lock, clear the list of processes waiting for
+ * group XID status updation, saving a pointer to the head of the list.
+ * Trying to pop elements one at a time could lead to an ABA problem.
+ */
+ while (true)
{
- Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
- TransactionIdSetStatusBit(subxids[i], status, lsn, slotno);
+ nextidx = pg_atomic_read_u32(&procglobal->firstupdateXidStatusElem);
+ if (pg_atomic_compare_exchange_u32(&procglobal->firstupdateXidStatusElem,
+ &nextidx,
+ INVALID_PGPROCNO))
+ break;
}
- ClogCtl->shared->page_dirty[slotno] = true;
+ /* Remember head of list so we can perform wakeups after dropping lock. */
+ wakeidx = nextidx;
+
+ /* Walk the list and update the status of all XIDs. */
+ while (nextidx != INVALID_PGPROCNO)
+ {
+ PGPROC *proc = &ProcGlobal->allProcs[nextidx];
+
+ TransactionIdSetPageStatusInternal(proc->memberXid,
+ proc->memberXidstatus,
+ proc->asyncCommitLsn,
+ proc->clogPage);
+ /* Move to next proc in list. */
+ nextidx = pg_atomic_read_u32(&proc->nextupdateXidStatusElem);
+ }
+
+ /* We're done with the lock now. */
LWLockRelease(CLogControlLock);
+
+ /*
+ * Now that we've released the lock, go back and wake everybody up. We
+ * don't do this under the lock so as to keep lock hold times to a
+ * minimum. The system calls we need to perform to wake other processes
+ * up are probably much slower than the simple memory writes we did while
+ * holding the lock.
+ */
+ while (wakeidx != INVALID_PGPROCNO)
+ {
+ PGPROC *proc = &ProcGlobal->allProcs[wakeidx];
+
+ wakeidx = pg_atomic_read_u32(&proc->nextupdateXidStatusElem);
+ pg_atomic_write_u32(&proc->nextupdateXidStatusElem, INVALID_PGPROCNO);
+
+ /* ensure all previous writes are visible before follower continues. */
+ pg_write_barrier();
+
+ proc->updateXidStatus = false;
+
+ if (proc != MyProc)
+ PGSemaphoreUnlock(&proc->sem);
+ }
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index bb10c1b..e1c71a6 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -182,6 +182,7 @@ InitProcGlobal(void)
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->firstClearXidElem, INVALID_PGPROCNO);
+ pg_atomic_init_u32(&ProcGlobal->firstupdateXidStatusElem, INVALID_PGPROCNO);
/*
* Create and initialize all the PGPROC structures we'll need. There are
@@ -397,6 +398,14 @@ InitProcess(void)
MyProc->backendLatestXid = InvalidTransactionId;
pg_atomic_init_u32(&MyProc->nextClearXidElem, INVALID_PGPROCNO);
+ /* Initialize fields for group transaction status updation. */
+ MyProc->updateXidStatus = false;
+ MyProc->memberXid = InvalidTransactionId;
+ MyProc->memberXidstatus = TRANSACTION_STATUS_IN_PROGRESS;
+ MyProc->clogPage = -1;
+ MyProc->asyncCommitLsn = InvalidXLogRecPtr;
+ pg_atomic_init_u32(&MyProc->nextupdateXidStatusElem, INVALID_PGPROCNO);
+
/*
* Acquire ownership of the PGPROC's latch, so that we can use WaitLatch
* on it. That allows us to repoint the process latch, which so far
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3d68017..2eddfe5 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -14,6 +14,7 @@
#ifndef _PROC_H_
#define _PROC_H_
+#include "access/clog.h"
#include "access/xlogdefs.h"
#include "lib/ilist.h"
#include "storage/latch.h"
@@ -146,6 +147,14 @@ struct PGPROC
pg_atomic_uint32 nextClearXidElem;
TransactionId backendLatestXid;
+ /* Support for group transaction status updation. */
+ bool updateXidStatus;
+ pg_atomic_uint32 nextupdateXidStatusElem;
+ TransactionId memberXid;
+ XidStatus memberXidstatus;
+ int clogPage;
+ XLogRecPtr asyncCommitLsn;
+
/* Per-backend LWLock. Protects fields below. */
LWLock *backendLock; /* protects the fields below */
@@ -209,6 +218,8 @@ typedef struct PROC_HDR
PGPROC *bgworkerFreeProcs;
/* First pgproc waiting for group XID clear */
pg_atomic_uint32 firstClearXidElem;
+ /* First pgproc waiting for group transaction status update */
+ pg_atomic_uint32 firstupdateXidStatusElem;
/* WALWriter process's latch */
Latch *walwriterLatch;
/* Checkpointer process's latch */
perf_write_clogcontrollock_data_v1.odsapplication/vnd.oasis.opendocument.spreadsheet; name=perf_write_clogcontrollock_data_v1.odsDownload
PK O4qG�l9�. . mimetypeapplication/vnd.oasis.opendocument.spreadsheetPK O4qG]DdV V meta.xml<?xml version="1.0" encoding="UTF-8"?>
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:grddl="http://www.w3.org/2003/g/data-view#" office:version="1.2"><office:meta><meta:creation-date>2014-03-31T15:29:55</meta:creation-date><dc:date>2015-11-17T12:04:30</dc:date><meta:editing-duration>P14DT11H10M46S</meta:editing-duration><meta:editing-cycles>139</meta:editing-cycles><meta:generator>LibreOffice/3.5$MacOSX_x86 LibreOffice_project/7e68ba2-a744ebf-1f241b7-c506db1-7d53735</meta:generator><meta:document-statistic meta:table-count="2" meta:cell-count="197" meta:object-count="2"/></office:meta></office:document-meta>PK O4qG ObjectReplacements/Object 1�]kl�>�����~;6��`�{m�Kpc�'�Tm�y%n0&�D)m����.A�T��" ��T��"j���6�����UU��!����jK����;�����zf} ;��<������s���r����o}
Im>� �6�Gtm ����#kpPKT������)�8>���m���%���������H
)����]��>�z��&���=���(M�'��y����]!��v�mX����mkh������7 �E)���&?���J�!6ko�g3r��q�Y7�&�|�A1�(���������fQhB(7����uV��v�MR��Q������,��h��Z+g��W2���F�WDn:��0��N��XI�������&[tX�hP~��3��U(��ko�d�����W����7�l��$"?m�� ?�M�cm����ibb�a3��������b��,4$W-&�-������>���)�F�<�ND�S��{�D��G����J�=�ND��(���3��D���,���{&������[T��u��#���O�&��<��Eq�f|����D�������%��B4��v"I��Xt������D�} + O4�����y�M$)��{��FI�����g��D��k ��/7��$E�@t���DI�>��[����GqS�c!R�?[����E~������_����������iU��a]G�2E�1�Y��uA�k�g���L��rL�rg�
�JU8sS����*.�)�.�+�V�B���p"��**_�B�L��\��7�g�
�JU8sS5th�XP��<��;9�Y!Y��
'bn�b����B��w}���<}���B�CN��P���A���d�*��VcWB�}���m�]����n��
E&�1��S�/�_u=:8�GA>Qz#�b��fb���S�
}���"Q0s�"�<���:�[�������S+YA����#Y�*��5��;����9��P��� UM��u�~�\���Ju�8�U�^tl� ��T�6���(>$��
�J�yZ�)�X�G���uY�<Aj��`N�1�O�y2�������+u3������T>�Iv�������1G"�\y�T)�S��)V������+7�U6I���s������;����_����"e�S��YP��V����M��}�$�4��/�}sjI#��m������t5�K��m34��U##�0[�3WW#�����J&Wr��3����������gyf+9�\�v#md��]D=���-��������k=����v�E{i=>i#>�������~'5��
Q��Q=��!��������T�����{q>b�ZOwc�a:�1���=8~G��|��oz �O���B�������?���w�D�2�>@�"��{h�[��������N���\�G����K:�VS�~��K��~�X�1�Y��2�BD�$��B��9�a�,L���}��h;L����-$�L-��]D�_�dm���
Zm=�@��h_^Jt[j�w�D
w���X%�`�� �Z�.���J��6��y��8��x;dZIT�
<�&za���V��u�i<��C,�!#����y���C������C9<��Cy<��C*��B^d2�x�����*��2*���!�V���!���������j����!c!���"Z�Cu<����y�6Z�k���e<���CF#��Cw�P5�P���<d��P+? m<������<��s�C�JZ�C�y�.Z�s�CFu�P��!����P�����\�G���1��d����{������%�Yb�
�pt!�^p9�`��ZdkB�*��2E8Y��� ���`t !��U�9���2x���A9�~�V�[����ew��~�us��4�r���Tb��������t�'S}��G�u���s $��b�z%���g�
��Ujq��;ha��;��� ��rD
C���j�F��� <��hG��?D�z<��h[�v�1Q2�"�G3|�w������~�V7�?�a���:!��������n�$ �}~pC'1"�p\I@�H?��$ �v~�z�>?��:���A�$ W~�I#�Od(�{=�sy�3�3��z����@�Ks�����[���9��OS��������;7;���������z�j(WQ-�s(��KY���M_�+�� :�,;��)�%_QM8�N���S��������\2Q��H��%�T�U�P\��������^Y�E�e��$NYyE��_����R�����O�'������#��I��>.VT/�8h����%jD��[Y���e��$FYi4�����<�Cn��q��1��p&��}b#���l�(
������2������>�e��Aaa����{H�w�/Z��u}����r���B"�I1�������D#y��z8dr���b�|_>uE��;�E{�|)V�Q��f���7MU4<gV4,�}��g�qu]1&��uc���.b�OK���m�r!����\HE���]w!A��J��x�.��-$����B�
9���h�.�.d.��+������/H,r�������BZu_*I��z8[�\HC)i�,r��s!����E�����y���*���W�k��%��]�.8�S�r!��}P������aY3B���A[j.�z�Z8c�E�t���N��o���Im��~�g��P��a�^w�&�I?�L|���=J��o��������1�i��`b�OV��s�|��$du���PKh�W�
�h PK O4qG ObjectReplacements/Object 2�]klW>�u���k�~��8v��N�������d�����lX�P��:�"�'i�M[�PJ� R*��� T*�T*RAH��(�Z��P�CTTUE���{g�������M���3����|���9g���q��'�+���}��E�5��.����� j�"����W�> ;''�Z���w������*��$%$��b�<S�`�El&��S���tk���I���jI���P����3�gWWww�����;����
�{��e��*S��
EC,Vm�����{dR.VC��!7����d�Zd�����U���9,���:���r��x���V���M#��J��c�d!G���8���e�FW�^��yl�)�C���+8����DC���)�[�d]��������W�z����5M�i��KwG���
g���d��R��6C�:�t#s�n���%�B��j�������iV!�P�F��J��Q�����|���]�n���J+Q�F�.D�?�)�D�u�}�gg)�D�u�}���+�D�u�}������k��BbA}aa�%U�Ro�X��W���B�'a�+|J+Q�F�)$A}��g*�D�u���%��/WZ��5�N! �KL��Fi%j��;�$�/1�����]�������E."����B�)$A}��[����B��>����N! �Pw���SH���]��p
IP�����5N! �Pw}��)$A]����9�$�k@�� J��u
��>7��������H7�;����� )�O�M���xT�n��r����L`e��z&_�\�Q���2������B��&
uD��u��3L��[R�2eoL��c�L���}���s��L��L��OS���y�M)U����(S�����0�7���4�����U>���}��������{:{���� ���<��Q����
x����D��Y&��z�4���dj%_�>G���������
O�)�7�C���r�I�+.����}���P7��!!6w�e�K���c>]
{9;b�6�a��5lW�2���:'3}�@V4o�o���i��[r�blxX(���[�l��r]m��,b�f�����r�mh�P���kt������7R�,��{oC���N��{����R�n���8� #�MJ�������,��>��$��{&�=�C����@�=j�7���cb.��� :�8
�\R�n�����%*j���P����M�y�P��n�c|h�J���������r�>�}^�w������j�A��cs����
t��6��@O�l�d��Q�\6�������z�Hf�x;��~���<]��Z����:Y��U#'����|*GW'r��<��3����L����tu���,'/�L$�d����rrK��N~']9�y�grs�Lv:Y�� ��i�:�;�r�������lw��fp�bL�'�:vN>�6��~G�9���!���>=���� ~d}�c�
R'��Z-���0�
�ql�b���E�0����{�}���R�p� ���}G�~?���}�� ���$����� �oh<m��?���S���@9��m��O�<�����z���r
nH�pH���Y1��(d�!����_|�����7��V��AI�`���������$:���SUD�\Mt����}����(U��
Q>����:d��DO�|��h�z��
Du8W���kD�D���i��V�C
{�d��v�cw}��������O[����:�]XJC��z0��P
-g�0+x(�����4������t���L���l��P�y�<d��P�!�<T�C�<T���C�J*�;�!���Jy����y���*y�y�X�CU<����y�����P-���u�5x����z�C2xh=m��Fj�� �y�h���C2Zyh#m��6j��;yh3m���<�����Pu��2xh'yh���=,��x�#"�`������|d<v�����K���� sX��C�)G��(��
$��IB��XGu�x�� l���1D��KEzk���(Z4�� T�Cr;@�I�V)Q���$�gl��5�*������
J��ot#�k6�_=���W��>@�����,�wJ��V���e~��_�
y�i�I ^�WyH���P~�}-��(Gp�~�n�WJU>r���8�m�����=/��KDN�O;~���pT��G0�k�
����A}8�x��>�}�[��X�����X���X����������.��:�~�_��!l�����|�|���_D���EMs����"A���qG��B�/
��)�_��9J�D�,�yDe��f����X��1�]H|��H{c�Y�q�}�L!�ov
�}���<�z�K��q/����]S-�4B���D�����u��6S���`��0X�C���� ?����v���t�E��`���jeL��-vN~�K���PI���!9A%��9�$��n������3�^�U��,:^
I�P�����t�G]���ht�f<�/��1� ����u����r����o�%y�m��Y��z���(� lT�mob�����I3���������PKbh:
�g PK O4qG settings.xml�Z�r�H}�WPz�\|WR@����q!�d��5H
Ly4��!����-K���Wx��hN7�=G��}�wYmBR����4j�mt(�u���y�����p��)��t��]��.A)}����\��r��7�H*MN\���M��o����6W��+�����1W�3� ���=�F�����Z��j#��YVS���M!���pC���X��<hD�����G�i�M6_�{�6��Sn���r�Z��&���!jF���{n��=d���YTKO/R��n��yz�x���%LU"x���0�w��y"zs��,t6O�^�f?+~�%^�r����� 9W�=���2�����R ]F7,��V���1?�DG�5G�nr�vR�h���5 BWz���B��FI�>?���/�O��IB>��|���B��<F���@<s?oM�P��
=��&@I�)��
�{��X�������|�o���y;J����)�^��J�["�ODw�QJ����0?��r���2��k_D?� n�1�"h��'�7�J����p���Ge{ �������t�Hd��r�nG�;�������N[^u�Y���OO���TddlBD�(vx�������%�����Z@8�����0�RY���+������
UU���k�!� �4��H��G����Q y� *Dj�}��0� �����5P�y*@�o�b���t'���Gp)F,N�1��TOt'+���T��H���v\�[N��'��E�-��������|����{�TZKn�r���;X���7HP�����/H��mt�K$�H�r����o<�(8� �c����ZwHa�������H8:�SN��hl�����P�>�AlpQ��R�M>y>:EM^������R���DP��4�}f�_jJ����;/�/���xN����to����.Wo��/�]��K��Wh�^m'^�����j�C�i�+�����T�q��q����"�����euE�#�����r�}���0]l��+}��=�-o$�OD�*
�/�'5�\��V>I����dX� ��)y}��j��:�a��� �!����V-�U�n���
9�+N�������Fr��(���(�go�5��W��PKeT�f] �( PK O4qG content.xml��n�8��=���8Y_�-�u�t�{Es{�v��`$��V����u�poxOrC��I�N�Xa�T-�dH�fH�������$��pN#�NzV��i8
H��I���������_���,
�8$�2�)��2��vJ�e�����1A4��%��Y0&N7Xc����U�Pv�F/��_�������.�S.��a�Vu�y[T}F�"_�X��$C,���:��o����ll����r�$����FQ[1T��e����1���a�-c�6�����YJ��%�k
b���f9�������DI������j~`���k�Y�X'�/*N(�&�-��g\@e����V���.-�V� ����,[�����U�P.��]�4F �Ww6_����<��y���q��4hg�B�W\��E��@����jL���������N��qtc=J)C�vdr> {�9�H�����W�0[v���%�a��k7M�y�m
�8�X��U�W�$�z�<�F�����UR�]w_g8�xOP�AO(����_������@�p�����P��}c�����tn^@��) f������r�P�*��-�g(�z����W���X+a����!�� ��M�$�o��"*���8�������R�E� �\>F
���i��.~���_QF����e����P��c����/�[<U�R7���-bQ���&��)��"���X`y�e��6�b���\��E�j32��1��_bQx '�y���*
������I����a��C��d����\�U�p�����w�@�n~��$cQ�b]Df�7a�n�m�?pme|;��
jf�*�{��x{#E|3�_�7� x�8�34�z����2f;�:T�aD����Y?��/��� (�I/�uvY�� �8�T}�r���s���QV)�Q(g��!����9��FI��+�������-��b���k����
�t����9��"
*�kX$�C�1>�:��V�O��m��f�I n������\�D(=X���_o��3{[��b�,#��xe!��'���(���PhYk�'��������[�U}�^<nK�C&��;o����$����X��.0f���j���V�e��.�y�A�o��������eq�s�!=g���iZxxk*E.d����a�0�9��i�������!����&��v��������K����)��Pq|�!���i��w�_!P��H:������h������(
����P>b���J(���=C�UM�������Q� ����� [pA��e7����6b�y�<|,a>�|��&�f�5��<�f���IS�iM�t����G��� ���r6�M��t���(��P<��� ������)7!�a���,p�-#Q��,jB;�d��u�� D�NaF����9v<����^����M�4Z[�jo��&�[�
��������1��Qls�,�$��&�mwxq��9`����~��%��M��sU�`�����9C�QLy���pJ��� �k|��� \)��F�������qC*F��B���U.Q�-_�M�%`8&h�������q���<��J�]�z>����B;���=.h
G�vZsy���G���/,�$�������������� �x:C���1������>u��|{2=wj�|O ��h5D���?H_�u�}Y��|�2��!!sa��b�����������=�=
]*%��[�����p�%=�p[�k ��kK�V����+9��so�f�p������@9�j��e�� 5 )%�z� 7R�{���P-�c���A����=q�rP1�i� �Xs���(�V���G���L�������y��q<�^����� k��*^��#i �2�KX��}%�����q������<