Speedup twophase transactions
Hello.
While working with cluster stuff (DTM, tsDTM) we noted that postgres 2pc transactions is approximately two times slower than an ordinary commit on workload with fast transactions — few single-row updates and COMMIT or PREPARE/COMMIT. Perf top showed that a lot of time is spent in kernel on fopen/fclose, so it worth a try to reduce file operations with 2pc tx.
Now 2PC in postgres does following:
* on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to xlog and to file, but file not is not fsynced
* on commit backend reads data from file
* if checkpoint occurs before commit, then files are fsynced during checkpoint
* if case of crash replay will move data from xlog to files
In this patch I’ve changed this procedures to following:
* on prepare backend writes data only to xlog and store pointer to the start of the xlog record
* if commit occurs before checkpoint then backend reads data from xlog by this pointer
* on checkpoint 2pc data copied to files and fsynced
* if commit happens after checkpoint then backend reads files
* in case of crash replay will move data from xlog to files (as it was before patch)
Most of that ideas was already mentioned in 2009 thread by Michael Paquier /messages/by-id/c64c5f8b0908062031k3ff48428j824a9a46f28180ac@mail.gmail.com where he suggested to store 2pc data in shared memory.
At that time patch was declined because no significant speedup were observed. Now I see performance improvements by my patch at about 60%. Probably old benchmark overall tps was lower and it was harder to hit filesystem fopen/fclose limits.
Now results of benchmark are following (dual 6-core xeon server):
Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps
Benchmark done with following script:
\set naccounts 100000 * :scale
\setrandom from_aid 1 :naccounts
\setrandom to_aid 1 :naccounts
\setrandom delta 1 100
\set scale :scale+1
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = :from_aid;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :to_aid;
PREPARE TRANSACTION ':client_id.:scale';
COMMIT PREPARED ':client_id.:scale';
Attachments:
2pc_xlog.diffapplication/octet-stream; name=2pc_xlog.diffDownload+78-194
On Wed, Dec 9, 2015 at 12:44 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
Now 2PC in postgres does following:
* on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to xlog and to file, but file not is not fsynced
* on commit backend reads data from file
* if checkpoint occurs before commit, then files are fsynced during checkpoint
* if case of crash replay will move data from xlog to filesIn this patch I’ve changed this procedures to following:
* on prepare backend writes data only to xlog and store pointer to the start of the xlog record
* if commit occurs before checkpoint then backend reads data from xlog by this pointer
* on checkpoint 2pc data copied to files and fsynced
* if commit happens after checkpoint then backend reads files
* in case of crash replay will move data from xlog to files (as it was before patch)
That sounds like a very good plan to me.
Now results of benchmark are following (dual 6-core xeon server):
Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps
I assume that last one should have been *Patched master with 2PC"?
Please add this to the January CommitFest.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks, Kevin.
I assume that last one should have been *Patched master with 2PC”?
Yes, this list should look like this:
Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Patched master with 2PC: ~36 ktps
And created CommitFest entry for this patch.
--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 10 Dec 2015, at 00:37, Kevin Grittner <kgrittn@gmail.com> wrote:
On Wed, Dec 9, 2015 at 12:44 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
Now 2PC in postgres does following:
* on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to xlog and to file, but file not is not fsynced
* on commit backend reads data from file
* if checkpoint occurs before commit, then files are fsynced during checkpoint
* if case of crash replay will move data from xlog to filesIn this patch I’ve changed this procedures to following:
* on prepare backend writes data only to xlog and store pointer to the start of the xlog record
* if commit occurs before checkpoint then backend reads data from xlog by this pointer
* on checkpoint 2pc data copied to files and fsynced
* if commit happens after checkpoint then backend reads files
* in case of crash replay will move data from xlog to files (as it was before patch)That sounds like a very good plan to me.
Now results of benchmark are following (dual 6-core xeon server):
Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktpsI assume that last one should have been *Patched master with 2PC"?
Please add this to the January CommitFest.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 10, 2015 at 3:44 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
Most of that ideas was already mentioned in 2009 thread by Michael Paquier /messages/by-id/c64c5f8b0908062031k3ff48428j824a9a46f28180ac@mail.gmail.com where he suggested to store 2pc data in shared memory.
At that time patch was declined because no significant speedup were observed. Now I see performance improvements by my patch at about 60%. Probably old benchmark overall tps was lower and it was harder to hit filesystem fopen/fclose limits.
Glad to see this patch is given a second life 6 years later.
Now results of benchmark are following (dual 6-core xeon server):
Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps
That's nice.
+ XLogRecPtr prepare_xlogptr; /* XLOG offset of prepare record start
+ * or NULL if twophase data moved to file
+ * after checkpoint.
+ */
This has better be InvalidXLogRecPtr if unused.
+ if (gxact->prepare_lsn)
+ {
+ XlogReadTwoPhaseData(gxact->prepare_xlogptr, &buf, NULL);
+ }
Perhaps you mean prepare_xlogptr here?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Dec 9, 2015 at 10:44 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
Hello.
While working with cluster stuff (DTM, tsDTM) we noted that postgres 2pc transactions is approximately two times slower than an ordinary commit on workload with fast transactions — few single-row updates and COMMIT or PREPARE/COMMIT. Perf top showed that a lot of time is spent in kernel on fopen/fclose, so it worth a try to reduce file operations with 2pc tx.
I've tested this through my testing harness which forces the database
to go through endless runs of crash recovery and checks for
consistency, and so far it has survived perfectly.
...
Now results of benchmark are following (dual 6-core xeon server):
Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps
Can you give the full command line? -j, -c, etc.
Benchmark done with following script:
\set naccounts 100000 * :scale
\setrandom from_aid 1 :naccounts
\setrandom to_aid 1 :naccounts
\setrandom delta 1 100
\set scale :scale+1
Why are you incrementing :scale ?
I very rapidly reach a point where most of the updates are against
tuples that don't exist, and then get integer overflow problems.
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = :from_aid;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :to_aid;
PREPARE TRANSACTION ':client_id.:scale';
COMMIT PREPARED ':client_id.:scale';
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael, Jeff thanks for reviewing and testing.
On 10 Dec 2015, at 02:16, Michael Paquier <michael.paquier@gmail.com> wrote:
This has better be InvalidXLogRecPtr if unused.
Yes, that’s better. Changed.
On 10 Dec 2015, at 02:16, Michael Paquier <michael.paquier@gmail.com> wrote:
+ if (gxact->prepare_lsn) + { + XlogReadTwoPhaseData(gxact->prepare_xlogptr, &buf, NULL); + } Perhaps you mean prepare_xlogptr here?
Yes, my bad. But funnily I have this error even number of times: code in CheckPointTwoPhase also uses prepare_lsn instead of xlogptr, so overall this was working well, that’s why it survived my own tests and probably Jeff’s tests.
I think that’s a bad variable naming, for example because lsn in pg_xlogdump points to start of the record, but here start used as xloptr and end as lsn.
So changed both variables to prepare_start_lsn and prepare_end_lsn.
On 10 Dec 2015, at 09:48, Jeff Janes <jeff.janes@gmail.com> wrote:
I've tested this through my testing harness which forces the database
to go through endless runs of crash recovery and checks for
consistency, and so far it has survived perfectly.
Cool! I think that patch is most vulnerable to following type of workload: prepare transaction, do a lot of stuff with database to force checkpoints (or even recovery cycles), and commit it.
On 10 Dec 2015, at 09:48, Jeff Janes <jeff.janes@gmail.com> wrote:
Can you give the full command line? -j, -c, etc.
pgbench -h testhost -i && pgbench -h testhost -f 2pc.pgb -T 300 -P 1 -c 64 -j 16 -r
where 2pc.pgb as in previous message.
Also all this applies to hosts with uniform memory. I tried to run patched postgres on NUMA with 60 physical cores and patch didn’t change anything. Perf top shows that main bottleneck is access to gxact, but on ordinary host with 1/2 cpu’s that access even not in top ten heaviest routines.
On 10 Dec 2015, at 09:48, Jeff Janes <jeff.janes@gmail.com> wrote:
Why are you incrementing :scale ?
That’s a funny part, overall 2pc speed depends on how you will name your prepared transaction. Concretely I tried to use random numbers for gid’s and it was slower than having constantly incrementing gid. Probably that happens due to linear search by gid in gxact array on commit. So I used :scale just as a counter, bacause it is initialised on pgbench start and line like “\set scale :scale+1” works well. (may be there is a different way to do it in pgbench).
I very rapidly reach a point where most of the updates are against
tuples that don't exist, and then get integer overflow problems.
Hmm, that’s strange. Probably you set scale to big value, so that 100000*:scale is bigger that int4? But i thought that pgbench will change aid columns to bigint if scale is more than 20000.
Attachments:
2pc_xlog.v2.diffapplication/octet-stream; name=2pc_xlog.v2.diffDownload+86-205
On 9 December 2015 at 18:44, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
In this patch I’ve changed this procedures to following:
* on prepare backend writes data only to xlog and store pointer to the
start of the xlog record
* if commit occurs before checkpoint then backend reads data from xlog by
this pointer
* on checkpoint 2pc data copied to files and fsynced
* if commit happens after checkpoint then backend reads files
* in case of crash replay will move data from xlog to files (as it was
before patch)
This looks sound to me.
I think we could do better still, but this looks like the easiest 80% and
actually removes code.
The lack of substantial comments on the patch is a problem though - the
details above should go in the patch. I'll have a go at reworking this for
you, this time.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Simon Riggs wrote:
I think we could do better still, but this looks like the easiest 80% and
actually removes code.The lack of substantial comments on the patch is a problem though - the
details above should go in the patch. I'll have a go at reworking this for
you, this time.
Is someone submitting an updated patch here?
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi.
I’ve updated patch and wrote description of thighs that happens
with 2PC state data in the beginning of the file. I think now this patch is well documented,
but if somebody points me to places that probably requires more detailed description I’m ready
to extend that.
Attachments:
2pc_xlog.diffapplication/octet-stream; name=2pc_xlog.diffDownload+101-207
On 9 January 2016 at 12:26, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
I’ve updated patch and wrote description of thighs that happens
with 2PC state data in the beginning of the file. I think now this patch
is well documented,
but if somebody points me to places that probably requires more detailed
description I’m ready
to extend that.
Hmm, I was just preparing this for commit.
Please have a look at my mild edits and extended comments.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
2pc_optimize.v2.patchapplication/octet-stream; name=2pc_optimize.v2.patchDownload+120-214
On 9 January 2016 at 12:26, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
I’ve updated patch and wrote description of thighs that happens
with 2PC state data in the beginning of the file. I think now this patch
is well documented,
but if somebody points me to places that probably requires more detailed
description I’m ready
to extend that.
Your comments say
"In case of crash replay will move data from xlog to files, if
that hasn't happened before."
but I don't see that in code. Can you show me where that happens?
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thanks a lot for your edits, now that patch is much more cleaner.
Your comments say
"In case of crash replay will move data from xlog to files, if that hasn't happened before."
but I don't see that in code. Can you show me where that happens?
xact.c calls RecreateTwoPhaseFile in xact_redo() function (xact.c:5596)
On 09 Jan 2016, at 18:29, Simon Riggs <simon@2ndquadrant.com> wrote:
Hmm, I was just preparing this for commit.
Please have a look at my mild edits and extended comments.
One concern that come into my mind while reading updated
patch is about creating extra bool field in GlobalTransactionData structure. While this improves readability, it
also increases size of that structure and that size have impact on performance on systems with many cores
(say like 60-80). Probably one byte will not make measurable difference, but I think it is good idea to keep
GXact as small as possible. As far as I understand the same logic was behind split of
PGPROC to PGPROC+PGXACT in 9.2 (comment in proc.h:166)
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9 January 2016 at 20:28, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
Thanks a lot for your edits, now that patch is much more cleaner.
Your comments say
"In case of crash replay will move data from xlog to files, if that
hasn't happened before."
but I don't see that in code. Can you show me where that happens?
xact.c calls RecreateTwoPhaseFile in xact_redo() function (xact.c:5596)
So we've only optimized half the usage? We're still going to cause
replication delays.
Sounds like we should be fixing both.
We can either
1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during
restartpoints
2) Copy the contents to shmem and then write them at restartpoint as we do
for checkpoint
(preferred)
On 09 Jan 2016, at 18:29, Simon Riggs <simon@2ndquadrant.com> wrote:
Hmm, I was just preparing this for commit.
Please have a look at my mild edits and extended comments.
One concern that come into my mind while reading updated
patch is about creating extra bool field in GlobalTransactionData
structure. While this improves readability, it
also increases size of that structure and that size have impact on
performance on systems with many cores
(say like 60-80). Probably one byte will not make measurable difference,
but I think it is good idea to keep
GXact as small as possible. As far as I understand the same logic was
behind split of
PGPROC to PGPROC+PGXACT in 9.2 (comment in proc.h:166)
I think padding will negate the effects of the additional bool.
If we want to reduce the size of the array GIDSIZE is currently 200, but XA
says maximum 128 bytes.
Anybody know why that is set to 200?
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 10 Jan 2016, at 12:15, Simon Riggs <simon@2ndquadrant.com> wrote:
So we've only optimized half the usage? We're still going to cause replication delays.
Yes, replica will go through old procedures of moving data to and from file.
We can either
1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during restartpoints
From what i’ve seen with old 2pc code main performance bottleneck was caused by frequent creating of files. So better to avoid files if possible.
2) Copy the contents to shmem and then write them at restartpoint as we do for checkpoint
(preferred)
Problem with shared memory is that we can’t really predict size of state data, and anyway it isn’t faster then reading data from WAL
(I have tested that while preparing original patch).
We can just apply the same logic on replica that on master: do not do anything special on prepare, and just read that data from WAL.
If checkpoint occurs during recovery/replay probably existing code will handle moving data to files.
I will update patch to address this issue.
I think padding will negate the effects of the additional bool.
If we want to reduce the size of the array GIDSIZE is currently 200, but XA says maximum 128 bytes.
Anybody know why that is set to 200?
Good catch about GID size.
If we talk about further optimisations i see two ways:
1) Optimising access to GXACT. Here we can try to shrink it; introduce more granular locks,
e.g. move GIDs out of GXACT and lock GIDs array only once while checking new GID uniqueness; try to lock only part of GXACT by hash; etc.
2) Be optimistic about consequent COMMIT PREPARED. In normal workload next command after PREPARE will be COMMIT/ROLLBACK, so we can save
transaction context and release it only if next command isn’t our designated COMMIT/ROLLBACK. But that is a big amount of work and requires
changes to whole transaction pipeline in postgres.
Anyway I suggest that we should consider that as a separate task.
---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 01/09/2016 10:29 AM, Simon Riggs wrote:
On 9 January 2016 at 12:26, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
I’ve updated patch and wrote description of thighs that happens
with 2PC state data in the beginning of the file. I think now this patch
is well documented,
but if somebody points me to places that probably requires more detailed
description I’m ready
to extend that.Hmm, I was just preparing this for commit.
Please have a look at my mild edits and extended comments.
I have done a run with the patch and it looks really great.
Attached is the TPS graph - with a 1pc run too - and the perf profile as
a flame graph (28C/56T w/ 256Gb mem, 2 x RAID10 SSD).
Maybe
+static void
+XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
to
+static void
+ReadTwoPhaseDataFromXlog(XLogRecPtr lsn, char **buf, int *len)
Best regards,
Jesper
On 11 January 2016 at 12:58, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 10 Jan 2016, at 12:15, Simon Riggs <simon@2ndquadrant.com> wrote:
So we've only optimized half the usage? We're still going to cause
replication delays.
Yes, replica will go through old procedures of moving data to and from
file.We can either
1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during
restartpoints
From what i’ve seen with old 2pc code main performance bottleneck was
caused by frequent creating of files. So better to avoid files if possible.2) Copy the contents to shmem and then write them at restartpoint as we
do for checkpoint
(preferred)
Problem with shared memory is that we can’t really predict size of state
data, and anyway it isn’t faster then reading data from WAL
(I have tested that while preparing original patch).We can just apply the same logic on replica that on master: do not do
anything special on prepare, and just read that data from WAL.
If checkpoint occurs during recovery/replay probably existing code will
handle moving data to files.I will update patch to address this issue.
I'm looking to commit what we have now, so lets do that as a separate but
necessary patch please.
I think padding will negate the effects of the additional bool.
If we want to reduce the size of the array GIDSIZE is currently 200, but
XA says maximum 128 bytes.
Anybody know why that is set to 200?
Good catch about GID size.
I'll apply that as a separate patch also.
If we talk about further optimisations i see two ways:
1) Optimising access to GXACT. Here we can try to shrink it; introduce
more granular locks,
e.g. move GIDs out of GXACT and lock GIDs array only once while checking
new GID uniqueness; try to lock only part of GXACT by hash; etc.
Have you measured lwlocking as a problem?
2) Be optimistic about consequent COMMIT PREPARED. In normal workload next
command after PREPARE will be COMMIT/ROLLBACK, so we can save
transaction context and release it only if next command isn’t our
designated COMMIT/ROLLBACK. But that is a big amount of work and requires
changes to whole transaction pipeline in postgres.
We'd need some way to force session pools to use that correctly, but yes,
agreed.
Anyway I suggest that we should consider that as a separate task.
Definitely. From the numbers, I can see there is still considerable
performance gain to be had.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 01/10/2016 04:15 AM, Simon Riggs wrote:
One concern that come into my mind while reading updated
patch is about creating extra bool field in GlobalTransactionData
structure. While this improves readability, it
also increases size of that structure and that size have impact on
performance on systems with many cores
(say like 60-80). Probably one byte will not make measurable difference,
but I think it is good idea to keep
GXact as small as possible. As far as I understand the same logic was
behind split of
PGPROC to PGPROC+PGXACT in 9.2 (comment in proc.h:166)I think padding will negate the effects of the additional bool.
If we want to reduce the size of the array GIDSIZE is currently 200, but XA
says maximum 128 bytes.Anybody know why that is set to 200?
Even though GlobalTransactionId and BranchQualifer have a maximum of 64
each, external clients may choose to encode the information, and thereby
need more space,
http://docs.oracle.com/javaee/7/api/javax/transaction/xa/Xid.html
which in this case adds up to a maximum of 189 characters.
Best regards,
Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2016-01-09 15:29:11 +0000, Simon Riggs wrote:
Hmm, I was just preparing this for commit.
Just read downthread that you want to commit this soon. Please hold of
for a while, this doesn't really look ready to me. I don't have time for
a real review right now, but I'll try to get to it asap.
+ +/* + * Reads 2PC data from xlog. During checkpoint this data will be moved to + * twophase files and ReadTwoPhaseFile should be used instead. + */ +static void +XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len) +{ + XLogRecord *record; + XLogReaderState *xlogreader; + char *errormsg; + + xlogreader = XLogReaderAllocate(&logical_read_local_xlog_page, NULL);
logical_read_local_xlog_page isn't really suitable for the use
here. Besides the naming issue, at the very least it'll be wrong during
WAL replay in the presence of promotions on an upstream node - it
doesn't dealwith timelines.
More generally, I'm doubtful that the approach of reading data from WAL
as proposed here is a very good idea. It seems better to "just" dump the
entire 2pc state into *one* file at checkpoint time.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2016-01-11 20:03:18 +0100, Andres Freund wrote:
More generally, I'm doubtful that the approach of reading data from WAL
as proposed here is a very good idea. It seems better to "just" dump the
entire 2pc state into *one* file at checkpoint time.
Or better: After determining the checkpoint redo location, insert a WAL
record representing the entire 2PC state as of that moment. That way it
can easily restored during WAL replay and nothing special has to be done
on a standby. This way we'll need no extra wal flushes and fsyncs.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11 January 2016 at 19:03, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2016-01-09 15:29:11 +0000, Simon Riggs wrote:
Hmm, I was just preparing this for commit.
Just read downthread that you want to commit this soon. Please hold of
for a while, this doesn't really look ready to me. I don't have time for
a real review right now, but I'll try to get to it asap.
"A real review"? Huh.
+ +/* + * Reads 2PC data from xlog. During checkpoint this data will be movedto
+ * twophase files and ReadTwoPhaseFile should be used instead. + */ +static void +XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len) +{ + XLogRecord *record; + XLogReaderState *xlogreader; + char *errormsg; + + xlogreader = XLogReaderAllocate(&logical_read_local_xlog_page, NULL);logical_read_local_xlog_page isn't really suitable for the use
here. Besides the naming issue, at the very least it'll be wrong during
WAL replay in the presence of promotions on an upstream node - it
doesn't dealwith timelines.
I'm aware of that, though note that it isn't being used in that way here.
More generally, I'm doubtful that the approach of reading data from WAL
as proposed here is a very good idea. It seems better to "just" dump the
entire 2pc state into *one* file at checkpoint time.
I think you misunderstand the proposed approach. This isn't just to do
with reading things back at checkpoint.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services