Speedup twophase transactions

Started by Stas Kelvichover 10 years ago167 messageshackers
Jump to latest
#1Stas Kelvich
s.kelvich@postgrespro.ru

Hello.

While working with cluster stuff (DTM, tsDTM) we noted that postgres 2pc transactions is approximately two times slower than an ordinary commit on workload with fast transactions — few single-row updates and COMMIT or PREPARE/COMMIT. Perf top showed that a lot of time is spent in kernel on fopen/fclose, so it worth a try to reduce file operations with 2pc tx.

Now 2PC in postgres does following:
* on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to xlog and to file, but file not is not fsynced
* on commit backend reads data from file
* if checkpoint occurs before commit, then files are fsynced during checkpoint
* if case of crash replay will move data from xlog to files

In this patch I’ve changed this procedures to following:
* on prepare backend writes data only to xlog and store pointer to the start of the xlog record
* if commit occurs before checkpoint then backend reads data from xlog by this pointer
* on checkpoint 2pc data copied to files and fsynced
* if commit happens after checkpoint then backend reads files
* in case of crash replay will move data from xlog to files (as it was before patch)

Most of that ideas was already mentioned in 2009 thread by Michael Paquier /messages/by-id/c64c5f8b0908062031k3ff48428j824a9a46f28180ac@mail.gmail.com where he suggested to store 2pc data in shared memory.
At that time patch was declined because no significant speedup were observed. Now I see performance improvements by my patch at about 60%. Probably old benchmark overall tps was lower and it was harder to hit filesystem fopen/fclose limits.

Now results of benchmark are following (dual 6-core xeon server):

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps

Benchmark done with following script:

\set naccounts 100000 * :scale
\setrandom from_aid 1 :naccounts
\setrandom to_aid 1 :naccounts
\setrandom delta 1 100
\set scale :scale+1
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = :from_aid;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :to_aid;
PREPARE TRANSACTION ':client_id.:scale';
COMMIT PREPARED ':client_id.:scale';

Attachments:

2pc_xlog.diffapplication/octet-stream; name=2pc_xlog.diffDownload+78-194
#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Stas Kelvich (#1)
Re: Speedup twophase transactions

On Wed, Dec 9, 2015 at 12:44 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

Now 2PC in postgres does following:
* on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to xlog and to file, but file not is not fsynced
* on commit backend reads data from file
* if checkpoint occurs before commit, then files are fsynced during checkpoint
* if case of crash replay will move data from xlog to files

In this patch I’ve changed this procedures to following:
* on prepare backend writes data only to xlog and store pointer to the start of the xlog record
* if commit occurs before checkpoint then backend reads data from xlog by this pointer
* on checkpoint 2pc data copied to files and fsynced
* if commit happens after checkpoint then backend reads files
* in case of crash replay will move data from xlog to files (as it was before patch)

That sounds like a very good plan to me.

Now results of benchmark are following (dual 6-core xeon server):

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps

I assume that last one should have been *Patched master with 2PC"?

Please add this to the January CommitFest.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Kevin Grittner (#2)
Re: Speedup twophase transactions

Thanks, Kevin.

I assume that last one should have been *Patched master with 2PC”?

Yes, this list should look like this:

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Patched master with 2PC: ~36 ktps

And created CommitFest entry for this patch.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

On 10 Dec 2015, at 00:37, Kevin Grittner <kgrittn@gmail.com> wrote:

On Wed, Dec 9, 2015 at 12:44 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

Now 2PC in postgres does following:
* on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to xlog and to file, but file not is not fsynced
* on commit backend reads data from file
* if checkpoint occurs before commit, then files are fsynced during checkpoint
* if case of crash replay will move data from xlog to files

In this patch I’ve changed this procedures to following:
* on prepare backend writes data only to xlog and store pointer to the start of the xlog record
* if commit occurs before checkpoint then backend reads data from xlog by this pointer
* on checkpoint 2pc data copied to files and fsynced
* if commit happens after checkpoint then backend reads files
* in case of crash replay will move data from xlog to files (as it was before patch)

That sounds like a very good plan to me.

Now results of benchmark are following (dual 6-core xeon server):

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps

I assume that last one should have been *Patched master with 2PC"?

Please add this to the January CommitFest.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#1)
Re: Speedup twophase transactions

On Thu, Dec 10, 2015 at 3:44 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

Most of that ideas was already mentioned in 2009 thread by Michael Paquier /messages/by-id/c64c5f8b0908062031k3ff48428j824a9a46f28180ac@mail.gmail.com where he suggested to store 2pc data in shared memory.
At that time patch was declined because no significant speedup were observed. Now I see performance improvements by my patch at about 60%. Probably old benchmark overall tps was lower and it was harder to hit filesystem fopen/fclose limits.

Glad to see this patch is given a second life 6 years later.

Now results of benchmark are following (dual 6-core xeon server):

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps

That's nice.

+    XLogRecPtr    prepare_xlogptr;    /* XLOG offset of prepare record start
+                                     * or NULL if twophase data moved to file
+                                     * after checkpoint.
+                                     */
This has better be InvalidXLogRecPtr if unused.
+    if (gxact->prepare_lsn)
+    {
+        XlogReadTwoPhaseData(gxact->prepare_xlogptr, &buf, NULL);
+    }
Perhaps you mean prepare_xlogptr here?
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Jeff Janes
jeff.janes@gmail.com
In reply to: Stas Kelvich (#1)
Re: Speedup twophase transactions

On Wed, Dec 9, 2015 at 10:44 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

Hello.

While working with cluster stuff (DTM, tsDTM) we noted that postgres 2pc transactions is approximately two times slower than an ordinary commit on workload with fast transactions — few single-row updates and COMMIT or PREPARE/COMMIT. Perf top showed that a lot of time is spent in kernel on fopen/fclose, so it worth a try to reduce file operations with 2pc tx.

I've tested this through my testing harness which forces the database
to go through endless runs of crash recovery and checks for
consistency, and so far it has survived perfectly.

...

Now results of benchmark are following (dual 6-core xeon server):

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps

Can you give the full command line? -j, -c, etc.

Benchmark done with following script:

\set naccounts 100000 * :scale
\setrandom from_aid 1 :naccounts
\setrandom to_aid 1 :naccounts
\setrandom delta 1 100
\set scale :scale+1

Why are you incrementing :scale ?

I very rapidly reach a point where most of the updates are against
tuples that don't exist, and then get integer overflow problems.

BEGIN;
UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = :from_aid;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :to_aid;
PREPARE TRANSACTION ':client_id.:scale';
COMMIT PREPARED ':client_id.:scale';

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#4)
Re: Speedup twophase transactions

Michael, Jeff thanks for reviewing and testing.

On 10 Dec 2015, at 02:16, Michael Paquier <michael.paquier@gmail.com> wrote:

This has better be InvalidXLogRecPtr if unused.

Yes, that’s better. Changed.

On 10 Dec 2015, at 02:16, Michael Paquier <michael.paquier@gmail.com> wrote:

+    if (gxact->prepare_lsn)
+    {
+        XlogReadTwoPhaseData(gxact->prepare_xlogptr, &buf, NULL);
+    }
Perhaps you mean prepare_xlogptr here?

Yes, my bad. But funnily I have this error even number of times: code in CheckPointTwoPhase also uses prepare_lsn instead of xlogptr, so overall this was working well, that’s why it survived my own tests and probably Jeff’s tests.
I think that’s a bad variable naming, for example because lsn in pg_xlogdump points to start of the record, but here start used as xloptr and end as lsn.
So changed both variables to prepare_start_lsn and prepare_end_lsn.

On 10 Dec 2015, at 09:48, Jeff Janes <jeff.janes@gmail.com> wrote:
I've tested this through my testing harness which forces the database
to go through endless runs of crash recovery and checks for
consistency, and so far it has survived perfectly.

Cool! I think that patch is most vulnerable to following type of workload: prepare transaction, do a lot of stuff with database to force checkpoints (or even recovery cycles), and commit it.

On 10 Dec 2015, at 09:48, Jeff Janes <jeff.janes@gmail.com> wrote:

Can you give the full command line? -j, -c, etc.

pgbench -h testhost -i && pgbench -h testhost -f 2pc.pgb -T 300 -P 1 -c 64 -j 16 -r

where 2pc.pgb as in previous message.

Also all this applies to hosts with uniform memory. I tried to run patched postgres on NUMA with 60 physical cores and patch didn’t change anything. Perf top shows that main bottleneck is access to gxact, but on ordinary host with 1/2 cpu’s that access even not in top ten heaviest routines.

On 10 Dec 2015, at 09:48, Jeff Janes <jeff.janes@gmail.com> wrote:

Why are you incrementing :scale ?

That’s a funny part, overall 2pc speed depends on how you will name your prepared transaction. Concretely I tried to use random numbers for gid’s and it was slower than having constantly incrementing gid. Probably that happens due to linear search by gid in gxact array on commit. So I used :scale just as a counter, bacause it is initialised on pgbench start and line like “\set scale :scale+1” works well. (may be there is a different way to do it in pgbench).

I very rapidly reach a point where most of the updates are against
tuples that don't exist, and then get integer overflow problems.

Hmm, that’s strange. Probably you set scale to big value, so that 100000*:scale is bigger that int4? But i thought that pgbench will change aid columns to bigint if scale is more than 20000.

Attachments:

2pc_xlog.v2.diffapplication/octet-stream; name=2pc_xlog.v2.diffDownload+86-205
#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Stas Kelvich (#1)
Re: Speedup twophase transactions

On 9 December 2015 at 18:44, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

In this patch I’ve changed this procedures to following:
* on prepare backend writes data only to xlog and store pointer to the
start of the xlog record
* if commit occurs before checkpoint then backend reads data from xlog by
this pointer
* on checkpoint 2pc data copied to files and fsynced
* if commit happens after checkpoint then backend reads files
* in case of crash replay will move data from xlog to files (as it was
before patch)

This looks sound to me.

I think we could do better still, but this looks like the easiest 80% and
actually removes code.

The lack of substantial comments on the patch is a problem though - the
details above should go in the patch. I'll have a go at reworking this for
you, this time.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#7)
Re: Speedup twophase transactions

Simon Riggs wrote:

I think we could do better still, but this looks like the easiest 80% and
actually removes code.

The lack of substantial comments on the patch is a problem though - the
details above should go in the patch. I'll have a go at reworking this for
you, this time.

Is someone submitting an updated patch here?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Alvaro Herrera (#8)
Re: Speedup twophase transactions

Hi.

I’ve updated patch and wrote description of thighs that happens
with 2PC state data in the beginning of the file. I think now this patch is well documented,
but if somebody points me to places that probably requires more detailed description I’m ready
to extend that.

Attachments:

2pc_xlog.diffapplication/octet-stream; name=2pc_xlog.diffDownload+101-207
#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Stas Kelvich (#9)
Re: Speedup twophase transactions

On 9 January 2016 at 12:26, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

I’ve updated patch and wrote description of thighs that happens
with 2PC state data in the beginning of the file. I think now this patch
is well documented,
but if somebody points me to places that probably requires more detailed
description I’m ready
to extend that.

Hmm, I was just preparing this for commit.

Please have a look at my mild edits and extended comments.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

2pc_optimize.v2.patchapplication/octet-stream; name=2pc_optimize.v2.patchDownload+120-214
#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Stas Kelvich (#9)
Re: Speedup twophase transactions

On 9 January 2016 at 12:26, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

I’ve updated patch and wrote description of thighs that happens
with 2PC state data in the beginning of the file. I think now this patch
is well documented,
but if somebody points me to places that probably requires more detailed
description I’m ready
to extend that.

Your comments say

"In case of crash replay will move data from xlog to files, if
that hasn't happened before."

but I don't see that in code. Can you show me where that happens?

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Simon Riggs (#10)
Re: Speedup twophase transactions

Thanks a lot for your edits, now that patch is much more cleaner.

Your comments say

"In case of crash replay will move data from xlog to files, if that hasn't happened before."

but I don't see that in code. Can you show me where that happens?

xact.c calls RecreateTwoPhaseFile in xact_redo() function (xact.c:5596)

On 09 Jan 2016, at 18:29, Simon Riggs <simon@2ndquadrant.com> wrote:

Hmm, I was just preparing this for commit.

Please have a look at my mild edits and extended comments.

One concern that come into my mind while reading updated
patch is about creating extra bool field in GlobalTransactionData structure. While this improves readability, it
also increases size of that structure and that size have impact on performance on systems with many cores
(say like 60-80). Probably one byte will not make measurable difference, but I think it is good idea to keep
GXact as small as possible. As far as I understand the same logic was behind split of
PGPROC to PGPROC+PGXACT in 9.2 (comment in proc.h:166)

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Stas Kelvich (#12)
Re: Speedup twophase transactions

On 9 January 2016 at 20:28, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

Thanks a lot for your edits, now that patch is much more cleaner.

Your comments say

"In case of crash replay will move data from xlog to files, if that

hasn't happened before."

but I don't see that in code. Can you show me where that happens?

xact.c calls RecreateTwoPhaseFile in xact_redo() function (xact.c:5596)

So we've only optimized half the usage? We're still going to cause
replication delays.

Sounds like we should be fixing both.

We can either

1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during
restartpoints

2) Copy the contents to shmem and then write them at restartpoint as we do
for checkpoint
(preferred)

On 09 Jan 2016, at 18:29, Simon Riggs <simon@2ndquadrant.com> wrote:

Hmm, I was just preparing this for commit.

Please have a look at my mild edits and extended comments.

One concern that come into my mind while reading updated
patch is about creating extra bool field in GlobalTransactionData
structure. While this improves readability, it
also increases size of that structure and that size have impact on
performance on systems with many cores
(say like 60-80). Probably one byte will not make measurable difference,
but I think it is good idea to keep
GXact as small as possible. As far as I understand the same logic was
behind split of
PGPROC to PGPROC+PGXACT in 9.2 (comment in proc.h:166)

I think padding will negate the effects of the additional bool.

If we want to reduce the size of the array GIDSIZE is currently 200, but XA
says maximum 128 bytes.

Anybody know why that is set to 200?

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Simon Riggs (#13)
Re: Speedup twophase transactions

On 10 Jan 2016, at 12:15, Simon Riggs <simon@2ndquadrant.com> wrote:

So we've only optimized half the usage? We're still going to cause replication delays.

Yes, replica will go through old procedures of moving data to and from file.

We can either

1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during restartpoints

From what i’ve seen with old 2pc code main performance bottleneck was caused by frequent creating of files. So better to avoid files if possible.

2) Copy the contents to shmem and then write them at restartpoint as we do for checkpoint
(preferred)

Problem with shared memory is that we can’t really predict size of state data, and anyway it isn’t faster then reading data from WAL
(I have tested that while preparing original patch).

We can just apply the same logic on replica that on master: do not do anything special on prepare, and just read that data from WAL.
If checkpoint occurs during recovery/replay probably existing code will handle moving data to files.

I will update patch to address this issue.

I think padding will negate the effects of the additional bool.

If we want to reduce the size of the array GIDSIZE is currently 200, but XA says maximum 128 bytes.

Anybody know why that is set to 200?

Good catch about GID size.

If we talk about further optimisations i see two ways:

1) Optimising access to GXACT. Here we can try to shrink it; introduce more granular locks,
e.g. move GIDs out of GXACT and lock GIDs array only once while checking new GID uniqueness; try to lock only part of GXACT by hash; etc.

2) Be optimistic about consequent COMMIT PREPARED. In normal workload next command after PREPARE will be COMMIT/ROLLBACK, so we can save
transaction context and release it only if next command isn’t our designated COMMIT/ROLLBACK. But that is a big amount of work and requires
changes to whole transaction pipeline in postgres.

Anyway I suggest that we should consider that as a separate task.

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Simon Riggs (#10)
Re: Speedup twophase transactions

On 01/09/2016 10:29 AM, Simon Riggs wrote:

On 9 January 2016 at 12:26, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

I’ve updated patch and wrote description of thighs that happens
with 2PC state data in the beginning of the file. I think now this patch
is well documented,
but if somebody points me to places that probably requires more detailed
description I’m ready
to extend that.

Hmm, I was just preparing this for commit.

Please have a look at my mild edits and extended comments.

I have done a run with the patch and it looks really great.

Attached is the TPS graph - with a 1pc run too - and the perf profile as
a flame graph (28C/56T w/ 256Gb mem, 2 x RAID10 SSD).

Maybe

+static void
+XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)

to

+static void
+ReadTwoPhaseDataFromXlog(XLogRecPtr lsn, char **buf, int *len)

Best regards,
Jesper

Attachments:

xa.pngimage/png; name=xa.pngDownload
xa.svg.gzapplication/gzip; name=xa.svg.gzDownload+1-3
#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Stas Kelvich (#14)
Re: Speedup twophase transactions

On 11 January 2016 at 12:58, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

On 10 Jan 2016, at 12:15, Simon Riggs <simon@2ndquadrant.com> wrote:

So we've only optimized half the usage? We're still going to cause

replication delays.

Yes, replica will go through old procedures of moving data to and from
file.

We can either

1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during

restartpoints

From what i’ve seen with old 2pc code main performance bottleneck was
caused by frequent creating of files. So better to avoid files if possible.

2) Copy the contents to shmem and then write them at restartpoint as we

do for checkpoint

(preferred)

Problem with shared memory is that we can’t really predict size of state
data, and anyway it isn’t faster then reading data from WAL
(I have tested that while preparing original patch).

We can just apply the same logic on replica that on master: do not do
anything special on prepare, and just read that data from WAL.
If checkpoint occurs during recovery/replay probably existing code will
handle moving data to files.

I will update patch to address this issue.

I'm looking to commit what we have now, so lets do that as a separate but
necessary patch please.

I think padding will negate the effects of the additional bool.

If we want to reduce the size of the array GIDSIZE is currently 200, but

XA says maximum 128 bytes.

Anybody know why that is set to 200?

Good catch about GID size.

I'll apply that as a separate patch also.

If we talk about further optimisations i see two ways:

1) Optimising access to GXACT. Here we can try to shrink it; introduce
more granular locks,
e.g. move GIDs out of GXACT and lock GIDs array only once while checking
new GID uniqueness; try to lock only part of GXACT by hash; etc.

Have you measured lwlocking as a problem?

2) Be optimistic about consequent COMMIT PREPARED. In normal workload next
command after PREPARE will be COMMIT/ROLLBACK, so we can save
transaction context and release it only if next command isn’t our
designated COMMIT/ROLLBACK. But that is a big amount of work and requires
changes to whole transaction pipeline in postgres.

We'd need some way to force session pools to use that correctly, but yes,
agreed.

Anyway I suggest that we should consider that as a separate task.

Definitely. From the numbers, I can see there is still considerable
performance gain to be had.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Simon Riggs (#13)
Re: Speedup twophase transactions

On 01/10/2016 04:15 AM, Simon Riggs wrote:

One concern that come into my mind while reading updated
patch is about creating extra bool field in GlobalTransactionData
structure. While this improves readability, it
also increases size of that structure and that size have impact on
performance on systems with many cores
(say like 60-80). Probably one byte will not make measurable difference,
but I think it is good idea to keep
GXact as small as possible. As far as I understand the same logic was
behind split of
PGPROC to PGPROC+PGXACT in 9.2 (comment in proc.h:166)

I think padding will negate the effects of the additional bool.

If we want to reduce the size of the array GIDSIZE is currently 200, but XA
says maximum 128 bytes.

Anybody know why that is set to 200?

Even though GlobalTransactionId and BranchQualifer have a maximum of 64
each, external clients may choose to encode the information, and thereby
need more space,

https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/xa/RecoveredXid.java#L66-L70

http://docs.oracle.com/javaee/7/api/javax/transaction/xa/Xid.html

which in this case adds up to a maximum of 189 characters.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#10)
Re: Speedup twophase transactions

Hi,

On 2016-01-09 15:29:11 +0000, Simon Riggs wrote:

Hmm, I was just preparing this for commit.

Just read downthread that you want to commit this soon. Please hold of
for a while, this doesn't really look ready to me. I don't have time for
a real review right now, but I'll try to get to it asap.

+
+/*
+ * Reads 2PC data from xlog. During checkpoint this data will be moved to
+ * twophase files and ReadTwoPhaseFile should be used instead.
+ */
+static void
+XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord *record;
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+
+	xlogreader = XLogReaderAllocate(&logical_read_local_xlog_page,
NULL);

logical_read_local_xlog_page isn't really suitable for the use
here. Besides the naming issue, at the very least it'll be wrong during
WAL replay in the presence of promotions on an upstream node - it
doesn't dealwith timelines.

More generally, I'm doubtful that the approach of reading data from WAL
as proposed here is a very good idea. It seems better to "just" dump the
entire 2pc state into *one* file at checkpoint time.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#18)
Re: Speedup twophase transactions

On 2016-01-11 20:03:18 +0100, Andres Freund wrote:

More generally, I'm doubtful that the approach of reading data from WAL
as proposed here is a very good idea. It seems better to "just" dump the
entire 2pc state into *one* file at checkpoint time.

Or better: After determining the checkpoint redo location, insert a WAL
record representing the entire 2PC state as of that moment. That way it
can easily restored during WAL replay and nothing special has to be done
on a standby. This way we'll need no extra wal flushes and fsyncs.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#18)
Re: Speedup twophase transactions

On 11 January 2016 at 19:03, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-01-09 15:29:11 +0000, Simon Riggs wrote:

Hmm, I was just preparing this for commit.

Just read downthread that you want to commit this soon. Please hold of
for a while, this doesn't really look ready to me. I don't have time for
a real review right now, but I'll try to get to it asap.

"A real review"? Huh.

+
+/*
+ * Reads 2PC data from xlog. During checkpoint this data will be moved

to

+ * twophase files and ReadTwoPhaseFile should be used instead.
+ */
+static void
+XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
+{
+     XLogRecord *record;
+     XLogReaderState *xlogreader;
+     char       *errormsg;
+
+     xlogreader = XLogReaderAllocate(&logical_read_local_xlog_page,
NULL);

logical_read_local_xlog_page isn't really suitable for the use
here. Besides the naming issue, at the very least it'll be wrong during
WAL replay in the presence of promotions on an upstream node - it
doesn't dealwith timelines.

I'm aware of that, though note that it isn't being used in that way here.

More generally, I'm doubtful that the approach of reading data from WAL
as proposed here is a very good idea. It seems better to "just" dump the
entire 2pc state into *one* file at checkpoint time.

I think you misunderstand the proposed approach. This isn't just to do
with reading things back at checkpoint.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#19)
#22Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#20)
#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Jesper Pedersen (#17)
#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#22)
#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#16)
#26Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#25)
#27Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#27)
#29Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Jesper Pedersen (#15)
#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#28)
#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Stas Kelvich (#29)
#32Michael Paquier
michael@paquier.xyz
In reply to: Simon Riggs (#25)
#33Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#32)
#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Michael Paquier (#32)
#35Simon Riggs
simon@2ndQuadrant.com
In reply to: Michael Paquier (#33)
#36Michael Paquier
michael@paquier.xyz
In reply to: Simon Riggs (#35)
#37Michael Paquier
michael@paquier.xyz
In reply to: Simon Riggs (#34)
#38Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Stas Kelvich (#29)
#39Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#37)
#40Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#24)
#41Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Alvaro Herrera (#39)
#42Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#40)
#43Simon Riggs
simon@2ndQuadrant.com
In reply to: Michael Paquier (#37)
#44Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Simon Riggs (#42)
#45Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Stas Kelvich (#44)
#46Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Alvaro Herrera (#45)
#47Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#46)
#48Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Stas Kelvich (#44)
#49David Steele
david@pgmasters.net
In reply to: Stas Kelvich (#46)
#50Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Jesper Pedersen (#48)
#51Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Stas Kelvich (#50)
#52Michael Paquier
michael@paquier.xyz
In reply to: Jesper Pedersen (#51)
#53David Steele
david@pgmasters.net
In reply to: Michael Paquier (#52)
#54Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#52)
#55Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Stas Kelvich (#54)
#56Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#54)
#57Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#56)
#58Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#57)
#59Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#58)
#60Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#59)
#61Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Michael Paquier (#60)
#62Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#60)
#63Simon Riggs
simon@2ndQuadrant.com
In reply to: Michael Paquier (#60)
#64Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Simon Riggs (#63)
#65Robert Haas
robertmhaas@gmail.com
In reply to: Jesper Pedersen (#61)
#66Robert Haas
robertmhaas@gmail.com
In reply to: Stas Kelvich (#44)
#67Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Robert Haas (#66)
#68Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Robert Haas (#65)
#69Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Robert Haas (#66)
#70Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Stas Kelvich (#69)
#71Noah Misch
noah@leadboat.com
In reply to: Jesper Pedersen (#68)
#72Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Stas Kelvich (#62)
#73Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#72)
#74Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#73)
#75Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#74)
#76Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#75)
#77Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Stas Kelvich (#76)
#78Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jesper Pedersen (#77)
#79Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#78)
#80Simon Riggs
simon@2ndQuadrant.com
In reply to: Stas Kelvich (#76)
#81Michael Paquier
michael@paquier.xyz
In reply to: Simon Riggs (#80)
#82Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#81)
#83Simon Riggs
simon@2ndQuadrant.com
In reply to: Michael Paquier (#82)
#84Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#82)
#85Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#84)
#86Simon Riggs
simon@2ndQuadrant.com
In reply to: Stas Kelvich (#84)
#87Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#85)
#88Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#87)
#89Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#88)
#90Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Stas Kelvich (#89)
#91Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#90)
#92Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#91)
#93Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#92)
#94Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#93)
#95Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#94)
#96Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#95)
#97Bruce Momjian
bruce@momjian.us
In reply to: Stas Kelvich (#96)
#98Michael Paquier
michael@paquier.xyz
In reply to: Bruce Momjian (#97)
#99Bruce Momjian
bruce@momjian.us
In reply to: Michael Paquier (#98)
#100Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Bruce Momjian (#99)
#101David Fetter
david@fetter.org
In reply to: Bruce Momjian (#99)
#102Andres Freund
andres@anarazel.de
In reply to: David Fetter (#101)
#103Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#102)
#104Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#103)
#105Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#96)
#106Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#105)
#107Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#106)
#108Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#107)
#109Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#108)
#110Michael Paquier
michael@paquier.xyz
In reply to: Stas Kelvich (#109)
#111Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#110)
#112Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Stas Kelvich (#111)
#113Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#112)
#114Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#113)
#115Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#113)
#116Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Stas Kelvich (#115)
#117Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Nikhil Sontakke (#116)
#118Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#117)
#119Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#118)
#120Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#119)
#121Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#120)
#122Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#121)
#123Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Michael Paquier (#122)
#124Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#122)
#125Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#124)
#126Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#125)
#127Simon Riggs
simon@2ndQuadrant.com
In reply to: Nikhil Sontakke (#126)
#128Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Simon Riggs (#127)
#129Simon Riggs
simon@2ndQuadrant.com
In reply to: Nikhil Sontakke (#128)
#130Michael Paquier
michael@paquier.xyz
In reply to: Simon Riggs (#129)
#131Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#130)
#132Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#120)
#133Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#132)
#134Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#133)
#135Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#134)
#136Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#135)
#137Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#136)
#138Robert Haas
robertmhaas@gmail.com
In reply to: Nikhil Sontakke (#112)
#139Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Robert Haas (#138)
#140Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#137)
#141Robert Haas
robertmhaas@gmail.com
In reply to: Nikhil Sontakke (#139)
#142Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Nikhil Sontakke (#140)
#143Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#142)
#144Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#140)
#145Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#144)
#146David Steele
david@pgmasters.net
In reply to: Nikhil Sontakke (#145)
#147Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: David Steele (#146)
#148Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#147)
#149Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#148)
#150Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#149)
#151Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#150)
#152Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#151)
#153Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#152)
#154Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#153)
#155Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#152)
#156Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#155)
#157Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#156)
#158Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#157)
#159Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#158)
#160Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#159)
#161Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#160)
#162Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#161)
#163Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Nikhil Sontakke (#162)
#164Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#162)
#165Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Michael Paquier (#164)
#166Michael Paquier
michael@paquier.xyz
In reply to: Nikhil Sontakke (#165)
#167Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#166)