[PATCH] 2PC state files on shared memory
Hi all,
Based on an idea of Heikki Linnakangas, here is a patch in order to improve
2PC
by sending the state files of prepared transactions to shared memory instead
of disk.
It is not possible to avoid the Xlog flush operation but reducing the amout
of data sent to disk permits to accelerate 2PC process.
During a checkpoint, only the state files of prepared but not committed
transactions are flushed to disk from shared memory.
The shared memory allocated for state files on shmem is made with an
additionnal parameter called max_state_file_space in postgresql.conf.
Of course if there are too many transactions and not enough space on shared
memory, state files are sent to disk originally.
By default, the space allocated is set at 0 as max_prepared_transaction is
nul in 8.4.
For some other results, please reference to the wiki page I wrote about this
2PC improvement.
http://wiki.postgresql.org/wiki/2PC_improvement:_state_files_in_shared_memory
This page explains the simulation method for the patch analysis and gathers
the main results.
Here are some of the performance results got by testing the code with a
battery-backedup cache Disk Array with 8 disks in RAID0 configuration.
The four tables below depend on the scale factor at 1 or 100 of pgbench and
if the results are normalized or not.
Normalized results have no unit but pure results are in TX/s.
Tests were made using transaction whose state file sizes are 600B and 712B
via pgbench.
As it is possible to see, the patch permits to improve the transaction flow
by up to 15-18%, what is not negligible.
1) Case scale factor 1, normalized results
State File Size (B) 600 712 Use of 2PC State file
on Shmem State file
on Disk No 2PC State file
on Shmem State file
on Disk No 2PC Pgbench conf Conn Trans Tps1-2 Tps2-2 Tps3-2 Tps1-2 Tps2-2
Tps3-2 2 10000 0.078663793 0 1 0.079653 0 1 5 10000 0.105263158 0 1
0.08438061 0 1 10 10000 0.096105528 0 1 0.07166124 0 1 25 10000 0.106321839
0 1 0.12846154 0 1 35 10000 0.138996139 0 1 0.12106136 0 1 50 10000
0.130278527 0 1 0.14072693 0 1 60 10000 0.133937563 0 1 0.1517094 0 1 70
10000 0.17218543 0 1 0.14913295 0 1 80 10000 0.1775 0 1 0.17786561 0 1 90
10000 0.179806362 0 1 0.15232722 0 1 100 10000 0.182242991 0 1 0.15264798 0
1
2) Case scale factor 1, pure TX/s results
State File Size (B) 600 712 Use of 2PC State file
on Shmem State file
on Disk No 2PC State file
on Shmem State file
on Disk No 2PC Pgbench conf Conn Trans Tps1-2 Tps2-2 Tps3-2 Tps1-2 Tps2-2
Tps3-2 2 10000 1163 1017 2873 1134 1033 2301 5 10000 1263 1077 2844 1213
1072 2743 10 10000 1265 1112 2704 1175 1065 2600 25 10000 1233 1085 2477
1205 1038 2338 35 10000 1220 1040 2335 1169 1023 2229 50 10000 1190 1045
2158 1143 992 2065 60 10000 1151 1018 2011 1111 969 1905 70 10000 1127 971
1877 1067 938 1803 80 10000 1091 949 1749 1021 886 1645 90 10000 1050 920
1643 939 831 1540 100 10000 1012 895 1537 889 791 1433
3) Case scale factor 100, normalized results
State File Size (B) 600 712 Use of 2PC State file
on Shmem State file
on Disk No 2PC State file
on Shmem State file
on Disk No 2PC Pgbench conf Conn Trans Tps1-2 Tps2-2 Tps3-2 Tps1-2 Tps2-2
Tps3-2 2 10000 0.031791908 0 1 0.00426621 0 1 5 10000 0.018481848 0 1
0.03858731 0 1 10 10000 0.049115914 0 1 0.07661017 0 1 25 10000 0.06954612 0
1 0.06117247 0 1 35 10000 0.077677841 0 1 0.05846422 0 1 50 10000
0.059885932 0 1 0.08961303 0 1 60 10000 0.071888412 0 1 0.06997743 0 1 70
10000 0.094007051 0 1 0.03571429 0 1 80 10000 0.078838174 0 1 0.05635838 0 1
4) Case scale factor 100, pure results
State File Size (B) 600 712 Use of 2PC State file
on Shmem State file
on Disk No 2PC State file
on Shmem State file
on Disk No 2PC Pgbench conf Conn Trans Tps1-2 Tps2-2 Tps3-2 Tps1-2 Tps2-2
Tps3-2 2 10000 1113 1058 2788 1147 1142 2314 5 10000 1240 1212 2727 1184
1125 2654 10 10000 1225 1150 2677 1203 1090 2565 25 10000 1218 1123 2489
1176 1104 2281 35 10000 1210 1115 2338 1151 1084 2230 50 10000 1153 1090
2142 1127 1039 2021 60 10000 1126 1059 1991 1083 1021 1907 70 10000 1087
1007 1858 1014 986 1770 80 10000 1046 989 1712 983 944 1636
Regards,
--
Michael Paquier
NTT OSSC
Attachments:
postgresql-8.4.0-2PCshmem.patchapplication/octet-stream; name=postgresql-8.4.0-2PCshmem.patchDownload
--- postgresql-8.4.0.orig/src/backend/access/transam/twophase.c 2009-06-26 04:05:52.000000000 +0900
+++ postgresql-8.4.0/src/backend/access/transam/twophase.c 2009-08-06 10:03:40.000000000 +0900
@@ -69,6 +69,7 @@
/* GUC variable, can't be changed after startup */
int max_prepared_xacts = 0;
+int state_file_max_space = 0;
/*
* This struct describes one global transaction that is in prepared state
@@ -115,6 +116,10 @@ typedef struct GlobalTransactionData
TransactionId locking_xid; /* top-level XID of backend working on xact */
bool valid; /* TRUE if fully prepared */
char gid[GIDSIZE]; /* The GID assigned to the prepared xact */
+ int stateFileLength; /* length of a state file */
+ bool in_cache; /* To determine if a state file is on shared mem or not*/
+ char *cache_entry; /* block entry */
+ int BlockId; /* identifier to find the block where state file is */
} GlobalTransactionData;
/*
@@ -138,6 +143,8 @@ typedef struct TwoPhaseStateData
static TwoPhaseStateData *TwoPhaseState;
+/* Static variable linked to state files on shared memory */
+static char *StateFileCacheFreeList = NULL;
static void RecordTransactionCommitPrepared(TransactionId xid,
int nchildren,
@@ -172,6 +179,14 @@ TwoPhaseShmemSize(void)
return size;
}
+Size
+StateFileShmemSize(void)
+{
+ Size StateFileSize;
+ StateFileSize = mul_size(max_prepared_xacts, state_file_max_space);
+ return StateFileSize;
+}
+
void
TwoPhaseShmemInit(void)
{
@@ -206,6 +221,18 @@ TwoPhaseShmemInit(void)
Assert(found);
}
+void
+StateFileShmemInit(void)
+{
+ if (state_file_max_space != 0)
+ {
+ StateFileCacheFreeList = (char *) ShmemAlloc(state_file_max_space*max_prepared_xacts);
+ }
+ else
+ {
+ StateFileCacheFreeList = NULL;
+ }
+}
/*
* MarkAsPreparing
@@ -865,7 +892,7 @@ EndPrepare(GlobalTransaction gxact)
XLogRecData *record;
pg_crc32 statefile_crc;
pg_crc32 bogus_crc;
- int fd;
+ int fd = 0;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -892,58 +919,133 @@ EndPrepare(GlobalTransaction gxact)
* the FD gets closed in any error exit path. Once we get into the
* critical section, though, it doesn't matter since any failure causes
* PANIC anyway.
+ *
+ * If the total length of records is higher than a block on shared mem,
+ * state file is written on disk instead
*/
- TwoPhaseFilePath(path, xid);
- fd = BasicOpenFile(path,
- O_CREAT | O_EXCL | O_WRONLY | PG_BINARY,
- S_IRUSR | S_IWUSR);
- if (fd < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create two-phase state file \"%s\": %m",
- path)));
+ gxact->stateFileLength = records.total_len;
+ if (hdr->total_len < state_file_max_space
+ && StateFileCacheFreeList != NULL)
+ {
+ bool *BlockId=(bool *)palloc(max_prepared_xacts*sizeof(bool));
+ int i,count = 1;
+ bool found = false;
+ /*Initialize BlockId */
+ for (i=0;i<max_prepared_xacts;i++)
+ {
+ BlockId[i]=false;
+ }
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+ /*check what are the blocks taken by prepared transactions
+ * update lively BlockId;
+ */
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction gxactloc = TwoPhaseState->prepXacts[i];
+ if (gxactloc->BlockId > 0)
+ {
+ BlockId[gxactloc->BlockId-1]=true;
+ }
+ }
+ /* It is better to keep the lock a longer time
+ * as another transaction could take the same block
+ */
+ /* find the 1st block in the list not taken */
+ while(!found)
+ {
+ if(BlockId[count-1]==true
+ && count < max_prepared_xacts)/* block already taken */
+ {
+ count++;
+ }
+ else
+ {
+ found = true;
+ gxact->BlockId=count;
+ }
+ }
+ if (gxact->BlockId > 0)
+ {
+ gxact->cache_entry = StateFileCacheFreeList + state_file_max_space*(gxact->BlockId-1);
+ }
- /* Write data to file, and calculate CRC as we pass over it */
- INIT_CRC32(statefile_crc);
+ LWLockRelease(TwoPhaseStateLock);
- for (record = records.head; record != NULL; record = record->next)
+ /* allocation of a memory block
+ * The head block of the free list is taken and used for the TX in process
+ */
+ gxact->in_cache = true;
+ }
+ else
{
- COMP_CRC32(statefile_crc, record->data, record->len);
- if ((write(fd, record->data, record->len)) != record->len)
- {
- close(fd);
+ gxact->in_cache = false;
+ gxact->BlockId = 0;
+ TwoPhaseFilePath(path, xid);
+
+ fd = BasicOpenFile(path,
+ O_CREAT | O_EXCL | O_WRONLY | PG_BINARY,
+ S_IRUSR | S_IWUSR);
+ if (fd < 0)
ereport(ERROR,
(errcode_for_file_access(),
- errmsg("could not write two-phase state file: %m")));
- }
+ errmsg("could not create two-phase state file \"%s\": %m",
+ path)));
}
- FIN_CRC32(statefile_crc);
- /*
- * Write a deliberately bogus CRC to the state file; this is just paranoia
- * to catch the case where four more bytes will run us out of disk space.
- */
- bogus_crc = ~statefile_crc;
-
- if ((write(fd, &bogus_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
+ if (gxact->in_cache == true)
{
- close(fd);
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not write two-phase state file: %m")));
+ int locTotalLen = 0;
+ for (record = records.head; record != NULL; record = record->next)
+ {
+ memcpy(gxact->cache_entry+locTotalLen, record->data, record->len);
+ locTotalLen += record->len;
+ }
+ Assert(locTotalLen == gxact->stateFileLength);
}
-
- /* Back up to prepare for rewriting the CRC */
- if (lseek(fd, -((off_t) sizeof(pg_crc32)), SEEK_CUR) < 0)
+ else
{
- close(fd);
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek in two-phase state file: %m")));
- }
+ /* Write data to file, and calculate CRC as we pass over it */
+ INIT_CRC32(statefile_crc);
+ for (record = records.head; record != NULL; record = record->next)
+ {
+ COMP_CRC32(statefile_crc, record->data, record->len);
+ if ((write(fd, record->data, record->len)) != record->len)
+ {
+ close(fd);
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write two-phase state file: %m")));
+ }
+ }
+
+ FIN_CRC32(statefile_crc);
+
+ /*
+ * Write a deliberately bogus CRC to the state file; this is just paranoia
+ * to catch the case where four more bytes will run us out of disk space.
+ */
+ bogus_crc = ~statefile_crc;
+
+ if ((write(fd, &bogus_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
+ {
+ close(fd);
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write two-phase state file: %m")));
+ }
+
+ /* Back up to prepare for rewriting the CRC */
+ if (lseek(fd, -((off_t) sizeof(pg_crc32)), SEEK_CUR) < 0)
+ {
+ close(fd);
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not seek in two-phase state file: %m")));
+ }
+ }
/*
* The state file isn't valid yet, because we haven't written the correct
* CRC yet. Before we do that, insert entry in WAL and flush it to disk.
@@ -974,21 +1076,23 @@ EndPrepare(GlobalTransaction gxact)
XLogFlush(gxact->prepare_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
-
- /* write correct CRC and close file */
- if ((write(fd, &statefile_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
+ if (!gxact->in_cache)
{
- close(fd);
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not write two-phase state file: %m")));
- }
- if (close(fd) != 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not close two-phase state file: %m")));
+ /* write correct CRC and close file */
+ if ((write(fd, &statefile_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
+ {
+ close(fd);
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write two-phase state file: %m")));
+ }
+ if (close(fd) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not close two-phase state file: %m")));
+ }
/*
* Mark the prepared transaction as valid. As soon as xact.c marks MyProc
* as not running our XID (which it will do immediately after this
@@ -1165,7 +1269,16 @@ FinishPreparedTransaction(const char *gi
/*
* Read and validate the state file
*/
- buf = ReadTwoPhaseFile(xid);
+ if (gxact->in_cache)
+ {
+ /* read file in shmem */
+ buf = (char *) palloc(state_file_max_space);
+ memcpy(buf, gxact->cache_entry, state_file_max_space);
+ }
+ else
+ {
+ buf = ReadTwoPhaseFile(xid);
+ }
if (buf == NULL)
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
@@ -1258,7 +1371,30 @@ FinishPreparedTransaction(const char *gi
/*
* And now we can clean up our mess.
*/
- RemoveTwoPhaseFile(xid, true);
+ if (gxact->in_cache)
+ {
+ int i;
+ /* Clean up the zone where the last state file has been written
+ * by replacing it with zeros
+ */
+ for (i=0;i<gxact->stateFileLength;i++)
+ {
+ *(StateFileCacheFreeList+(gxact->BlockId-1)*state_file_max_space+i)='\0';
+ }
+
+ /* Remove statefile in shared memory by deleting the BlockId */
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+ gxact->BlockId = 0;
+ LWLockRelease(TwoPhaseStateLock);
+ gxact->in_cache = false;
+ gxact->cache_entry = NULL;
+ gxact->stateFileLength = 0;
+ }
+ else
+ {
+ RemoveTwoPhaseFile(xid, true);
+ }
RemoveGXact(gxact);
@@ -1373,6 +1509,77 @@ RecreateTwoPhaseFile(TransactionId xid,
errmsg("could not close two-phase state file: %m")));
}
+/*
+ * Writes the state file for given xact from shared memory cache to disk
+ */
+static void
+FlushTwoPhaseStateFile(TransactionId xid)
+{
+ char *buffer = palloc(state_file_max_space);
+ int len;
+ int i;
+ bool found = false;
+
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+ /* find the TX corresponding to the XID, and copy the state file contents
+ * from shared memory cache to local buffer
+ */
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+ if (gxact->proc.xid == xid)
+ {
+ /* If not in cache, nothing to do */
+ if (!gxact->in_cache)
+ return;
+
+ len = gxact->stateFileLength;
+ memcpy(buffer, gxact->cache_entry, len);
+ found = true;
+ break;
+ }
+ }
+ LWLockRelease(TwoPhaseStateLock);
+
+ if (!found)
+ return;
+
+ RecreateTwoPhaseFile(xid, buffer, len);
+
+ /* The data is now both in cache, and on disk. Remove it from cache */
+
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+ found = false;
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+ if (gxact->proc.xid == xid)
+ {
+ for (i=0;i<state_file_max_space;i++)
+ {
+ *(StateFileCacheFreeList+(gxact->BlockId-1)*state_file_max_space)='\0';
+ }
+ Assert(gxact->in_cache);
+ gxact->in_cache = false;
+ gxact->BlockId = 0;
+ /* add a state file block in Free list */
+ gxact->cache_entry = NULL;
+
+ found = true;
+ break;
+ }
+ }
+ LWLockRelease(TwoPhaseStateLock);
+
+ if (!found)
+ {
+ /* The transaction was finished while we were writing it to disk */
+ RemoveTwoPhaseFile(xid, true);
+ }
+}
+
/*
* CheckPointTwoPhase -- handle 2PC component of checkpointing.
*
@@ -1394,6 +1601,7 @@ void
CheckPointTwoPhase(XLogRecPtr redo_horizon)
{
TransactionId *xids;
+ bool *in_cache;
int nxids;
char path[MAXPGPATH];
int i;
@@ -1415,6 +1623,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horiz
TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_START();
xids = (TransactionId *) palloc(max_prepared_xacts * sizeof(TransactionId));
+ in_cache = (bool *) palloc(max_prepared_xacts * sizeof(bool));
nxids = 0;
LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
@@ -1425,7 +1634,11 @@ CheckPointTwoPhase(XLogRecPtr redo_horiz
if (gxact->valid &&
XLByteLE(gxact->prepare_lsn, redo_horizon))
- xids[nxids++] = gxact->proc.xid;
+ {
+ xids[nxids] = gxact->proc.xid;
+ in_cache[nxids] = gxact->in_cache;
+ nxids++;
+ }
}
LWLockRelease(TwoPhaseStateLock);
@@ -1435,6 +1648,9 @@ CheckPointTwoPhase(XLogRecPtr redo_horiz
TransactionId xid = xids[i];
int fd;
+ if (in_cache[i])
+ FlushTwoPhaseStateFile(xid);
+
TwoPhaseFilePath(path, xid);
fd = BasicOpenFile(path, O_RDWR | PG_BINARY, 0);
--- postgresql-8.4.0.orig/src/backend/utils/misc/guc.c 2009-06-11 23:49:06.000000000 +0900
+++ postgresql-8.4.0/src/backend/utils/misc/guc.c 2009-08-06 10:03:40.000000000 +0900
@@ -1506,6 +1506,15 @@ static struct config_int ConfigureNamesI
&max_prepared_xacts,
0, 0, INT_MAX / 4, NULL, NULL
},
+ {
+ {"state_file_max_space", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the maximum space usable by a state file in shared memory."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &state_file_max_space,
+ 0, 0, INT_MAX, NULL, NULL
+ },
#ifdef LOCK_DEBUG
{
--- postgresql-8.4.0.orig/src/include/access/twophase.h 2009-01-02 02:23:56.000000000 +0900
+++ postgresql-8.4.0/src/include/access/twophase.h 2009-08-06 10:03:41.000000000 +0900
@@ -26,10 +26,14 @@ typedef struct GlobalTransactionData *Gl
/* GUC variable */
extern int max_prepared_xacts;
+extern int state_file_max_space;
extern Size TwoPhaseShmemSize(void);
extern void TwoPhaseShmemInit(void);
+extern Size StateFileShmemSize(void);
+extern void StateFileShmemInit(void);
+
extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid);
extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
--- postgresql-8.4.0.orig/src/backend/storage/ipc/ipci.c 2009-05-06 04:59:00.000000000 +0900
+++ postgresql-8.4.0/src/backend/storage/ipc/ipci.c 2009-08-06 10:03:40.000000000 +0900
@@ -106,6 +106,7 @@ CreateSharedMemoryAndSemaphores(bool mak
size = add_size(size, CLOGShmemSize());
size = add_size(size, SUBTRANSShmemSize());
size = add_size(size, TwoPhaseShmemSize());
+ size = add_size(size, StateFileShmemSize());
size = add_size(size, MultiXactShmemSize());
size = add_size(size, LWLockShmemSize());
size = add_size(size, ProcArrayShmemSize());
@@ -183,6 +184,7 @@ CreateSharedMemoryAndSemaphores(bool mak
CLOGShmemInit();
SUBTRANSShmemInit();
TwoPhaseShmemInit();
+ StateFileShmemInit();
MultiXactShmemInit();
InitBufferPool();
--- postgresql-8.4.0.orig/src/backend/utils/misc/postgresql.conf.sample 2009-04-23 09:23:45.000000000 +0900
+++ postgresql-8.4.0/src/backend/utils/misc/postgresql.conf.sample 2009-08-06 10:03:40.000000000 +0900
@@ -112,6 +112,10 @@
# per transaction slot, plus lock space (see max_locks_per_transaction).
# It is not advisable to set max_prepared_transactions nonzero unless you
# actively intend to use prepared transactions.
+
+#state_file_max_space = 0 # maximum space reserved for one state file on shared memory
+ # 0 value equivalent on writing all files on disk
+ # default value set up at 0, averaged value at 768
#work_mem = 1MB # min 64kB
#maintenance_work_mem = 16MB # min 1MB
#max_stack_depth = 2MB # min 100kB
Michael Paquier <michael.paquier@gmail.com> writes:
Based on an idea of Heikki Linnakangas, here is a patch in order to improve
2PC
by sending the state files of prepared transactions to shared memory instead
of disk.
I don't understand how this can possibly work. The entire point of
2PC is that the state file is guaranteed to be on disk so it will
survive a crash. What good is it if it's in shared memory?
Quite aside from that, the fixed size of shared memory makes this seem
pretty impractical.
regards, tom lane
Tom Lane wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
Based on an idea of Heikki Linnakangas, here is a patch in order to improve
2PC
by sending the state files of prepared transactions to shared memory instead
of disk.I don't understand how this can possibly work. The entire point of
2PC is that the state file is guaranteed to be on disk so it will
survive a crash. What good is it if it's in shared memory?
The state files are not fsync'd when they're written, but a copy is
written to WAL so that it can be replayed on crash. With this patch,
it's still written to WAL, but the write to a file on disk is skipped,
and it's stored in shared memory instead.
Quite aside from that, the fixed size of shared memory makes this seem
pretty impractical.
Most state files are small. If one doesn't fit in the area reserved for
this, it's written to disk as usual. It's just an optimization.
I'm a bit disappointed by the performance gains. I would've expected
more, given a decent battery-backed-up cache to buffer the WAL fsyncs.
But it looks like they're still causing the most overhead, even with a
battery-backed-up cache.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sat, Aug 8, 2009 at 9:31 AM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
Tom Lane wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
Based on an idea of Heikki Linnakangas, here is a patch in order to improve
2PC
by sending the state files of prepared transactions to shared memory instead
of disk.I don't understand how this can possibly work. The entire point of
2PC is that the state file is guaranteed to be on disk so it will
survive a crash. What good is it if it's in shared memory?The state files are not fsync'd when they're written, but a copy is
written to WAL so that it can be replayed on crash. With this patch,
it's still written to WAL, but the write to a file on disk is skipped,
and it's stored in shared memory instead.Quite aside from that, the fixed size of shared memory makes this seem
pretty impractical.Most state files are small. If one doesn't fit in the area reserved for
this, it's written to disk as usual. It's just an optimization.I'm a bit disappointed by the performance gains. I would've expected
more, given a decent battery-backed-up cache to buffer the WAL fsyncs.
But it looks like they're still causing the most overhead, even with a
battery-backed-up cache.
It doesn't seem that surprising to me that a write to shared memory
and a write to an un-fsync'd file would be about the same speed. The
file write will eventually generate some I/O when it goes to disk, but
at the time you make the system call it's basically just a memory
copy.
...Robert
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
Quite aside from that, the fixed size of shared memory makes this seem
pretty impractical.
Most state files are small. If one doesn't fit in the area reserved for
this, it's written to disk as usual. It's just an optimization.
What evidence do you have for that assumption? And what's "small" anyway?
I think setting the size parameter for this would be a frightfully
difficult problem; the fact that average installations wouldn't use it
doesn't make that any better for those who would. After our bad
experiences with fixed-size FSM, I'm pretty wary of introducing new
fixed-size structures that the user is expected to figure out how to
size.
I'm a bit disappointed by the performance gains. I would've expected
more, given a decent battery-backed-up cache to buffer the WAL fsyncs.
But it looks like they're still causing the most overhead, even with a
battery-backed-up cache.
If you can't demonstrate order-of-magnitude speedups, I think we
shouldn't touch this.
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
On Sat, Aug 8, 2009 at 9:31 AM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:I'm a bit disappointed by the performance gains. I would've expected
more, given a decent battery-backed-up cache to buffer the WAL fsyncs.
It doesn't seem that surprising to me that a write to shared memory
and a write to an un-fsync'd file would be about the same speed.
I just had a second thought about this. The idea is to avoid writing
the separate 2PC state file until/unless it has to be checkpointed.
(And, per the comments for CheckPointTwoPhase, that is an uncommon
case --- especially now with our time-extended checkpoints.)
What if PREPARE simply didn't write the 2PC file at all, except into WAL?
Then, make CheckPointTwoPhase write the 2PC file for any still-live
GXACT, by means of reaching into the WAL and pulling the data out.
All it would need for that is the LSN of the WAL record, which I think
the GXACT has already. (It might have the end location rather than
the start, but in any case we could store both.) Similarly, COMMIT
PREPARED could be taught to pull the data from WAL instead of a 2PC
file, in the typical case where the file didn't exist yet. I think
there might be some synchronization issues against checkpoints --- you
couldn't recycle WAL until you were sure there was no COMMIT PREPARED
pulling from it. But it seems possibly workable, and there's no tuning
knob needed.
regards, tom lane
Tom Lane wrote:
What if PREPARE simply didn't write the 2PC file at all, except into WAL?
Then, make CheckPointTwoPhase write the 2PC file for any still-live
GXACT, by means of reaching into the WAL and pulling the data out.
All it would need for that is the LSN of the WAL record, which I think
the GXACT has already. (It might have the end location rather than
the start, but in any case we could store both.) Similarly, COMMIT
PREPARED could be taught to pull the data from WAL instead of a 2PC
file, in the typical case where the file didn't exist yet. I think
there might be some synchronization issues against checkpoints --- you
couldn't recycle WAL until you were sure there was no COMMIT PREPARED
pulling from it. But it seems possibly workable, and there's no tuning
knob needed.
Interesting idea, might be worth performance testing. Peeking into the
WAL files during normal operation feels naughty, but it should work.
However, if the bottleneck is the WAL fsyncs, I doubt it's any faster
than Michael's current patch.
Actually, it would be interesting to performance test a stripped down
broken implementation that doesn't write the state files anywhere but
WAL, PREPARE releases all locks like regular COMMIT does, and COMMIT
PREPARED just writes the commit record and fsyncs. That would give an
upper bound on how much gain any of these patches can have. If that's
not much, we can throw in the towel.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
What if PREPARE simply didn't write the 2PC file at all, except into WAL?
Interesting idea, might be worth performance testing. Peeking into the
WAL files during normal operation feels naughty, but it should work.
However, if the bottleneck is the WAL fsyncs, I doubt it's any faster
than Michael's current patch.
This isn't about faster, it's about not requiring users to estimate
a suitable size for a shared-memory arena.
Actually, it would be interesting to performance test a stripped down
broken implementation that doesn't write the state files anywhere but
WAL, PREPARE releases all locks like regular COMMIT does, and COMMIT
PREPARED just writes the commit record and fsyncs. That would give an
upper bound on how much gain any of these patches can have. If that's
not much, we can throw in the towel.
Good idea --- although I would think that the performance of 2PC would
be pretty context-dependent anyway. What load would you test under?
regards, tom lane
After making a lot of tests, state file size is not more than 600B.
In some cases, it reached a maximum of size of 712B and I used such
transactions in my tests.
I think setting the size parameter for this would be a frightfully
difficult problem; the fact that average installations wouldn't use it
doesn't make that any better for those who would. After our bad
experiences with fixed-size FSM, I'm pretty wary of introducing new
fixed-size structures that the user is expected to figure out how to
size.
The patch has been designed such as if a state file has a size higher than
what has been decided by the user,
it will be written to disk instead of shared memory. So it will not
represent a danger for teh stability of the system.
The case of too many prepared transactions is also covered thanks to
max_prepared_transactions.
Regards,
--
Michael Paquier
NTT OSSC
Michael Paquier <michael.paquier@gmail.com> writes:
After making a lot of tests, state file size is not more than 600B.
In some cases, it reached a maximum of size of 712B and I used such
transactions in my tests.
I can only say that that demonstrates you didn't test very many cases.
It is trivial to generate enormous state files --- try something with
a lot of subtransactions, for example, or a lot of files created or
deleted. I remain of the opinion that asking users to estimate the
amount of shared memory needed for this patch will cripple its
usability. We learned that lesson the hard way for FSM, I see no
reason we have to fail to learn from experience.
regards, tom lane