Re: Performance Improvement by reducing WAL for Update Operation
On Friday, January 11, 2013 11:12 PM Simon Riggs wrote:
On 11 January 2013 17:30, Amit kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
On Friday, January 11, 2013 7:59 PM Alvaro Herrera wrote:
Simon Riggs wrote:On 28 December 2012 10:21, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
I was also worried about the high variance in the results. Those
averages look rather meaningless. Which would be okay, I think, because
it'd mean that performance-wise the patch is a wash,For larger tuple sizes (>1000 && < 1800), the performance gain will be good.
Please refer performance results by me and Kyotaro-san in below links:http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BEAAE32(at)szxeml509-mbx<http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BEAAE32%28at%29szxeml509-mbx>
http://archives.postgresql.org/message-id/20121228(dot)170748(dot)90887322(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp<http://archives.postgresql.org/message-id/20121228%28dot%29170748%28dot%2990887322%28dot%29horiguchi%28dot%29kyotaro%28at%29lab%28dot%29ntt%28dot%29co%28dot%29jp>
AFAICS your tests are badly variable, but as Alvaro says, they aren't
accurate enough to tell there's a regression.
By running performance scenario in suse 11 board, the readings are not varying much except 8 threads, as i feel my board is a 4 core machine.
Performance readings are attached for original, 256, 512, 1000 and 1800 size of records.
Conclusions from the readings:
1. With orignal pgbench there is a max 9% WAL reduction with not much performance difference.
2. With 250 record pgbench there is a max wal reduction of 30% with not much performance difference.
3. With 500 and above record size in pgbench there is an improvement in the performance and wal reduction both.
If the record size increases there is a gain in performance and wal size is reduced as well.
With Regards,
Amit Kapila.
Attachments:
On 11 January 2013 15:57, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
I've moved this to the next CF. I'm planning to review this one first.
Thank you.
Just reviewing the patch now, making more sense with comments added.
Making more sense, but not yet making complete sense.
I'd like you to revisit the patch comments since some of them are completely unreadable.
I have modified most of the comments in code.
The changes in attached patch are as below:
1. Introduced Encoded WAL Tuple (EWT) to refer to delta encoded tuple for update operation.
It can rename to one of below:
a. WAL Encoded Tuple (WET)
b. Delta Encoded WAL Tuple (DEWT)
c. Delta WAL Encoded Tuple (DWET)
d. any others?
2. I have kept the wording related to compression in modified docs, but i have tried to copy parts completely.
IMO this is required as there are some changes w.r.t LZ compression like for New Data.
3. There is small coding change as it has been overwritten by one of my previous patch patches.
Calculation of approximate length for encoded wal tuple.
Previous Patch:
if ((bp + (2 * new_tup_bitmaplen)) - bstart >= result_max)
New Patch:
if ((bp + (2 + new_tup_bitmaplen)) - bstart >= result_max)
The previous patch calculation was valid if we could have exactly used LZ format.
With Regards,
Amit Kapila.
Attachments:
wal_update_changes_v8.patchapplication/octet-stream; name=wal_update_changes_v8.patchDownload
*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,66 ****
--- 60,69 ----
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+ #include "utils/datum.h"
+ /* guc variable for EWT compression ratio*/
+ int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
}
/* ----------------
! * nocachegetattr
*
! * This only gets called from fastgetattr() macro, in cases where
* we can't use a cacheoffset and the value is not null.
*
! * This caches attribute offsets in the attribute descriptor.
*
* An alternative way to speed things up would be to cache offsets
* with the tuple, but that seems more difficult unless you take
--- 300,312 ----
}
/* ----------------
! * nocachegetattr_with_len
*
! * This only gets called in cases where
* we can't use a cacheoffset and the value is not null.
*
! * This caches attribute offsets in the attribute descriptor and
! * outputs the length of the attribute value.
*
* An alternative way to speed things up would be to cache offsets
* with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
* ----------------
*/
Datum
! nocachegetattr(HeapTuple tuple,
! int attnum,
! TupleDesc tupleDesc)
{
HeapTupleHeader tup = tuple->t_data;
Form_pg_attribute *att = tupleDesc->attrs;
--- 324,333 ----
* ----------------
*/
Datum
! nocachegetattr_with_len(HeapTuple tuple,
! int attnum,
! TupleDesc tupleDesc,
! Size *len)
{
HeapTupleHeader tup = tuple->t_data;
Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 386,394 ----
*/
if (att[attnum]->attcacheoff >= 0)
{
+ if (len)
+ *len = att_getlength(att[attnum]->attlen,
+ tp + att[attnum]->attcacheoff);
return fetchatt(att[attnum],
tp + att[attnum]->attcacheoff);
}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 515,536 ----
}
}
+ if (len)
+ *len = att_getlength(att[attnum]->attlen, tp + off);
return fetchatt(att[attnum], tp + off);
}
+ /*
+ * nocachegetattr
+ */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ int attnum,
+ TupleDesc tupleDesc)
+ {
+ return nocachegetattr_with_len(tuple, attnum, tupleDesc, NULL);
+ }
+
/* ----------------
* heap_getsysattr
*
***************
*** 617,622 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 638,1061 ----
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+ /* ----------------
+ * heap_attr_get_length_and_check_equals
+ *
+ * returns the result of comparison of specified attribute's value for
+ * input tuples.
+ * outputs the length of specified attribute's value for
+ * input tuples.
+ * ----------------
+ */
+ bool
+ heap_attr_get_length_and_check_equals(TupleDesc tupdesc, int attrnum,
+ HeapTuple tup1, HeapTuple tup2,
+ Size *tup1_attr_len, Size *tup2_attr_len)
+ {
+ Datum value1,
+ value2;
+ bool isnull1,
+ isnull2;
+ Form_pg_attribute att;
+
+ *tup1_attr_len = 0;
+ *tup2_attr_len = 0;
+
+ /*
+ * If it's a whole-tuple reference, say "not equal". It's not really
+ * worth supporting this case, since it could only succeed after a no-op
+ * update, which is hardly a case worth optimizing for.
+ */
+ if (attrnum == 0)
+ return false;
+
+ /*
+ * Likewise, automatically say "not equal" for any system attribute other
+ * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ * chain, or even to be set correctly yet in the new tuple.
+ */
+ if (attrnum < 0)
+ {
+ if (attrnum != ObjectIdAttributeNumber &&
+ attrnum != TableOidAttributeNumber)
+ return false;
+ }
+
+ /*
+ * Extract the corresponding values and length of values. XXX this is
+ * pretty inefficient if there are many indexed columns. Should
+ * HeapSatisfiesHOTUpdate do a single heap_deform_tuple call on each
+ * tuple, instead? But that doesn't work for system columns ...
+ */
+ value1 = heap_getattr_with_len(tup1, attrnum, tupdesc, &isnull1, tup1_attr_len);
+ value2 = heap_getattr_with_len(tup2, attrnum, tupdesc, &isnull2, tup2_attr_len);
+
+ /*
+ * If one value is NULL and other is not, then they are certainly not
+ * equal
+ */
+ if (isnull1 != isnull2)
+ return false;
+
+ /*
+ * If both are NULL, they can be considered equal.
+ */
+ if (isnull1)
+ return true;
+
+ /*
+ * We do simple binary comparison of the two datums. This may be overly
+ * strict because there can be multiple binary representations for the
+ * same logical value. But we should be OK as long as there are no false
+ * positives. Using a type-specific equality operator is messy because
+ * there could be multiple notions of equality in different operator
+ * classes; furthermore, we cannot safely invoke user-defined functions
+ * while holding exclusive buffer lock.
+ */
+ if (attrnum <= 0)
+ {
+ /* The only allowed system columns are OIDs, so do this */
+ return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ }
+ else
+ {
+ Assert(attrnum <= tupdesc->natts);
+ att = tupdesc->attrs[attrnum - 1];
+ return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ }
+ }
+
+ /* ----------------
+ * heap_delta_encode
+ *
+ * Construct a delta Encoded WAL Tuple (EWT) by comparing old and new
+ * tuple versions w.r.t column boundaries.
+ *
+ * Encoded WAL Tuple Format:
+ * Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ *
+ * Encode Mechanism:
+ *
+ * Copy the bitmap data from new tuple to the EWT (Encoded WAL Tuple) and
+ * loop for all attributes to find any modifications in the attributes.
+ * The unmodified data is encoded as a History Reference in EWT and
+ * the modified data (if NOT NULL) is encoded as New Data in EWT.
+ *
+ * The offset values are calculated with respect to the tuple t_hoff
+ * value. For each column attribute old and new tuple offsets
+ * are recalculated based on padding in the tuples.
+ * Once the alignment difference is found between old and new tuple
+ * versions, then include alignment difference as New Data in EWT.
+ *
+ * max encoded data length is 75% (default compression rate)
+ * of original data, If encoded output data length is greater than
+ * that, original tuple (new tuple version) will be directly stored in
+ * WAL Tuple.
+ *
+ *
+ * History Reference:
+ * If any column is modified then the unmodified columns data till the
+ * modified column needs to be copied to EWT as a Tag.
+ *
+ *
+ * New data (modified data):
+ * First byte repersents the length [0-255] of the modified data,
+ * followed by the modified data of corresponding length.
+ *
+ * For more details about Encoded WAL Tuple (EWT) representation,
+ * refer transam\README
+ * ----------------
+ */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ PGLZ_Header *encdata)
+ {
+ Form_pg_attribute *att = tupleDesc->attrs;
+ int numberOfAttributes;
+ int32 new_tup_off = 0,
+ old_tup_off = 0,
+ temp_off = 0,
+ match_off = 0,
+ change_off = 0;
+ int attnum;
+ int32 data_len,
+ old_tup_pad_len,
+ new_tup_pad_len;
+ Size old_tup_attr_len,
+ new_tup_attr_len;
+ bool is_attr_equals = true;
+ unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ unsigned char *bstart = bp;
+ char *dp = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ char *dstart = dp;
+ char *history;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ int32 len,
+ old_tup_bitmaplen,
+ new_tup_bitmaplen,
+ old_tup_len,
+ new_tup_len;
+ int32 result_size;
+ int32 result_max;
+
+ old_tup_len = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (old_tup_len >= PGLZ_HISTORY_SIZE)
+ return false;
+
+ history = (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ old_tup_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ new_tup_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * If length of old and new tuple versions vary by more than 50%, include
+ * new as-is
+ */
+ if ((new_tup_len <= (old_tup_len >> 1))
+ || (old_tup_len <= (new_tup_len >> 1)))
+ return false;
+
+ /* Required compression ratio for EWT */
+ result_max = (new_tup_len * (100 - wal_update_compression_ratio)) / 100;
+ encdata->rawsize = new_tup_len;
+
+ /*
+ * Advance the EWT by adding the approximate length of the operation for
+ * new data as [1 control byte + 1 length byte + data_length] and validate
+ * it with result_max. The same length approximation is used in the
+ * function for New data.
+ */
+ if ((bp + (2 + new_tup_bitmaplen)) - bstart >= result_max)
+ return false;
+
+ /* Copy the bitmap data from new tuple to EWT */
+ pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_bitmaplen, dp);
+ dstart = dp;
+
+ /*
+ * Loop through all attributes, if the attribute is modified by the update
+ * operation, store the [Offset,Length] reffering old tuple version till
+ * the last unchanged column in the EWT as History Reference, else store
+ * the [Length,Data] from new tuple version as New Data.
+ */
+ numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ {
+ if (!heap_attr_get_length_and_check_equals(tupleDesc, attnum, oldtup,
+ newtup, &old_tup_attr_len, &new_tup_attr_len))
+ {
+ is_attr_equals = false;
+ data_len = old_tup_off - match_off;
+
+ len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ if ((bp + len) - bstart >= result_max)
+ return false;
+
+ /*
+ * The match_off value is calculated w.r.t to the tuple t_hoff
+ * value, the bit map len needs to be added to match_off to get
+ * the actual start offset from the old/history tuple.
+ */
+ match_off += old_tup_bitmaplen;
+
+ /*
+ * If any unchanged data presents in the old and new tuples then
+ * encode the data as it needs to copy from history tuple with len
+ * and offset.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+
+ /*
+ * Recalculate the old and new tuple offsets based on padding in
+ * the tuples
+ */
+ if (!HeapTupleHasNulls(oldtup)
+ || !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ {
+ old_tup_off = att_align_pointer(old_tup_off,
+ att[attnum - 1]->attalign,
+ att[attnum - 1]->attlen,
+ (char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ }
+
+ if (!HeapTupleHasNulls(newtup)
+ || !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ {
+ new_tup_off = att_align_pointer(new_tup_off,
+ att[attnum - 1]->attalign,
+ att[attnum - 1]->attlen,
+ (char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ }
+
+ old_tup_off += old_tup_attr_len;
+ new_tup_off += new_tup_attr_len;
+
+ match_off = old_tup_off;
+ }
+ else
+ {
+ data_len = new_tup_off - change_off;
+ if ((bp + (2 + data_len)) - bstart >= result_max)
+ return false;
+
+ /* Add the modified column data to the EWT */
+ pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+
+ /*
+ * Calculate the alignment for old and new tuple versions for this
+ * attribute, if the alignment is same, then we continue for next
+ * attribute else 1. stores the [Offset,Length] reffering old
+ * tuple version for previous attribute (if previous attr is same
+ * in old and new tuple versions) in the EWT as History Reference,
+ * 2. add the [Length,Data] for alignment from new tuple as New
+ * Data in EWT.
+ */
+ if (!HeapTupleHasNulls(oldtup)
+ || !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ {
+ temp_off = old_tup_off;
+ old_tup_off = att_align_pointer(old_tup_off,
+ att[attnum - 1]->attalign,
+ att[attnum - 1]->attlen,
+ (char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+
+ old_tup_pad_len = old_tup_off - temp_off;
+
+
+ temp_off = new_tup_off;
+ new_tup_off = att_align_pointer(new_tup_off,
+ att[attnum - 1]->attalign,
+ att[attnum - 1]->attlen,
+ (char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ new_tup_pad_len = new_tup_off - temp_off;
+
+ if (old_tup_pad_len != new_tup_pad_len)
+ {
+ /*
+ * If the alignment difference is found between old and
+ * new tuples and previous attribute value of the old and
+ * new tuple versions is same then store until the current
+ * match as history reference Tag in EWT.
+ */
+ if (is_attr_equals)
+ {
+ data_len = old_tup_off - old_tup_pad_len - match_off;
+ len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ if ((bp + len) - bstart >= result_max)
+ return false;
+
+ match_off += old_tup_bitmaplen;
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ }
+
+ match_off = old_tup_off;
+
+ /* Alignment data */
+ if ((bp + (2 + new_tup_pad_len)) - bstart >= result_max)
+ return false;
+
+ pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_pad_len, dp);
+ }
+ }
+
+ old_tup_off += old_tup_attr_len;
+ new_tup_off += new_tup_attr_len;
+
+ change_off = new_tup_off;
+
+ /*
+ * Recalculate the destination pointer with the new offset which
+ * is used while copying the modified data.
+ */
+ dp = dstart + new_tup_off;
+ is_attr_equals = true;
+ }
+ }
+
+ /* If any modified column data presents then add it in EWT. */
+ data_len = new_tup_off - change_off;
+ if ((bp + (2 + data_len)) - bstart >= result_max)
+ return false;
+
+ pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+
+ /*
+ * If any left out old tuple data is present then copy it as history
+ * reference
+ */
+ data_len = old_tup_off - match_off;
+ len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ if ((bp + len) - bstart >= result_max)
+ return false;
+
+ match_off += old_tup_bitmaplen;
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+
+ result_size = bp - bstart;
+ if (result_size >= result_max)
+ return false;
+
+ /* Fill in the actual length of the compressed datum */
+ SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ return true;
+ }
+
+ /* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version
+ *
+ * Encoded WAL Tuple Format:
+ * Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ *
+ *
+ * Decode Mechanism:
+ * Skip header and Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ * Check each control bit, if the bit is set then it is History Reference which
+ * means the next 2 - 3 byte tag provides the offset and length of history match.
+ * Use the offset and corresponding length to copy data from old tuple version
+ * to new tuple.
+ * If the control bit is unset, then it is New Data Reference which means
+ * first byte contains the length [0-255] of the modified data, followed
+ * by the modified data of corresponding length specified in the first byte.
+ *
+ * Tag in History Reference:
+ * 2-3 byte tag -
+ * 2 byte tag is used when length of History data (unchanged data from old tuple version) is less than 18.
+ * 3 byte tag is used when length of History data (unchanged data from old tuple version) is greater than
+ * equal to 18.
+ * The maximum length that can be represented by one Tag is 273.
+ *
+ * For more details about Encoded WAL Tuple (EWT) representation, refer transam\README
+ *
+ * ----------------
+ */
+ void
+ heap_delta_decode(PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ return pglz_decompress_with_history((char *) encdata,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 85,91 ----
TransactionId xid, CommandId cid, int options);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ HeapTuple oldtup,
bool all_visible_cleared, bool new_all_visible_cleared);
static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
HeapTuple oldtup, HeapTuple newtup);
***************
*** 857,862 **** heapgettup_pagemode(HeapScanDesc scan,
--- 858,911 ----
* definition in access/htup.h is maintained.
*/
Datum
+ fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ bool *isnull, int32 *len)
+ {
+ return (
+ (attnum) > 0 ?
+ (
+ (*(isnull) = false),
+ HeapTupleNoNulls(tup) ?
+ (
+ (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
+ (
+ (*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
+ (char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ (tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
+ fetchatt((tupleDesc)->attrs[(attnum) - 1],
+ (char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ (tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
+ )
+ :
+ (
+ nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
+ )
+ :
+ (
+ att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
+ (
+ (*(isnull) = true),
+ (*(len) = 0),
+ (Datum) NULL
+ )
+ :
+ (
+ nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))
+ )
+ )
+ )
+ :
+ (
+ (Datum) NULL
+ )
+ );
+ }
+
+ /*
+ * This is formatted so oddly so that the correspondence to the macro
+ * definition in access/htup.h is maintained.
+ */
+ Datum
fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
bool *isnull)
{
***************
*** 873,879 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
)
:
! nocachegetattr((tup), (attnum), (tupleDesc))
)
:
(
--- 922,929 ----
(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
)
:
! (
! nocachegetattr(tup), (attnum), (tupleDesc))
)
:
(
***************
*** 3229,3238 **** l2:
/* XLOG stuff */
if (RelationNeedsWAL(relation))
{
! XLogRecPtr recptr = log_heap_update(relation, buffer, oldtup.t_self,
! newbuf, heaptup,
! all_visible_cleared,
! all_visible_cleared_new);
if (newbuf != buffer)
{
--- 3279,3290 ----
/* XLOG stuff */
if (RelationNeedsWAL(relation))
{
! XLogRecPtr recptr;
!
! recptr = log_heap_update(relation, buffer, oldtup.t_self,
! newbuf, heaptup, &oldtup,
! all_visible_cleared,
! all_visible_cleared_new);
if (newbuf != buffer)
{
***************
*** 3299,3372 **** static bool
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
HeapTuple tup1, HeapTuple tup2)
{
! Datum value1,
! value2;
! bool isnull1,
! isnull2;
! Form_pg_attribute att;
!
! /*
! * If it's a whole-tuple reference, say "not equal". It's not really
! * worth supporting this case, since it could only succeed after a no-op
! * update, which is hardly a case worth optimizing for.
! */
! if (attrnum == 0)
! return false;
!
! /*
! * Likewise, automatically say "not equal" for any system attribute other
! * than OID and tableOID; we cannot expect these to be consistent in a HOT
! * chain, or even to be set correctly yet in the new tuple.
! */
! if (attrnum < 0)
! {
! if (attrnum != ObjectIdAttributeNumber &&
! attrnum != TableOidAttributeNumber)
! return false;
! }
!
! /*
! * Extract the corresponding values. XXX this is pretty inefficient if
! * there are many indexed columns. Should HeapSatisfiesHOTUpdate do a
! * single heap_deform_tuple call on each tuple, instead? But that doesn't
! * work for system columns ...
! */
! value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
! value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
!
! /*
! * If one value is NULL and other is not, then they are certainly not
! * equal
! */
! if (isnull1 != isnull2)
! return false;
!
! /*
! * If both are NULL, they can be considered equal.
! */
! if (isnull1)
! return true;
! /*
! * We do simple binary comparison of the two datums. This may be overly
! * strict because there can be multiple binary representations for the
! * same logical value. But we should be OK as long as there are no false
! * positives. Using a type-specific equality operator is messy because
! * there could be multiple notions of equality in different operator
! * classes; furthermore, we cannot safely invoke user-defined functions
! * while holding exclusive buffer lock.
! */
! if (attrnum <= 0)
! {
! /* The only allowed system columns are OIDs, so do this */
! return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
! }
! else
! {
! Assert(attrnum <= tupdesc->natts);
! att = tupdesc->attrs[attrnum - 1];
! return datumIsEqual(value1, value2, att->attbyval, att->attlen);
! }
}
/*
--- 3351,3361 ----
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
HeapTuple tup1, HeapTuple tup2)
{
! Size tup1_attr_len,
! tup2_attr_len;
! return heap_attr_get_length_and_check_equals(tupdesc, attrnum, tup1, tup2,
! &tup1_attr_len, &tup2_attr_len);
}
/*
***************
*** 4464,4470 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
*/
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! Buffer newbuf, HeapTuple newtup,
bool all_visible_cleared, bool new_all_visible_cleared)
{
xl_heap_update xlrec;
--- 4453,4459 ----
*/
static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
bool all_visible_cleared, bool new_all_visible_cleared)
{
xl_heap_update xlrec;
***************
*** 4473,4478 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4462,4477 ----
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ struct
+ {
+ PGLZ_Header pglzheader;
+ char buf[MaxHeapTupleSize];
+ } buf;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
***************
*** 4482,4492 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
else
info = XLOG_HEAP_UPDATE;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = from;
! xlrec.all_visible_cleared = all_visible_cleared;
xlrec.newtid = newtup->t_self;
! xlrec.new_all_visible_cleared = new_all_visible_cleared;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
--- 4481,4522 ----
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, &buf.pglzheader))
+ {
+ compressed = true;
+ newtupdata = (char *) &buf.pglzheader;
+ newtuplen = VARSIZE(&buf.pglzheader);
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = from;
! if (all_visible_cleared)
! xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
! if (new_all_visible_cleared)
! xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! if (compressed)
! xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
***************
*** 4513,4521 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
! /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
--- 4543,4554 ----
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
! /*
! * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! */
! rdata[3].data = newtupdata;
! rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
***************
*** 5291,5297 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5324,5333 ----
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
***************
*** 5306,5312 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->all_visible_cleared)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5342,5348 ----
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5366,5372 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
! htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
HEAP_XMAX_INVALID |
--- 5402,5408 ----
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
! oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
HEAP_XMAX_INVALID |
***************
*** 5385,5391 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
! if (xlrec->all_visible_cleared)
PageClearAllVisible(page);
/*
--- 5421,5427 ----
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
! if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
***************
*** 5410,5416 **** newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->new_all_visible_cleared)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5446,5452 ----
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5473,5482 **** newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! (char *) xlrec + hsize,
! newlen);
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
--- 5509,5540 ----
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
!
! /*
! * If the record is EWT then decode it.
! */
! if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! {
! /*
! * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
! * + New data (1 byte length + variable data)+ ...
! */
! PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
!
! oldtup.t_data = oldtupdata;
! newtup.t_data = htup;
!
! heap_delta_decode(encoded_data, &oldtup, &newtup);
! newlen = newtup.t_len;
! }
! else
! {
! /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! (char *) xlrec + hsize,
! newlen);
! }
!
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
***************
*** 5491,5497 **** newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
! if (xlrec->new_all_visible_cleared)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
--- 5549,5555 ----
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
! if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
***************
*** 665,670 **** then restart recovery. This is part of the reason for not writing a WAL
--- 665,778 ----
entry until we've successfully done the original action.
+ Encoded WAL Tuple (EWT)
+ -----------------------
+
+ Delta Encoded WAL Tuple (EWT) eliminates the need for copying entire tuple to WAL for the update operation.
+ EWT is constructed by comparing old and new versions of tuple w.r.t column boundaries. It contains the data
+ from new tuple for modified columns and reference [Offset,Length] of old tuple verion for un-changed columns.
+
+
+ EWT Format
+ ----------
+
+ Header + Control byte + History Reference (2 - 3)bytes
+ + New data (1 byte length + variable data) + ...
+
+
+ Header:
+
+ The header is same as PGLZ_Header, which is used to store the compressed length and raw length.
+
+ Control byte:
+
+ The first byte after the header tells what to do the next 8 times. We call this the control byte.
+
+
+ History Reference:
+
+ A set bit in the control byte means, that a tag of 2-3 bytes follows. A tag contains information
+ to copy some bytes from old tuple version to the current location in the output.
+
+ Details about 2-3 byte Tag
+ 2 byte tag is used when length of History data (unchanged data from old tuple version) is less than 18.
+ 3 byte tag is used when length of History data (unchanged data from old tuple version) is greater than
+ equal to 18.
+ The maximum length that can be represented by one Tag is 273.
+
+ Let's call the three tag bytes T1, T2 and T3. The position of the data to copy is coded as an offset
+ from the old tuple.
+
+ The offset is in the upper nibble of T1 and in T2.
+ The length is in the lower nibble of T1.
+
+ So the 16 bits of a 2 byte tag are coded as
+
+ 7---T1--0 7---T2--0
+ OOOO LLLL OOOO OOOO
+
+ This limits the offset to 1-4095 (12 bits) and the length to 3-18 (4 bits) because 3 is always added to it.
+
+ In the actual implementation, the 2 byte tag's length is limited to 3-17, because the value 0xF
+ in the length nibble has special meaning. It means, that the next following byte (T3) has to be
+ added to the length value of 18. That makes total limits of 1-4095 for offset and 3-273 for length.
+
+
+
+
+ New data:
+
+ An unset bit in the control byte represents modified data of new tuple version.
+ First byte repersents the length [0-255] of the modified data, followed by the
+ modified data of corresponding length.
+
+ 7---T1--0 7---T2--0 ...
+ LLLL LLLL DDDD DDDD ...
+
+ Data bytes repeat until the length of the new data.
+
+
+ L - Length
+ O - Offset
+ D - Data
+
+ This encoding is very similar to LZ Compression used in PostgreSQL (pg_lzcompress.c).
+
+
+ Encoding Mechanism for EWT
+ --------------------------
+ Copy the bitmap data from new tuple to the EWT (Encoded WAL Tuple) and loop for all attributes
+ to find any modifications in the attributes. The unmodified data is encoded as a
+ History Reference in EWT and the modified data (if NOT NULL) is encoded as New Data in EWT.
+
+ The offset values are calculated with respect to the tuple t_hoff value. For each column attribute
+ old and new tuple offsets are recalculated based on padding in the tuples.
+ Once the alignment difference is found between old and new tuple versions,
+ then include alignment difference as New Data in EWT.
+
+ Max encoded data length is 75% (default compression rate) of original data, if encoded output data
+ length is greater thanthat, original tuple (new tuple version) will be directly stored in WAL Tuple.
+
+
+ Decoding Mechanism for EWT
+ --------------------------
+ Skip header and Read one control byte and process the next 8 items (or as many as remain in the compressed input).
+ Check each control bit, if the bit is set then it is History Reference which means the next 2 - 3 byte tag
+ provides the offset and length of history match.
+ Use the offset and corresponding length to copy data from old tuple version to new tuple.
+ If the control bit is unset, then it is New Data Reference which means first byte contains the
+ length [0-255] of the modified data, followed by the modified data of corresponding length
+ specified in the first byte.
+
+
+ Constraints for EWT
+ --------------------
+ 1. Delta encoding is allowed when the update is going to the same page and
+ buffer doesn't need a backup block in case of full-pagewrite is on.
+ 2. Old Tuples with length less than PGLZ_HISTORY_SIZE are allowed for encoding.
+ 3. Old and New tuple versions shouldn't vary in length by more than 50% are allowed for encoding.
+
+
Asynchronous Commit
-------------------
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1204,1209 **** begin:;
--- 1204,1231 ----
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+ bool
+ XLogCheckBufferNeedsBackup(Buffer buffer)
+ {
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+ }
+
+ /*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
*/
#define PGLZ_HISTORY_LISTS 8192 /* must be power of 2 */
#define PGLZ_HISTORY_MASK (PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE 4096
- #define PGLZ_MAX_MATCH 273
-
/* ----------
* PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do { \
} \
} while (0)
-
- /* ----------
- * pglz_out_ctrl -
- *
- * Outputs the last and allocates a new control byte if needed.
- * ----------
- */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- if ((__ctrl & 0xff) == 0) \
- { \
- *(__ctrlp) = __ctrlb; \
- __ctrlp = (__buf)++; \
- __ctrlb = 0; \
- __ctrl = 1; \
- } \
- } while (0)
-
-
- /* ----------
- * pglz_out_literal -
- *
- * Outputs a literal byte to the destination buffer including the
- * appropriate control bit.
- * ----------
- */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
- *(_buf)++ = (unsigned char)(_byte); \
- _ctrl <<= 1; \
- } while (0)
-
-
- /* ----------
- * pglz_out_tag -
- *
- * Outputs a backward reference tag of 2-4 bytes (depending on
- * offset and length) to the destination buffer including the
- * appropriate control bit.
- * ----------
- */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
- _ctrlb |= _ctrl; \
- _ctrl <<= 1; \
- if (_len > 17) \
- { \
- (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f); \
- (_buf)[1] = (unsigned char)(((_off) & 0xff)); \
- (_buf)[2] = (unsigned char)((_len) - 18); \
- (_buf) += 3; \
- } else { \
- (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- (_buf)[1] = (unsigned char)((_off) & 0xff); \
- (_buf) += 2; \
- } \
- } while (0)
-
-
/* ----------
* pglz_find_match -
*
--- 299,304 ----
***************
*** 595,601 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
* Create the tag and add history entries for all matched
* characters.
*/
! pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
pglz_hist_add(hist_start, hist_entries,
--- 531,537 ----
* Create the tag and add history entries for all matched
* characters.
*/
! pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off, dp);
while (match_len--)
{
pglz_hist_add(hist_start, hist_entries,
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
void
pglz_decompress(const PGLZ_Header *source, char *dest)
{
const unsigned char *sp;
const unsigned char *srcend;
unsigned char *dp;
unsigned char *destend;
sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! srcend = ((const unsigned char *) source) + VARSIZE(source);
dp = (unsigned char *) dest;
! destend = dp + source->rawsize;
while (sp < srcend && dp < destend)
{
--- 583,620 ----
void
pglz_decompress(const PGLZ_Header *source, char *dest)
{
+ pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+
+ /* ----------
+ * pglz_decompress_with_history -
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ const char *history)
+ {
+ PGLZ_Header src;
const unsigned char *sp;
const unsigned char *srcend;
unsigned char *dp;
unsigned char *destend;
+ /* To avoid the unaligned access of PGLZ_Header */
+ memcpy((char *) &src, source, sizeof(PGLZ_Header));
+
sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! srcend = ((const unsigned char *) source) + VARSIZE(&src);
dp = (unsigned char *) dest;
! destend = dp + src.rawsize;
!
! if (destlen)
! {
! *destlen = src.rawsize;
! }
while (sp < srcend && dp < destend)
{
***************
*** 699,726 **** pglz_decompress(const PGLZ_Header *source, char *dest)
break;
}
! /*
! * Now we copy the bytes specified by the tag from OUTPUT to
! * OUTPUT. It is dangerous and platform dependent to use
! * memcpy() here, because the copied areas could overlap
! * extremely!
! */
! while (len--)
{
! *dp = dp[-off];
! dp++;
}
}
else
{
! /*
! * An unset control bit means LITERAL BYTE. So we just copy
! * one from INPUT to OUTPUT.
! */
! if (dp >= destend) /* check for buffer overrun */
! break; /* do not clobber memory */
!
! *dp++ = *sp++;
}
/*
--- 658,726 ----
break;
}
! if (history)
! {
! /*
! * Now we copy the bytes specified by the tag from history
! * to OUTPUT.
! */
! memcpy(dp, history + off, len);
! dp += len;
! }
! else
{
! /*
! * Now we copy the bytes specified by the tag from OUTPUT
! * to OUTPUT. It is dangerous and platform dependent to
! * use memcpy() here, because the copied areas could
! * overlap extremely!
! */
! while (len--)
! {
! *dp = dp[-off];
! dp++;
! }
}
}
else
{
! if (history)
! {
! /*
! * The byte at current offset in the source is the length
! * of this literal segment. See pglz_out_add for encoding
! * side.
! */
! int32 len;
!
! len = sp[0];
! sp += 1;
!
! if (dp + len > destend)
! {
! dp += len;
! break;
! }
!
! /*
! * Now we copy the bytes specified by the tag from Source
! * to OUTPUT.
! */
! memcpy(dp, sp, len);
! dp += len;
! sp += len;
! }
! else
! {
! /*
! * An unset control bit means LITERAL BYTE. So we just
! * copy one from INPUT to OUTPUT.
! */
! if (dp >= destend) /* check for buffer overrun */
! break; /* do not clobber memory */
!
! *dp++ = *sp++;
! }
}
/*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 123,128 **** extern int CommitSiblings;
--- 123,129 ----
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
+ extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
***************
*** 2382,2387 **** static struct config_int ConfigureNamesInt[] =
--- 2383,2399 ----
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 1, 99,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
{
xl_heaptid target; /* deleted tuple id */
ItemPointerData newtid; /* new inserted tuple id */
! bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
! bool new_all_visible_cleared; /* same for the page of newtid */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
! #define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
/*
* This is what we need to know about vacuum page cleanup/redirect
--- 142,161 ----
{
xl_heaptid target; /* deleted tuple id */
ItemPointerData newtid; /* new inserted tuple id */
! int flags; /* flag bits, see below */
!
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
!
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old page's
! all visible bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new page's
! all visible bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the update
! operation is delta encoded */
!
! #define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(char))
/*
* This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
#include "access/tupdesc.h"
#include "access/tupmacs.h"
#include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
/*
* MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 528,533 **** struct MinimalTupleData
--- 529,535 ----
HeapTupleHeaderSetOid((tuple)->t_data, (oid))
+ #if !defined(DISABLE_COMPLEX_MACRO)
/* ----------------
* fastgetattr
*
***************
*** 542,550 **** struct MinimalTupleData
* lookups, and call nocachegetattr() for the rest.
* ----------------
*/
-
- #if !defined(DISABLE_COMPLEX_MACRO)
-
#define fastgetattr(tup, attnum, tupleDesc, isnull) \
( \
AssertMacro((attnum) > 0), \
--- 544,549 ----
***************
*** 572,585 **** struct MinimalTupleData
nocachegetattr((tup), (attnum), (tupleDesc)) \
) \
) \
)
- #else /* defined(DISABLE_COMPLEX_MACRO) */
extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
bool *isnull);
#endif /* defined(DISABLE_COMPLEX_MACRO) */
-
/* ----------------
* heap_getattr
*
--- 571,626 ----
nocachegetattr((tup), (attnum), (tupleDesc)) \
) \
) \
+ ) \
+
+ /* ----------------
+ * fastgetattr_with_len
+ *
+ * Similar to fastgetattr and fetches the length of the given attribute
+ * also.
+ * ----------------
+ */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ AssertMacro((attnum) > 0), \
+ (*(isnull) = false), \
+ HeapTupleNoNulls(tup) ? \
+ ( \
+ (tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ? \
+ ( \
+ (*(len) = att_getlength( \
+ (tupleDesc)->attrs[(attnum)-1]->attlen, \
+ (char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ (tupleDesc)->attrs[(attnum)-1]->attcacheoff)), \
+ fetchatt((tupleDesc)->attrs[(attnum)-1], \
+ (char *) (tup)->t_data + (tup)->t_data->t_hoff + \
+ (tupleDesc)->attrs[(attnum)-1]->attcacheoff) \
+ ) \
+ : \
+ nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ ) \
+ : \
+ ( \
+ att_isnull((attnum)-1, (tup)->t_data->t_bits) ? \
+ ( \
+ (*(isnull) = true), \
+ (*(len) = 0), \
+ (Datum)NULL \
+ ) \
+ : \
+ ( \
+ nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ ) \
+ ) \
)
+ #else /* defined(DISABLE_COMPLEX_MACRO) */
extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ TupleDesc tupleDesc, bool *isnull, int32 *len);
#endif /* defined(DISABLE_COMPLEX_MACRO) */
/* ----------------
* heap_getattr
*
***************
*** 596,616 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
* ----------------
*/
#define heap_getattr(tup, attnum, tupleDesc, isnull) \
( \
! ((attnum) > 0) ? \
( \
! ((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! ( \
! (*(isnull) = true), \
! (Datum)NULL \
! ) \
! : \
! fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
) \
: \
! heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
/* prototypes for functions in common/heaptuple.c */
extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 637,679 ----
* ----------------
*/
#define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ ((attnum) > 0) ? \
( \
! ((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
( \
! (*(isnull) = true), \
! (Datum)NULL \
) \
: \
! fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! ) \
! : \
! heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
+ /* ----------------
+ * heap_getattr_with_len
+ *
+ * Similar to heap_getattr and outputs the length of the given attribute.
+ * ----------------
+ */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ ((attnum) > 0) ? \
+ ( \
+ ((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ ( \
+ (*(isnull) = true), \
+ (*(len) = 0), \
+ (Datum)NULL \
+ ) \
+ : \
+ fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ ) \
+ : \
+ heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
/* prototypes for functions in common/heaptuple.c */
extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 620,625 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 683,690 ----
char *data, Size data_size,
uint16 *infomask, bits8 *bit);
extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ TupleDesc att, Size *len);
extern Datum nocachegetattr(HeapTuple tup, int attnum,
TupleDesc att);
extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 636,641 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 701,714 ----
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+ extern bool heap_attr_get_length_and_check_equals(TupleDesc tupdesc,
+ int attrnum, HeapTuple tup1, HeapTuple tup2,
+ Size *tup1_attr_len, Size *tup2_attr_len);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
)
/*
+ * att_getlength -
+ * Gets the length of the attribute.
+ */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ ((attlen) > 0) ? \
+ ( \
+ (attlen) \
+ ) \
+ : (((attlen) == -1) ? \
+ ( \
+ VARSIZE_ANY(attptr) \
+ ) \
+ : \
+ ( \
+ AssertMacro((attlen) == -2), \
+ (strlen((char *) (attptr)) + 1) \
+ )) \
+ )
+
+
+ /*
* store_att_byval is a partial inverse of fetch_att: store a given Datum
* value into a tuple data area at the specified address. However, it only
* handles the byval case, because in typical usage the caller needs to
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,266 **** typedef struct CheckpointStatsData
--- 261,267 ----
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,31 ----
int32 rawsize;
} PGLZ_Header;
+ /* LZ algorithm can hold only history offset in the range of 1 - 4095. */
+ #define PGLZ_HISTORY_SIZE 4096
+ #define PGLZ_MAX_MATCH 273
/* ----------
* PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 89,207 ----
int32 match_size_drop;
} PGLZ_Strategy;
+ /*
+ * calculate the approximate length required for history reference tag for the
+ * given length
+ */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ ((_len) < 17) ? (3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH))) \
+ )
+
+ /* ----------
+ * pglz_out_ctrl -
+ *
+ * Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ if ((__ctrl & 0xff) == 0) \
+ { \
+ *(__ctrlp) = __ctrlb; \
+ __ctrlp = (__buf)++; \
+ __ctrlb = 0; \
+ __ctrl = 1; \
+ } \
+ } while (0)
+
+ /* ----------
+ * pglz_out_literal -
+ *
+ * Outputs a literal byte to the destination buffer including the
+ * appropriate control bit.
+ * ----------
+ */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ *(_buf)++ = (unsigned char)(_byte); \
+ _ctrl <<= 1; \
+ } while (0)
+
+ /* ----------
+ * pglz_out_tag -
+ *
+ * Outputs a backward/history reference tag of 2-3 bytes (depending on
+ * offset and length) to the destination buffer including the
+ * appropriate control bit.
+ *
+ * Split the process of backward/history reference as different chunks,
+ * if the given length is more than max match and repeats the process
+ * until the given length is processed.
+ *
+ * If the matched history length is less than 3 bytes then add it as a
+ * new data only during encoding instead of history reference. This occurs
+ * only while framing EWT.
+ * ----------
+ */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_byte) \
+ do { \
+ int _mtaglen; \
+ int _tagtotal_len = (_len); \
+ while (_tagtotal_len > 0) \
+ { \
+ _mtaglen = _tagtotal_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _tagtotal_len; \
+ if (_mtaglen < 3) \
+ { \
+ char *_data = (char *)(_byte) + (_off); \
+ pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_mtaglen,_data); \
+ break; \
+ } \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ _ctrlb |= _ctrl; \
+ _ctrl <<= 1; \
+ if (_mtaglen > 17) \
+ { \
+ (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f); \
+ (_buf)[1] = (unsigned char)(((_off) & 0xff)); \
+ (_buf)[2] = (unsigned char)((_mtaglen) - 18); \
+ (_buf) += 3; \
+ } else { \
+ (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mtaglen) - 3)); \
+ (_buf)[1] = (unsigned char)((_off) & 0xff); \
+ (_buf) += 2; \
+ } \
+ _tagtotal_len -= _mtaglen; \
+ (_off) += _mtaglen; \
+ } \
+ } while (0)
+
+ /* ----------
+ * pglz_out_add -
+ *
+ * Outputs a reference tag of 1 byte with length and the new data
+ * to the destination buffer, including the appropriate control bit.
+ * ----------
+ */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ int32 _maddlen; \
+ int32 _addtotal_len = (_len); \
+ while (_addtotal_len > 0) \
+ { \
+ _maddlen = _addtotal_len > 255 ? 255 : _addtotal_len; \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ _ctrl <<= 1; \
+ (_buf)[0] = (unsigned char)(_maddlen); \
+ (_buf) += 1; \
+ memcpy((_buf), (_byte), _maddlen); \
+ (_buf) += _maddlen; \
+ (_byte) += _maddlen; \
+ _addtotal_len -= _maddlen; \
+ } \
+ } while (0)
+
/* ----------
* The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
!
#endif /* _PG_LZCOMPRESS_H_ */
--- 224,229 ----
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! uint32 *destlen, const char *history);
#endif /* _PG_LZCOMPRESS_H_ */
*** a/src/test/regress/expected/update.out
--- b/src/test/regress/expected/update.out
***************
*** 97,99 **** SELECT a, b, char_length(c) FROM update_test;
--- 97,169 ----
(2 rows)
DROP TABLE update_test;
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ DROP TABLE IF EXISTS update_test;
+ NOTICE: table "update_test" does not exist, skipping
+ CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+ );
+ INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+ );
+ SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+ (1 row)
+
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+ (1 row)
+
+ DROP TABLE update_test;
*** a/src/test/regress/sql/update.sql
--- b/src/test/regress/sql/update.sql
***************
*** 59,61 **** UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
--- 59,128 ----
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+ --
+ -- Test to update continuos and non continuos columns
+ --
+
+ DROP TABLE IF EXISTS update_test;
+ CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+ );
+
+ INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+ );
+
+ SELECT * from update_test;
+
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+
+ SELECT * from update_test;
+ DROP TABLE update_test;
Import Notes
Resolved by subject fallback
On Monday, January 21, 2013 9:32 PM Amit kapila wrote:
On 11 January 2013 15:57, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
I've moved this to the next CF. I'm planning to review this one first.
Thank you.
Just reviewing the patch now, making more sense with comments added.
Making more sense, but not yet making complete sense.
I'd like you to revisit the patch comments since some of them are
completely unreadable.
I have modified most of the comments in code.
The changes in attached patch are as below:
Rebased the patch as per HEAD.
With Regards,
Amit Kapila.
Attachments:
wal_update_changes_v9.patchapplication/octet-stream; name=wal_update_changes_v9.patchDownload
*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,66 ****
--- 60,69 ----
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+ #include "utils/datum.h"
+ /* guc variable for EWT compression ratio*/
+ int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
}
/* ----------------
! * nocachegetattr
*
! * This only gets called from fastgetattr() macro, in cases where
* we can't use a cacheoffset and the value is not null.
*
! * This caches attribute offsets in the attribute descriptor.
*
* An alternative way to speed things up would be to cache offsets
* with the tuple, but that seems more difficult unless you take
--- 300,312 ----
}
/* ----------------
! * nocachegetattr_with_len
*
! * This only gets called in cases where
* we can't use a cacheoffset and the value is not null.
*
! * This caches attribute offsets in the attribute descriptor and
! * outputs the length of the attribute value.
*
* An alternative way to speed things up would be to cache offsets
* with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
* ----------------
*/
Datum
! nocachegetattr(HeapTuple tuple,
! int attnum,
! TupleDesc tupleDesc)
{
HeapTupleHeader tup = tuple->t_data;
Form_pg_attribute *att = tupleDesc->attrs;
--- 324,333 ----
* ----------------
*/
Datum
! nocachegetattr_with_len(HeapTuple tuple,
! int attnum,
! TupleDesc tupleDesc,
! Size *len)
{
HeapTupleHeader tup = tuple->t_data;
Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 386,394 ----
*/
if (att[attnum]->attcacheoff >= 0)
{
+ if (len)
+ *len = att_getlength(att[attnum]->attlen,
+ tp + att[attnum]->attcacheoff);
return fetchatt(att[attnum],
tp + att[attnum]->attcacheoff);
}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 515,536 ----
}
}
+ if (len)
+ *len = att_getlength(att[attnum]->attlen, tp + off);
return fetchatt(att[attnum], tp + off);
}
+ /*
+ * nocachegetattr
+ */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ int attnum,
+ TupleDesc tupleDesc)
+ {
+ return nocachegetattr_with_len(tuple, attnum, tupleDesc, NULL);
+ }
+
/* ----------------
* heap_getsysattr
*
***************
*** 617,622 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 638,1061 ----
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+ /* ----------------
+ * heap_attr_get_length_and_check_equals
+ *
+ * returns the result of comparison of specified attribute's value for
+ * input tuples.
+ * outputs the length of specified attribute's value for
+ * input tuples.
+ * ----------------
+ */
+ bool
+ heap_attr_get_length_and_check_equals(TupleDesc tupdesc, int attrnum,
+ HeapTuple tup1, HeapTuple tup2,
+ Size *tup1_attr_len, Size *tup2_attr_len)
+ {
+ Datum value1,
+ value2;
+ bool isnull1,
+ isnull2;
+ Form_pg_attribute att;
+
+ *tup1_attr_len = 0;
+ *tup2_attr_len = 0;
+
+ /*
+ * If it's a whole-tuple reference, say "not equal". It's not really
+ * worth supporting this case, since it could only succeed after a no-op
+ * update, which is hardly a case worth optimizing for.
+ */
+ if (attrnum == 0)
+ return false;
+
+ /*
+ * Likewise, automatically say "not equal" for any system attribute other
+ * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ * chain, or even to be set correctly yet in the new tuple.
+ */
+ if (attrnum < 0)
+ {
+ if (attrnum != ObjectIdAttributeNumber &&
+ attrnum != TableOidAttributeNumber)
+ return false;
+ }
+
+ /*
+ * Extract the corresponding values and length of values. XXX this is
+ * pretty inefficient if there are many indexed columns. Should
+ * HeapSatisfiesHOTUpdate do a single heap_deform_tuple call on each
+ * tuple, instead? But that doesn't work for system columns ...
+ */
+ value1 = heap_getattr_with_len(tup1, attrnum, tupdesc, &isnull1, tup1_attr_len);
+ value2 = heap_getattr_with_len(tup2, attrnum, tupdesc, &isnull2, tup2_attr_len);
+
+ /*
+ * If one value is NULL and other is not, then they are certainly not
+ * equal
+ */
+ if (isnull1 != isnull2)
+ return false;
+
+ /*
+ * If both are NULL, they can be considered equal.
+ */
+ if (isnull1)
+ return true;
+
+ /*
+ * We do simple binary comparison of the two datums. This may be overly
+ * strict because there can be multiple binary representations for the
+ * same logical value. But we should be OK as long as there are no false
+ * positives. Using a type-specific equality operator is messy because
+ * there could be multiple notions of equality in different operator
+ * classes; furthermore, we cannot safely invoke user-defined functions
+ * while holding exclusive buffer lock.
+ */
+ if (attrnum <= 0)
+ {
+ /* The only allowed system columns are OIDs, so do this */
+ return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ }
+ else
+ {
+ Assert(attrnum <= tupdesc->natts);
+ att = tupdesc->attrs[attrnum - 1];
+ return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ }
+ }
+
+ /* ----------------
+ * heap_delta_encode
+ *
+ * Construct a delta Encoded WAL Tuple (EWT) by comparing old and new
+ * tuple versions w.r.t column boundaries.
+ *
+ * Encoded WAL Tuple Format:
+ * Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ *
+ * Encode Mechanism:
+ *
+ * Copy the bitmap data from new tuple to the EWT (Encoded WAL Tuple) and
+ * loop for all attributes to find any modifications in the attributes.
+ * The unmodified data is encoded as a History Reference in EWT and
+ * the modified data (if NOT NULL) is encoded as New Data in EWT.
+ *
+ * The offset values are calculated with respect to the tuple t_hoff
+ * value. For each column attribute old and new tuple offsets
+ * are recalculated based on padding in the tuples.
+ * Once the alignment difference is found between old and new tuple
+ * versions, then include alignment difference as New Data in EWT.
+ *
+ * max encoded data length is 75% (default compression rate)
+ * of original data, If encoded output data length is greater than
+ * that, original tuple (new tuple version) will be directly stored in
+ * WAL Tuple.
+ *
+ *
+ * History Reference:
+ * If any column is modified then the unmodified columns data till the
+ * modified column needs to be copied to EWT as a Tag.
+ *
+ *
+ * New data (modified data):
+ * First byte repersents the length [0-255] of the modified data,
+ * followed by the modified data of corresponding length.
+ *
+ * For more details about Encoded WAL Tuple (EWT) representation,
+ * refer transam\README
+ * ----------------
+ */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ PGLZ_Header *encdata)
+ {
+ Form_pg_attribute *att = tupleDesc->attrs;
+ int numberOfAttributes;
+ int32 new_tup_off = 0,
+ old_tup_off = 0,
+ temp_off = 0,
+ match_off = 0,
+ change_off = 0;
+ int attnum;
+ int32 data_len,
+ old_tup_pad_len,
+ new_tup_pad_len;
+ Size old_tup_attr_len,
+ new_tup_attr_len;
+ bool is_attr_equals = true;
+ unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ unsigned char *bstart = bp;
+ char *dp = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ char *dstart = dp;
+ char *history;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ int32 len,
+ old_tup_bitmaplen,
+ new_tup_bitmaplen,
+ old_tup_len,
+ new_tup_len;
+ int32 result_size;
+ int32 result_max;
+
+ old_tup_len = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (old_tup_len >= PGLZ_HISTORY_SIZE)
+ return false;
+
+ history = (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ old_tup_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ new_tup_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * If length of old and new tuple versions vary by more than 50%, include
+ * new as-is
+ */
+ if ((new_tup_len <= (old_tup_len >> 1))
+ || (old_tup_len <= (new_tup_len >> 1)))
+ return false;
+
+ /* Required compression ratio for EWT */
+ result_max = (new_tup_len * (100 - wal_update_compression_ratio)) / 100;
+ encdata->rawsize = new_tup_len;
+
+ /*
+ * Advance the EWT by adding the approximate length of the operation for
+ * new data as [1 control byte + 1 length byte + data_length] and validate
+ * it with result_max. The same length approximation is used in the
+ * function for New data.
+ */
+ if ((bp + (2 + new_tup_bitmaplen)) - bstart >= result_max)
+ return false;
+
+ /* Copy the bitmap data from new tuple to EWT */
+ pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_bitmaplen, dp);
+ dstart = dp;
+
+ /*
+ * Loop through all attributes, if the attribute is modified by the update
+ * operation, store the [Offset,Length] reffering old tuple version till
+ * the last unchanged column in the EWT as History Reference, else store
+ * the [Length,Data] from new tuple version as New Data.
+ */
+ numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ {
+ if (!heap_attr_get_length_and_check_equals(tupleDesc, attnum, oldtup,
+ newtup, &old_tup_attr_len, &new_tup_attr_len))
+ {
+ is_attr_equals = false;
+ data_len = old_tup_off - match_off;
+
+ len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ if ((bp + len) - bstart >= result_max)
+ return false;
+
+ /*
+ * The match_off value is calculated w.r.t to the tuple t_hoff
+ * value, the bit map len needs to be added to match_off to get
+ * the actual start offset from the old/history tuple.
+ */
+ match_off += old_tup_bitmaplen;
+
+ /*
+ * If any unchanged data presents in the old and new tuples then
+ * encode the data as it needs to copy from history tuple with len
+ * and offset.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+
+ /*
+ * Recalculate the old and new tuple offsets based on padding in
+ * the tuples
+ */
+ if (!HeapTupleHasNulls(oldtup)
+ || !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ {
+ old_tup_off = att_align_pointer(old_tup_off,
+ att[attnum - 1]->attalign,
+ att[attnum - 1]->attlen,
+ (char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ }
+
+ if (!HeapTupleHasNulls(newtup)
+ || !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ {
+ new_tup_off = att_align_pointer(new_tup_off,
+ att[attnum - 1]->attalign,
+ att[attnum - 1]->attlen,
+ (char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ }
+
+ old_tup_off += old_tup_attr_len;
+ new_tup_off += new_tup_attr_len;
+
+ match_off = old_tup_off;
+ }
+ else
+ {
+ data_len = new_tup_off - change_off;
+ if ((bp + (2 + data_len)) - bstart >= result_max)
+ return false;
+
+ /* Add the modified column data to the EWT */
+ pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+
+ /*
+ * Calculate the alignment for old and new tuple versions for this
+ * attribute, if the alignment is same, then we continue for next
+ * attribute else 1. stores the [Offset,Length] reffering old
+ * tuple version for previous attribute (if previous attr is same
+ * in old and new tuple versions) in the EWT as History Reference,
+ * 2. add the [Length,Data] for alignment from new tuple as New
+ * Data in EWT.
+ */
+ if (!HeapTupleHasNulls(oldtup)
+ || !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ {
+ temp_off = old_tup_off;
+ old_tup_off = att_align_pointer(old_tup_off,
+ att[attnum - 1]->attalign,
+ att[attnum - 1]->attlen,
+ (char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+
+ old_tup_pad_len = old_tup_off - temp_off;
+
+
+ temp_off = new_tup_off;
+ new_tup_off = att_align_pointer(new_tup_off,
+ att[attnum - 1]->attalign,
+ att[attnum - 1]->attlen,
+ (char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ new_tup_pad_len = new_tup_off - temp_off;
+
+ if (old_tup_pad_len != new_tup_pad_len)
+ {
+ /*
+ * If the alignment difference is found between old and
+ * new tuples and previous attribute value of the old and
+ * new tuple versions is same then store until the current
+ * match as history reference Tag in EWT.
+ */
+ if (is_attr_equals)
+ {
+ data_len = old_tup_off - old_tup_pad_len - match_off;
+ len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ if ((bp + len) - bstart >= result_max)
+ return false;
+
+ match_off += old_tup_bitmaplen;
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ }
+
+ match_off = old_tup_off;
+
+ /* Alignment data */
+ if ((bp + (2 + new_tup_pad_len)) - bstart >= result_max)
+ return false;
+
+ pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_pad_len, dp);
+ }
+ }
+
+ old_tup_off += old_tup_attr_len;
+ new_tup_off += new_tup_attr_len;
+
+ change_off = new_tup_off;
+
+ /*
+ * Recalculate the destination pointer with the new offset which
+ * is used while copying the modified data.
+ */
+ dp = dstart + new_tup_off;
+ is_attr_equals = true;
+ }
+ }
+
+ /* If any modified column data presents then add it in EWT. */
+ data_len = new_tup_off - change_off;
+ if ((bp + (2 + data_len)) - bstart >= result_max)
+ return false;
+
+ pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+
+ /*
+ * If any left out old tuple data is present then copy it as history
+ * reference
+ */
+ data_len = old_tup_off - match_off;
+ len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ if ((bp + len) - bstart >= result_max)
+ return false;
+
+ match_off += old_tup_bitmaplen;
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+
+ result_size = bp - bstart;
+ if (result_size >= result_max)
+ return false;
+
+ /* Fill in the actual length of the compressed datum */
+ SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ return true;
+ }
+
+ /* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version
+ *
+ * Encoded WAL Tuple Format:
+ * Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ *
+ *
+ * Decode Mechanism:
+ * Skip header and Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ * Check each control bit, if the bit is set then it is History Reference which
+ * means the next 2 - 3 byte tag provides the offset and length of history match.
+ * Use the offset and corresponding length to copy data from old tuple version
+ * to new tuple.
+ * If the control bit is unset, then it is New Data Reference which means
+ * first byte contains the length [0-255] of the modified data, followed
+ * by the modified data of corresponding length specified in the first byte.
+ *
+ * Tag in History Reference:
+ * 2-3 byte tag -
+ * 2 byte tag is used when length of History data (unchanged data from old tuple version) is less than 18.
+ * 3 byte tag is used when length of History data (unchanged data from old tuple version) is greater than
+ * equal to 18.
+ * The maximum length that can be represented by one Tag is 273.
+ *
+ * For more details about Encoded WAL Tuple (EWT) representation, refer transam\README
+ *
+ * ----------------
+ */
+ void
+ heap_delta_decode(PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ return pglz_decompress_with_history((char *) encdata,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 950,955 **** heapgettup_pagemode(HeapScanDesc scan,
--- 950,1003 ----
* definition in access/htup.h is maintained.
*/
Datum
+ fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ bool *isnull, int32 *len)
+ {
+ return (
+ (attnum) > 0 ?
+ (
+ (*(isnull) = false),
+ HeapTupleNoNulls(tup) ?
+ (
+ (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
+ (
+ (*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
+ (char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ (tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
+ fetchatt((tupleDesc)->attrs[(attnum) - 1],
+ (char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ (tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
+ )
+ :
+ (
+ nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
+ )
+ :
+ (
+ att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
+ (
+ (*(isnull) = true),
+ (*(len) = 0),
+ (Datum) NULL
+ )
+ :
+ (
+ nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))
+ )
+ )
+ )
+ :
+ (
+ (Datum) NULL
+ )
+ );
+ }
+
+ /*
+ * This is formatted so oddly so that the correspondence to the macro
+ * definition in access/htup.h is maintained.
+ */
+ Datum
fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
bool *isnull)
{
***************
*** 966,972 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
)
:
! nocachegetattr((tup), (attnum), (tupleDesc))
)
:
(
--- 1014,1021 ----
(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
)
:
! (
! nocachegetattr(tup), (attnum), (tupleDesc))
)
:
(
***************
*** 3609,3682 **** static bool
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
HeapTuple tup1, HeapTuple tup2)
{
! Datum value1,
! value2;
! bool isnull1,
! isnull2;
! Form_pg_attribute att;
!
! /*
! * If it's a whole-tuple reference, say "not equal". It's not really
! * worth supporting this case, since it could only succeed after a no-op
! * update, which is hardly a case worth optimizing for.
! */
! if (attrnum == 0)
! return false;
!
! /*
! * Likewise, automatically say "not equal" for any system attribute other
! * than OID and tableOID; we cannot expect these to be consistent in a HOT
! * chain, or even to be set correctly yet in the new tuple.
! */
! if (attrnum < 0)
! {
! if (attrnum != ObjectIdAttributeNumber &&
! attrnum != TableOidAttributeNumber)
! return false;
! }
!
! /*
! * Extract the corresponding values. XXX this is pretty inefficient if
! * there are many indexed columns. Should HeapSatisfiesHOTandKeyUpdate do a
! * single heap_deform_tuple call on each tuple, instead? But that doesn't
! * work for system columns ...
! */
! value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
! value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
! /*
! * If one value is NULL and other is not, then they are certainly not
! * equal
! */
! if (isnull1 != isnull2)
! return false;
!
! /*
! * If both are NULL, they can be considered equal.
! */
! if (isnull1)
! return true;
!
! /*
! * We do simple binary comparison of the two datums. This may be overly
! * strict because there can be multiple binary representations for the
! * same logical value. But we should be OK as long as there are no false
! * positives. Using a type-specific equality operator is messy because
! * there could be multiple notions of equality in different operator
! * classes; furthermore, we cannot safely invoke user-defined functions
! * while holding exclusive buffer lock.
! */
! if (attrnum <= 0)
! {
! /* The only allowed system columns are OIDs, so do this */
! return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
! }
! else
! {
! Assert(attrnum <= tupdesc->natts);
! att = tupdesc->attrs[attrnum - 1];
! return datumIsEqual(value1, value2, att->attbyval, att->attlen);
! }
}
/*
--- 3658,3668 ----
heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
HeapTuple tup1, HeapTuple tup2)
{
! Size tup1_attr_len,
! tup2_attr_len;
! return heap_attr_get_length_and_check_equals(tupdesc, attrnum, tup1, tup2,
! &tup1_attr_len, &tup2_attr_len);
}
/*
***************
*** 5765,5770 **** log_heap_update(Relation reln, Buffer oldbuf,
--- 5751,5766 ----
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ struct
+ {
+ PGLZ_Header pglzheader;
+ char buf[MaxHeapTupleSize];
+ } buf;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
***************
*** 5774,5788 **** log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! xlrec.all_visible_cleared = all_visible_cleared;
xlrec.newtid = newtup->t_self;
! xlrec.new_all_visible_cleared = new_all_visible_cleared;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
--- 5770,5815 ----
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, &buf.pglzheader))
+ {
+ compressed = true;
+ newtupdata = (char *) &buf.pglzheader;
+ newtuplen = VARSIZE(&buf.pglzheader);
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! if (all_visible_cleared)
! xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
! if (new_all_visible_cleared)
! xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! if (compressed)
! xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
***************
*** 5809,5817 **** log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
! /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
--- 5836,5847 ----
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
! /*
! * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! */
! rdata[3].data = newtupdata;
! rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
***************
*** 6614,6620 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 6644,6653 ----
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
***************
*** 6629,6635 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->all_visible_cleared)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 6662,6668 ----
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 6689,6695 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
! htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
--- 6722,6728 ----
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
! oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
***************
*** 6707,6713 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
! if (xlrec->all_visible_cleared)
PageClearAllVisible(page);
/*
--- 6740,6746 ----
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
! if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
***************
*** 6732,6738 **** newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->new_all_visible_cleared)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 6765,6771 ----
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 6795,6804 **** newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! (char *) xlrec + hsize,
! newlen);
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
--- 6828,6859 ----
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
!
! /*
! * If the record is EWT then decode it.
! */
! if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! {
! /*
! * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
! * + New data (1 byte length + variable data)+ ...
! */
! PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
!
! oldtup.t_data = oldtupdata;
! newtup.t_data = htup;
!
! heap_delta_decode(encoded_data, &oldtup, &newtup);
! newlen = newtup.t_len;
! }
! else
! {
! /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! (char *) xlrec + hsize,
! newlen);
! }
!
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
***************
*** 6814,6820 **** newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
! if (xlrec->new_all_visible_cleared)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
--- 6869,6875 ----
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
! if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
***************
*** 665,670 **** then restart recovery. This is part of the reason for not writing a WAL
--- 665,778 ----
entry until we've successfully done the original action.
+ Encoded WAL Tuple (EWT)
+ -----------------------
+
+ Delta Encoded WAL Tuple (EWT) eliminates the need for copying entire tuple to WAL for the update operation.
+ EWT is constructed by comparing old and new versions of tuple w.r.t column boundaries. It contains the data
+ from new tuple for modified columns and reference [Offset,Length] of old tuple verion for un-changed columns.
+
+
+ EWT Format
+ ----------
+
+ Header + Control byte + History Reference (2 - 3)bytes
+ + New data (1 byte length + variable data) + ...
+
+
+ Header:
+
+ The header is same as PGLZ_Header, which is used to store the compressed length and raw length.
+
+ Control byte:
+
+ The first byte after the header tells what to do the next 8 times. We call this the control byte.
+
+
+ History Reference:
+
+ A set bit in the control byte means, that a tag of 2-3 bytes follows. A tag contains information
+ to copy some bytes from old tuple version to the current location in the output.
+
+ Details about 2-3 byte Tag
+ 2 byte tag is used when length of History data (unchanged data from old tuple version) is less than 18.
+ 3 byte tag is used when length of History data (unchanged data from old tuple version) is greater than
+ equal to 18.
+ The maximum length that can be represented by one Tag is 273.
+
+ Let's call the three tag bytes T1, T2 and T3. The position of the data to copy is coded as an offset
+ from the old tuple.
+
+ The offset is in the upper nibble of T1 and in T2.
+ The length is in the lower nibble of T1.
+
+ So the 16 bits of a 2 byte tag are coded as
+
+ 7---T1--0 7---T2--0
+ OOOO LLLL OOOO OOOO
+
+ This limits the offset to 1-4095 (12 bits) and the length to 3-18 (4 bits) because 3 is always added to it.
+
+ In the actual implementation, the 2 byte tag's length is limited to 3-17, because the value 0xF
+ in the length nibble has special meaning. It means, that the next following byte (T3) has to be
+ added to the length value of 18. That makes total limits of 1-4095 for offset and 3-273 for length.
+
+
+
+
+ New data:
+
+ An unset bit in the control byte represents modified data of new tuple version.
+ First byte repersents the length [0-255] of the modified data, followed by the
+ modified data of corresponding length.
+
+ 7---T1--0 7---T2--0 ...
+ LLLL LLLL DDDD DDDD ...
+
+ Data bytes repeat until the length of the new data.
+
+
+ L - Length
+ O - Offset
+ D - Data
+
+ This encoding is very similar to LZ Compression used in PostgreSQL (pg_lzcompress.c).
+
+
+ Encoding Mechanism for EWT
+ --------------------------
+ Copy the bitmap data from new tuple to the EWT (Encoded WAL Tuple) and loop for all attributes
+ to find any modifications in the attributes. The unmodified data is encoded as a
+ History Reference in EWT and the modified data (if NOT NULL) is encoded as New Data in EWT.
+
+ The offset values are calculated with respect to the tuple t_hoff value. For each column attribute
+ old and new tuple offsets are recalculated based on padding in the tuples.
+ Once the alignment difference is found between old and new tuple versions,
+ then include alignment difference as New Data in EWT.
+
+ Max encoded data length is 75% (default compression rate) of original data, if encoded output data
+ length is greater thanthat, original tuple (new tuple version) will be directly stored in WAL Tuple.
+
+
+ Decoding Mechanism for EWT
+ --------------------------
+ Skip header and Read one control byte and process the next 8 items (or as many as remain in the compressed input).
+ Check each control bit, if the bit is set then it is History Reference which means the next 2 - 3 byte tag
+ provides the offset and length of history match.
+ Use the offset and corresponding length to copy data from old tuple version to new tuple.
+ If the control bit is unset, then it is New Data Reference which means first byte contains the
+ length [0-255] of the modified data, followed by the modified data of corresponding length
+ specified in the first byte.
+
+
+ Constraints for EWT
+ --------------------
+ 1. Delta encoding is allowed when the update is going to the same page and
+ buffer doesn't need a backup block in case of full-pagewrite is on.
+ 2. Old Tuples with length less than PGLZ_HISTORY_SIZE are allowed for encoding.
+ 3. Old and New tuple versions shouldn't vary in length by more than 50% are allowed for encoding.
+
+
Asynchronous Commit
-------------------
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1204,1209 **** begin:;
--- 1204,1231 ----
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+ bool
+ XLogCheckBufferNeedsBackup(Buffer buffer)
+ {
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+ }
+
+ /*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
*/
#define PGLZ_HISTORY_LISTS 8192 /* must be power of 2 */
#define PGLZ_HISTORY_MASK (PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE 4096
- #define PGLZ_MAX_MATCH 273
-
/* ----------
* PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do { \
} \
} while (0)
-
- /* ----------
- * pglz_out_ctrl -
- *
- * Outputs the last and allocates a new control byte if needed.
- * ----------
- */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- if ((__ctrl & 0xff) == 0) \
- { \
- *(__ctrlp) = __ctrlb; \
- __ctrlp = (__buf)++; \
- __ctrlb = 0; \
- __ctrl = 1; \
- } \
- } while (0)
-
-
- /* ----------
- * pglz_out_literal -
- *
- * Outputs a literal byte to the destination buffer including the
- * appropriate control bit.
- * ----------
- */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
- *(_buf)++ = (unsigned char)(_byte); \
- _ctrl <<= 1; \
- } while (0)
-
-
- /* ----------
- * pglz_out_tag -
- *
- * Outputs a backward reference tag of 2-4 bytes (depending on
- * offset and length) to the destination buffer including the
- * appropriate control bit.
- * ----------
- */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
- _ctrlb |= _ctrl; \
- _ctrl <<= 1; \
- if (_len > 17) \
- { \
- (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f); \
- (_buf)[1] = (unsigned char)(((_off) & 0xff)); \
- (_buf)[2] = (unsigned char)((_len) - 18); \
- (_buf) += 3; \
- } else { \
- (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- (_buf)[1] = (unsigned char)((_off) & 0xff); \
- (_buf) += 2; \
- } \
- } while (0)
-
-
/* ----------
* pglz_find_match -
*
--- 299,304 ----
***************
*** 595,601 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
* Create the tag and add history entries for all matched
* characters.
*/
! pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
pglz_hist_add(hist_start, hist_entries,
--- 531,537 ----
* Create the tag and add history entries for all matched
* characters.
*/
! pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off, dp);
while (match_len--)
{
pglz_hist_add(hist_start, hist_entries,
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
void
pglz_decompress(const PGLZ_Header *source, char *dest)
{
const unsigned char *sp;
const unsigned char *srcend;
unsigned char *dp;
unsigned char *destend;
sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! srcend = ((const unsigned char *) source) + VARSIZE(source);
dp = (unsigned char *) dest;
! destend = dp + source->rawsize;
while (sp < srcend && dp < destend)
{
--- 583,620 ----
void
pglz_decompress(const PGLZ_Header *source, char *dest)
{
+ pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+
+ /* ----------
+ * pglz_decompress_with_history -
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ const char *history)
+ {
+ PGLZ_Header src;
const unsigned char *sp;
const unsigned char *srcend;
unsigned char *dp;
unsigned char *destend;
+ /* To avoid the unaligned access of PGLZ_Header */
+ memcpy((char *) &src, source, sizeof(PGLZ_Header));
+
sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! srcend = ((const unsigned char *) source) + VARSIZE(&src);
dp = (unsigned char *) dest;
! destend = dp + src.rawsize;
!
! if (destlen)
! {
! *destlen = src.rawsize;
! }
while (sp < srcend && dp < destend)
{
***************
*** 699,726 **** pglz_decompress(const PGLZ_Header *source, char *dest)
break;
}
! /*
! * Now we copy the bytes specified by the tag from OUTPUT to
! * OUTPUT. It is dangerous and platform dependent to use
! * memcpy() here, because the copied areas could overlap
! * extremely!
! */
! while (len--)
{
! *dp = dp[-off];
! dp++;
}
}
else
{
! /*
! * An unset control bit means LITERAL BYTE. So we just copy
! * one from INPUT to OUTPUT.
! */
! if (dp >= destend) /* check for buffer overrun */
! break; /* do not clobber memory */
!
! *dp++ = *sp++;
}
/*
--- 658,726 ----
break;
}
! if (history)
! {
! /*
! * Now we copy the bytes specified by the tag from history
! * to OUTPUT.
! */
! memcpy(dp, history + off, len);
! dp += len;
! }
! else
{
! /*
! * Now we copy the bytes specified by the tag from OUTPUT
! * to OUTPUT. It is dangerous and platform dependent to
! * use memcpy() here, because the copied areas could
! * overlap extremely!
! */
! while (len--)
! {
! *dp = dp[-off];
! dp++;
! }
}
}
else
{
! if (history)
! {
! /*
! * The byte at current offset in the source is the length
! * of this literal segment. See pglz_out_add for encoding
! * side.
! */
! int32 len;
!
! len = sp[0];
! sp += 1;
!
! if (dp + len > destend)
! {
! dp += len;
! break;
! }
!
! /*
! * Now we copy the bytes specified by the tag from Source
! * to OUTPUT.
! */
! memcpy(dp, sp, len);
! dp += len;
! sp += len;
! }
! else
! {
! /*
! * An unset control bit means LITERAL BYTE. So we just
! * copy one from INPUT to OUTPUT.
! */
! if (dp >= destend) /* check for buffer overrun */
! break; /* do not clobber memory */
!
! *dp++ = *sp++;
! }
}
/*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 123,128 **** extern int CommitSiblings;
--- 123,129 ----
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
+ extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
***************
*** 2382,2387 **** static struct config_int ConfigureNamesInt[] =
--- 2383,2399 ----
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 1, 99,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 147,159 **** typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
! uint8 old_infobits_set; /* infomask bits to set on old tuple */
! bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
! bool new_all_visible_cleared; /* same for the page of newtid */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
! #define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
/*
* This is what we need to know about vacuum page cleanup/redirect
--- 147,168 ----
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
! uint8 old_infobits_set; /* infomask bits to set on old tuple */
! int flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
! * page's all visible
! * bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
! * page's all visible
! * bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
! * update operation is
! * delta encoded */
!
! #define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(int))
/*
* This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
#include "access/tupdesc.h"
#include "access/tupmacs.h"
#include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
/*
* MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 579,584 **** struct MinimalTupleData
--- 580,586 ----
HeapTupleHeaderSetOid((tuple)->t_data, (oid))
+ #if !defined(DISABLE_COMPLEX_MACRO)
/* ----------------
* fastgetattr
*
***************
*** 593,601 **** struct MinimalTupleData
* lookups, and call nocachegetattr() for the rest.
* ----------------
*/
-
- #if !defined(DISABLE_COMPLEX_MACRO)
-
#define fastgetattr(tup, attnum, tupleDesc, isnull) \
( \
AssertMacro((attnum) > 0), \
--- 595,600 ----
***************
*** 623,636 **** struct MinimalTupleData
nocachegetattr((tup), (attnum), (tupleDesc)) \
) \
) \
)
- #else /* defined(DISABLE_COMPLEX_MACRO) */
extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
bool *isnull);
#endif /* defined(DISABLE_COMPLEX_MACRO) */
-
/* ----------------
* heap_getattr
*
--- 622,677 ----
nocachegetattr((tup), (attnum), (tupleDesc)) \
) \
) \
+ ) \
+
+ /* ----------------
+ * fastgetattr_with_len
+ *
+ * Similar to fastgetattr and fetches the length of the given attribute
+ * also.
+ * ----------------
+ */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ AssertMacro((attnum) > 0), \
+ (*(isnull) = false), \
+ HeapTupleNoNulls(tup) ? \
+ ( \
+ (tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ? \
+ ( \
+ (*(len) = att_getlength( \
+ (tupleDesc)->attrs[(attnum)-1]->attlen, \
+ (char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ (tupleDesc)->attrs[(attnum)-1]->attcacheoff)), \
+ fetchatt((tupleDesc)->attrs[(attnum)-1], \
+ (char *) (tup)->t_data + (tup)->t_data->t_hoff + \
+ (tupleDesc)->attrs[(attnum)-1]->attcacheoff) \
+ ) \
+ : \
+ nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ ) \
+ : \
+ ( \
+ att_isnull((attnum)-1, (tup)->t_data->t_bits) ? \
+ ( \
+ (*(isnull) = true), \
+ (*(len) = 0), \
+ (Datum)NULL \
+ ) \
+ : \
+ ( \
+ nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ ) \
+ ) \
)
+ #else /* defined(DISABLE_COMPLEX_MACRO) */
extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ TupleDesc tupleDesc, bool *isnull, int32 *len);
#endif /* defined(DISABLE_COMPLEX_MACRO) */
/* ----------------
* heap_getattr
*
***************
*** 647,667 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
* ----------------
*/
#define heap_getattr(tup, attnum, tupleDesc, isnull) \
( \
! ((attnum) > 0) ? \
( \
! ((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! ( \
! (*(isnull) = true), \
! (Datum)NULL \
! ) \
! : \
! fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
) \
: \
! heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
/* prototypes for functions in common/heaptuple.c */
extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 688,730 ----
* ----------------
*/
#define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ ((attnum) > 0) ? \
( \
! ((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
( \
! (*(isnull) = true), \
! (Datum)NULL \
) \
: \
! fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! ) \
! : \
! heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
+ /* ----------------
+ * heap_getattr_with_len
+ *
+ * Similar to heap_getattr and outputs the length of the given attribute.
+ * ----------------
+ */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ ((attnum) > 0) ? \
+ ( \
+ ((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ ( \
+ (*(isnull) = true), \
+ (*(len) = 0), \
+ (Datum)NULL \
+ ) \
+ : \
+ fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ ) \
+ : \
+ heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
/* prototypes for functions in common/heaptuple.c */
extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 671,676 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 734,741 ----
char *data, Size data_size,
uint16 *infomask, bits8 *bit);
extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ TupleDesc att, Size *len);
extern Datum nocachegetattr(HeapTuple tup, int attnum,
TupleDesc att);
extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 687,692 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 752,765 ----
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+ extern bool heap_attr_get_length_and_check_equals(TupleDesc tupdesc,
+ int attrnum, HeapTuple tup1, HeapTuple tup2,
+ Size *tup1_attr_len, Size *tup2_attr_len);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
)
/*
+ * att_getlength -
+ * Gets the length of the attribute.
+ */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ ((attlen) > 0) ? \
+ ( \
+ (attlen) \
+ ) \
+ : (((attlen) == -1) ? \
+ ( \
+ VARSIZE_ANY(attptr) \
+ ) \
+ : \
+ ( \
+ AssertMacro((attlen) == -2), \
+ (strlen((char *) (attptr)) + 1) \
+ )) \
+ )
+
+
+ /*
* store_att_byval is a partial inverse of fetch_att: store a given Datum
* value into a tuple data area at the specified address. However, it only
* handles the byval case, because in typical usage the caller needs to
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,266 **** typedef struct CheckpointStatsData
--- 261,267 ----
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,31 ----
int32 rawsize;
} PGLZ_Header;
+ /* LZ algorithm can hold only history offset in the range of 1 - 4095. */
+ #define PGLZ_HISTORY_SIZE 4096
+ #define PGLZ_MAX_MATCH 273
/* ----------
* PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 89,207 ----
int32 match_size_drop;
} PGLZ_Strategy;
+ /*
+ * calculate the approximate length required for history reference tag for the
+ * given length
+ */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ ((_len) < 17) ? (3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH))) \
+ )
+
+ /* ----------
+ * pglz_out_ctrl -
+ *
+ * Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ if ((__ctrl & 0xff) == 0) \
+ { \
+ *(__ctrlp) = __ctrlb; \
+ __ctrlp = (__buf)++; \
+ __ctrlb = 0; \
+ __ctrl = 1; \
+ } \
+ } while (0)
+
+ /* ----------
+ * pglz_out_literal -
+ *
+ * Outputs a literal byte to the destination buffer including the
+ * appropriate control bit.
+ * ----------
+ */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ *(_buf)++ = (unsigned char)(_byte); \
+ _ctrl <<= 1; \
+ } while (0)
+
+ /* ----------
+ * pglz_out_tag -
+ *
+ * Outputs a backward/history reference tag of 2-3 bytes (depending on
+ * offset and length) to the destination buffer including the
+ * appropriate control bit.
+ *
+ * Split the process of backward/history reference as different chunks,
+ * if the given length is more than max match and repeats the process
+ * until the given length is processed.
+ *
+ * If the matched history length is less than 3 bytes then add it as a
+ * new data only during encoding instead of history reference. This occurs
+ * only while framing EWT.
+ * ----------
+ */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_byte) \
+ do { \
+ int _mtaglen; \
+ int _tagtotal_len = (_len); \
+ while (_tagtotal_len > 0) \
+ { \
+ _mtaglen = _tagtotal_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _tagtotal_len; \
+ if (_mtaglen < 3) \
+ { \
+ char *_data = (char *)(_byte) + (_off); \
+ pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_mtaglen,_data); \
+ break; \
+ } \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ _ctrlb |= _ctrl; \
+ _ctrl <<= 1; \
+ if (_mtaglen > 17) \
+ { \
+ (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f); \
+ (_buf)[1] = (unsigned char)(((_off) & 0xff)); \
+ (_buf)[2] = (unsigned char)((_mtaglen) - 18); \
+ (_buf) += 3; \
+ } else { \
+ (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mtaglen) - 3)); \
+ (_buf)[1] = (unsigned char)((_off) & 0xff); \
+ (_buf) += 2; \
+ } \
+ _tagtotal_len -= _mtaglen; \
+ (_off) += _mtaglen; \
+ } \
+ } while (0)
+
+ /* ----------
+ * pglz_out_add -
+ *
+ * Outputs a reference tag of 1 byte with length and the new data
+ * to the destination buffer, including the appropriate control bit.
+ * ----------
+ */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ int32 _maddlen; \
+ int32 _addtotal_len = (_len); \
+ while (_addtotal_len > 0) \
+ { \
+ _maddlen = _addtotal_len > 255 ? 255 : _addtotal_len; \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ _ctrl <<= 1; \
+ (_buf)[0] = (unsigned char)(_maddlen); \
+ (_buf) += 1; \
+ memcpy((_buf), (_byte), _maddlen); \
+ (_buf) += _maddlen; \
+ (_byte) += _maddlen; \
+ _addtotal_len -= _maddlen; \
+ } \
+ } while (0)
+
/* ----------
* The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
!
#endif /* _PG_LZCOMPRESS_H_ */
--- 224,229 ----
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! uint32 *destlen, const char *history);
#endif /* _PG_LZCOMPRESS_H_ */
*** a/src/test/regress/expected/update.out
--- b/src/test/regress/expected/update.out
***************
*** 97,99 **** SELECT a, b, char_length(c) FROM update_test;
--- 97,169 ----
(2 rows)
DROP TABLE update_test;
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ DROP TABLE IF EXISTS update_test;
+ NOTICE: table "update_test" does not exist, skipping
+ CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+ );
+ INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+ );
+ SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+ (1 row)
+
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+ (1 row)
+
+ DROP TABLE update_test;
*** a/src/test/regress/sql/update.sql
--- b/src/test/regress/sql/update.sql
***************
*** 59,61 **** UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
--- 59,128 ----
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+ --
+ -- Test to update continuos and non continuos columns
+ --
+
+ DROP TABLE IF EXISTS update_test;
+ CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+ );
+
+ INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+ );
+
+ SELECT * from update_test;
+
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+
+ SELECT * from update_test;
+ DROP TABLE update_test;
On 28.01.2013 15:39, Amit Kapila wrote:
Rebased the patch as per HEAD.
I don't like the way heap_delta_encode has intimate knowledge of how the
lz compression works. It feels like a violent punch through the
abstraction layers.
Ideally, you would just pass the old and new tuple to pglz as char *,
and pglz code would find the common parts. But I guess that's too slow,
as that's what I originally suggested and you rejected that approach.
But even if that's not possible on performance grounds, we don't need to
completely blow up the abstraction. pglz can still do the encoding - the
caller just needs to pass it the attribute boundaries to consider for
matches, so that it doesn't need to scan them byte by byte.
I came up with the attached patch. I wrote it to demonstrate the API,
I'm not 100% sure the result after decoding is correct.
- Heikki
Attachments:
wal_update_pglz_with_history-heikki.patchtext/x-diff; name=wal_update_pglz_with_history-heikki.patchDownload
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..bbdee4f 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+/* guc variable for EWT compression ratio*/
+int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,119 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+
+/* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata)
+{
+ HeapTupleHeader tup = oldtup->t_data;
+ Form_pg_attribute *att = tupleDesc->attrs;
+ bool hasnulls = HeapTupleHasNulls(oldtup);
+ bits8 *bp = oldtup->t_data->t_bits; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ char *tp; /* ptr to tuple data */
+ long off; /* offset in tuple data */
+ int natts;
+ int32 *offsets;
+ int noffsets;
+ int attnum;
+ PGLZ_Strategy strategy;
+
+ /*
+ * Loop through all attributes, if the attribute is modified by the update
+ * operation, store the [Offset,Length] reffering old tuple version till
+ * the last unchanged column in the EWT as History Reference, else store
+ * the [Length,Data] from new tuple version as New Data.
+ */
+ natts = HeapTupleHeaderGetNatts(oldtup->t_data);
+
+ offsets = palloc(natts * sizeof(int32));
+
+ noffsets = 0;
+
+ /* copied from heap_deform_tuple */
+ tp = (char *) tup + tup->t_hoff;
+ off = 0;
+ for (attnum = 0; attnum < natts; attnum++)
+ {
+ Form_pg_attribute thisatt = att[attnum];
+
+ if (hasnulls && att_isnull(attnum, bp))
+ {
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ /*
+ * We can only cache the offset for a varlena attribute if the
+ * offset is already suitably aligned, so that there would be no
+ * pad bytes in any case: then the offset will be valid for either
+ * an aligned or unaligned value.
+ */
+ if (!slow &&
+ off == att_align_nominal(off, thisatt->attalign))
+ thisatt->attcacheoff = off;
+ else
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+
+ if (!slow)
+ thisatt->attcacheoff = off;
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+
+ offsets[noffsets++] = off;
+ }
+
+ strategy = *PGLZ_strategy_always;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_compress_with_history((char *) oldtup->t_data, oldtup->t_len,
+ (char *) newtup->t_data, newtup->t_len,
+ offsets, noffsets, (PGLZ_Header *) encdata,
+ &strategy);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, HeapTuple oldtup, HeapTuple newtup)
+{
+ return pglz_decompress_with_history((char *) encdata,
+ newtup->t_data,
+ &newtup->t_len,
+ (char *) oldtup->t_data,
+ oldtup->t_len);
+}
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57d47e8..789bbe2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
/* GUC variable */
@@ -5765,6 +5766,16 @@ log_heap_update(Relation reln, Buffer oldbuf,
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ struct
+ {
+ PGLZ_Header pglzheader;
+ char buf[MaxHeapTupleSize];
+ } buf;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -5774,15 +5785,46 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, (char *) &buf.pglzheader))
+ {
+ compressed = true;
+ newtupdata = (char *) &buf.pglzheader;
+ newtuplen = VARSIZE(&buf.pglzheader);
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
- xlrec.all_visible_cleared = all_visible_cleared;
+ if (all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
- xlrec.new_all_visible_cleared = new_all_visible_cleared;
+ if (new_all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (compressed)
+ xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
@@ -5809,9 +5851,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ /*
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+ * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+ */
+ rdata[3].data = newtupdata;
+ rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
@@ -6614,7 +6659,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
@@ -6629,7 +6677,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6689,7 +6737,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
- htup = (HeapTupleHeader) PageGetItem(page, lp);
+ oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6707,7 +6755,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
@@ -6732,7 +6780,7 @@ newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6795,10 +6843,32 @@ newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
- /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
- memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
- (char *) xlrec + hsize,
- newlen);
+
+ /*
+ * If the record is EWT then decode it.
+ */
+ if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+ {
+ /*
+ * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ */
+ PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
+
+ oldtup.t_data = oldtupdata;
+ newtup.t_data = htup;
+
+ heap_delta_decode((char *) encoded_data, &oldtup, &newtup);
+ newlen = newtup.t_len;
+ }
+ else
+ {
+ /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ (char *) xlrec + hsize,
+ newlen);
+ }
+
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
@@ -6814,7 +6884,7 @@ newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cf2f6e7..9cd6271 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1204,6 +1204,28 @@ begin:;
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+}
+
+/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..c6ba6af 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -373,6 +373,7 @@ do { \
*/
static inline int
pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
+ const char *historyend,
int *lenp, int *offp, int good_match, int good_drop)
{
PGLZ_HistEntry *hent;
@@ -393,7 +394,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
/*
* Stop if the offset does not fit into our tag anymore.
*/
- thisoff = ip - hp;
+ thisoff = (historyend ? historyend : ip) - hp;
if (thisoff >= 0x0fff)
break;
@@ -408,12 +409,12 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
thislen = 0;
if (len >= 16)
{
- if (memcmp(ip, hp, len) == 0)
+ if ((historyend == NULL || historyend - hp > len) && memcmp(ip, hp, len) == 0)
{
thislen = len;
ip += len;
hp += len;
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH && (historyend == NULL || hp < historyend))
{
thislen++;
ip++;
@@ -423,7 +424,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
}
else
{
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH && (historyend == NULL || hp < historyend))
{
thislen++;
ip++;
@@ -588,7 +589,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
/*
* Try to find a match in the history
*/
- if (pglz_find_match(hist_start, dp, dend, &match_len,
+ if (pglz_find_match(hist_start, dp, dend, NULL, &match_len,
&match_off, good_match, good_drop))
{
/*
@@ -637,6 +638,176 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
return true;
}
+/*
+ * Like pglz_compress, but performs delta encoding rather than compression.
+ * The back references are offsets from the end of history data, rather
+ * than current output position. 'hoffsets' is an array of offsets in the
+ * history to consider. We could scan the whole history string for possible
+ * matches, but if the caller has some information on which offsets are
+ * likely to be interesting (attribute boundaries, when encoding tuples, for
+ * example), this is a lot faster.
+ */
+bool
+pglz_compress_with_history(const char *source, int32 slen, const char *history,
+ int32 hlen,
+ int32 *hoffsets,
+ int32 nhoffsets,
+ PGLZ_Header *dest, const PGLZ_Strategy *strategy)
+{
+ unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
+ unsigned char *bstart = bp;
+ int hist_next = 0;
+ bool hist_recycle = false;
+ const char *dp = source;
+ const char *dend = source + slen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len;
+ int32 match_off;
+ int32 good_match;
+ int32 good_drop;
+ int32 result_size;
+ int32 result_max;
+ int i;
+ int32 need_rate;
+ const char *historyend = history + hlen;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ /*
+ * Save the original source size in the header.
+ */
+ dest->rawsize = slen;
+
+ /*
+ * Limit the match parameters to the supported range.
+ */
+ good_match = strategy->match_size_good;
+ if (good_match > PGLZ_MAX_MATCH)
+ good_match = PGLZ_MAX_MATCH;
+ else if (good_match < 17)
+ good_match = 17;
+
+ good_drop = strategy->match_size_drop;
+ if (good_drop < 0)
+ good_drop = 0;
+ else if (good_drop > 100)
+ good_drop = 100;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ /*
+ * Initialize the history lists to empty. We do not need to zero the
+ * hist_entries[] array; its entries are initialized as they are used.
+ */
+ memset(hist_start, 0, sizeof(hist_start));
+
+ /* Populate the history hash from the history string */
+ for (i = 0; i < nhoffsets; i++)
+ {
+ const char *hp = history + hoffsets[i];
+
+ /* Add this offset to history */
+ pglz_hist_add(hist_start, hist_entries,
+ hist_next, hist_recycle,
+ hp, historyend);
+ }
+
+ /*
+ * Compress the source directly into the output buffer.
+ */
+ dp = source;
+ while (dp < dend)
+ {
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ if (pglz_find_match(hist_start, dp, dend, historyend, &match_len,
+ &match_off, good_match, good_drop))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ dp += match_len;
+ found_match = true;
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+ }
+
+ if (!found_match)
+ return false;
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ SET_VARSIZE_COMPRESSED(dest, result_size + sizeof(PGLZ_Header));
+
+ return true;
+}
/* ----------
* pglz_decompress -
@@ -647,15 +818,39 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
void
pglz_decompress(const PGLZ_Header *source, char *dest)
{
+ pglz_decompress_with_history((char *) source, dest, NULL, NULL, 0);
+}
+
+/* ----------
+ * pglz_decompress_with_history -
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ const char *history, int hlen)
+{
+ PGLZ_Header src;
const unsigned char *sp;
const unsigned char *srcend;
unsigned char *dp;
unsigned char *destend;
+ const char *historyend = history + hlen;
+
+ /* To avoid the unaligned access of PGLZ_Header */
+ memcpy((char *) &src, source, sizeof(PGLZ_Header));
sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
- srcend = ((const unsigned char *) source) + VARSIZE(source);
+ srcend = ((const unsigned char *) source) + VARSIZE(&src);
dp = (unsigned char *) dest;
- destend = dp + source->rawsize;
+ destend = dp + src.rawsize;
+
+ if (destlen)
+ {
+ *destlen = src.rawsize;
+ }
while (sp < srcend && dp < destend)
{
@@ -699,26 +894,38 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
break;
}
- /*
- * Now we copy the bytes specified by the tag from OUTPUT to
- * OUTPUT. It is dangerous and platform dependent to use
- * memcpy() here, because the copied areas could overlap
- * extremely!
- */
- while (len--)
+ if (history)
+ {
+ /*
+ * Now we copy the bytes specified by the tag from history
+ * to OUTPUT.
+ */
+ memcpy(dp, historyend - off, len);
+ dp += len;
+ }
+ else
{
- *dp = dp[-off];
- dp++;
+ /*
+ * Now we copy the bytes specified by the tag from OUTPUT
+ * to OUTPUT. It is dangerous and platform dependent to
+ * use memcpy() here, because the copied areas could
+ * overlap extremely!
+ */
+ while (len--)
+ {
+ *dp = dp[-off];
+ dp++;
+ }
}
}
else
{
/*
- * An unset control bit means LITERAL BYTE. So we just copy
- * one from INPUT to OUTPUT.
+ * An unset control bit means LITERAL BYTE. So we just
+ * copy one from INPUT to OUTPUT.
*/
- if (dp >= destend) /* check for buffer overrun */
- break; /* do not clobber memory */
+ if (dp >= destend) /* check for buffer overrun */
+ break; /* do not clobber memory */
*dp++ = *sp++;
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6128694..9a37b2d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int CommitSiblings;
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
+extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
@@ -2382,6 +2383,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 1, 99,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..1825292 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
- uint8 old_infobits_set; /* infomask bits to set on old tuple */
- bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
- bool new_all_visible_cleared; /* same for the page of newtid */
+ uint8 old_infobits_set; /* infomask bits to set on old tuple */
+ int flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
+ * update operation is
+ * delta encoded */
+
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(int))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..042c8b9 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata);
+extern void heap_delta_decode (char *encdata, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 72e3242..15f5d5d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..7a32803 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,8 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+extern bool pglz_compress_with_history(const char *source, int32 slen, const char *history, int32 hlen, int32 *hoffsets, int32 noffsets, PGLZ_Header *dest, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen, const char *history, int hlen);
#endif /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
(2 rows)
DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE: table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;
On Tuesday, January 29, 2013 2:53 AM Heikki Linnakangas wrote:
On 28.01.2013 15:39, Amit Kapila wrote:
Rebased the patch as per HEAD.
I don't like the way heap_delta_encode has intimate knowledge of how
the lz compression works. It feels like a violent punch through the
abstraction layers.Ideally, you would just pass the old and new tuple to pglz as char *,
and pglz code would find the common parts. But I guess that's too slow,
as that's what I originally suggested and you rejected that approach.
But even if that's not possible on performance grounds, we don't need
to completely blow up the abstraction. pglz can still do the encoding -
the caller just needs to pass it the attribute boundaries to consider
for matches, so that it doesn't need to scan them byte by byte.I came up with the attached patch. I wrote it to demonstrate the API,
I'm not 100% sure the result after decoding is correct.
I have checked the patch code, found few problems.
1. History will be old tuple, in that case below call needs to be changed
/* return pglz_compress_with_history((char *) oldtup->t_data,
oldtup->t_len,
(char *) newtup->t_data, newtup->t_len,
offsets, noffsets, (PGLZ_Header *) encdata,
&strategy);*/
return pglz_compress_with_history((char *) newtup->t_data,
newtup->t_len,
(char *) oldtup->t_data, oldtup->t_len,
offsets, noffsets, (PGLZ_Header *) encdata,
&strategy);
2. The offset array is beginning of each column offset. In that case below
needs to be changed.
offsets[noffsets++] = off;
off = att_addlength_pointer(off, thisatt->attlen, tp + off);
if (thisatt->attlen <= 0)
slow = true; /* can't use attcacheoff
anymore */
// offsets[noffsets++] = off;
}
Apart from this, some of the test cases are failing which I need to check.
I have debugged the new code, it appears to me that this will not be as
efficient as the current approach of patch.
It needs to build hash table for history reference and comparison which can
add overhead as compare to existing approach. I am taking the Performance
and WAL Reduction data.
Can there be another way with which current patch code can be made better,
so that we don't need to change the encoding approach, as I am having
feeling that this might not be performance wise equally good.
With Regards,
Amit Kapila.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 29.01.2013 11:58, Amit Kapila wrote:
Can there be another way with which current patch code can be made better,
so that we don't need to change the encoding approach, as I am having
feeling that this might not be performance wise equally good.
The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch more
like yours in behavior by also passing an array of offsets in the new
tuple to check, and only checking for matches as those offsets.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:
On 29.01.2013 11:58, Amit Kapila wrote:
Can there be another way with which current patch code can be made
better,
so that we don't need to change the encoding approach, as I am having
feeling that this might not be performance wise equally good.The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch more
like yours in behavior by also passing an array of offsets in the new
tuple to check, and only checking for matches as those offsets.
I think it makes sense, because if we have offsets of both new and old
tuple, we
can internally use memcmp to compare columns and use same algorithm for
encoding.
I will change the patch according to this suggestion.
With Regards,
Amit Kapila.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:
On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:
On 29.01.2013 11:58, Amit Kapila wrote:
Can there be another way with which current patch code can be made
better,
so that we don't need to change the encoding approach, as I am
having
feeling that this might not be performance wise equally good.
The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch more
like yours in behavior by also passing an array of offsets in the new
tuple to check, and only checking for matches as those offsets.I think it makes sense, because if we have offsets of both new and old
tuple, we
can internally use memcmp to compare columns and use same algorithm for
encoding.
I will change the patch according to this suggestion.
I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed bitmaplength
also, as we need
to copy the bitmap of new tuple as it is into Encoded WAL Tuple.
Please see if such API design is okay?
I shall update the README and send the performance/WAL Reduction data for
modified patch tomorrow.
With Regards,
Amit Kapila.
Attachments:
wal_update_changes_v10.patchapplication/octet-stream; name=wal_update_changes_v10.patchDownload
*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,66 ****
--- 60,70 ----
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+ #include "utils/datum.h"
+ #include "utils/pg_lzcompress.h"
+ /* guc variable for EWT compression ratio*/
+ int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
***************
*** 69,74 ****
--- 73,80 ----
#define VARLENA_ATT_IS_PACKABLE(att) \
((att)->attstorage != 'p')
+ static void heap_get_attr_offsets (TupleDesc tupleDesc, HeapTuple Tuple,
+ int32 **offsets, int *noffsets);
/* ----------------------------------------------------------------
* misc support routines
***************
*** 617,622 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 623,766 ----
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+ /* ----------------
+ * heap_get_attr_offsets
+ *
+ * Given a tuple, extract it's column starting offsets including null
+ * columns also. For null columns the offset will be same as next attribute
+ * offset.
+ * ----------------
+ */
+ static void
+ heap_get_attr_offsets (TupleDesc tupleDesc, HeapTuple Tuple,
+ int32 **offsets, int *noffsets)
+ {
+ HeapTupleHeader tup = Tuple->t_data;
+ Form_pg_attribute *att = tupleDesc->attrs;
+ bool hasnulls = HeapTupleHasNulls(Tuple);
+ bits8 *bp = Tuple->t_data->t_bits; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ char *tp; /* ptr to tuple data */
+ long off; /* offset in tuple data */
+ int natts;
+ int attnum;
+
+ natts = HeapTupleHeaderGetNatts(Tuple->t_data);
+
+ *offsets = palloc(natts * sizeof(int32));
+
+ *noffsets = 0;
+
+ /* copied from heap_deform_tuple */
+ tp = (char *) tup + tup->t_hoff;
+ off = 0;
+ for (attnum = 0; attnum < natts; attnum++)
+ {
+ Form_pg_attribute thisatt = att[attnum];
+
+ if (hasnulls && att_isnull(attnum, bp))
+ {
+ slow = true; /* can't use attcacheoff anymore */
+ (*offsets)[(*noffsets)++] = off;
+ continue;
+ }
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ /*
+ * We can only cache the offset for a varlena attribute if the
+ * offset is already suitably aligned, so that there would be no
+ * pad bytes in any case: then the offset will be valid for either
+ * an aligned or unaligned value.
+ */
+ if (!slow &&
+ off == att_align_nominal(off, thisatt->attalign))
+ thisatt->attcacheoff = off;
+ else
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+
+ if (!slow)
+ thisatt->attcacheoff = off;
+ }
+
+ (*offsets)[(*noffsets)++] = off;
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+
+ }
+
+ }
+
+ /* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata)
+ {
+ int32 *hoffsets,
+ *newoffsets;
+ int noffsets;
+ PGLZ_Strategy strategy;
+ int32 newbitmaplen,
+ hbitmpalen;
+
+ newbitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ hbitmpalen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * Deform and get the old and new tuple column boundary offsets. Which are
+ * required for calculating delta between old and new tuples.
+ */
+ heap_get_attr_offsets(tupleDesc, oldtup, &hoffsets, &noffsets);
+ heap_get_attr_offsets(tupleDesc, newtup, &newoffsets, &noffsets);
+
+ strategy = *PGLZ_strategy_always;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_compress_with_history((char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ newoffsets, hoffsets, noffsets,
+ newbitmaplen, hbitmpalen,
+ (PGLZ_Header *) encdata, &strategy);
+ }
+
+ /* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+ void
+ heap_delta_decode(char *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ return pglz_decompress_with_history((char *) encdata,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 70,75 ****
--- 70,76 ----
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+ #include "utils/pg_lzcompress.h"
/* GUC variable */
***************
*** 5765,5770 **** log_heap_update(Relation reln, Buffer oldbuf,
--- 5766,5781 ----
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ struct
+ {
+ PGLZ_Header pglzheader;
+ char buf[MaxHeapTupleSize];
+ } buf;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
***************
*** 5774,5788 **** log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! xlrec.all_visible_cleared = all_visible_cleared;
xlrec.newtid = newtup->t_self;
! xlrec.new_all_visible_cleared = new_all_visible_cleared;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
--- 5785,5830 ----
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, (char *) &buf.pglzheader))
+ {
+ compressed = true;
+ newtupdata = (char *) &buf.pglzheader;
+ newtuplen = VARSIZE(&buf.pglzheader);
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! if (all_visible_cleared)
! xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
! if (new_all_visible_cleared)
! xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! if (compressed)
! xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
***************
*** 5809,5817 **** log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
! /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
--- 5851,5862 ----
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
! /*
! * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! */
! rdata[3].data = newtupdata;
! rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
***************
*** 6614,6620 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 6659,6668 ----
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
***************
*** 6629,6635 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->all_visible_cleared)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 6677,6683 ----
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 6689,6695 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
! htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
--- 6737,6743 ----
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
! oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
***************
*** 6707,6713 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
! if (xlrec->all_visible_cleared)
PageClearAllVisible(page);
/*
--- 6755,6761 ----
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
! if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
***************
*** 6732,6738 **** newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->new_all_visible_cleared)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 6780,6786 ----
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 6795,6804 **** newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! (char *) xlrec + hsize,
! newlen);
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
--- 6843,6874 ----
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
!
! /*
! * If the record is EWT then decode it.
! */
! if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! {
! /*
! * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
! * + New data (1 byte length + variable data)+ ...
! */
! PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
!
! oldtup.t_data = oldtupdata;
! newtup.t_data = htup;
!
! heap_delta_decode((char *) encoded_data, &oldtup, &newtup);
! newlen = newtup.t_len;
! }
! else
! {
! /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! (char *) xlrec + hsize,
! newlen);
! }
!
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
***************
*** 6814,6820 **** newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
! if (xlrec->new_all_visible_cleared)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
--- 6884,6890 ----
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
! if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1209,1214 **** begin:;
--- 1209,1236 ----
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+ bool
+ XLogCheckBufferNeedsBackup(Buffer buffer)
+ {
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+ }
+
+ /*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 471,476 **** pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
--- 471,516 ----
return 0;
}
+ /* ----------
+ * pglz_find_match -
+ *
+ * Lookup the history table if the actual input stream matches
+ * another sequence of characters, starting somewhere earlier
+ * in the input buffer.
+ * ----------
+ */
+ static inline int
+ pglz_find_match_with_history(const char *input, const char *end,
+ const char *history, const char *hend, int *lenp)
+ {
+ const char *ip = input;
+ const char *hp = history;
+
+ /*
+ * Determine length of match. A better match must be larger than the
+ * best so far. And if we already have a match of 16 or more bytes,
+ * it's worth the call overhead to use memcmp() to check if this match
+ * is equal for the same size. After that we must fallback to
+ * character by character comparison to know the exact position where
+ * the diff occurred.
+ */
+ while (ip < end && hp < hend && *ip == *hp && *lenp < PGLZ_MAX_MATCH)
+ {
+ (*lenp)++;
+ ip++;
+ hp++;
+ }
+
+ /*
+ * Return match information only if it results at least in one byte
+ * reduction.
+ */
+ if (*lenp > 2)
+ return 1;
+
+ return 0;
+ }
+
/* ----------
* pglz_compress -
***************
*** 637,642 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
--- 677,879 ----
return true;
}
+ /* ----------
+ * pglz_compress_with_history
+ *
+ * Like pglz_compress, but performs delta encoding rather than compression.
+ * The references are offsets from the start of history data, rather
+ * than current output position. 'hoffsets' and 'newoffsets' are array of
+ * offsets in the history and source to consider. We could scan the history
+ * string for possible matches on which offsets are likely to be interesting
+ * (attribute boundaries, when encoding tuples, for example), this is a lot
+ * faster.
+ * For attributes having NULL value, the offset will be same as next attribute
+ * offset. When old tuple contains NULL and new tuple has non-NULL value,
+ * it will copy it as New Data in Encoded WAL Tuple. When new tuple has NULL
+ * value and old tuple has non-NULL value, the old tuple value will be ignored.
+ * ----------
+ */
+ bool
+ pglz_compress_with_history(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ int32 *newoffsets, int32 *hoffsets, int32 noffsets,
+ int32 newbitmaplen, int32 hbitmaplen,
+ PGLZ_Header *dest, const PGLZ_Strategy *strategy)
+ {
+ unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int i;
+ int32 need_rate;
+ const char *hp = history;
+ const char *hend = history + hlen;
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (hlen >= PGLZ_HISTORY_SIZE)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ /*
+ * Save the original source size in the header.
+ */
+ dest->rawsize = slen;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ /*
+ * Compress the source directly into the output buffer until bitmaplen.
+ */
+ dend = source + newbitmaplen;
+ while (dp < dend)
+ {
+ if (bp - bstart >= result_max)
+ return false;
+
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Loop through all attributes offsets, if the attribute data differs with
+ * history refering offsets, store the [Offset,Length] reffering history
+ * version till the match and store the changed data as New data.
+ */
+ match_off = hbitmaplen;
+ hp = history + hbitmaplen;
+ for (i = 0; i < noffsets; i++)
+ {
+ dend = source + ((i + 1 == noffsets) ? slen : newoffsets[i + 1] + newbitmaplen);
+ hend = history + ((i + 1 == noffsets) ? hlen : hoffsets[i + 1] + hbitmaplen);
+
+ MATCH_AGAIN:
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ if (pglz_find_match_with_history(dp + match_len, dend, hp + match_len,
+ hend, &match_len))
+ {
+ found_match = true;
+
+ /* Finding the maximum match across the offsets */
+ if ((i + 1 == noffsets)
+ || ((dp + match_len) < dend)
+ || ((hp + match_len < hend)))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ match_off += match_len;
+ dp += match_len;
+ hp += match_len;
+
+ if (match_len == PGLZ_MAX_MATCH)
+ {
+ match_len = 0;
+ goto MATCH_AGAIN;
+ }
+ else
+ {
+ hp = hend;
+ match_off = hend - history;
+ match_len = 0;
+ }
+ }
+ }
+ else
+ {
+ hp = hend;
+ match_off = hend - history;
+ match_len = 0;
+ }
+
+ /* copy the unmatched data to output buffer directly from source */
+ while ((dp + match_len) < dend)
+ {
+ if (bp - bstart >= result_max)
+ return false;
+
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+ }
+
+ if (!found_match)
+ return false;
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+ #ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+ #endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ SET_VARSIZE_COMPRESSED(dest, result_size + sizeof(PGLZ_Header));
+
+ return true;
+ }
/* ----------
* pglz_decompress -
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
void
pglz_decompress(const PGLZ_Header *source, char *dest)
{
const unsigned char *sp;
const unsigned char *srcend;
unsigned char *dp;
unsigned char *destend;
sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! srcend = ((const unsigned char *) source) + VARSIZE(source);
dp = (unsigned char *) dest;
! destend = dp + source->rawsize;
while (sp < srcend && dp < destend)
{
--- 884,921 ----
void
pglz_decompress(const PGLZ_Header *source, char *dest)
{
+ pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+
+ /* ----------
+ * pglz_decompress_with_history -
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ const char *history)
+ {
+ PGLZ_Header src;
const unsigned char *sp;
const unsigned char *srcend;
unsigned char *dp;
unsigned char *destend;
+ /* To avoid the unaligned access of PGLZ_Header */
+ memcpy((char *) &src, source, sizeof(PGLZ_Header));
+
sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! srcend = ((const unsigned char *) source) + VARSIZE(&src);
dp = (unsigned char *) dest;
! destend = dp + src.rawsize;
!
! if (destlen)
! {
! *destlen = src.rawsize;
! }
while (sp < srcend && dp < destend)
{
***************
*** 699,724 **** pglz_decompress(const PGLZ_Header *source, char *dest)
break;
}
! /*
! * Now we copy the bytes specified by the tag from OUTPUT to
! * OUTPUT. It is dangerous and platform dependent to use
! * memcpy() here, because the copied areas could overlap
! * extremely!
! */
! while (len--)
{
! *dp = dp[-off];
! dp++;
}
}
else
{
/*
! * An unset control bit means LITERAL BYTE. So we just copy
! * one from INPUT to OUTPUT.
*/
! if (dp >= destend) /* check for buffer overrun */
! break; /* do not clobber memory */
*dp++ = *sp++;
}
--- 959,996 ----
break;
}
! if (history)
! {
! /*
! * Now we copy the bytes specified by the tag from history
! * to OUTPUT.
! */
! memcpy(dp, history + off, len);
! dp += len;
! }
! else
{
! /*
! * Now we copy the bytes specified by the tag from OUTPUT
! * to OUTPUT. It is dangerous and platform dependent to
! * use memcpy() here, because the copied areas could
! * overlap extremely!
! */
! while (len--)
! {
! *dp = dp[-off];
! dp++;
! }
}
}
else
{
/*
! * An unset control bit means LITERAL BYTE. So we just
! * copy one from INPUT to OUTPUT.
*/
! if (dp >= destend) /* check for buffer overrun */
! break; /* do not clobber memory */
*dp++ = *sp++;
}
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 123,128 **** extern int CommitSiblings;
--- 123,129 ----
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
+ extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
***************
*** 2382,2387 **** static struct config_int ConfigureNamesInt[] =
--- 2383,2399 ----
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 1, 99,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 147,159 **** typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
! uint8 old_infobits_set; /* infomask bits to set on old tuple */
! bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
! bool new_all_visible_cleared; /* same for the page of newtid */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
! #define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
/*
* This is what we need to know about vacuum page cleanup/redirect
--- 147,168 ----
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
! uint8 old_infobits_set; /* infomask bits to set on old tuple */
! int flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
! * page's all visible
! * bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
! * page's all visible
! * bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
! * update operation is
! * delta encoded */
!
! #define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(int))
/*
* This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 687,692 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 687,697 ----
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata);
+ extern void heap_delta_decode (char *encdata, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,266 **** typedef struct CheckpointStatsData
--- 261,267 ----
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 107,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
--- 107,119 ----
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+ extern bool pglz_compress_with_history(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ int32 *newoffsets, int32 *hoffsets, int32 noffsets,
+ int32 newbitmaplen, int32 hbitmaplen,
+ PGLZ_Header *dest, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+ extern void pglz_decompress_with_history(const char *source, char *dest,
+ uint32 *destlen, const char *history);
#endif /* _PG_LZCOMPRESS_H_ */
*** a/src/test/regress/expected/update.out
--- b/src/test/regress/expected/update.out
***************
*** 97,99 **** SELECT a, b, char_length(c) FROM update_test;
--- 97,169 ----
(2 rows)
DROP TABLE update_test;
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ DROP TABLE IF EXISTS update_test;
+ NOTICE: table "update_test" does not exist, skipping
+ CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+ );
+ INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+ );
+ SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+ (1 row)
+
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+ (1 row)
+
+ DROP TABLE update_test;
*** a/src/test/regress/sql/update.sql
--- b/src/test/regress/sql/update.sql
***************
*** 59,61 **** UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
--- 59,128 ----
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+ --
+ -- Test to update continuos and non continuos columns
+ --
+
+ DROP TABLE IF EXISTS update_test;
+ CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+ );
+
+ INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+ );
+
+ SELECT * from update_test;
+
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+
+ SELECT * from update_test;
+ DROP TABLE update_test;
On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:
On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:
On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:
On 29.01.2013 11:58, Amit Kapila wrote:
Can there be another way with which current patch code can be
made
better,
so that we don't need to change the encoding approach, as I am
having
feeling that this might not be performance wise equally good.
The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patchmore
like yours in behavior by also passing an array of offsets in the
new tuple to check, and only checking for matches as those offsets.I think it makes sense, because if we have offsets of both new and
old
tuple, we can internally use memcmp to compare columns and use same
algorithm for encoding.
I will change the patch according to this suggestion.I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed
bitmaplength also, as we need to copy the bitmap of new tuple as it is
into Encoded WAL Tuple.Please see if such API design is okay?
I shall update the README and send the performance/WAL Reduction data
for modified patch tomorrow.
Updated patch including comments and README is attached with this mail.
This patch contain exactly same design behavior as per previous.
It takes care of API design suggestion of Heikki.
The performance data is similar, as it is not complete, I shall send that
tomorrow.
With Regards,
Amit Kapila.
Attachments:
wal_update_changes_v10.patchapplication/octet-stream; name=wal_update_changes_v10.patchDownload
*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,66 ****
--- 60,70 ----
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+ #include "utils/datum.h"
+ #include "utils/pg_lzcompress.h"
+ /* guc variable for EWT compression ratio*/
+ int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
***************
*** 69,74 ****
--- 73,80 ----
#define VARLENA_ATT_IS_PACKABLE(att) \
((att)->attstorage != 'p')
+ static void heap_get_attr_offsets(TupleDesc tupleDesc, HeapTuple Tuple,
+ int32 **offsets, int *noffsets);
/* ----------------------------------------------------------------
* misc support routines
***************
*** 617,622 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 623,775 ----
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+ /* ----------------
+ * heap_get_attr_offsets
+ *
+ * Given a tuple, extract each attribute's starting offset and return
+ * it as an array of offsets for a heap tuple.
+ * If the attribute has null value, the offset for it will be end of
+ * previous attribute offset.
+ * ----------------
+ */
+ static void
+ heap_get_attr_offsets(TupleDesc tupleDesc, HeapTuple Tuple,
+ int32 **offsets, int *noffsets)
+ {
+ HeapTupleHeader tup = Tuple->t_data;
+ Form_pg_attribute *att = tupleDesc->attrs;
+ bool hasnulls = HeapTupleHasNulls(Tuple);
+ bits8 *bp = Tuple->t_data->t_bits; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ char *tp; /* ptr to tuple data */
+ long off; /* offset in tuple data */
+ int natts;
+ int attnum;
+
+ natts = HeapTupleHeaderGetNatts(Tuple->t_data);
+
+ *offsets = palloc(natts * sizeof(int32));
+
+ *noffsets = 0;
+
+ /* copied from heap_deform_tuple */
+ tp = (char *) tup + tup->t_hoff;
+ off = 0;
+ for (attnum = 0; attnum < natts; attnum++)
+ {
+ Form_pg_attribute thisatt = att[attnum];
+
+ if (hasnulls && att_isnull(attnum, bp))
+ {
+ slow = true; /* can't use attcacheoff anymore */
+ (*offsets)[(*noffsets)++] = off;
+ continue;
+ }
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ /*
+ * We can only cache the offset for a varlena attribute if the
+ * offset is already suitably aligned, so that there would be no
+ * pad bytes in any case: then the offset will be valid for either
+ * an aligned or unaligned value.
+ */
+ if (!slow &&
+ off == att_align_nominal(off, thisatt->attalign))
+ thisatt->attcacheoff = off;
+ else
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+
+ if (!slow)
+ thisatt->attcacheoff = off;
+ }
+
+ (*offsets)[(*noffsets)++] = off;
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+
+ }
+
+ }
+
+ /* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata)
+ {
+ int32 *hoffsets,
+ *newoffsets;
+ int noffsets;
+ PGLZ_Strategy strategy;
+ int32 newbitmaplen,
+ hbitmpalen;
+
+ /*
+ * If length of old and new tuple versions vary by more than 50%, include
+ * new as-is
+ */
+ if ((newtup->t_len <= (oldtup->t_len >> 1))
+ || (oldtup->t_len <= (newtup->t_len >> 1)))
+ return false;
+
+ newbitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ hbitmpalen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * Deform and get the attribute offsets for old and new tuple which will
+ * be used for calculating delta between old and new tuples.
+ */
+ heap_get_attr_offsets(tupleDesc, oldtup, &hoffsets, &noffsets);
+ heap_get_attr_offsets(tupleDesc, newtup, &newoffsets, &noffsets);
+
+ strategy = *PGLZ_strategy_always;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_compress_with_history((char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ newoffsets, hoffsets, noffsets,
+ newbitmaplen, hbitmpalen,
+ (PGLZ_Header *) encdata, &strategy);
+ }
+
+ /* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+ void
+ heap_delta_decode(char *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ return pglz_decompress_with_history((char *) encdata,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 70,75 ****
--- 70,76 ----
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+ #include "utils/pg_lzcompress.h"
/* GUC variable */
***************
*** 5765,5770 **** log_heap_update(Relation reln, Buffer oldbuf,
--- 5766,5781 ----
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ struct
+ {
+ PGLZ_Header pglzheader;
+ char buf[MaxHeapTupleSize];
+ } buf;
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
***************
*** 5774,5788 **** log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! xlrec.all_visible_cleared = all_visible_cleared;
xlrec.newtid = newtup->t_self;
! xlrec.new_all_visible_cleared = new_all_visible_cleared;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
--- 5785,5830 ----
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, (char *) &buf.pglzheader))
+ {
+ compressed = true;
+ newtupdata = (char *) &buf.pglzheader;
+ newtuplen = VARSIZE(&buf.pglzheader);
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! if (all_visible_cleared)
! xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
! if (new_all_visible_cleared)
! xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! if (compressed)
! xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
***************
*** 5809,5817 **** log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
! /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
--- 5851,5862 ----
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
! /*
! * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! */
! rdata[3].data = newtupdata;
! rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
***************
*** 6614,6620 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 6659,6668 ----
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
***************
*** 6629,6635 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->all_visible_cleared)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 6677,6683 ----
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 6689,6695 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
! htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
--- 6737,6743 ----
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
! oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
***************
*** 6707,6713 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
! if (xlrec->all_visible_cleared)
PageClearAllVisible(page);
/*
--- 6755,6761 ----
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
! if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
***************
*** 6732,6738 **** newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->new_all_visible_cleared)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 6780,6786 ----
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 6795,6804 **** newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! (char *) xlrec + hsize,
! newlen);
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
--- 6843,6874 ----
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
!
! /*
! * If the record is EWT then decode it.
! */
! if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! {
! /*
! * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
! * + New data (1 byte length + variable data)+ ...
! */
! PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
!
! oldtup.t_data = oldtupdata;
! newtup.t_data = htup;
!
! heap_delta_decode((char *) encoded_data, &oldtup, &newtup);
! newlen = newtup.t_len;
! }
! else
! {
! /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! (char *) xlrec + hsize,
! newlen);
! }
!
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
***************
*** 6814,6820 **** newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
! if (xlrec->new_all_visible_cleared)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
--- 6884,6890 ----
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
! if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
***************
*** 665,670 **** then restart recovery. This is part of the reason for not writing a WAL
--- 665,784 ----
entry until we've successfully done the original action.
+ Encoded WAL Tuple (EWT)
+ -----------------------
+
+ Delta Encoded WAL Tuple (EWT) eliminates the need for copying entire tuple
+ to WAL for the update operation. EWT is constructed using pglz by comparing
+ old and new versions of tuple w.r.t column boundaries. It contains the data
+ from new tuple for modified columns and reference [Offset,Length] of old tuple
+ verion for un-changed columns.
+
+
+ EWT Format
+ ----------
+
+ Header + Control byte + History Reference (2 - 3)bytes
+ + New data (1 byte length + variable data) + ...
+
+
+ Header:
+
+ The header is same as PGLZ_Header, which is used to store the compressed length
+ and raw length.
+
+ Control byte:
+
+ The first byte after the header tells what to do the next 8 times. We call this
+ the control byte.
+
+
+ History Reference:
+
+ A set bit in the control byte means, that a tag of 2-3 bytes follows.
+ A tag contains information to copy some bytes from old tuple version to
+ the current location in the output.
+
+ Details about 2-3 byte Tag
+ 2 byte tag is used when length of History data
+ (unchanged data from old tuple version) is less than 18.
+ 3 byte tag is used when length of History data
+ (unchanged data from old tuple version) is greater than equal to 18.
+ The maximum length that can be represented by one Tag is 273.
+
+ Let's call the three tag bytes T1, T2 and T3. The position of the data
+ to copy is coded as an offset from the old tuple.
+
+ The offset is in the upper nibble of T1 and in T2.
+ The length is in the lower nibble of T1.
+
+ So the 16 bits of a 2 byte tag are coded as
+
+ 7---T1--0 7---T2--0
+ OOOO LLLL OOOO OOOO
+
+ This limits the offset to 1-4095 (12 bits) and the length to 3-18 (4 bits)
+ because 3 is always added to it.
+
+ In the actual implementation, the 2 byte tag's length is limited to 3-17,
+ because the value 0xF in the length nibble has special meaning. It means,
+ that the next following byte (T3) has to be added to the length value of 18.
+ That makes total limits of 1-4095 for offset and 3-273 for length.
+
+
+ New data:
+
+ An unset bit in the control byte represents modified data of new tuple version.
+ First byte repersents the length [0-255] of the modified data, followed by the
+ modified data of corresponding length.
+
+ 7---T1--0 7---T2--0 ...
+ LLLL LLLL DDDD DDDD ...
+
+ Data bytes repeat until the length of the new data.
+
+
+ L - Length
+ O - Offset
+ D - Data
+
+
+ Encoding Mechanism for EWT
+ --------------------------
+ Copy the bitmap data from new tuple to the EWT (Encoded WAL Tuple)
+ and loop for all attributes to find any modifications in the attributes.
+ The unmodified data is encoded as a History Reference in EWT and the
+ modified data (if NOT NULL) is encoded as New Data in EWT.
+
+ The offset values are calculated with respect to the tuple t_hoff value.
+ Max encoded data length is 75% (default compression rate) of original data,
+ if encoded output data length is greater than that, original tuple
+ (new tuple version) will be directly stored in WAL Tuple.
+
+
+ Decoding Mechanism for EWT
+ --------------------------
+ Skip header and Read one control byte and process the next 8 items
+ (or as many as remain in the compressed input). Check each control bit,
+ if the bit is set then it is History Reference which means the next
+ 2 - 3 byte tag provides the offset and length of history match.
+
+ Use the offset and corresponding length to copy data from old tuple
+ version to new tuple. If the control bit is unset, then it is
+ New Data Reference which means first byte contains the length [0-255]
+ of the modified data, followed by the modified data of corresponding length
+ specified in the first byte.
+
+
+ Constraints for EWT
+ --------------------
+ 1. Delta encoding is allowed when the update is going to the same page and
+ buffer doesn't need a backup block in case of full-pagewrite is on.
+ 2. Old Tuples with length less than PGLZ_HISTORY_SIZE are allowed for encoding.
+ 3. Old and New tuple versions shouldn't vary in length by more than 50%
+ are allowed for encoding.
+
+
Asynchronous Commit
-------------------
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1209,1214 **** begin:;
--- 1209,1236 ----
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+ bool
+ XLogCheckBufferNeedsBackup(Buffer buffer)
+ {
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+ }
+
+ /*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 362,367 **** do { \
--- 362,391 ----
} \
} while (0)
+ /* ----------
+ * pglz_out_add -
+ *
+ * Outputs a reference tag of 1 byte with length and the new data
+ * to the destination buffer, including the appropriate control bit.
+ * ----------
+ */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ int32 _maddlen; \
+ int32 _addtotal_len = (_len); \
+ while (_addtotal_len > 0) \
+ { \
+ _maddlen = _addtotal_len > 255 ? 255 : _addtotal_len; \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ _ctrl <<= 1; \
+ (_buf)[0] = (unsigned char)(_maddlen); \
+ (_buf) += 1; \
+ memcpy((_buf), (_byte), _maddlen); \
+ (_buf) += _maddlen; \
+ (_byte) += _maddlen; \
+ _addtotal_len -= _maddlen; \
+ } \
+ } while (0)
/* ----------
* pglz_find_match -
***************
*** 471,476 **** pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
--- 495,539 ----
return 0;
}
+ /* ----------
+ * pglz_find_match -
+ *
+ * Lookup the history table if the actual input stream matches
+ * another sequence of characters, starting somewhere earlier
+ * in the input buffer.
+ * ----------
+ */
+ static inline int
+ pglz_find_match_with_history(const char *input, const char *end,
+ const char *history, const char *hend, int *lenp)
+ {
+ const char *ip = input;
+ const char *hp = history;
+
+ /*
+ * Determine length of match. A better match must be larger than the best
+ * so far. And if we already have a match of 16 or more bytes, it's worth
+ * the call overhead to use memcmp() to check if this match is equal for
+ * the same size. After that we must fallback to character by character
+ * comparison to know the exact position where the diff occurred.
+ */
+ while (ip < end && hp < hend && *ip == *hp && *lenp < PGLZ_MAX_MATCH)
+ {
+ (*lenp)++;
+ ip++;
+ hp++;
+ }
+
+ /*
+ * Return match information only if it results at least in one byte
+ * reduction.
+ */
+ if (*lenp > 2)
+ return 1;
+
+ return 0;
+ }
+
/* ----------
* pglz_compress -
***************
*** 637,642 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
--- 700,895 ----
return true;
}
+ /* ----------
+ * pglz_compress_with_history
+ *
+ * Like pglz_compress, but performs delta encoding rather than compression.
+ * The references are offsets from the start of history data, rather
+ * than current output position. 'hoffsets' and 'newoffsets' are array of
+ * offsets in the history and source to consider. We scan the history
+ * string based on attribute offsets for possible matches with source string.
+ *
+ * For attributes having NULL value, the offset will be same as next attribute
+ * offset. When old tuple contains NULL and new tuple has non-NULL value,
+ * it will copy it as New Data in Encoded WAL Tuple. When new tuple has NULL
+ * value and old tuple has non-NULL value, the old tuple value will be ignored.
+ * ----------
+ */
+ bool
+ pglz_compress_with_history(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ int32 *newoffsets, int32 *hoffsets, int32 noffsets,
+ int32 newbitmaplen, int32 hbitmaplen,
+ PGLZ_Header *dest, const PGLZ_Strategy *strategy)
+ {
+ unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int i,
+ len;
+ int32 need_rate;
+ const char *hp = history;
+ const char *hend = history + hlen;
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (hlen >= PGLZ_HISTORY_SIZE)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ /*
+ * Save the original source size in the header.
+ */
+ dest->rawsize = slen;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ /*
+ * Compress the source directly into the output buffer until bitmaplen.
+ */
+ if ((bp + newbitmaplen + 2) - bstart >= result_max)
+ return false;
+
+ pglz_out_add(ctrlp, ctrlb, ctrl, bp, newbitmaplen, dp);
+
+ /*
+ * Loop through all attributes offsets, if the attribute data differs with
+ * history referring offsets, store the [Offset,Length] reffering history
+ * version till the match and store the changed data as New data. We need
+ * to accumulate all the matched attributes till an unmatched one is
+ * found. For the last attribute if it is matched, directly store its
+ * Offset. It can be improved for accumulation of unmatched attributes.
+ */
+ match_off = hbitmaplen;
+ hp = history + hbitmaplen;
+ for (i = 0; i < noffsets; i++)
+ {
+ dend = source + ((i + 1 == noffsets) ? slen : newoffsets[i + 1] + newbitmaplen);
+ hend = history + ((i + 1 == noffsets) ? hlen : hoffsets[i + 1] + hbitmaplen);
+
+ MATCH_AGAIN:
+
+ /* If we already exceeded the maximum result size, fail. */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history. It can match maximum
+ * PGLZ_MAX_MATCH in one pass as history tag can be of 3 bytes. For
+ * match greater than PGLZ_MAX_MATCH, it need to do it in multiple
+ * passes (MATCH_AGAIN).
+ */
+ if (pglz_find_match_with_history(dp + match_len, dend, hp + match_len,
+ hend, &match_len))
+ {
+ found_match = true;
+
+ /* Finding the maximum match across the offsets */
+ if ((i + 1 == noffsets)
+ || ((dp + match_len) < dend)
+ || ((hp + match_len < hend)))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ match_off += match_len;
+ dp += match_len;
+ hp += match_len;
+
+ if (match_len == PGLZ_MAX_MATCH)
+ {
+ match_len = 0;
+ goto MATCH_AGAIN;
+ }
+ else
+ {
+ hp = hend;
+ match_off = hend - history;
+ match_len = 0;
+ }
+ }
+ }
+ else
+ {
+ hp = hend;
+ match_off = hend - history;
+ match_len = 0;
+ }
+
+ /* copy the unmatched data to output buffer directly from source */
+ len = dend - (dp + match_len);
+ if ((bp + len + 2) - bstart >= result_max)
+ return false;
+
+ pglz_out_add(ctrlp, ctrlb, ctrl, bp, len, dp);
+ }
+
+ if (!found_match)
+ return false;
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+ #ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+ #endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ SET_VARSIZE_COMPRESSED(dest, result_size + sizeof(PGLZ_Header));
+
+ return true;
+ }
/* ----------
* pglz_decompress -
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
void
pglz_decompress(const PGLZ_Header *source, char *dest)
{
const unsigned char *sp;
const unsigned char *srcend;
unsigned char *dp;
unsigned char *destend;
sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! srcend = ((const unsigned char *) source) + VARSIZE(source);
dp = (unsigned char *) dest;
! destend = dp + source->rawsize;
while (sp < srcend && dp < destend)
{
--- 900,937 ----
void
pglz_decompress(const PGLZ_Header *source, char *dest)
{
+ pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+
+ /* ----------
+ * pglz_decompress_with_history -
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ const char *history)
+ {
+ PGLZ_Header src;
const unsigned char *sp;
const unsigned char *srcend;
unsigned char *dp;
unsigned char *destend;
+ /* To avoid the unaligned access of PGLZ_Header */
+ memcpy((char *) &src, source, sizeof(PGLZ_Header));
+
sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! srcend = ((const unsigned char *) source) + VARSIZE(&src);
dp = (unsigned char *) dest;
! destend = dp + src.rawsize;
!
! if (destlen)
! {
! *destlen = src.rawsize;
! }
while (sp < srcend && dp < destend)
{
***************
*** 665,670 **** pglz_decompress(const PGLZ_Header *source, char *dest)
--- 941,947 ----
*/
unsigned char ctrl = *sp++;
int ctrlc;
+ int32 len;
for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
{
***************
*** 677,683 **** pglz_decompress(const PGLZ_Header *source, char *dest)
* coded as 18, another extension tag byte tells how much
* longer the match really was (0-255).
*/
- int32 len;
int32 off;
len = (sp[0] & 0x0f) + 3;
--- 954,959 ----
***************
*** 699,726 **** pglz_decompress(const PGLZ_Header *source, char *dest)
break;
}
! /*
! * Now we copy the bytes specified by the tag from OUTPUT to
! * OUTPUT. It is dangerous and platform dependent to use
! * memcpy() here, because the copied areas could overlap
! * extremely!
! */
! while (len--)
{
! *dp = dp[-off];
! dp++;
}
}
else
{
! /*
! * An unset control bit means LITERAL BYTE. So we just copy
! * one from INPUT to OUTPUT.
! */
! if (dp >= destend) /* check for buffer overrun */
! break; /* do not clobber memory */
! *dp++ = *sp++;
}
/*
--- 975,1030 ----
break;
}
! if (history)
{
! /*
! * Now we copy the bytes specified by the tag from history
! * to OUTPUT.
! */
! memcpy(dp, history + off, len);
! dp += len;
! }
! else
! {
! /*
! * Now we copy the bytes specified by the tag from OUTPUT
! * to OUTPUT. It is dangerous and platform dependent to
! * use memcpy() here, because the copied areas could
! * overlap extremely!
! */
! while (len--)
! {
! *dp = dp[-off];
! dp++;
! }
}
}
else
{
! if (history)
! {
! len = sp[0];
! sp++;
! /*
! * Now we copy the bytes specified by the len from source
! * to OUTPUT.
! */
! memcpy(dp, sp, len);
! sp += len;
! dp += len;
! }
! else
! {
! /*
! * An unset control bit means LITERAL BYTE. So we just
! * copy one from INPUT to OUTPUT.
! */
! if (dp >= destend) /* check for buffer overrun */
! break; /* do not clobber memory */
!
! *dp++ = *sp++;
! }
}
/*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 123,128 **** extern int CommitSiblings;
--- 123,129 ----
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
+ extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
***************
*** 2382,2387 **** static struct config_int ConfigureNamesInt[] =
--- 2383,2399 ----
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 1, 99,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 147,159 **** typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
! uint8 old_infobits_set; /* infomask bits to set on old tuple */
! bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
! bool new_all_visible_cleared; /* same for the page of newtid */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
! #define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
/*
* This is what we need to know about vacuum page cleanup/redirect
--- 147,168 ----
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
! uint8 old_infobits_set; /* infomask bits to set on old tuple */
! int flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
! * page's all visible
! * bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
! * page's all visible
! * bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
! * update operation is
! * delta encoded */
!
! #define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(int))
/*
* This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 687,692 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 687,697 ----
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata);
+ extern void heap_delta_decode(char *encdata, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,266 **** typedef struct CheckpointStatsData
--- 261,267 ----
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 107,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
--- 107,119 ----
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+ extern bool pglz_compress_with_history(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ int32 *newoffsets, int32 *hoffsets, int32 noffsets,
+ int32 newbitmaplen, int32 hbitmaplen,
+ PGLZ_Header *dest, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+ extern void pglz_decompress_with_history(const char *source, char *dest,
+ uint32 *destlen, const char *history);
#endif /* _PG_LZCOMPRESS_H_ */
*** a/src/test/regress/expected/update.out
--- b/src/test/regress/expected/update.out
***************
*** 97,99 **** SELECT a, b, char_length(c) FROM update_test;
--- 97,169 ----
(2 rows)
DROP TABLE update_test;
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ DROP TABLE IF EXISTS update_test;
+ NOTICE: table "update_test" does not exist, skipping
+ CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+ );
+ INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+ );
+ SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+ (1 row)
+
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+ (1 row)
+
+ DROP TABLE update_test;
*** a/src/test/regress/sql/update.sql
--- b/src/test/regress/sql/update.sql
***************
*** 59,61 **** UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
--- 59,128 ----
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+ --
+ -- Test to update continuos and non continuos columns
+ --
+
+ DROP TABLE IF EXISTS update_test;
+ CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+ );
+
+ INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+ );
+
+ SELECT * from update_test;
+
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+
+ SELECT * from update_test;
+ DROP TABLE update_test;
On Thursday, January 31, 2013 6:44 PM Amit Kapila wrote:
On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:
On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:
On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:
On 29.01.2013 11:58, Amit Kapila wrote:
Can there be another way with which current patch code can be
made
better,
so that we don't need to change the encoding approach, as I am
having
feeling that this might not be performance wise equally good.
The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patchmore
like yours in behavior by also passing an array of offsets in the
new tuple to check, and only checking for matches as thoseoffsets.
I think it makes sense, because if we have offsets of both new and
old
tuple, we can internally use memcmp to compare columns and use same
algorithm for encoding.
I will change the patch according to this suggestion.I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed
bitmaplength also, as we need to copy the bitmap of new tuple as itis
into Encoded WAL Tuple.
Please see if such API design is okay?
I shall update the README and send the performance/WAL Reduction data
for modified patch tomorrow.Updated patch including comments and README is attached with this mail.
This patch contain exactly same design behavior as per previous.
It takes care of API design suggestion of Heikki.The performance data is similar, as it is not complete, I shall send
that tomorrow.
Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):
1. With orignal pgbench there is a max 7% WAL reduction with not much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with not much
performance difference.
3. With 500 and above record size in pgbench there is an improvement in the
performance and wal reduction both.
If the record size increases there is a gain in performance and wal size is
reduced as well.
Performance data for synchronous_commit = on is under progress, I shall post
it once it is done.
I am expecting it to be same as previous.
With Regards,
Amit Kapila.
Attachments:
On Friday, February 01, 2013 6:37 PM Amit Kapila wrote:
On Thursday, January 31, 2013 6:44 PM Amit Kapila wrote:
On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:
On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:
On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:
On 29.01.2013 11:58, Amit Kapila wrote:
Can there be another way with which current patch code can be
made
better,
so that we don't need to change the encoding approach, as I
am
having
feeling that this might not be performance wise equally good.
The point is that I don't want to heap_delta_encode() to know
the internals of pglz compression. You could probably make my
patchmore
like yours in behavior by also passing an array of offsets in
the new tuple to check, and only checking for matches as thoseoffsets.
I think it makes sense, because if we have offsets of both new
and
old
tuple, we can internally use memcmp to compare columns and use
same algorithm for encoding.
I will change the patch according to this suggestion.I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed
bitmaplength also, as we need to copy the bitmap of new tuple as itis
into Encoded WAL Tuple.
Please see if such API design is okay?
I shall update the README and send the performance/WAL Reduction
data for modified patch tomorrow.Updated patch including comments and README is attached with this
mail.
This patch contain exactly same design behavior as per previous.
It takes care of API design suggestion of Heikki.The performance data is similar, as it is not complete, I shall send
that tomorrow.Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):1. With orignal pgbench there is a max 7% WAL reduction with not much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with not
much performance difference.
3. With 500 and above record size in pgbench there is an improvement in
the performance and wal reduction both.If the record size increases there is a gain in performance and wal
size is reduced as well.Performance data for synchronous_commit = on is under progress, I shall
post it once it is done.
I am expecting it to be same as previous.
Please find the performance readings for synchronous_commit = on.
Each run is taken for 20 min.
Conclusions from the readings with synchronous commit on mode:
1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% with not much
performance difference.
3. With 1800 record size in pgbench there is both an improvement in the
performance (approx 3%) as well as wal reduction (44%).
If the record size increases there is a very good reduction in WAL size.
Please provide your feedback.
With Regards,
Amit Kapila.
Attachments:
On 02/05/2013 11:53 PM, Amit Kapila wrote:
Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):1. With orignal pgbench there is a max 7% WAL reduction with not much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with not
much performance difference.
3. With 500 and above record size in pgbench there is an improvement in
the performance and wal reduction both.If the record size increases there is a gain in performance and wal
size is reduced as well.Performance data for synchronous_commit = on is under progress, I shall
post it once it is done.
I am expecting it to be same as previous.Please find the performance readings for synchronous_commit = on.
Each run is taken for 20 min.
Conclusions from the readings with synchronous commit on mode:
1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% with not much
performance difference.
3. With 1800 record size in pgbench there is both an improvement in the
performance (approx 3%) as well as wal reduction (44%).If the record size increases there is a very good reduction in WAL size.
The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
On 02/05/2013 11:53 PM, Amit Kapila wrote:
Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):1. With orignal pgbench there is a max 7% WAL reduction with not
much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% withnot
much performance difference.
3. With 500 and above record size in pgbench there is an improvementin
the performance and wal reduction both.
If the record size increases there is a gain in performance and wal
size is reduced as well.Performance data for synchronous_commit = on is under progress, I
shall
post it once it is done.
I am expecting it to be same as previous.Please find the performance readings for synchronous_commit = on.
Each run is taken for 20 min.
Conclusions from the readings with synchronous commit on mode:
1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% withnot much
performance difference.
3. With 1800 record size in pgbench there is both an improvement inthe
performance (approx 3%) as well as wal reduction (44%).
If the record size increases there is a very good reduction in WAL
size.
The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.
For bigger size (~2000) records, it goes into toast, for which we don't do
this optimization.
This optimization is mainly for medium size records.
With Regards,
Amit Kapila.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04.03.2013 06:39, Amit Kapila wrote:
On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
On 02/05/2013 11:53 PM, Amit Kapila wrote:
Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):1. With orignal pgbench there is a max 7% WAL reduction with not
much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% withnot
much performance difference.
3. With 500 and above record size in pgbench there is an improvementin
the performance and wal reduction both.
If the record size increases there is a gain in performance and wal
size is reduced as well.Performance data for synchronous_commit = on is under progress, I
shall
post it once it is done.
I am expecting it to be same as previous.Please find the performance readings for synchronous_commit = on.
Each run is taken for 20 min.
Conclusions from the readings with synchronous commit on mode:
1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% withnot much
performance difference.
3. With 1800 record size in pgbench there is both an improvement inthe
performance (approx 3%) as well as wal reduction (44%).
If the record size increases there is a very good reduction in WAL
size.
The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.For bigger size (~2000) records, it goes into toast, for which we don't do
this optimization.
This optimization is mainly for medium size records.
I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll begin
with some numbers:
unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1245525360 | 9.94613695144653
two short fields, one changed | 1245536528 | 10.146910905838
two short fields, both changed | 1245523160 | 11.2332470417023
one short and one long field, no change | 1054926504 | 5.90477800369263
ten tiny fields, all changed | 1411774608 | 13.4536008834839
hundred tiny fields, all changed | 635739680 | 7.57448387145996
hundred tiny fields, half changed | 636930560 | 7.56888699531555
hundred tiny fields, half nulled | 573751120 | 6.68991994857788
Amit's wal_update_changes_v10.patch:
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1249722112 | 13.0558869838715
two short fields, one changed | 1246145408 | 12.9947438240051
two short fields, both changed | 1245951056 | 13.0262880325317
one short and one long field, no change | 678480664 | 5.70031690597534
ten tiny fields, all changed | 1328873920 | 20.0167419910431
hundred tiny fields, all changed | 638149416 | 14.4236788749695
hundred tiny fields, half changed | 635560504 | 14.8770561218262
hundred tiny fields, half nulled | 558468352 | 16.2437210083008
pglz-with-micro-optimizations-1.patch:
testname | wal_generated |
duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1245519008 | 11.6702048778534
two short fields, one changed | 1245756904 | 11.3233819007874
two short fields, both changed | 1249711088 | 11.6836447715759
one short and one long field, no change | 664741392 | 6.44810795783997
ten tiny fields, all changed | 1328085568 | 13.9679481983185
hundred tiny fields, all changed | 635974088 | 9.15514206886292
hundred tiny fields, half changed | 636309040 | 9.13769292831421
hundred tiny fields, half nulled | 496396448 | 8.77351498603821
In each test, a table is created with a large number of identical rows,
and fillfactor=50. Then a full-table UPDATE is performed, and the UPDATE
is timed. Duration is the time spent in the UPDATE (lower is better),
and wal_generated is the amount of WAL generated by the updates (lower
is better).
The summary is that Amit's patch is a small win in terms of CPU usage,
in the best case where the table has few columns, with one large column
that is not updated. In all other cases it just adds overhead. In terms
of WAL size, you get a big gain in the same best case scenario.
Attached is a different version of this patch, which uses the pglz
algorithm to spot the similarities between the old and new tuple,
instead of having explicit knowledge of where the column boundaries are.
This has the advantage that it will spot similarities, and be able to
compress, in more cases. For example, you can see a reduction in WAL
size in the "hundred tiny fields, half nulled" test case above.
The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default, this
probably just isn't worth it.
The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function, it
goes further than that, and contains some further micro-optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for speed.
If you could squeeze pglz_delta_encode function to be cheap enough that
we could enable this by default, this would be pretty cool patch. Or at
least, the overhead in the cases that you get no compression needs to be
brought down, to about 2-5 % at most I think. If it can't be done
easily, I feel that this probably needs to be dropped.
PS. I haven't done much testing of WAL redo, so it's quite possible that
the encoding is actually buggy, or that decoding is slow. But I don't
think there's anything so fundamentally wrong that it would affect the
performance results much.
- Heikki
Attachments:
pglz-with-micro-optimizations-1.patchtext/x-diff; name=pglz-with-micro-optimizations-1.patchDownload
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..d6458b2 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+/* guc variable for EWT compression ratio*/
+int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+/* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata, uint32 *enclen)
+{
+ PGLZ_Strategy strategy;
+
+ strategy = *PGLZ_strategy_default;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_delta_encode(
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ encdata, enclen, &strategy
+ );
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+ return pglz_delta_decode(encdata, enclen,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len);
+}
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d226726..5a9bea9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
/* GUC variable */
bool synchronize_seqscans = true;
+extern int wal_update_compression_ratio;
static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5803,6 +5805,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ char buf[MaxHeapTupleSize];
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -5812,15 +5820,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ uint32 enclen;
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ {
+ compressed = true;
+ newtupdata = buf;
+ newtuplen = enclen;
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
- xlrec.all_visible_cleared = all_visible_cleared;
+ if (all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
- xlrec.new_all_visible_cleared = new_all_visible_cleared;
+ if (new_all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (compressed)
+ xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
@@ -5847,9 +5887,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ /*
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+ * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+ */
+ rdata[3].data = newtupdata;
+ rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
@@ -6652,7 +6695,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
@@ -6667,7 +6713,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6727,7 +6773,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
- htup = (HeapTupleHeader) PageGetItem(page, lp);
+ oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6745,7 +6791,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
@@ -6770,7 +6816,7 @@ newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6833,10 +6879,30 @@ newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
- /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
- memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
- (char *) xlrec + hsize,
- newlen);
+
+ /*
+ * If the record is EWT then decode it.
+ */
+ if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+ {
+ /*
+ * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ */
+ oldtup.t_data = oldtupdata;
+ newtup.t_data = htup;
+
+ heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+ newlen = newtup.t_len;
+ }
+ else
+ {
+ /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ (char *) xlrec + hsize,
+ newlen);
+ }
+
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
@@ -6852,7 +6918,7 @@ newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d960bbc..c721392 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+}
+
+/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..96c5c61b 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
* of identical bytes like trailing spaces) and for bigger ones
* our 4K maximum look-back distance is too small.
*
- * The compressor creates a table for 8192 lists of positions.
+ * The compressor creates a table for lists of positions.
* For each input position (except the last 3), a hash key is
* built from the 4 next input bytes and the position remembered
* in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
* matching strings. This is done on the fly while the input
* is compressed into the output area. Table entries are only
* kept for the last 4096 input positions, since we cannot use
- * back-pointers larger than that anyway.
+ * back-pointers larger than that anyway. The size of the hash
+ * table depends on the size of the input - a larger table has
+ * a larger startup cost, as it needs to be initialized to zero,
+ * but reduces the number of hash collisions on long inputs.
*
* For each byte in the input, it's hash key (built from this
* byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
* Local definitions
* ----------
*/
-#define PGLZ_HISTORY_LISTS 8192 /* must be power of 2 */
-#define PGLZ_HISTORY_MASK (PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS 8192 /* must be power of 2 */
#define PGLZ_HISTORY_SIZE 4096
#define PGLZ_MAX_MATCH 273
@@ -198,9 +200,9 @@
*/
typedef struct PGLZ_HistEntry
{
- struct PGLZ_HistEntry *next; /* links for my hash key's list */
- struct PGLZ_HistEntry *prev;
- int hindex; /* my current hash key */
+ int16 next; /* links for my hash key's list */
+ int16 prev;
+ uint32 hindex; /* my current hash key */
const char *pos; /* my input position */
} PGLZ_HistEntry;
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
* Statically allocated work arrays for history
* ----------
*/
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY 0
/* ----------
* pglz_hist_idx -
@@ -257,12 +261,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* hash keys more.
* ----------
*/
-#define pglz_hist_idx(_s,_e) ( \
+#define pglz_hist_idx(_s,_e, _mask) ( \
((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
- (((_s)[0] << 9) ^ ((_s)[1] << 6) ^ \
- ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK) \
+ (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
+ ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
)
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) \
+ do { \
+ a = 0; \
+ b = _p[0]; \
+ c = _p[1]; \
+ d = _p[2]; \
+ hindex = (b << 4) ^ (c << 2) ^ d; \
+ } while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) \
+ do { \
+ /* subtract old a */ \
+ hindex ^= a; \
+ /* shift and add byte */ \
+ a = b; b = c; c = d; d = _p[3]; \
+ hindex = ((hindex << 2) ^ d) & (_mask); \
+ } while (0)
+
/* ----------
* pglz_hist_add -
@@ -276,32 +308,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* _hn and _recycle are modified in the macro.
* ----------
*/
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex) \
do { \
- int __hindex = pglz_hist_idx((_s),(_e)); \
- PGLZ_HistEntry **__myhsp = &(_hs)[__hindex]; \
+ int16 *__myhsp = &(_hs)[_hindex]; \
PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
if (_recycle) { \
- if (__myhe->prev == NULL) \
+ if (__myhe->prev == INVALID_ENTRY) \
(_hs)[__myhe->hindex] = __myhe->next; \
else \
- __myhe->prev->next = __myhe->next; \
- if (__myhe->next != NULL) \
- __myhe->next->prev = __myhe->prev; \
+ (_he)[__myhe->prev].next = __myhe->next; \
+ if (__myhe->next != INVALID_ENTRY) \
+ (_he)[__myhe->next].prev = __myhe->prev; \
} \
__myhe->next = *__myhsp; \
- __myhe->prev = NULL; \
- __myhe->hindex = __hindex; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
__myhe->pos = (_s); \
- if (*__myhsp != NULL) \
- (*__myhsp)->prev = __myhe; \
- *__myhsp = __myhe; \
- if (++(_hn) >= PGLZ_HISTORY_SIZE) { \
- (_hn) = 0; \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) { \
+ (_hn) = 1; \
(_recycle) = true; \
} \
} while (0)
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex) \
+do { \
+ int16 *__myhsp = &(_hs)[_hindex]; \
+ PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
+ __myhe->next = *__myhsp; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
+ __myhe->pos = (_s); \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ ++(_hn); \
+} while (0)
+
/* ----------
* pglz_out_ctrl -
@@ -372,30 +421,44 @@ do { \
* ----------
*/
static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
- int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+ int *lenp, int *offp, int good_match, int good_drop,
+ const char *hend, int hindex)
{
- PGLZ_HistEntry *hent;
+ int16 hentno;
int32 len = 0;
int32 off = 0;
/*
* Traverse the linked history list until a good enough match is found.
*/
- hent = hstart[pglz_hist_idx(input, end)];
- while (hent)
+ hentno = hstart[hindex];
+ while (hentno != INVALID_ENTRY)
{
+ PGLZ_HistEntry *hent = &hist_entries[hentno];
const char *ip = input;
const char *hp = hent->pos;
int32 thisoff;
int32 thislen;
+ int32 maxlen;
+
+ maxlen = PGLZ_MAX_MATCH;
+ if (input - end > maxlen)
+ maxlen = input - end;
+ if (hend && (hend - hp > maxlen))
+ maxlen = hend - hp;
/*
* Stop if the offset does not fit into our tag anymore.
*/
+ if (!hend)
+ {
thisoff = ip - hp;
if (thisoff >= 0x0fff)
break;
+ }
+ else
+ thisoff = hend - hp;
/*
* Determine length of match. A better match must be larger than the
@@ -413,7 +476,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
thislen = len;
ip += len;
hp += len;
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -423,7 +486,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
}
else
{
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -443,13 +506,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
/*
* Advance to the next history entry
*/
- hent = hent->next;
+ hentno = hent->next;
/*
* Be happy with lesser good matches the more entries we visited. But
* no point in doing calculation if we're at end of list.
*/
- if (hent)
+ if (hentno != INVALID_ENTRY)
{
if (len >= good_match)
break;
@@ -471,6 +534,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
return 0;
}
+static int
+choose_hash_size(int slen)
+{
+ int hashsz;
+
+ /*
+ * Experiments suggest that these hash sizes work pretty well. A large
+ * hash table minimizes collision, but has a higher startup cost. For
+ * a small input, the startup cost dominates. The table size must be
+ * a power of two.
+ */
+ if (slen < 128)
+ hashsz = 512;
+ else if (slen < 256)
+ hashsz = 1024;
+ else if (slen < 512)
+ hashsz = 2048;
+ else if (slen < 1024)
+ hashsz = 4096;
+ else
+ hashsz = 8192;
+ return hashsz;
+}
/* ----------
* pglz_compress -
@@ -484,7 +570,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
{
unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
unsigned char *bstart = bp;
- int hist_next = 0;
+ int hist_next = 1;
bool hist_recycle = false;
const char *dp = source;
const char *dend = source + slen;
@@ -500,6 +586,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
int32 result_size;
int32 result_max;
int32 need_rate;
+ int hashsz;
+ int mask;
/*
* Our fallback strategy is the default.
@@ -555,17 +643,21 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
else
result_max = (slen * (100 - need_rate)) / 100;
+ hashsz = choose_hash_size(slen);
+ mask = hashsz - 1;
+
/*
* Initialize the history lists to empty. We do not need to zero the
* hist_entries[] array; its entries are initialized as they are used.
*/
- memset(hist_start, 0, sizeof(hist_start));
+ memset(hist_start, 0, hashsz * sizeof(int16));
/*
* Compress the source directly into the output buffer.
*/
while (dp < dend)
{
+ uint32 hindex;
/*
* If we already exceeded the maximum result size, fail.
*
@@ -588,8 +680,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
/*
* Try to find a match in the history
*/
+ hindex = pglz_hist_idx(dp, dend, mask);
if (pglz_find_match(hist_start, dp, dend, &match_len,
- &match_off, good_match, good_drop))
+ &match_off, good_match, good_drop, NULL, hindex))
{
/*
* Create the tag and add history entries for all matched
@@ -598,9 +691,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
+ hindex = pglz_hist_idx(dp, dend, mask);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -614,7 +708,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -637,6 +731,200 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
return true;
}
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen,
+ const PGLZ_Strategy *strategy)
+{
+ unsigned char *bp = ((unsigned char *) dest);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ const char *hp = history;
+ const char *hend = history + hlen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int32 good_match;
+ int32 good_drop;
+ int32 need_rate;
+ int hist_next = 0;
+ int hashsz;
+ int mask;
+ int32 a,b,c,d;
+ int32 hindex;
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Limit the match parameters to the supported range.
+ */
+ good_match = strategy->match_size_good;
+ if (good_match > PGLZ_MAX_MATCH)
+ good_match = PGLZ_MAX_MATCH;
+ else if (good_match < 17)
+ good_match = 17;
+
+ good_drop = strategy->match_size_drop;
+ if (good_drop < 0)
+ good_drop = 0;
+ else if (good_drop > 100)
+ good_drop = 100;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ hashsz = choose_hash_size(hlen);
+ mask = hashsz - 1;
+
+ /*
+ * Initialize the history lists to empty. We do not need to zero the
+ * hist_entries[] array; its entries are initialized as they are used.
+ */
+ memset(hist_start, 0, hashsz * sizeof(int16));
+
+ if (hend - hp > PGLZ_HISTORY_SIZE)
+ hp = hend - PGLZ_HISTORY_SIZE;
+ pglz_hash_init(hp, hindex, a,b,c,d);
+ while (hp < hend - 4)
+ {
+ /*
+ * TODO: It would be nice to behave like the history and the source
+ * strings were concatenated, so that you could compress using the
+ * new data, too.
+ */
+ pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ hp, hend, hindex);
+ hp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Loop through the input.
+ */
+ match_off = 0;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ while (dp < dend - 4)
+ {
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+ if (pglz_find_match(hist_start, dp, dend, &match_len,
+ &match_off, good_match, good_drop, hend, hindex))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ dp += match_len;
+ found_match = true;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+ }
+
+ if (!found_match)
+ return false;
+
+ /* Handle the last few bytes as literals */
+ while (dp < dend)
+ {
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ *finallen = result_size;
+
+ return true;
+}
/* ----------
* pglz_decompress -
@@ -740,3 +1028,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
* That's it.
*/
}
+
+/* ----------
+ * pglz_delta_decode
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen)
+{
+ const unsigned char *sp;
+ const unsigned char *srcend;
+ unsigned char *dp;
+ unsigned char *destend;
+ const char *hend;
+
+ sp = ((const unsigned char *) source);
+ srcend = ((const unsigned char *) source) + srclen;
+ dp = (unsigned char *) dest;
+ destend = dp + destlen;
+ hend = history + histlen;
+
+ while (sp < srcend && dp < destend)
+ {
+ /*
+ * Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ */
+ unsigned char ctrl = *sp++;
+ int ctrlc;
+
+ for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+ {
+ if (ctrl & 1)
+ {
+ /*
+ * Otherwise it contains the match length minus 3 and the
+ * upper 4 bits of the offset. The next following byte
+ * contains the lower 8 bits of the offset. If the length is
+ * coded as 18, another extension tag byte tells how much
+ * longer the match really was (0-255).
+ */
+ int32 len;
+ int32 off;
+
+ len = (sp[0] & 0x0f) + 3;
+ off = ((sp[0] & 0xf0) << 4) | sp[1];
+ sp += 2;
+ if (len == 18)
+ len += *sp++;
+
+ /*
+ * Check for output buffer overrun, to ensure we don't clobber
+ * memory in case of corrupt input. Note: we must advance dp
+ * here to ensure the error is detected below the loop. We
+ * don't simply put the elog inside the loop since that will
+ * probably interfere with optimization.
+ */
+ if (dp + len > destend)
+ {
+ dp += len;
+ break;
+ }
+
+ /*
+ * Now we copy the bytes specified by the tag from history
+ * to OUTPUT.
+ */
+ memcpy(dp, hend - off, len);
+ dp += len;
+ }
+ else
+ {
+ /*
+ * An unset control bit means LITERAL BYTE. So we just
+ * copy one from INPUT to OUTPUT.
+ */
+ if (dp >= destend) /* check for buffer overrun */
+ break; /* do not clobber memory */
+
+ *dp++ = *sp++;
+ }
+
+ /*
+ * Advance the control bit
+ */
+ ctrl >>= 1;
+ }
+ }
+
+ /*
+ * Check we decompressed the right amount.
+ */
+ if (sp != srcend)
+ elog(PANIC, "compressed data is corrupt");
+
+ /*
+ * That's it.
+ */
+ *finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int CommitSiblings;
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
+extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 0, 100,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
- uint8 old_infobits_set; /* infomask bits to set on old tuple */
- bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
- bool new_all_visible_cleared; /* same for the page of newtid */
+ uint8 old_infobits_set; /* infomask bits to set on old tuple */
+ uint8 flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
+ * update operation is
+ * delta encoded */
+
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(uint8))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen);
#endif /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
(2 rows)
DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE: table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
On 04.03.2013 06:39, Amit Kapila wrote:
On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
On 02/05/2013 11:53 PM, Amit Kapila wrote:
Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previouspatch):
I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll begin
with some numbers:unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):
testname | wal_generated |
duration-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245525360 |
9.94613695144653
two short fields, one changed | 1245536528 |
10.146910905838
two short fields, both changed | 1245523160 |
11.2332470417023
one short and one long field, no change | 1054926504 |
5.90477800369263
ten tiny fields, all changed | 1411774608 |
13.4536008834839
hundred tiny fields, all changed | 635739680 |
7.57448387145996
hundred tiny fields, half changed | 636930560 |
7.56888699531555
hundred tiny fields, half nulled | 573751120 |
6.68991994857788Amit's wal_update_changes_v10.patch:
testname | wal_generated |
duration-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1249722112 |
13.0558869838715
two short fields, one changed | 1246145408 |
12.9947438240051
two short fields, both changed | 1245951056 |
13.0262880325317
one short and one long field, no change | 678480664 |
5.70031690597534
ten tiny fields, all changed | 1328873920 |
20.0167419910431
hundred tiny fields, all changed | 638149416 |
14.4236788749695
hundred tiny fields, half changed | 635560504 |
14.8770561218262
hundred tiny fields, half nulled | 558468352 |
16.2437210083008pglz-with-micro-optimizations-1.patch:
testname | wal_generated |
duration
-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245519008 |
11.6702048778534
two short fields, one changed | 1245756904 |
11.3233819007874
two short fields, both changed | 1249711088 |
11.6836447715759
one short and one long field, no change | 664741392 |
6.44810795783997
ten tiny fields, all changed | 1328085568 |
13.9679481983185
hundred tiny fields, all changed | 635974088 |
9.15514206886292
hundred tiny fields, half changed | 636309040 |
9.13769292831421
hundred tiny fields, half nulled | 496396448 |
8.77351498603821
For some of the tests, it doesn't even execute main part of
compression/encoding.
The reason is that the length of tuple is less than strategy min length, so
it returns from below check
in function pglz_delta_encode()
if (strategy->match_size_good <= 0 ||
slen < strategy->min_input_size ||
slen > strategy->max_input_size)
return false;
The tests for which it doesn't execute encoding are below:
two short fields, no change
two short fields, one changed
two short fields, both changed
ten tiny fields, all changed
For above cases, the reason of difference in timings for both approaches
with original could be due to the reason that
this check is done after some processing. So I think if we check the length
in log_heap_update, then
there should not be timing difference for above test scenario's. I can check
that once.
This optimization helps only when tuple is of length > 128~200 bytes and
upto 1800 bytes (till it turns to toast), otherwise it could result in
overhead without any major WAL reduction.
Infact I think in one of my initial patch there is a check if length of
tuple is greater than 128 bytes then perform the optimization.
I shall try to run both patches for cases when length of tuple is > 128~200
bytes, as this optimization has benefits in those cases.
In each test, a table is created with a large number of identical rows,
and fillfactor=50. Then a full-table UPDATE is performed, and the
UPDATE is timed. Duration is the time spent in the UPDATE (lower is
better), and wal_generated is the amount of WAL generated by the
updates (lower is better).The summary is that Amit's patch is a small win in terms of CPU usage,
in the best case where the table has few columns, with one large column
that is not updated. In all other cases it just adds overhead. In terms
of WAL size, you get a big gain in the same best case scenario.Attached is a different version of this patch, which uses the pglz
algorithm to spot the similarities between the old and new tuple,
instead of having explicit knowledge of where the column boundaries
are.
This has the advantage that it will spot similarities, and be able to
compress, in more cases. For example, you can see a reduction in WAL
size in the "hundred tiny fields, half nulled" test case above.The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default,
this probably just isn't worth it.
As I mentioned, for smaller tuples it can be overhead without any major
benefit of WAL reduction,
so I think before doing encoding it should ensure that tuple length is
greater than some threshold length.
Yes, it can miss some cases like your test has shown for (hundred tiny
fields, half nulled),
but we might be able to safely enable it for default.
The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function,
it goes further than that, and contains some further micro-
optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more.
One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for
speed.
Do you mean to say roll for 10 times and then call pglz_hist_add_no_recycle
and then same
before pglz_find_match?
I shall try doing this for the tests.
If you could squeeze pglz_delta_encode function to be cheap enough that
we could enable this by default, this would be pretty cool patch. Or at
least, the overhead in the cases that you get no compression needs to
be brought down, to about 2-5 % at most I think. If it can't be done
easily, I feel that this probably needs to be dropped.
Agreed, though it gives benefit for some of the cases, but it should not
degrade much
for any of other cases.
One more thing that any compression technique will have some overhead, so it
should be
used optimally rather then in every case. So in that regards, I think we
should do this
optimization only when it has better chance of win (like based on length of
tuple or some other criteria
where WAL tuple can be logged as-is). What is your opinion?
PS. I haven't done much testing of WAL redo, so it's quite possible
that the encoding is actually buggy, or that decoding is slow. But I
don't think there's anything so fundamentally wrong that it would
affect the performance results much.
I also don't think it will have any problem, but I can run some test to
verify the same.
With Regards,
Amit Kapila.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-03-05 23:26:59 +0200, Heikki Linnakangas wrote:
On 04.03.2013 06:39, Amit Kapila wrote:
The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.For bigger size (~2000) records, it goes into toast, for which we don't do
this optimization.
This optimization is mainly for medium size records.I've been doing investigating the pglz option further, and doing performance
comparisons of the pglz approach and this patch. I'll begin with some
numbers:unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1245525360 | 9.94613695144653
two short fields, one changed | 1245536528 | 10.146910905838
two short fields, both changed | 1245523160 | 11.2332470417023
one short and one long field, no change | 1054926504 | 5.90477800369263
ten tiny fields, all changed | 1411774608 | 13.4536008834839
hundred tiny fields, all changed | 635739680 | 7.57448387145996
hundred tiny fields, half changed | 636930560 | 7.56888699531555
hundred tiny fields, half nulled | 573751120 | 6.68991994857788Amit's wal_update_changes_v10.patch:
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1249722112 | 13.0558869838715
two short fields, one changed | 1246145408 | 12.9947438240051
two short fields, both changed | 1245951056 | 13.0262880325317
one short and one long field, no change | 678480664 | 5.70031690597534
ten tiny fields, all changed | 1328873920 | 20.0167419910431
hundred tiny fields, all changed | 638149416 | 14.4236788749695
hundred tiny fields, half changed | 635560504 | 14.8770561218262
hundred tiny fields, half nulled | 558468352 | 16.2437210083008pglz-with-micro-optimizations-1.patch:
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1245519008 | 11.6702048778534
two short fields, one changed | 1245756904 | 11.3233819007874
two short fields, both changed | 1249711088 | 11.6836447715759
one short and one long field, no change | 664741392 | 6.44810795783997
ten tiny fields, all changed | 1328085568 | 13.9679481983185
hundred tiny fields, all changed | 635974088 | 9.15514206886292
hundred tiny fields, half changed | 636309040 | 9.13769292831421
hundred tiny fields, half nulled | 496396448 | 8.77351498603821In each test, a table is created with a large number of identical rows, and
fillfactor=50. Then a full-table UPDATE is performed, and the UPDATE is
timed. Duration is the time spent in the UPDATE (lower is better), and
wal_generated is the amount of WAL generated by the updates (lower is
better).The summary is that Amit's patch is a small win in terms of CPU usage, in
the best case where the table has few columns, with one large column that is
not updated. In all other cases it just adds overhead. In terms of WAL size,
you get a big gain in the same best case scenario.Attached is a different version of this patch, which uses the pglz algorithm
to spot the similarities between the old and new tuple, instead of having
explicit knowledge of where the column boundaries are. This has the
advantage that it will spot similarities, and be able to compress, in more
cases. For example, you can see a reduction in WAL size in the "hundred tiny
fields, half nulled" test case above.The attached patch also just adds overhead in most cases, but the overhead
is much smaller in the worst case. I think that's the right tradeoff here -
we want to avoid scenarios where performance falls off the cliff. That said,
if you usually just get a slowdown, we certainly can't make this the
default, and if we can't turn it on by default, this probably just isn't
worth it.The attached patch contains the variable-hash-size changes I posted in the
"Optimizing pglz compressor". But in the delta encoding function, it goes
further than that, and contains some further micro-optimizations: the hash
is calculated in a rolling fashion, and it uses a specialized version of the
pglz_hist_add macro that knows that the input can't exceed 4096 bytes. Those
changes shaved off some cycles, but you could probably do more. One idea is
to only add every 10 bytes or so to the history lookup table; that would
sacrifice some compressibility for speed.If you could squeeze pglz_delta_encode function to be cheap enough that we
could enable this by default, this would be pretty cool patch. Or at least,
the overhead in the cases that you get no compression needs to be brought
down, to about 2-5 % at most I think. If it can't be done easily, I feel
that this probably needs to be dropped.
While this is exciting stuff - and I find Heikki's approach more
interesting and applicable to more cases - I think this is clearly not
9.3 material anymore. There are loads of tradeoffs here which requires
substantial amount of benchmarking and its not the kind of change that
can be backed out easily during 9.3's lifecycle.
And I have to say I find 2-5% performance overhead too high...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
On 04.03.2013 06:39, Amit Kapila wrote:
On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
On 02/05/2013 11:53 PM, Amit Kapila wrote:
Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previouspatch):
I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll begin
with some numbers:unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):
testname | wal_generated |
duration-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245525360 |
9.94613695144653
two short fields, one changed | 1245536528 |
10.146910905838
two short fields, both changed | 1245523160 |
11.2332470417023
one short and one long field, no change | 1054926504 |
5.90477800369263
ten tiny fields, all changed | 1411774608 |
13.4536008834839
hundred tiny fields, all changed | 635739680 |
7.57448387145996
hundred tiny fields, half changed | 636930560 |
7.56888699531555
hundred tiny fields, half nulled | 573751120 |
6.68991994857788Amit's wal_update_changes_v10.patch:
testname | wal_generated |
duration-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1249722112 |
13.0558869838715
two short fields, one changed | 1246145408 |
12.9947438240051
two short fields, both changed | 1245951056 |
13.0262880325317
one short and one long field, no change | 678480664 |
5.70031690597534
ten tiny fields, all changed | 1328873920 |
20.0167419910431
hundred tiny fields, all changed | 638149416 |
14.4236788749695
hundred tiny fields, half changed | 635560504 |
14.8770561218262
hundred tiny fields, half nulled | 558468352 |
16.2437210083008pglz-with-micro-optimizations-1.patch:
testname | wal_generated |
duration
-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245519008 |
11.6702048778534
two short fields, one changed | 1245756904 |
11.3233819007874
two short fields, both changed | 1249711088 |
11.6836447715759
one short and one long field, no change | 664741392 |
6.44810795783997
ten tiny fields, all changed | 1328085568 |
13.9679481983185
hundred tiny fields, all changed | 635974088 |
9.15514206886292
hundred tiny fields, half changed | 636309040 |
9.13769292831421
hundred tiny fields, half nulled | 496396448 |
8.77351498603821In each test, a table is created with a large number of identical rows,
and fillfactor=50. Then a full-table UPDATE is performed, and the
UPDATE is timed. Duration is the time spent in the UPDATE (lower is
better), and wal_generated is the amount of WAL generated by the
updates (lower is better).
Based on your patch, I have tried some more optimizations:
Fixed bug in your patch (pglz-with-micro-optimizations-2):
1. There were some problems in recovery due to wrong length of oldtuple
passed in decode which I have corrected.
Approach -1 (pglz-with-micro-optimizations-2_roll10_32)
1. Move strategy min length (32) check in log_heap_update
2. Rolling 10 for hash as suggested by you is added.
Approach -2 (pglz-with-micro-optimizations-2_roll10_32_1hashkey)
1. This is done on top of Approach-1 changes
2. Used 1 byte data as the hash key.
Approach-3
(pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal)
1. This is done on top of Approach-1 and Approach-2 changes
2. Instead of doing copy of literal byte when it founds as non match with
history, do all in a batch.
Data for all above approaches is in attached file "test_readings" (Apart
from your tests, I have added one more test " hundred tiny fields, first 10
changed")
Summary -
After changes of Approach-1, CPU utilization for all except 2 tests
("hundred tiny fields, all changed",
"hundred tiny fields, half changed") is either same or less. The best case
CPU utilization has decreased (which is better), but WAL reduction has
little bit increased (which is as per expectation due 10 consecutive
rollup's).
Approach-2 modifications was done to see if there is any overhead of hash
calculation.
Approach-2 & Approach-3 doesn't result into any improvements.
I have investigated the reason for CPU utilization for 2 tests and the
reason is that there is nothing to compress in the new tuple and that
information it will come to know only after it processes 75% (compression
ratio) of tuple bytes.
I think any compression algorithm will have this drawback that if data is
not compressible, it can consume time inspite
of the fact that it will not be able to compress the data.
I think most updates will update some part of tuple which will always yield
positive results.
Apart from above tests, I have run your patch against my old tests, it
yields quite positive results,
WAL Reduction is more as compare to my patch and CPU utilization is almost
similar or my patch is slightly better.
The results are in attached file "pgbench_pg_lz_mod"
The above all data is for synchronous_commit = off. I can collect the data
for synchronous_commit = on and
Performance of recovery.
Any further suggestions?
With Regards,
Amit Kapila.
Attachments:
pglz-with-micro-optimizations-2.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2.patchDownload
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+/* guc variable for EWT compression ratio*/
+int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+/* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata, uint32 *enclen)
+{
+ PGLZ_Strategy strategy;
+
+ strategy = *PGLZ_strategy_default;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_delta_encode(
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ encdata, enclen, &strategy
+ );
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+ return pglz_delta_decode(encdata, enclen,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5250ec7..5b69189 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
/* GUC variable */
bool synchronize_seqscans = true;
+extern int wal_update_compression_ratio;
static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5806,6 +5808,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ char buf[MaxHeapTupleSize];
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -5815,15 +5823,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ uint32 enclen;
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ {
+ compressed = true;
+ newtupdata = buf;
+ newtuplen = enclen;
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
- xlrec.all_visible_cleared = all_visible_cleared;
+ if (all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
- xlrec.new_all_visible_cleared = new_all_visible_cleared;
+ if (new_all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (compressed)
+ xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
@@ -5850,9 +5890,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ /*
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+ * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+ */
+ rdata[3].data = newtupdata;
+ rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
@@ -6655,7 +6698,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
@@ -6670,7 +6716,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6730,7 +6776,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
- htup = (HeapTupleHeader) PageGetItem(page, lp);
+ oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6748,7 +6794,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
@@ -6773,7 +6819,7 @@ newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6836,10 +6882,31 @@ newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
- /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
- memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
- (char *) xlrec + hsize,
- newlen);
+
+ /*
+ * If the record is EWT then decode it.
+ */
+ if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+ {
+ /*
+ * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ */
+ oldtup.t_data = oldtupdata;
+ oldtup.t_len = ItemIdGetLength(lp);
+ newtup.t_data = htup;
+
+ heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+ newlen = newtup.t_len;
+ }
+ else
+ {
+ /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ (char *) xlrec + hsize,
+ newlen);
+ }
+
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
@@ -6855,7 +6922,7 @@ newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a02eebc..5075d2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+}
+
+/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..96c5c61b 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
* of identical bytes like trailing spaces) and for bigger ones
* our 4K maximum look-back distance is too small.
*
- * The compressor creates a table for 8192 lists of positions.
+ * The compressor creates a table for lists of positions.
* For each input position (except the last 3), a hash key is
* built from the 4 next input bytes and the position remembered
* in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
* matching strings. This is done on the fly while the input
* is compressed into the output area. Table entries are only
* kept for the last 4096 input positions, since we cannot use
- * back-pointers larger than that anyway.
+ * back-pointers larger than that anyway. The size of the hash
+ * table depends on the size of the input - a larger table has
+ * a larger startup cost, as it needs to be initialized to zero,
+ * but reduces the number of hash collisions on long inputs.
*
* For each byte in the input, it's hash key (built from this
* byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
* Local definitions
* ----------
*/
-#define PGLZ_HISTORY_LISTS 8192 /* must be power of 2 */
-#define PGLZ_HISTORY_MASK (PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS 8192 /* must be power of 2 */
#define PGLZ_HISTORY_SIZE 4096
#define PGLZ_MAX_MATCH 273
@@ -198,9 +200,9 @@
*/
typedef struct PGLZ_HistEntry
{
- struct PGLZ_HistEntry *next; /* links for my hash key's list */
- struct PGLZ_HistEntry *prev;
- int hindex; /* my current hash key */
+ int16 next; /* links for my hash key's list */
+ int16 prev;
+ uint32 hindex; /* my current hash key */
const char *pos; /* my input position */
} PGLZ_HistEntry;
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
* Statically allocated work arrays for history
* ----------
*/
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY 0
/* ----------
* pglz_hist_idx -
@@ -257,12 +261,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* hash keys more.
* ----------
*/
-#define pglz_hist_idx(_s,_e) ( \
+#define pglz_hist_idx(_s,_e, _mask) ( \
((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
- (((_s)[0] << 9) ^ ((_s)[1] << 6) ^ \
- ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK) \
+ (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
+ ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
)
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) \
+ do { \
+ a = 0; \
+ b = _p[0]; \
+ c = _p[1]; \
+ d = _p[2]; \
+ hindex = (b << 4) ^ (c << 2) ^ d; \
+ } while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) \
+ do { \
+ /* subtract old a */ \
+ hindex ^= a; \
+ /* shift and add byte */ \
+ a = b; b = c; c = d; d = _p[3]; \
+ hindex = ((hindex << 2) ^ d) & (_mask); \
+ } while (0)
+
/* ----------
* pglz_hist_add -
@@ -276,32 +308,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* _hn and _recycle are modified in the macro.
* ----------
*/
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex) \
do { \
- int __hindex = pglz_hist_idx((_s),(_e)); \
- PGLZ_HistEntry **__myhsp = &(_hs)[__hindex]; \
+ int16 *__myhsp = &(_hs)[_hindex]; \
PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
if (_recycle) { \
- if (__myhe->prev == NULL) \
+ if (__myhe->prev == INVALID_ENTRY) \
(_hs)[__myhe->hindex] = __myhe->next; \
else \
- __myhe->prev->next = __myhe->next; \
- if (__myhe->next != NULL) \
- __myhe->next->prev = __myhe->prev; \
+ (_he)[__myhe->prev].next = __myhe->next; \
+ if (__myhe->next != INVALID_ENTRY) \
+ (_he)[__myhe->next].prev = __myhe->prev; \
} \
__myhe->next = *__myhsp; \
- __myhe->prev = NULL; \
- __myhe->hindex = __hindex; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
__myhe->pos = (_s); \
- if (*__myhsp != NULL) \
- (*__myhsp)->prev = __myhe; \
- *__myhsp = __myhe; \
- if (++(_hn) >= PGLZ_HISTORY_SIZE) { \
- (_hn) = 0; \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) { \
+ (_hn) = 1; \
(_recycle) = true; \
} \
} while (0)
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex) \
+do { \
+ int16 *__myhsp = &(_hs)[_hindex]; \
+ PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
+ __myhe->next = *__myhsp; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
+ __myhe->pos = (_s); \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ ++(_hn); \
+} while (0)
+
/* ----------
* pglz_out_ctrl -
@@ -372,30 +421,44 @@ do { \
* ----------
*/
static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
- int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+ int *lenp, int *offp, int good_match, int good_drop,
+ const char *hend, int hindex)
{
- PGLZ_HistEntry *hent;
+ int16 hentno;
int32 len = 0;
int32 off = 0;
/*
* Traverse the linked history list until a good enough match is found.
*/
- hent = hstart[pglz_hist_idx(input, end)];
- while (hent)
+ hentno = hstart[hindex];
+ while (hentno != INVALID_ENTRY)
{
+ PGLZ_HistEntry *hent = &hist_entries[hentno];
const char *ip = input;
const char *hp = hent->pos;
int32 thisoff;
int32 thislen;
+ int32 maxlen;
+
+ maxlen = PGLZ_MAX_MATCH;
+ if (input - end > maxlen)
+ maxlen = input - end;
+ if (hend && (hend - hp > maxlen))
+ maxlen = hend - hp;
/*
* Stop if the offset does not fit into our tag anymore.
*/
+ if (!hend)
+ {
thisoff = ip - hp;
if (thisoff >= 0x0fff)
break;
+ }
+ else
+ thisoff = hend - hp;
/*
* Determine length of match. A better match must be larger than the
@@ -413,7 +476,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
thislen = len;
ip += len;
hp += len;
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -423,7 +486,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
}
else
{
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -443,13 +506,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
/*
* Advance to the next history entry
*/
- hent = hent->next;
+ hentno = hent->next;
/*
* Be happy with lesser good matches the more entries we visited. But
* no point in doing calculation if we're at end of list.
*/
- if (hent)
+ if (hentno != INVALID_ENTRY)
{
if (len >= good_match)
break;
@@ -471,6 +534,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
return 0;
}
+static int
+choose_hash_size(int slen)
+{
+ int hashsz;
+
+ /*
+ * Experiments suggest that these hash sizes work pretty well. A large
+ * hash table minimizes collision, but has a higher startup cost. For
+ * a small input, the startup cost dominates. The table size must be
+ * a power of two.
+ */
+ if (slen < 128)
+ hashsz = 512;
+ else if (slen < 256)
+ hashsz = 1024;
+ else if (slen < 512)
+ hashsz = 2048;
+ else if (slen < 1024)
+ hashsz = 4096;
+ else
+ hashsz = 8192;
+ return hashsz;
+}
/* ----------
* pglz_compress -
@@ -484,7 +570,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
{
unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
unsigned char *bstart = bp;
- int hist_next = 0;
+ int hist_next = 1;
bool hist_recycle = false;
const char *dp = source;
const char *dend = source + slen;
@@ -500,6 +586,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
int32 result_size;
int32 result_max;
int32 need_rate;
+ int hashsz;
+ int mask;
/*
* Our fallback strategy is the default.
@@ -555,17 +643,21 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
else
result_max = (slen * (100 - need_rate)) / 100;
+ hashsz = choose_hash_size(slen);
+ mask = hashsz - 1;
+
/*
* Initialize the history lists to empty. We do not need to zero the
* hist_entries[] array; its entries are initialized as they are used.
*/
- memset(hist_start, 0, sizeof(hist_start));
+ memset(hist_start, 0, hashsz * sizeof(int16));
/*
* Compress the source directly into the output buffer.
*/
while (dp < dend)
{
+ uint32 hindex;
/*
* If we already exceeded the maximum result size, fail.
*
@@ -588,8 +680,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
/*
* Try to find a match in the history
*/
+ hindex = pglz_hist_idx(dp, dend, mask);
if (pglz_find_match(hist_start, dp, dend, &match_len,
- &match_off, good_match, good_drop))
+ &match_off, good_match, good_drop, NULL, hindex))
{
/*
* Create the tag and add history entries for all matched
@@ -598,9 +691,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
+ hindex = pglz_hist_idx(dp, dend, mask);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -614,7 +708,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -637,6 +731,200 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
return true;
}
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen,
+ const PGLZ_Strategy *strategy)
+{
+ unsigned char *bp = ((unsigned char *) dest);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ const char *hp = history;
+ const char *hend = history + hlen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int32 good_match;
+ int32 good_drop;
+ int32 need_rate;
+ int hist_next = 0;
+ int hashsz;
+ int mask;
+ int32 a,b,c,d;
+ int32 hindex;
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Limit the match parameters to the supported range.
+ */
+ good_match = strategy->match_size_good;
+ if (good_match > PGLZ_MAX_MATCH)
+ good_match = PGLZ_MAX_MATCH;
+ else if (good_match < 17)
+ good_match = 17;
+
+ good_drop = strategy->match_size_drop;
+ if (good_drop < 0)
+ good_drop = 0;
+ else if (good_drop > 100)
+ good_drop = 100;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ hashsz = choose_hash_size(hlen);
+ mask = hashsz - 1;
+
+ /*
+ * Initialize the history lists to empty. We do not need to zero the
+ * hist_entries[] array; its entries are initialized as they are used.
+ */
+ memset(hist_start, 0, hashsz * sizeof(int16));
+
+ if (hend - hp > PGLZ_HISTORY_SIZE)
+ hp = hend - PGLZ_HISTORY_SIZE;
+ pglz_hash_init(hp, hindex, a,b,c,d);
+ while (hp < hend - 4)
+ {
+ /*
+ * TODO: It would be nice to behave like the history and the source
+ * strings were concatenated, so that you could compress using the
+ * new data, too.
+ */
+ pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ hp, hend, hindex);
+ hp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Loop through the input.
+ */
+ match_off = 0;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ while (dp < dend - 4)
+ {
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+ if (pglz_find_match(hist_start, dp, dend, &match_len,
+ &match_off, good_match, good_drop, hend, hindex))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ dp += match_len;
+ found_match = true;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+ }
+
+ if (!found_match)
+ return false;
+
+ /* Handle the last few bytes as literals */
+ while (dp < dend)
+ {
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ *finallen = result_size;
+
+ return true;
+}
/* ----------
* pglz_decompress -
@@ -740,3 +1028,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
* That's it.
*/
}
+
+/* ----------
+ * pglz_delta_decode
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen)
+{
+ const unsigned char *sp;
+ const unsigned char *srcend;
+ unsigned char *dp;
+ unsigned char *destend;
+ const char *hend;
+
+ sp = ((const unsigned char *) source);
+ srcend = ((const unsigned char *) source) + srclen;
+ dp = (unsigned char *) dest;
+ destend = dp + destlen;
+ hend = history + histlen;
+
+ while (sp < srcend && dp < destend)
+ {
+ /*
+ * Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ */
+ unsigned char ctrl = *sp++;
+ int ctrlc;
+
+ for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+ {
+ if (ctrl & 1)
+ {
+ /*
+ * Otherwise it contains the match length minus 3 and the
+ * upper 4 bits of the offset. The next following byte
+ * contains the lower 8 bits of the offset. If the length is
+ * coded as 18, another extension tag byte tells how much
+ * longer the match really was (0-255).
+ */
+ int32 len;
+ int32 off;
+
+ len = (sp[0] & 0x0f) + 3;
+ off = ((sp[0] & 0xf0) << 4) | sp[1];
+ sp += 2;
+ if (len == 18)
+ len += *sp++;
+
+ /*
+ * Check for output buffer overrun, to ensure we don't clobber
+ * memory in case of corrupt input. Note: we must advance dp
+ * here to ensure the error is detected below the loop. We
+ * don't simply put the elog inside the loop since that will
+ * probably interfere with optimization.
+ */
+ if (dp + len > destend)
+ {
+ dp += len;
+ break;
+ }
+
+ /*
+ * Now we copy the bytes specified by the tag from history
+ * to OUTPUT.
+ */
+ memcpy(dp, hend - off, len);
+ dp += len;
+ }
+ else
+ {
+ /*
+ * An unset control bit means LITERAL BYTE. So we just
+ * copy one from INPUT to OUTPUT.
+ */
+ if (dp >= destend) /* check for buffer overrun */
+ break; /* do not clobber memory */
+
+ *dp++ = *sp++;
+ }
+
+ /*
+ * Advance the control bit
+ */
+ ctrl >>= 1;
+ }
+ }
+
+ /*
+ * Check we decompressed the right amount.
+ */
+ if (sp != srcend)
+ elog(PANIC, "compressed data is corrupt");
+
+ /*
+ * That's it.
+ */
+ *finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int CommitSiblings;
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
+extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 0, 100,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
- uint8 old_infobits_set; /* infomask bits to set on old tuple */
- bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
- bool new_all_visible_cleared; /* same for the page of newtid */
+ uint8 old_infobits_set; /* infomask bits to set on old tuple */
+ uint8 flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
+ * update operation is
+ * delta encoded */
+
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(uint8))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen);
#endif /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
(2 rows)
DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE: table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;
pglz-with-micro-optimizations-2_roll10_32.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2_roll10_32.patchDownload
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+/* guc variable for EWT compression ratio*/
+int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+/* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata, uint32 *enclen)
+{
+ PGLZ_Strategy strategy;
+
+ strategy = *PGLZ_strategy_default;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_delta_encode(
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ encdata, enclen, &strategy
+ );
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+ return pglz_delta_decode(encdata, enclen,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5250ec7..4dcf164 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
/* GUC variable */
bool synchronize_seqscans = true;
+extern int wal_update_compression_ratio;
static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5806,6 +5808,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ char buf[MaxHeapTupleSize];
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -5815,15 +5823,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if (wal_update_compression_ratio != 0 && (newtuplen >=32) && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ uint32 enclen;
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ {
+ compressed = true;
+ newtupdata = buf;
+ newtuplen = enclen;
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
- xlrec.all_visible_cleared = all_visible_cleared;
+ if (all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
- xlrec.new_all_visible_cleared = new_all_visible_cleared;
+ if (new_all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (compressed)
+ xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
@@ -5850,9 +5890,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ /*
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+ * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+ */
+ rdata[3].data = newtupdata;
+ rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
@@ -6655,7 +6698,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
@@ -6670,7 +6716,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6730,7 +6776,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
- htup = (HeapTupleHeader) PageGetItem(page, lp);
+ oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6748,7 +6794,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
@@ -6773,7 +6819,7 @@ newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6836,10 +6882,31 @@ newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
- /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
- memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
- (char *) xlrec + hsize,
- newlen);
+
+ /*
+ * If the record is EWT then decode it.
+ */
+ if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+ {
+ /*
+ * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ */
+ oldtup.t_data = oldtupdata;
+ oldtup.t_len = ItemIdGetLength(lp);
+ newtup.t_data = htup;
+
+ heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+ newlen = newtup.t_len;
+ }
+ else
+ {
+ /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ (char *) xlrec + hsize,
+ newlen);
+ }
+
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
@@ -6855,7 +6922,7 @@ newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a02eebc..5075d2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+}
+
+/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..8aec6bd 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
* of identical bytes like trailing spaces) and for bigger ones
* our 4K maximum look-back distance is too small.
*
- * The compressor creates a table for 8192 lists of positions.
+ * The compressor creates a table for lists of positions.
* For each input position (except the last 3), a hash key is
* built from the 4 next input bytes and the position remembered
* in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
* matching strings. This is done on the fly while the input
* is compressed into the output area. Table entries are only
* kept for the last 4096 input positions, since we cannot use
- * back-pointers larger than that anyway.
+ * back-pointers larger than that anyway. The size of the hash
+ * table depends on the size of the input - a larger table has
+ * a larger startup cost, as it needs to be initialized to zero,
+ * but reduces the number of hash collisions on long inputs.
*
* For each byte in the input, it's hash key (built from this
* byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
* Local definitions
* ----------
*/
-#define PGLZ_HISTORY_LISTS 8192 /* must be power of 2 */
-#define PGLZ_HISTORY_MASK (PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS 8192 /* must be power of 2 */
#define PGLZ_HISTORY_SIZE 4096
#define PGLZ_MAX_MATCH 273
@@ -198,9 +200,9 @@
*/
typedef struct PGLZ_HistEntry
{
- struct PGLZ_HistEntry *next; /* links for my hash key's list */
- struct PGLZ_HistEntry *prev;
- int hindex; /* my current hash key */
+ int16 next; /* links for my hash key's list */
+ int16 prev;
+ uint32 hindex; /* my current hash key */
const char *pos; /* my input position */
} PGLZ_HistEntry;
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
* Statically allocated work arrays for history
* ----------
*/
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY 0
/* ----------
* pglz_hist_idx -
@@ -257,12 +261,39 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* hash keys more.
* ----------
*/
-#define pglz_hist_idx(_s,_e) ( \
+#define pglz_hist_idx(_s,_e, _mask) ( \
((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
- (((_s)[0] << 9) ^ ((_s)[1] << 6) ^ \
- ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK) \
+ (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
+ ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
)
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) \
+ do { \
+ a = 0; \
+ b = _p[0]; \
+ c = _p[1]; \
+ d = _p[2]; \
+ hindex = (b << 4) ^ (c << 2) ^ d; \
+ } while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) \
+ do { \
+ /* subtract old a */ \
+ hindex ^= a; \
+ /* shift and add byte */ \
+ a = b; b = c; c = d; d = _p[3]; \
+ hindex = ((hindex << 2) ^ d) & (_mask); \
+ } while (0)
/* ----------
* pglz_hist_add -
@@ -276,32 +307,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* _hn and _recycle are modified in the macro.
* ----------
*/
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex) \
do { \
- int __hindex = pglz_hist_idx((_s),(_e)); \
- PGLZ_HistEntry **__myhsp = &(_hs)[__hindex]; \
+ int16 *__myhsp = &(_hs)[_hindex]; \
PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
if (_recycle) { \
- if (__myhe->prev == NULL) \
+ if (__myhe->prev == INVALID_ENTRY) \
(_hs)[__myhe->hindex] = __myhe->next; \
else \
- __myhe->prev->next = __myhe->next; \
- if (__myhe->next != NULL) \
- __myhe->next->prev = __myhe->prev; \
+ (_he)[__myhe->prev].next = __myhe->next; \
+ if (__myhe->next != INVALID_ENTRY) \
+ (_he)[__myhe->next].prev = __myhe->prev; \
} \
__myhe->next = *__myhsp; \
- __myhe->prev = NULL; \
- __myhe->hindex = __hindex; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
__myhe->pos = (_s); \
- if (*__myhsp != NULL) \
- (*__myhsp)->prev = __myhe; \
- *__myhsp = __myhe; \
- if (++(_hn) >= PGLZ_HISTORY_SIZE) { \
- (_hn) = 0; \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) { \
+ (_hn) = 1; \
(_recycle) = true; \
} \
} while (0)
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex) \
+do { \
+ int16 *__myhsp = &(_hs)[_hindex]; \
+ PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
+ __myhe->next = *__myhsp; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
+ __myhe->pos = (_s); \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ ++(_hn); \
+} while (0)
+
/* ----------
* pglz_out_ctrl -
@@ -372,30 +420,44 @@ do { \
* ----------
*/
static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
- int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+ int *lenp, int *offp, int good_match, int good_drop,
+ const char *hend, int hindex)
{
- PGLZ_HistEntry *hent;
+ int16 hentno;
int32 len = 0;
int32 off = 0;
/*
* Traverse the linked history list until a good enough match is found.
*/
- hent = hstart[pglz_hist_idx(input, end)];
- while (hent)
+ hentno = hstart[hindex];
+ while (hentno != INVALID_ENTRY)
{
+ PGLZ_HistEntry *hent = &hist_entries[hentno];
const char *ip = input;
const char *hp = hent->pos;
int32 thisoff;
int32 thislen;
+ int32 maxlen;
+
+ maxlen = PGLZ_MAX_MATCH;
+ if (input - end > maxlen)
+ maxlen = input - end;
+ if (hend && (hend - hp > maxlen))
+ maxlen = hend - hp;
/*
* Stop if the offset does not fit into our tag anymore.
*/
- thisoff = ip - hp;
- if (thisoff >= 0x0fff)
- break;
+ if (!hend)
+ {
+ thisoff = ip - hp;
+ if (thisoff >= 0x0fff)
+ break;
+ }
+ else
+ thisoff = hend - hp;
/*
* Determine length of match. A better match must be larger than the
@@ -413,7 +475,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
thislen = len;
ip += len;
hp += len;
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -423,7 +485,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
}
else
{
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -443,13 +505,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
/*
* Advance to the next history entry
*/
- hent = hent->next;
+ hentno = hent->next;
/*
* Be happy with lesser good matches the more entries we visited. But
* no point in doing calculation if we're at end of list.
*/
- if (hent)
+ if (hentno != INVALID_ENTRY)
{
if (len >= good_match)
break;
@@ -471,6 +533,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
return 0;
}
+static int
+choose_hash_size(int slen)
+{
+ int hashsz;
+
+ /*
+ * Experiments suggest that these hash sizes work pretty well. A large
+ * hash table minimizes collision, but has a higher startup cost. For a
+ * small input, the startup cost dominates. The table size must be a power
+ * of two.
+ */
+ if (slen < 128)
+ hashsz = 512;
+ else if (slen < 256)
+ hashsz = 1024;
+ else if (slen < 512)
+ hashsz = 2048;
+ else if (slen < 1024)
+ hashsz = 4096;
+ else
+ hashsz = 8192;
+ return hashsz;
+}
/* ----------
* pglz_compress -
@@ -484,7 +569,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
{
unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
unsigned char *bstart = bp;
- int hist_next = 0;
+ int hist_next = 1;
bool hist_recycle = false;
const char *dp = source;
const char *dend = source + slen;
@@ -500,6 +585,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
int32 result_size;
int32 result_max;
int32 need_rate;
+ int hashsz;
+ int mask;
/*
* Our fallback strategy is the default.
@@ -555,17 +642,22 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
else
result_max = (slen * (100 - need_rate)) / 100;
+ hashsz = choose_hash_size(slen);
+ mask = hashsz - 1;
+
/*
* Initialize the history lists to empty. We do not need to zero the
* hist_entries[] array; its entries are initialized as they are used.
*/
- memset(hist_start, 0, sizeof(hist_start));
+ memset(hist_start, 0, hashsz * sizeof(int16));
/*
* Compress the source directly into the output buffer.
*/
while (dp < dend)
{
+ uint32 hindex;
+
/*
* If we already exceeded the maximum result size, fail.
*
@@ -588,8 +680,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
/*
* Try to find a match in the history
*/
+ hindex = pglz_hist_idx(dp, dend, mask);
if (pglz_find_match(hist_start, dp, dend, &match_len,
- &match_off, good_match, good_drop))
+ &match_off, good_match, good_drop, NULL, hindex))
{
/*
* Create the tag and add history entries for all matched
@@ -598,9 +691,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
+ hindex = pglz_hist_idx(dp, dend, mask);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -614,7 +708,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -637,6 +731,225 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
return true;
}
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen,
+ const PGLZ_Strategy *strategy)
+{
+ unsigned char *bp = ((unsigned char *) dest);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ const char *hp = history;
+ const char *hend = history + hlen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int32 good_match;
+ int32 good_drop;
+ int32 need_rate;
+ int hist_next = 0;
+ int hashsz;
+ int mask;
+ int32 hindex;
+ int32 a,b,c,d;
+ int32 rollidx;
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Limit the match parameters to the supported range.
+ */
+ good_match = strategy->match_size_good;
+ if (good_match > PGLZ_MAX_MATCH)
+ good_match = PGLZ_MAX_MATCH;
+ else if (good_match < 17)
+ good_match = 17;
+
+ good_drop = strategy->match_size_drop;
+ if (good_drop < 0)
+ good_drop = 0;
+ else if (good_drop > 100)
+ good_drop = 100;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ hashsz = choose_hash_size(hlen);
+ mask = hashsz - 1;
+
+ /*
+ * Initialize the history lists to empty. We do not need to zero the
+ * hist_entries[] array; its entries are initialized as they are used.
+ */
+ memset(hist_start, 0, hashsz * sizeof(int16));
+
+ if (hend - hp > PGLZ_HISTORY_SIZE)
+ hp = hend - PGLZ_HISTORY_SIZE;
+
+ rollidx = 0;
+ pglz_hash_init(hp, hindex, a,b,c,d);
+ while (hp < hend)
+ {
+ /*
+ * TODO: It would be nice to behave like the history and the source
+ * strings were concatenated, so that you could compress using the new
+ * data, too.
+ */
+ pglz_hash_roll(hp, hindex, a, b, c, d, mask);
+ if (rollidx % 10 == 0)
+ {
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ hp, hend, hindex);
+ rollidx = 0;
+ }
+ hp++; /* Do not do this ++ in the line above! */
+ rollidx++;
+ }
+
+ /*
+ * Loop through the input.
+ */
+ match_off = 0;
+ rollidx = 0;
+ pglz_hash_init(hp, hindex, a,b,c,d);
+ while (dp < dend)
+ {
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ pglz_hash_roll(dp, hindex, a, b, c, d, mask);
+ if (rollidx % 10 == 0)
+ {
+ if (pglz_find_match(hist_start, dp, dend, &match_len,
+ &match_off, good_match, good_drop, hend, hindex))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ dp += match_len;
+ found_match = true;
+ pglz_hash_init(hp, hindex, a,b,c,d);
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+
+ rollidx = 0;
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+
+ rollidx++;
+ }
+
+ if (!found_match)
+ return false;
+
+ /* Handle the last few bytes as literals */
+ while (dp < dend)
+ {
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ *finallen = result_size;
+
+ return true;
+}
/* ----------
* pglz_decompress -
@@ -740,3 +1053,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
* That's it.
*/
}
+
+/* ----------
+ * pglz_delta_decode
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen)
+{
+ const unsigned char *sp;
+ const unsigned char *srcend;
+ unsigned char *dp;
+ unsigned char *destend;
+ const char *hend;
+
+ sp = ((const unsigned char *) source);
+ srcend = ((const unsigned char *) source) + srclen;
+ dp = (unsigned char *) dest;
+ destend = dp + destlen;
+ hend = history + histlen;
+
+ while (sp < srcend && dp < destend)
+ {
+ /*
+ * Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ */
+ unsigned char ctrl = *sp++;
+ int ctrlc;
+
+ for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+ {
+ if (ctrl & 1)
+ {
+ /*
+ * Otherwise it contains the match length minus 3 and the
+ * upper 4 bits of the offset. The next following byte
+ * contains the lower 8 bits of the offset. If the length is
+ * coded as 18, another extension tag byte tells how much
+ * longer the match really was (0-255).
+ */
+ int32 len;
+ int32 off;
+
+ len = (sp[0] & 0x0f) + 3;
+ off = ((sp[0] & 0xf0) << 4) | sp[1];
+ sp += 2;
+ if (len == 18)
+ len += *sp++;
+
+ /*
+ * Check for output buffer overrun, to ensure we don't clobber
+ * memory in case of corrupt input. Note: we must advance dp
+ * here to ensure the error is detected below the loop. We
+ * don't simply put the elog inside the loop since that will
+ * probably interfere with optimization.
+ */
+ if (dp + len > destend)
+ {
+ dp += len;
+ break;
+ }
+
+ /*
+ * Now we copy the bytes specified by the tag from history to
+ * OUTPUT.
+ */
+ memcpy(dp, hend - off, len);
+ dp += len;
+ }
+ else
+ {
+ /*
+ * An unset control bit means LITERAL BYTE. So we just copy
+ * one from INPUT to OUTPUT.
+ */
+ if (dp >= destend) /* check for buffer overrun */
+ break; /* do not clobber memory */
+
+ *dp++ = *sp++;
+ }
+
+ /*
+ * Advance the control bit
+ */
+ ctrl >>= 1;
+ }
+ }
+
+ /*
+ * Check we decompressed the right amount.
+ */
+ if (sp != srcend)
+ elog(PANIC, "compressed data is corrupt");
+
+ /*
+ * That's it.
+ */
+ *finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int CommitSiblings;
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
+extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 0, 100,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
- uint8 old_infobits_set; /* infomask bits to set on old tuple */
- bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
- bool new_all_visible_cleared; /* same for the page of newtid */
+ uint8 old_infobits_set; /* infomask bits to set on old tuple */
+ uint8 flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
+ * update operation is
+ * delta encoded */
+
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(uint8))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen);
#endif /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
(2 rows)
DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE: table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;
pglz-with-micro-optimizations-2_roll10_32_1hashkey.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2_roll10_32_1hashkey.patchDownload
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+/* guc variable for EWT compression ratio*/
+int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+/* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata, uint32 *enclen)
+{
+ PGLZ_Strategy strategy;
+
+ strategy = *PGLZ_strategy_default;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_delta_encode(
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ encdata, enclen, &strategy
+ );
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+ return pglz_delta_decode(encdata, enclen,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5250ec7..4dcf164 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
/* GUC variable */
bool synchronize_seqscans = true;
+extern int wal_update_compression_ratio;
static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5806,6 +5808,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ char buf[MaxHeapTupleSize];
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -5815,15 +5823,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if (wal_update_compression_ratio != 0 && (newtuplen >=32) && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ uint32 enclen;
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ {
+ compressed = true;
+ newtupdata = buf;
+ newtuplen = enclen;
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
- xlrec.all_visible_cleared = all_visible_cleared;
+ if (all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
- xlrec.new_all_visible_cleared = new_all_visible_cleared;
+ if (new_all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (compressed)
+ xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
@@ -5850,9 +5890,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ /*
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+ * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+ */
+ rdata[3].data = newtupdata;
+ rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
@@ -6655,7 +6698,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
@@ -6670,7 +6716,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6730,7 +6776,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
- htup = (HeapTupleHeader) PageGetItem(page, lp);
+ oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6748,7 +6794,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
@@ -6773,7 +6819,7 @@ newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6836,10 +6882,31 @@ newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
- /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
- memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
- (char *) xlrec + hsize,
- newlen);
+
+ /*
+ * If the record is EWT then decode it.
+ */
+ if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+ {
+ /*
+ * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ */
+ oldtup.t_data = oldtupdata;
+ oldtup.t_len = ItemIdGetLength(lp);
+ newtup.t_data = htup;
+
+ heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+ newlen = newtup.t_len;
+ }
+ else
+ {
+ /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ (char *) xlrec + hsize,
+ newlen);
+ }
+
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
@@ -6855,7 +6922,7 @@ newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a02eebc..5075d2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+}
+
+/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..2f3067f 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
* of identical bytes like trailing spaces) and for bigger ones
* our 4K maximum look-back distance is too small.
*
- * The compressor creates a table for 8192 lists of positions.
+ * The compressor creates a table for lists of positions.
* For each input position (except the last 3), a hash key is
* built from the 4 next input bytes and the position remembered
* in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
* matching strings. This is done on the fly while the input
* is compressed into the output area. Table entries are only
* kept for the last 4096 input positions, since we cannot use
- * back-pointers larger than that anyway.
+ * back-pointers larger than that anyway. The size of the hash
+ * table depends on the size of the input - a larger table has
+ * a larger startup cost, as it needs to be initialized to zero,
+ * but reduces the number of hash collisions on long inputs.
*
* For each byte in the input, it's hash key (built from this
* byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
* Local definitions
* ----------
*/
-#define PGLZ_HISTORY_LISTS 8192 /* must be power of 2 */
-#define PGLZ_HISTORY_MASK (PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS 8192 /* must be power of 2 */
#define PGLZ_HISTORY_SIZE 4096
#define PGLZ_MAX_MATCH 273
@@ -198,9 +200,9 @@
*/
typedef struct PGLZ_HistEntry
{
- struct PGLZ_HistEntry *next; /* links for my hash key's list */
- struct PGLZ_HistEntry *prev;
- int hindex; /* my current hash key */
+ int16 next; /* links for my hash key's list */
+ int16 prev;
+ uint32 hindex; /* my current hash key */
const char *pos; /* my input position */
} PGLZ_HistEntry;
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
* Statically allocated work arrays for history
* ----------
*/
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY 0
/* ----------
* pglz_hist_idx -
@@ -257,12 +261,20 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* hash keys more.
* ----------
*/
-#define pglz_hist_idx(_s,_e) ( \
+#define pglz_hist_idx(_s,_e, _mask) ( \
((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
- (((_s)[0] << 9) ^ ((_s)[1] << 6) ^ \
- ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK) \
+ (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
+ ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
)
+/*
+ * pglz_hash_roll() calculates the hashindex with current record using mask.
+ */
+#define pglz_hash_roll(_p,hindex,_mask) \
+ do { \
+ hindex = (_p[0]) & (_mask); \
+ } while (0)
+
/* ----------
* pglz_hist_add -
@@ -276,32 +288,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* _hn and _recycle are modified in the macro.
* ----------
*/
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex) \
do { \
- int __hindex = pglz_hist_idx((_s),(_e)); \
- PGLZ_HistEntry **__myhsp = &(_hs)[__hindex]; \
+ int16 *__myhsp = &(_hs)[_hindex]; \
PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
if (_recycle) { \
- if (__myhe->prev == NULL) \
+ if (__myhe->prev == INVALID_ENTRY) \
(_hs)[__myhe->hindex] = __myhe->next; \
else \
- __myhe->prev->next = __myhe->next; \
- if (__myhe->next != NULL) \
- __myhe->next->prev = __myhe->prev; \
+ (_he)[__myhe->prev].next = __myhe->next; \
+ if (__myhe->next != INVALID_ENTRY) \
+ (_he)[__myhe->next].prev = __myhe->prev; \
} \
__myhe->next = *__myhsp; \
- __myhe->prev = NULL; \
- __myhe->hindex = __hindex; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
__myhe->pos = (_s); \
- if (*__myhsp != NULL) \
- (*__myhsp)->prev = __myhe; \
- *__myhsp = __myhe; \
- if (++(_hn) >= PGLZ_HISTORY_SIZE) { \
- (_hn) = 0; \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) { \
+ (_hn) = 1; \
(_recycle) = true; \
} \
} while (0)
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex) \
+do { \
+ int16 *__myhsp = &(_hs)[_hindex]; \
+ PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
+ __myhe->next = *__myhsp; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
+ __myhe->pos = (_s); \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ ++(_hn); \
+} while (0)
+
/* ----------
* pglz_out_ctrl -
@@ -372,30 +401,44 @@ do { \
* ----------
*/
static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
- int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+ int *lenp, int *offp, int good_match, int good_drop,
+ const char *hend, int hindex)
{
- PGLZ_HistEntry *hent;
+ int16 hentno;
int32 len = 0;
int32 off = 0;
/*
* Traverse the linked history list until a good enough match is found.
*/
- hent = hstart[pglz_hist_idx(input, end)];
- while (hent)
+ hentno = hstart[hindex];
+ while (hentno != INVALID_ENTRY)
{
+ PGLZ_HistEntry *hent = &hist_entries[hentno];
const char *ip = input;
const char *hp = hent->pos;
int32 thisoff;
int32 thislen;
+ int32 maxlen;
+
+ maxlen = PGLZ_MAX_MATCH;
+ if (input - end > maxlen)
+ maxlen = input - end;
+ if (hend && (hend - hp > maxlen))
+ maxlen = hend - hp;
/*
* Stop if the offset does not fit into our tag anymore.
*/
- thisoff = ip - hp;
- if (thisoff >= 0x0fff)
- break;
+ if (!hend)
+ {
+ thisoff = ip - hp;
+ if (thisoff >= 0x0fff)
+ break;
+ }
+ else
+ thisoff = hend - hp;
/*
* Determine length of match. A better match must be larger than the
@@ -413,7 +456,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
thislen = len;
ip += len;
hp += len;
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -423,7 +466,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
}
else
{
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -443,13 +486,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
/*
* Advance to the next history entry
*/
- hent = hent->next;
+ hentno = hent->next;
/*
* Be happy with lesser good matches the more entries we visited. But
* no point in doing calculation if we're at end of list.
*/
- if (hent)
+ if (hentno != INVALID_ENTRY)
{
if (len >= good_match)
break;
@@ -471,6 +514,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
return 0;
}
+static int
+choose_hash_size(int slen)
+{
+ int hashsz;
+
+ /*
+ * Experiments suggest that these hash sizes work pretty well. A large
+ * hash table minimizes collision, but has a higher startup cost. For a
+ * small input, the startup cost dominates. The table size must be a power
+ * of two.
+ */
+ if (slen < 128)
+ hashsz = 512;
+ else if (slen < 256)
+ hashsz = 1024;
+ else if (slen < 512)
+ hashsz = 2048;
+ else if (slen < 1024)
+ hashsz = 4096;
+ else
+ hashsz = 8192;
+ return hashsz;
+}
/* ----------
* pglz_compress -
@@ -484,7 +550,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
{
unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
unsigned char *bstart = bp;
- int hist_next = 0;
+ int hist_next = 1;
bool hist_recycle = false;
const char *dp = source;
const char *dend = source + slen;
@@ -500,6 +566,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
int32 result_size;
int32 result_max;
int32 need_rate;
+ int hashsz;
+ int mask;
/*
* Our fallback strategy is the default.
@@ -555,17 +623,22 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
else
result_max = (slen * (100 - need_rate)) / 100;
+ hashsz = choose_hash_size(slen);
+ mask = hashsz - 1;
+
/*
* Initialize the history lists to empty. We do not need to zero the
* hist_entries[] array; its entries are initialized as they are used.
*/
- memset(hist_start, 0, sizeof(hist_start));
+ memset(hist_start, 0, hashsz * sizeof(int16));
/*
* Compress the source directly into the output buffer.
*/
while (dp < dend)
{
+ uint32 hindex;
+
/*
* If we already exceeded the maximum result size, fail.
*
@@ -588,8 +661,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
/*
* Try to find a match in the history
*/
+ hindex = pglz_hist_idx(dp, dend, mask);
if (pglz_find_match(hist_start, dp, dend, &match_len,
- &match_off, good_match, good_drop))
+ &match_off, good_match, good_drop, NULL, hindex))
{
/*
* Create the tag and add history entries for all matched
@@ -598,9 +672,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
+ hindex = pglz_hist_idx(dp, dend, mask);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -614,7 +689,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -637,6 +712,221 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
return true;
}
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen,
+ const PGLZ_Strategy *strategy)
+{
+ unsigned char *bp = ((unsigned char *) dest);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ const char *hp = history;
+ const char *hend = history + hlen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int32 good_match;
+ int32 good_drop;
+ int32 need_rate;
+ int hist_next = 0;
+ int hashsz;
+ int mask;
+ int32 hindex;
+ int32 rollidx;
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Limit the match parameters to the supported range.
+ */
+ good_match = strategy->match_size_good;
+ if (good_match > PGLZ_MAX_MATCH)
+ good_match = PGLZ_MAX_MATCH;
+ else if (good_match < 17)
+ good_match = 17;
+
+ good_drop = strategy->match_size_drop;
+ if (good_drop < 0)
+ good_drop = 0;
+ else if (good_drop > 100)
+ good_drop = 100;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ hashsz = choose_hash_size(hlen);
+ mask = hashsz - 1;
+
+ /*
+ * Initialize the history lists to empty. We do not need to zero the
+ * hist_entries[] array; its entries are initialized as they are used.
+ */
+ memset(hist_start, 0, hashsz * sizeof(int16));
+
+ if (hend - hp > PGLZ_HISTORY_SIZE)
+ hp = hend - PGLZ_HISTORY_SIZE;
+
+ rollidx = 0;
+ while (hp < hend)
+ {
+ /*
+ * TODO: It would be nice to behave like the history and the source
+ * strings were concatenated, so that you could compress using the new
+ * data, too.
+ */
+ pglz_hash_roll(hp, hindex, mask);
+ if (rollidx % 10 == 0)
+ {
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ hp, hend, hindex);
+ rollidx = 0;
+ }
+ hp++; /* Do not do this ++ in the line above! */
+ rollidx++;
+ }
+
+ /*
+ * Loop through the input.
+ */
+ match_off = 0;
+ rollidx = 0;
+ while (dp < dend)
+ {
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ pglz_hash_roll(dp, hindex, mask);
+ if (rollidx % 10 == 0)
+ {
+ if (pglz_find_match(hist_start, dp, dend, &match_len,
+ &match_off, good_match, good_drop, hend, hindex))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ dp += match_len;
+ found_match = true;
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+
+ rollidx = 0;
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+
+ rollidx++;
+ }
+
+ if (!found_match)
+ return false;
+
+ /* Handle the last few bytes as literals */
+ while (dp < dend)
+ {
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ *finallen = result_size;
+
+ return true;
+}
/* ----------
* pglz_decompress -
@@ -740,3 +1030,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
* That's it.
*/
}
+
+/* ----------
+ * pglz_delta_decode
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen)
+{
+ const unsigned char *sp;
+ const unsigned char *srcend;
+ unsigned char *dp;
+ unsigned char *destend;
+ const char *hend;
+
+ sp = ((const unsigned char *) source);
+ srcend = ((const unsigned char *) source) + srclen;
+ dp = (unsigned char *) dest;
+ destend = dp + destlen;
+ hend = history + histlen;
+
+ while (sp < srcend && dp < destend)
+ {
+ /*
+ * Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ */
+ unsigned char ctrl = *sp++;
+ int ctrlc;
+
+ for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+ {
+ if (ctrl & 1)
+ {
+ /*
+ * Otherwise it contains the match length minus 3 and the
+ * upper 4 bits of the offset. The next following byte
+ * contains the lower 8 bits of the offset. If the length is
+ * coded as 18, another extension tag byte tells how much
+ * longer the match really was (0-255).
+ */
+ int32 len;
+ int32 off;
+
+ len = (sp[0] & 0x0f) + 3;
+ off = ((sp[0] & 0xf0) << 4) | sp[1];
+ sp += 2;
+ if (len == 18)
+ len += *sp++;
+
+ /*
+ * Check for output buffer overrun, to ensure we don't clobber
+ * memory in case of corrupt input. Note: we must advance dp
+ * here to ensure the error is detected below the loop. We
+ * don't simply put the elog inside the loop since that will
+ * probably interfere with optimization.
+ */
+ if (dp + len > destend)
+ {
+ dp += len;
+ break;
+ }
+
+ /*
+ * Now we copy the bytes specified by the tag from history to
+ * OUTPUT.
+ */
+ memcpy(dp, hend - off, len);
+ dp += len;
+ }
+ else
+ {
+ /*
+ * An unset control bit means LITERAL BYTE. So we just copy
+ * one from INPUT to OUTPUT.
+ */
+ if (dp >= destend) /* check for buffer overrun */
+ break; /* do not clobber memory */
+
+ *dp++ = *sp++;
+ }
+
+ /*
+ * Advance the control bit
+ */
+ ctrl >>= 1;
+ }
+ }
+
+ /*
+ * Check we decompressed the right amount.
+ */
+ if (sp != srcend)
+ elog(PANIC, "compressed data is corrupt");
+
+ /*
+ * That's it.
+ */
+ *finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int CommitSiblings;
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
+extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 0, 100,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
- uint8 old_infobits_set; /* infomask bits to set on old tuple */
- bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
- bool new_all_visible_cleared; /* same for the page of newtid */
+ uint8 old_infobits_set; /* infomask bits to set on old tuple */
+ uint8 flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
+ * update operation is
+ * delta encoded */
+
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(uint8))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen);
#endif /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
(2 rows)
DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE: table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;
pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal.patchDownload
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+/* guc variable for EWT compression ratio*/
+int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+/* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata, uint32 *enclen)
+{
+ PGLZ_Strategy strategy;
+
+ strategy = *PGLZ_strategy_default;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_delta_encode(
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ encdata, enclen, &strategy
+ );
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+ return pglz_delta_decode(encdata, enclen,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5250ec7..4dcf164 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
/* GUC variable */
bool synchronize_seqscans = true;
+extern int wal_update_compression_ratio;
static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5806,6 +5808,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ char buf[MaxHeapTupleSize];
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -5815,15 +5823,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if (wal_update_compression_ratio != 0 && (newtuplen >=32) && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ uint32 enclen;
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ {
+ compressed = true;
+ newtupdata = buf;
+ newtuplen = enclen;
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
- xlrec.all_visible_cleared = all_visible_cleared;
+ if (all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
- xlrec.new_all_visible_cleared = new_all_visible_cleared;
+ if (new_all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (compressed)
+ xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
@@ -5850,9 +5890,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ /*
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+ * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+ */
+ rdata[3].data = newtupdata;
+ rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
@@ -6655,7 +6698,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
@@ -6670,7 +6716,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6730,7 +6776,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
- htup = (HeapTupleHeader) PageGetItem(page, lp);
+ oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6748,7 +6794,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
@@ -6773,7 +6819,7 @@ newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6836,10 +6882,31 @@ newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
- /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
- memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
- (char *) xlrec + hsize,
- newlen);
+
+ /*
+ * If the record is EWT then decode it.
+ */
+ if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+ {
+ /*
+ * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ */
+ oldtup.t_data = oldtupdata;
+ oldtup.t_len = ItemIdGetLength(lp);
+ newtup.t_data = htup;
+
+ heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+ newlen = newtup.t_len;
+ }
+ else
+ {
+ /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ (char *) xlrec + hsize,
+ newlen);
+ }
+
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
@@ -6855,7 +6922,7 @@ newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a02eebc..5075d2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+}
+
+/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..93e7cd0 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
* of identical bytes like trailing spaces) and for bigger ones
* our 4K maximum look-back distance is too small.
*
- * The compressor creates a table for 8192 lists of positions.
+ * The compressor creates a table for lists of positions.
* For each input position (except the last 3), a hash key is
* built from the 4 next input bytes and the position remembered
* in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
* matching strings. This is done on the fly while the input
* is compressed into the output area. Table entries are only
* kept for the last 4096 input positions, since we cannot use
- * back-pointers larger than that anyway.
+ * back-pointers larger than that anyway. The size of the hash
+ * table depends on the size of the input - a larger table has
+ * a larger startup cost, as it needs to be initialized to zero,
+ * but reduces the number of hash collisions on long inputs.
*
* For each byte in the input, it's hash key (built from this
* byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
* Local definitions
* ----------
*/
-#define PGLZ_HISTORY_LISTS 8192 /* must be power of 2 */
-#define PGLZ_HISTORY_MASK (PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS 8192 /* must be power of 2 */
#define PGLZ_HISTORY_SIZE 4096
#define PGLZ_MAX_MATCH 273
@@ -198,9 +200,9 @@
*/
typedef struct PGLZ_HistEntry
{
- struct PGLZ_HistEntry *next; /* links for my hash key's list */
- struct PGLZ_HistEntry *prev;
- int hindex; /* my current hash key */
+ int16 next; /* links for my hash key's list */
+ int16 prev;
+ uint32 hindex; /* my current hash key */
const char *pos; /* my input position */
} PGLZ_HistEntry;
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
* Statically allocated work arrays for history
* ----------
*/
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY 0
/* ----------
* pglz_hist_idx -
@@ -257,12 +261,20 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* hash keys more.
* ----------
*/
-#define pglz_hist_idx(_s,_e) ( \
+#define pglz_hist_idx(_s,_e, _mask) ( \
((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
- (((_s)[0] << 9) ^ ((_s)[1] << 6) ^ \
- ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK) \
+ (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
+ ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
)
+/*
+ * pglz_hash_roll() calculates the hashindex with current record using mask.
+ */
+#define pglz_hash_roll(_p,hindex,_mask) \
+ do { \
+ hindex = (_p[0]) & (_mask); \
+ } while (0)
+
/* ----------
* pglz_hist_add -
@@ -276,32 +288,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* _hn and _recycle are modified in the macro.
* ----------
*/
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex) \
do { \
- int __hindex = pglz_hist_idx((_s),(_e)); \
- PGLZ_HistEntry **__myhsp = &(_hs)[__hindex]; \
+ int16 *__myhsp = &(_hs)[_hindex]; \
PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
if (_recycle) { \
- if (__myhe->prev == NULL) \
+ if (__myhe->prev == INVALID_ENTRY) \
(_hs)[__myhe->hindex] = __myhe->next; \
else \
- __myhe->prev->next = __myhe->next; \
- if (__myhe->next != NULL) \
- __myhe->next->prev = __myhe->prev; \
+ (_he)[__myhe->prev].next = __myhe->next; \
+ if (__myhe->next != INVALID_ENTRY) \
+ (_he)[__myhe->next].prev = __myhe->prev; \
} \
__myhe->next = *__myhsp; \
- __myhe->prev = NULL; \
- __myhe->hindex = __hindex; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
__myhe->pos = (_s); \
- if (*__myhsp != NULL) \
- (*__myhsp)->prev = __myhe; \
- *__myhsp = __myhe; \
- if (++(_hn) >= PGLZ_HISTORY_SIZE) { \
- (_hn) = 0; \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) { \
+ (_hn) = 1; \
(_recycle) = true; \
} \
} while (0)
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex) \
+do { \
+ int16 *__myhsp = &(_hs)[_hindex]; \
+ PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
+ __myhe->next = *__myhsp; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
+ __myhe->pos = (_s); \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ ++(_hn); \
+} while (0)
+
/* ----------
* pglz_out_ctrl -
@@ -372,30 +401,44 @@ do { \
* ----------
*/
static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
- int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+ int *lenp, int *offp, int good_match, int good_drop,
+ const char *hend, int hindex)
{
- PGLZ_HistEntry *hent;
+ int16 hentno;
int32 len = 0;
int32 off = 0;
/*
* Traverse the linked history list until a good enough match is found.
*/
- hent = hstart[pglz_hist_idx(input, end)];
- while (hent)
+ hentno = hstart[hindex];
+ while (hentno != INVALID_ENTRY)
{
+ PGLZ_HistEntry *hent = &hist_entries[hentno];
const char *ip = input;
const char *hp = hent->pos;
int32 thisoff;
int32 thislen;
+ int32 maxlen;
+
+ maxlen = PGLZ_MAX_MATCH;
+ if (input - end > maxlen)
+ maxlen = input - end;
+ if (hend && (hend - hp > maxlen))
+ maxlen = hend - hp;
/*
* Stop if the offset does not fit into our tag anymore.
*/
- thisoff = ip - hp;
- if (thisoff >= 0x0fff)
- break;
+ if (!hend)
+ {
+ thisoff = ip - hp;
+ if (thisoff >= 0x0fff)
+ break;
+ }
+ else
+ thisoff = hend - hp;
/*
* Determine length of match. A better match must be larger than the
@@ -413,7 +456,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
thislen = len;
ip += len;
hp += len;
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -423,7 +466,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
}
else
{
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -443,13 +486,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
/*
* Advance to the next history entry
*/
- hent = hent->next;
+ hentno = hent->next;
/*
* Be happy with lesser good matches the more entries we visited. But
* no point in doing calculation if we're at end of list.
*/
- if (hent)
+ if (hentno != INVALID_ENTRY)
{
if (len >= good_match)
break;
@@ -471,6 +514,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
return 0;
}
+static int
+choose_hash_size(int slen)
+{
+ int hashsz;
+
+ /*
+ * Experiments suggest that these hash sizes work pretty well. A large
+ * hash table minimizes collision, but has a higher startup cost. For a
+ * small input, the startup cost dominates. The table size must be a power
+ * of two.
+ */
+ if (slen < 128)
+ hashsz = 512;
+ else if (slen < 256)
+ hashsz = 1024;
+ else if (slen < 512)
+ hashsz = 2048;
+ else if (slen < 1024)
+ hashsz = 4096;
+ else
+ hashsz = 8192;
+ return hashsz;
+}
/* ----------
* pglz_compress -
@@ -484,7 +550,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
{
unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
unsigned char *bstart = bp;
- int hist_next = 0;
+ int hist_next = 1;
bool hist_recycle = false;
const char *dp = source;
const char *dend = source + slen;
@@ -500,6 +566,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
int32 result_size;
int32 result_max;
int32 need_rate;
+ int hashsz;
+ int mask;
/*
* Our fallback strategy is the default.
@@ -555,17 +623,22 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
else
result_max = (slen * (100 - need_rate)) / 100;
+ hashsz = choose_hash_size(slen);
+ mask = hashsz - 1;
+
/*
* Initialize the history lists to empty. We do not need to zero the
* hist_entries[] array; its entries are initialized as they are used.
*/
- memset(hist_start, 0, sizeof(hist_start));
+ memset(hist_start, 0, hashsz * sizeof(int16));
/*
* Compress the source directly into the output buffer.
*/
while (dp < dend)
{
+ uint32 hindex;
+
/*
* If we already exceeded the maximum result size, fail.
*
@@ -588,8 +661,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
/*
* Try to find a match in the history
*/
+ hindex = pglz_hist_idx(dp, dend, mask);
if (pglz_find_match(hist_start, dp, dend, &match_len,
- &match_off, good_match, good_drop))
+ &match_off, good_match, good_drop, NULL, hindex))
{
/*
* Create the tag and add history entries for all matched
@@ -598,9 +672,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
+ hindex = pglz_hist_idx(dp, dend, mask);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -614,7 +689,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -637,6 +712,219 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
return true;
}
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen,
+ const PGLZ_Strategy *strategy)
+{
+ unsigned char *bp = ((unsigned char *) dest);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ const char *hp = history;
+ const char *hend = history + hlen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int32 good_match;
+ int32 good_drop;
+ int32 need_rate;
+ int hist_next = 0;
+ int hashsz;
+ int mask;
+ int32 hindex;
+ int32 rollidx;
+ int32 literal_len = 0;
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Limit the match parameters to the supported range.
+ */
+ good_match = strategy->match_size_good;
+ if (good_match > PGLZ_MAX_MATCH)
+ good_match = PGLZ_MAX_MATCH;
+ else if (good_match < 17)
+ good_match = 17;
+
+ good_drop = strategy->match_size_drop;
+ if (good_drop < 0)
+ good_drop = 0;
+ else if (good_drop > 100)
+ good_drop = 100;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ hashsz = choose_hash_size(hlen);
+ mask = hashsz - 1;
+
+ /*
+ * Initialize the history lists to empty. We do not need to zero the
+ * hist_entries[] array; its entries are initialized as they are used.
+ */
+ memset(hist_start, 0, hashsz * sizeof(int16));
+
+ if (hend - hp > PGLZ_HISTORY_SIZE)
+ hp = hend - PGLZ_HISTORY_SIZE;
+
+ rollidx = 0;
+ while (hp < hend)
+ {
+ /*
+ * TODO: It would be nice to behave like the history and the source
+ * strings were concatenated, so that you could compress using the new
+ * data, too.
+ */
+ pglz_hash_roll(hp, hindex, mask);
+ if (rollidx % 10 == 0)
+ {
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ hp, hend, hindex);
+ rollidx = 0;
+ }
+ hp++; /* Do not do this ++ in the line above! */
+ rollidx++;
+ }
+
+ /*
+ * Loop through the input.
+ */
+ match_off = 0;
+ rollidx = 0;
+ while ((dp + literal_len) < dend)
+ {
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if ((bp + literal_len) - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ pglz_hash_roll((dp + literal_len), hindex, mask);
+ if (rollidx % 10 == 0)
+ {
+ if (pglz_find_match(hist_start, (dp + literal_len), dend, &match_len,
+ &match_off, good_match, good_drop, hend, hindex))
+ {
+ while (literal_len > 0)
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ literal_len--;
+ }
+
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ dp += match_len;
+ found_match = true;
+ }
+ else
+ literal_len++;
+
+ rollidx = 0;
+ }
+ else
+ literal_len++;
+
+ rollidx++;
+ }
+
+ if (!found_match)
+ return false;
+
+ /* Handle the last few bytes as literals */
+ while (dp < dend)
+ {
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ *finallen = result_size;
+
+ return true;
+}
/* ----------
* pglz_decompress -
@@ -740,3 +1028,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
* That's it.
*/
}
+
+/* ----------
+ * pglz_delta_decode
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen)
+{
+ const unsigned char *sp;
+ const unsigned char *srcend;
+ unsigned char *dp;
+ unsigned char *destend;
+ const char *hend;
+
+ sp = ((const unsigned char *) source);
+ srcend = ((const unsigned char *) source) + srclen;
+ dp = (unsigned char *) dest;
+ destend = dp + destlen;
+ hend = history + histlen;
+
+ while (sp < srcend && dp < destend)
+ {
+ /*
+ * Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ */
+ unsigned char ctrl = *sp++;
+ int ctrlc;
+
+ for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+ {
+ if (ctrl & 1)
+ {
+ /*
+ * Otherwise it contains the match length minus 3 and the
+ * upper 4 bits of the offset. The next following byte
+ * contains the lower 8 bits of the offset. If the length is
+ * coded as 18, another extension tag byte tells how much
+ * longer the match really was (0-255).
+ */
+ int32 len;
+ int32 off;
+
+ len = (sp[0] & 0x0f) + 3;
+ off = ((sp[0] & 0xf0) << 4) | sp[1];
+ sp += 2;
+ if (len == 18)
+ len += *sp++;
+
+ /*
+ * Check for output buffer overrun, to ensure we don't clobber
+ * memory in case of corrupt input. Note: we must advance dp
+ * here to ensure the error is detected below the loop. We
+ * don't simply put the elog inside the loop since that will
+ * probably interfere with optimization.
+ */
+ if (dp + len > destend)
+ {
+ dp += len;
+ break;
+ }
+
+ /*
+ * Now we copy the bytes specified by the tag from history to
+ * OUTPUT.
+ */
+ memcpy(dp, hend - off, len);
+ dp += len;
+ }
+ else
+ {
+ /*
+ * An unset control bit means LITERAL BYTE. So we just copy
+ * one from INPUT to OUTPUT.
+ */
+ if (dp >= destend) /* check for buffer overrun */
+ break; /* do not clobber memory */
+
+ *dp++ = *sp++;
+ }
+
+ /*
+ * Advance the control bit
+ */
+ ctrl >>= 1;
+ }
+ }
+
+ /*
+ * Check we decompressed the right amount.
+ */
+ if (sp != srcend)
+ elog(PANIC, "compressed data is corrupt");
+
+ /*
+ * That's it.
+ */
+ *finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int CommitSiblings;
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
+extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 0, 100,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
- uint8 old_infobits_set; /* infomask bits to set on old tuple */
- bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
- bool new_all_visible_cleared; /* same for the page of newtid */
+ uint8 old_infobits_set; /* infomask bits to set on old tuple */
+ uint8 flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
+ * update operation is
+ * delta encoded */
+
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(uint8))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen);
#endif /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
(2 rows)
DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE: table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;
On Friday, March 08, 2013 9:22 PM Amit Kapila wrote:
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
On 04.03.2013 06:39, Amit Kapila wrote:
On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
On 02/05/2013 11:53 PM, Amit Kapila wrote:
Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previouspatch):
I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll
begin with some numbers:Based on your patch, I have tried some more optimizations:
Fixed bug in your patch (pglz-with-micro-optimizations-2):
1. There were some problems in recovery due to wrong length of oldtuple
passed in decode which I have corrected.Approach -1 (pglz-with-micro-optimizations-2_roll10_32)
1. Move strategy min length (32) check in log_heap_update 2. Rolling 10
for hash as suggested by you is added.Approach -2 (pglz-with-micro-optimizations-2_roll10_32_1hashkey)
1. This is done on top of Approach-1 changes 2. Used 1 byte data as the
hash key.Approach-3
(pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal)
1. This is done on top of Approach-1 and Approach-2 changes 2. Instead
of doing copy of literal byte when it founds as non match with history,
do all in a batch.Data for all above approaches is in attached file "test_readings"
(Apart from your tests, I have added one more test " hundred tiny
fields, first 10
changed")Summary -
After changes of Approach-1, CPU utilization for all except 2 tests
("hundred tiny fields, all changed", "hundred tiny fields, half
changed") is either same or less. The best case CPU utilization has
decreased (which is better), but WAL reduction has little bit increased
(which is as per expectation due 10 consecutive rollup's).Approach-2 modifications was done to see if there is any overhead of
hash calculation.
Approach-2 & Approach-3 doesn't result into any improvements.I have investigated the reason for CPU utilization for 2 tests and the
reason is that there is nothing to compress in the new tuple and that
information it will come to know only after it processes 75%
(compression
ratio) of tuple bytes.
I think any compression algorithm will have this drawback that if data
is not compressible, it can consume time inspite of the fact that it
will not be able to compress the data.
I think most updates will update some part of tuple which will always
yield positive results.Apart from above tests, I have run your patch against my old tests, it
yields quite positive results, WAL Reduction is more as compare to my
patch and CPU utilization is almost similar or my patch is slightly
better.
The results are in attached file "pgbench_pg_lz_mod"The above all data is for synchronous_commit = off. I can collect the
data for synchronous_commit = on and Performance of recovery.
Data for synchronous_commit = on is as follows:
Find the data for heikki's test in file "test_readings_on.txt"
Result and observation is same as for synchronous_commit =off. In short,
Approach-1
as mentioned in above mail seems to be best.
Find the data for pg_bench based test's used in my previous tests in
"pgbench_pg_lz_mod_sync_commit_on.htm"
This has been done for Heikki's original patch and Approach-1.
It shows that there is very minor cpu dip (0.1%) in some cases and WAL
Reduction of (2~3%).
WAL reduction is not much as operations performed are less.
Recovery Performance
----------------------
pgbench org:
./pgbench -i -s 75 -F 80 postgres
./pgbench -c 4 -j 4 -T 600 postgres
pgbench 1800(rec size=1800):
./pgbench -i -s 10 -F 80 postgres
./pgbench -c 4 -j 4 -T 600 postgres
Recovery benchmark:
postgres org postgres pg lz
optimization
Recovery(sec) Recovery(sec)
pgbench org 11 11
pgbench 1800 16 11
This shows that with your patch recovery performance is also improved.
There is one more defect in recovery which is fixed in attached patch
pglz-with-micro-optimizations-3.patch.
In pglz_find_match(), it was going beyond maxlen for comparision due to
which encoded data was not properly written to WAL.
Finally, as per my work further to your patch, the best patch will be by
fixing recovery defects and changes for Approach-1.
With Regards,
Amit Kapila.
Attachments:
pglz-with-micro-optimizations-3.patchapplication/octet-stream; name=pglz-with-micro-optimizations-3.patchDownload
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+/* guc variable for EWT compression ratio*/
+int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+/* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata, uint32 *enclen)
+{
+ PGLZ_Strategy strategy;
+
+ strategy = *PGLZ_strategy_default;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_delta_encode(
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ encdata, enclen, &strategy
+ );
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+ return pglz_delta_decode(encdata, enclen,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5250ec7..5b69189 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
/* GUC variable */
bool synchronize_seqscans = true;
+extern int wal_update_compression_ratio;
static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5806,6 +5808,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ char buf[MaxHeapTupleSize];
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -5815,15 +5823,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ uint32 enclen;
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ {
+ compressed = true;
+ newtupdata = buf;
+ newtuplen = enclen;
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
- xlrec.all_visible_cleared = all_visible_cleared;
+ if (all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
- xlrec.new_all_visible_cleared = new_all_visible_cleared;
+ if (new_all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (compressed)
+ xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
@@ -5850,9 +5890,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ /*
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+ * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+ */
+ rdata[3].data = newtupdata;
+ rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
@@ -6655,7 +6698,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
@@ -6670,7 +6716,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6730,7 +6776,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
- htup = (HeapTupleHeader) PageGetItem(page, lp);
+ oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6748,7 +6794,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
@@ -6773,7 +6819,7 @@ newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6836,10 +6882,31 @@ newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
- /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
- memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
- (char *) xlrec + hsize,
- newlen);
+
+ /*
+ * If the record is EWT then decode it.
+ */
+ if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+ {
+ /*
+ * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ */
+ oldtup.t_data = oldtupdata;
+ oldtup.t_len = ItemIdGetLength(lp);
+ newtup.t_data = htup;
+
+ heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+ newlen = newtup.t_len;
+ }
+ else
+ {
+ /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ (char *) xlrec + hsize,
+ newlen);
+ }
+
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
@@ -6855,7 +6922,7 @@ newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a02eebc..5075d2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+}
+
+/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..2aa9aaf 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
* of identical bytes like trailing spaces) and for bigger ones
* our 4K maximum look-back distance is too small.
*
- * The compressor creates a table for 8192 lists of positions.
+ * The compressor creates a table for lists of positions.
* For each input position (except the last 3), a hash key is
* built from the 4 next input bytes and the position remembered
* in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
* matching strings. This is done on the fly while the input
* is compressed into the output area. Table entries are only
* kept for the last 4096 input positions, since we cannot use
- * back-pointers larger than that anyway.
+ * back-pointers larger than that anyway. The size of the hash
+ * table depends on the size of the input - a larger table has
+ * a larger startup cost, as it needs to be initialized to zero,
+ * but reduces the number of hash collisions on long inputs.
*
* For each byte in the input, it's hash key (built from this
* byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
* Local definitions
* ----------
*/
-#define PGLZ_HISTORY_LISTS 8192 /* must be power of 2 */
-#define PGLZ_HISTORY_MASK (PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS 8192 /* must be power of 2 */
#define PGLZ_HISTORY_SIZE 4096
#define PGLZ_MAX_MATCH 273
@@ -198,9 +200,9 @@
*/
typedef struct PGLZ_HistEntry
{
- struct PGLZ_HistEntry *next; /* links for my hash key's list */
- struct PGLZ_HistEntry *prev;
- int hindex; /* my current hash key */
+ int16 next; /* links for my hash key's list */
+ int16 prev;
+ uint32 hindex; /* my current hash key */
const char *pos; /* my input position */
} PGLZ_HistEntry;
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
* Statically allocated work arrays for history
* ----------
*/
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY 0
/* ----------
* pglz_hist_idx -
@@ -257,12 +261,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* hash keys more.
* ----------
*/
-#define pglz_hist_idx(_s,_e) ( \
+#define pglz_hist_idx(_s,_e, _mask) ( \
((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
- (((_s)[0] << 9) ^ ((_s)[1] << 6) ^ \
- ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK) \
+ (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
+ ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
)
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) \
+ do { \
+ a = 0; \
+ b = _p[0]; \
+ c = _p[1]; \
+ d = _p[2]; \
+ hindex = (b << 4) ^ (c << 2) ^ d; \
+ } while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) \
+ do { \
+ /* subtract old a */ \
+ hindex ^= a; \
+ /* shift and add byte */ \
+ a = b; b = c; c = d; d = _p[3]; \
+ hindex = ((hindex << 2) ^ d) & (_mask); \
+ } while (0)
+
/* ----------
* pglz_hist_add -
@@ -276,32 +308,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* _hn and _recycle are modified in the macro.
* ----------
*/
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex) \
do { \
- int __hindex = pglz_hist_idx((_s),(_e)); \
- PGLZ_HistEntry **__myhsp = &(_hs)[__hindex]; \
+ int16 *__myhsp = &(_hs)[_hindex]; \
PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
if (_recycle) { \
- if (__myhe->prev == NULL) \
+ if (__myhe->prev == INVALID_ENTRY) \
(_hs)[__myhe->hindex] = __myhe->next; \
else \
- __myhe->prev->next = __myhe->next; \
- if (__myhe->next != NULL) \
- __myhe->next->prev = __myhe->prev; \
+ (_he)[__myhe->prev].next = __myhe->next; \
+ if (__myhe->next != INVALID_ENTRY) \
+ (_he)[__myhe->next].prev = __myhe->prev; \
} \
__myhe->next = *__myhsp; \
- __myhe->prev = NULL; \
- __myhe->hindex = __hindex; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
__myhe->pos = (_s); \
- if (*__myhsp != NULL) \
- (*__myhsp)->prev = __myhe; \
- *__myhsp = __myhe; \
- if (++(_hn) >= PGLZ_HISTORY_SIZE) { \
- (_hn) = 0; \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) { \
+ (_hn) = 1; \
(_recycle) = true; \
} \
} while (0)
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex) \
+do { \
+ int16 *__myhsp = &(_hs)[_hindex]; \
+ PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
+ __myhe->next = *__myhsp; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
+ __myhe->pos = (_s); \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ ++(_hn); \
+} while (0)
+
/* ----------
* pglz_out_ctrl -
@@ -372,28 +421,42 @@ do { \
* ----------
*/
static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
- int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+ int *lenp, int *offp, int good_match, int good_drop,
+ const char *hend, int hindex)
{
- PGLZ_HistEntry *hent;
+ int16 hentno;
int32 len = 0;
int32 off = 0;
/*
* Traverse the linked history list until a good enough match is found.
*/
- hent = hstart[pglz_hist_idx(input, end)];
- while (hent)
+ hentno = hstart[hindex];
+ while (hentno != INVALID_ENTRY)
{
+ PGLZ_HistEntry *hent = &hist_entries[hentno];
const char *ip = input;
const char *hp = hent->pos;
int32 thisoff;
int32 thislen;
+ int32 maxlen;
+
+ maxlen = PGLZ_MAX_MATCH;
+ if (input - end < maxlen)
+ maxlen = input - end;
+ if (hend && (hend - hp < maxlen))
+ maxlen = hend - hp;
+
/*
* Stop if the offset does not fit into our tag anymore.
*/
- thisoff = ip - hp;
+ if (!hend)
+ thisoff = ip - hp;
+ else
+ thisoff = hend - hp;
+
if (thisoff >= 0x0fff)
break;
@@ -413,7 +476,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
thislen = len;
ip += len;
hp += len;
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -423,7 +486,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
}
else
{
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -443,13 +506,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
/*
* Advance to the next history entry
*/
- hent = hent->next;
+ hentno = hent->next;
/*
* Be happy with lesser good matches the more entries we visited. But
* no point in doing calculation if we're at end of list.
*/
- if (hent)
+ if (hentno != INVALID_ENTRY)
{
if (len >= good_match)
break;
@@ -471,6 +534,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
return 0;
}
+static int
+choose_hash_size(int slen)
+{
+ int hashsz;
+
+ /*
+ * Experiments suggest that these hash sizes work pretty well. A large
+ * hash table minimizes collision, but has a higher startup cost. For
+ * a small input, the startup cost dominates. The table size must be
+ * a power of two.
+ */
+ if (slen < 128)
+ hashsz = 512;
+ else if (slen < 256)
+ hashsz = 1024;
+ else if (slen < 512)
+ hashsz = 2048;
+ else if (slen < 1024)
+ hashsz = 4096;
+ else
+ hashsz = 8192;
+ return hashsz;
+}
/* ----------
* pglz_compress -
@@ -484,7 +570,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
{
unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
unsigned char *bstart = bp;
- int hist_next = 0;
+ int hist_next = 1;
bool hist_recycle = false;
const char *dp = source;
const char *dend = source + slen;
@@ -500,6 +586,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
int32 result_size;
int32 result_max;
int32 need_rate;
+ int hashsz;
+ int mask;
/*
* Our fallback strategy is the default.
@@ -555,17 +643,21 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
else
result_max = (slen * (100 - need_rate)) / 100;
+ hashsz = choose_hash_size(slen);
+ mask = hashsz - 1;
+
/*
* Initialize the history lists to empty. We do not need to zero the
* hist_entries[] array; its entries are initialized as they are used.
*/
- memset(hist_start, 0, sizeof(hist_start));
+ memset(hist_start, 0, hashsz * sizeof(int16));
/*
* Compress the source directly into the output buffer.
*/
while (dp < dend)
{
+ uint32 hindex;
/*
* If we already exceeded the maximum result size, fail.
*
@@ -588,8 +680,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
/*
* Try to find a match in the history
*/
+ hindex = pglz_hist_idx(dp, dend, mask);
if (pglz_find_match(hist_start, dp, dend, &match_len,
- &match_off, good_match, good_drop))
+ &match_off, good_match, good_drop, NULL, hindex))
{
/*
* Create the tag and add history entries for all matched
@@ -598,9 +691,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
+ hindex = pglz_hist_idx(dp, dend, mask);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -614,7 +708,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -637,6 +731,198 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
return true;
}
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen,
+ const PGLZ_Strategy *strategy)
+{
+ unsigned char *bp = ((unsigned char *) dest);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ const char *hp = history;
+ const char *hend = history + hlen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int32 good_match;
+ int32 good_drop;
+ int32 need_rate;
+ int hist_next = 0;
+ int hashsz;
+ int mask;
+ int32 a,b,c,d;
+ int32 hindex;
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Limit the match parameters to the supported range.
+ */
+ good_match = strategy->match_size_good;
+ if (good_match > PGLZ_MAX_MATCH)
+ good_match = PGLZ_MAX_MATCH;
+ else if (good_match < 17)
+ good_match = 17;
+
+ good_drop = strategy->match_size_drop;
+ if (good_drop < 0)
+ good_drop = 0;
+ else if (good_drop > 100)
+ good_drop = 100;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ hashsz = choose_hash_size(hlen);
+ mask = hashsz - 1;
+
+ /*
+ * Initialize the history lists to empty. We do not need to zero the
+ * hist_entries[] array; its entries are initialized as they are used.
+ */
+ memset(hist_start, 0, hashsz * sizeof(int16));
+
+ pglz_hash_init(hp, hindex, a,b,c,d);
+ while (hp < hend - 4)
+ {
+ /*
+ * TODO: It would be nice to behave like the history and the source
+ * strings were concatenated, so that you could compress using the
+ * new data, too.
+ */
+ pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ hp, hend, hindex);
+ hp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Loop through the input.
+ */
+ match_off = 0;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ while (dp < dend - 4)
+ {
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+ if (pglz_find_match(hist_start, dp, dend, &match_len,
+ &match_off, good_match, good_drop, hend, hindex))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ dp += match_len;
+ found_match = true;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+ }
+
+ if (!found_match)
+ return false;
+
+ /* Handle the last few bytes as literals */
+ while (dp < dend)
+ {
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ *finallen = result_size;
+
+ return true;
+}
/* ----------
* pglz_decompress -
@@ -740,3 +1026,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
* That's it.
*/
}
+
+/* ----------
+ * pglz_delta_decode
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen)
+{
+ const unsigned char *sp;
+ const unsigned char *srcend;
+ unsigned char *dp;
+ unsigned char *destend;
+ const char *hend;
+
+ sp = ((const unsigned char *) source);
+ srcend = ((const unsigned char *) source) + srclen;
+ dp = (unsigned char *) dest;
+ destend = dp + destlen;
+ hend = history + histlen;
+
+ while (sp < srcend && dp < destend)
+ {
+ /*
+ * Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ */
+ unsigned char ctrl = *sp++;
+ int ctrlc;
+
+ for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+ {
+ if (ctrl & 1)
+ {
+ /*
+ * Otherwise it contains the match length minus 3 and the
+ * upper 4 bits of the offset. The next following byte
+ * contains the lower 8 bits of the offset. If the length is
+ * coded as 18, another extension tag byte tells how much
+ * longer the match really was (0-255).
+ */
+ int32 len;
+ int32 off;
+
+ len = (sp[0] & 0x0f) + 3;
+ off = ((sp[0] & 0xf0) << 4) | sp[1];
+ sp += 2;
+ if (len == 18)
+ len += *sp++;
+
+ /*
+ * Check for output buffer overrun, to ensure we don't clobber
+ * memory in case of corrupt input. Note: we must advance dp
+ * here to ensure the error is detected below the loop. We
+ * don't simply put the elog inside the loop since that will
+ * probably interfere with optimization.
+ */
+ if (dp + len > destend)
+ {
+ dp += len;
+ break;
+ }
+
+ /*
+ * Now we copy the bytes specified by the tag from history
+ * to OUTPUT.
+ */
+ memcpy(dp, hend - off, len);
+ dp += len;
+ }
+ else
+ {
+ /*
+ * An unset control bit means LITERAL BYTE. So we just
+ * copy one from INPUT to OUTPUT.
+ */
+ if (dp >= destend) /* check for buffer overrun */
+ break; /* do not clobber memory */
+
+ *dp++ = *sp++;
+ }
+
+ /*
+ * Advance the control bit
+ */
+ ctrl >>= 1;
+ }
+ }
+
+ /*
+ * Check we decompressed the right amount.
+ */
+ if (sp != srcend)
+ elog(PANIC, "compressed data is corrupt");
+
+ /*
+ * That's it.
+ */
+ *finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int CommitSiblings;
extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool synchronize_seqscans;
+extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 0, 100,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
- uint8 old_infobits_set; /* infomask bits to set on old tuple */
- bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
- bool new_all_visible_cleared; /* same for the page of newtid */
+ uint8 old_infobits_set; /* infomask bits to set on old tuple */
+ uint8 flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
+ * update operation is
+ * delta encoded */
+
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(uint8))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen);
#endif /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
(2 rows)
DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE: table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;
On Wednesday, March 13, 2013 5:50 PM Amit Kapila wrote:
On Friday, March 08, 2013 9:22 PM Amit Kapila wrote:
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
On 04.03.2013 06:39, Amit Kapila wrote:
On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
On 02/05/2013 11:53 PM, Amit Kapila wrote:
Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previouspatch):
I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll
begin with some numbers:Based on your patch, I have tried some more optimizations:
Based on numbers provided by Daniel for compression methods, I tried Snappy
Algorithm for encoding
and it addresses most of your concerns that it should not degrade
performance for majority cases.
postgres orginal:
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1232916160 | 34.0338308811188
two short fields, one changed | 1232909704 | 32.8722319602966
two short fields, both changed | 1236770128 | 35.445415019989
one short and one long field, no change | 1053000144 | 23.2983899116516
ten tiny fields, all changed | 1397452584 | 40.2718069553375
hundred tiny fields, first 10 changed | 622082664 | 21.7642788887024
hundred tiny fields, all changed | 626461528 | 20.964781999588
hundred tiny fields, half changed | 621900472 | 21.6473519802094
hundred tiny fields, half nulled | 557714752 | 19.0088789463043
(9 rows)
postgres encode wal using snappy:
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1232915128 | 34.6910920143127
two short fields, one changed | 1238902520 | 34.2287850379944
two short fields, both changed | 1233882056 | 35.3292708396912
one short and one long field, no change | 733095168 | 20.3494939804077
ten tiny fields, all changed | 1314959744 | 38.969575881958
hundred tiny fields, first 10 changed | 483275136 | 19.6973309516907
hundred tiny fields, all changed | 481755280 | 19.7665288448334
hundred tiny fields, half changed | 488693616 | 19.7246761322021
hundred tiny fields, half nulled | 483425712 | 18.6299569606781
(9 rows)
Changes are to call snappy compress and decompress for encoding and decoding
in patch.
I am doing encoding for tup length greater than 32, as for too small tuples
it might not make much sense for encoding.
On my m/c while using snapy compress/decompress, it was giving stack
corruption for first 4 bytes, so I put below fix to proceed.
I am looking into reason of same.
1. snappy_compress - Increment the encoded data buffer with 4 bytes before
encryption starts.
2. snappy_uncompress - Decrement the 4 bytes increment done during compress.
3. snappy_uncompressed_length - Decrement the 4 bytes increment done during
compress.
For LZ compression patch, there was small problem in identifying max length
which I have corrected in separate patch
'pglz-with-micro-optimizations-4.patch'
In my opinion, there can be following ways for this patch:
1. Use LZ compression, but provide a way to user so that it can be avoided
for cases where much compression is not possible.
I see this as a viable way because most updates will update only have few
columns and rest data would be same.
2. Use snappy API's, do anyone know of standard library of snappy?
3. Provide multiple compression ways, so depending on usage, user can use
appropriate one.
Feedback?
With Regards,
Amit Kapila.
Attachments:
snappy_algo_v1.patchapplication/octet-stream; name=snappy_algo_v1.patchDownload
*** a/src/backend/utils/adt/Makefile
--- b/src/backend/utils/adt/Makefile
***************
*** 31,37 **** OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
tsvector.o tsvector_op.o tsvector_parser.o \
txid.o uuid.o windowfuncs.o xml.o rangetypes_spgist.o \
! rangetypes_typanalyze.o rangetypes_selfuncs.o
like.o: like.c like_match.c
--- 31,37 ----
tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
tsvector.o tsvector_op.o tsvector_parser.o \
txid.o uuid.o windowfuncs.o xml.o rangetypes_spgist.o \
! rangetypes_typanalyze.o rangetypes_selfuncs.o snappy.o
like.o: like.c like_match.c
*** /dev/null
--- b/src/backend/utils/adt/snappy.c
***************
*** 0 ****
--- 1,1334 ----
+ /*
+ * C port of the snappy compressor from Google.
+ * This is a very fast compressor with comparable compression to lzo.
+ * Works best on 64bit little-endian, but should be good on others too.
+ * Ported by Andi Kleen.
+ * Based on snappy 1.0.3 plus some selected changes from SVN.
+ */
+
+ /*
+ * Copyright 2005 Google Inc. All Rights Reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Google Inc. nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+ #ifdef __KERNEL__
+ #include <linux/kernel.h>
+ #include <linux/module.h>
+ #include <linux/slab.h>
+ #include <linux/string.h>
+ #include <linux/snappy.h>
+ #include <linux/vmalloc.h>
+ #include <asm/unaligned.h>
+ #else
+ #include <stdbool.h>
+ #include <stddef.h>
+ #include "utils/snappy.h"
+ #include "utils/compat.h"
+ #endif
+
+ #define CRASH_UNLESS(x) BUG_ON(!(x))
+ #define CHECK(cond) CRASH_UNLESS(cond)
+ #define CHECK_LE(a, b) CRASH_UNLESS((a) <= (b))
+ #define CHECK_GE(a, b) CRASH_UNLESS((a) >= (b))
+ #define CHECK_EQ(a, b) CRASH_UNLESS((a) == (b))
+ #define CHECK_NE(a, b) CRASH_UNLESS((a) != (b))
+ #define CHECK_LT(a, b) CRASH_UNLESS((a) < (b))
+ #define CHECK_GT(a, b) CRASH_UNLESS((a) > (b))
+
+ #define UNALIGNED_LOAD16(_p) get_unaligned((u16 *)(_p))
+ #define UNALIGNED_LOAD32(_p) get_unaligned((u32 *)(_p))
+ #define UNALIGNED_LOAD64(_p) get_unaligned((u64 *)(_p))
+
+ #define UNALIGNED_STORE16(_p, _val) put_unaligned(_val, (u16 *)(_p))
+ #define UNALIGNED_STORE32(_p, _val) put_unaligned(_val, (u32 *)(_p))
+ #define UNALIGNED_STORE64(_p, _val) put_unaligned(_val, (u64 *)(_p))
+
+ #ifdef NDEBUG
+
+ #define DCHECK(cond) do {} while(0)
+ #define DCHECK_LE(a, b) do {} while(0)
+ #define DCHECK_GE(a, b) do {} while(0)
+ #define DCHECK_EQ(a, b) do {} while(0)
+ #define DCHECK_NE(a, b) do {} while(0)
+ #define DCHECK_LT(a, b) do {} while(0)
+ #define DCHECK_GT(a, b) do {} while(0)
+
+ #else
+
+ #define DCHECK(cond) CHECK(cond)
+ #define DCHECK_LE(a, b) CHECK_LE(a, b)
+ #define DCHECK_GE(a, b) CHECK_GE(a, b)
+ #define DCHECK_EQ(a, b) CHECK_EQ(a, b)
+ #define DCHECK_NE(a, b) CHECK_NE(a, b)
+ #define DCHECK_LT(a, b) CHECK_LT(a, b)
+ #define DCHECK_GT(a, b) CHECK_GT(a, b)
+
+ #endif
+
+ static inline bool is_little_endian(void)
+ {
+ #ifdef __LITTLE_ENDIAN__
+ return true;
+ #endif
+ return false;
+ }
+
+ static inline int log2_floor(u32 n)
+ {
+ return n == 0 ? -1 : 31 ^ __builtin_clz(n);
+ }
+
+ static inline int find_lsb_set_non_zero(u32 n)
+ {
+ return __builtin_ctz(n);
+ }
+
+ static inline int find_lsb_set_non_zero64(u64 n)
+ {
+ return __builtin_ctzll(n);
+ }
+
+ #define kmax32 5
+
+ /*
+ * Attempts to parse a varint32 from a prefix of the bytes in [ptr,limit-1].
+ * Never reads a character at or beyond limit. If a valid/terminated varint32
+ * was found in the range, stores it in *OUTPUT and returns a pointer just
+ * past the last byte of the varint32. Else returns NULL. On success,
+ * "result <= limit".
+ */
+ static inline const char *varint_parse32_with_limit(const char *p,
+ const char *l,
+ u32 * OUTPUT)
+ {
+ const unsigned char *ptr = (const unsigned char *)(p);
+ const unsigned char *limit = (const unsigned char *)(l);
+ u32 b, result;
+
+ if (ptr >= limit)
+ return NULL;
+ b = *(ptr++);
+ result = b & 127;
+ if (b < 128)
+ goto done;
+ if (ptr >= limit)
+ return NULL;
+ b = *(ptr++);
+ result |= (b & 127) << 7;
+ if (b < 128)
+ goto done;
+ if (ptr >= limit)
+ return NULL;
+ b = *(ptr++);
+ result |= (b & 127) << 14;
+ if (b < 128)
+ goto done;
+ if (ptr >= limit)
+ return NULL;
+ b = *(ptr++);
+ result |= (b & 127) << 21;
+ if (b < 128)
+ goto done;
+ if (ptr >= limit)
+ return NULL;
+ b = *(ptr++);
+ result |= (b & 127) << 28;
+ if (b < 16)
+ goto done;
+ return NULL; /* Value is too long to be a varint32 */
+ done:
+ *OUTPUT = result;
+ return (const char *)(ptr);
+ }
+
+ /*
+ * REQUIRES "ptr" points to a buffer of length sufficient to hold "v".
+ * EFFECTS Encodes "v" into "ptr" and returns a pointer to the
+ * byte just past the last encoded byte.
+ */
+ static inline char *varint_encode32(char *sptr, u32 v)
+ {
+ /* Operate on characters as unsigneds */
+ unsigned char *ptr = (unsigned char *)(sptr);
+ static const int B = 128;
+
+ if (v < (1 << 7)) {
+ *(ptr++) = v;
+ } else if (v < (1 << 14)) {
+ *(ptr++) = v | B;
+ *(ptr++) = v >> 7;
+ } else if (v < (1 << 21)) {
+ *(ptr++) = v | B;
+ *(ptr++) = (v >> 7) | B;
+ *(ptr++) = v >> 14;
+ } else if (v < (1 << 28)) {
+ *(ptr++) = v | B;
+ *(ptr++) = (v >> 7) | B;
+ *(ptr++) = (v >> 14) | B;
+ *(ptr++) = v >> 21;
+ } else {
+ *(ptr++) = v | B;
+ *(ptr++) = (v >> 7) | B;
+ *(ptr++) = (v >> 14) | B;
+ *(ptr++) = (v >> 21) | B;
+ *(ptr++) = v >> 28;
+ }
+ return (char *)(ptr);
+ }
+
+ struct source {
+ const char *ptr;
+ size_t left;
+ };
+
+ static inline int available(struct source *s)
+ {
+ return s->left;
+ }
+
+ static inline const char *peek(struct source *s, size_t * len)
+ {
+ *len = s->left;
+ return s->ptr;
+ }
+
+ static inline void skip(struct source *s, size_t n)
+ {
+ s->left -= n;
+ s->ptr += n;
+ }
+
+ struct sink {
+ char *dest;
+ };
+
+ static inline void append(struct sink *s, const char *data, size_t n)
+ {
+ if (data != s->dest)
+ memcpy(s->dest, data, n);
+ s->dest += n;
+ }
+
+ static inline void *sink_peek(struct sink *s, size_t n)
+ {
+ return s->dest;
+ }
+
+ struct writer {
+ char *base;
+ char *op;
+ char *op_limit;
+ };
+
+ /* Called before decompression */
+ static inline void writer_set_expected_length(struct writer *w, size_t len)
+ {
+ w->op_limit = w->op + len;
+ }
+
+ /* Called after decompression */
+ static inline bool writer_check_length(struct writer *w)
+ {
+ return w->op == w->op_limit;
+ }
+
+ /*
+ * Copy "len" bytes from "src" to "op", one byte at a time. Used for
+ * handling COPY operations where the input and output regions may
+ * overlap. For example, suppose:
+ * src == "ab"
+ * op == src + 2
+ * len == 20
+ * After IncrementalCopy(src, op, len), the result will have
+ * eleven copies of "ab"
+ * ababababababababababab
+ * Note that this does not match the semantics of either memcpy()
+ * or memmove().
+ */
+ static inline void incremental_copy(const char *src, char *op, int len)
+ {
+ DCHECK_GT(len, 0);
+ do {
+ *op++ = *src++;
+ } while (--len > 0);
+ }
+
+ /*
+ * Equivalent to IncrementalCopy except that it can write up to ten extra
+ * bytes after the end of the copy, and that it is faster.
+ *
+ * The main part of this loop is a simple copy of eight bytes at a time until
+ * we've copied (at least) the requested amount of bytes. However, if op and
+ * src are less than eight bytes apart (indicating a repeating pattern of
+ * length < 8), we first need to expand the pattern in order to get the correct
+ * results. For instance, if the buffer looks like this, with the eight-byte
+ * <src> and <op> patterns marked as intervals:
+ *
+ * abxxxxxxxxxxxx
+ * [------] src
+ * [------] op
+ *
+ * a single eight-byte copy from <src> to <op> will repeat the pattern once,
+ * after which we can move <op> two bytes without moving <src>:
+ *
+ * ababxxxxxxxxxx
+ * [------] src
+ * [------] op
+ *
+ * and repeat the exercise until the two no longer overlap.
+ *
+ * This allows us to do very well in the special case of one single byte
+ * repeated many times, without taking a big hit for more general cases.
+ *
+ * The worst case of extra writing past the end of the match occurs when
+ * op - src == 1 and len == 1; the last copy will read from byte positions
+ * [0..7] and write to [4..11], whereas it was only supposed to write to
+ * position 1. Thus, ten excess bytes.
+ */
+
+ #define kmax_increment_copy_overflow 10
+
+ static inline void incremental_copy_fast_path(const char *src, char *op,
+ int len)
+ {
+ while (op - src < 8) {
+ UNALIGNED_STORE64(op, UNALIGNED_LOAD64(src));
+ len -= op - src;
+ op += op - src;
+ }
+ while (len > 0) {
+ UNALIGNED_STORE64(op, UNALIGNED_LOAD64(src));
+ src += 8;
+ op += 8;
+ len -= 8;
+ }
+ }
+
+ static inline bool writer_append_from_self(struct writer *w, u32 offset,
+ u32 len)
+ {
+ char *op = w->op;
+ const int space_left = w->op_limit - op;
+
+ if (op - w->base <= offset - 1u) /* -1u catches offset==0 */
+ return false;
+ if (len <= 16 && offset >= 8 && space_left >= 16) {
+ /* Fast path, used for the majority (70-80%) of dynamic
+ * invocations. */
+ UNALIGNED_STORE64(op, UNALIGNED_LOAD64(op - offset));
+ UNALIGNED_STORE64(op + 8, UNALIGNED_LOAD64(op - offset + 8));
+ } else {
+ if (space_left >= len + kmax_increment_copy_overflow) {
+ incremental_copy_fast_path(op - offset, op, len);
+ } else {
+ if (space_left < len) {
+ return false;
+ }
+ incremental_copy(op - offset, op, len);
+ }
+ }
+
+ w->op = op + len;
+ return true;
+ }
+
+ static inline bool writer_append(struct writer *w, const char *ip, u32 len)
+ {
+ char *op = w->op;
+ const int space_left = w->op_limit - op;
+ if (space_left < len)
+ return false;
+ memcpy(op, ip, len);
+ w->op = op + len;
+ return true;
+ }
+
+ static inline bool writer_try_fast_append(struct writer *w, const char *ip,
+ u32 available, u32 len)
+ {
+ char *op = w->op;
+ const int space_left = w->op_limit - op;
+ if (len <= 16 && available >= 16 && space_left >= 16) {
+ /* Fast path, used for the majority (~95%) of invocations */
+ UNALIGNED_STORE64(op, UNALIGNED_LOAD64(ip));
+ UNALIGNED_STORE64(op + 8, UNALIGNED_LOAD64(ip + 8));
+ w->op = op + len;
+ return true;
+ }
+ return false;
+ }
+
+ /*
+ * Any hash function will produce a valid compressed bitstream, but a good
+ * hash function reduces the number of collisions and thus yields better
+ * compression for compressible input, and more speed for incompressible
+ * input. Of course, it doesn't hurt if the hash function is reasonably fast
+ * either, as it gets called a lot.
+ */
+ static inline u32 hash_bytes(u32 bytes, int shift)
+ {
+ u32 kmul = 0x1e35a7bd;
+ return (bytes * kmul) >> shift;
+ }
+
+ static inline u32 hash(const char *p, int shift)
+ {
+ return hash_bytes(UNALIGNED_LOAD32(p), shift);
+ }
+
+ /*
+ * Compressed data can be defined as:
+ * compressed := item* literal*
+ * item := literal* copy
+ *
+ * The trailing literal sequence has a space blowup of at most 62/60
+ * since a literal of length 60 needs one tag byte + one extra byte
+ * for length information.
+ *
+ * Item blowup is trickier to measure. Suppose the "copy" op copies
+ * 4 bytes of data. Because of a special check in the encoding code,
+ * we produce a 4-byte copy only if the offset is < 65536. Therefore
+ * the copy op takes 3 bytes to encode, and this type of item leads
+ * to at most the 62/60 blowup for representing literals.
+ *
+ * Suppose the "copy" op copies 5 bytes of data. If the offset is big
+ * enough, it will take 5 bytes to encode the copy op. Therefore the
+ * worst case here is a one-byte literal followed by a five-byte copy.
+ * I.e., 6 bytes of input turn into 7 bytes of "compressed" data.
+ *
+ * This last factor dominates the blowup, so the final estimate is:
+ */
+ size_t snappy_max_compressed_length(size_t source_len)
+ {
+ return 32 + source_len + source_len / 6;
+ }
+ EXPORT_SYMBOL(snappy_max_compressed_length);
+
+ enum {
+ LITERAL = 0,
+ COPY_1_BYTE_OFFSET = 1, /* 3 bit length + 3 bits of offset in opcode */
+ COPY_2_BYTE_OFFSET = 2,
+ COPY_4_BYTE_OFFSET = 3
+ };
+
+ static inline char *emit_literal(char *op,
+ const char *literal,
+ int len, bool allow_fast_path)
+ {
+ int n = len - 1; /* Zero-length literals are disallowed */
+
+ if (n < 60) {
+ /* Fits in tag byte */
+ *op++ = LITERAL | (n << 2);
+
+ /*
+ * The vast majority of copies are below 16 bytes, for which a
+ * call to memcpy is overkill. This fast path can sometimes
+ * copy up to 15 bytes too much, but that is okay in the
+ * main loop, since we have a bit to go on for both sides:
+ *
+ * - The input will always have kInputMarginBytes = 15 extra
+ * available bytes, as long as we're in the main loop, and
+ * if not, allow_fast_path = false.
+ * - The output will always have 32 spare bytes (see
+ * MaxCompressedLength).
+ */
+ if (allow_fast_path && len <= 16) {
+ UNALIGNED_STORE64(op, UNALIGNED_LOAD64(literal));
+ UNALIGNED_STORE64(op + 8,
+ UNALIGNED_LOAD64(literal + 8));
+ return op + len;
+ }
+ } else {
+ /* Encode in upcoming bytes */
+ char *base = op;
+ int count = 0;
+ op++;
+ while (n > 0) {
+ *op++ = n & 0xff;
+ n >>= 8;
+ count++;
+ }
+ DCHECK(count >= 1);
+ DCHECK(count <= 4);
+ *base = LITERAL | ((59 + count) << 2);
+ }
+ memcpy(op, literal, len);
+ return op + len;
+ }
+
+ static inline char *emit_copy_less_than64(char *op, int offset, int len)
+ {
+ DCHECK_LE(len, 64);
+ DCHECK_GE(len, 4);
+ DCHECK_LT(offset, 65536);
+
+ if ((len < 12) && (offset < 2048)) {
+ int len_minus_4 = len - 4;
+ DCHECK(len_minus_4 < 8); /* Must fit in 3 bits */
+ *op++ =
+ COPY_1_BYTE_OFFSET | ((len_minus_4) << 2) | ((offset >> 8)
+ << 5);
+ *op++ = offset & 0xff;
+ } else {
+ *op++ = COPY_2_BYTE_OFFSET | ((len - 1) << 2);
+ put_unaligned_le16(offset, op);
+ op += 2;
+ }
+ return op;
+ }
+
+ static inline char *emit_copy(char *op, int offset, int len)
+ {
+ /*
+ * Emit 64 byte copies but make sure to keep at least four bytes
+ * reserved
+ */
+ while (len >= 68) {
+ op = emit_copy_less_than64(op, offset, 64);
+ len -= 64;
+ }
+
+ /*
+ * Emit an extra 60 byte copy if have too much data to fit in
+ * one copy
+ */
+ if (len > 64) {
+ op = emit_copy_less_than64(op, offset, 60);
+ len -= 60;
+ }
+
+ /* Emit remainder */
+ op = emit_copy_less_than64(op, offset, len);
+ return op;
+ }
+
+ /**
+ * snappy_uncompressed_length - return length of uncompressed output.
+ * @start: compressed buffer
+ * @n: length of compressed buffer.
+ * @result: Write the length of the uncompressed output here.
+ *
+ * Returns true when successfull, otherwise false.
+ */
+ bool snappy_uncompressed_length(const char *start, size_t n, size_t * result)
+ {
+ u32 v = 0;
+
+ /* Temp fix of 4 bytes decrement, because compress add 4 bytes extra */
+ const char *limit = (start + 4) + (n - 4);
+ if (varint_parse32_with_limit(start, limit, &v) != NULL) {
+ *result = v;
+ return true;
+ } else {
+ return false;
+ }
+ }
+ EXPORT_SYMBOL(snappy_uncompressed_length);
+
+ #define kblock_log 15
+ #define kblock_size (1 << kblock_log)
+
+ /*
+ * This value could be halfed or quartered to save memory
+ * at the cost of slightly worse compression.
+ */
+ #define kmax_hash_table_bits 14
+ #define kmax_hash_table_size (1 << kmax_hash_table_bits)
+
+ /*
+ * Use smaller hash table when input.size() is smaller, since we
+ * fill the table, incurring O(hash table size) overhead for
+ * compression, and if the input is short, we won't need that
+ * many hash table entries anyway.
+ */
+ static u16 *get_hash_table(struct snappy_env *env, size_t input_size,
+ int *table_size)
+ {
+ int htsize = 256;
+
+ DCHECK(kmax_hash_table_size >= 256);
+ while (htsize < kmax_hash_table_size && htsize < input_size)
+ htsize <<= 1;
+ CHECK_EQ(0, htsize & (htsize - 1));
+ CHECK_LE(htsize, kmax_hash_table_size);
+
+ u16 *table;
+ table = env->hash_table;
+
+ *table_size = htsize;
+ memset(table, 0, htsize * sizeof(*table));
+ return table;
+ }
+
+ /*
+ * Return the largest n such that
+ *
+ * s1[0,n-1] == s2[0,n-1]
+ * and n <= (s2_limit - s2).
+ *
+ * Does not read *s2_limit or beyond.
+ * Does not read *(s1 + (s2_limit - s2)) or beyond.
+ * Requires that s2_limit >= s2.
+ *
+ * Separate implementation for x86_64, for speed. Uses the fact that
+ * x86_64 is little endian.
+ */
+ #if defined(__LITTLE_ENDIAN__) && BITS_PER_LONG == 64
+ static inline int find_match_length(const char *s1,
+ const char *s2, const char *s2_limit)
+ {
+ int matched = 0;
+
+ DCHECK_GE(s2_limit, s2);
+ /*
+ * Find out how long the match is. We loop over the data 64 bits at a
+ * time until we find a 64-bit block that doesn't match; then we find
+ * the first non-matching bit and use that to calculate the total
+ * length of the match.
+ */
+ while (likely(s2 <= s2_limit - 8)) {
+ if (unlikely
+ (UNALIGNED_LOAD64(s2) == UNALIGNED_LOAD64(s1 + matched))) {
+ s2 += 8;
+ matched += 8;
+ } else {
+ /*
+ * On current (mid-2008) Opteron models there
+ * is a 3% more efficient code sequence to
+ * find the first non-matching byte. However,
+ * what follows is ~10% better on Intel Core 2
+ * and newer, and we expect AMD's bsf
+ * instruction to improve.
+ */
+ u64 x =
+ UNALIGNED_LOAD64(s2) ^ UNALIGNED_LOAD64(s1 +
+ matched);
+ int matching_bits = find_lsb_set_non_zero64(x);
+ matched += matching_bits >> 3;
+ return matched;
+ }
+ }
+ while (likely(s2 < s2_limit)) {
+ if (likely(s1[matched] == *s2)) {
+ ++s2;
+ ++matched;
+ } else {
+ return matched;
+ }
+ }
+ return matched;
+ }
+ #else
+ static inline int find_match_length(const char *s1,
+ const char *s2, const char *s2_limit)
+ {
+ /* Implementation based on the x86-64 version, above. */
+ DCHECK_GE(s2_limit, s2);
+ int matched = 0;
+
+ while (s2 <= s2_limit - 4 &&
+ UNALIGNED_LOAD32(s2) == UNALIGNED_LOAD32(s1 + matched)) {
+ s2 += 4;
+ matched += 4;
+ }
+ if (is_little_endian() && s2 <= s2_limit - 4) {
+ u32 x =
+ UNALIGNED_LOAD32(s2) ^ UNALIGNED_LOAD32(s1 + matched);
+ int matching_bits = find_lsb_set_non_zero(x);
+ matched += matching_bits >> 3;
+ } else {
+ while ((s2 < s2_limit) && (s1[matched] == *s2)) {
+ ++s2;
+ ++matched;
+ }
+ }
+ return matched;
+ }
+ #endif
+
+ /*
+ * For 0 <= offset <= 4, GetU32AtOffset(UNALIGNED_LOAD64(p), offset) will
+ * equal UNALIGNED_LOAD32(p + offset). Motivation: On x86-64 hardware we have
+ * empirically found that overlapping loads such as
+ * UNALIGNED_LOAD32(p) ... UNALIGNED_LOAD32(p+1) ... UNALIGNED_LOAD32(p+2)
+ * are slower than UNALIGNED_LOAD64(p) followed by shifts and casts to u32.
+ */
+ static inline u32 get_u32_at_offset(u64 v, int offset)
+ {
+ DCHECK(0 <= offset && offset <= 4);
+ return v >> (is_little_endian()? 8 * offset : 32 - 8 * offset);
+ }
+
+ /*
+ * Flat array compression that does not emit the "uncompressed length"
+ * prefix. Compresses "input" string to the "*op" buffer.
+ *
+ * REQUIRES: "input" is at most "kBlockSize" bytes long.
+ * REQUIRES: "op" points to an array of memory that is at least
+ * "MaxCompressedLength(input.size())" in size.
+ * REQUIRES: All elements in "table[0..table_size-1]" are initialized to zero.
+ * REQUIRES: "table_size" is a power of two
+ *
+ * Returns an "end" pointer into "op" buffer.
+ * "end - op" is the compressed size of "input".
+ */
+
+ static char *compress_fragment(const char *const input,
+ const size_t input_size,
+ char *op, u16 * table, const int table_size)
+ {
+ /* "ip" is the input pointer, and "op" is the output pointer. */
+ const char *ip = input;
+ CHECK_LE(input_size, kblock_size);
+ CHECK_EQ(table_size & (table_size - 1), 0);
+ const int shift = 32 - log2_floor(table_size);
+ DCHECK_EQ(UINT_MAX >> shift, table_size - 1);
+ const char *ip_end = input + input_size;
+ const char *baseip = ip;
+ /*
+ * Bytes in [next_emit, ip) will be emitted as literal bytes. Or
+ * [next_emit, ip_end) after the main loop.
+ */
+ const char *next_emit = ip;
+
+ const int kinput_margin_bytes = 15;
+
+ if (likely(input_size >= kinput_margin_bytes)) {
+ const char *ip_limit = input + input_size -
+ kinput_margin_bytes;
+
+ u32 next_hash;
+ for (next_hash = hash(++ip, shift);;) {
+ DCHECK_LT(next_emit, ip);
+ /*
+ * The body of this loop calls EmitLiteral once and then EmitCopy one or
+ * more times. (The exception is that when we're close to exhausting
+ * the input we goto emit_remainder.)
+ *
+ * In the first iteration of this loop we're just starting, so
+ * there's nothing to copy, so calling EmitLiteral once is
+ * necessary. And we only start a new iteration when the
+ * current iteration has determined that a call to EmitLiteral will
+ * precede the next call to EmitCopy (if any).
+ *
+ * Step 1: Scan forward in the input looking for a 4-byte-long match.
+ * If we get close to exhausting the input then goto emit_remainder.
+ *
+ * Heuristic match skipping: If 32 bytes are scanned with no matches
+ * found, start looking only at every other byte. If 32 more bytes are
+ * scanned, look at every third byte, etc.. When a match is found,
+ * immediately go back to looking at every byte. This is a small loss
+ * (~5% performance, ~0.1% density) for lcompressible data due to more
+ * bookkeeping, but for non-compressible data (such as JPEG) it's a huge
+ * win since the compressor quickly "realizes" the data is incompressible
+ * and doesn't bother looking for matches everywhere.
+ *
+ * The "skip" variable keeps track of how many bytes there are since the
+ * last match; dividing it by 32 (ie. right-shifting by five) gives the
+ * number of bytes to move ahead for each iteration.
+ */
+ u32 skip = 32;
+
+ const char *next_ip = ip;
+ const char *candidate;
+ do {
+ ip = next_ip;
+ u32 hval = next_hash;
+ DCHECK_EQ(hval, hash(ip, shift));
+ u32 bytes_between_hash_lookups = skip++ >> 5;
+ next_ip = ip + bytes_between_hash_lookups;
+ if (unlikely(next_ip > ip_limit)) {
+ goto emit_remainder;
+ }
+ next_hash = hash(next_ip, shift);
+ candidate = baseip + table[hval];
+ DCHECK_GE(candidate, baseip);
+ DCHECK_LT(candidate, ip);
+
+ table[hval] = ip - baseip;
+ } while (likely(UNALIGNED_LOAD32(ip) !=
+ UNALIGNED_LOAD32(candidate)));
+
+ /*
+ * Step 2: A 4-byte match has been found. We'll later see if more
+ * than 4 bytes match. But, prior to the match, input
+ * bytes [next_emit, ip) are unmatched. Emit them as "literal bytes."
+ */
+ DCHECK_LE(next_emit + 16, ip_end);
+ op = emit_literal(op, next_emit, ip - next_emit, true);
+
+ /*
+ * Step 3: Call EmitCopy, and then see if another EmitCopy could
+ * be our next move. Repeat until we find no match for the
+ * input immediately after what was consumed by the last EmitCopy call.
+ *
+ * If we exit this loop normally then we need to call EmitLiteral next,
+ * though we don't yet know how big the literal will be. We handle that
+ * by proceeding to the next iteration of the main loop. We also can exit
+ * this loop via goto if we get close to exhausting the input.
+ */
+ u64 input_bytes = 0;
+ u32 candidate_bytes = 0;
+
+ do {
+ /*
+ * We have a 4-byte match at ip, and no need to emit any
+ * "literal bytes" prior to ip.
+ */
+ const char *base = ip;
+ int matched = 4 +
+ find_match_length(candidate + 4, ip + 4,
+ ip_end);
+ ip += matched;
+ int offset = base - candidate;
+ DCHECK_EQ(0, memcmp(base, candidate, matched));
+ op = emit_copy(op, offset, matched);
+ /*
+ * We could immediately start working at ip now, but to improve
+ * compression we first update table[Hash(ip - 1, ...)].
+ */
+ const char *insert_tail = ip - 1;
+ next_emit = ip;
+ if (unlikely(ip >= ip_limit)) {
+ goto emit_remainder;
+ }
+ input_bytes = UNALIGNED_LOAD64(insert_tail);
+ u32 prev_hash =
+ hash_bytes(get_u32_at_offset
+ (input_bytes, 0), shift);
+ table[prev_hash] = ip - baseip - 1;
+ u32 cur_hash =
+ hash_bytes(get_u32_at_offset
+ (input_bytes, 1), shift);
+ candidate = baseip + table[cur_hash];
+ candidate_bytes = UNALIGNED_LOAD32(candidate);
+ table[cur_hash] = ip - baseip;
+ } while (get_u32_at_offset(input_bytes, 1) ==
+ candidate_bytes);
+
+ next_hash =
+ hash_bytes(get_u32_at_offset(input_bytes, 2),
+ shift);
+ ++ip;
+ }
+ }
+
+ emit_remainder:
+ /* Emit the remaining bytes as a literal */
+ if (next_emit < ip_end)
+ op = emit_literal(op, next_emit, ip_end - next_emit, false);
+
+ return op;
+ }
+
+ /*
+ * -----------------------------------------------------------------------
+ * Lookup table for decompression code. Generated by ComputeTable() below.
+ * -----------------------------------------------------------------------
+ */
+
+ /* Mapping from i in range [0,4] to a mask to extract the bottom 8*i bits */
+ static const u32 wordmask[] = {
+ 0u, 0xffu, 0xffffu, 0xffffffu, 0xffffffffu
+ };
+
+ /*
+ * Data stored per entry in lookup table:
+ * Range Bits-used Description
+ * ------------------------------------
+ * 1..64 0..7 Literal/copy length encoded in opcode byte
+ * 0..7 8..10 Copy offset encoded in opcode byte / 256
+ * 0..4 11..13 Extra bytes after opcode
+ *
+ * We use eight bits for the length even though 7 would have sufficed
+ * because of efficiency reasons:
+ * (1) Extracting a byte is faster than a bit-field
+ * (2) It properly aligns copy offset so we do not need a <<8
+ */
+ static const u16 char_table[256] = {
+ 0x0001, 0x0804, 0x1001, 0x2001, 0x0002, 0x0805, 0x1002, 0x2002,
+ 0x0003, 0x0806, 0x1003, 0x2003, 0x0004, 0x0807, 0x1004, 0x2004,
+ 0x0005, 0x0808, 0x1005, 0x2005, 0x0006, 0x0809, 0x1006, 0x2006,
+ 0x0007, 0x080a, 0x1007, 0x2007, 0x0008, 0x080b, 0x1008, 0x2008,
+ 0x0009, 0x0904, 0x1009, 0x2009, 0x000a, 0x0905, 0x100a, 0x200a,
+ 0x000b, 0x0906, 0x100b, 0x200b, 0x000c, 0x0907, 0x100c, 0x200c,
+ 0x000d, 0x0908, 0x100d, 0x200d, 0x000e, 0x0909, 0x100e, 0x200e,
+ 0x000f, 0x090a, 0x100f, 0x200f, 0x0010, 0x090b, 0x1010, 0x2010,
+ 0x0011, 0x0a04, 0x1011, 0x2011, 0x0012, 0x0a05, 0x1012, 0x2012,
+ 0x0013, 0x0a06, 0x1013, 0x2013, 0x0014, 0x0a07, 0x1014, 0x2014,
+ 0x0015, 0x0a08, 0x1015, 0x2015, 0x0016, 0x0a09, 0x1016, 0x2016,
+ 0x0017, 0x0a0a, 0x1017, 0x2017, 0x0018, 0x0a0b, 0x1018, 0x2018,
+ 0x0019, 0x0b04, 0x1019, 0x2019, 0x001a, 0x0b05, 0x101a, 0x201a,
+ 0x001b, 0x0b06, 0x101b, 0x201b, 0x001c, 0x0b07, 0x101c, 0x201c,
+ 0x001d, 0x0b08, 0x101d, 0x201d, 0x001e, 0x0b09, 0x101e, 0x201e,
+ 0x001f, 0x0b0a, 0x101f, 0x201f, 0x0020, 0x0b0b, 0x1020, 0x2020,
+ 0x0021, 0x0c04, 0x1021, 0x2021, 0x0022, 0x0c05, 0x1022, 0x2022,
+ 0x0023, 0x0c06, 0x1023, 0x2023, 0x0024, 0x0c07, 0x1024, 0x2024,
+ 0x0025, 0x0c08, 0x1025, 0x2025, 0x0026, 0x0c09, 0x1026, 0x2026,
+ 0x0027, 0x0c0a, 0x1027, 0x2027, 0x0028, 0x0c0b, 0x1028, 0x2028,
+ 0x0029, 0x0d04, 0x1029, 0x2029, 0x002a, 0x0d05, 0x102a, 0x202a,
+ 0x002b, 0x0d06, 0x102b, 0x202b, 0x002c, 0x0d07, 0x102c, 0x202c,
+ 0x002d, 0x0d08, 0x102d, 0x202d, 0x002e, 0x0d09, 0x102e, 0x202e,
+ 0x002f, 0x0d0a, 0x102f, 0x202f, 0x0030, 0x0d0b, 0x1030, 0x2030,
+ 0x0031, 0x0e04, 0x1031, 0x2031, 0x0032, 0x0e05, 0x1032, 0x2032,
+ 0x0033, 0x0e06, 0x1033, 0x2033, 0x0034, 0x0e07, 0x1034, 0x2034,
+ 0x0035, 0x0e08, 0x1035, 0x2035, 0x0036, 0x0e09, 0x1036, 0x2036,
+ 0x0037, 0x0e0a, 0x1037, 0x2037, 0x0038, 0x0e0b, 0x1038, 0x2038,
+ 0x0039, 0x0f04, 0x1039, 0x2039, 0x003a, 0x0f05, 0x103a, 0x203a,
+ 0x003b, 0x0f06, 0x103b, 0x203b, 0x003c, 0x0f07, 0x103c, 0x203c,
+ 0x0801, 0x0f08, 0x103d, 0x203d, 0x1001, 0x0f09, 0x103e, 0x203e,
+ 0x1801, 0x0f0a, 0x103f, 0x203f, 0x2001, 0x0f0b, 0x1040, 0x2040
+ };
+
+ struct snappy_decompressor {
+ struct source *reader; /* Underlying source of bytes to decompress */
+ const char *ip; /* Points to next buffered byte */
+ const char *ip_limit; /* Points just past buffered bytes */
+ u32 peeked; /* Bytes peeked from reader (need to skip) */
+ bool eof; /* Hit end of input without an error? */
+ char scratch[5]; /* Temporary buffer for peekfast boundaries */
+ };
+
+ static void
+ init_snappy_decompressor(struct snappy_decompressor *d, struct source *reader)
+ {
+ d->reader = reader;
+ d->ip = NULL;
+ d->ip_limit = NULL;
+ d->peeked = 0;
+ d->eof = false;
+ }
+
+ static void exit_snappy_decompressor(struct snappy_decompressor *d)
+ {
+ skip(d->reader, d->peeked);
+ }
+
+ /*
+ * Read the uncompressed length stored at the start of the compressed data.
+ * On succcess, stores the length in *result and returns true.
+ * On failure, returns false.
+ */
+ static bool read_uncompressed_length(struct snappy_decompressor *d,
+ u32 * result)
+ {
+ DCHECK(d->ip == NULL); /*
+ * Must not have read anything yet
+ * Length is encoded in 1..5 bytes
+ */
+ *result = 0;
+ u32 shift = 0;
+ while (true) {
+ if (shift >= 32)
+ return false;
+ size_t n;
+ const char *ip = peek(d->reader, &n);
+ if (n == 0)
+ return false;
+ const unsigned char c = *(const unsigned char *)(ip);
+ skip(d->reader, 1);
+ *result |= (u32) (c & 0x7f) << shift;
+ if (c < 128) {
+ break;
+ }
+ shift += 7;
+ }
+ return true;
+ }
+
+ static bool refill_tag(struct snappy_decompressor *d);
+
+ /*
+ * Process the next item found in the input.
+ * Returns true if successful, false on error or end of input.
+ */
+ static void decompress_all_tags(struct snappy_decompressor *d,
+ struct writer *writer)
+ {
+ const char *ip = d->ip;
+
+ /*
+ * We could have put this refill fragment only at the beginning of the loop.
+ * However, duplicating it at the end of each branch gives the compiler more
+ * scope to optimize the <ip_limit_ - ip> expression based on the local
+ * context, which overall increases speed.
+ */
+ #define MAYBE_REFILL() \
+ if (d->ip_limit - ip < 5) { \
+ d->ip = ip; \
+ if (!refill_tag(d)) return; \
+ ip = d->ip; \
+ }
+
+
+ MAYBE_REFILL();
+ for (;;) {
+ if (d->ip_limit - ip < 5) {
+ d->ip = ip;
+ if (!refill_tag(d))
+ return;
+ ip = d->ip;
+ }
+
+ const unsigned char c = *(const unsigned char *)(ip++);
+
+ if ((c & 0x3) == LITERAL) {
+ u32 literal_length = (c >> 2) + 1;
+ if (writer_try_fast_append(writer, ip, d->ip_limit - ip,
+ literal_length)) {
+ DCHECK_LT(literal_length, 61);
+ ip += literal_length;
+ MAYBE_REFILL();
+ continue;
+ }
+ if (unlikely(literal_length >= 61)) {
+ /* Long literal */
+ const u32 literal_ll = literal_length - 60;
+ literal_length = (get_unaligned_le32(ip) &
+ wordmask[literal_ll]) + 1;
+ ip += literal_ll;
+ }
+
+ u32 avail = d->ip_limit - ip;
+ while (avail < literal_length) {
+ if (!writer_append(writer, ip, avail))
+ return;
+ literal_length -= avail;
+ skip(d->reader, d->peeked);
+ size_t n;
+ ip = peek(d->reader, &n);
+ avail = n;
+ d->peeked = avail;
+ if (avail == 0)
+ return; /* Premature end of input */
+ d->ip_limit = ip + avail;
+ }
+ if (!writer_append(writer, ip, literal_length))
+ return;
+ ip += literal_length;
+ MAYBE_REFILL();
+ } else {
+ const u32 entry = char_table[c];
+ const u32 trailer = get_unaligned_le32(ip) &
+ wordmask[entry >> 11];
+ const u32 length = entry & 0xff;
+ ip += entry >> 11;
+
+ /*
+ * copy_offset/256 is encoded in bits 8..10.
+ * By just fetching those bits, we get
+ * copy_offset (since the bit-field starts at
+ * bit 8).
+ */
+ const u32 copy_offset = entry & 0x700;
+ if (!writer_append_from_self(writer,
+ copy_offset + trailer,
+ length))
+ return;
+ MAYBE_REFILL();
+ }
+ }
+ }
+
+ #undef MAYBE_REFILL
+
+ static bool refill_tag(struct snappy_decompressor *d)
+ {
+ const char *ip = d->ip;
+
+ if (ip == d->ip_limit) {
+ size_t n;
+ /* Fetch a new fragment from the reader */
+ skip(d->reader, d->peeked); /* All peeked bytes are used up */
+ ip = peek(d->reader, &n);
+ d->peeked = n;
+ if (n == 0) {
+ d->eof = true;
+ return false;
+ }
+ d->ip_limit = ip + n;
+ }
+
+ /* Read the tag character */
+ DCHECK_LT(ip, d->ip_limit);
+ const unsigned char c = *(const unsigned char *)(ip);
+ const u32 entry = char_table[c];
+ const u32 needed = (entry >> 11) + 1; /* +1 byte for 'c' */
+ DCHECK_LE(needed, sizeof(d->scratch));
+
+ /* Read more bytes from reader if needed */
+ u32 nbuf = d->ip_limit - ip;
+
+ if (nbuf < needed) {
+ /*
+ * Stitch together bytes from ip and reader to form the word
+ * contents. We store the needed bytes in "scratch". They
+ * will be consumed immediately by the caller since we do not
+ * read more than we need.
+ */
+ memmove(d->scratch, ip, nbuf);
+ skip(d->reader, d->peeked); /* All peeked bytes are used up */
+ d->peeked = 0;
+ while (nbuf < needed) {
+ size_t length;
+ const char *src = peek(d->reader, &length);
+ if (length == 0)
+ return false;
+ u32 to_add = min_t(u32, needed - nbuf, length);
+ memcpy(d->scratch + nbuf, src, to_add);
+ nbuf += to_add;
+ skip(d->reader, to_add);
+ }
+ DCHECK_EQ(nbuf, needed);
+ d->ip = d->scratch;
+ d->ip_limit = d->scratch + needed;
+ } else if (nbuf < 5) {
+ /*
+ * Have enough bytes, but move into scratch so that we do not
+ * read past end of input
+ */
+ memmove(d->scratch, ip, nbuf);
+ skip(d->reader, d->peeked); /* All peeked bytes are used up */
+ d->peeked = 0;
+ d->ip = d->scratch;
+ d->ip_limit = d->scratch + nbuf;
+ } else {
+ /* Pass pointer to buffer returned by reader. */
+ d->ip = ip;
+ }
+ return true;
+ }
+
+ static int internal_uncompress(struct source *r,
+ struct writer *writer, u32 max_len)
+ {
+ struct snappy_decompressor decompressor;
+ u32 uncompressed_len = 0;
+
+ init_snappy_decompressor(&decompressor, r);
+
+ if (!read_uncompressed_length(&decompressor, &uncompressed_len))
+ return -EIO;
+ /* Protect against possible DoS attack */
+ if ((u64) (uncompressed_len) > max_len)
+ return -EIO;
+
+ writer_set_expected_length(writer, uncompressed_len);
+
+ /* Process the entire input */
+ decompress_all_tags(&decompressor, writer);
+
+ exit_snappy_decompressor(&decompressor);
+ return (decompressor.eof && writer_check_length(writer)) ? 0 : -EIO;
+ }
+
+ static inline int compress(struct snappy_env *env, struct source *reader,
+ struct sink *writer)
+ {
+ int err;
+ size_t written = 0;
+ int N = available(reader);
+ char ulength[kmax32];
+ char *p = varint_encode32(ulength, N);
+
+ append(writer, ulength, p - ulength);
+ written += (p - ulength);
+
+ while (N > 0) {
+ /* Get next block to compress (without copying if possible) */
+ size_t fragment_size;
+ const char *fragment = peek(reader, &fragment_size);
+ if (fragment_size == 0) {
+ err = -EIO;
+ goto out;
+ }
+ const int num_to_read = min_t(int, N, kblock_size);
+ size_t bytes_read = fragment_size;
+
+ int pending_advance = 0;
+ if (bytes_read >= num_to_read) {
+ /* Buffer returned by reader is large enough */
+ pending_advance = num_to_read;
+ fragment_size = num_to_read;
+ }
+ else {
+ memcpy(env->scratch, fragment, bytes_read);
+ skip(reader, bytes_read);
+
+ while (bytes_read < num_to_read) {
+ fragment = peek(reader, &fragment_size);
+ size_t n =
+ min_t(size_t, fragment_size,
+ num_to_read - bytes_read);
+ memcpy(env->scratch + bytes_read, fragment, n);
+ bytes_read += n;
+ skip(reader, n);
+ }
+ DCHECK_EQ(bytes_read, num_to_read);
+ fragment = env->scratch;
+ fragment_size = num_to_read;
+ }
+ if (fragment_size < num_to_read)
+ return -EIO;
+
+ /* Get encoding table for compression */
+ int table_size;
+ u16 *table = get_hash_table(env, num_to_read, &table_size);
+
+ /* Compress input_fragment and append to dest */
+ const int max_output =
+ snappy_max_compressed_length(num_to_read);
+
+ char *dest;
+ dest = sink_peek(writer, max_output);
+ if (!dest) {
+ /*
+ * Need a scratch buffer for the output,
+ * because the byte sink doesn't have enough
+ * in one piece.
+ */
+ dest = env->scratch_output;
+ }
+ char *end = compress_fragment(fragment, fragment_size,
+ dest, table, table_size);
+ append(writer, dest, end - dest);
+ written += (end - dest);
+
+ N -= num_to_read;
+ skip(reader, pending_advance);
+ }
+
+ err = 0;
+ out:
+ return err;
+ }
+
+
+ /**
+ * snappy_compress - Compress a buffer using the snappy compressor.
+ * @env: Preallocated environment
+ * @input: Input buffer
+ * @input_length: Length of input_buffer
+ * @compressed: Output buffer for compressed data
+ * @compressed_length: The real length of the output written here.
+ *
+ * Return 0 on success, otherwise an negative error code.
+ *
+ * The output buffer must be at least
+ * snappy_max_compressed_length(input_length) bytes long.
+ *
+ * Requires a preallocated environment from snappy_init_env.
+ * The environment does not keep state over individual calls
+ * of this function, just preallocates the memory.
+ */
+ int snappy_compress(struct snappy_env *env,
+ const char *input,
+ size_t input_length,
+ char *compressed, size_t *compressed_length)
+ {
+ int err;
+ struct source reader = {
+ .ptr = input,
+ .left = input_length
+ };
+ struct sink writer = {
+ .dest = compressed
+ };
+
+ /*Temp fix in length of first 4 bytes */
+ writer.dest += 4;
+ err = compress(env, &reader, &writer);
+
+ /* Compute how many bytes were added */
+ *compressed_length = (writer.dest - compressed);
+ return err;
+ }
+ EXPORT_SYMBOL(snappy_compress);
+
+ /**
+ * snappy_uncompress - Uncompress a snappy compressed buffer
+ * @compressed: Input buffer with compressed data
+ * @n: length of compressed buffer
+ * @uncompressed: buffer for uncompressed data
+ *
+ * The uncompressed data buffer must be at least
+ * snappy_uncompressed_length(compressed) bytes long.
+ *
+ * Return 0 on success, otherwise an negative error code.
+ */
+ int snappy_uncompress(const char *compressed, size_t n, char *uncompressed)
+ {
+ /* Temp fix of 4 bytes decrement, because compress add 4 bytes extra */
+ struct source reader = {
+ .ptr = compressed + 4,
+ .left = n - 4
+ };
+ struct writer output = {
+ .base = uncompressed,
+ .op = uncompressed
+ };
+ return internal_uncompress(&reader, &output, 0xffffffff);
+ }
+ EXPORT_SYMBOL(snappy_uncompress);
+
+
+ /**
+ * snappy_init_env - Allocate snappy compression environment
+ * @env: Environment to preallocate
+ *
+ * Passing multiple entries in an iovec is not allowed
+ * on the environment allocated here.
+ * Returns 0 on success, otherwise negative errno.
+ * Must run in process context.
+ */
+ int snappy_init_env(struct snappy_env *env)
+ {
+ env->hash_table = vmalloc(sizeof(u16) * kmax_hash_table_size);
+ if (!env->hash_table)
+ return -ENOMEM;
+ return 0;
+ }
+ EXPORT_SYMBOL(snappy_init_env);
+
+ /**
+ * snappy_free_env - Free an snappy compression environment
+ * @env: Environment to free.
+ *
+ * Must run in process context.
+ */
+ void snappy_free_env(struct snappy_env *env)
+ {
+ vfree(env->hash_table);
+ #ifdef SG
+ vfree(env->scratch);
+ vfree(env->scratch_output);
+ #endif
+ memset(env, 0, sizeof(struct snappy_env));
+ }
+ EXPORT_SYMBOL(snappy_free_env);
*** /dev/null
--- b/src/include/utils/compat.h
***************
*** 0 ****
--- 1,39 ----
+
+ #include <stdlib.h>
+ #include <assert.h>
+ #include <string.h>
+ #include <errno.h>
+ #include <stdbool.h>
+ #include <limits.h>
+ #include <sys/uio.h>
+
+ typedef unsigned char u8;
+ typedef unsigned short u16;
+ typedef unsigned u32;
+ typedef unsigned long long u64;
+
+ #define BUG_ON(x) assert(!(x))
+
+ #define get_unaligned(x) (*(x))
+ #define get_unaligned_le32(x) (le32toh(*(u32 *)(x)))
+ #define put_unaligned(v,x) (*(x) = (v))
+ #define put_unaligned_le16(v,x) (*(u16 *)(x) = htole16(v))
+
+ #define vmalloc(x) malloc(x)
+ #define vfree(x) free(x)
+
+ #define EXPORT_SYMBOL(x)
+
+ #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+ #define likely(x) __builtin_expect((x), 1)
+ #define unlikely(x) __builtin_expect((x), 0)
+
+ #define min_t(t,x,y) ((x) < (y) ? (x) : (y))
+ #define max_t(t,x,y) ((x) > (y) ? (x) : (y))
+
+ #if __BYTE_ORDER == __LITTLE_ENDIAN
+ #define __LITTLE_ENDIAN__ 1
+ #endif
+
+ #define BITS_PER_LONG (__SIZEOF_LONG__ * 8)
*** /dev/null
--- b/src/include/utils/snappy.h
***************
*** 0 ****
--- 1,36 ----
+ #ifndef _LINUX_SNAPPY_H
+ #define _LINUX_SNAPPY_H
+
+
+ /* Only needed for compression. This preallocates the worst case */
+ struct snappy_env {
+ unsigned short *hash_table;
+ void *scratch;
+ void *scratch_output;
+ };
+
+ struct iovec;
+ int snappy_init_env(struct snappy_env *env);
+ int snappy_init_env_sg(struct snappy_env *env, bool sg);
+ void snappy_free_env(struct snappy_env *env);
+ int snappy_uncompress_iov(struct iovec *iov_in, int iov_in_len,
+ size_t input_len, char *uncompressed);
+ int snappy_uncompress(const char *compressed, size_t n, char *uncompressed);
+ int snappy_compress(struct snappy_env *env,
+ const char *input,
+ size_t input_length,
+ char *compressed,
+ size_t *compressed_length);
+ int snappy_compress_iov(struct snappy_env *env,
+ struct iovec *iov_in,
+ int iov_in_len,
+ size_t input_length,
+ struct iovec *iov_out,
+ int iov_out_len,
+ size_t *compressed_length);
+ bool snappy_uncompressed_length(const char *buf, size_t len, size_t *result);
+ size_t snappy_max_compressed_length(size_t source_len);
+
+
+
+ #endif
wal_update_snappy_concat_oldandnew_tuple_v1.patchapplication/octet-stream; name=wal_update_snappy_concat_oldandnew_tuple_v1.patchDownload
*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,66 ****
--- 60,69 ----
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+ #include "utils/snappy.h"
+ /* guc variable for EWT compression ratio*/
+ int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
***************
*** 617,622 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 620,679 ----
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+ /* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata, uint32 *enclen)
+ {
+ struct snappy_env env;
+ int err;
+ char *oldtupdata = (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ int32 oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ char *newtupdata = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ int32 newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ char buf[2 * MaxHeapTupleSize];
+
+ memcpy(buf, oldtupdata, oldtuplen);
+ memcpy(buf + oldtuplen, newtupdata, newtuplen);
+
+ err = snappy_init_env(&env);
+ if (err)
+ return false;
+
+ err = snappy_compress(&env, buf, oldtuplen + newtuplen, encdata, (size_t *)enclen);
+ snappy_free_env(&env);
+ if (err)
+ return false;
+
+ return true;
+ }
+
+ /* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+ void
+ heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+ {
+ char buf[2 * MaxHeapTupleSize];
+ int32 oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ snappy_uncompressed_length(encdata, enclen, (size_t *)&newtup->t_len);
+
+ snappy_uncompress(encdata, enclen, buf);
+ memcpy((char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ buf + oldtuplen, newtup->t_len - oldtuplen);
+ }
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 74,79 ****
--- 74,80 ----
/* GUC variable */
bool synchronize_seqscans = true;
+ extern int wal_update_compression_ratio;
static HeapScanDesc heap_beginscan_internal(Relation relation,
***************
*** 5815,5820 **** log_heap_update(Relation reln, Buffer oldbuf,
--- 5816,5827 ----
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ char buf[MaxHeapTupleSize];
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
***************
*** 5824,5838 **** log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! xlrec.all_visible_cleared = all_visible_cleared;
xlrec.newtid = newtup->t_self;
! xlrec.new_all_visible_cleared = new_all_visible_cleared;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
--- 5831,5878 ----
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if (wal_update_compression_ratio != 0 && (newtuplen > 32)
+ && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ uint32 enclen;
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ {
+ compressed = true;
+ newtupdata = buf;
+ newtuplen = enclen;
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! if (all_visible_cleared)
! xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
! if (new_all_visible_cleared)
! xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! if (compressed)
! xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
***************
*** 5859,5867 **** log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
! /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
--- 5899,5910 ----
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
! /*
! * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! */
! rdata[3].data = newtupdata;
! rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
***************
*** 6671,6677 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 6714,6723 ----
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
***************
*** 6686,6692 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->all_visible_cleared)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 6732,6738 ----
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 6746,6752 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
! htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
--- 6792,6798 ----
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
! oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
***************
*** 6764,6770 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
! if (xlrec->all_visible_cleared)
PageClearAllVisible(page);
/*
--- 6810,6816 ----
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
! if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
***************
*** 6788,6794 **** newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->new_all_visible_cleared)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 6834,6840 ----
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 6851,6860 **** newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! (char *) xlrec + hsize,
! newlen);
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
--- 6897,6927 ----
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
!
! /*
! * If the record is EWT then decode it.
! */
! if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! {
! /*
! * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
! * + New data (1 byte length + variable data)+ ...
! */
! oldtup.t_data = oldtupdata;
! oldtup.t_len = ItemIdGetLength(lp);
! newtup.t_data = htup;
!
! heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
! newlen = newtup.t_len;
! }
! else
! {
! /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! (char *) xlrec + hsize,
! newlen);
! }
!
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
***************
*** 6870,6876 **** newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
! if (xlrec->new_all_visible_cleared)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
--- 6937,6943 ----
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
! if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1249,1254 **** begin:;
--- 1249,1276 ----
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+ bool
+ XLogCheckBufferNeedsBackup(Buffer buffer)
+ {
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+ }
+
+ /*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 124,129 **** extern char *default_tablespace;
--- 124,130 ----
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+ extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
***************
*** 2410,2415 **** static struct config_int ConfigureNamesInt[] =
--- 2411,2427 ----
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 0, 100,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 147,159 **** typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
! uint8 old_infobits_set; /* infomask bits to set on old tuple */
! bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
! bool new_all_visible_cleared; /* same for the page of newtid */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
! #define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
/*
* This is what we need to know about vacuum page cleanup/redirect
--- 147,168 ----
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
! uint8 old_infobits_set; /* infomask bits to set on old tuple */
! uint8 flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
! * page's all visible
! * bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
! * page's all visible
! * bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
! * update operation is
! * delta encoded */
!
! #define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(uint8))
/*
* This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 687,692 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 687,697 ----
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata, uint32 *enclen);
+ extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,266 **** typedef struct CheckpointStatsData
--- 261,267 ----
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
*** a/src/test/regress/expected/update.out
--- b/src/test/regress/expected/update.out
***************
*** 97,99 **** SELECT a, b, char_length(c) FROM update_test;
--- 97,169 ----
(2 rows)
DROP TABLE update_test;
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ DROP TABLE IF EXISTS update_test;
+ NOTICE: table "update_test" does not exist, skipping
+ CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+ );
+ INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+ );
+ SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+ (1 row)
+
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+ (1 row)
+
+ DROP TABLE update_test;
*** a/src/test/regress/sql/update.sql
--- b/src/test/regress/sql/update.sql
***************
*** 59,61 **** UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
--- 59,128 ----
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+ --
+ -- Test to update continuos and non continuos columns
+ --
+
+ DROP TABLE IF EXISTS update_test;
+ CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+ );
+
+ INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+ );
+
+ SELECT * from update_test;
+
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+
+ SELECT * from update_test;
+ DROP TABLE update_test;
pglz-with-micro-optimizations-4.patchapplication/octet-stream; name=pglz-with-micro-optimizations-4.patchDownload
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+/* guc variable for EWT compression ratio*/
+int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+/* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata, uint32 *enclen)
+{
+ PGLZ_Strategy strategy;
+
+ strategy = *PGLZ_strategy_default;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_delta_encode(
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ encdata, enclen, &strategy
+ );
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+ return pglz_delta_decode(encdata, enclen,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fe56318..24c117c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
/* GUC variable */
bool synchronize_seqscans = true;
+extern int wal_update_compression_ratio;
static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5815,6 +5817,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ char buf[MaxHeapTupleSize];
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -5824,15 +5832,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ uint32 enclen;
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ {
+ compressed = true;
+ newtupdata = buf;
+ newtuplen = enclen;
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
- xlrec.all_visible_cleared = all_visible_cleared;
+ if (all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
- xlrec.new_all_visible_cleared = new_all_visible_cleared;
+ if (new_all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (compressed)
+ xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
@@ -5859,9 +5899,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ /*
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+ * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+ */
+ rdata[3].data = newtupdata;
+ rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
@@ -6671,7 +6714,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
@@ -6686,7 +6732,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6746,7 +6792,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
- htup = (HeapTupleHeader) PageGetItem(page, lp);
+ oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6764,7 +6810,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
@@ -6788,7 +6834,7 @@ newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6851,10 +6897,31 @@ newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
- /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
- memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
- (char *) xlrec + hsize,
- newlen);
+
+ /*
+ * If the record is EWT then decode it.
+ */
+ if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+ {
+ /*
+ * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ */
+ oldtup.t_data = oldtupdata;
+ oldtup.t_len = ItemIdGetLength(lp);
+ newtup.t_data = htup;
+
+ heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+ newlen = newtup.t_len;
+ }
+ else
+ {
+ /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ (char *) xlrec + hsize,
+ newlen);
+ }
+
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
@@ -6870,7 +6937,7 @@ newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 07c68ad..c3a94a2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1249,6 +1249,28 @@ begin:;
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+}
+
+/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..35e8206 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
* of identical bytes like trailing spaces) and for bigger ones
* our 4K maximum look-back distance is too small.
*
- * The compressor creates a table for 8192 lists of positions.
+ * The compressor creates a table for lists of positions.
* For each input position (except the last 3), a hash key is
* built from the 4 next input bytes and the position remembered
* in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
* matching strings. This is done on the fly while the input
* is compressed into the output area. Table entries are only
* kept for the last 4096 input positions, since we cannot use
- * back-pointers larger than that anyway.
+ * back-pointers larger than that anyway. The size of the hash
+ * table depends on the size of the input - a larger table has
+ * a larger startup cost, as it needs to be initialized to zero,
+ * but reduces the number of hash collisions on long inputs.
*
* For each byte in the input, it's hash key (built from this
* byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
* Local definitions
* ----------
*/
-#define PGLZ_HISTORY_LISTS 8192 /* must be power of 2 */
-#define PGLZ_HISTORY_MASK (PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS 8192 /* must be power of 2 */
#define PGLZ_HISTORY_SIZE 4096
#define PGLZ_MAX_MATCH 273
@@ -198,9 +200,9 @@
*/
typedef struct PGLZ_HistEntry
{
- struct PGLZ_HistEntry *next; /* links for my hash key's list */
- struct PGLZ_HistEntry *prev;
- int hindex; /* my current hash key */
+ int16 next; /* links for my hash key's list */
+ int16 prev;
+ uint32 hindex; /* my current hash key */
const char *pos; /* my input position */
} PGLZ_HistEntry;
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
* Statically allocated work arrays for history
* ----------
*/
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY 0
/* ----------
* pglz_hist_idx -
@@ -257,12 +261,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* hash keys more.
* ----------
*/
-#define pglz_hist_idx(_s,_e) ( \
+#define pglz_hist_idx(_s,_e, _mask) ( \
((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
- (((_s)[0] << 9) ^ ((_s)[1] << 6) ^ \
- ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK) \
+ (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
+ ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
)
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) \
+ do { \
+ a = 0; \
+ b = _p[0]; \
+ c = _p[1]; \
+ d = _p[2]; \
+ hindex = (b << 4) ^ (c << 2) ^ d; \
+ } while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) \
+ do { \
+ /* subtract old a */ \
+ hindex ^= a; \
+ /* shift and add byte */ \
+ a = b; b = c; c = d; d = _p[3]; \
+ hindex = ((hindex << 2) ^ d) & (_mask); \
+ } while (0)
+
/* ----------
* pglz_hist_add -
@@ -276,32 +308,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* _hn and _recycle are modified in the macro.
* ----------
*/
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex) \
do { \
- int __hindex = pglz_hist_idx((_s),(_e)); \
- PGLZ_HistEntry **__myhsp = &(_hs)[__hindex]; \
+ int16 *__myhsp = &(_hs)[_hindex]; \
PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
if (_recycle) { \
- if (__myhe->prev == NULL) \
+ if (__myhe->prev == INVALID_ENTRY) \
(_hs)[__myhe->hindex] = __myhe->next; \
else \
- __myhe->prev->next = __myhe->next; \
- if (__myhe->next != NULL) \
- __myhe->next->prev = __myhe->prev; \
+ (_he)[__myhe->prev].next = __myhe->next; \
+ if (__myhe->next != INVALID_ENTRY) \
+ (_he)[__myhe->next].prev = __myhe->prev; \
} \
__myhe->next = *__myhsp; \
- __myhe->prev = NULL; \
- __myhe->hindex = __hindex; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
__myhe->pos = (_s); \
- if (*__myhsp != NULL) \
- (*__myhsp)->prev = __myhe; \
- *__myhsp = __myhe; \
- if (++(_hn) >= PGLZ_HISTORY_SIZE) { \
- (_hn) = 0; \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) { \
+ (_hn) = 1; \
(_recycle) = true; \
} \
} while (0)
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex) \
+do { \
+ int16 *__myhsp = &(_hs)[_hindex]; \
+ PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
+ __myhe->next = *__myhsp; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
+ __myhe->pos = (_s); \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ ++(_hn); \
+} while (0)
+
/* ----------
* pglz_out_ctrl -
@@ -372,28 +421,42 @@ do { \
* ----------
*/
static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
- int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+ int *lenp, int *offp, int good_match, int good_drop,
+ const char *hend, int hindex)
{
- PGLZ_HistEntry *hent;
+ int16 hentno;
int32 len = 0;
int32 off = 0;
/*
* Traverse the linked history list until a good enough match is found.
*/
- hent = hstart[pglz_hist_idx(input, end)];
- while (hent)
+ hentno = hstart[hindex];
+ while (hentno != INVALID_ENTRY)
{
+ PGLZ_HistEntry *hent = &hist_entries[hentno];
const char *ip = input;
const char *hp = hent->pos;
int32 thisoff;
int32 thislen;
+ int32 maxlen;
+
+ maxlen = PGLZ_MAX_MATCH;
+ if (end - input < maxlen)
+ maxlen = end - input;
+ if (hend && (hend - hp < maxlen))
+ maxlen = hend - hp;
+
/*
* Stop if the offset does not fit into our tag anymore.
*/
- thisoff = ip - hp;
+ if (!hend)
+ thisoff = ip - hp;
+ else
+ thisoff = hend - hp;
+
if (thisoff >= 0x0fff)
break;
@@ -413,7 +476,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
thislen = len;
ip += len;
hp += len;
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -423,7 +486,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
}
else
{
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -443,13 +506,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
/*
* Advance to the next history entry
*/
- hent = hent->next;
+ hentno = hent->next;
/*
* Be happy with lesser good matches the more entries we visited. But
* no point in doing calculation if we're at end of list.
*/
- if (hent)
+ if (hentno != INVALID_ENTRY)
{
if (len >= good_match)
break;
@@ -471,6 +534,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
return 0;
}
+static int
+choose_hash_size(int slen)
+{
+ int hashsz;
+
+ /*
+ * Experiments suggest that these hash sizes work pretty well. A large
+ * hash table minimizes collision, but has a higher startup cost. For
+ * a small input, the startup cost dominates. The table size must be
+ * a power of two.
+ */
+ if (slen < 128)
+ hashsz = 512;
+ else if (slen < 256)
+ hashsz = 1024;
+ else if (slen < 512)
+ hashsz = 2048;
+ else if (slen < 1024)
+ hashsz = 4096;
+ else
+ hashsz = 8192;
+ return hashsz;
+}
/* ----------
* pglz_compress -
@@ -484,7 +570,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
{
unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
unsigned char *bstart = bp;
- int hist_next = 0;
+ int hist_next = 1;
bool hist_recycle = false;
const char *dp = source;
const char *dend = source + slen;
@@ -500,6 +586,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
int32 result_size;
int32 result_max;
int32 need_rate;
+ int hashsz;
+ int mask;
/*
* Our fallback strategy is the default.
@@ -555,17 +643,21 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
else
result_max = (slen * (100 - need_rate)) / 100;
+ hashsz = choose_hash_size(slen);
+ mask = hashsz - 1;
+
/*
* Initialize the history lists to empty. We do not need to zero the
* hist_entries[] array; its entries are initialized as they are used.
*/
- memset(hist_start, 0, sizeof(hist_start));
+ memset(hist_start, 0, hashsz * sizeof(int16));
/*
* Compress the source directly into the output buffer.
*/
while (dp < dend)
{
+ uint32 hindex;
/*
* If we already exceeded the maximum result size, fail.
*
@@ -588,8 +680,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
/*
* Try to find a match in the history
*/
+ hindex = pglz_hist_idx(dp, dend, mask);
if (pglz_find_match(hist_start, dp, dend, &match_len,
- &match_off, good_match, good_drop))
+ &match_off, good_match, good_drop, NULL, hindex))
{
/*
* Create the tag and add history entries for all matched
@@ -598,9 +691,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
+ hindex = pglz_hist_idx(dp, dend, mask);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -614,7 +708,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -637,6 +731,198 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
return true;
}
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen,
+ const PGLZ_Strategy *strategy)
+{
+ unsigned char *bp = ((unsigned char *) dest);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ const char *hp = history;
+ const char *hend = history + hlen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int32 good_match;
+ int32 good_drop;
+ int32 need_rate;
+ int hist_next = 0;
+ int hashsz;
+ int mask;
+ int32 a,b,c,d;
+ int32 hindex;
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Limit the match parameters to the supported range.
+ */
+ good_match = strategy->match_size_good;
+ if (good_match > PGLZ_MAX_MATCH)
+ good_match = PGLZ_MAX_MATCH;
+ else if (good_match < 17)
+ good_match = 17;
+
+ good_drop = strategy->match_size_drop;
+ if (good_drop < 0)
+ good_drop = 0;
+ else if (good_drop > 100)
+ good_drop = 100;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ hashsz = choose_hash_size(hlen);
+ mask = hashsz - 1;
+
+ /*
+ * Initialize the history lists to empty. We do not need to zero the
+ * hist_entries[] array; its entries are initialized as they are used.
+ */
+ memset(hist_start, 0, hashsz * sizeof(int16));
+
+ pglz_hash_init(hp, hindex, a,b,c,d);
+ while (hp < hend - 4)
+ {
+ /*
+ * TODO: It would be nice to behave like the history and the source
+ * strings were concatenated, so that you could compress using the
+ * new data, too.
+ */
+ pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ hp, hend, hindex);
+ hp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Loop through the input.
+ */
+ match_off = 0;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ while (dp < dend - 4)
+ {
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+ if (pglz_find_match(hist_start, dp, dend, &match_len,
+ &match_off, good_match, good_drop, hend, hindex))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ dp += match_len;
+ found_match = true;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+ }
+
+ if (!found_match)
+ return false;
+
+ /* Handle the last few bytes as literals */
+ while (dp < dend)
+ {
+ pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ *finallen = result_size;
+
+ return true;
+}
/* ----------
* pglz_decompress -
@@ -740,3 +1026,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
* That's it.
*/
}
+
+/* ----------
+ * pglz_delta_decode
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen)
+{
+ const unsigned char *sp;
+ const unsigned char *srcend;
+ unsigned char *dp;
+ unsigned char *destend;
+ const char *hend;
+
+ sp = ((const unsigned char *) source);
+ srcend = ((const unsigned char *) source) + srclen;
+ dp = (unsigned char *) dest;
+ destend = dp + destlen;
+ hend = history + histlen;
+
+ while (sp < srcend && dp < destend)
+ {
+ /*
+ * Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ */
+ unsigned char ctrl = *sp++;
+ int ctrlc;
+
+ for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+ {
+ if (ctrl & 1)
+ {
+ /*
+ * Otherwise it contains the match length minus 3 and the
+ * upper 4 bits of the offset. The next following byte
+ * contains the lower 8 bits of the offset. If the length is
+ * coded as 18, another extension tag byte tells how much
+ * longer the match really was (0-255).
+ */
+ int32 len;
+ int32 off;
+
+ len = (sp[0] & 0x0f) + 3;
+ off = ((sp[0] & 0xf0) << 4) | sp[1];
+ sp += 2;
+ if (len == 18)
+ len += *sp++;
+
+ /*
+ * Check for output buffer overrun, to ensure we don't clobber
+ * memory in case of corrupt input. Note: we must advance dp
+ * here to ensure the error is detected below the loop. We
+ * don't simply put the elog inside the loop since that will
+ * probably interfere with optimization.
+ */
+ if (dp + len > destend)
+ {
+ dp += len;
+ break;
+ }
+
+ /*
+ * Now we copy the bytes specified by the tag from history
+ * to OUTPUT.
+ */
+ memcpy(dp, hend - off, len);
+ dp += len;
+ }
+ else
+ {
+ /*
+ * An unset control bit means LITERAL BYTE. So we just
+ * copy one from INPUT to OUTPUT.
+ */
+ if (dp >= destend) /* check for buffer overrun */
+ break; /* do not clobber memory */
+
+ *dp++ = *sp++;
+ }
+
+ /*
+ * Advance the control bit
+ */
+ ctrl >>= 1;
+ }
+ }
+
+ /*
+ * Check we decompressed the right amount.
+ */
+ if (sp != srcend)
+ elog(PANIC, "compressed data is corrupt");
+
+ /*
+ * That's it.
+ */
+ *finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 22ba35f..6ff6b23 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -124,6 +124,7 @@ extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
@@ -2410,6 +2411,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 0, 100,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index e58eae5..386277d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
- uint8 old_infobits_set; /* infomask bits to set on old tuple */
- bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
- bool new_all_visible_cleared; /* same for the page of newtid */
+ uint8 old_infobits_set; /* infomask bits to set on old tuple */
+ uint8 flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
+ * update operation is
+ * delta encoded */
+
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(uint8))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f8f06c1..56efcac 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen);
#endif /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
(2 rows)
DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE: table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
On 04.03.2013 06:39, Amit Kapila wrote:
On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
On 02/05/2013 11:53 PM, Amit Kapila wrote:
Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previouspatch):
The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default,
this probably just isn't worth it.The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function,
it goes further than that, and contains some further micro-
optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for
speed.If you could squeeze pglz_delta_encode function to be cheap enough that
we could enable this by default, this would be pretty cool patch. Or at
least, the overhead in the cases that you get no compression needs to
be brought down, to about 2-5 % at most I think. If it can't be done
easily, I feel that this probably needs to be dropped.
After trying some more on optimizing pglz_delta_encode(), I found that if we
use new data also in history, then the results of compression
and cpu utilization are much better.
In addition to the pg lz micro optimization changes, following changes are
done in modified patch
1. The unmatched new data is also added to the history which can be
referenced later.
2. To incorporate this change in the lZ algorithm, 1 extra control bit is
needed to indicate if data is from old or new tuple
Performance Data
-----------------
Head code:
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1232908016 | 36.3914430141449
two short fields, one changed | 1232904040 | 36.5231261253357
two short fields, both changed | 1235215048 | 37.7455959320068
one short and one long field, no change | 1051394568 | 24.418487071991
ten tiny fields, all changed | 1395189872 | 43.2316210269928
hundred tiny fields, first 10 changed | 622156848 | 21.9155580997467
hundred tiny fields, all changed | 625962056 | 22.3296411037445
hundred tiny fields, half changed | 621901128 | 21.3881061077118
hundred tiny fields, half nulled | 557708096 | 19.4633228778839
pglz-with-micro-optimization-compress-using-newdata-1:
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1235992768 | 37.3365149497986
two short fields, one changed | 1240979256 | 36.897796869278
two short fields, both changed | 1236079976 | 38.4273149967194
one short and one long field, no change | 651010944 | 20.9490079879761
ten tiny fields, all changed | 1315606864 | 42.5771369934082
hundred tiny fields, first 10 changed | 459134432 | 17.4556930065155
hundred tiny fields, all changed | 456506680 | 17.8865270614624
hundred tiny fields, half changed | 454784456 | 18.0130441188812
hundred tiny fields, half nulled | 486675784 | 18.6600229740143
Observation
---------------
1. It yielded compression in more cases (refer all cases of hundred tiny
fields)
2. CPU- utilization is also better.
Performance data for pgbench related scenarios is attached in document
(pgbench_lz_opt_compress_using_newdata.htm)
1. Better reduction in WAL
2. TPS increase can be observed after records size is >=250
3. There is small performance penality for single-thread (0.04~3.45), but
when penality is 3.45 in single thread, for 8 threads TPS improvement is
high.
Do you think it matches the conditions you have in mind for further
proceeding of this patch?
Thanks to Hari Babu for helping in implementation of this idea and taking
performance data.
With Regards,
Amit Kapila.
Attachments:
pglz-with-micro-optimization-compress-using-newdata-1.patchapplication/octet-stream; name=pglz-with-micro-optimization-compress-using-newdata-1.patchDownload
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+/* guc variable for EWT compression ratio*/
+int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+/* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata, uint32 *enclen)
+{
+ PGLZ_Strategy strategy;
+
+ strategy = *PGLZ_strategy_default;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_delta_encode(
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ encdata, enclen, &strategy
+ );
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+ return pglz_delta_decode(encdata, enclen,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e88dd30..0997fe2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
/* GUC variable */
bool synchronize_seqscans = true;
+extern int wal_update_compression_ratio;
static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5827,6 +5829,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ char buf[MaxHeapTupleSize];
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -5836,15 +5844,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ uint32 enclen;
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ {
+ compressed = true;
+ newtupdata = buf;
+ newtuplen = enclen;
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
- xlrec.all_visible_cleared = all_visible_cleared;
+ if (all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
- xlrec.new_all_visible_cleared = new_all_visible_cleared;
+ if (new_all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (compressed)
+ xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
@@ -5871,9 +5911,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ /*
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+ * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+ */
+ rdata[3].data = newtupdata;
+ rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
@@ -6683,7 +6726,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
@@ -6698,7 +6744,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6758,7 +6804,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
- htup = (HeapTupleHeader) PageGetItem(page, lp);
+ oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6776,7 +6822,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
@@ -6800,7 +6846,7 @@ newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6867,10 +6913,31 @@ newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
- /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
- memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
- (char *) xlrec + hsize,
- newlen);
+
+ /*
+ * If the record is EWT then decode it.
+ */
+ if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+ {
+ /*
+ * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ */
+ oldtup.t_data = oldtupdata;
+ oldtup.t_len = ItemIdGetLength(lp);
+ newtup.t_data = htup;
+
+ heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+ newlen = newtup.t_len;
+ }
+ else
+ {
+ /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ (char *) xlrec + hsize,
+ newlen);
+ }
+
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
@@ -6886,7 +6953,7 @@ newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fbc722c..b13be74 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1239,6 +1239,28 @@ begin:;
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+}
+
+/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..a7876e0 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
* of identical bytes like trailing spaces) and for bigger ones
* our 4K maximum look-back distance is too small.
*
- * The compressor creates a table for 8192 lists of positions.
+ * The compressor creates a table for lists of positions.
* For each input position (except the last 3), a hash key is
* built from the 4 next input bytes and the position remembered
* in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
* matching strings. This is done on the fly while the input
* is compressed into the output area. Table entries are only
* kept for the last 4096 input positions, since we cannot use
- * back-pointers larger than that anyway.
+ * back-pointers larger than that anyway. The size of the hash
+ * table depends on the size of the input - a larger table has
+ * a larger startup cost, as it needs to be initialized to zero,
+ * but reduces the number of hash collisions on long inputs.
*
* For each byte in the input, it's hash key (built from this
* byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
* Local definitions
* ----------
*/
-#define PGLZ_HISTORY_LISTS 8192 /* must be power of 2 */
-#define PGLZ_HISTORY_MASK (PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS 8192 /* must be power of 2 */
#define PGLZ_HISTORY_SIZE 4096
#define PGLZ_MAX_MATCH 273
@@ -198,9 +200,10 @@
*/
typedef struct PGLZ_HistEntry
{
- struct PGLZ_HistEntry *next; /* links for my hash key's list */
- struct PGLZ_HistEntry *prev;
- int hindex; /* my current hash key */
+ int16 next; /* links for my hash key's list */
+ int16 prev;
+ uint32 hindex; /* my current hash key */
+ bool from_history; /* Is the hash entry from history buffer? */
const char *pos; /* my input position */
} PGLZ_HistEntry;
@@ -241,9 +244,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
* Statically allocated work arrays for history
* ----------
*/
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY 0
/* ----------
* pglz_hist_idx -
@@ -257,12 +262,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* hash keys more.
* ----------
*/
-#define pglz_hist_idx(_s,_e) ( \
+#define pglz_hist_idx(_s,_e, _mask) ( \
((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
- (((_s)[0] << 9) ^ ((_s)[1] << 6) ^ \
- ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK) \
+ (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
+ ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
)
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) \
+ do { \
+ a = 0; \
+ b = _p[0]; \
+ c = _p[1]; \
+ d = _p[2]; \
+ hindex = (b << 4) ^ (c << 2) ^ d; \
+ } while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) \
+ do { \
+ /* subtract old a */ \
+ hindex ^= a; \
+ /* shift and add byte */ \
+ a = b; b = c; c = d; d = _p[3]; \
+ hindex = ((hindex << 2) ^ d) & (_mask); \
+ } while (0)
+
/* ----------
* pglz_hist_add -
@@ -276,32 +309,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
* _hn and _recycle are modified in the macro.
* ----------
*/
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex) \
do { \
- int __hindex = pglz_hist_idx((_s),(_e)); \
- PGLZ_HistEntry **__myhsp = &(_hs)[__hindex]; \
+ int16 *__myhsp = &(_hs)[_hindex]; \
PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
if (_recycle) { \
- if (__myhe->prev == NULL) \
+ if (__myhe->prev == INVALID_ENTRY) \
(_hs)[__myhe->hindex] = __myhe->next; \
else \
- __myhe->prev->next = __myhe->next; \
- if (__myhe->next != NULL) \
- __myhe->next->prev = __myhe->prev; \
+ (_he)[__myhe->prev].next = __myhe->next; \
+ if (__myhe->next != INVALID_ENTRY) \
+ (_he)[__myhe->next].prev = __myhe->prev; \
} \
__myhe->next = *__myhsp; \
- __myhe->prev = NULL; \
- __myhe->hindex = __hindex; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
__myhe->pos = (_s); \
- if (*__myhsp != NULL) \
- (*__myhsp)->prev = __myhe; \
- *__myhsp = __myhe; \
- if (++(_hn) >= PGLZ_HISTORY_SIZE) { \
- (_hn) = 0; \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) { \
+ (_hn) = 1; \
(_recycle) = true; \
} \
} while (0)
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex, _from_history) \
+do { \
+ int16 *__myhsp = &(_hs)[_hindex]; \
+ PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
+ __myhe->next = *__myhsp; \
+ __myhe->prev = INVALID_ENTRY; \
+ __myhe->hindex = _hindex; \
+ __myhe->pos = (_s); \
+ if (*__myhsp != INVALID_ENTRY) \
+ (_he)[(*__myhsp)].prev = _hn; \
+ *__myhsp = _hn; \
+ ++(_hn); \
+ __myhe->from_history = _from_history; \
+} while (0)
/* ----------
* pglz_out_ctrl -
@@ -364,6 +414,49 @@ do { \
/* ----------
+ * pglz_out_tag_encode -
+ *
+ * Outputs a backward reference tag of 2-4 bytes (depending on
+ * offset and length) to the destination/history buffer including the
+ * appropriate control bit.
+ * ----------
+ */
+#define pglz_out_tag_encode(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_from_history) \
+do { \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ _ctrlb |= _ctrl; \
+ _ctrl <<= 1; \
+ if (_from_history) \
+ _ctrlb |= _ctrl; \
+ _ctrl <<= 1; \
+ if (_len > 17) \
+ { \
+ (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f); \
+ (_buf)[1] = (unsigned char)(((_off) & 0xff)); \
+ (_buf)[2] = (unsigned char)((_len) - 18); \
+ (_buf) += 3; \
+ } else { \
+ (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+ (_buf)[1] = (unsigned char)((_off) & 0xff); \
+ (_buf) += 2; \
+ } \
+} while (0)
+
+/* ----------
+ * pglz_out_literal_encode -
+ *
+ * Outputs a literal byte to the destination buffer including the
+ * appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal_encode(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ *(_buf)++ = (unsigned char)(_byte); \
+ _ctrl <<= 2; \
+} while (0)
+
+/* ----------
* pglz_find_match -
*
* Lookup the history table if the actual input stream matches
@@ -372,28 +465,48 @@ do { \
* ----------
*/
static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
- int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+ int *lenp, int *offp, int good_match, int good_drop,
+ const char *hend, int hindex, bool *from_history)
{
- PGLZ_HistEntry *hent;
+ int16 hentno;
int32 len = 0;
int32 off = 0;
+ bool history_match = false;
+
+ *from_history = false;
/*
* Traverse the linked history list until a good enough match is found.
*/
- hent = hstart[pglz_hist_idx(input, end)];
- while (hent)
+ hentno = hstart[hindex];
+ while (hentno != INVALID_ENTRY)
{
+ PGLZ_HistEntry *hent = &hist_entries[hentno];
const char *ip = input;
const char *hp = hent->pos;
int32 thisoff;
int32 thislen;
+ int32 maxlen;
+
+ history_match = false;
+ maxlen = PGLZ_MAX_MATCH;
+ if (hent->from_history && (hend - hp < maxlen))
+ maxlen = hend - hp;
+ else if (end - input < maxlen)
+ maxlen = end - input;
/*
* Stop if the offset does not fit into our tag anymore.
*/
- thisoff = ip - hp;
+ if (hent->from_history)
+ {
+ history_match = true;
+ thisoff = hend - hp;
+ }
+ else
+ thisoff = ip - hp;
+
if (thisoff >= 0x0fff)
break;
@@ -413,7 +526,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
thislen = len;
ip += len;
hp += len;
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -423,7 +536,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
}
else
{
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -436,6 +549,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
*/
if (thislen > len)
{
+ *from_history = history_match;
len = thislen;
off = thisoff;
}
@@ -443,13 +557,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
/*
* Advance to the next history entry
*/
- hent = hent->next;
+ hentno = hent->next;
/*
* Be happy with lesser good matches the more entries we visited. But
* no point in doing calculation if we're at end of list.
*/
- if (hent)
+ if (hentno != INVALID_ENTRY)
{
if (len >= good_match)
break;
@@ -471,6 +585,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
return 0;
}
+static int
+choose_hash_size(int slen)
+{
+ int hashsz;
+
+ /*
+ * Experiments suggest that these hash sizes work pretty well. A large
+ * hash table minimizes collision, but has a higher startup cost. For
+ * a small input, the startup cost dominates. The table size must be
+ * a power of two.
+ */
+ if (slen < 128)
+ hashsz = 512;
+ else if (slen < 256)
+ hashsz = 1024;
+ else if (slen < 512)
+ hashsz = 2048;
+ else if (slen < 1024)
+ hashsz = 4096;
+ else
+ hashsz = 8192;
+ return hashsz;
+}
/* ----------
* pglz_compress -
@@ -484,7 +621,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
{
unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
unsigned char *bstart = bp;
- int hist_next = 0;
+ int hist_next = 1;
bool hist_recycle = false;
const char *dp = source;
const char *dend = source + slen;
@@ -500,6 +637,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
int32 result_size;
int32 result_max;
int32 need_rate;
+ int hashsz;
+ int mask;
/*
* Our fallback strategy is the default.
@@ -555,17 +694,23 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
else
result_max = (slen * (100 - need_rate)) / 100;
+ hashsz = choose_hash_size(slen);
+ mask = hashsz - 1;
+
/*
* Initialize the history lists to empty. We do not need to zero the
* hist_entries[] array; its entries are initialized as they are used.
*/
- memset(hist_start, 0, sizeof(hist_start));
+ memset(hist_start, 0, hashsz * sizeof(int16));
/*
* Compress the source directly into the output buffer.
*/
while (dp < dend)
{
+ uint32 hindex;
+ bool from_history;
+
/*
* If we already exceeded the maximum result size, fail.
*
@@ -588,8 +733,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
/*
* Try to find a match in the history
*/
+ hindex = pglz_hist_idx(dp, dend, mask);
if (pglz_find_match(hist_start, dp, dend, &match_len,
- &match_off, good_match, good_drop))
+ &match_off, good_match, good_drop, NULL, hindex,
+ &from_history))
{
/*
* Create the tag and add history entries for all matched
@@ -598,9 +745,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
+ hindex = pglz_hist_idx(dp, dend, mask);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -614,7 +762,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -637,6 +785,205 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
return true;
}
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen,
+ const PGLZ_Strategy *strategy)
+{
+ unsigned char *bp = ((unsigned char *) dest);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ const char *hp = history;
+ const char *hend = history + hlen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int32 good_match;
+ int32 good_drop;
+ int32 need_rate;
+ int hist_next = 0;
+ int hashsz;
+ int mask;
+ int32 a,b,c,d;
+ int32 hindex;
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Limit the match parameters to the supported range.
+ */
+ good_match = strategy->match_size_good;
+ if (good_match > PGLZ_MAX_MATCH)
+ good_match = PGLZ_MAX_MATCH;
+ else if (good_match < 17)
+ good_match = 17;
+
+ good_drop = strategy->match_size_drop;
+ if (good_drop < 0)
+ good_drop = 0;
+ else if (good_drop > 100)
+ good_drop = 100;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ hashsz = choose_hash_size(hlen + slen);
+ mask = hashsz - 1;
+
+ /*
+ * Initialize the history lists to empty. We do not need to zero the
+ * hist_entries[] array; its entries are initialized as they are used.
+ */
+ memset(hist_start, 0, hashsz * sizeof(int16));
+
+ pglz_hash_init(hp, hindex, a,b,c,d);
+ while (hp < hend - 4)
+ {
+ /*
+ * TODO: It would be nice to behave like the history and the source
+ * strings were concatenated, so that you could compress using the
+ * new data, too.
+ */
+ pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ hp, hend, hindex, true);
+ hp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Loop through the input.
+ */
+ match_off = 0;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ while (dp < dend - 4)
+ {
+ bool from_history;
+
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+ if (pglz_find_match(hist_start, dp, dend, &match_len,
+ &match_off, good_match, good_drop, hend, hindex,
+ &from_history))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag_encode(ctrlp, ctrlb, ctrl, bp, match_len, match_off, from_history);
+ dp += match_len;
+ found_match = true;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ dp, dend, hindex, false);
+
+ pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+ }
+
+ if (!found_match)
+ return false;
+
+ /* Handle the last few bytes as literals */
+ while (dp < dend)
+ {
+ pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ *finallen = result_size;
+
+ return true;
+}
/* ----------
* pglz_decompress -
@@ -740,3 +1087,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
* That's it.
*/
}
+
+/* ----------
+ * pglz_delta_decode
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen)
+{
+ const unsigned char *sp;
+ const unsigned char *srcend;
+ unsigned char *dp;
+ unsigned char *destend;
+ const char *hend;
+
+ sp = ((const unsigned char *) source);
+ srcend = ((const unsigned char *) source) + srclen;
+ dp = (unsigned char *) dest;
+ destend = dp + destlen;
+ hend = history + histlen;
+
+ while (sp < srcend && dp < destend)
+ {
+ /*
+ * Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ */
+ unsigned char ctrl = *sp++;
+ int ctrlc;
+
+ for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+ {
+ if (ctrl & 1)
+ {
+ /*
+ * Otherwise it contains the match length minus 3 and the
+ * upper 4 bits of the offset. The next following byte
+ * contains the lower 8 bits of the offset. If the length is
+ * coded as 18, another extension tag byte tells how much
+ * longer the match really was (0-255).
+ */
+ int32 len;
+ int32 off;
+
+ len = (sp[0] & 0x0f) + 3;
+ off = ((sp[0] & 0xf0) << 4) | sp[1];
+ sp += 2;
+ if (len == 18)
+ len += *sp++;
+
+ /*
+ * Check for output buffer overrun, to ensure we don't clobber
+ * memory in case of corrupt input. Note: we must advance dp
+ * here to ensure the error is detected below the loop. We
+ * don't simply put the elog inside the loop since that will
+ * probably interfere with optimization.
+ */
+ if (dp + len > destend)
+ {
+ dp += len;
+ break;
+ }
+
+ /*
+ * Now we copy the bytes specified by the tag from history
+ * to OUTPUT.
+ */
+ if ((ctrl >> 1) & 1)
+ {
+ memcpy(dp, hend - off, len);
+ dp += len;
+ }
+ else
+ {
+ /*
+ * Now we copy the bytes specified by the tag from OUTPUT to
+ * OUTPUT. It is dangerous and platform dependent to use
+ * memcpy() here, because the copied areas could overlap
+ * extremely!
+ */
+ while (len--)
+ {
+ *dp = dp[-off];
+ dp++;
+ }
+ }
+ }
+ else
+ {
+ /*
+ * An unset control bit means LITERAL BYTE. So we just
+ * copy one from INPUT to OUTPUT.
+ */
+ if (dp >= destend) /* check for buffer overrun */
+ break; /* do not clobber memory */
+
+ *dp++ = *sp++;
+ }
+
+ /*
+ * Advance the control bit
+ */
+ ctrl >>= 2;
+ }
+ }
+
+ /*
+ * Check we decompressed the right amount.
+ */
+ if (sp != srcend)
+ elog(PANIC, "compressed data is corrupt");
+
+ /*
+ * That's it.
+ */
+ *finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..5bcf40b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -124,6 +124,7 @@ extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
@@ -2410,6 +2411,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 0, 100,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..36e7dc8 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -148,12 +148,21 @@ typedef struct xl_heap_update
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
uint8 old_infobits_set; /* infomask bits to set on old tuple */
- bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
- bool new_all_visible_cleared; /* same for the page of newtid */
+ uint8 flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
+ * update operation is
+ * delta encoded */
+
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(uint8))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index b4a75ce..032a422 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen);
#endif /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
(2 rows)
DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE: table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;
On Friday, June 07, 2013 5:07 PM Amit Kapila wrote:
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
On 04.03.2013 06:39, Amit Kapila wrote:
On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
On 02/05/2013 11:53 PM, Amit Kapila wrote:
Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previouspatch):
The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default,
this probably just isn't worth it.The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function,
it goes further than that, and contains some further micro-
optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for
speed.If you could squeeze pglz_delta_encode function to be cheap enough
that we could enable this by default, this would be pretty cool patch.
Or at least, the overhead in the cases that you get no compression
needs to be brought down, to about 2-5 % at most I think. If it can't
be done easily, I feel that this probably needs to be dropped.
After trying some more on optimizing pglz_delta_encode(), I found that if
we use new data also in history, then the results of compression and cpu
utilization >are much better.
In addition to the pg lz micro optimization changes, following changes are
done in modified patch
1. The unmatched new data is also added to the history which can be
referenced later.
2. To incorporate this change in the lZ algorithm, 1 extra control bit is
needed to indicate if data is from old or new tuple
The patch is rebased to use the new PG LZ algorithm optimization changes
which got committed recently.
Performance Data
-----------------
Head code:
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1232911016 | 35.1784930229187
two short fields, one changed | 1240322016 | 35.0436308383942
two short fields, both changed | 1235318352 | 35.4989421367645
one short and one long field, no change | 1042332336 | 23.4457180500031
ten tiny fields, all changed | 1395194136 | 41.9023628234863
hundred tiny fields, first 10 changed | 626725984 | 21.2999589443207
hundred tiny fields, all changed | 621899224 | 21.6676609516144
hundred tiny fields, half changed | 623998272 | 21.2745981216431
hundred tiny fields, half nulled | 557714088 | 19.5902800559998
pglz-with-micro-optimization-compress-using-newdata-2:
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1232903384 | 35.0115969181061
two short fields, one changed | 1232906960 | 34.3333759307861
two short fields, both changed | 1232903520 | 35.7665238380432
one short and one long field, no change | 649647992 | 19.4671010971069
ten tiny fields, all changed | 1314957136 | 39.9727990627289
hundred tiny fields, first 10 changed | 458684024 | 17.8197758197784
hundred tiny fields, all changed | 461028464 | 17.3083391189575
hundred tiny fields, half changed | 456528696 | 17.1769199371338
hundred tiny fields, half nulled | 480548936 | 18.81720495224
Observation
---------------
1. It yielded compression in more cases (refer all cases of hundred tiny
fields)
2. CPU- utilization is also better.
Performance data for pgbench related scenarios is attached in document
(pgbench_lz_opt_compress_using_newdata-2.htm)
1. Better reduction in WAL
2. TPS increase can be observed after records size is >=250
3. There is small performance penality for single-thread (0.36~3.23),
but when penality is 3.23 in single thread, for 8 threads TPS improvement
is high.
Please suggest any further proceedings on this patch.
Regards,
Hari babu.
Attachments:
pglz-with-micro-optimization-compress-using-newdata-2.patchapplication/octet-stream; name=pglz-with-micro-optimization-compress-using-newdata-2.patchDownload
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+/* guc variable for EWT compression ratio*/
+int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+/* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata, uint32 *enclen)
+{
+ PGLZ_Strategy strategy;
+
+ strategy = *PGLZ_strategy_default;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_delta_encode(
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ encdata, enclen, &strategy
+ );
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+ return pglz_delta_decode(encdata, enclen,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1531f3b..ed51650 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
/* GUC variable */
bool synchronize_seqscans = true;
+extern int wal_update_compression_ratio;
static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5830,6 +5832,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ char buf[MaxHeapTupleSize];
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -5839,15 +5847,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ uint32 enclen;
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ {
+ compressed = true;
+ newtupdata = buf;
+ newtuplen = enclen;
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
- xlrec.all_visible_cleared = all_visible_cleared;
+ if (all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
- xlrec.new_all_visible_cleared = new_all_visible_cleared;
+ if (new_all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (compressed)
+ xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
@@ -5874,9 +5914,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ /*
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+ * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+ */
+ rdata[3].data = newtupdata;
+ rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
@@ -6686,7 +6729,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
@@ -6701,7 +6747,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6761,7 +6807,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
- htup = (HeapTupleHeader) PageGetItem(page, lp);
+ oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6779,7 +6825,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
@@ -6803,7 +6849,7 @@ newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6870,10 +6916,31 @@ newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
- /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
- memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
- (char *) xlrec + hsize,
- newlen);
+
+ /*
+ * If the record is EWT then decode it.
+ */
+ if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+ {
+ /*
+ * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+ * + New data (1 byte length + variable data)+ ...
+ */
+ oldtup.t_data = oldtupdata;
+ oldtup.t_len = ItemIdGetLength(lp);
+ newtup.t_data = htup;
+
+ heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+ newlen = newtup.t_len;
+ }
+ else
+ {
+ /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ (char *) xlrec + hsize,
+ newlen);
+ }
+
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
@@ -6889,7 +6956,7 @@ newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0ce661b..306961c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1239,6 +1239,28 @@ begin:;
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+}
+
+/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index ae67519..a98277e 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -200,9 +200,10 @@
*/
typedef struct PGLZ_HistEntry
{
- struct PGLZ_HistEntry *next; /* links for my hash key's list */
- struct PGLZ_HistEntry *prev;
- int hindex; /* my current hash key */
+ struct PGLZ_HistEntry *next; /* links for my hash key's list */
+ struct PGLZ_HistEntry *prev;
+ uint32 hindex; /* my current hash key */
+ bool from_history; /* Is the hash entry from history buffer? */
const char *pos; /* my input position */
} PGLZ_HistEntry;
@@ -265,12 +266,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
* hash keys more.
* ----------
*/
-#define pglz_hist_idx(_s,_e, _mask) ( \
- ((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
- (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
- ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
+#define pglz_hist_idx(_s,_e, _mask) ( \
+ ((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
+ (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
+ ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
)
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) \
+ do { \
+ a = 0; \
+ b = _p[0]; \
+ c = _p[1]; \
+ d = _p[2]; \
+ hindex = (b << 4) ^ (c << 2) ^ d; \
+ } while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) \
+ do { \
+ /* subtract old a */ \
+ hindex ^= a; \
+ /* shift and add byte */ \
+ a = b; b = c; c = d; d = _p[3]; \
+ hindex = ((hindex << 2) ^ d) & (_mask); \
+ } while (0)
+
/* ----------
* pglz_hist_add -
@@ -284,10 +313,9 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
* _hn and _recycle are modified in the macro.
* ----------
*/
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex) \
do { \
- int __hindex = pglz_hist_idx((_s),(_e), (_mask)); \
- int16 *__myhsp = &(_hs)[__hindex]; \
+ int16 *__myhsp = &(_hs)[_hindex]; \
PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
if (_recycle) { \
if (__myhe->prev == NULL) \
@@ -299,7 +327,7 @@ do { \
} \
__myhe->next = &(_he)[*__myhsp]; \
__myhe->prev = NULL; \
- __myhe->hindex = __hindex; \
+ __myhe->hindex = _hindex; \
__myhe->pos = (_s); \
/* If there was an existing entry in this hash slot, link */ \
/* this new entry to it. However, the 0th entry in the */ \
@@ -317,6 +345,23 @@ do { \
} \
} while (0)
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex, _from_history) \
+do { \
+ int16 *__myhsp = &(_hs)[_hindex]; \
+ PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
+ __myhe->next = &(_he)[*__myhsp]; \
+ __myhe->prev = NULL; \
+ __myhe->hindex = _hindex; \
+ __myhe->pos = (_s); \
+ (_he)[(*__myhsp)].prev = __myhe; \
+ *__myhsp = _hn; \
+ ++(_hn); \
+ __myhe->from_history = _from_history; \
+} while (0)
/* ----------
* pglz_out_ctrl -
@@ -379,6 +424,49 @@ do { \
/* ----------
+ * pglz_out_tag_encode -
+ *
+ * Outputs a backward reference tag of 2-4 bytes (depending on
+ * offset and length) to the destination/history buffer including the
+ * appropriate control bit.
+ * ----------
+ */
+#define pglz_out_tag_encode(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_from_history) \
+do { \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ _ctrlb |= _ctrl; \
+ _ctrl <<= 1; \
+ if (_from_history) \
+ _ctrlb |= _ctrl; \
+ _ctrl <<= 1; \
+ if (_len > 17) \
+ { \
+ (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f); \
+ (_buf)[1] = (unsigned char)(((_off) & 0xff)); \
+ (_buf)[2] = (unsigned char)((_len) - 18); \
+ (_buf) += 3; \
+ } else { \
+ (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+ (_buf)[1] = (unsigned char)((_off) & 0xff); \
+ (_buf) += 2; \
+ } \
+} while (0)
+
+/* ----------
+ * pglz_out_literal_encode -
+ *
+ * Outputs a literal byte to the destination buffer including the
+ * appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal_encode(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ *(_buf)++ = (unsigned char)(_byte); \
+ _ctrl <<= 2; \
+} while (0)
+
+/* ----------
* pglz_find_match -
*
* Lookup the history table if the actual input stream matches
@@ -388,17 +476,21 @@ do { \
*/
static inline int
pglz_find_match(int16 *hstart, const char *input, const char *end,
- int *lenp, int *offp, int good_match, int good_drop, int mask)
+ int *lenp, int *offp, int good_match, int good_drop,
+ const char *hend, int hindex, bool *from_history)
{
PGLZ_HistEntry *hent;
int16 hentno;
int32 len = 0;
int32 off = 0;
+ bool history_match = false;
+
+ *from_history = false;
/*
* Traverse the linked history list until a good enough match is found.
*/
- hentno = hstart[pglz_hist_idx(input, end, mask)];
+ hentno = hstart[hindex];
hent = &hist_entries[hentno];
while (hent != INVALID_ENTRY_PTR)
{
@@ -406,11 +498,26 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
const char *hp = hent->pos;
int32 thisoff;
int32 thislen;
+ int32 maxlen;
+
+ history_match = false;
+ maxlen = PGLZ_MAX_MATCH;
+ if (hent->from_history && (hend - hp < maxlen))
+ maxlen = hend - hp;
+ else if (end - input < maxlen)
+ maxlen = end - input;
/*
* Stop if the offset does not fit into our tag anymore.
*/
- thisoff = ip - hp;
+ if (hent->from_history)
+ {
+ history_match = true;
+ thisoff = hend - hp;
+ }
+ else
+ thisoff = ip - hp;
+
if (thisoff >= 0x0fff)
break;
@@ -430,7 +537,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
thislen = len;
ip += len;
hp += len;
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -440,7 +547,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
}
else
{
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -453,6 +560,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
*/
if (thislen > len)
{
+ *from_history = history_match;
len = thislen;
off = thisoff;
}
@@ -466,7 +574,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
* Be happy with lesser good matches the more entries we visited. But
* no point in doing calculation if we're at end of list.
*/
- if (hent)
+ if (hentno != INVALID_ENTRY)
{
if (len >= good_match)
break;
@@ -488,6 +596,29 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
return 0;
}
+static int
+choose_hash_size(int slen)
+{
+ int hashsz;
+
+ /*
+ * Experiments suggest that these hash sizes work pretty well. A large
+ * hash table minimizes collision, but has a higher startup cost. For
+ * a small input, the startup cost dominates. The table size must be
+ * a power of two.
+ */
+ if (slen < 128)
+ hashsz = 512;
+ else if (slen < 256)
+ hashsz = 1024;
+ else if (slen < 512)
+ hashsz = 2048;
+ else if (slen < 1024)
+ hashsz = 4096;
+ else
+ hashsz = 8192;
+ return hashsz;
+}
/* ----------
* pglz_compress -
@@ -574,6 +705,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
else
result_max = (slen * (100 - need_rate)) / 100;
+ hashsz = choose_hash_size(slen);
+ mask = hashsz - 1;
+
/*
* Experiments suggest that these hash sizes work pretty well. A large
* hash table minimizes collision, but has a higher startup cost. For
@@ -603,6 +737,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
*/
while (dp < dend)
{
+ uint32 hindex;
+ bool from_history;
+
/*
* If we already exceeded the maximum result size, fail.
*
@@ -625,8 +762,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
/*
* Try to find a match in the history
*/
+ hindex = pglz_hist_idx(dp, dend, mask);
if (pglz_find_match(hist_start, dp, dend, &match_len,
- &match_off, good_match, good_drop, mask))
+ &match_off, good_match, good_drop, NULL, hindex,
+ &from_history))
{
/*
* Create the tag and add history entries for all matched
@@ -635,9 +774,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
+ hindex = pglz_hist_idx(dp, dend, mask);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend, mask);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -651,7 +791,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend, mask);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -674,6 +814,205 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
return true;
}
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen,
+ const PGLZ_Strategy *strategy)
+{
+ unsigned char *bp = ((unsigned char *) dest);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ const char *hp = history;
+ const char *hend = history + hlen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int32 good_match;
+ int32 good_drop;
+ int32 need_rate;
+ int hist_next = 0;
+ int hashsz;
+ int mask;
+ int32 a,b,c,d;
+ int32 hindex;
+
+ /*
+ * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ * delta encode as this is the maximum size of history offset.
+ */
+ if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Limit the match parameters to the supported range.
+ */
+ good_match = strategy->match_size_good;
+ if (good_match > PGLZ_MAX_MATCH)
+ good_match = PGLZ_MAX_MATCH;
+ else if (good_match < 17)
+ good_match = 17;
+
+ good_drop = strategy->match_size_drop;
+ if (good_drop < 0)
+ good_drop = 0;
+ else if (good_drop > 100)
+ good_drop = 100;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ hashsz = choose_hash_size(hlen + slen);
+ mask = hashsz - 1;
+
+ /*
+ * Initialize the history lists to empty. We do not need to zero the
+ * hist_entries[] array; its entries are initialized as they are used.
+ */
+ memset(hist_start, 0, hashsz * sizeof(int16));
+
+ pglz_hash_init(hp, hindex, a,b,c,d);
+ while (hp < hend - 4)
+ {
+ /*
+ * TODO: It would be nice to behave like the history and the source
+ * strings were concatenated, so that you could compress using the
+ * new data, too.
+ */
+ pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ hp, hend, hindex, true);
+ hp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Loop through the input.
+ */
+ match_off = 0;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ while (dp < dend - 4)
+ {
+ bool from_history;
+
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+ if (pglz_find_match(hist_start, dp, dend, &match_len,
+ &match_off, good_match, good_drop, hend, hindex,
+ &from_history))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag_encode(ctrlp, ctrlb, ctrl, bp, match_len, match_off, from_history);
+ dp += match_len;
+ found_match = true;
+ pglz_hash_init(dp, hindex,a,b,c,d);
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ dp, dend, hindex, false);
+
+ pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+ }
+
+ if (!found_match)
+ return false;
+
+ /* Handle the last few bytes as literals */
+ while (dp < dend)
+ {
+ pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ *finallen = result_size;
+
+ return true;
+}
/* ----------
* pglz_decompress -
@@ -777,3 +1116,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
* That's it.
*/
}
+
+/* ----------
+ * pglz_delta_decode
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen)
+{
+ const unsigned char *sp;
+ const unsigned char *srcend;
+ unsigned char *dp;
+ unsigned char *destend;
+ const char *hend;
+
+ sp = ((const unsigned char *) source);
+ srcend = ((const unsigned char *) source) + srclen;
+ dp = (unsigned char *) dest;
+ destend = dp + destlen;
+ hend = history + histlen;
+
+ while (sp < srcend && dp < destend)
+ {
+ /*
+ * Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ */
+ unsigned char ctrl = *sp++;
+ int ctrlc;
+
+ for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+ {
+ if (ctrl & 1)
+ {
+ /*
+ * Otherwise it contains the match length minus 3 and the
+ * upper 4 bits of the offset. The next following byte
+ * contains the lower 8 bits of the offset. If the length is
+ * coded as 18, another extension tag byte tells how much
+ * longer the match really was (0-255).
+ */
+ int32 len;
+ int32 off;
+
+ len = (sp[0] & 0x0f) + 3;
+ off = ((sp[0] & 0xf0) << 4) | sp[1];
+ sp += 2;
+ if (len == 18)
+ len += *sp++;
+
+ /*
+ * Check for output buffer overrun, to ensure we don't clobber
+ * memory in case of corrupt input. Note: we must advance dp
+ * here to ensure the error is detected below the loop. We
+ * don't simply put the elog inside the loop since that will
+ * probably interfere with optimization.
+ */
+ if (dp + len > destend)
+ {
+ dp += len;
+ break;
+ }
+
+ /*
+ * Now we copy the bytes specified by the tag from history
+ * to OUTPUT.
+ */
+ if ((ctrl >> 1) & 1)
+ {
+ memcpy(dp, hend - off, len);
+ dp += len;
+ }
+ else
+ {
+ /*
+ * Now we copy the bytes specified by the tag from OUTPUT to
+ * OUTPUT. It is dangerous and platform dependent to use
+ * memcpy() here, because the copied areas could overlap
+ * extremely!
+ */
+ while (len--)
+ {
+ *dp = dp[-off];
+ dp++;
+ }
+ }
+ }
+ else
+ {
+ /*
+ * An unset control bit means LITERAL BYTE. So we just
+ * copy one from INPUT to OUTPUT.
+ */
+ if (dp >= destend) /* check for buffer overrun */
+ break; /* do not clobber memory */
+
+ *dp++ = *sp++;
+ }
+
+ /*
+ * Advance the control bit
+ */
+ ctrl >>= 2;
+ }
+ }
+
+ /*
+ * Check we decompressed the right amount.
+ */
+ if (sp != srcend)
+ elog(PANIC, "compressed data is corrupt");
+
+ /*
+ * That's it.
+ */
+ *finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3a76536..e2c42af 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -125,6 +125,7 @@ extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
@@ -2411,6 +2412,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 0, 100,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..36e7dc8 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -148,12 +148,21 @@ typedef struct xl_heap_update
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
uint8 old_infobits_set; /* infomask bits to set on old tuple */
- bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
- bool new_all_visible_cleared; /* same for the page of newtid */
+ uint8 flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
+ * update operation is
+ * delta encoded */
+
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(uint8))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 83e5832..4e6914c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen);
#endif /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
(2 rows)
DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE: table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;
I can't comment on further direction for the patch, but since it was marked
as Needs Review in the CF app I took a quick look at it.
It patches and compiles clean against the current Git HEAD, and 'make
check' runs successfully.
Does it need documentation for the GUC variable
'wal_update_compression_ratio'?
__________________________________________________________________________________
*Mike Blackwell | Technical Analyst, Distribution Services/Rollout
Management | RR Donnelley*
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com
<http://www.rrdonnelley.com/>
* <Mike.Blackwell@rrd.com>*
On Tue, Jul 2, 2013 at 2:26 AM, Hari Babu <haribabu.kommi@huawei.com> wrote:
Show quoted text
On Friday, June 07, 2013 5:07 PM Amit Kapila wrote:
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
On 04.03.2013 06:39, Amit Kapila wrote:
On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
On 02/05/2013 11:53 PM, Amit Kapila wrote:
Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previouspatch):
The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default,
this probably just isn't worth it.The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function,
it goes further than that, and contains some further micro-
optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for
speed.If you could squeeze pglz_delta_encode function to be cheap enough
that we could enable this by default, this would be pretty cool patch.
Or at least, the overhead in the cases that you get no compression
needs to be brought down, to about 2-5 % at most I think. If it can't
be done easily, I feel that this probably needs to be dropped.After trying some more on optimizing pglz_delta_encode(), I found that if
we use new data also in history, then the results of compression and cpu
utilization >are much better.In addition to the pg lz micro optimization changes, following changes are
done in modified patch
1. The unmatched new data is also added to the history which can be
referenced later.
2. To incorporate this change in the lZ algorithm, 1 extra control bit is
needed to indicate if data is from old or new tuple
The patch is rebased to use the new PG LZ algorithm optimization changes
which got committed recently.Performance Data
-----------------Head code:
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1232911016 | 35.1784930229187
two short fields, one changed | 1240322016 | 35.0436308383942
two short fields, both changed | 1235318352 | 35.4989421367645
one short and one long field, no change | 1042332336 | 23.4457180500031
ten tiny fields, all changed | 1395194136 | 41.9023628234863
hundred tiny fields, first 10 changed | 626725984 | 21.2999589443207
hundred tiny fields, all changed | 621899224 | 21.6676609516144
hundred tiny fields, half changed | 623998272 | 21.2745981216431
hundred tiny fields, half nulled | 557714088 | 19.5902800559998pglz-with-micro-optimization-compress-using-newdata-2:
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1232903384 | 35.0115969181061
two short fields, one changed | 1232906960 | 34.3333759307861
two short fields, both changed | 1232903520 | 35.7665238380432
one short and one long field, no change | 649647992 | 19.4671010971069
ten tiny fields, all changed | 1314957136 | 39.9727990627289
hundred tiny fields, first 10 changed | 458684024 | 17.8197758197784
hundred tiny fields, all changed | 461028464 | 17.3083391189575
hundred tiny fields, half changed | 456528696 | 17.1769199371338
hundred tiny fields, half nulled | 480548936 | 18.81720495224Observation
---------------
1. It yielded compression in more cases (refer all cases of hundred tiny
fields)
2. CPU- utilization is also better.Performance data for pgbench related scenarios is attached in document
(pgbench_lz_opt_compress_using_newdata-2.htm)1. Better reduction in WAL
2. TPS increase can be observed after records size is >=250
3. There is small performance penality for single-thread (0.36~3.23),
but when penality is 3.23 in single thread, for 8 threads TPS
improvement
is high.Please suggest any further proceedings on this patch.
Regards,
Hari babu.--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: 51d2813e.a77d440a.2190.7de7SMTPIN_ADDED_BROKEN@mx.google.com
On 07/08/2013 02:21 PM, Mike Blackwell wrote:
I can't comment on further direction for the patch, but since it was marked
as Needs Review in the CF app I took a quick look at it.It patches and compiles clean against the current Git HEAD, and 'make
check' runs successfully.Does it need documentation for the GUC variable
'wal_update_compression_ratio'?
Yes.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM9b2239656a7f7166795da027b34532073fd1092954cfd6dabc1c0fe5f5ddf3d1a311b0efa47d36e7e63d2fffe05bf77d@asav-2.01.com
On Tuesday, July 09, 2013 2:52 AM Mike Blackwell wrote:
I can't comment on further direction for the patch, but since it was marked as Needs Review in the CF app I took a quick look at it.
Thanks for looking into it.
Last time Heikki has found test scenario's where the original patch was not performing good.
He has also proposed a different approach for WAL encoding and sent the modified patch which has comparatively less negative performance impact and
asked to check if the patch can reduce the performance impact for the scenario's mentioned by him.
After that I found that with some modification's (use new tuple data for encoding) in his approach, it eliminates the negative performance impact and
have WAL reduction for more number of cases.
I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person.
The testcase used is posted at below link:
/messages/by-id/51366323.8070606@vmware.com
It patches and compiles clean against the current Git HEAD, and 'make check' runs successfully.
Does it need documentation for the GUC variable 'wal_update_compression_ratio'?
This variable has been added to test the patch for different compression_ratio during development test.
It was not decided to have this variable permanently as part of this patch, so currently there is no documentation for it.
However if the decision comes out to be that it needs to be part of patch, then documentation for same can be updated.
With Regards,
Amit Kapila.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
The only environment I have available at the moment is a virtual box.
That's probably not going to be very helpful for performance testing.
__________________________________________________________________________________
*Mike Blackwell | Technical Analyst, Distribution Services/Rollout
Management | RR Donnelley*
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com
<http://www.rrdonnelley.com/>
* <Mike.Blackwell@rrd.com>*
On Mon, Jul 8, 2013 at 11:09 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
Show quoted text
On Tuesday, July 09, 2013 2:52 AM Mike Blackwell wrote:
I can't comment on further direction for the patch, but since it was
marked as Needs Review in the CF app I took a quick look at it.
Thanks for looking into it.Last time Heikki has found test scenario's where the original patch was
not performing good.
He has also proposed a different approach for WAL encoding and sent the
modified patch which has comparatively less negative performance impact and
asked to check if the patch can reduce the performance impact for the
scenario's mentioned by him.
After that I found that with some modification's (use new tuple data for
encoding) in his approach, it eliminates the negative performance impact
and
have WAL reduction for more number of cases.I think the first thing to verify is whether the results posted can be
validated in some other environment setup by another person.
The testcase used is posted at below link:
/messages/by-id/51366323.8070606@vmware.comIt patches and compiles clean against the current Git HEAD, and 'make
check' runs successfully.
Does it need documentation for the GUC variable
'wal_update_compression_ratio'?
This variable has been added to test the patch for different
compression_ratio during development test.
It was not decided to have this variable permanently as part of this
patch, so currently there is no documentation for it.
However if the decision comes out to be that it needs to be part of
patch, then documentation for same can be updated.With Regards,
Amit Kapila.--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: 51db8e17.055a420a.6037.ffff8c92SMTPIN_ADDED_BROKEN@mx.google.com
On Wednesday, July 10, 2013 6:32 AM Mike Blackwell wrote:
The only environment I have available at the moment is a virtual box. That's probably not going to be very helpful for performance testing.
It's okay. Anyway thanks for doing the basic test for patch.
With Regards,
Amit Kapila.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 7/9/13 12:09 AM, Amit Kapila wrote:
I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person.
The testcase used is posted at below link:
/messages/by-id/51366323.8070606@vmware.com
That seems easy enough to do here, Heikki's test script is excellent.
The latest patch Hari posted on July 2 has one hunk that doesn't apply
anymore now. Inside src/backend/utils/adt/pg_lzcompress.c the patch
tries to change this code:
- if (hent)
+ if (hentno != INVALID_ENTRY)
But that line looks like this now:
if (hent != INVALID_ENTRY_PTR)
Definitions of those:
#define INVALID_ENTRY 0
#define INVALID_ENTRY_PTR (&hist_entries[INVALID_ENTRY])
I'm not sure if different error handling may be needed here now due the
commit that changed this, or if the patch wasn't referring to the right
type of error originally.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Greg,
* Greg Smith (greg@2ndQuadrant.com) wrote:
That seems easy enough to do here, Heikki's test script is
excellent. The latest patch Hari posted on July 2 has one hunk that
doesn't apply anymore now. Inside
src/backend/utils/adt/pg_lzcompress.c the patch tries to change this
code:- if (hent) + if (hentno != INVALID_ENTRY)
hentno certainly doesn't make much sense here- it's only used at the top
of the function to keep things a bit cleaner when extracting the address
into hent from hist_entries:
hentno = hstart[pglz_hist_idx(input, end, mask)];
hent = &hist_entries[hentno];
Indeed, as the referenced conditional is inside the following loop:
while (hent != INVALID_ENTRY_PTR)
and, since hentno == 0 implies hent == INVALID_ENTRY_PTR, the
conditional would never fail (which is what was happening prior to
Heikki commiting the fix for this, changing the conditional to what is
below).
But that line looks like this now:
if (hent != INVALID_ENTRY_PTR)
Right, this is correct- it's useful to check the new value for hent
after it's been updated by:
hent = hent->next;
and see if it's possible to drop out early.
I'm not sure if different error handling may be needed here now due
the commit that changed this, or if the patch wasn't referring to
the right type of error originally.
I've not looked at anything regarding this beyond this email, but I'm
pretty confident that the change Heikki committed was the correct one.
Thanks,
Stephen
On Friday, July 19, 2013 4:11 AM Greg Smith wrote:
On 7/9/13 12:09 AM, Amit Kapila wrote:
I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person.
The testcase used is posted at below link:
/messages/by-id/51366323.8070606@vmware.com
That seems easy enough to do here, Heikki's test script is excellent.
The latest patch Hari posted on July 2 has one hunk that doesn't apply
anymore now.
The Head code change from Heikki is correct.
During the patch rebase to latest PG LZ optimization code, the above code change is missed.
Apart from the above changed some more changes are done in the patch, those are.
1. corrected some comments in the code
2. Added a validity check as source and history length combined cannot be more than or equal to 8192.
Thanks for the review, please find the latest patch attached in the mail.
Regards,
Hari babu.
Attachments:
pglz-with-micro-optimization-compress-using-newdata-3.patchapplication/octet-stream; name=pglz-with-micro-optimization-compress-using-newdata-3.patchDownload
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..875434d 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
#include "access/sysattr.h"
#include "access/tuptoaster.h"
#include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+/* guc variable for EWT compression ratio*/
+int wal_update_compression_ratio = 25;
/* Does att's datatype allow packing into the 1-byte-header varlena format? */
#define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
}
+/* ----------------
+ * heap_delta_encode
+ *
+ * Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ char *encdata, uint32 *enclen)
+{
+ PGLZ_Strategy strategy;
+
+ strategy = *PGLZ_strategy_default;
+ strategy.min_comp_rate = wal_update_compression_ratio;
+
+ return pglz_delta_encode(
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ encdata, enclen, &strategy
+ );
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ * Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+ return pglz_delta_decode(encdata, enclen,
+ (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+ &newtup->t_len,
+ (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
/*
* heap_form_tuple
* construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5bcbc92..6dc362e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
/* GUC variable */
bool synchronize_seqscans = true;
+extern int wal_update_compression_ratio;
static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5844,6 +5846,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
XLogRecPtr recptr;
XLogRecData rdata[4];
Page page = BufferGetPage(newbuf);
+ char *newtupdata;
+ int newtuplen;
+ bool compressed = false;
+
+ /* Structure which holds EWT */
+ char buf[MaxHeapTupleSize];
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -5853,15 +5861,48 @@ log_heap_update(Relation reln, Buffer oldbuf,
else
info = XLOG_HEAP_UPDATE;
+ newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+ /*
+ * EWT can be generated for all new tuple versions created by Update
+ * operation. Currently we do it when both the old and new tuple versions
+ * are on same page, because during recovery if the page containing old
+ * tuple is corrupt, it should not cascade that corruption to other pages.
+ * Under the general assumption that for long runs most updates tend to
+ * create new tuple version on same page, there should not be significant
+ * impact on WAL reduction or performance.
+ *
+ * We should not generate EWT when we need to backup the whole bolck in
+ * WAL as in that case there is no saving by reduced WAL size.
+ */
+ if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ {
+ uint32 enclen;
+
+ /* Delta-encode the new tuple using the old tuple */
+ if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ {
+ compressed = true;
+ newtupdata = buf;
+ newtuplen = enclen;
+ }
+ }
+
+ xlrec.flags = 0;
xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
oldtup->t_data->t_infomask2);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
- xlrec.all_visible_cleared = all_visible_cleared;
+ if (all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
xlrec.newtid = newtup->t_self;
- xlrec.new_all_visible_cleared = new_all_visible_cleared;
+ if (new_all_visible_cleared)
+ xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (compressed)
+ xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapUpdate;
@@ -5888,9 +5929,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ /*
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+ * OR PG94FORMAT [If encoded]: LZ header + Encoded data follows
+ */
+ rdata[3].data = newtupdata;
+ rdata[3].len = newtuplen;
rdata[3].buffer = newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
@@ -6700,7 +6744,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
+ HeapTupleData newtup;
+ HeapTupleData oldtup;
HeapTupleHeader htup;
+ HeapTupleHeader oldtupdata = NULL;
struct
{
HeapTupleHeaderData hdr;
@@ -6715,7 +6762,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6775,7 +6822,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(PANIC, "heap_update_redo: invalid lp");
- htup = (HeapTupleHeader) PageGetItem(page, lp);
+ oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6793,7 +6840,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
/* Mark the page as a candidate for pruning */
PageSetPrunable(page, record->xl_xid);
- if (xlrec->all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
/*
@@ -6817,7 +6864,7 @@ newt:;
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6884,10 +6931,31 @@ newsame:;
SizeOfHeapHeader);
htup = &tbuf.hdr;
MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
- /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
- memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
- (char *) xlrec + hsize,
- newlen);
+
+ /*
+ * If the record is EWT then decode it.
+ */
+ if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+ {
+ /*
+ * PG94FORMAT: Header + Control byte + history reference (2 - 3)bytes
+ * + literal byte + ...
+ */
+ oldtup.t_data = oldtupdata;
+ oldtup.t_len = ItemIdGetLength(lp);
+ newtup.t_data = htup;
+
+ heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+ newlen = newtup.t_len;
+ }
+ else
+ {
+ /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ (char *) xlrec + hsize,
+ newlen);
+ }
+
newlen += offsetof(HeapTupleHeaderData, t_bits);
htup->t_infomask2 = xlhdr.t_infomask2;
htup->t_infomask = xlhdr.t_infomask;
@@ -6903,7 +6971,7 @@ newsame:;
if (offnum == InvalidOffsetNumber)
elog(PANIC, "heap_update_redo: failed to add tuple");
- if (xlrec->new_all_visible_cleared)
+ if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 96aceb9..fed305d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2308,6 +2308,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
}
/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+ bool doPageWrites;
+ Page page;
+
+ page = BufferGetPage(buffer);
+
+ doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+ if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ return true; /* buffer requires backup */
+
+ return false; /* buffer does not need to be backed up */
+}
+
+/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 1c129b8..cbf6064 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -120,7 +120,7 @@
* matching strings. This is done on the fly while the input
* is compressed into the output area. Table entries are only
* kept for the last 4096 input positions, since we cannot use
- * back-pointers larger than that anyway. The size of the hash
+ * back-pointers larger than that anyway. The size of the hash
* table is chosen based on the size of the input - a larger table
* has a larger startup cost, as it needs to be initialized to
* zero, but reduces the number of hash collisions on long inputs.
@@ -202,7 +202,8 @@ typedef struct PGLZ_HistEntry
{
struct PGLZ_HistEntry *next; /* links for my hash key's list */
struct PGLZ_HistEntry *prev;
- int hindex; /* my current hash key */
+ uint32 hindex; /* my current hash key */
+ bool from_history; /* Is the hash entry from history buffer? */
const char *pos; /* my input position */
} PGLZ_HistEntry;
@@ -265,12 +266,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
* hash keys more.
* ----------
*/
-#define pglz_hist_idx(_s,_e, _mask) ( \
- ((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
- (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
- ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
+#define pglz_hist_idx(_s,_e, _mask) ( \
+ ((((_e) - (_s)) < 4) ? (int) (_s)[0] : \
+ (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \
+ ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \
)
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) \
+ do { \
+ a = 0; \
+ b = _p[0]; \
+ c = _p[1]; \
+ d = _p[2]; \
+ hindex = (b << 4) ^ (c << 2) ^ d; \
+ } while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) \
+ do { \
+ /* subtract old a */ \
+ hindex ^= a; \
+ /* shift and add byte */ \
+ a = b; b = c; c = d; d = _p[3]; \
+ hindex = ((hindex << 2) ^ d) & (_mask); \
+ } while (0)
+
/* ----------
* pglz_hist_add -
@@ -284,10 +313,9 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
* _hn and _recycle are modified in the macro.
* ----------
*/
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex) \
do { \
- int __hindex = pglz_hist_idx((_s),(_e), (_mask)); \
- int16 *__myhsp = &(_hs)[__hindex]; \
+ int16 *__myhsp = &(_hs)[_hindex]; \
PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
if (_recycle) { \
if (__myhe->prev == NULL) \
@@ -299,7 +327,7 @@ do { \
} \
__myhe->next = &(_he)[*__myhsp]; \
__myhe->prev = NULL; \
- __myhe->hindex = __hindex; \
+ __myhe->hindex = _hindex; \
__myhe->pos = (_s); \
/* If there was an existing entry in this hash slot, link */ \
/* this new entry to it. However, the 0th entry in the */ \
@@ -317,6 +345,23 @@ do { \
} \
} while (0)
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex, _from_history) \
+do { \
+ int16 *__myhsp = &(_hs)[_hindex]; \
+ PGLZ_HistEntry *__myhe = &(_he)[_hn]; \
+ __myhe->next = &(_he)[*__myhsp]; \
+ __myhe->prev = NULL; \
+ __myhe->hindex = _hindex; \
+ __myhe->pos = (_s); \
+ (_he)[(*__myhsp)].prev = __myhe; \
+ *__myhsp = _hn; \
+ ++(_hn); \
+ __myhe->from_history = _from_history; \
+} while (0)
/* ----------
* pglz_out_ctrl -
@@ -379,6 +424,49 @@ do { \
/* ----------
+ * pglz_out_tag_encode -
+ *
+ * Outputs a backward reference tag of 2-4 bytes (depending on
+ * offset and length) to the destination/history buffer including the
+ * appropriate control bit.
+ * ----------
+ */
+#define pglz_out_tag_encode(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_from_history) \
+do { \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ _ctrlb |= _ctrl; \
+ _ctrl <<= 1; \
+ if (_from_history) \
+ _ctrlb |= _ctrl; \
+ _ctrl <<= 1; \
+ if (_len > 17) \
+ { \
+ (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f); \
+ (_buf)[1] = (unsigned char)(((_off) & 0xff)); \
+ (_buf)[2] = (unsigned char)((_len) - 18); \
+ (_buf) += 3; \
+ } else { \
+ (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+ (_buf)[1] = (unsigned char)((_off) & 0xff); \
+ (_buf) += 2; \
+ } \
+} while (0)
+
+/* ----------
+ * pglz_out_literal_encode -
+ *
+ * Outputs a literal byte to the destination buffer including the
+ * appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal_encode(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
+ *(_buf)++ = (unsigned char)(_byte); \
+ _ctrl <<= 2; \
+} while (0)
+
+/* ----------
* pglz_find_match -
*
* Lookup the history table if the actual input stream matches
@@ -388,17 +476,21 @@ do { \
*/
static inline int
pglz_find_match(int16 *hstart, const char *input, const char *end,
- int *lenp, int *offp, int good_match, int good_drop, int mask)
+ int *lenp, int *offp, int good_match, int good_drop,
+ const char *hend, int hindex, bool *from_history)
{
PGLZ_HistEntry *hent;
int16 hentno;
int32 len = 0;
int32 off = 0;
+ bool history_match = false;
+
+ *from_history = false;
/*
* Traverse the linked history list until a good enough match is found.
*/
- hentno = hstart[pglz_hist_idx(input, end, mask)];
+ hentno = hstart[hindex];
hent = &hist_entries[hentno];
while (hent != INVALID_ENTRY_PTR)
{
@@ -406,11 +498,26 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
const char *hp = hent->pos;
int32 thisoff;
int32 thislen;
+ int32 maxlen;
+
+ history_match = false;
+ maxlen = PGLZ_MAX_MATCH;
+ if (hent->from_history && (hend - hp < maxlen))
+ maxlen = hend - hp;
+ else if (end - input < maxlen)
+ maxlen = end - input;
/*
* Stop if the offset does not fit into our tag anymore.
*/
- thisoff = ip - hp;
+ if (hent->from_history)
+ {
+ history_match = true;
+ thisoff = hend - hp;
+ }
+ else
+ thisoff = ip - hp;
+
if (thisoff >= 0x0fff)
break;
@@ -430,7 +537,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
thislen = len;
ip += len;
hp += len;
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -440,7 +547,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
}
else
{
- while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+ while (*ip == *hp && thislen < maxlen)
{
thislen++;
ip++;
@@ -453,6 +560,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
*/
if (thislen > len)
{
+ *from_history = history_match;
len = thislen;
off = thisoff;
}
@@ -488,6 +596,29 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
return 0;
}
+static int
+choose_hash_size(int slen)
+{
+ int hashsz;
+
+ /*
+ * Experiments suggest that these hash sizes work pretty well. A large
+ * hash table minimizes collision, but has a higher startup cost. For a
+ * small input, the startup cost dominates. The table size must be a power
+ * of two.
+ */
+ if (slen < 128)
+ hashsz = 512;
+ else if (slen < 256)
+ hashsz = 1024;
+ else if (slen < 512)
+ hashsz = 2048;
+ else if (slen < 1024)
+ hashsz = 4096;
+ else
+ hashsz = 8192;
+ return hashsz;
+}
/* ----------
* pglz_compress -
@@ -574,11 +705,14 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
else
result_max = (slen * (100 - need_rate)) / 100;
+ hashsz = choose_hash_size(slen);
+ mask = hashsz - 1;
+
/*
* Experiments suggest that these hash sizes work pretty well. A large
- * hash table minimizes collision, but has a higher startup cost. For
- * a small input, the startup cost dominates. The table size must be
- * a power of two.
+ * hash table minimizes collision, but has a higher startup cost. For a
+ * small input, the startup cost dominates. The table size must be a power
+ * of two.
*/
if (slen < 128)
hashsz = 512;
@@ -603,6 +737,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
*/
while (dp < dend)
{
+ uint32 hindex;
+ bool from_history;
+
/*
* If we already exceeded the maximum result size, fail.
*
@@ -625,8 +762,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
/*
* Try to find a match in the history
*/
+ hindex = pglz_hist_idx(dp, dend, mask);
if (pglz_find_match(hist_start, dp, dend, &match_len,
- &match_off, good_match, good_drop, mask))
+ &match_off, good_match, good_drop, NULL, hindex,
+ &from_history))
{
/*
* Create the tag and add history entries for all matched
@@ -635,9 +774,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
while (match_len--)
{
+ hindex = pglz_hist_idx(dp, dend, mask);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend, mask);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -651,7 +791,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
pglz_hist_add(hist_start, hist_entries,
hist_next, hist_recycle,
- dp, dend, mask);
+ dp, dend, hindex);
dp++; /* Do not do this ++ in the line above! */
/* The macro would do it four times - Jan. */
}
@@ -674,6 +814,209 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
return true;
}
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen,
+ const PGLZ_Strategy *strategy)
+{
+ unsigned char *bp = ((unsigned char *) dest);
+ unsigned char *bstart = bp;
+ const char *dp = source;
+ const char *dend = source + slen;
+ const char *hp = history;
+ const char *hend = history + hlen;
+ unsigned char ctrl_dummy = 0;
+ unsigned char *ctrlp = &ctrl_dummy;
+ unsigned char ctrlb = 0;
+ unsigned char ctrl = 0;
+ bool found_match = false;
+ int32 match_len = 0;
+ int32 match_off;
+ int32 result_size;
+ int32 result_max;
+ int32 good_match;
+ int32 good_drop;
+ int32 need_rate;
+ int hist_next = 0;
+ int hashsz;
+ int mask;
+ int32 a,
+ b,
+ c,
+ d;
+ int32 hindex;
+
+ /*
+ * Tuples of source and history length greater than 2 * PGLZ_HISTORY_SIZE
+ * are not allowed for delta encode as this is the maximum size of history
+ * offset. And also tuples with history data less than 4 are not allowed.
+ */
+ if (((hlen + slen) >= (2 * PGLZ_HISTORY_SIZE)) || hlen < 4)
+ return false;
+
+ /*
+ * Our fallback strategy is the default.
+ */
+ if (strategy == NULL)
+ strategy = PGLZ_strategy_default;
+
+ /*
+ * If the strategy forbids compression (at all or if source chunk size out
+ * of range), fail.
+ */
+ if (strategy->match_size_good <= 0 ||
+ slen < strategy->min_input_size ||
+ slen > strategy->max_input_size)
+ return false;
+
+ need_rate = strategy->min_comp_rate;
+ if (need_rate < 0)
+ need_rate = 0;
+ else if (need_rate > 99)
+ need_rate = 99;
+
+ /*
+ * Limit the match parameters to the supported range.
+ */
+ good_match = strategy->match_size_good;
+ if (good_match > PGLZ_MAX_MATCH)
+ good_match = PGLZ_MAX_MATCH;
+ else if (good_match < 17)
+ good_match = 17;
+
+ good_drop = strategy->match_size_drop;
+ if (good_drop < 0)
+ good_drop = 0;
+ else if (good_drop > 100)
+ good_drop = 100;
+
+ /*
+ * Compute the maximum result size allowed by the strategy, namely the
+ * input size minus the minimum wanted compression rate. This had better
+ * be <= slen, else we might overrun the provided output buffer.
+ */
+ if (slen > (INT_MAX / 100))
+ {
+ /* Approximate to avoid overflow */
+ result_max = (slen / 100) * (100 - need_rate);
+ }
+ else
+ result_max = (slen * (100 - need_rate)) / 100;
+
+ hashsz = choose_hash_size(hlen + slen);
+ mask = hashsz - 1;
+
+ /*
+ * Initialize the history lists to empty. We do not need to zero the
+ * hist_entries[] array; its entries are initialized as they are used.
+ */
+ memset(hist_start, 0, hashsz * sizeof(int16));
+
+ pglz_hash_init(hp, hindex, a, b, c, d);
+ while (hp < hend - 4)
+ {
+ /*
+ * TODO: It would be nice to behave like the history and the source
+ * strings were concatenated, so that you could compress using the new
+ * data, too.
+ */
+ pglz_hash_roll(hp, hindex, a, b, c, d, mask);
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ hp, hend, hindex, true);
+ hp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Loop through the input.
+ */
+ match_off = 0;
+ pglz_hash_init(dp, hindex, a, b, c, d);
+ while (dp < dend - 4)
+ {
+ bool from_history;
+
+ /*
+ * If we already exceeded the maximum result size, fail.
+ *
+ * We check once per loop; since the loop body could emit as many as 4
+ * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ * allow 4 slop bytes.
+ */
+ if (bp - bstart >= result_max)
+ return false;
+
+ /*
+ * Try to find a match in the history
+ */
+ pglz_hash_roll(dp, hindex, a, b, c, d, mask);
+ if (pglz_find_match(hist_start, dp, dend, &match_len,
+ &match_off, good_match, good_drop, hend, hindex,
+ &from_history))
+ {
+ /*
+ * Create the tag and add history entries for all matched
+ * characters.
+ */
+ pglz_out_tag_encode(ctrlp, ctrlb, ctrl, bp, match_len, match_off, from_history);
+ dp += match_len;
+ found_match = true;
+ pglz_hash_init(dp, hindex, a, b, c, d);
+ }
+ else
+ {
+ /*
+ * No match found. Copy one literal byte.
+ */
+ pglz_hist_add_no_recycle(hist_start, hist_entries,
+ hist_next,
+ dp, dend, hindex, false);
+
+ pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ /* The macro would do it four times - Jan. */
+ }
+ }
+
+ if (!found_match)
+ return false;
+
+ /* Handle the last few bytes as literals */
+ while (dp < dend)
+ {
+ pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+ dp++; /* Do not do this ++ in the line above! */
+ }
+
+ /*
+ * Write out the last control byte and check that we haven't overrun the
+ * output size allowed by the strategy.
+ */
+ *ctrlp = ctrlb;
+ result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+ elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+ /*
+ * Success - need only fill in the actual length of the compressed datum.
+ */
+ *finallen = result_size;
+
+ return true;
+}
/* ----------
* pglz_decompress -
@@ -777,3 +1120,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
* That's it.
*/
}
+
+/* ----------
+ * pglz_delta_decode
+ *
+ * Decompresses source into dest.
+ * To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen)
+{
+ const unsigned char *sp;
+ const unsigned char *srcend;
+ unsigned char *dp;
+ unsigned char *destend;
+ const char *hend;
+
+ sp = ((const unsigned char *) source);
+ srcend = ((const unsigned char *) source) + srclen;
+ dp = (unsigned char *) dest;
+ destend = dp + destlen;
+ hend = history + histlen;
+
+ while (sp < srcend && dp < destend)
+ {
+ /*
+ * Read one control byte and process the next 8 items (or as many as
+ * remain in the compressed input).
+ */
+ unsigned char ctrl = *sp++;
+ int ctrlc;
+
+ for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+ {
+ if (ctrl & 1)
+ {
+ /*
+ * Otherwise it contains the match length minus 3 and the
+ * upper 4 bits of the offset. The next following byte
+ * contains the lower 8 bits of the offset. If the length is
+ * coded as 18, another extension tag byte tells how much
+ * longer the match really was (0-255).
+ */
+ int32 len;
+ int32 off;
+
+ len = (sp[0] & 0x0f) + 3;
+ off = ((sp[0] & 0xf0) << 4) | sp[1];
+ sp += 2;
+ if (len == 18)
+ len += *sp++;
+
+ /*
+ * Check for output buffer overrun, to ensure we don't clobber
+ * memory in case of corrupt input. Note: we must advance dp
+ * here to ensure the error is detected below the loop. We
+ * don't simply put the elog inside the loop since that will
+ * probably interfere with optimization.
+ */
+ if (dp + len > destend)
+ {
+ dp += len;
+ break;
+ }
+
+ /*
+ * Now we copy the bytes specified by the tag from history to
+ * OUTPUT.
+ */
+ if ((ctrl >> 1) & 1)
+ {
+ memcpy(dp, hend - off, len);
+ dp += len;
+ }
+ else
+ {
+ /*
+ * Now we copy the bytes specified by the tag from OUTPUT
+ * to OUTPUT. It is dangerous and platform dependent to
+ * use memcpy() here, because the copied areas could
+ * overlap extremely!
+ */
+ while (len--)
+ {
+ *dp = dp[-off];
+ dp++;
+ }
+ }
+ }
+ else
+ {
+ /*
+ * An unset control bit means LITERAL BYTE. So we just copy
+ * one from INPUT to OUTPUT.
+ */
+ if (dp >= destend) /* check for buffer overrun */
+ break; /* do not clobber memory */
+
+ *dp++ = *sp++;
+ }
+
+ /*
+ * Advance the control bit
+ */
+ ctrl >>= 2;
+ }
+ }
+
+ /*
+ * Check we decompressed the right amount.
+ */
+ if (sp != srcend)
+ elog(PANIC, "compressed data is corrupt");
+
+ /*
+ * That's it.
+ */
+ *finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2b753f8..13ef553 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -125,6 +125,7 @@ extern char *default_tablespace;
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+extern int wal_update_compression_ratio;
extern int ssl_renegotiation_limit;
extern char *SSLCipherSuites;
@@ -2437,6 +2438,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ /* Not for general use */
+ {"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Sets the compression ratio of delta record for wal update"),
+ NULL,
+ },
+ &wal_update_compression_ratio,
+ 25, 0, 100,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..36e7dc8 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -148,12 +148,21 @@ typedef struct xl_heap_update
TransactionId new_xmax; /* xmax of the new tuple */
ItemPointerData newtid; /* new inserted tuple id */
uint8 old_infobits_set; /* infomask bits to set on old tuple */
- bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
- bool new_all_visible_cleared; /* same for the page of newtid */
+ uint8 flags; /* flag bits, see below */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;
-#define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED 0x01 /* Indicates as old
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED 0x02 /* Indicates as new
+ * page's all visible
+ * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED 0x04 /* Indicates as the
+ * update operation is
+ * delta encoded */
+
+#define SizeOfHeapUpdate (offsetof(xl_heap_update, flags) + sizeof(uint8))
/*
* This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
Datum *values, bool *isnull);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);
+
/* these three are deprecated versions of the three above: */
extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 002862c..0a928d9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -262,6 +262,7 @@ typedef struct CheckpointStatsData
extern CheckpointStatsData CheckpointStats;
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
extern void XLogFlush(XLogRecPtr RecPtr);
extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
*/
extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+ const char *history, int32 hlen,
+ char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+ char *dest, uint32 destlen, uint32 *finallen,
+ const char *history, uint32 histlen);
#endif /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
(2 rows)
DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE: table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+ 1 | t | Test | 7.169 | B | CSD | 01-01-2000 | 520 | road2, +| dcy2 | M | 12000 | 50.4 | 00:00:00
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept | dob | idnum | addr | destn | gend | samba | hgt | ctime
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+ 1 | f | Tes | 8.9 | B | Test | 01-01-2000 | 520 | road2, +| moved | M | 0 | 10.1 | 00:00:00.1
+ | | | | | | | | streeeeet2,+| | | | |
+ | | | | | | | | city2 | | | | |
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
SELECT a, b, char_length(c) FROM update_test;
DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+ bser bigserial,
+ bln boolean,
+ ename VARCHAR(25),
+ perf_f float(8),
+ grade CHAR,
+ dept CHAR(5) NOT NULL,
+ dob DATE,
+ idnum INT,
+ addr VARCHAR(30) NOT NULL,
+ destn CHAR(6),
+ Gend CHAR,
+ samba BIGINT,
+ hgt float,
+ ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+ nextval('update_test_bser_seq'::regclass),
+ TRUE,
+ 'Test',
+ 7.169,
+ 'B',
+ 'CSD',
+ '2000-01-01',
+ 520,
+ 'road2,
+ streeeeet2,
+ city2',
+ 'dcy2',
+ 'M',
+ 12000,
+ 50.4,
+ '00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;
The v3 patch applies perfectly here now. Attached is a spreadsheet with
test results from two platforms, a Mac laptop and a Linux server. I
used systems with high disk speed because that seemed like a worst case
for this improvement. The actual improvement for shrinking WAL should
be even better on a system with slower disks.
There are enough problems with the "hundred tiny fields" results that I
think this not quite ready for commit yet. More comments on that below.
This seems close though, close enough that I am planning to follow up
to see if the slow disk results are better.
Reviewing the wal-update-testsuite.sh test program, I think the only
case missing that would be useful to add is "ten tiny fields, one
changed". I think that one is interesting to highlight because it's
what benchmark programs like pgbench do very often.
The GUC added by the program looks like this:
postgres=# show wal_update_compression_ratio ;
wal_update_compression_ratio
------------------------------
25
Is possible to add a setting here that disables the feature altogether?
That always makes it easier to consider a commit, knowing people can
roll back the change if it makes performance worse. That would make
performance testing easier too. wal-update-testsuit.sh takes as long as
13 minutes, it's long enough that I'd like the easier to automate
comparison GUC disabling adds. If that's not practical to do given the
intrusiveness of the code, it's not really necessary. I haven't looked
at the change enough to be sure how hard this is.
On the Mac, the only case that seems to have a slowdown now is "hundred
tiny fields, half nulled". It would be nice to understand just what is
going on with that one. I got some ugly results in "two short fields,
no change" too, along with a couple of other weird results, but I think
those were testing procedure issues that can be ignored. The pgbench
throttle work I did recently highlights that I can't really make a Mac
quiet/consistent for benchmarking very well. Note that I ran all of the
Mac tests with assertions on, to try and catch platform specific bugs.
The Linux ones used the default build parameters.
On Linux "hundred tiny fields, half nulled" was also by far the worst
performing one, with a >30% increase in duration despite the 14% drop in
WAL. Exactly what's going on there really needs to be investigated
before this seems safe to commit. All of the "hundred tiny fields"
cases seem pretty bad on Linux, with the rest of them running about a
11% duration increase.
This doesn't seem ready to commit for this CF, but the number of problem
cases is getting pretty small now. Now that I've gotten more familiar
with the test programs and the feature, I can run more performance tests
on this at any time really. If updates addressing the trouble cases are
ready from Amit or Hari before the next CF, send them out and I can look
at them without waiting until that one starts. This is a very promising
looking performance feature.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attachments:
WAL-lz-v3.xlsapplication/vnd.ms-excel; name=WAL-lz-v3.xlsDownload
������ ; �� ���� ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
������������ ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������R o o t E n t r y ������������ ���� ������������ ���� ������������ ���� ������������ ���� �
� � �� � \ p Calc B �a � = � � � = @ 8 � @ � "