pglz compression performance, take two

Started by Andrey Borodinabout 5 years ago32 messages

x4mmm@yandex-team.ru

about 5 years ago

1 attachment(s)

Hi hackers!

A year ago Vladimir Leskov proposed patch to speed up pglz compression[0]/messages/by-id/169163A8-C96F-4DBE-A062-7D1CECBE9E5D@yandex-team.ru. PFA the patch with some editorialisation by me.
I saw some reports of bottlenecking in pglz WAL compression [1]https://smalldatum.blogspot.com/2020/12/tuning-for-insert-benchmark-postgres_4.html.

Hopefully soon we will have compression codecs developed by compression specialists. The work is going on in nearby thread about custom compression methods.
Is it viable to work on pglz optimisation? It's about x1.4 faster. Or should we rely on future use of lz4\zstd and others?

Best regards, Andrey Borodin.

[0]: /messages/by-id/169163A8-C96F-4DBE-A062-7D1CECBE9E5D@yandex-team.ru
[1]: https://smalldatum.blogspot.com/2020/12/tuning-for-insert-benchmark-postgres_4.html

Attachments:

0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=0001-Reorganize-pglz-compression-code.patch; x-unix-mode=0644Download

From e01af73387817e2bc7b511741bae56243f041751 Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 27 Jun 2019 23:18:21 +0500
Subject: [PATCH] Reorganize pglz compression code

Convert macro-functions to regular functions, use more compact hash table and some other optimizations. Total speedup about x1.4
---
 src/common/pg_lzcompress.c | 460 ++++++++++++++++++++++---------------
 1 file changed, 278 insertions(+), 182 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index d24d4803a9..741f4d5ca0 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -118,13 +118,13 @@
  *			our 4K maximum look-back distance is too small.
  *
  *			The compressor creates a table for lists of positions.
- *			For each input position (except the last 3), a hash key is
+ *			For each input position (except the last 4), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4096 input positions, since we cannot use
+ *			kept for the last 4094 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -146,17 +146,15 @@
  *			used for the next tag in the output.
  *
  *			For each subsequent entry in the history list, the "good_match"
- *			is lowered by 10%. So the compressor will be more happy with
- *			short matches the farer it has to go back in the history.
+ *			is lowered roughly by 10%. So the compressor will be more happy
+ *			with short matches the farer it has to go back in the history.
  *			Another "speed against ratio" preference characteristic of
  *			the algorithm.
  *
- *			Thus there are 3 stop conditions for the lookup of matches:
+ *			Thus there are 2 stop conditions for the lookup of matches:
  *
  *				- a match >= good_match is found
  *				- there are no more history entries to look at
- *				- the next history entry is already too far back
- *				  to be coded into a tag.
  *
  *			Finally the match algorithm checks that at least a match
  *			of 3 or more bytes has been found, because that is the smallest
@@ -172,6 +170,11 @@
  *
  *			Jan Wieck
  *
+ *			Many thanks to Yann Collet for his article about unaligned
+ *			memory access.
+ *
+ *			Leskov Vladimir
+ *
  * Copyright (c) 1999-2020, PostgreSQL Global Development Group
  *
  * src/common/pg_lzcompress.c
@@ -188,12 +191,57 @@
 #include "common/pg_lzcompress.h"
 
 
+/**************************************
+ *  CPU Feature Detection			  *
+ **************************************/
+/* PGLZ_FORCE_MEMORY_ACCESS
+ * By default, access to unaligned memory is controlled by `memcpy()`, which is safe and portable.
+ * Unfortunately, on some target/compiler combinations, the generated assembly is sub-optimal.
+ * The below switch allow to select different access method for improved performance.
+ * Method 0 (default) : use `memcpy()`. Safe and portable.
+ * Method 1 : direct access. This method is portable but violate C standard.
+ *			It can generate buggy code on targets which assembly generation depends on alignment.
+ *			But in some circumstances, it's the only known way to get the most performance (ie GCC + ARMv6)
+ * See https://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html for details.
+ * Prefer these methods in priority order (0 > 1)
+ */
+#ifndef PGLZ_FORCE_MEMORY_ACCESS   /* can be defined externally */
+#  if defined(__GNUC__) && \
+  ( defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) \
+  || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) )
+#	define PGLZ_FORCE_MEMORY_ACCESS 1
+#  endif
+#endif
+
+
+#if defined(__x86_64__)
+  typedef uint64 reg_t;   /* 64-bits in x32 mode */
+#else
+  typedef size_t reg_t;   /* 32-bits in x32 mode */
+#endif
+
+
+#if defined(PGLZ_FORCE_MEMORY_ACCESS) && (PGLZ_FORCE_MEMORY_ACCESS==1)
+/* lie to the compiler about data alignment; use with caution */
+
+static uint32 pglz_read32(const void* ptr) { return *(const uint32*) ptr; }
+
+#else  /* safe and portable access using memcpy() */
+
+static uint32 pglz_read32(const void* ptr)
+{
+	uint32 val; memcpy(&val, ptr, sizeof(val)); return val;
+}
+
+#endif /* PGLZ_FORCE_MEMORY_ACCESS */
+
+
 /* ----------
  * Local definitions
  * ----------
  */
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
-#define PGLZ_HISTORY_SIZE		4096
+#define PGLZ_HISTORY_SIZE		0x0fff - 1	/* to avoid compare in iteration */
 #define PGLZ_MAX_MATCH			273
 
 
@@ -202,17 +250,16 @@
  *
  *		Linked list for the backward history lookup
  *
- * All the entries sharing a hash key are linked in a doubly linked list.
- * This makes it easy to remove an entry when it's time to recycle it
- * (because it's more than 4K positions old).
+ * All the entries sharing a hash key are linked in a singly linked list.
+ * Links are not changed during insertion in order to speed it up.
+ * Instead more complicated stop condition is used during list iteration. 
  * ----------
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
-	const char *pos;			/* my input position */
+	int16 next_id;					  /* links for my hash key's list */
+	uint16 hist_idx;					/* my current hash key */
+	const unsigned char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
 
@@ -274,117 +321,105 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
-		)
+static inline uint16 pglz_hist_idx(const unsigned char* s, uint16 mask)
+{
+	return ((s[0] << 6) ^ (s[1] << 4) ^ (s[2] << 2) ^ s[3]) & mask;
+}
 
 
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table.
- *
- * If _recycle is true, then we are recycling a previously used entry,
- * and must first delink it from its old hashcode's linked list.
- *
- * NOTE: beware of multiple evaluations of macro's arguments, and note that
- * _hn and _recycle are modified in the macro.
+ *		Adds a new entry to the history table
+ *		and recalculates hash value.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
-do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
-			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
-			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
-					(_hs)[__myhe->hindex] = __myhe->next - (_he);			\
-				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
-			}																\
-			__myhe->next = &(_he)[*__myhsp];								\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
-			__myhe->pos  = (_s);											\
-			/* If there was an existing entry in this hash slot, link */	\
-			/* this new entry to it. However, the 0th entry in the */		\
-			/* entries table is unused, so we can freely scribble on it. */ \
-			/* So don't bother checking if the slot was used - we'll */		\
-			/* scribble on the unused entry if it was not, but that's */	\
-			/* harmless. Avoiding the branch in this critical path */		\
-			/* speeds this up a little bit. */								\
-			/* if (*__myhsp != INVALID_ENTRY) */							\
-				(_he)[(*__myhsp)].prev = __myhe;							\
-			*__myhsp = _hn;													\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
-				(_hn) = 1;													\
-				(_recycle) = true;											\
-			}																\
-} while (0)
+static inline int16
+pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char* s, uint16 mask)
+{
+	int16* my_hist_start = &hist_start[*hist_idx];
+	PGLZ_HistEntry* entry = &(hist_entries)[hist_next];
 
+	/*
+	 * Initialize entry with a new value.
+	 */ 
+	entry->next_id = *my_hist_start;
+	entry->hist_idx = *hist_idx;
+	entry->pos = s;
 
-/* ----------
- * pglz_out_ctrl -
- *
- *		Outputs the last and allocates a new control byte if needed.
- * ----------
- */
-#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
-do { \
-	if ((__ctrl & 0xff) == 0)												\
-	{																		\
-		*(__ctrlp) = __ctrlb;												\
-		__ctrlp = (__buf)++;												\
-		__ctrlb = 0;														\
-		__ctrl = 1;															\
-	}																		\
-} while (0)
+	/*
+	 * Update linked list head pointer.
+	 */
+	*my_hist_start = hist_next;
+
+	/*
+	 * Recalculate hash value.
+	 */
+	*hist_idx = ((((*hist_idx) ^ (s[0] << 6)) << 2) ^ s[4]) & mask;
+
+	/*
+	 * Shift history pointer.
+	 */
+	hist_next++;
+	if (hist_next == PGLZ_HISTORY_SIZE + 1)
+	{
+		hist_next = 1;
+	}
+	return hist_next;
+}
 
 
 /* ----------
- * pglz_out_literal -
+ * pglz_out_tag -
  *
- *		Outputs a literal byte to the destination buffer including the
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
  *		appropriate control bit.
  * ----------
  */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	*(_buf)++ = (unsigned char)(_byte);										\
-	_ctrl <<= 1;															\
-} while (0)
+static inline unsigned char*
+pglz_out_tag(unsigned char* dest_ptr, int32 match_len, int32 match_offset)
+{
+	if (match_len > 17)
+	{
+		*(dest_ptr++) = (unsigned char)((((match_offset) & 0xf00) >> 4) | 0x0f);
+		*(dest_ptr++) = (unsigned char)(((match_offset) & 0xff));
+		*(dest_ptr++) = (unsigned char)((match_len) - 18);
+	}
+	else
+	{
+		*(dest_ptr++) = (unsigned char)((((match_offset) & 0xf00) >> 4) | ((match_len) - 3));
+		*(dest_ptr++) = (unsigned char)((match_offset) & 0xff);
+	}
+	return dest_ptr;
+}
 
 
 /* ----------
- * pglz_out_tag -
+ * pglz_compare -
  *
- *		Outputs a backward reference tag of 2-4 bytes (depending on
- *		offset and length) to the destination buffer including the
- *		appropriate control bit.
+ *		Compares two sequences of bytes
+ *		and returns positiion of first mismatch.
  * ----------
  */
-#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	_ctrlb |= _ctrl;														\
-	_ctrl <<= 1;															\
-	if (_len > 17)															\
-	{																		\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
-		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
-		(_buf)[2] = (unsigned char)((_len) - 18);							\
-		(_buf) += 3;														\
-	} else {																\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
-		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
-		(_buf) += 2;														\
-	}																		\
-} while (0)
+static inline int32
+pglz_compare(int32 len, int32 len_bound, const unsigned char* input_pos,
+					   const unsigned char* hist_pos)
+{
+	while(len <= len_bound - 4 && pglz_read32(input_pos) == pglz_read32(hist_pos))
+	{
+		len += 4;
+		input_pos += 4;
+		hist_pos += 4;
+	}
+	while (len < len_bound && *input_pos == *hist_pos)
+	{
+		len++;
+		input_pos++;
+		hist_pos++;
+	}
+	return len;
+}
 
 
 /* ----------
@@ -396,32 +431,40 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char *end,
+				int *lenp, int *offp, int good_match, int good_drop)
 {
-	PGLZ_HistEntry *hent;
-	int16		hentno;
+	PGLZ_HistEntry *hist_entry;
+	int16		*hist_entry_number;
 	int32		len = 0;
-	int32		off = 0;
+	int32		offset = 0;
+	int32		cur_len = 0;
+	int32	   len_bound = Min(end - input, PGLZ_MAX_MATCH);
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
-	hent = &hist_entries[hentno];
-	while (hent != INVALID_ENTRY_PTR)
-	{
-		const char *ip = input;
-		const char *hp = hent->pos;
-		int32		thisoff;
-		int32		thislen;
+	hist_entry_number = &hist_start[hist_idx];
+	if (*hist_entry_number == INVALID_ENTRY)
+		return 0;
 
+	hist_entry = &hist_entries[*hist_entry_number];
+	if (hist_idx != hist_entry->hist_idx)
+	{
 		/*
-		 * Stop if the offset does not fit into our tag anymore.
+		 * If current linked list head points to invalid entry
+		 * then clear it to reduce a number of comparisons in future.
 		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		*hist_entry_number = INVALID_ENTRY;
+		return 0;
+	}
+
+	while (true)
+	{
+		const unsigned char *input_pos = input;
+		const unsigned char *hist_pos = hist_entry->pos;
+		const unsigned char *my_pos;
+		int32		cur_offset = input_pos - hist_pos;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -431,58 +474,57 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 * character by character comparison to know the exact position where
 		 * the diff occurred.
 		 */
-		thislen = 0;
 		if (len >= 16)
 		{
-			if (memcmp(ip, hp, len) == 0)
+			if (memcmp(input_pos, hist_pos, len) == 0)
 			{
-				thislen = len;
-				ip += len;
-				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
-				{
-					thislen++;
-					ip++;
-					hp++;
-				}
+				offset = cur_offset;
+				len = pglz_compare(len, len_bound, input_pos + len, hist_pos + len);
 			}
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			if (pglz_read32(input_pos) == pglz_read32(hist_pos))
 			{
-				thislen++;
-				ip++;
-				hp++;
+				cur_len = pglz_compare(4, len_bound, input_pos + 4, hist_pos + 4);
+				if (cur_len > len)
+				{
+					len = cur_len;
+					offset = cur_offset;
+				}
 			}
 		}
 
-		/*
-		 * Remember this match as the best (if it is)
-		 */
-		if (thislen > len)
-		{
-			len = thislen;
-			off = thisoff;
-		}
-
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		my_pos = hist_entry->pos;
+		hist_entry = &hist_entries[hist_entry->next_id];
 
 		/*
-		 * Be happy with lesser good matches the more entries we visited. But
-		 * no point in doing calculation if we're at end of list.
+		 * If current match length is ok then stop iteration.
+		 * As outdated list links are not updated during insertion process
+		 * then additional stop condition should be introduced to avoid following them.
+		 * If recycled entry has another hash, then iteration stops.
+		 * If it has the same hash then recycled cell would break input stream
+		 * position monotonicity what is checked after.
 		 */
-		if (hent != INVALID_ENTRY_PTR)
+		if (len >= good_match || hist_idx != hist_entry->hist_idx || my_pos <= hist_entry->pos)
 		{
-			if (len >= good_match)
-				break;
-			good_match -= (good_match * good_drop) / 100;
+			break;
 		}
+		/*
+		 * Be happy with lesser good matches the more entries we visited.
+		 */
+		good_match -= (good_match * good_drop) >> 7;
 	}
 
+	/*
+	 * As initial compare for short matches compares 4 bytes
+	 * then for the end of stream length of match should be cut
+	 */
+	len = Min(len, len_bound);
+
 	/*
 	 * Return match information only if it results at least in one byte
 	 * reduction.
@@ -490,7 +532,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	if (len > 2)
 	{
 		*lenp = len;
-		*offp = off;
+		*offp = offset;
 		return 1;
 	}
 
@@ -509,26 +551,28 @@ int32
 pglz_compress(const char *source, int32 slen, char *dest,
 			  const PGLZ_Strategy *strategy)
 {
-	unsigned char *bp = (unsigned char *) dest;
-	unsigned char *bstart = bp;
-	int			hist_next = 1;
-	bool		hist_recycle = false;
-	const char *dp = source;
-	const char *dend = source + slen;
-	unsigned char ctrl_dummy = 0;
-	unsigned char *ctrlp = &ctrl_dummy;
-	unsigned char ctrlb = 0;
-	unsigned char ctrl = 0;
+	unsigned char *dest_ptr = (unsigned char *) dest;
+	unsigned char *dest_start = dest_ptr;
+	uint16			hist_next = 1;
+	uint16		  hist_idx;
+	const unsigned char *src_ptr = (const unsigned char*)source;
+	const unsigned char *src_end = (const unsigned char*)source + slen;
+	const unsigned char *compress_src_end = src_end - 4;
+	unsigned char control_dummy = 0;
+	unsigned char *control_ptr = &control_dummy;
+	unsigned char control_byte = 0;
+	unsigned char control_pos = 0;
 	bool		found_match = false;
 	int32		match_len;
-	int32		match_off;
+	int32		match_offset;
 	int32		good_match;
 	int32		good_drop;
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
 	int			hashsz;
-	int			mask;
+	uint16		mask;
+
 
 	/*
 	 * Our fallback strategy is the default.
@@ -560,6 +604,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	else if (good_drop > 100)
 		good_drop = 100;
 
+	/* We use <<7 later to calculate actual drop, so allign percents to 128 */
+	good_drop = good_drop * 128 / 100;
+
 	need_rate = strategy->min_comp_rate;
 	if (need_rate < 0)
 		need_rate = 0;
@@ -577,7 +624,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		result_max = (slen / 100) * (100 - need_rate);
 	}
 	else
+	{
 		result_max = (slen * (100 - need_rate)) / 100;
+	}
 
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
@@ -603,10 +652,21 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	 */
 	memset(hist_start, 0, hashsz * sizeof(int16));
 
+	/*
+	 * Initialize INVALID_ENTRY for stopping during lookup.
+	 */
+	hist_entries[INVALID_ENTRY].pos = src_end;
+	hist_entries[INVALID_ENTRY].hist_idx = hashsz;
+
+	/*
+	 * Calculate initial hash value.
+	 */
+	hist_idx = pglz_hist_idx(src_ptr, mask);
+
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
-	while (dp < dend)
+	while (src_ptr < compress_src_end)
 	{
 		/*
 		 * If we already exceeded the maximum result size, fail.
@@ -615,7 +675,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
 		 * allow 4 slop bytes.
 		 */
-		if (bp - bstart >= result_max)
+		if (dest_ptr - dest_start >= result_max)
 			return -1;
 
 		/*
@@ -624,26 +684,35 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * reasonably quickly when looking at incompressible input (such as
 		 * pre-compressed data).
 		 */
-		if (!found_match && bp - bstart >= strategy->first_success_by)
+		if (!found_match && dest_ptr - dest_start >= strategy->first_success_by)
 			return -1;
 
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)												
+		{																		
+			*(control_ptr) = control_byte;												
+			control_ptr = (dest_ptr)++;												
+			control_byte = 0;														
+			control_pos = 1;															
+		}																		
 		/*
 		 * Try to find a match in the history
 		 */
-		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+		if (pglz_find_match(hist_idx, src_ptr, compress_src_end, &match_len,
+							&match_offset, good_match, good_drop))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
 			 * characters.
 			 */
-			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			control_byte |= control_pos;														
+			dest_ptr = pglz_out_tag(dest_ptr, match_len, match_offset);
 			while (match_len--)
 			{
-				pglz_hist_add(hist_start, hist_entries,
-							  hist_next, hist_recycle,
-							  dp, dend, mask);
-				dp++;			/* Do not do this ++ in the line above! */
+				hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+				src_ptr++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.  */
 			}
 			found_match = true;
@@ -653,21 +722,48 @@ pglz_compress(const char *source, int32 slen, char *dest,
 			/*
 			 * No match found. Copy one literal byte.
 			 */
-			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
-			pglz_hist_add(hist_start, hist_entries,
-						  hist_next, hist_recycle,
-						  dp, dend, mask);
-			dp++;				/* Do not do this ++ in the line above! */
+			hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+			*(dest_ptr)++ = (unsigned char)(*src_ptr);
+			src_ptr++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.  */
 		}
+		control_pos <<= 1;
+	}
+
+
+	while (src_ptr < src_end)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (dest_ptr - dest_start >= result_max)
+			return -1;
+
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
+		}
+		*(dest_ptr)++ = (unsigned char)(*src_ptr);
+		src_ptr++;
+		control_pos <<= 1;
 	}
 
 	/*
 	 * Write out the last control byte and check that we haven't overrun the
 	 * output size allowed by the strategy.
 	 */
-	*ctrlp = ctrlb;
-	result_size = bp - bstart;
+	*control_ptr = control_byte;
+	result_size = dest_ptr - dest_start;
 	if (result_size >= result_max)
 		return -1;
 
-- 
2.24.3 (Apple Git-128)

Andrey Borodin

x4mmm@yandex-team.ru

about 5 years ago

In reply to: Andrey Borodin (#1)

Re: pglz compression performance, take two

9 дек. 2020 г., в 12:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
PFA the patch with some editorialisation by me.
I saw some reports of bottlenecking in pglz WAL compression [1].

I've checked that on my machine simple test
echo "wal_compression = on" >> $PGDATA/postgresql.conf
pgbench -i -s 20 && pgbench -T 30
shows ~2-3% of improvement, but the result is not very stable, deviation is comparable. In fact, bottleneck is just shifted from pglz, thus impact is not that measurable.

I've found out that the patch continues ideas from thread [0]/messages/by-id/5130C914.8080106@vmware.com and commit 031cc55 [1]https://github.com/x4m/postgres_g/commit/031cc55bbea6b3a6b67c700498a78fb1d4399476, but in much more shotgun-surgery way.
Out of curiosity I've rerun tests from that thread
postgres=# with patched as (select testname, avg(seconds) patched from testresults0 group by testname),unpatched as (select testname, avg(seconds) unpatched from testresults group by testname) select * from unpatched join patched using (testname);
testname | unpatched | patched
-------------------+------------------------+------------------------
512b random | 4.5568015000000000 | 4.3512980000000000
100k random | 1.03342300000000000000 | 1.00326200000000000000
100k of same byte | 2.1689715000000000 | 2.0958155000000000
2k random | 3.1613815000000000 | 3.1861350000000000
512b text | 5.7233600000000000 | 5.3602330000000000
5k text | 1.7044835000000000 | 1.8086770000000000
(6 rows)

Thanks!

Best regards, Andrey Borodin.

[0]: /messages/by-id/5130C914.8080106@vmware.com
[1]: https://github.com/x4m/postgres_g/commit/031cc55bbea6b3a6b67c700498a78fb1d4399476

Andrey Borodin

x4mmm@yandex-team.ru

about 5 years ago

In reply to: Andrey Borodin (#2)

1 attachment(s)

Re: pglz compression performance, take two

12 дек. 2020 г., в 22:47, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

I've cleaned up comments, checked that memory alignment stuff actually make sense for 32-bit ARM (according to Godbolt) and did some more code cleanup. PFA v2 patch.

I'm still in doubt should I register this patch on CF or not. I'm willing to work on this, but it's not clear will it hit PGv14. And I hope for PGv15 we will have lz4 or something better for WAL compression.

Best regards, Andrey Borodin.

Attachments:

v2-0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=v2-0001-Reorganize-pglz-compression-code.patch; x-unix-mode=0644Download

From c9d5db5f3ddca464d3bedbcc6462a38375b58b9e Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 27 Jun 2019 23:18:21 +0500
Subject: [PATCH v2] Reorganize pglz compression code

This patch accumulates several changes:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons in a search instead of 1-byte
Total speedup about x1.4
---
 src/common/pg_lzcompress.c | 474 ++++++++++++++++++++++---------------
 1 file changed, 287 insertions(+), 187 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index d24d4803a9..96c6367dad 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -118,13 +118,13 @@
  *			our 4K maximum look-back distance is too small.
  *
  *			The compressor creates a table for lists of positions.
- *			For each input position (except the last 3), a hash key is
+ *			For each input position (except the last 4), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4096 input positions, since we cannot use
+ *			kept for the last 4094 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -146,17 +146,15 @@
  *			used for the next tag in the output.
  *
  *			For each subsequent entry in the history list, the "good_match"
- *			is lowered by 10%. So the compressor will be more happy with
- *			short matches the farer it has to go back in the history.
+ *			is lowered roughly by 10%. So the compressor will be more happy
+ *			with short matches the farer it has to go back in the history.
  *			Another "speed against ratio" preference characteristic of
  *			the algorithm.
  *
- *			Thus there are 3 stop conditions for the lookup of matches:
+ *			Thus there are 2 stop conditions for the lookup of matches:
  *
  *				- a match >= good_match is found
  *				- there are no more history entries to look at
- *				- the next history entry is already too far back
- *				  to be coded into a tag.
  *
  *			Finally the match algorithm checks that at least a match
  *			of 3 or more bytes has been found, because that is the smallest
@@ -172,6 +170,11 @@
  *
  *			Jan Wieck
  *
+ *			Many thanks to Yann Collet for his article about unaligned
+ *			memory access.
+ *
+ *			Leskov Vladimir
+ *
  * Copyright (c) 1999-2020, PostgreSQL Global Development Group
  *
  * src/common/pg_lzcompress.c
@@ -188,12 +191,57 @@
 #include "common/pg_lzcompress.h"
 
 
+/**************************************
+ *  CPU Feature Detection			  *
+ **************************************/
+/* PGLZ_FORCE_MEMORY_ACCESS
+ * By default, access to unaligned memory is controlled by `memcpy()`, which is safe and portable.
+ * Unfortunately, on some target/compiler combinations, the generated assembly is sub-optimal.
+ * The below switch allow to select different access method for improved performance.
+ * Method 0 (default) : use `memcpy()`. Safe and portable.
+ * Method 1 : direct access. This method is portable but violate C standard.
+ *			It can generate buggy code on targets which assembly generation depends on alignment.
+ *			But in some circumstances, it's the only known way to get the most performance (ie GCC + ARMv6)
+ * See https://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html for details.
+ * Prefer these methods in priority order (0 > 1)
+ */
+#ifndef PGLZ_FORCE_MEMORY_ACCESS	/* can be defined externally */
+#if defined(__GNUC__) && \
+  ( defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) \
+  || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) )
+#define PGLZ_FORCE_MEMORY_ACCESS 1
+#endif
+#endif
+
+#if defined(PGLZ_FORCE_MEMORY_ACCESS) && (PGLZ_FORCE_MEMORY_ACCESS==1)
+/* lie to the compiler about data alignment; use with caution */
+
+static uint32
+pglz_read32(const void *ptr)
+{
+	return *(const uint32 *) ptr;
+}
+
+#else							/* safe and portable access using memcpy() */
+
+static uint32
+pglz_read32(const void *ptr)
+{
+	uint32		val;
+
+	memcpy(&val, ptr, sizeof(val));
+	return val;
+}
+
+#endif							/* PGLZ_FORCE_MEMORY_ACCESS */
+
+
 /* ----------
  * Local definitions
  * ----------
  */
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
-#define PGLZ_HISTORY_SIZE		4096
+#define PGLZ_HISTORY_SIZE		0x0fff - 1	/* to avoid compare in iteration */
 #define PGLZ_MAX_MATCH			273
 
 
@@ -202,17 +250,16 @@
  *
  *		Linked list for the backward history lookup
  *
- * All the entries sharing a hash key are linked in a doubly linked list.
- * This makes it easy to remove an entry when it's time to recycle it
- * (because it's more than 4K positions old).
+ * All the entries sharing a hash key are linked in a singly linked list.
+ * Links are not changed during insertion in order to speed it up.
+ * Instead more complicated stop condition is used during list iteration.
  * ----------
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
-	const char *pos;			/* my input position */
+	int16		next_id;		/* links for my hash key's list */
+	uint16		hist_idx;		/* my current hash key */
+	const unsigned char *pos;	/* my input position */
 } PGLZ_HistEntry;
 
 
@@ -256,11 +303,9 @@ static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
 static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
 /*
- * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
- * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ * Element 0 in hist_entries is unused, and means 'invalid'.
  */
 #define INVALID_ENTRY			0
-#define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -274,117 +319,111 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
-		)
+static inline uint16
+pglz_hist_idx(const unsigned char *s, uint16 mask)
+{
+	/*
+	 * Note that we only calculate function at the beginning of compressed data.
+	 * Further hash valued are computed by subtracting prev byte and adding next.
+	 * For details see pglz_hist_add().
+	 */
+	return ((s[0] << 6) ^ (s[1] << 4) ^ (s[2] << 2) ^ s[3]) & mask;
+}
 
 
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table.
- *
- * If _recycle is true, then we are recycling a previously used entry,
- * and must first delink it from its old hashcode's linked list.
- *
- * NOTE: beware of multiple evaluations of macro's arguments, and note that
- * _hn and _recycle are modified in the macro.
+ *		Adds a new entry to the history table
+ *		and recalculates hash value.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
-do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
-			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
-			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
-					(_hs)[__myhe->hindex] = __myhe->next - (_he);			\
-				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
-			}																\
-			__myhe->next = &(_he)[*__myhsp];								\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
-			__myhe->pos  = (_s);											\
-			/* If there was an existing entry in this hash slot, link */	\
-			/* this new entry to it. However, the 0th entry in the */		\
-			/* entries table is unused, so we can freely scribble on it. */ \
-			/* So don't bother checking if the slot was used - we'll */		\
-			/* scribble on the unused entry if it was not, but that's */	\
-			/* harmless. Avoiding the branch in this critical path */		\
-			/* speeds this up a little bit. */								\
-			/* if (*__myhsp != INVALID_ENTRY) */							\
-				(_he)[(*__myhsp)].prev = __myhe;							\
-			*__myhsp = _hn;													\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
-				(_hn) = 1;													\
-				(_recycle) = true;											\
-			}																\
-} while (0)
+static inline int16
+pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char *s, uint16 mask)
+{
+	int16	   *my_hist_start = &hist_start[*hist_idx];
+	PGLZ_HistEntry *entry = &(hist_entries)[hist_next];
 
+	/*
+	 * Initialize entry with a new value.
+	 */
+	entry->next_id = *my_hist_start;
+	entry->hist_idx = *hist_idx;
+	entry->pos = s;
 
-/* ----------
- * pglz_out_ctrl -
- *
- *		Outputs the last and allocates a new control byte if needed.
- * ----------
- */
-#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
-do { \
-	if ((__ctrl & 0xff) == 0)												\
-	{																		\
-		*(__ctrlp) = __ctrlb;												\
-		__ctrlp = (__buf)++;												\
-		__ctrlb = 0;														\
-		__ctrl = 1;															\
-	}																		\
-} while (0)
+	/*
+	 * Update linked list head pointer.
+	 */
+	*my_hist_start = hist_next;
+
+	/*
+	 * Recalculate hash value for next position. Remove current byte and add next.
+	 */
+	*hist_idx = ((((*hist_idx) ^ (s[0] << 6)) << 2) ^ s[4]) & mask;
+
+	/*
+	 * Shift history pointer.
+	 */
+	hist_next++;
+	if (hist_next == PGLZ_HISTORY_SIZE + 1)
+	{
+		hist_next = 1;
+	}
+	return hist_next;
+}
 
 
 /* ----------
- * pglz_out_literal -
+ * pglz_out_tag -
  *
- *		Outputs a literal byte to the destination buffer including the
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
  *		appropriate control bit.
  * ----------
  */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	*(_buf)++ = (unsigned char)(_byte);										\
-	_ctrl <<= 1;															\
-} while (0)
+static inline unsigned char *
+pglz_out_tag(unsigned char *dest_ptr, int32 match_len, int32 match_offset)
+{
+	if (match_len > 17)
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | 0x0f);
+		*(dest_ptr++) = (unsigned char) (((match_offset) & 0xff));
+		*(dest_ptr++) = (unsigned char) ((match_len) - 18);
+	}
+	else
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | ((match_len) - 3));
+		*(dest_ptr++) = (unsigned char) ((match_offset) & 0xff);
+	}
+	return dest_ptr;
+}
 
 
 /* ----------
- * pglz_out_tag -
+ * pglz_compare -
  *
- *		Outputs a backward reference tag of 2-4 bytes (depending on
- *		offset and length) to the destination buffer including the
- *		appropriate control bit.
+ *		Compares two sequences of bytes
+ *		and returns positiion of first mismatch.
  * ----------
  */
-#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	_ctrlb |= _ctrl;														\
-	_ctrl <<= 1;															\
-	if (_len > 17)															\
-	{																		\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
-		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
-		(_buf)[2] = (unsigned char)((_len) - 18);							\
-		(_buf) += 3;														\
-	} else {																\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
-		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
-		(_buf) += 2;														\
-	}																		\
-} while (0)
+static inline int32
+pglz_compare(int32 len, int32 len_bound, const unsigned char *input_pos,
+			 const unsigned char *hist_pos)
+{
+	while (len <= len_bound - 4 && pglz_read32(input_pos) == pglz_read32(hist_pos))
+	{
+		len += 4;
+		input_pos += 4;
+		hist_pos += 4;
+	}
+	while (len < len_bound && *input_pos == *hist_pos)
+	{
+		len++;
+		input_pos++;
+		hist_pos++;
+	}
+	return len;
+}
 
 
 /* ----------
@@ -396,32 +435,40 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char *end,
+				int *lenp, int *offp, int good_match, int good_drop)
 {
-	PGLZ_HistEntry *hent;
-	int16		hentno;
+	PGLZ_HistEntry *hist_entry;
+	int16	   *hist_entry_number;
 	int32		len = 0;
-	int32		off = 0;
+	int32		offset = 0;
+	int32		cur_len = 0;
+	int32		len_bound = Min(end - input, PGLZ_MAX_MATCH);
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
-	hent = &hist_entries[hentno];
-	while (hent != INVALID_ENTRY_PTR)
-	{
-		const char *ip = input;
-		const char *hp = hent->pos;
-		int32		thisoff;
-		int32		thislen;
+	hist_entry_number = &hist_start[hist_idx];
+	if (*hist_entry_number == INVALID_ENTRY)
+		return 0;
 
+	hist_entry = &hist_entries[*hist_entry_number];
+	if (hist_idx != hist_entry->hist_idx)
+	{
 		/*
-		 * Stop if the offset does not fit into our tag anymore.
+		 * If current linked list head points to invalid entry then clear it
+		 * to reduce a number of comparisons in future.
 		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		*hist_entry_number = INVALID_ENTRY;
+		return 0;
+	}
+
+	while (true)
+	{
+		const unsigned char *input_pos = input;
+		const unsigned char *hist_pos = hist_entry->pos;
+		const unsigned char *my_pos;
+		int32		cur_offset = input_pos - hist_pos;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -431,58 +478,58 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 * character by character comparison to know the exact position where
 		 * the diff occurred.
 		 */
-		thislen = 0;
 		if (len >= 16)
 		{
-			if (memcmp(ip, hp, len) == 0)
+			if (memcmp(input_pos, hist_pos, len) == 0)
 			{
-				thislen = len;
-				ip += len;
-				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
-				{
-					thislen++;
-					ip++;
-					hp++;
-				}
+				offset = cur_offset;
+				len = pglz_compare(len, len_bound, input_pos + len, hist_pos + len);
 			}
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			if (pglz_read32(input_pos) == pglz_read32(hist_pos))
 			{
-				thislen++;
-				ip++;
-				hp++;
+				cur_len = pglz_compare(4, len_bound, input_pos + 4, hist_pos + 4);
+				if (cur_len > len)
+				{
+					len = cur_len;
+					offset = cur_offset;
+				}
 			}
 		}
 
-		/*
-		 * Remember this match as the best (if it is)
-		 */
-		if (thislen > len)
-		{
-			len = thislen;
-			off = thisoff;
-		}
-
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		my_pos = hist_entry->pos;
+		hist_entry = &hist_entries[hist_entry->next_id];
 
 		/*
-		 * Be happy with lesser good matches the more entries we visited. But
-		 * no point in doing calculation if we're at end of list.
+		 * If current match length is ok then stop iteration. As outdated list
+		 * links are not updated during insertion process then additional stop
+		 * condition should be introduced to avoid following them. If recycled
+		 * entry has another hash, then iteration stops. If it has the same
+		 * hash then recycled cell would break input stream position
+		 * monotonicity what is checked after.
 		 */
-		if (hent != INVALID_ENTRY_PTR)
+		if (len >= good_match || hist_idx != hist_entry->hist_idx || my_pos <= hist_entry->pos)
 		{
-			if (len >= good_match)
-				break;
-			good_match -= (good_match * good_drop) / 100;
+			break;
 		}
+
+		/*
+		 * Be happy with lesser good matches the more entries we visited.
+		 */
+		good_match -= (good_match * good_drop) >> 7;
 	}
 
+	/*
+	 * As initial compare for short matches compares 4 bytes then for the end
+	 * of stream length of match should be cut
+	 */
+	len = Min(len, len_bound);
+
 	/*
 	 * Return match information only if it results at least in one byte
 	 * reduction.
@@ -490,7 +537,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	if (len > 2)
 	{
 		*lenp = len;
-		*offp = off;
+		*offp = offset;
 		return 1;
 	}
 
@@ -509,26 +556,28 @@ int32
 pglz_compress(const char *source, int32 slen, char *dest,
 			  const PGLZ_Strategy *strategy)
 {
-	unsigned char *bp = (unsigned char *) dest;
-	unsigned char *bstart = bp;
-	int			hist_next = 1;
-	bool		hist_recycle = false;
-	const char *dp = source;
-	const char *dend = source + slen;
-	unsigned char ctrl_dummy = 0;
-	unsigned char *ctrlp = &ctrl_dummy;
-	unsigned char ctrlb = 0;
-	unsigned char ctrl = 0;
+	unsigned char *dest_ptr = (unsigned char *) dest;
+	unsigned char *dest_start = dest_ptr;
+	uint16		hist_next = 1;
+	uint16		hist_idx;
+	const unsigned char *src_ptr = (const unsigned char *) source;
+	const unsigned char *src_end = (const unsigned char *) source + slen;
+	const unsigned char *compress_src_end = src_end - 4;
+	unsigned char control_dummy = 0;
+	unsigned char *control_ptr = &control_dummy;
+	unsigned char control_byte = 0;
+	unsigned char control_pos = 0;
 	bool		found_match = false;
 	int32		match_len;
-	int32		match_off;
+	int32		match_offset;
 	int32		good_match;
 	int32		good_drop;
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
 	int			hashsz;
-	int			mask;
+	uint16		mask;
+
 
 	/*
 	 * Our fallback strategy is the default.
@@ -560,6 +609,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	else if (good_drop > 100)
 		good_drop = 100;
 
+	/* We use <<7 later to calculate actual drop, so allign percents to 128 */
+	good_drop = good_drop * 128 / 100;
+
 	need_rate = strategy->min_comp_rate;
 	if (need_rate < 0)
 		need_rate = 0;
@@ -577,7 +629,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		result_max = (slen / 100) * (100 - need_rate);
 	}
 	else
+	{
 		result_max = (slen * (100 - need_rate)) / 100;
+	}
 
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
@@ -603,10 +657,21 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	 */
 	memset(hist_start, 0, hashsz * sizeof(int16));
 
+	/*
+	 * Initialize INVALID_ENTRY for stopping during lookup.
+	 */
+	hist_entries[INVALID_ENTRY].pos = src_end;
+	hist_entries[INVALID_ENTRY].hist_idx = hashsz;
+
+	/*
+	 * Calculate initial hash value.
+	 */
+	hist_idx = pglz_hist_idx(src_ptr, mask);
+
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
-	while (dp < dend)
+	while (src_ptr < compress_src_end)
 	{
 		/*
 		 * If we already exceeded the maximum result size, fail.
@@ -615,7 +680,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
 		 * allow 4 slop bytes.
 		 */
-		if (bp - bstart >= result_max)
+		if (dest_ptr - dest_start >= result_max)
 			return -1;
 
 		/*
@@ -624,27 +689,36 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * reasonably quickly when looking at incompressible input (such as
 		 * pre-compressed data).
 		 */
-		if (!found_match && bp - bstart >= strategy->first_success_by)
+		if (!found_match && dest_ptr - dest_start >= strategy->first_success_by)
 			return -1;
 
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
+		}
+
 		/*
 		 * Try to find a match in the history
 		 */
-		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+		if (pglz_find_match(hist_idx, src_ptr, compress_src_end, &match_len,
+							&match_offset, good_match, good_drop))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
 			 * characters.
 			 */
-			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			control_byte |= control_pos;
+			dest_ptr = pglz_out_tag(dest_ptr, match_len, match_offset);
 			while (match_len--)
 			{
-				pglz_hist_add(hist_start, hist_entries,
-							  hist_next, hist_recycle,
-							  dp, dend, mask);
-				dp++;			/* Do not do this ++ in the line above! */
-				/* The macro would do it four times - Jan.  */
+				hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+				src_ptr++;
 			}
 			found_match = true;
 		}
@@ -653,21 +727,47 @@ pglz_compress(const char *source, int32 slen, char *dest,
 			/*
 			 * No match found. Copy one literal byte.
 			 */
-			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
-			pglz_hist_add(hist_start, hist_entries,
-						  hist_next, hist_recycle,
-						  dp, dend, mask);
-			dp++;				/* Do not do this ++ in the line above! */
-			/* The macro would do it four times - Jan.  */
+			hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+			*(dest_ptr)++ = (unsigned char) (*src_ptr);
+			src_ptr++;
+		}
+		control_pos <<= 1;
+	}
+
+
+	while (src_ptr < src_end)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (dest_ptr - dest_start >= result_max)
+			return -1;
+
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
 		}
+		*(dest_ptr)++ = (unsigned char) (*src_ptr);
+		src_ptr++;
+		control_pos <<= 1;
 	}
 
 	/*
 	 * Write out the last control byte and check that we haven't overrun the
 	 * output size allowed by the strategy.
 	 */
-	*ctrlp = ctrlb;
-	result_size = bp - bstart;
+	*control_ptr = control_byte;
+	result_size = dest_ptr - dest_start;
 	if (result_size >= result_max)
 		return -1;
 
-- 
2.24.3 (Apple Git-128)

Tomas Vondra

tomas.vondra@enterprisedb.com

about 5 years ago

In reply to: Andrey Borodin (#3)

Re: pglz compression performance, take two

On 12/26/20 8:06 AM, Andrey Borodin wrote:

12 дек. 2020 г., в 22:47, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

I've cleaned up comments, checked that memory alignment stuff actually make sense for 32-bit ARM (according to Godbolt) and did some more code cleanup. PFA v2 patch.

I'm still in doubt should I register this patch on CF or not. I'm willing to work on this, but it's not clear will it hit PGv14. And I hope for PGv15 we will have lz4 or something better for WAL compression.

I'd suggest registering it, otherwise people are much less likely to
give you feedback. I don't see why it couldn't land in PG14.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Tomas Vondra (#4)

Re: pglz compression performance, take two

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 12/26/20 8:06 AM, Andrey Borodin wrote:

I'm still in doubt should I register this patch on CF or not. I'm willing to work on this, but it's not clear will it hit PGv14. And I hope for PGv15 we will have lz4 or something better for WAL compression.

I'd suggest registering it, otherwise people are much less likely to
give you feedback. I don't see why it couldn't land in PG14.

Even if lz4 or something else shows up, the existing code will remain
important for TOAST purposes. It would be years before we lose interest
in it.

regards, tom lane

Justin Pryzby

pryzby@telsasoft.com

about 5 years ago

In reply to: Andrey Borodin (#3)

Re: pglz compression performance, take two

On Sat, Dec 26, 2020 at 12:06:59PM +0500, Andrey Borodin wrote:

12 дек. 2020 г., в 22:47, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

I've cleaned up comments, checked that memory alignment stuff actually make sense for 32-bit ARM (according to Godbolt) and did some more code cleanup. PFA v2 patch.

I'm still in doubt should I register this patch on CF or not. I'm willing to work on this, but it's not clear will it hit PGv14. And I hope for PGv15 we will have lz4 or something better for WAL compression.

Thanks for registering it.

There's some typos in the current patch;

farer (further: but it's not your typo)
positiion
reduce a => reduce the
monotonicity what => monotonicity, which
lesser good => less good
allign: align

This comment I couldn't understand:
+        * As initial compare for short matches compares 4 bytes then for the end
+        * of stream length of match should be cut

--
Justin

Andrey Borodin

x4mmm@yandex-team.ru

about 5 years ago

In reply to: Justin Pryzby (#6)

1 attachment(s)

Re: pglz compression performance, take two

Thanks for looking into this, Justin!

30 дек. 2020 г., в 09:39, Justin Pryzby <pryzby@telsasoft.com> написал(а):

There's some typos in the current patch;

farer (further: but it's not your typo)
positiion
reduce a => reduce the
monotonicity what => monotonicity, which
lesser good => less good
allign: align

Fixed.

This comment I couldn't understand:
+        * As initial compare for short matches compares 4 bytes then for the end
+        * of stream length of match should be cut

I've reworded comments.

Best regards, Andrey Borodin.

Attachments:

v3-0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=v3-0001-Reorganize-pglz-compression-code.patch; x-unix-mode=0644Download

From 73d5335339f3712bf684b98e58b6dc3239df16f7 Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 27 Jun 2019 23:18:21 +0500
Subject: [PATCH v3] Reorganize pglz compression code

This patch accumulates several changes:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons in a search instead of 1-byte
Total speedup about x1.4
---
 src/common/pg_lzcompress.c | 476 ++++++++++++++++++++++---------------
 1 file changed, 289 insertions(+), 187 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index d24d4803a9..d1dbb2631a 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -118,13 +118,13 @@
  *			our 4K maximum look-back distance is too small.
  *
  *			The compressor creates a table for lists of positions.
- *			For each input position (except the last 3), a hash key is
+ *			For each input position (except the last 4), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4096 input positions, since we cannot use
+ *			kept for the last 4094 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -146,17 +146,15 @@
  *			used for the next tag in the output.
  *
  *			For each subsequent entry in the history list, the "good_match"
- *			is lowered by 10%. So the compressor will be more happy with
- *			short matches the farer it has to go back in the history.
+ *			is lowered roughly by 10%. So the compressor will be more happy
+ *			with short matches the further it has to go back in the history.
  *			Another "speed against ratio" preference characteristic of
  *			the algorithm.
  *
- *			Thus there are 3 stop conditions for the lookup of matches:
+ *			Thus there are 2 stop conditions for the lookup of matches:
  *
  *				- a match >= good_match is found
  *				- there are no more history entries to look at
- *				- the next history entry is already too far back
- *				  to be coded into a tag.
  *
  *			Finally the match algorithm checks that at least a match
  *			of 3 or more bytes has been found, because that is the smallest
@@ -172,6 +170,11 @@
  *
  *			Jan Wieck
  *
+ *			Many thanks to Yann Collet for his article about unaligned
+ *			memory access.
+ *
+ *			Leskov Vladimir
+ *
  * Copyright (c) 1999-2020, PostgreSQL Global Development Group
  *
  * src/common/pg_lzcompress.c
@@ -188,12 +191,57 @@
 #include "common/pg_lzcompress.h"
 
 
+/**************************************
+ *  CPU Feature Detection			  *
+ **************************************/
+/* PGLZ_FORCE_MEMORY_ACCESS
+ * By default, access to unaligned memory is controlled by `memcpy()`, which is safe and portable.
+ * Unfortunately, on some target/compiler combinations, the generated assembly is sub-optimal.
+ * The below switch allow to select different access method for improved performance.
+ * Method 0 (default) : use `memcpy()`. Safe and portable.
+ * Method 1 : direct access. This method is portable but violate C standard.
+ *			It can generate buggy code on targets which assembly generation depends on alignment.
+ *			But in some circumstances, it's the only known way to get the most performance (ie GCC + ARMv6)
+ * See https://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html for details.
+ * Prefer these methods in priority order (0 > 1)
+ */
+#ifndef PGLZ_FORCE_MEMORY_ACCESS	/* can be defined externally */
+#if defined(__GNUC__) && \
+  ( defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) \
+  || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) )
+#define PGLZ_FORCE_MEMORY_ACCESS 1
+#endif
+#endif
+
+#if defined(PGLZ_FORCE_MEMORY_ACCESS) && (PGLZ_FORCE_MEMORY_ACCESS==1)
+/* lie to the compiler about data alignment; use with caution */
+
+static uint32
+pglz_read32(const void *ptr)
+{
+	return *(const uint32 *) ptr;
+}
+
+#else							/* safe and portable access using memcpy() */
+
+static uint32
+pglz_read32(const void *ptr)
+{
+	uint32		val;
+
+	memcpy(&val, ptr, sizeof(val));
+	return val;
+}
+
+#endif							/* PGLZ_FORCE_MEMORY_ACCESS */
+
+
 /* ----------
  * Local definitions
  * ----------
  */
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
-#define PGLZ_HISTORY_SIZE		4096
+#define PGLZ_HISTORY_SIZE		0x0fff - 1	/* to avoid compare in iteration */
 #define PGLZ_MAX_MATCH			273
 
 
@@ -202,17 +250,16 @@
  *
  *		Linked list for the backward history lookup
  *
- * All the entries sharing a hash key are linked in a doubly linked list.
- * This makes it easy to remove an entry when it's time to recycle it
- * (because it's more than 4K positions old).
+ * All the entries sharing a hash key are linked in a singly linked list.
+ * Links are not changed during insertion in order to speed it up.
+ * Instead more complicated stop condition is used during list iteration.
  * ----------
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
-	const char *pos;			/* my input position */
+	int16		next_id;		/* links for my hash key's list */
+	uint16		hist_idx;		/* my current hash key */
+	const unsigned char *pos;	/* my input position */
 } PGLZ_HistEntry;
 
 
@@ -256,11 +303,9 @@ static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
 static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
 /*
- * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
- * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ * Element 0 in hist_entries is unused, and means 'invalid'.
  */
 #define INVALID_ENTRY			0
-#define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -274,117 +319,111 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
-		)
+static inline uint16
+pglz_hist_idx(const unsigned char *s, uint16 mask)
+{
+	/*
+	 * Note that we only calculate function at the beginning of compressed data.
+	 * Further hash valued are computed by subtracting prev byte and adding next.
+	 * For details see pglz_hist_add().
+	 */
+	return ((s[0] << 6) ^ (s[1] << 4) ^ (s[2] << 2) ^ s[3]) & mask;
+}
 
 
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table.
- *
- * If _recycle is true, then we are recycling a previously used entry,
- * and must first delink it from its old hashcode's linked list.
- *
- * NOTE: beware of multiple evaluations of macro's arguments, and note that
- * _hn and _recycle are modified in the macro.
+ *		Adds a new entry to the history table
+ *		and recalculates hash value.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
-do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
-			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
-			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
-					(_hs)[__myhe->hindex] = __myhe->next - (_he);			\
-				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
-			}																\
-			__myhe->next = &(_he)[*__myhsp];								\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
-			__myhe->pos  = (_s);											\
-			/* If there was an existing entry in this hash slot, link */	\
-			/* this new entry to it. However, the 0th entry in the */		\
-			/* entries table is unused, so we can freely scribble on it. */ \
-			/* So don't bother checking if the slot was used - we'll */		\
-			/* scribble on the unused entry if it was not, but that's */	\
-			/* harmless. Avoiding the branch in this critical path */		\
-			/* speeds this up a little bit. */								\
-			/* if (*__myhsp != INVALID_ENTRY) */							\
-				(_he)[(*__myhsp)].prev = __myhe;							\
-			*__myhsp = _hn;													\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
-				(_hn) = 1;													\
-				(_recycle) = true;											\
-			}																\
-} while (0)
+static inline int16
+pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char *s, uint16 mask)
+{
+	int16	   *my_hist_start = &hist_start[*hist_idx];
+	PGLZ_HistEntry *entry = &(hist_entries)[hist_next];
 
+	/*
+	 * Initialize entry with a new value.
+	 */
+	entry->next_id = *my_hist_start;
+	entry->hist_idx = *hist_idx;
+	entry->pos = s;
 
-/* ----------
- * pglz_out_ctrl -
- *
- *		Outputs the last and allocates a new control byte if needed.
- * ----------
- */
-#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
-do { \
-	if ((__ctrl & 0xff) == 0)												\
-	{																		\
-		*(__ctrlp) = __ctrlb;												\
-		__ctrlp = (__buf)++;												\
-		__ctrlb = 0;														\
-		__ctrl = 1;															\
-	}																		\
-} while (0)
+	/*
+	 * Update linked list head pointer.
+	 */
+	*my_hist_start = hist_next;
+
+	/*
+	 * Recalculate hash value for next position. Remove current byte and add next.
+	 */
+	*hist_idx = ((((*hist_idx) ^ (s[0] << 6)) << 2) ^ s[4]) & mask;
+
+	/*
+	 * Shift history pointer.
+	 */
+	hist_next++;
+	if (hist_next == PGLZ_HISTORY_SIZE + 1)
+	{
+		hist_next = 1;
+	}
+	return hist_next;
+}
 
 
 /* ----------
- * pglz_out_literal -
+ * pglz_out_tag -
  *
- *		Outputs a literal byte to the destination buffer including the
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
  *		appropriate control bit.
  * ----------
  */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	*(_buf)++ = (unsigned char)(_byte);										\
-	_ctrl <<= 1;															\
-} while (0)
+static inline unsigned char *
+pglz_out_tag(unsigned char *dest_ptr, int32 match_len, int32 match_offset)
+{
+	if (match_len > 17)
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | 0x0f);
+		*(dest_ptr++) = (unsigned char) (((match_offset) & 0xff));
+		*(dest_ptr++) = (unsigned char) ((match_len) - 18);
+	}
+	else
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | ((match_len) - 3));
+		*(dest_ptr++) = (unsigned char) ((match_offset) & 0xff);
+	}
+	return dest_ptr;
+}
 
 
 /* ----------
- * pglz_out_tag -
+ * pglz_compare -
  *
- *		Outputs a backward reference tag of 2-4 bytes (depending on
- *		offset and length) to the destination buffer including the
- *		appropriate control bit.
+ *		Compares two sequences of bytes
+ *		and returns position of first mismatch.
  * ----------
  */
-#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	_ctrlb |= _ctrl;														\
-	_ctrl <<= 1;															\
-	if (_len > 17)															\
-	{																		\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
-		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
-		(_buf)[2] = (unsigned char)((_len) - 18);							\
-		(_buf) += 3;														\
-	} else {																\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
-		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
-		(_buf) += 2;														\
-	}																		\
-} while (0)
+static inline int32
+pglz_compare(int32 len, int32 len_bound, const unsigned char *input_pos,
+			 const unsigned char *hist_pos)
+{
+	while (len <= len_bound - 4 && pglz_read32(input_pos) == pglz_read32(hist_pos))
+	{
+		len += 4;
+		input_pos += 4;
+		hist_pos += 4;
+	}
+	while (len < len_bound && *input_pos == *hist_pos)
+	{
+		len++;
+		input_pos++;
+		hist_pos++;
+	}
+	return len;
+}
 
 
 /* ----------
@@ -396,32 +435,40 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char *end,
+				int *lenp, int *offp, int good_match, int good_drop)
 {
-	PGLZ_HistEntry *hent;
-	int16		hentno;
+	PGLZ_HistEntry *hist_entry;
+	int16	   *hist_entry_number;
 	int32		len = 0;
-	int32		off = 0;
+	int32		offset = 0;
+	int32		cur_len = 0;
+	int32		len_bound = Min(end - input, PGLZ_MAX_MATCH);
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
-	hent = &hist_entries[hentno];
-	while (hent != INVALID_ENTRY_PTR)
-	{
-		const char *ip = input;
-		const char *hp = hent->pos;
-		int32		thisoff;
-		int32		thislen;
+	hist_entry_number = &hist_start[hist_idx];
+	if (*hist_entry_number == INVALID_ENTRY)
+		return 0;
 
+	hist_entry = &hist_entries[*hist_entry_number];
+	if (hist_idx != hist_entry->hist_idx)
+	{
 		/*
-		 * Stop if the offset does not fit into our tag anymore.
+		 * If current linked list head points to invalid entry then clear it
+		 * to reduce the number of comparisons in future.
 		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		*hist_entry_number = INVALID_ENTRY;
+		return 0;
+	}
+
+	while (true)
+	{
+		const unsigned char *input_pos = input;
+		const unsigned char *hist_pos = hist_entry->pos;
+		const unsigned char *my_pos;
+		int32		cur_offset = input_pos - hist_pos;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -431,58 +478,60 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 * character by character comparison to know the exact position where
 		 * the diff occurred.
 		 */
-		thislen = 0;
 		if (len >= 16)
 		{
-			if (memcmp(ip, hp, len) == 0)
+			if (memcmp(input_pos, hist_pos, len) == 0)
 			{
-				thislen = len;
-				ip += len;
-				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
-				{
-					thislen++;
-					ip++;
-					hp++;
-				}
+				offset = cur_offset;
+				len = pglz_compare(len, len_bound, input_pos + len, hist_pos + len);
 			}
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			/*
+			 * Start search for short matches by comparing 4 bytes
+			 */
+			if (pglz_read32(input_pos) == pglz_read32(hist_pos))
 			{
-				thislen++;
-				ip++;
-				hp++;
+				cur_len = pglz_compare(4, len_bound, input_pos + 4, hist_pos + 4);
+				if (cur_len > len)
+				{
+					len = cur_len;
+					offset = cur_offset;
+				}
 			}
 		}
 
-		/*
-		 * Remember this match as the best (if it is)
-		 */
-		if (thislen > len)
-		{
-			len = thislen;
-			off = thisoff;
-		}
-
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		my_pos = hist_entry->pos;
+		hist_entry = &hist_entries[hist_entry->next_id];
 
 		/*
-		 * Be happy with lesser good matches the more entries we visited. But
-		 * no point in doing calculation if we're at end of list.
+		 * If current match length is ok then stop iteration. As outdated list
+		 * links are not updated during insertion process then additional stop
+		 * condition should be introduced to avoid following them. If recycled
+		 * entry has another hash, then iteration stops. If it has the same
+		 * hash then recycled cell would break input stream position
+		 * monotonicity, which is checked after.
 		 */
-		if (hent != INVALID_ENTRY_PTR)
+		if (len >= good_match || hist_idx != hist_entry->hist_idx || my_pos <= hist_entry->pos)
 		{
-			if (len >= good_match)
-				break;
-			good_match -= (good_match * good_drop) / 100;
+			break;
 		}
+
+		/*
+		 * Be happy with less good matches the more entries we visited.
+		 */
+		good_match -= (good_match * good_drop) >> 7;
 	}
 
+	/*
+	 * found match can be slightly more than necessary, bound it with len_bound
+	 */
+	len = Min(len, len_bound);
+
 	/*
 	 * Return match information only if it results at least in one byte
 	 * reduction.
@@ -490,7 +539,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	if (len > 2)
 	{
 		*lenp = len;
-		*offp = off;
+		*offp = offset;
 		return 1;
 	}
 
@@ -509,26 +558,28 @@ int32
 pglz_compress(const char *source, int32 slen, char *dest,
 			  const PGLZ_Strategy *strategy)
 {
-	unsigned char *bp = (unsigned char *) dest;
-	unsigned char *bstart = bp;
-	int			hist_next = 1;
-	bool		hist_recycle = false;
-	const char *dp = source;
-	const char *dend = source + slen;
-	unsigned char ctrl_dummy = 0;
-	unsigned char *ctrlp = &ctrl_dummy;
-	unsigned char ctrlb = 0;
-	unsigned char ctrl = 0;
+	unsigned char *dest_ptr = (unsigned char *) dest;
+	unsigned char *dest_start = dest_ptr;
+	uint16		hist_next = 1;
+	uint16		hist_idx;
+	const unsigned char *src_ptr = (const unsigned char *) source;
+	const unsigned char *src_end = (const unsigned char *) source + slen;
+	const unsigned char *compress_src_end = src_end - 4;
+	unsigned char control_dummy = 0;
+	unsigned char *control_ptr = &control_dummy;
+	unsigned char control_byte = 0;
+	unsigned char control_pos = 0;
 	bool		found_match = false;
 	int32		match_len;
-	int32		match_off;
+	int32		match_offset;
 	int32		good_match;
 	int32		good_drop;
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
 	int			hashsz;
-	int			mask;
+	uint16		mask;
+
 
 	/*
 	 * Our fallback strategy is the default.
@@ -560,6 +611,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	else if (good_drop > 100)
 		good_drop = 100;
 
+	/* We use <<7 later to calculate actual drop, so align percents to 128 */
+	good_drop = good_drop * 128 / 100;
+
 	need_rate = strategy->min_comp_rate;
 	if (need_rate < 0)
 		need_rate = 0;
@@ -577,7 +631,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		result_max = (slen / 100) * (100 - need_rate);
 	}
 	else
+	{
 		result_max = (slen * (100 - need_rate)) / 100;
+	}
 
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
@@ -603,10 +659,21 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	 */
 	memset(hist_start, 0, hashsz * sizeof(int16));
 
+	/*
+	 * Initialize INVALID_ENTRY for stopping during lookup.
+	 */
+	hist_entries[INVALID_ENTRY].pos = src_end;
+	hist_entries[INVALID_ENTRY].hist_idx = hashsz;
+
+	/*
+	 * Calculate initial hash value.
+	 */
+	hist_idx = pglz_hist_idx(src_ptr, mask);
+
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
-	while (dp < dend)
+	while (src_ptr < compress_src_end)
 	{
 		/*
 		 * If we already exceeded the maximum result size, fail.
@@ -615,7 +682,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
 		 * allow 4 slop bytes.
 		 */
-		if (bp - bstart >= result_max)
+		if (dest_ptr - dest_start >= result_max)
 			return -1;
 
 		/*
@@ -624,27 +691,36 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * reasonably quickly when looking at incompressible input (such as
 		 * pre-compressed data).
 		 */
-		if (!found_match && bp - bstart >= strategy->first_success_by)
+		if (!found_match && dest_ptr - dest_start >= strategy->first_success_by)
 			return -1;
 
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
+		}
+
 		/*
 		 * Try to find a match in the history
 		 */
-		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+		if (pglz_find_match(hist_idx, src_ptr, compress_src_end, &match_len,
+							&match_offset, good_match, good_drop))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
 			 * characters.
 			 */
-			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			control_byte |= control_pos;
+			dest_ptr = pglz_out_tag(dest_ptr, match_len, match_offset);
 			while (match_len--)
 			{
-				pglz_hist_add(hist_start, hist_entries,
-							  hist_next, hist_recycle,
-							  dp, dend, mask);
-				dp++;			/* Do not do this ++ in the line above! */
-				/* The macro would do it four times - Jan.  */
+				hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+				src_ptr++;
 			}
 			found_match = true;
 		}
@@ -653,21 +729,47 @@ pglz_compress(const char *source, int32 slen, char *dest,
 			/*
 			 * No match found. Copy one literal byte.
 			 */
-			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
-			pglz_hist_add(hist_start, hist_entries,
-						  hist_next, hist_recycle,
-						  dp, dend, mask);
-			dp++;				/* Do not do this ++ in the line above! */
-			/* The macro would do it four times - Jan.  */
+			hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+			*(dest_ptr)++ = (unsigned char) (*src_ptr);
+			src_ptr++;
+		}
+		control_pos <<= 1;
+	}
+
+
+	while (src_ptr < src_end)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (dest_ptr - dest_start >= result_max)
+			return -1;
+
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
 		}
+		*(dest_ptr)++ = (unsigned char) (*src_ptr);
+		src_ptr++;
+		control_pos <<= 1;
 	}
 
 	/*
 	 * Write out the last control byte and check that we haven't overrun the
 	 * output size allowed by the strategy.
 	 */
-	*ctrlp = ctrlb;
-	result_size = bp - bstart;
+	*control_ptr = control_byte;
+	result_size = dest_ptr - dest_start;
 	if (result_size >= result_max)
 		return -1;
 
-- 
2.24.3 (Apple Git-128)

Justin Pryzby

pryzby@telsasoft.com

almost 5 years ago

In reply to: Andrey Borodin (#7)

1 attachment(s)

Re: pglz compression performance, take two

@cfbot: rebased

Attachments:

0001-Reorganize-pglz-compression-code.patchtext/x-diff; charset=us-asciiDownload

From 03fec5d2587cf34a1d1a75d7afdcfbad9cb7ec68 Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 27 Jun 2019 23:18:21 +0500
Subject: [PATCH] Reorganize pglz compression code

This patch accumulates several changes:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons in a search instead of 1-byte
Total speedup about x1.4
---
 src/common/pg_lzcompress.c | 475 ++++++++++++++++++++++---------------
 1 file changed, 288 insertions(+), 187 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index fdd527f757..ffa4fe549e 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -118,13 +118,13 @@
  *			our 4K maximum look-back distance is too small.
  *
  *			The compressor creates a table for lists of positions.
- *			For each input position (except the last 3), a hash key is
+ *			For each input position (except the last 4), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4096 input positions, since we cannot use
+ *			kept for the last 4094 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -146,17 +146,15 @@
  *			used for the next tag in the output.
  *
  *			For each subsequent entry in the history list, the "good_match"
- *			is lowered by 10%. So the compressor will be more happy with
- *			short matches the farer it has to go back in the history.
+ *			is lowered roughly by 10%. So the compressor will be more happy
+ *			with short matches the further it has to go back in the history.
  *			Another "speed against ratio" preference characteristic of
  *			the algorithm.
  *
- *			Thus there are 3 stop conditions for the lookup of matches:
+ *			Thus there are 2 stop conditions for the lookup of matches:
  *
  *				- a match >= good_match is found
  *				- there are no more history entries to look at
- *				- the next history entry is already too far back
- *				  to be coded into a tag.
  *
  *			Finally the match algorithm checks that at least a match
  *			of 3 or more bytes has been found, because that is the smallest
@@ -173,6 +171,10 @@
  *			Jan Wieck
  *
  * Copyright (c) 1999-2021, PostgreSQL Global Development Group
+ *			Many thanks to Yann Collet for his article about unaligned
+ *			memory access.
+ *
+ *			Leskov Vladimir
  *
  * src/common/pg_lzcompress.c
  * ----------
@@ -188,12 +190,57 @@
 #include "common/pg_lzcompress.h"
 
 
+/**************************************
+ *  CPU Feature Detection			  *
+ **************************************/
+/* PGLZ_FORCE_MEMORY_ACCESS
+ * By default, access to unaligned memory is controlled by `memcpy()`, which is safe and portable.
+ * Unfortunately, on some target/compiler combinations, the generated assembly is sub-optimal.
+ * The below switch allow to select different access method for improved performance.
+ * Method 0 (default) : use `memcpy()`. Safe and portable.
+ * Method 1 : direct access. This method is portable but violate C standard.
+ *			It can generate buggy code on targets which assembly generation depends on alignment.
+ *			But in some circumstances, it's the only known way to get the most performance (ie GCC + ARMv6)
+ * See https://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html for details.
+ * Prefer these methods in priority order (0 > 1)
+ */
+#ifndef PGLZ_FORCE_MEMORY_ACCESS	/* can be defined externally */
+#if defined(__GNUC__) && \
+  ( defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) \
+  || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) )
+#define PGLZ_FORCE_MEMORY_ACCESS 1
+#endif
+#endif
+
+#if defined(PGLZ_FORCE_MEMORY_ACCESS) && (PGLZ_FORCE_MEMORY_ACCESS==1)
+/* lie to the compiler about data alignment; use with caution */
+
+static uint32
+pglz_read32(const void *ptr)
+{
+	return *(const uint32 *) ptr;
+}
+
+#else							/* safe and portable access using memcpy() */
+
+static uint32
+pglz_read32(const void *ptr)
+{
+	uint32		val;
+
+	memcpy(&val, ptr, sizeof(val));
+	return val;
+}
+
+#endif							/* PGLZ_FORCE_MEMORY_ACCESS */
+
+
 /* ----------
  * Local definitions
  * ----------
  */
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
-#define PGLZ_HISTORY_SIZE		4096
+#define PGLZ_HISTORY_SIZE		0x0fff - 1	/* to avoid compare in iteration */
 #define PGLZ_MAX_MATCH			273
 
 
@@ -202,17 +249,16 @@
  *
  *		Linked list for the backward history lookup
  *
- * All the entries sharing a hash key are linked in a doubly linked list.
- * This makes it easy to remove an entry when it's time to recycle it
- * (because it's more than 4K positions old).
+ * All the entries sharing a hash key are linked in a singly linked list.
+ * Links are not changed during insertion in order to speed it up.
+ * Instead more complicated stop condition is used during list iteration.
  * ----------
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
-	const char *pos;			/* my input position */
+	int16		next_id;		/* links for my hash key's list */
+	uint16		hist_idx;		/* my current hash key */
+	const unsigned char *pos;	/* my input position */
 } PGLZ_HistEntry;
 
 
@@ -256,11 +302,9 @@ static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
 static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
 /*
- * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
- * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ * Element 0 in hist_entries is unused, and means 'invalid'.
  */
 #define INVALID_ENTRY			0
-#define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -274,117 +318,111 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
-		)
+static inline uint16
+pglz_hist_idx(const unsigned char *s, uint16 mask)
+{
+	/*
+	 * Note that we only calculate function at the beginning of compressed data.
+	 * Further hash valued are computed by subtracting prev byte and adding next.
+	 * For details see pglz_hist_add().
+	 */
+	return ((s[0] << 6) ^ (s[1] << 4) ^ (s[2] << 2) ^ s[3]) & mask;
+}
 
 
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table.
- *
- * If _recycle is true, then we are recycling a previously used entry,
- * and must first delink it from its old hashcode's linked list.
- *
- * NOTE: beware of multiple evaluations of macro's arguments, and note that
- * _hn and _recycle are modified in the macro.
+ *		Adds a new entry to the history table
+ *		and recalculates hash value.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
-do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
-			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
-			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
-					(_hs)[__myhe->hindex] = __myhe->next - (_he);			\
-				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
-			}																\
-			__myhe->next = &(_he)[*__myhsp];								\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
-			__myhe->pos  = (_s);											\
-			/* If there was an existing entry in this hash slot, link */	\
-			/* this new entry to it. However, the 0th entry in the */		\
-			/* entries table is unused, so we can freely scribble on it. */ \
-			/* So don't bother checking if the slot was used - we'll */		\
-			/* scribble on the unused entry if it was not, but that's */	\
-			/* harmless. Avoiding the branch in this critical path */		\
-			/* speeds this up a little bit. */								\
-			/* if (*__myhsp != INVALID_ENTRY) */							\
-				(_he)[(*__myhsp)].prev = __myhe;							\
-			*__myhsp = _hn;													\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
-				(_hn) = 1;													\
-				(_recycle) = true;											\
-			}																\
-} while (0)
+static inline int16
+pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char *s, uint16 mask)
+{
+	int16	   *my_hist_start = &hist_start[*hist_idx];
+	PGLZ_HistEntry *entry = &(hist_entries)[hist_next];
 
+	/*
+	 * Initialize entry with a new value.
+	 */
+	entry->next_id = *my_hist_start;
+	entry->hist_idx = *hist_idx;
+	entry->pos = s;
 
-/* ----------
- * pglz_out_ctrl -
- *
- *		Outputs the last and allocates a new control byte if needed.
- * ----------
- */
-#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
-do { \
-	if ((__ctrl & 0xff) == 0)												\
-	{																		\
-		*(__ctrlp) = __ctrlb;												\
-		__ctrlp = (__buf)++;												\
-		__ctrlb = 0;														\
-		__ctrl = 1;															\
-	}																		\
-} while (0)
+	/*
+	 * Update linked list head pointer.
+	 */
+	*my_hist_start = hist_next;
+
+	/*
+	 * Recalculate hash value for next position. Remove current byte and add next.
+	 */
+	*hist_idx = ((((*hist_idx) ^ (s[0] << 6)) << 2) ^ s[4]) & mask;
+
+	/*
+	 * Shift history pointer.
+	 */
+	hist_next++;
+	if (hist_next == PGLZ_HISTORY_SIZE + 1)
+	{
+		hist_next = 1;
+	}
+	return hist_next;
+}
 
 
 /* ----------
- * pglz_out_literal -
+ * pglz_out_tag -
  *
- *		Outputs a literal byte to the destination buffer including the
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
  *		appropriate control bit.
  * ----------
  */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	*(_buf)++ = (unsigned char)(_byte);										\
-	_ctrl <<= 1;															\
-} while (0)
+static inline unsigned char *
+pglz_out_tag(unsigned char *dest_ptr, int32 match_len, int32 match_offset)
+{
+	if (match_len > 17)
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | 0x0f);
+		*(dest_ptr++) = (unsigned char) (((match_offset) & 0xff));
+		*(dest_ptr++) = (unsigned char) ((match_len) - 18);
+	}
+	else
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | ((match_len) - 3));
+		*(dest_ptr++) = (unsigned char) ((match_offset) & 0xff);
+	}
+	return dest_ptr;
+}
 
 
 /* ----------
- * pglz_out_tag -
+ * pglz_compare -
  *
- *		Outputs a backward reference tag of 2-4 bytes (depending on
- *		offset and length) to the destination buffer including the
- *		appropriate control bit.
+ *		Compares two sequences of bytes
+ *		and returns position of first mismatch.
  * ----------
  */
-#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	_ctrlb |= _ctrl;														\
-	_ctrl <<= 1;															\
-	if (_len > 17)															\
-	{																		\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
-		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
-		(_buf)[2] = (unsigned char)((_len) - 18);							\
-		(_buf) += 3;														\
-	} else {																\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
-		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
-		(_buf) += 2;														\
-	}																		\
-} while (0)
+static inline int32
+pglz_compare(int32 len, int32 len_bound, const unsigned char *input_pos,
+			 const unsigned char *hist_pos)
+{
+	while (len <= len_bound - 4 && pglz_read32(input_pos) == pglz_read32(hist_pos))
+	{
+		len += 4;
+		input_pos += 4;
+		hist_pos += 4;
+	}
+	while (len < len_bound && *input_pos == *hist_pos)
+	{
+		len++;
+		input_pos++;
+		hist_pos++;
+	}
+	return len;
+}
 
 
 /* ----------
@@ -396,32 +434,40 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char *end,
+				int *lenp, int *offp, int good_match, int good_drop)
 {
-	PGLZ_HistEntry *hent;
-	int16		hentno;
+	PGLZ_HistEntry *hist_entry;
+	int16	   *hist_entry_number;
 	int32		len = 0;
-	int32		off = 0;
+	int32		offset = 0;
+	int32		cur_len = 0;
+	int32		len_bound = Min(end - input, PGLZ_MAX_MATCH);
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
-	hent = &hist_entries[hentno];
-	while (hent != INVALID_ENTRY_PTR)
-	{
-		const char *ip = input;
-		const char *hp = hent->pos;
-		int32		thisoff;
-		int32		thislen;
+	hist_entry_number = &hist_start[hist_idx];
+	if (*hist_entry_number == INVALID_ENTRY)
+		return 0;
 
+	hist_entry = &hist_entries[*hist_entry_number];
+	if (hist_idx != hist_entry->hist_idx)
+	{
 		/*
-		 * Stop if the offset does not fit into our tag anymore.
+		 * If current linked list head points to invalid entry then clear it
+		 * to reduce the number of comparisons in future.
 		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		*hist_entry_number = INVALID_ENTRY;
+		return 0;
+	}
+
+	while (true)
+	{
+		const unsigned char *input_pos = input;
+		const unsigned char *hist_pos = hist_entry->pos;
+		const unsigned char *my_pos;
+		int32		cur_offset = input_pos - hist_pos;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -431,58 +477,60 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 * character by character comparison to know the exact position where
 		 * the diff occurred.
 		 */
-		thislen = 0;
 		if (len >= 16)
 		{
-			if (memcmp(ip, hp, len) == 0)
+			if (memcmp(input_pos, hist_pos, len) == 0)
 			{
-				thislen = len;
-				ip += len;
-				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
-				{
-					thislen++;
-					ip++;
-					hp++;
-				}
+				offset = cur_offset;
+				len = pglz_compare(len, len_bound, input_pos + len, hist_pos + len);
 			}
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			/*
+			 * Start search for short matches by comparing 4 bytes
+			 */
+			if (pglz_read32(input_pos) == pglz_read32(hist_pos))
 			{
-				thislen++;
-				ip++;
-				hp++;
+				cur_len = pglz_compare(4, len_bound, input_pos + 4, hist_pos + 4);
+				if (cur_len > len)
+				{
+					len = cur_len;
+					offset = cur_offset;
+				}
 			}
 		}
 
-		/*
-		 * Remember this match as the best (if it is)
-		 */
-		if (thislen > len)
-		{
-			len = thislen;
-			off = thisoff;
-		}
-
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		my_pos = hist_entry->pos;
+		hist_entry = &hist_entries[hist_entry->next_id];
 
 		/*
-		 * Be happy with lesser good matches the more entries we visited. But
-		 * no point in doing calculation if we're at end of list.
+		 * If current match length is ok then stop iteration. As outdated list
+		 * links are not updated during insertion process then additional stop
+		 * condition should be introduced to avoid following them. If recycled
+		 * entry has another hash, then iteration stops. If it has the same
+		 * hash then recycled cell would break input stream position
+		 * monotonicity, which is checked after.
 		 */
-		if (hent != INVALID_ENTRY_PTR)
+		if (len >= good_match || hist_idx != hist_entry->hist_idx || my_pos <= hist_entry->pos)
 		{
-			if (len >= good_match)
-				break;
-			good_match -= (good_match * good_drop) / 100;
+			break;
 		}
+
+		/*
+		 * Be happy with less good matches the more entries we visited.
+		 */
+		good_match -= (good_match * good_drop) >> 7;
 	}
 
+	/*
+	 * found match can be slightly more than necessary, bound it with len_bound
+	 */
+	len = Min(len, len_bound);
+
 	/*
 	 * Return match information only if it results at least in one byte
 	 * reduction.
@@ -490,7 +538,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	if (len > 2)
 	{
 		*lenp = len;
-		*offp = off;
+		*offp = offset;
 		return 1;
 	}
 
@@ -509,26 +557,28 @@ int32
 pglz_compress(const char *source, int32 slen, char *dest,
 			  const PGLZ_Strategy *strategy)
 {
-	unsigned char *bp = (unsigned char *) dest;
-	unsigned char *bstart = bp;
-	int			hist_next = 1;
-	bool		hist_recycle = false;
-	const char *dp = source;
-	const char *dend = source + slen;
-	unsigned char ctrl_dummy = 0;
-	unsigned char *ctrlp = &ctrl_dummy;
-	unsigned char ctrlb = 0;
-	unsigned char ctrl = 0;
+	unsigned char *dest_ptr = (unsigned char *) dest;
+	unsigned char *dest_start = dest_ptr;
+	uint16		hist_next = 1;
+	uint16		hist_idx;
+	const unsigned char *src_ptr = (const unsigned char *) source;
+	const unsigned char *src_end = (const unsigned char *) source + slen;
+	const unsigned char *compress_src_end = src_end - 4;
+	unsigned char control_dummy = 0;
+	unsigned char *control_ptr = &control_dummy;
+	unsigned char control_byte = 0;
+	unsigned char control_pos = 0;
 	bool		found_match = false;
 	int32		match_len;
-	int32		match_off;
+	int32		match_offset;
 	int32		good_match;
 	int32		good_drop;
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
 	int			hashsz;
-	int			mask;
+	uint16		mask;
+
 
 	/*
 	 * Our fallback strategy is the default.
@@ -560,6 +610,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	else if (good_drop > 100)
 		good_drop = 100;
 
+	/* We use <<7 later to calculate actual drop, so align percents to 128 */
+	good_drop = good_drop * 128 / 100;
+
 	need_rate = strategy->min_comp_rate;
 	if (need_rate < 0)
 		need_rate = 0;
@@ -577,7 +630,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		result_max = (slen / 100) * (100 - need_rate);
 	}
 	else
+	{
 		result_max = (slen * (100 - need_rate)) / 100;
+	}
 
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
@@ -603,10 +658,21 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	 */
 	memset(hist_start, 0, hashsz * sizeof(int16));
 
+	/*
+	 * Initialize INVALID_ENTRY for stopping during lookup.
+	 */
+	hist_entries[INVALID_ENTRY].pos = src_end;
+	hist_entries[INVALID_ENTRY].hist_idx = hashsz;
+
+	/*
+	 * Calculate initial hash value.
+	 */
+	hist_idx = pglz_hist_idx(src_ptr, mask);
+
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
-	while (dp < dend)
+	while (src_ptr < compress_src_end)
 	{
 		/*
 		 * If we already exceeded the maximum result size, fail.
@@ -615,7 +681,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
 		 * allow 4 slop bytes.
 		 */
-		if (bp - bstart >= result_max)
+		if (dest_ptr - dest_start >= result_max)
 			return -1;
 
 		/*
@@ -624,27 +690,36 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * reasonably quickly when looking at incompressible input (such as
 		 * pre-compressed data).
 		 */
-		if (!found_match && bp - bstart >= strategy->first_success_by)
+		if (!found_match && dest_ptr - dest_start >= strategy->first_success_by)
 			return -1;
 
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
+		}
+
 		/*
 		 * Try to find a match in the history
 		 */
-		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+		if (pglz_find_match(hist_idx, src_ptr, compress_src_end, &match_len,
+							&match_offset, good_match, good_drop))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
 			 * characters.
 			 */
-			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			control_byte |= control_pos;
+			dest_ptr = pglz_out_tag(dest_ptr, match_len, match_offset);
 			while (match_len--)
 			{
-				pglz_hist_add(hist_start, hist_entries,
-							  hist_next, hist_recycle,
-							  dp, dend, mask);
-				dp++;			/* Do not do this ++ in the line above! */
-				/* The macro would do it four times - Jan.  */
+				hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+				src_ptr++;
 			}
 			found_match = true;
 		}
@@ -653,21 +728,47 @@ pglz_compress(const char *source, int32 slen, char *dest,
 			/*
 			 * No match found. Copy one literal byte.
 			 */
-			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
-			pglz_hist_add(hist_start, hist_entries,
-						  hist_next, hist_recycle,
-						  dp, dend, mask);
-			dp++;				/* Do not do this ++ in the line above! */
-			/* The macro would do it four times - Jan.  */
+			hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+			*(dest_ptr)++ = (unsigned char) (*src_ptr);
+			src_ptr++;
+		}
+		control_pos <<= 1;
+	}
+
+
+	while (src_ptr < src_end)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (dest_ptr - dest_start >= result_max)
+			return -1;
+
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
 		}
+		*(dest_ptr)++ = (unsigned char) (*src_ptr);
+		src_ptr++;
+		control_pos <<= 1;
 	}
 
 	/*
 	 * Write out the last control byte and check that we haven't overrun the
 	 * output size allowed by the strategy.
 	 */
-	*ctrlp = ctrlb;
-	result_size = bp - bstart;
+	*control_ptr = control_byte;
+	result_size = dest_ptr - dest_start;
 	if (result_size >= result_max)
 		return -1;
 
-- 
2.17.0

Andrey Borodin

x4mmm@yandex-team.ru

almost 5 years ago

In reply to: Justin Pryzby (#8)

Re: pglz compression performance, take two

22 янв. 2021 г., в 07:48, Justin Pryzby <pryzby@telsasoft.com> написал(а):

@cfbot: rebased
<0001-Reorganize-pglz-compression-code.patch>

Thanks!

I'm experimenting with TPC-C over PostgreSQL 13 on production-like cluster in the cloud. Overall performance is IO-bound, but compression is burning a lot energy too (according to perf top). Cluster consists of 3 nodes(only HA, no standby queries) with 32 vCPU each, 128GB RAM, sync replication, 2000 warehouses, 240GB PGDATA.

Samples: 1M of event 'cpu-clock', 4000 Hz, Event count (approx.): 177958545079
Overhead Shared Object Symbol
18.36% postgres [.] pglz_compress
3.88% [kernel] [k] _raw_spin_unlock_irqrestore
3.39% postgres [.] hash_search_with_hash_value
3.00% [kernel] [k] finish_task_switch
2.03% [kernel] [k] copy_user_enhanced_fast_string
1.14% [kernel] [k] filemap_map_pages
1.02% postgres [.] AllocSetAlloc
0.93% postgres [.] _bt_compare
0.89% postgres [.] PinBuffer
0.82% postgres [.] SearchCatCache1
0.79% postgres [.] LWLockAttemptLock
0.78% postgres [.] GetSnapshotData

Overall cluster runs 862tps (52KtpmC, though only 26KtmpC is qualified on 2K warehouses).

Thanks!

Best regards, Andrey Borodin.

#10

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Justin Pryzby (#8)

Re: pglz compression performance, take two

On Jan 21, 2021, at 6:48 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:

@cfbot: rebased
<0001-Reorganize-pglz-compression-code.patch>

Review comments.

First, I installed a build from master without this patch, created a test installation with lots of compressed text and array columns, upgraded the binaries to a build with this patch included, and tried to find problems with the data left over from the pre-patch binaries. Everything checks out. This is on little-endian mac osx intel core i9, not on any ARM platform that you are targeting with portions of the patch.

+/**************************************
+ *  CPU Feature Detection            *
+ **************************************/
+/* PGLZ_FORCE_MEMORY_ACCESS
+ * By default, access to unaligned memory is controlled by `memcpy()`, which is safe and portable.
+ * Unfortunately, on some target/compiler combinations, the generated assembly is sub-optimal.
+ * The below switch allow to select different access method for improved performance.
+ * Method 0 (default) : use `memcpy()`. Safe and portable.
+ * Method 1 : direct access. This method is portable but violate C standard.
+ *         It can generate buggy code on targets which assembly generation depends on alignment.
+ *         But in some circumstances, it's the only known way to get the most performance (ie GCC + ARMv6)
+ * See https://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html for details.
+ * Prefer these methods in priority order (0 > 1)
+ */

The link to blogspot.fr has a lot more information than your summary in the code comments. It might be hard to understand this comment if the blogspot article were ever to disappear. Perhaps you could include a bit more of the relevant details?

+#ifndef PGLZ_FORCE_MEMORY_ACCESS   /* can be defined externally */
+#if defined(__GNUC__) && \
+  ( defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) \
+  || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) )
+#define PGLZ_FORCE_MEMORY_ACCESS 1
+#endif
+#endif

I can understand wanting to set this on gcc + ARMv6, but doesn't this belong in a configure script rather than directly in the compression code?

The blogspot article indicates that the author lied about alignment to the compiler when using gcc on ARMv6, thereby generating a fast load instruction which happens to work on ARMv6. You appear to be using that same approach. Your #if defined(__GNUC__), seems to assume that all future versions of gcc will generate the instructions that you want, and not start generating some other set of instructions. Wouldn't you at least need a configure test to verify that the version of gcc being used generates the desired assembly? Even then, you'd be banking on gcc doing the same thing for the test code and for the pglz code, which I guess might not be true. Have you considered using inline assembly instead?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Andrey Borodin (#9)

Re: pglz compression performance, take two

On Jan 28, 2021, at 2:56 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

22 янв. 2021 г., в 07:48, Justin Pryzby <pryzby@telsasoft.com> написал(а):

@cfbot: rebased
<0001-Reorganize-pglz-compression-code.patch>

Thanks!

I'm experimenting with TPC-C over PostgreSQL 13 on production-like cluster in the cloud. Overall performance is IO-bound, but compression is burning a lot energy too (according to perf top). Cluster consists of 3 nodes(only HA, no standby queries) with 32 vCPU each, 128GB RAM, sync replication, 2000 warehouses, 240GB PGDATA.

Samples: 1M of event 'cpu-clock', 4000 Hz, Event count (approx.): 177958545079
Overhead Shared Object Symbol
18.36% postgres [.] pglz_compress
3.88% [kernel] [k] _raw_spin_unlock_irqrestore
3.39% postgres [.] hash_search_with_hash_value
3.00% [kernel] [k] finish_task_switch
2.03% [kernel] [k] copy_user_enhanced_fast_string
1.14% [kernel] [k] filemap_map_pages
1.02% postgres [.] AllocSetAlloc
0.93% postgres [.] _bt_compare
0.89% postgres [.] PinBuffer
0.82% postgres [.] SearchCatCache1
0.79% postgres [.] LWLockAttemptLock
0.78% postgres [.] GetSnapshotData

Overall cluster runs 862tps (52KtpmC, though only 26KtmpC is qualified on 2K warehouses).

Thanks!

Robert Haas just committed Dilip Kumar's LZ4 compression, bbe0a81db69bd10bd166907c3701492a29aca294.

Is this pglz compression patch still relevant? How does the LZ4 compression compare on your hardware?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

Justin Pryzby

pryzby@telsasoft.com

almost 5 years ago

In reply to: Mark Dilger (#11)

Re: pglz compression performance, take two

On Fri, Mar 19, 2021 at 01:29:14PM -0700, Mark Dilger wrote:

Robert Haas just committed Dilip Kumar's LZ4 compression, bbe0a81db69bd10bd166907c3701492a29aca294.

Is this pglz compression patch still relevant? How does the LZ4 compression compare on your hardware?

I think it's still relevant, since many people may not end up with binaries
--with-lz4 (I'm thinking of cloud providers). PGLZ is what existing data uses,
and people may not want to/know to migrate to shiny new features, but they'd
like it if their queries were 20% faster after upgrading without needing to.

Also, Dilip's patch is only for TOAST compression, and pglz is also being used
for wal_compression - Andrey has a short patch to implement lz4 for that:
https://commitfest.postgresql.org/32/3015/

--
Justin

#13

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Justin Pryzby (#12)

Re: pglz compression performance, take two

On Sat, Mar 20, 2021 at 12:19:45AM -0500, Justin Pryzby wrote:

I think it's still relevant, since many people may not end up with binaries
--with-lz4 (I'm thinking of cloud providers). PGLZ is what existing data uses,
and people may not want to/know to migrate to shiny new features, but they'd
like it if their queries were 20% faster after upgrading without needing to.

Yeah, I agree that local improvements here are relevant, particularly
as we don't enforce the rewrite of toast data already compressed with
pglz. So, we still need to stick with pglz for some time.
--
Michael

#14

Andrey Borodin

x4mmm@yandex-team.ru

over 4 years ago

In reply to: Mark Dilger (#10)

1 attachment(s)

Re: pglz compression performance, take two

20 марта 2021 г., в 00:35, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):

On Jan 21, 2021, at 6:48 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:

@cfbot: rebased
<0001-Reorganize-pglz-compression-code.patch>

Review comments.

Thanks for the review, Mark!
And sorry for such a long delay, I've been trying to figure out a way to do things less-platform dependent.
And here's what I've come up with.

We use pglz_read32() not the way xxhash and lz4 does - we really do not need to get 4-byte value, we only need to compare 4 bytes at once.
So, essentially, we need to compare two implementation of 4-byte comparison

bool
cpm_a(const void *ptr1, const void *ptr2)
{
return *(const uint32_t *) ptr1 == *(const uint32_t *) ptr2;
}

bool
cmp_b(const void *ptr1, const void *ptr2)
{
return memcmp(ptr1, ptr2, 4) == 0;
}

Variant B is more portable. Inspecting it Godblot's compiler explorer I've found out that for GCC 7.1+ it generates assembly without memcmp() call. For x86-64 and ARM64 assembly of cmp_b is identical to cmp_a.
So I think maybe we could just stick with version cmp_b instead of optimising for ARM6 and similar architectures like Arduino.

I've benchmarked the patch with "REINDEX table pgbench_accounts" on pgbench -i of scale 100. wal_compression was on, other settings were default.
Without patch it takes ~11055.077 ms on my machine, with patch it takes ~9512.411 ms, 14% speedup overall.

PFA v5.

Thanks!

Best regards, Andrey Borodin.

Attachments:

v5-0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=v5-0001-Reorganize-pglz-compression-code.patch; x-unix-mode=0644Download

From 1e1979b3d062ec932c0856a229621f28a8831201 Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 27 Jun 2019 23:18:21 +0500
Subject: [PATCH v5] Reorganize pglz compression code

This patch accumulates several changes:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons in a search instead of 1-byte
Total speedup about x1.4
---
 src/common/pg_lzcompress.c | 436 +++++++++++++++++++++----------------
 1 file changed, 249 insertions(+), 187 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index a30a2c2eb8..9322ba77b1 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -118,13 +118,13 @@
  *			our 4K maximum look-back distance is too small.
  *
  *			The compressor creates a table for lists of positions.
- *			For each input position (except the last 3), a hash key is
+ *			For each input position (except the last 4), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4096 input positions, since we cannot use
+ *			kept for the last 4094 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -146,17 +146,15 @@
  *			used for the next tag in the output.
  *
  *			For each subsequent entry in the history list, the "good_match"
- *			is lowered by 10%. So the compressor will be more happy with
- *			short matches the further it has to go back in the history.
+ *			is lowered roughly by 10%. So the compressor will be more happy
+ *			with short matches the further it has to go back in the history.
  *			Another "speed against ratio" preference characteristic of
  *			the algorithm.
  *
- *			Thus there are 3 stop conditions for the lookup of matches:
+ *			Thus there are 2 stop conditions for the lookup of matches:
  *
  *				- a match >= good_match is found
  *				- there are no more history entries to look at
- *				- the next history entry is already too far back
- *				  to be coded into a tag.
  *
  *			Finally the match algorithm checks that at least a match
  *			of 3 or more bytes has been found, because that is the smallest
@@ -187,13 +185,12 @@
 
 #include "common/pg_lzcompress.h"
 
-
 /* ----------
  * Local definitions
  * ----------
  */
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
-#define PGLZ_HISTORY_SIZE		4096
+#define PGLZ_HISTORY_SIZE		0x0fff - 1	/* to avoid compare in iteration */
 #define PGLZ_MAX_MATCH			273
 
 
@@ -202,17 +199,16 @@
  *
  *		Linked list for the backward history lookup
  *
- * All the entries sharing a hash key are linked in a doubly linked list.
- * This makes it easy to remove an entry when it's time to recycle it
- * (because it's more than 4K positions old).
+ * All the entries sharing a hash key are linked in a singly linked list.
+ * Links are not changed during insertion in order to speed it up.
+ * Instead more complicated stop condition is used during list iteration.
  * ----------
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
-	const char *pos;			/* my input position */
+	int16		next_id;		/* links for my hash key's list */
+	uint16		hist_idx;		/* my current hash key */
+	const unsigned char *pos;	/* my input position */
 } PGLZ_HistEntry;
 
 
@@ -256,11 +252,9 @@ static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
 static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
 /*
- * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
- * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ * Element 0 in hist_entries is unused, and means 'invalid'.
  */
 #define INVALID_ENTRY			0
-#define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -274,117 +268,122 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
-		)
+static inline uint16
+pglz_hist_idx(const unsigned char *s, uint16 mask)
+{
+	/*
+	 * Note that we only calculate function at the beginning of compressed data.
+	 * Further hash valued are computed by subtracting prev byte and adding next.
+	 * For details see pglz_hist_add().
+	 */
+	return ((s[0] << 6) ^ (s[1] << 4) ^ (s[2] << 2) ^ s[3]) & mask;
+}
 
 
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table.
- *
- * If _recycle is true, then we are recycling a previously used entry,
- * and must first delink it from its old hashcode's linked list.
- *
- * NOTE: beware of multiple evaluations of macro's arguments, and note that
- * _hn and _recycle are modified in the macro.
+ *		Adds a new entry to the history table
+ *		and recalculates hash value.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
-do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
-			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
-			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
-					(_hs)[__myhe->hindex] = __myhe->next - (_he);			\
-				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
-			}																\
-			__myhe->next = &(_he)[*__myhsp];								\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
-			__myhe->pos  = (_s);											\
-			/* If there was an existing entry in this hash slot, link */	\
-			/* this new entry to it. However, the 0th entry in the */		\
-			/* entries table is unused, so we can freely scribble on it. */ \
-			/* So don't bother checking if the slot was used - we'll */		\
-			/* scribble on the unused entry if it was not, but that's */	\
-			/* harmless. Avoiding the branch in this critical path */		\
-			/* speeds this up a little bit. */								\
-			/* if (*__myhsp != INVALID_ENTRY) */							\
-				(_he)[(*__myhsp)].prev = __myhe;							\
-			*__myhsp = _hn;													\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
-				(_hn) = 1;													\
-				(_recycle) = true;											\
-			}																\
-} while (0)
+static inline int16
+pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char *s, uint16 mask)
+{
+	int16	   *my_hist_start = &hist_start[*hist_idx];
+	PGLZ_HistEntry *entry = &(hist_entries)[hist_next];
+
+	/*
+	 * Initialize entry with a new value.
+	 */
+	entry->next_id = *my_hist_start;
+	entry->hist_idx = *hist_idx;
+	entry->pos = s;
+
+	/*
+	 * Update linked list head pointer.
+	 */
+	*my_hist_start = hist_next;
+
+	/*
+	 * Recalculate hash value for next position. Remove current byte and add next.
+	 */
+	*hist_idx = ((((*hist_idx) ^ (s[0] << 6)) << 2) ^ s[4]) & mask;
+
+	/*
+	 * Shift history pointer.
+	 */
+	hist_next++;
+	if (hist_next == PGLZ_HISTORY_SIZE + 1)
+	{
+		hist_next = 1;
+	}
+	return hist_next;
+}
 
 
 /* ----------
- * pglz_out_ctrl -
+ * pglz_out_tag -
  *
- *		Outputs the last and allocates a new control byte if needed.
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
  * ----------
  */
-#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
-do { \
-	if ((__ctrl & 0xff) == 0)												\
-	{																		\
-		*(__ctrlp) = __ctrlb;												\
-		__ctrlp = (__buf)++;												\
-		__ctrlb = 0;														\
-		__ctrl = 1;															\
-	}																		\
-} while (0)
-
+static inline unsigned char *
+pglz_out_tag(unsigned char *dest_ptr, int32 match_len, int32 match_offset)
+{
+	if (match_len > 17)
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | 0x0f);
+		*(dest_ptr++) = (unsigned char) (((match_offset) & 0xff));
+		*(dest_ptr++) = (unsigned char) ((match_len) - 18);
+	}
+	else
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | ((match_len) - 3));
+		*(dest_ptr++) = (unsigned char) ((match_offset) & 0xff);
+	}
+	return dest_ptr;
+}
 
 /* ----------
- * pglz_out_literal -
+ * pglz_compare -
  *
- *		Outputs a literal byte to the destination buffer including the
- *		appropriate control bit.
+ *		Compares 4 bytes at pointers
  * ----------
  */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	*(_buf)++ = (unsigned char)(_byte);										\
-	_ctrl <<= 1;															\
-} while (0)
-
+static inline bool
+pglz_compare32(const void *ptr1, const void *ptr2)
+{
+	return memcmp(ptr1, ptr2, 4) == 0;
+}
 
 /* ----------
- * pglz_out_tag -
+ * pglz_compare -
  *
- *		Outputs a backward reference tag of 2-4 bytes (depending on
- *		offset and length) to the destination buffer including the
- *		appropriate control bit.
+ *		Compares two sequences of bytes
+ *		and returns position of first mismatch.
  * ----------
  */
-#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	_ctrlb |= _ctrl;														\
-	_ctrl <<= 1;															\
-	if (_len > 17)															\
-	{																		\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
-		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
-		(_buf)[2] = (unsigned char)((_len) - 18);							\
-		(_buf) += 3;														\
-	} else {																\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
-		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
-		(_buf) += 2;														\
-	}																		\
-} while (0)
+static inline int32
+pglz_compare(int32 len, int32 len_bound, const unsigned char *input_pos,
+			 const unsigned char *hist_pos)
+{
+	while (len <= len_bound - 4 && pglz_compare32(input_pos, hist_pos))
+	{
+		len += 4;
+		input_pos += 4;
+		hist_pos += 4;
+	}
+	while (len < len_bound && *input_pos == *hist_pos)
+	{
+		len++;
+		input_pos++;
+		hist_pos++;
+	}
+	return len;
+}
 
 
 /* ----------
@@ -396,32 +395,40 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char *end,
+				int *lenp, int *offp, int good_match, int good_drop)
 {
-	PGLZ_HistEntry *hent;
-	int16		hentno;
+	PGLZ_HistEntry *hist_entry;
+	int16	   *hist_entry_number;
 	int32		len = 0;
-	int32		off = 0;
+	int32		offset = 0;
+	int32		cur_len = 0;
+	int32		len_bound = Min(end - input, PGLZ_MAX_MATCH);
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
-	hent = &hist_entries[hentno];
-	while (hent != INVALID_ENTRY_PTR)
-	{
-		const char *ip = input;
-		const char *hp = hent->pos;
-		int32		thisoff;
-		int32		thislen;
+	hist_entry_number = &hist_start[hist_idx];
+	if (*hist_entry_number == INVALID_ENTRY)
+		return 0;
 
+	hist_entry = &hist_entries[*hist_entry_number];
+	if (hist_idx != hist_entry->hist_idx)
+	{
 		/*
-		 * Stop if the offset does not fit into our tag anymore.
+		 * If current linked list head points to invalid entry then clear it
+		 * to reduce the number of comparisons in future.
 		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		*hist_entry_number = INVALID_ENTRY;
+		return 0;
+	}
+
+	while (true)
+	{
+		const unsigned char *input_pos = input;
+		const unsigned char *hist_pos = hist_entry->pos;
+		const unsigned char *my_pos;
+		int32		cur_offset = input_pos - hist_pos;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -431,58 +438,60 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 * character by character comparison to know the exact position where
 		 * the diff occurred.
 		 */
-		thislen = 0;
 		if (len >= 16)
 		{
-			if (memcmp(ip, hp, len) == 0)
+			if (memcmp(input_pos, hist_pos, len) == 0)
 			{
-				thislen = len;
-				ip += len;
-				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
-				{
-					thislen++;
-					ip++;
-					hp++;
-				}
+				offset = cur_offset;
+				len = pglz_compare(len, len_bound, input_pos + len, hist_pos + len);
 			}
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			/*
+			 * Start search for short matches by comparing 4 bytes
+			 */
+			if (pglz_compare32(input_pos, hist_pos))
 			{
-				thislen++;
-				ip++;
-				hp++;
+				cur_len = pglz_compare(4, len_bound, input_pos + 4, hist_pos + 4);
+				if (cur_len > len)
+				{
+					len = cur_len;
+					offset = cur_offset;
+				}
 			}
 		}
 
-		/*
-		 * Remember this match as the best (if it is)
-		 */
-		if (thislen > len)
-		{
-			len = thislen;
-			off = thisoff;
-		}
-
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		my_pos = hist_entry->pos;
+		hist_entry = &hist_entries[hist_entry->next_id];
 
 		/*
-		 * Be happy with lesser good matches the more entries we visited. But
-		 * no point in doing calculation if we're at end of list.
+		 * If current match length is ok then stop iteration. As outdated list
+		 * links are not updated during insertion process then additional stop
+		 * condition should be introduced to avoid following them. If recycled
+		 * entry has another hash, then iteration stops. If it has the same
+		 * hash then recycled cell would break input stream position
+		 * monotonicity, which is checked after.
 		 */
-		if (hent != INVALID_ENTRY_PTR)
+		if (len >= good_match || hist_idx != hist_entry->hist_idx || my_pos <= hist_entry->pos)
 		{
-			if (len >= good_match)
-				break;
-			good_match -= (good_match * good_drop) / 100;
+			break;
 		}
+
+		/*
+		 * Be happy with less good matches the more entries we visited.
+		 */
+		good_match -= (good_match * good_drop) >> 7;
 	}
 
+	/*
+	 * found match can be slightly more than necessary, bound it with len_bound
+	 */
+	len = Min(len, len_bound);
+
 	/*
 	 * Return match information only if it results at least in one byte
 	 * reduction.
@@ -490,7 +499,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	if (len > 2)
 	{
 		*lenp = len;
-		*offp = off;
+		*offp = offset;
 		return 1;
 	}
 
@@ -509,26 +518,28 @@ int32
 pglz_compress(const char *source, int32 slen, char *dest,
 			  const PGLZ_Strategy *strategy)
 {
-	unsigned char *bp = (unsigned char *) dest;
-	unsigned char *bstart = bp;
-	int			hist_next = 1;
-	bool		hist_recycle = false;
-	const char *dp = source;
-	const char *dend = source + slen;
-	unsigned char ctrl_dummy = 0;
-	unsigned char *ctrlp = &ctrl_dummy;
-	unsigned char ctrlb = 0;
-	unsigned char ctrl = 0;
+	unsigned char *dest_ptr = (unsigned char *) dest;
+	unsigned char *dest_start = dest_ptr;
+	uint16		hist_next = 1;
+	uint16		hist_idx;
+	const unsigned char *src_ptr = (const unsigned char *) source;
+	const unsigned char *src_end = (const unsigned char *) source + slen;
+	const unsigned char *compress_src_end = src_end - 4;
+	unsigned char control_dummy = 0;
+	unsigned char *control_ptr = &control_dummy;
+	unsigned char control_byte = 0;
+	unsigned char control_pos = 0;
 	bool		found_match = false;
 	int32		match_len;
-	int32		match_off;
+	int32		match_offset;
 	int32		good_match;
 	int32		good_drop;
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
 	int			hashsz;
-	int			mask;
+	uint16		mask;
+
 
 	/*
 	 * Our fallback strategy is the default.
@@ -560,6 +571,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	else if (good_drop > 100)
 		good_drop = 100;
 
+	/* We use <<7 later to calculate actual drop, so align percents to 128 */
+	good_drop = good_drop * 128 / 100;
+
 	need_rate = strategy->min_comp_rate;
 	if (need_rate < 0)
 		need_rate = 0;
@@ -577,7 +591,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		result_max = (slen / 100) * (100 - need_rate);
 	}
 	else
+	{
 		result_max = (slen * (100 - need_rate)) / 100;
+	}
 
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
@@ -603,10 +619,21 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	 */
 	memset(hist_start, 0, hashsz * sizeof(int16));
 
+	/*
+	 * Initialize INVALID_ENTRY for stopping during lookup.
+	 */
+	hist_entries[INVALID_ENTRY].pos = src_end;
+	hist_entries[INVALID_ENTRY].hist_idx = hashsz;
+
+	/*
+	 * Calculate initial hash value.
+	 */
+	hist_idx = pglz_hist_idx(src_ptr, mask);
+
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
-	while (dp < dend)
+	while (src_ptr < compress_src_end)
 	{
 		/*
 		 * If we already exceeded the maximum result size, fail.
@@ -615,7 +642,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
 		 * allow 4 slop bytes.
 		 */
-		if (bp - bstart >= result_max)
+		if (dest_ptr - dest_start >= result_max)
 			return -1;
 
 		/*
@@ -624,27 +651,36 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * reasonably quickly when looking at incompressible input (such as
 		 * pre-compressed data).
 		 */
-		if (!found_match && bp - bstart >= strategy->first_success_by)
+		if (!found_match && dest_ptr - dest_start >= strategy->first_success_by)
 			return -1;
 
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
+		}
+
 		/*
 		 * Try to find a match in the history
 		 */
-		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+		if (pglz_find_match(hist_idx, src_ptr, compress_src_end, &match_len,
+							&match_offset, good_match, good_drop))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
 			 * characters.
 			 */
-			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			control_byte |= control_pos;
+			dest_ptr = pglz_out_tag(dest_ptr, match_len, match_offset);
 			while (match_len--)
 			{
-				pglz_hist_add(hist_start, hist_entries,
-							  hist_next, hist_recycle,
-							  dp, dend, mask);
-				dp++;			/* Do not do this ++ in the line above! */
-				/* The macro would do it four times - Jan.  */
+				hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+				src_ptr++;
 			}
 			found_match = true;
 		}
@@ -653,21 +689,47 @@ pglz_compress(const char *source, int32 slen, char *dest,
 			/*
 			 * No match found. Copy one literal byte.
 			 */
-			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
-			pglz_hist_add(hist_start, hist_entries,
-						  hist_next, hist_recycle,
-						  dp, dend, mask);
-			dp++;				/* Do not do this ++ in the line above! */
-			/* The macro would do it four times - Jan.  */
+			hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+			*(dest_ptr)++ = (unsigned char) (*src_ptr);
+			src_ptr++;
+		}
+		control_pos <<= 1;
+	}
+
+
+	while (src_ptr < src_end)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (dest_ptr - dest_start >= result_max)
+			return -1;
+
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
 		}
+		*(dest_ptr)++ = (unsigned char) (*src_ptr);
+		src_ptr++;
+		control_pos <<= 1;
 	}
 
 	/*
 	 * Write out the last control byte and check that we haven't overrun the
 	 * output size allowed by the strategy.
 	 */
-	*ctrlp = ctrlb;
-	result_size = bp - bstart;
+	*control_ptr = control_byte;
+	result_size = dest_ptr - dest_start;
 	if (result_size >= result_max)
 		return -1;
 
-- 
2.17.1

#15

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Andrey Borodin (#14)

Re: pglz compression performance, take two

On Jun 27, 2021, at 3:41 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

And here's what I've come up with.

I have not tested the patch yet, but here are some quick review comments:

#define PGLZ_HISTORY_SIZE 0x0fff - 1 /* to avoid compare in iteration */

...

static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];

...

if (hist_next == PGLZ_HISTORY_SIZE + 1)

These are the only uses of PGLZ_HISTORY_SIZE. Perhaps you could just defined the symbol as 0x0fff and skip the -1 and +1 business?

/* ----------
* pglz_compare -
*
* Compares 4 bytes at pointers
* ----------
*/
static inline bool
pglz_compare32(const void *ptr1, const void *ptr2)
{
return memcmp(ptr1, ptr2, 4) == 0;
}

The comment function name differs from the actual function name.

Also, pglz_compare returns an offset into the string, whereas pglz_compare32 returns a boolean. This is fairly unintuitive. The "32" part of pglz_compare32 implies doing the same thing as pglz_compare but where the string is known to be 4 bytes in length. Given that pglz_compare32 is dissimilar to pglz_compare, perhaps avoid using /pglz_compare/ in its name?

/*
* Determine length of match. A better match must be larger than the
* best so far. And if we already have a match of 16 or more bytes,
* it's worth the call overhead to use memcmp()

This comment is hard to understand, given the code that follows. The first block calls memcmp(), which seems to be the function overhead you refer to. The second block calls the static inline function pglz_compare32, which internally calls memcmp(). Superficially, there seems to be a memcmp() function call either way. The difference is that in the first block's call to memcmp(), the length is a runtime value, and in the second block, it is a compile-time known value. If you are depending on the compiler to notice this distinction and optimize the second call, perhaps you can mention that explicitly? Otherwise, reading and understanding the comment takes more effort.

I took a quick look for other places in the code that try to beat the performance of memcmp on short strings. In varlena.c, rest_of_char_same() seems to do so. We also use comparisons on NameData, which frequently contains strings shorter than 16 bytes. Is it worth sharting a static inline function that uses your optimization in other places? How confident are you that your optimization really helps?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Mark Dilger (#15)

Re: pglz compression performance, take two

On Jun 28, 2021, at 9:05 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Is it worth sharting a static inline function that uses your optimization in other places?

s/sharting/sharing/

How confident are you that your optimization really helps?

By which I mean, is the optimization worth the extra branch checking if (len >= 16)? Is the 14% speedup you report dependent on this extra complexity?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Tomas Vondra

tomas.vondra@enterprisedb.com

about 4 years ago

In reply to: Mark Dilger (#16)

2 attachment(s)

Re: pglz compression performance, take two

Hi,

I've looked at this patch again and did some testing. I don't have any
comments to the code (I see there are two comments from Mark after the
last version, though).

For the testing, I did a fairly simple benchmark loading either random
or compressible data into a bytea column. The tables are defined as
unlogged, the values are 1kB, 4kB and 1MB, and the total amount of data
is always 1GB. The timings are

test master patched delta
------------------------------------------
random_1k 12295 12312 100%
random_1m 12999 12984 100%
random_4k 16881 15959 95%
redundant_1k 12308 12348 100%
redundant_1m 16632 14072 85%
redundant_4k 16798 13828 82%

I ran the test on multiple x86_64 machines, but the behavior is almost
exactly the same.

This shows there's no difference for 1kB (expected, because this does
not exceed the ~2kB TOAST threshold). For random data in general the
difference is pretty negligible, although it's a bit strange it takes
longer for 4kB than 1MB ones.

For redundant (highly compressible) values, there's quite significant
speedup between 15-18%. Real-world data are likely somewhere between,
but the speedup is still pretty nice.

Andrey, can you update the patch per Mark's review? I'll do my best to
get it committed sometime in this CF.

Attached are the two scripts used for generating / testing (you'll have
to fix some hardcoded paths, but simple otherwise).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Andrey Borodin

x4mmm@yandex-team.ru

about 4 years ago

In reply to: Tomas Vondra (#17)

1 attachment(s)

Re: pglz compression performance, take two

Thanks for the review Mark! Sorry it took too long to reply on my side.

28 июня 2021 г., в 21:05, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):

#define PGLZ_HISTORY_SIZE 0x0fff - 1 /* to avoid compare in iteration */

...

static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];

...

if (hist_next == PGLZ_HISTORY_SIZE + 1)

These are the only uses of PGLZ_HISTORY_SIZE. Perhaps you could just defined the symbol as 0x0fff and skip the -1 and +1 business?

Fixed.

/* ----------
* pglz_compare -
*
* Compares 4 bytes at pointers
* ----------
*/
static inline bool
pglz_compare32(const void *ptr1, const void *ptr2)
{
return memcmp(ptr1, ptr2, 4) == 0;
}

The comment function name differs from the actual function name.

Also, pglz_compare returns an offset into the string, whereas pglz_compare32 returns a boolean. This is fairly unintuitive. The "32" part of pglz_compare32 implies doing the same thing as pglz_compare but where the string is known to be 4 bytes in length. Given that pglz_compare32 is dissimilar to pglz_compare, perhaps avoid using /pglz_compare/ in its name?

I've removed pglz_compare32 entirely. It was a simple substitution for memcmp().

/*
* Determine length of match. A better match must be larger than the
* best so far. And if we already have a match of 16 or more bytes,
* it's worth the call overhead to use memcmp()

This comment is hard to understand, given the code that follows. The first block calls memcmp(), which seems to be the function overhead you refer to. The second block calls the static inline function pglz_compare32, which internally calls memcmp(). Superficially, there seems to be a memcmp() function call either way. The difference is that in the first block's call to memcmp(), the length is a runtime value, and in the second block, it is a compile-time known value. If you are depending on the compiler to notice this distinction and optimize the second call, perhaps you can mention that explicitly? Otherwise, reading and understanding the comment takes more effort.

I've updated comment for second branch with fixed-size memcmp(). Frankly, I'm not sure "if (memcmp(input_pos, hist_pos, 4) == 0)" worth the complexity, internals of "pglz_compare(0, len_bound, input_pos + 0, hist_pos + 0);" would do almost same instructions.

I took a quick look for other places in the code that try to beat the performance of memcmp on short strings. In varlena.c, rest_of_char_same() seems to do so. We also use comparisons on NameData, which frequently contains strings shorter than 16 bytes. Is it worth sharting a static inline function that uses your optimization in other places? How confident are you that your optimization really helps?

Honestly, I do not know. The overall patch effect consists of stacking up many small optimizations. They have a net effect, but are too noisy to measure independently. That's mostly the reason why I didn't know what to reply for so long.

5 нояб. 2021 г., в 01:47, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а):

Andrey, can you update the patch per Mark's review? I'll do my best to get it committed sometime in this CF.

Cool! Here's the patch.

Best regards, Andrey Borodin.

Attachments:

v6-0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=v6-0001-Reorganize-pglz-compression-code.patch; x-unix-mode=0644Download

From 63bee75540d8b362ac30d592df0ee686c92a7c38 Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 27 Jun 2019 23:18:21 +0500
Subject: [PATCH v6] Reorganize pglz compression code

This patch accumulates several changes:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons in a search instead of 1-byte
Total speedup about x1.4
---
 src/common/pg_lzcompress.c | 432 +++++++++++++++++++++----------------
 1 file changed, 242 insertions(+), 190 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index 72e6a7ea61..88b65e049a 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -118,13 +118,13 @@
  *			our 4K maximum look-back distance is too small.
  *
  *			The compressor creates a table for lists of positions.
- *			For each input position (except the last 3), a hash key is
+ *			For each input position (except the last 4), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4096 input positions, since we cannot use
+ *			kept for the last 4094 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -146,17 +146,15 @@
  *			used for the next tag in the output.
  *
  *			For each subsequent entry in the history list, the "good_match"
- *			is lowered by 10%. So the compressor will be more happy with
- *			short matches the further it has to go back in the history.
+ *			is lowered roughly by 10%. So the compressor will be more happy
+ *			with short matches the further it has to go back in the history.
  *			Another "speed against ratio" preference characteristic of
  *			the algorithm.
  *
- *			Thus there are 3 stop conditions for the lookup of matches:
+ *			Thus there are 2 stop conditions for the lookup of matches:
  *
  *				- a match >= good_match is found
  *				- there are no more history entries to look at
- *				- the next history entry is already too far back
- *				  to be coded into a tag.
  *
  *			Finally the match algorithm checks that at least a match
  *			of 3 or more bytes has been found, because that is the smallest
@@ -187,13 +185,12 @@
 
 #include "common/pg_lzcompress.h"
 
-
 /* ----------
  * Local definitions
  * ----------
  */
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
-#define PGLZ_HISTORY_SIZE		4096
+#define PGLZ_HISTORY_SIZE		0x0fff	/* to avoid compare in iteration */
 #define PGLZ_MAX_MATCH			273
 
 
@@ -202,17 +199,16 @@
  *
  *		Linked list for the backward history lookup
  *
- * All the entries sharing a hash key are linked in a doubly linked list.
- * This makes it easy to remove an entry when it's time to recycle it
- * (because it's more than 4K positions old).
+ * All the entries sharing a hash key are linked in a singly linked list.
+ * Links are not changed during insertion in order to speed it up.
+ * Instead more complicated stop condition is used during list iteration.
  * ----------
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
-	const char *pos;			/* my input position */
+	int16		next_id;		/* links for my hash key's list */
+	uint16		hist_idx;		/* my current hash key */
+	const unsigned char *pos;	/* my input position */
 } PGLZ_HistEntry;
 
 
@@ -253,14 +249,12 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * ----------
  */
 static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
 
 /*
- * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
- * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ * Element 0 in hist_entries is unused, and means 'invalid'.
  */
 #define INVALID_ENTRY			0
-#define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -274,90 +268,58 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
-		)
+static inline uint16
+pglz_hist_idx(const unsigned char *s, uint16 mask)
+{
+	/*
+	 * Note that we only calculate function at the beginning of compressed data.
+	 * Further hash valued are computed by subtracting prev byte and adding next.
+	 * For details see pglz_hist_add().
+	 */
+	return ((s[0] << 6) ^ (s[1] << 4) ^ (s[2] << 2) ^ s[3]) & mask;
+}
 
 
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table.
- *
- * If _recycle is true, then we are recycling a previously used entry,
- * and must first delink it from its old hashcode's linked list.
- *
- * NOTE: beware of multiple evaluations of macro's arguments, and note that
- * _hn and _recycle are modified in the macro.
+ *		Adds a new entry to the history table
+ *		and recalculates hash value.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
-do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
-			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
-			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
-					(_hs)[__myhe->hindex] = __myhe->next - (_he);			\
-				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
-			}																\
-			__myhe->next = &(_he)[*__myhsp];								\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
-			__myhe->pos  = (_s);											\
-			/* If there was an existing entry in this hash slot, link */	\
-			/* this new entry to it. However, the 0th entry in the */		\
-			/* entries table is unused, so we can freely scribble on it. */ \
-			/* So don't bother checking if the slot was used - we'll */		\
-			/* scribble on the unused entry if it was not, but that's */	\
-			/* harmless. Avoiding the branch in this critical path */		\
-			/* speeds this up a little bit. */								\
-			/* if (*__myhsp != INVALID_ENTRY) */							\
-				(_he)[(*__myhsp)].prev = __myhe;							\
-			*__myhsp = _hn;													\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
-				(_hn) = 1;													\
-				(_recycle) = true;											\
-			}																\
-} while (0)
+static inline int16
+pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char *s, uint16 mask)
+{
+	int16	   *my_hist_start = &hist_start[*hist_idx];
+	PGLZ_HistEntry *entry = &(hist_entries)[hist_next];
 
+	/*
+	 * Initialize entry with a new value.
+	 */
+	entry->next_id = *my_hist_start;
+	entry->hist_idx = *hist_idx;
+	entry->pos = s;
 
-/* ----------
- * pglz_out_ctrl -
- *
- *		Outputs the last and allocates a new control byte if needed.
- * ----------
- */
-#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
-do { \
-	if ((__ctrl & 0xff) == 0)												\
-	{																		\
-		*(__ctrlp) = __ctrlb;												\
-		__ctrlp = (__buf)++;												\
-		__ctrlb = 0;														\
-		__ctrl = 1;															\
-	}																		\
-} while (0)
+	/*
+	 * Update linked list head pointer.
+	 */
+	*my_hist_start = hist_next;
 
+	/*
+	 * Recalculate hash value for next position. Remove current byte and add next.
+	 */
+	*hist_idx = ((((*hist_idx) ^ (s[0] << 6)) << 2) ^ s[4]) & mask;
 
-/* ----------
- * pglz_out_literal -
- *
- *		Outputs a literal byte to the destination buffer including the
- *		appropriate control bit.
- * ----------
- */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	*(_buf)++ = (unsigned char)(_byte);										\
-	_ctrl <<= 1;															\
-} while (0)
+	/*
+	 * Shift history pointer.
+	 */
+	hist_next++;
+	if (hist_next == PGLZ_HISTORY_SIZE)
+	{
+		hist_next = 1;
+	}
+	return hist_next;
+}
 
 
 /* ----------
@@ -368,23 +330,48 @@ do { \
  *		appropriate control bit.
  * ----------
  */
-#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	_ctrlb |= _ctrl;														\
-	_ctrl <<= 1;															\
-	if (_len > 17)															\
-	{																		\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
-		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
-		(_buf)[2] = (unsigned char)((_len) - 18);							\
-		(_buf) += 3;														\
-	} else {																\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
-		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
-		(_buf) += 2;														\
-	}																		\
-} while (0)
+static inline unsigned char *
+pglz_out_tag(unsigned char *dest_ptr, int32 match_len, int32 match_offset)
+{
+	if (match_len > 17)
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | 0x0f);
+		*(dest_ptr++) = (unsigned char) (((match_offset) & 0xff));
+		*(dest_ptr++) = (unsigned char) ((match_len) - 18);
+	}
+	else
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | ((match_len) - 3));
+		*(dest_ptr++) = (unsigned char) ((match_offset) & 0xff);
+	}
+	return dest_ptr;
+}
+
+/* ----------
+ * pglz_compare -
+ *
+ *		Compares two sequences of bytes
+ *		and returns position of first mismatch.
+ * ----------
+ */
+static inline int32
+pglz_compare(int32 len, int32 len_bound, const unsigned char *input_pos,
+			 const unsigned char *hist_pos)
+{
+	while (len <= len_bound - 4 && memcmp(input_pos, hist_pos, 4) == 0)
+	{
+		len += 4;
+		input_pos += 4;
+		hist_pos += 4;
+	}
+	while (len < len_bound && *input_pos == *hist_pos)
+	{
+		len++;
+		input_pos++;
+		hist_pos++;
+	}
+	return len;
+}
 
 
 /* ----------
@@ -396,32 +383,40 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char *end,
+				int *lenp, int *offp, int good_match, int good_drop)
 {
-	PGLZ_HistEntry *hent;
-	int16		hentno;
+	PGLZ_HistEntry *hist_entry;
+	int16	   *hist_entry_number;
 	int32		len = 0;
-	int32		off = 0;
+	int32		offset = 0;
+	int32		cur_len = 0;
+	int32		len_bound = Min(end - input, PGLZ_MAX_MATCH);
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
-	hent = &hist_entries[hentno];
-	while (hent != INVALID_ENTRY_PTR)
-	{
-		const char *ip = input;
-		const char *hp = hent->pos;
-		int32		thisoff;
-		int32		thislen;
+	hist_entry_number = &hist_start[hist_idx];
+	if (*hist_entry_number == INVALID_ENTRY)
+		return 0;
 
+	hist_entry = &hist_entries[*hist_entry_number];
+	if (hist_idx != hist_entry->hist_idx)
+	{
 		/*
-		 * Stop if the offset does not fit into our tag anymore.
+		 * If current linked list head points to invalid entry then clear it
+		 * to reduce the number of comparisons in future.
 		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		*hist_entry_number = INVALID_ENTRY;
+		return 0;
+	}
+
+	while (true)
+	{
+		const unsigned char *input_pos = input;
+		const unsigned char *hist_pos = hist_entry->pos;
+		const unsigned char *my_pos;
+		int32		cur_offset = input_pos - hist_pos;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -431,58 +426,62 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 * character by character comparison to know the exact position where
 		 * the diff occurred.
 		 */
-		thislen = 0;
 		if (len >= 16)
 		{
-			if (memcmp(ip, hp, len) == 0)
+			if (memcmp(input_pos, hist_pos, len) == 0)
 			{
-				thislen = len;
-				ip += len;
-				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
-				{
-					thislen++;
-					ip++;
-					hp++;
-				}
+				offset = cur_offset;
+				len = pglz_compare(len, len_bound, input_pos + len, hist_pos + len);
 			}
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			/*
+			 * Start search for short matches by comparing 4 bytes.
+			 * We expect that compiler will substitute memcmp() with fixed
+			 * length by optimal 4-bytes comparison it knows.
+			 */
+			if (memcmp(input_pos, hist_pos, 4) == 0)
 			{
-				thislen++;
-				ip++;
-				hp++;
+				cur_len = pglz_compare(4, len_bound, input_pos + 4, hist_pos + 4);
+				if (cur_len > len)
+				{
+					len = cur_len;
+					offset = cur_offset;
+				}
 			}
 		}
 
-		/*
-		 * Remember this match as the best (if it is)
-		 */
-		if (thislen > len)
-		{
-			len = thislen;
-			off = thisoff;
-		}
-
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		my_pos = hist_entry->pos;
+		hist_entry = &hist_entries[hist_entry->next_id];
 
 		/*
-		 * Be happy with lesser good matches the more entries we visited. But
-		 * no point in doing calculation if we're at end of list.
+		 * If current match length is ok then stop iteration. As outdated list
+		 * links are not updated during insertion process then additional stop
+		 * condition should be introduced to avoid following them. If recycled
+		 * entry has another hash, then iteration stops. If it has the same
+		 * hash then recycled cell would break input stream position
+		 * monotonicity, which is checked after.
 		 */
-		if (hent != INVALID_ENTRY_PTR)
+		if (len >= good_match || hist_idx != hist_entry->hist_idx || my_pos <= hist_entry->pos)
 		{
-			if (len >= good_match)
-				break;
-			good_match -= (good_match * good_drop) / 100;
+			break;
 		}
+
+		/*
+		 * Be happy with less good matches the more entries we visited.
+		 */
+		good_match -= (good_match * good_drop) >> 7;
 	}
 
+	/*
+	 * found match can be slightly more than necessary, bound it with len_bound
+	 */
+	len = Min(len, len_bound);
+
 	/*
 	 * Return match information only if it results at least in one byte
 	 * reduction.
@@ -490,7 +489,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	if (len > 2)
 	{
 		*lenp = len;
-		*offp = off;
+		*offp = offset;
 		return 1;
 	}
 
@@ -509,26 +508,28 @@ int32
 pglz_compress(const char *source, int32 slen, char *dest,
 			  const PGLZ_Strategy *strategy)
 {
-	unsigned char *bp = (unsigned char *) dest;
-	unsigned char *bstart = bp;
-	int			hist_next = 1;
-	bool		hist_recycle = false;
-	const char *dp = source;
-	const char *dend = source + slen;
-	unsigned char ctrl_dummy = 0;
-	unsigned char *ctrlp = &ctrl_dummy;
-	unsigned char ctrlb = 0;
-	unsigned char ctrl = 0;
+	unsigned char *dest_ptr = (unsigned char *) dest;
+	unsigned char *dest_start = dest_ptr;
+	uint16		hist_next = 1;
+	uint16		hist_idx;
+	const unsigned char *src_ptr = (const unsigned char *) source;
+	const unsigned char *src_end = (const unsigned char *) source + slen;
+	const unsigned char *compress_src_end = src_end - 4;
+	unsigned char control_dummy = 0;
+	unsigned char *control_ptr = &control_dummy;
+	unsigned char control_byte = 0;
+	unsigned char control_pos = 0;
 	bool		found_match = false;
 	int32		match_len;
-	int32		match_off;
+	int32		match_offset;
 	int32		good_match;
 	int32		good_drop;
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
 	int			hashsz;
-	int			mask;
+	uint16		mask;
+
 
 	/*
 	 * Our fallback strategy is the default.
@@ -560,6 +561,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	else if (good_drop > 100)
 		good_drop = 100;
 
+	/* We use <<7 later to calculate actual drop, so align percents to 128 */
+	good_drop = good_drop * 128 / 100;
+
 	need_rate = strategy->min_comp_rate;
 	if (need_rate < 0)
 		need_rate = 0;
@@ -577,7 +581,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		result_max = (slen / 100) * (100 - need_rate);
 	}
 	else
+	{
 		result_max = (slen * (100 - need_rate)) / 100;
+	}
 
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
@@ -603,10 +609,21 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	 */
 	memset(hist_start, 0, hashsz * sizeof(int16));
 
+	/*
+	 * Initialize INVALID_ENTRY for stopping during lookup.
+	 */
+	hist_entries[INVALID_ENTRY].pos = src_end;
+	hist_entries[INVALID_ENTRY].hist_idx = hashsz;
+
+	/*
+	 * Calculate initial hash value.
+	 */
+	hist_idx = pglz_hist_idx(src_ptr, mask);
+
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
-	while (dp < dend)
+	while (src_ptr < compress_src_end)
 	{
 		/*
 		 * If we already exceeded the maximum result size, fail.
@@ -615,7 +632,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
 		 * allow 4 slop bytes.
 		 */
-		if (bp - bstart >= result_max)
+		if (dest_ptr - dest_start >= result_max)
 			return -1;
 
 		/*
@@ -624,27 +641,36 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * reasonably quickly when looking at incompressible input (such as
 		 * pre-compressed data).
 		 */
-		if (!found_match && bp - bstart >= strategy->first_success_by)
+		if (!found_match && dest_ptr - dest_start >= strategy->first_success_by)
 			return -1;
 
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
+		}
+
 		/*
 		 * Try to find a match in the history
 		 */
-		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+		if (pglz_find_match(hist_idx, src_ptr, compress_src_end, &match_len,
+							&match_offset, good_match, good_drop))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
 			 * characters.
 			 */
-			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			control_byte |= control_pos;
+			dest_ptr = pglz_out_tag(dest_ptr, match_len, match_offset);
 			while (match_len--)
 			{
-				pglz_hist_add(hist_start, hist_entries,
-							  hist_next, hist_recycle,
-							  dp, dend, mask);
-				dp++;			/* Do not do this ++ in the line above! */
-				/* The macro would do it four times - Jan.  */
+				hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+				src_ptr++;
 			}
 			found_match = true;
 		}
@@ -653,21 +679,47 @@ pglz_compress(const char *source, int32 slen, char *dest,
 			/*
 			 * No match found. Copy one literal byte.
 			 */
-			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
-			pglz_hist_add(hist_start, hist_entries,
-						  hist_next, hist_recycle,
-						  dp, dend, mask);
-			dp++;				/* Do not do this ++ in the line above! */
-			/* The macro would do it four times - Jan.  */
+			hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+			*(dest_ptr)++ = (unsigned char) (*src_ptr);
+			src_ptr++;
+		}
+		control_pos <<= 1;
+	}
+
+
+	while (src_ptr < src_end)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (dest_ptr - dest_start >= result_max)
+			return -1;
+
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
 		}
+		*(dest_ptr)++ = (unsigned char) (*src_ptr);
+		src_ptr++;
+		control_pos <<= 1;
 	}
 
 	/*
 	 * Write out the last control byte and check that we haven't overrun the
 	 * output size allowed by the strategy.
 	 */
-	*ctrlp = ctrlb;
-	result_size = bp - bstart;
+	*control_ptr = control_byte;
+	result_size = dest_ptr - dest_start;
 	if (result_size >= result_max)
 		return -1;
 
-- 
2.24.3 (Apple Git-128)

#19

Ian Lawrence Barwick

barwick@gmail.com

about 3 years ago

In reply to: Andrey Borodin (#18)

Re: pglz compression performance, take two

2021年11月5日(金) 14:51 Andrey Borodin <x4mmm@yandex-team.ru>:

Thanks for the review Mark! Sorry it took too long to reply on my side.

28 июня 2021 г., в 21:05, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):

#define PGLZ_HISTORY_SIZE 0x0fff - 1 /* to avoid compare in iteration */

...

static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];

...

if (hist_next == PGLZ_HISTORY_SIZE + 1)

These are the only uses of PGLZ_HISTORY_SIZE. Perhaps you could just defined the symbol as 0x0fff and skip the -1 and +1 business?

Fixed.

/* ----------
* pglz_compare -
*
* Compares 4 bytes at pointers
* ----------
*/
static inline bool
pglz_compare32(const void *ptr1, const void *ptr2)
{
return memcmp(ptr1, ptr2, 4) == 0;
}

The comment function name differs from the actual function name.

Also, pglz_compare returns an offset into the string, whereas pglz_compare32 returns a boolean. This is fairly unintuitive. The "32" part of pglz_compare32 implies doing the same thing as pglz_compare but where the string is known to be 4 bytes in length. Given that pglz_compare32 is dissimilar to pglz_compare, perhaps avoid using /pglz_compare/ in its name?

I've removed pglz_compare32 entirely. It was a simple substitution for memcmp().

/*
* Determine length of match. A better match must be larger than the
* best so far. And if we already have a match of 16 or more bytes,
* it's worth the call overhead to use memcmp()

This comment is hard to understand, given the code that follows. The first block calls memcmp(), which seems to be the function overhead you refer to. The second block calls the static inline function pglz_compare32, which internally calls memcmp(). Superficially, there seems to be a memcmp() function call either way. The difference is that in the first block's call to memcmp(), the length is a runtime value, and in the second block, it is a compile-time known value. If you are depending on the compiler to notice this distinction and optimize the second call, perhaps you can mention that explicitly? Otherwise, reading and understanding the comment takes more effort.

I've updated comment for second branch with fixed-size memcmp(). Frankly, I'm not sure "if (memcmp(input_pos, hist_pos, 4) == 0)" worth the complexity, internals of "pglz_compare(0, len_bound, input_pos + 0, hist_pos + 0);" would do almost same instructions.

I took a quick look for other places in the code that try to beat the performance of memcmp on short strings. In varlena.c, rest_of_char_same() seems to do so. We also use comparisons on NameData, which frequently contains strings shorter than 16 bytes. Is it worth sharting a static inline function that uses your optimization in other places? How confident are you that your optimization really helps?

Honestly, I do not know. The overall patch effect consists of stacking up many small optimizations. They have a net effect, but are too noisy to measure independently. That's mostly the reason why I didn't know what to reply for so long.

5 нояб. 2021 г., в 01:47, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а):

Andrey, can you update the patch per Mark's review? I'll do my best to get it committed sometime in this CF.

Cool! Here's the patch.

HI!

This patch is marked as "Ready for Committer" in the current commitfest [1]https://commitfest.postgresql.org/40/2897/
but has seen no further activity for more than a year, Given that it's
on its 10th
commitfest, it would be useful to clarify its status one way or the other.

[1]: https://commitfest.postgresql.org/40/2897/

Thanks

Ian Barwick

#20

Tomas Vondra

tomas.vondra@enterprisedb.com

about 3 years ago

In reply to: Ian Lawrence Barwick (#19)

2 attachment(s)

Re: pglz compression performance, take two

Hi,

I took a look at the v6 patch, with the intention to get it committed. I
have a couple minor comments:

1) For PGLZ_HISTORY_SIZE it uses literal 0x0fff, with the explanation:

/* to avoid compare in iteration */

which I think means intent to use this value as a bit mask, but then the
only place using PGLZ_HISTORY_SIZE does

if (hist_next == PGLZ_HISTORY_SIZE) ...

i.e. a comparison. Maybe I misunderstand the comment, though.

2) PGLZ_HistEntry was modified and replaces links (pointers) with
indexes, but the comments still talk about "links", so maybe that needs
to be changed. Also, I wonder why next_id is int16 while hist_idx is
uint16 (and also why id vs. idx)?

3) minor formatting of comments

4) the comment in pglz_find_match about traversing the history seems too
early - it's before handling invalid entries and cleanup, but it does
not talk about that at all, and the actual while loop is after that.

Attached is v6 in 0001 (verbatim), with the review comments in 0002.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Reorganize-pglz-compression-code-v6.patchtext/x-patch; charset=UTF-8; name=0001-Reorganize-pglz-compression-code-v6.patchDownload

From 5aee164e5d6ec716e607751d179755a35969a333 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Sun, 27 Nov 2022 15:02:27 +0100
Subject: [PATCH 1/2] Reorganize pglz compression code

This patch accumulates several changes:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons in a search instead of 1-byte
Total speedup about x1.4
---
 src/common/pg_lzcompress.c | 432 +++++++++++++++++++++----------------
 1 file changed, 242 insertions(+), 190 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index 9de5d5a0d43..248545a108e 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -118,13 +118,13 @@
  *			our 4K maximum look-back distance is too small.
  *
  *			The compressor creates a table for lists of positions.
- *			For each input position (except the last 3), a hash key is
+ *			For each input position (except the last 4), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4096 input positions, since we cannot use
+ *			kept for the last 4094 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -146,17 +146,15 @@
  *			used for the next tag in the output.
  *
  *			For each subsequent entry in the history list, the "good_match"
- *			is lowered by 10%. So the compressor will be more happy with
- *			short matches the further it has to go back in the history.
+ *			is lowered roughly by 10%. So the compressor will be more happy
+ *			with short matches the further it has to go back in the history.
  *			Another "speed against ratio" preference characteristic of
  *			the algorithm.
  *
- *			Thus there are 3 stop conditions for the lookup of matches:
+ *			Thus there are 2 stop conditions for the lookup of matches:
  *
  *				- a match >= good_match is found
  *				- there are no more history entries to look at
- *				- the next history entry is already too far back
- *				  to be coded into a tag.
  *
  *			Finally the match algorithm checks that at least a match
  *			of 3 or more bytes has been found, because that is the smallest
@@ -187,13 +185,12 @@
 
 #include "common/pg_lzcompress.h"
 
-
 /* ----------
  * Local definitions
  * ----------
  */
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
-#define PGLZ_HISTORY_SIZE		4096
+#define PGLZ_HISTORY_SIZE		0x0fff	/* to avoid compare in iteration */
 #define PGLZ_MAX_MATCH			273
 
 
@@ -202,17 +199,16 @@
  *
  *		Linked list for the backward history lookup
  *
- * All the entries sharing a hash key are linked in a doubly linked list.
- * This makes it easy to remove an entry when it's time to recycle it
- * (because it's more than 4K positions old).
+ * All the entries sharing a hash key are linked in a singly linked list.
+ * Links are not changed during insertion in order to speed it up.
+ * Instead more complicated stop condition is used during list iteration.
  * ----------
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
-	const char *pos;			/* my input position */
+	int16		next_id;		/* links for my hash key's list */
+	uint16		hist_idx;		/* my current hash key */
+	const unsigned char *pos;	/* my input position */
 } PGLZ_HistEntry;
 
 
@@ -253,14 +249,12 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * ----------
  */
 static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
 
 /*
- * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
- * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ * Element 0 in hist_entries is unused, and means 'invalid'.
  */
 #define INVALID_ENTRY			0
-#define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -274,90 +268,58 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
-		)
+static inline uint16
+pglz_hist_idx(const unsigned char *s, uint16 mask)
+{
+	/*
+	 * Note that we only calculate function at the beginning of compressed data.
+	 * Further hash valued are computed by subtracting prev byte and adding next.
+	 * For details see pglz_hist_add().
+	 */
+	return ((s[0] << 6) ^ (s[1] << 4) ^ (s[2] << 2) ^ s[3]) & mask;
+}
 
 
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table.
- *
- * If _recycle is true, then we are recycling a previously used entry,
- * and must first delink it from its old hashcode's linked list.
- *
- * NOTE: beware of multiple evaluations of macro's arguments, and note that
- * _hn and _recycle are modified in the macro.
+ *		Adds a new entry to the history table
+ *		and recalculates hash value.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
-do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
-			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
-			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
-					(_hs)[__myhe->hindex] = __myhe->next - (_he);			\
-				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
-			}																\
-			__myhe->next = &(_he)[*__myhsp];								\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
-			__myhe->pos  = (_s);											\
-			/* If there was an existing entry in this hash slot, link */	\
-			/* this new entry to it. However, the 0th entry in the */		\
-			/* entries table is unused, so we can freely scribble on it. */ \
-			/* So don't bother checking if the slot was used - we'll */		\
-			/* scribble on the unused entry if it was not, but that's */	\
-			/* harmless. Avoiding the branch in this critical path */		\
-			/* speeds this up a little bit. */								\
-			/* if (*__myhsp != INVALID_ENTRY) */							\
-				(_he)[(*__myhsp)].prev = __myhe;							\
-			*__myhsp = _hn;													\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
-				(_hn) = 1;													\
-				(_recycle) = true;											\
-			}																\
-} while (0)
+static inline int16
+pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char *s, uint16 mask)
+{
+	int16	   *my_hist_start = &hist_start[*hist_idx];
+	PGLZ_HistEntry *entry = &(hist_entries)[hist_next];
 
+	/*
+	 * Initialize entry with a new value.
+	 */
+	entry->next_id = *my_hist_start;
+	entry->hist_idx = *hist_idx;
+	entry->pos = s;
 
-/* ----------
- * pglz_out_ctrl -
- *
- *		Outputs the last and allocates a new control byte if needed.
- * ----------
- */
-#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
-do { \
-	if ((__ctrl & 0xff) == 0)												\
-	{																		\
-		*(__ctrlp) = __ctrlb;												\
-		__ctrlp = (__buf)++;												\
-		__ctrlb = 0;														\
-		__ctrl = 1;															\
-	}																		\
-} while (0)
+	/*
+	 * Update linked list head pointer.
+	 */
+	*my_hist_start = hist_next;
 
+	/*
+	 * Recalculate hash value for next position. Remove current byte and add next.
+	 */
+	*hist_idx = ((((*hist_idx) ^ (s[0] << 6)) << 2) ^ s[4]) & mask;
 
-/* ----------
- * pglz_out_literal -
- *
- *		Outputs a literal byte to the destination buffer including the
- *		appropriate control bit.
- * ----------
- */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	*(_buf)++ = (unsigned char)(_byte);										\
-	_ctrl <<= 1;															\
-} while (0)
+	/*
+	 * Shift history pointer.
+	 */
+	hist_next++;
+	if (hist_next == PGLZ_HISTORY_SIZE)
+	{
+		hist_next = 1;
+	}
+	return hist_next;
+}
 
 
 /* ----------
@@ -368,23 +330,48 @@ do { \
  *		appropriate control bit.
  * ----------
  */
-#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	_ctrlb |= _ctrl;														\
-	_ctrl <<= 1;															\
-	if (_len > 17)															\
-	{																		\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
-		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
-		(_buf)[2] = (unsigned char)((_len) - 18);							\
-		(_buf) += 3;														\
-	} else {																\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
-		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
-		(_buf) += 2;														\
-	}																		\
-} while (0)
+static inline unsigned char *
+pglz_out_tag(unsigned char *dest_ptr, int32 match_len, int32 match_offset)
+{
+	if (match_len > 17)
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | 0x0f);
+		*(dest_ptr++) = (unsigned char) (((match_offset) & 0xff));
+		*(dest_ptr++) = (unsigned char) ((match_len) - 18);
+	}
+	else
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | ((match_len) - 3));
+		*(dest_ptr++) = (unsigned char) ((match_offset) & 0xff);
+	}
+	return dest_ptr;
+}
+
+/* ----------
+ * pglz_compare -
+ *
+ *		Compares two sequences of bytes
+ *		and returns position of first mismatch.
+ * ----------
+ */
+static inline int32
+pglz_compare(int32 len, int32 len_bound, const unsigned char *input_pos,
+			 const unsigned char *hist_pos)
+{
+	while (len <= len_bound - 4 && memcmp(input_pos, hist_pos, 4) == 0)
+	{
+		len += 4;
+		input_pos += 4;
+		hist_pos += 4;
+	}
+	while (len < len_bound && *input_pos == *hist_pos)
+	{
+		len++;
+		input_pos++;
+		hist_pos++;
+	}
+	return len;
+}
 
 
 /* ----------
@@ -396,32 +383,40 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char *end,
+				int *lenp, int *offp, int good_match, int good_drop)
 {
-	PGLZ_HistEntry *hent;
-	int16		hentno;
+	PGLZ_HistEntry *hist_entry;
+	int16	   *hist_entry_number;
 	int32		len = 0;
-	int32		off = 0;
+	int32		offset = 0;
+	int32		cur_len = 0;
+	int32		len_bound = Min(end - input, PGLZ_MAX_MATCH);
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
-	hent = &hist_entries[hentno];
-	while (hent != INVALID_ENTRY_PTR)
-	{
-		const char *ip = input;
-		const char *hp = hent->pos;
-		int32		thisoff;
-		int32		thislen;
+	hist_entry_number = &hist_start[hist_idx];
+	if (*hist_entry_number == INVALID_ENTRY)
+		return 0;
 
+	hist_entry = &hist_entries[*hist_entry_number];
+	if (hist_idx != hist_entry->hist_idx)
+	{
 		/*
-		 * Stop if the offset does not fit into our tag anymore.
+		 * If current linked list head points to invalid entry then clear it
+		 * to reduce the number of comparisons in future.
 		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		*hist_entry_number = INVALID_ENTRY;
+		return 0;
+	}
+
+	while (true)
+	{
+		const unsigned char *input_pos = input;
+		const unsigned char *hist_pos = hist_entry->pos;
+		const unsigned char *my_pos;
+		int32		cur_offset = input_pos - hist_pos;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -431,58 +426,62 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 * character by character comparison to know the exact position where
 		 * the diff occurred.
 		 */
-		thislen = 0;
 		if (len >= 16)
 		{
-			if (memcmp(ip, hp, len) == 0)
+			if (memcmp(input_pos, hist_pos, len) == 0)
 			{
-				thislen = len;
-				ip += len;
-				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
-				{
-					thislen++;
-					ip++;
-					hp++;
-				}
+				offset = cur_offset;
+				len = pglz_compare(len, len_bound, input_pos + len, hist_pos + len);
 			}
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			/*
+			 * Start search for short matches by comparing 4 bytes.
+			 * We expect that compiler will substitute memcmp() with fixed
+			 * length by optimal 4-bytes comparison it knows.
+			 */
+			if (memcmp(input_pos, hist_pos, 4) == 0)
 			{
-				thislen++;
-				ip++;
-				hp++;
+				cur_len = pglz_compare(4, len_bound, input_pos + 4, hist_pos + 4);
+				if (cur_len > len)
+				{
+					len = cur_len;
+					offset = cur_offset;
+				}
 			}
 		}
 
-		/*
-		 * Remember this match as the best (if it is)
-		 */
-		if (thislen > len)
-		{
-			len = thislen;
-			off = thisoff;
-		}
-
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		my_pos = hist_entry->pos;
+		hist_entry = &hist_entries[hist_entry->next_id];
 
 		/*
-		 * Be happy with lesser good matches the more entries we visited. But
-		 * no point in doing calculation if we're at end of list.
+		 * If current match length is ok then stop iteration. As outdated list
+		 * links are not updated during insertion process then additional stop
+		 * condition should be introduced to avoid following them. If recycled
+		 * entry has another hash, then iteration stops. If it has the same
+		 * hash then recycled cell would break input stream position
+		 * monotonicity, which is checked after.
 		 */
-		if (hent != INVALID_ENTRY_PTR)
+		if (len >= good_match || hist_idx != hist_entry->hist_idx || my_pos <= hist_entry->pos)
 		{
-			if (len >= good_match)
-				break;
-			good_match -= (good_match * good_drop) / 100;
+			break;
 		}
+
+		/*
+		 * Be happy with less good matches the more entries we visited.
+		 */
+		good_match -= (good_match * good_drop) >> 7;
 	}
 
+	/*
+	 * found match can be slightly more than necessary, bound it with len_bound
+	 */
+	len = Min(len, len_bound);
+
 	/*
 	 * Return match information only if it results at least in one byte
 	 * reduction.
@@ -490,7 +489,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	if (len > 2)
 	{
 		*lenp = len;
-		*offp = off;
+		*offp = offset;
 		return 1;
 	}
 
@@ -509,26 +508,28 @@ int32
 pglz_compress(const char *source, int32 slen, char *dest,
 			  const PGLZ_Strategy *strategy)
 {
-	unsigned char *bp = (unsigned char *) dest;
-	unsigned char *bstart = bp;
-	int			hist_next = 1;
-	bool		hist_recycle = false;
-	const char *dp = source;
-	const char *dend = source + slen;
-	unsigned char ctrl_dummy = 0;
-	unsigned char *ctrlp = &ctrl_dummy;
-	unsigned char ctrlb = 0;
-	unsigned char ctrl = 0;
+	unsigned char *dest_ptr = (unsigned char *) dest;
+	unsigned char *dest_start = dest_ptr;
+	uint16		hist_next = 1;
+	uint16		hist_idx;
+	const unsigned char *src_ptr = (const unsigned char *) source;
+	const unsigned char *src_end = (const unsigned char *) source + slen;
+	const unsigned char *compress_src_end = src_end - 4;
+	unsigned char control_dummy = 0;
+	unsigned char *control_ptr = &control_dummy;
+	unsigned char control_byte = 0;
+	unsigned char control_pos = 0;
 	bool		found_match = false;
 	int32		match_len;
-	int32		match_off;
+	int32		match_offset;
 	int32		good_match;
 	int32		good_drop;
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
 	int			hashsz;
-	int			mask;
+	uint16		mask;
+
 
 	/*
 	 * Our fallback strategy is the default.
@@ -560,6 +561,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	else if (good_drop > 100)
 		good_drop = 100;
 
+	/* We use <<7 later to calculate actual drop, so align percents to 128 */
+	good_drop = good_drop * 128 / 100;
+
 	need_rate = strategy->min_comp_rate;
 	if (need_rate < 0)
 		need_rate = 0;
@@ -577,7 +581,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		result_max = (slen / 100) * (100 - need_rate);
 	}
 	else
+	{
 		result_max = (slen * (100 - need_rate)) / 100;
+	}
 
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
@@ -603,10 +609,21 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	 */
 	memset(hist_start, 0, hashsz * sizeof(int16));
 
+	/*
+	 * Initialize INVALID_ENTRY for stopping during lookup.
+	 */
+	hist_entries[INVALID_ENTRY].pos = src_end;
+	hist_entries[INVALID_ENTRY].hist_idx = hashsz;
+
+	/*
+	 * Calculate initial hash value.
+	 */
+	hist_idx = pglz_hist_idx(src_ptr, mask);
+
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
-	while (dp < dend)
+	while (src_ptr < compress_src_end)
 	{
 		/*
 		 * If we already exceeded the maximum result size, fail.
@@ -615,7 +632,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
 		 * allow 4 slop bytes.
 		 */
-		if (bp - bstart >= result_max)
+		if (dest_ptr - dest_start >= result_max)
 			return -1;
 
 		/*
@@ -624,27 +641,36 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * reasonably quickly when looking at incompressible input (such as
 		 * pre-compressed data).
 		 */
-		if (!found_match && bp - bstart >= strategy->first_success_by)
+		if (!found_match && dest_ptr - dest_start >= strategy->first_success_by)
 			return -1;
 
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
+		}
+
 		/*
 		 * Try to find a match in the history
 		 */
-		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+		if (pglz_find_match(hist_idx, src_ptr, compress_src_end, &match_len,
+							&match_offset, good_match, good_drop))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
 			 * characters.
 			 */
-			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			control_byte |= control_pos;
+			dest_ptr = pglz_out_tag(dest_ptr, match_len, match_offset);
 			while (match_len--)
 			{
-				pglz_hist_add(hist_start, hist_entries,
-							  hist_next, hist_recycle,
-							  dp, dend, mask);
-				dp++;			/* Do not do this ++ in the line above! */
-				/* The macro would do it four times - Jan.  */
+				hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+				src_ptr++;
 			}
 			found_match = true;
 		}
@@ -653,21 +679,47 @@ pglz_compress(const char *source, int32 slen, char *dest,
 			/*
 			 * No match found. Copy one literal byte.
 			 */
-			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
-			pglz_hist_add(hist_start, hist_entries,
-						  hist_next, hist_recycle,
-						  dp, dend, mask);
-			dp++;				/* Do not do this ++ in the line above! */
-			/* The macro would do it four times - Jan.  */
+			hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+			*(dest_ptr)++ = (unsigned char) (*src_ptr);
+			src_ptr++;
+		}
+		control_pos <<= 1;
+	}
+
+
+	while (src_ptr < src_end)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (dest_ptr - dest_start >= result_max)
+			return -1;
+
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
 		}
+		*(dest_ptr)++ = (unsigned char) (*src_ptr);
+		src_ptr++;
+		control_pos <<= 1;
 	}
 
 	/*
 	 * Write out the last control byte and check that we haven't overrun the
 	 * output size allowed by the strategy.
 	 */
-	*ctrlp = ctrlb;
-	result_size = bp - bstart;
+	*control_ptr = control_byte;
+	result_size = dest_ptr - dest_start;
 	if (result_size >= result_max)
 		return -1;
 
-- 
2.38.1

0002-review-v6.patchtext/x-patch; charset=UTF-8; name=0002-review-v6.patchDownload

From add94b106300b9b28631a4f6f82fb85d68967d0e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Sun, 27 Nov 2022 16:26:47 +0100
Subject: [PATCH 2/2] review

---
 src/common/pg_lzcompress.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index 248545a108e..bc3388d481f 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -185,11 +185,13 @@
 
 #include "common/pg_lzcompress.h"
 
+
 /* ----------
  * Local definitions
  * ----------
  */
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
+/* review: but we only do "==" comparison anyway, so why? also could say 4095 */
 #define PGLZ_HISTORY_SIZE		0x0fff	/* to avoid compare in iteration */
 #define PGLZ_MAX_MATCH			273
 
@@ -206,6 +208,7 @@
  */
 typedef struct PGLZ_HistEntry
 {
+	/* review: the comment is obsolete, there are indexes, not links) */
 	int16		next_id;		/* links for my hash key's list */
 	uint16		hist_idx;		/* my current hash key */
 	const unsigned char *pos;	/* my input position */
@@ -283,8 +286,7 @@ pglz_hist_idx(const unsigned char *s, uint16 mask)
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table
- *		and recalculates hash value.
+ *		Adds a new entry to the history table and recalculates hash value.
  * ----------
  */
 static inline int16
@@ -350,8 +352,7 @@ pglz_out_tag(unsigned char *dest_ptr, int32 match_len, int32 match_offset)
 /* ----------
  * pglz_compare -
  *
- *		Compares two sequences of bytes
- *		and returns position of first mismatch.
+ *		Compares two sequences of bytes and returns position of first mismatch.
  * ----------
  */
 static inline int32
@@ -395,6 +396,9 @@ pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
+	 *
+	 * review: misplaced comment - this belongs to the while loop further
+	 * down, here we should explain handling of invalid entries and cleanup
 	 */
 	hist_entry_number = &hist_start[hist_idx];
 	if (*hist_entry_number == INVALID_ENTRY)
@@ -581,9 +585,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		result_max = (slen / 100) * (100 - need_rate);
 	}
 	else
-	{
 		result_max = (slen * (100 - need_rate)) / 100;
-	}
 
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
-- 
2.38.1

#21

Tomas Vondra

tomas.vondra@enterprisedb.com

about 3 years ago

In reply to: Tomas Vondra (#20)

Re: pglz compression performance, take two

On 11/27/22 17:02, Tomas Vondra wrote:

Hi,

I took a look at the v6 patch, with the intention to get it committed. I
have a couple minor comments:

1) For PGLZ_HISTORY_SIZE it uses literal 0x0fff, with the explanation:

/* to avoid compare in iteration */

which I think means intent to use this value as a bit mask, but then the
only place using PGLZ_HISTORY_SIZE does

if (hist_next == PGLZ_HISTORY_SIZE) ...

i.e. a comparison. Maybe I misunderstand the comment, though.

2) PGLZ_HistEntry was modified and replaces links (pointers) with
indexes, but the comments still talk about "links", so maybe that needs
to be changed. Also, I wonder why next_id is int16 while hist_idx is
uint16 (and also why id vs. idx)?

3) minor formatting of comments

4) the comment in pglz_find_match about traversing the history seems too
early - it's before handling invalid entries and cleanup, but it does
not talk about that at all, and the actual while loop is after that.

Attached is v6 in 0001 (verbatim), with the review comments in 0002.

BTW I've switched this to WoA, but the comments should be trivial to
resolve and to get it committed.

Also, I still see roughly 15-20% improvement on some compression-heavy
tests, as reported before. Which is nice.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Andrey Borodin

amborodin86@gmail.com

about 3 years ago

In reply to: Tomas Vondra (#20)

2 attachment(s)

Re: pglz compression performance, take two

Hi Tomas,

On Sun, Nov 27, 2022 at 8:02 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

1) For PGLZ_HISTORY_SIZE it uses literal 0x0fff, with the explanation:

/* to avoid compare in iteration */

which I think means intent to use this value as a bit mask, but then the
only place using PGLZ_HISTORY_SIZE does

if (hist_next == PGLZ_HISTORY_SIZE) ...

i.e. a comparison. Maybe I misunderstand the comment, though.

As far as I recollect, it's a leftover from an attempt to optimize the
code into branchless version
I.e. instead of
if(hist_next>=PGLZ_HISTORY_SIZE)
hist_next = 1;
use something like hist_next = hist_next & PGLZ_HISTORY_SIZE.
But the optimization did not show any measurable impact and was
improperly rolled back.

2) PGLZ_HistEntry was modified and replaces links (pointers) with
indexes, but the comments still talk about "links", so maybe that needs
to be changed.

The offsets still form a "linked list"... however I removed some
mentions of pointers, since they are not pointers anymore.

Also, I wonder why next_id is int16 while hist_idx is
uint16 (and also why id vs. idx)?

+1. I'd call them next and hash.

int16 next; /* instead of next_d */
uint16 hash; /* instead of hist_idx */

What do you think?
hist_idx comes from the function name... I'm not sure how far renaming
should go here.

3) minor formatting of comments

4) the comment in pglz_find_match about traversing the history seems too
early - it's before handling invalid entries and cleanup, but it does
not talk about that at all, and the actual while loop is after that.

Yes, this seems right for me.

PFA review fixes (step 1 is unchanged).
I did not include next_id->next and hist_idx -> hash rename.

Thank you!

Best regards, Andrey Borodin.

Attachments:

v7-0002-Review-fixes.patchapplication/octet-stream; name=v7-0002-Review-fixes.patchDownload

From f1a73bf30b52dde4c635a876eae69050f2a781d8 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <xformmm@amazon.com>
Date: Sun, 27 Nov 2022 10:37:43 -0800
Subject: [PATCH v7 2/2] Review fixes

---
 src/common/pg_lzcompress.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index 248545a108..cbc4dd1ad2 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -124,7 +124,7 @@
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4094 input positions, since we cannot use
+ *			kept for the last 4095 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -190,7 +190,7 @@
  * ----------
  */
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
-#define PGLZ_HISTORY_SIZE		0x0fff	/* to avoid compare in iteration */
+#define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
 
 
@@ -200,13 +200,13 @@
  *		Linked list for the backward history lookup
  *
  * All the entries sharing a hash key are linked in a singly linked list.
- * Links are not changed during insertion in order to speed it up.
+ * Indexes are not changed during insertion in order to speed it up.
  * Instead more complicated stop condition is used during list iteration.
  * ----------
  */
 typedef struct PGLZ_HistEntry
 {
-	int16		next_id;		/* links for my hash key's list */
+	int16		next_id;		/* index of next entry with same hash */
 	uint16		hist_idx;		/* my current hash key */
 	const unsigned char *pos;	/* my input position */
 } PGLZ_HistEntry;
@@ -283,8 +283,7 @@ pglz_hist_idx(const unsigned char *s, uint16 mask)
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table
- *		and recalculates hash value.
+ *		Adds a new entry to the history table and recalculates hash value.
  * ----------
  */
 static inline int16
@@ -301,7 +300,7 @@ pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char *s, uint16
 	entry->pos = s;
 
 	/*
-	 * Update linked list head pointer.
+	 * Update linked list head index.
 	 */
 	*my_hist_start = hist_next;
 
@@ -311,7 +310,7 @@ pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char *s, uint16
 	*hist_idx = ((((*hist_idx) ^ (s[0] << 6)) << 2) ^ s[4]) & mask;
 
 	/*
-	 * Shift history pointer.
+	 * Shift history index.
 	 */
 	hist_next++;
 	if (hist_next == PGLZ_HISTORY_SIZE)
@@ -350,8 +349,7 @@ pglz_out_tag(unsigned char *dest_ptr, int32 match_len, int32 match_offset)
 /* ----------
  * pglz_compare -
  *
- *		Compares two sequences of bytes
- *		and returns position of first mismatch.
+ *		Compares two sequences of bytes and returns position of first mismatch.
  * ----------
  */
 static inline int32
@@ -393,9 +391,6 @@ pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char
 	int32		cur_len = 0;
 	int32		len_bound = Min(end - input, PGLZ_MAX_MATCH);
 
-	/*
-	 * Traverse the linked history list until a good enough match is found.
-	 */
 	hist_entry_number = &hist_start[hist_idx];
 	if (*hist_entry_number == INVALID_ENTRY)
 		return 0;
@@ -411,6 +406,9 @@ pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char
 		return 0;
 	}
 
+	/*
+	 * Traverse the linked history list until a good enough match is found.
+	 */
 	while (true)
 	{
 		const unsigned char *input_pos = input;
-- 
2.37.0 (Apple Git-136)

v7-0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=v7-0001-Reorganize-pglz-compression-code.patchDownload

From 3d8420a9609275f0c5d8888955e4ffe018c245bf Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 27 Jun 2019 23:18:21 +0500
Subject: [PATCH v7 1/2] Reorganize pglz compression code

This patch accumulates several changes:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons in a search instead of 1-byte
Total speedup about x1.4
---
 src/common/pg_lzcompress.c | 432 +++++++++++++++++++++----------------
 1 file changed, 242 insertions(+), 190 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index 9de5d5a0d4..248545a108 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -118,13 +118,13 @@
  *			our 4K maximum look-back distance is too small.
  *
  *			The compressor creates a table for lists of positions.
- *			For each input position (except the last 3), a hash key is
+ *			For each input position (except the last 4), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4096 input positions, since we cannot use
+ *			kept for the last 4094 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -146,17 +146,15 @@
  *			used for the next tag in the output.
  *
  *			For each subsequent entry in the history list, the "good_match"
- *			is lowered by 10%. So the compressor will be more happy with
- *			short matches the further it has to go back in the history.
+ *			is lowered roughly by 10%. So the compressor will be more happy
+ *			with short matches the further it has to go back in the history.
  *			Another "speed against ratio" preference characteristic of
  *			the algorithm.
  *
- *			Thus there are 3 stop conditions for the lookup of matches:
+ *			Thus there are 2 stop conditions for the lookup of matches:
  *
  *				- a match >= good_match is found
  *				- there are no more history entries to look at
- *				- the next history entry is already too far back
- *				  to be coded into a tag.
  *
  *			Finally the match algorithm checks that at least a match
  *			of 3 or more bytes has been found, because that is the smallest
@@ -187,13 +185,12 @@
 
 #include "common/pg_lzcompress.h"
 
-
 /* ----------
  * Local definitions
  * ----------
  */
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
-#define PGLZ_HISTORY_SIZE		4096
+#define PGLZ_HISTORY_SIZE		0x0fff	/* to avoid compare in iteration */
 #define PGLZ_MAX_MATCH			273
 
 
@@ -202,17 +199,16 @@
  *
  *		Linked list for the backward history lookup
  *
- * All the entries sharing a hash key are linked in a doubly linked list.
- * This makes it easy to remove an entry when it's time to recycle it
- * (because it's more than 4K positions old).
+ * All the entries sharing a hash key are linked in a singly linked list.
+ * Links are not changed during insertion in order to speed it up.
+ * Instead more complicated stop condition is used during list iteration.
  * ----------
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
-	const char *pos;			/* my input position */
+	int16		next_id;		/* links for my hash key's list */
+	uint16		hist_idx;		/* my current hash key */
+	const unsigned char *pos;	/* my input position */
 } PGLZ_HistEntry;
 
 
@@ -253,14 +249,12 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * ----------
  */
 static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
 
 /*
- * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
- * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ * Element 0 in hist_entries is unused, and means 'invalid'.
  */
 #define INVALID_ENTRY			0
-#define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -274,90 +268,58 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
-		)
+static inline uint16
+pglz_hist_idx(const unsigned char *s, uint16 mask)
+{
+	/*
+	 * Note that we only calculate function at the beginning of compressed data.
+	 * Further hash valued are computed by subtracting prev byte and adding next.
+	 * For details see pglz_hist_add().
+	 */
+	return ((s[0] << 6) ^ (s[1] << 4) ^ (s[2] << 2) ^ s[3]) & mask;
+}
 
 
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table.
- *
- * If _recycle is true, then we are recycling a previously used entry,
- * and must first delink it from its old hashcode's linked list.
- *
- * NOTE: beware of multiple evaluations of macro's arguments, and note that
- * _hn and _recycle are modified in the macro.
+ *		Adds a new entry to the history table
+ *		and recalculates hash value.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
-do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
-			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
-			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
-					(_hs)[__myhe->hindex] = __myhe->next - (_he);			\
-				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
-			}																\
-			__myhe->next = &(_he)[*__myhsp];								\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
-			__myhe->pos  = (_s);											\
-			/* If there was an existing entry in this hash slot, link */	\
-			/* this new entry to it. However, the 0th entry in the */		\
-			/* entries table is unused, so we can freely scribble on it. */ \
-			/* So don't bother checking if the slot was used - we'll */		\
-			/* scribble on the unused entry if it was not, but that's */	\
-			/* harmless. Avoiding the branch in this critical path */		\
-			/* speeds this up a little bit. */								\
-			/* if (*__myhsp != INVALID_ENTRY) */							\
-				(_he)[(*__myhsp)].prev = __myhe;							\
-			*__myhsp = _hn;													\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
-				(_hn) = 1;													\
-				(_recycle) = true;											\
-			}																\
-} while (0)
+static inline int16
+pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char *s, uint16 mask)
+{
+	int16	   *my_hist_start = &hist_start[*hist_idx];
+	PGLZ_HistEntry *entry = &(hist_entries)[hist_next];
 
+	/*
+	 * Initialize entry with a new value.
+	 */
+	entry->next_id = *my_hist_start;
+	entry->hist_idx = *hist_idx;
+	entry->pos = s;
 
-/* ----------
- * pglz_out_ctrl -
- *
- *		Outputs the last and allocates a new control byte if needed.
- * ----------
- */
-#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
-do { \
-	if ((__ctrl & 0xff) == 0)												\
-	{																		\
-		*(__ctrlp) = __ctrlb;												\
-		__ctrlp = (__buf)++;												\
-		__ctrlb = 0;														\
-		__ctrl = 1;															\
-	}																		\
-} while (0)
+	/*
+	 * Update linked list head pointer.
+	 */
+	*my_hist_start = hist_next;
 
+	/*
+	 * Recalculate hash value for next position. Remove current byte and add next.
+	 */
+	*hist_idx = ((((*hist_idx) ^ (s[0] << 6)) << 2) ^ s[4]) & mask;
 
-/* ----------
- * pglz_out_literal -
- *
- *		Outputs a literal byte to the destination buffer including the
- *		appropriate control bit.
- * ----------
- */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	*(_buf)++ = (unsigned char)(_byte);										\
-	_ctrl <<= 1;															\
-} while (0)
+	/*
+	 * Shift history pointer.
+	 */
+	hist_next++;
+	if (hist_next == PGLZ_HISTORY_SIZE)
+	{
+		hist_next = 1;
+	}
+	return hist_next;
+}
 
 
 /* ----------
@@ -368,23 +330,48 @@ do { \
  *		appropriate control bit.
  * ----------
  */
-#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	_ctrlb |= _ctrl;														\
-	_ctrl <<= 1;															\
-	if (_len > 17)															\
-	{																		\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
-		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
-		(_buf)[2] = (unsigned char)((_len) - 18);							\
-		(_buf) += 3;														\
-	} else {																\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
-		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
-		(_buf) += 2;														\
-	}																		\
-} while (0)
+static inline unsigned char *
+pglz_out_tag(unsigned char *dest_ptr, int32 match_len, int32 match_offset)
+{
+	if (match_len > 17)
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | 0x0f);
+		*(dest_ptr++) = (unsigned char) (((match_offset) & 0xff));
+		*(dest_ptr++) = (unsigned char) ((match_len) - 18);
+	}
+	else
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | ((match_len) - 3));
+		*(dest_ptr++) = (unsigned char) ((match_offset) & 0xff);
+	}
+	return dest_ptr;
+}
+
+/* ----------
+ * pglz_compare -
+ *
+ *		Compares two sequences of bytes
+ *		and returns position of first mismatch.
+ * ----------
+ */
+static inline int32
+pglz_compare(int32 len, int32 len_bound, const unsigned char *input_pos,
+			 const unsigned char *hist_pos)
+{
+	while (len <= len_bound - 4 && memcmp(input_pos, hist_pos, 4) == 0)
+	{
+		len += 4;
+		input_pos += 4;
+		hist_pos += 4;
+	}
+	while (len < len_bound && *input_pos == *hist_pos)
+	{
+		len++;
+		input_pos++;
+		hist_pos++;
+	}
+	return len;
+}
 
 
 /* ----------
@@ -396,32 +383,40 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char *end,
+				int *lenp, int *offp, int good_match, int good_drop)
 {
-	PGLZ_HistEntry *hent;
-	int16		hentno;
+	PGLZ_HistEntry *hist_entry;
+	int16	   *hist_entry_number;
 	int32		len = 0;
-	int32		off = 0;
+	int32		offset = 0;
+	int32		cur_len = 0;
+	int32		len_bound = Min(end - input, PGLZ_MAX_MATCH);
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
-	hent = &hist_entries[hentno];
-	while (hent != INVALID_ENTRY_PTR)
-	{
-		const char *ip = input;
-		const char *hp = hent->pos;
-		int32		thisoff;
-		int32		thislen;
+	hist_entry_number = &hist_start[hist_idx];
+	if (*hist_entry_number == INVALID_ENTRY)
+		return 0;
 
+	hist_entry = &hist_entries[*hist_entry_number];
+	if (hist_idx != hist_entry->hist_idx)
+	{
 		/*
-		 * Stop if the offset does not fit into our tag anymore.
+		 * If current linked list head points to invalid entry then clear it
+		 * to reduce the number of comparisons in future.
 		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		*hist_entry_number = INVALID_ENTRY;
+		return 0;
+	}
+
+	while (true)
+	{
+		const unsigned char *input_pos = input;
+		const unsigned char *hist_pos = hist_entry->pos;
+		const unsigned char *my_pos;
+		int32		cur_offset = input_pos - hist_pos;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -431,58 +426,62 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 * character by character comparison to know the exact position where
 		 * the diff occurred.
 		 */
-		thislen = 0;
 		if (len >= 16)
 		{
-			if (memcmp(ip, hp, len) == 0)
+			if (memcmp(input_pos, hist_pos, len) == 0)
 			{
-				thislen = len;
-				ip += len;
-				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
-				{
-					thislen++;
-					ip++;
-					hp++;
-				}
+				offset = cur_offset;
+				len = pglz_compare(len, len_bound, input_pos + len, hist_pos + len);
 			}
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			/*
+			 * Start search for short matches by comparing 4 bytes.
+			 * We expect that compiler will substitute memcmp() with fixed
+			 * length by optimal 4-bytes comparison it knows.
+			 */
+			if (memcmp(input_pos, hist_pos, 4) == 0)
 			{
-				thislen++;
-				ip++;
-				hp++;
+				cur_len = pglz_compare(4, len_bound, input_pos + 4, hist_pos + 4);
+				if (cur_len > len)
+				{
+					len = cur_len;
+					offset = cur_offset;
+				}
 			}
 		}
 
-		/*
-		 * Remember this match as the best (if it is)
-		 */
-		if (thislen > len)
-		{
-			len = thislen;
-			off = thisoff;
-		}
-
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		my_pos = hist_entry->pos;
+		hist_entry = &hist_entries[hist_entry->next_id];
 
 		/*
-		 * Be happy with lesser good matches the more entries we visited. But
-		 * no point in doing calculation if we're at end of list.
+		 * If current match length is ok then stop iteration. As outdated list
+		 * links are not updated during insertion process then additional stop
+		 * condition should be introduced to avoid following them. If recycled
+		 * entry has another hash, then iteration stops. If it has the same
+		 * hash then recycled cell would break input stream position
+		 * monotonicity, which is checked after.
 		 */
-		if (hent != INVALID_ENTRY_PTR)
+		if (len >= good_match || hist_idx != hist_entry->hist_idx || my_pos <= hist_entry->pos)
 		{
-			if (len >= good_match)
-				break;
-			good_match -= (good_match * good_drop) / 100;
+			break;
 		}
+
+		/*
+		 * Be happy with less good matches the more entries we visited.
+		 */
+		good_match -= (good_match * good_drop) >> 7;
 	}
 
+	/*
+	 * found match can be slightly more than necessary, bound it with len_bound
+	 */
+	len = Min(len, len_bound);
+
 	/*
 	 * Return match information only if it results at least in one byte
 	 * reduction.
@@ -490,7 +489,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	if (len > 2)
 	{
 		*lenp = len;
-		*offp = off;
+		*offp = offset;
 		return 1;
 	}
 
@@ -509,26 +508,28 @@ int32
 pglz_compress(const char *source, int32 slen, char *dest,
 			  const PGLZ_Strategy *strategy)
 {
-	unsigned char *bp = (unsigned char *) dest;
-	unsigned char *bstart = bp;
-	int			hist_next = 1;
-	bool		hist_recycle = false;
-	const char *dp = source;
-	const char *dend = source + slen;
-	unsigned char ctrl_dummy = 0;
-	unsigned char *ctrlp = &ctrl_dummy;
-	unsigned char ctrlb = 0;
-	unsigned char ctrl = 0;
+	unsigned char *dest_ptr = (unsigned char *) dest;
+	unsigned char *dest_start = dest_ptr;
+	uint16		hist_next = 1;
+	uint16		hist_idx;
+	const unsigned char *src_ptr = (const unsigned char *) source;
+	const unsigned char *src_end = (const unsigned char *) source + slen;
+	const unsigned char *compress_src_end = src_end - 4;
+	unsigned char control_dummy = 0;
+	unsigned char *control_ptr = &control_dummy;
+	unsigned char control_byte = 0;
+	unsigned char control_pos = 0;
 	bool		found_match = false;
 	int32		match_len;
-	int32		match_off;
+	int32		match_offset;
 	int32		good_match;
 	int32		good_drop;
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
 	int			hashsz;
-	int			mask;
+	uint16		mask;
+
 
 	/*
 	 * Our fallback strategy is the default.
@@ -560,6 +561,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	else if (good_drop > 100)
 		good_drop = 100;
 
+	/* We use <<7 later to calculate actual drop, so align percents to 128 */
+	good_drop = good_drop * 128 / 100;
+
 	need_rate = strategy->min_comp_rate;
 	if (need_rate < 0)
 		need_rate = 0;
@@ -577,7 +581,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		result_max = (slen / 100) * (100 - need_rate);
 	}
 	else
+	{
 		result_max = (slen * (100 - need_rate)) / 100;
+	}
 
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
@@ -603,10 +609,21 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	 */
 	memset(hist_start, 0, hashsz * sizeof(int16));
 
+	/*
+	 * Initialize INVALID_ENTRY for stopping during lookup.
+	 */
+	hist_entries[INVALID_ENTRY].pos = src_end;
+	hist_entries[INVALID_ENTRY].hist_idx = hashsz;
+
+	/*
+	 * Calculate initial hash value.
+	 */
+	hist_idx = pglz_hist_idx(src_ptr, mask);
+
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
-	while (dp < dend)
+	while (src_ptr < compress_src_end)
 	{
 		/*
 		 * If we already exceeded the maximum result size, fail.
@@ -615,7 +632,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
 		 * allow 4 slop bytes.
 		 */
-		if (bp - bstart >= result_max)
+		if (dest_ptr - dest_start >= result_max)
 			return -1;
 
 		/*
@@ -624,27 +641,36 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * reasonably quickly when looking at incompressible input (such as
 		 * pre-compressed data).
 		 */
-		if (!found_match && bp - bstart >= strategy->first_success_by)
+		if (!found_match && dest_ptr - dest_start >= strategy->first_success_by)
 			return -1;
 
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
+		}
+
 		/*
 		 * Try to find a match in the history
 		 */
-		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+		if (pglz_find_match(hist_idx, src_ptr, compress_src_end, &match_len,
+							&match_offset, good_match, good_drop))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
 			 * characters.
 			 */
-			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			control_byte |= control_pos;
+			dest_ptr = pglz_out_tag(dest_ptr, match_len, match_offset);
 			while (match_len--)
 			{
-				pglz_hist_add(hist_start, hist_entries,
-							  hist_next, hist_recycle,
-							  dp, dend, mask);
-				dp++;			/* Do not do this ++ in the line above! */
-				/* The macro would do it four times - Jan.  */
+				hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+				src_ptr++;
 			}
 			found_match = true;
 		}
@@ -653,21 +679,47 @@ pglz_compress(const char *source, int32 slen, char *dest,
 			/*
 			 * No match found. Copy one literal byte.
 			 */
-			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
-			pglz_hist_add(hist_start, hist_entries,
-						  hist_next, hist_recycle,
-						  dp, dend, mask);
-			dp++;				/* Do not do this ++ in the line above! */
-			/* The macro would do it four times - Jan.  */
+			hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+			*(dest_ptr)++ = (unsigned char) (*src_ptr);
+			src_ptr++;
+		}
+		control_pos <<= 1;
+	}
+
+
+	while (src_ptr < src_end)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (dest_ptr - dest_start >= result_max)
+			return -1;
+
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
 		}
+		*(dest_ptr)++ = (unsigned char) (*src_ptr);
+		src_ptr++;
+		control_pos <<= 1;
 	}
 
 	/*
 	 * Write out the last control byte and check that we haven't overrun the
 	 * output size allowed by the strategy.
 	 */
-	*ctrlp = ctrlb;
-	result_size = bp - bstart;
+	*control_ptr = control_byte;
+	result_size = dest_ptr - dest_start;
 	if (result_size >= result_max)
 		return -1;
 
-- 
2.37.0 (Apple Git-136)

#23

Andrey Borodin

amborodin86@gmail.com

about 3 years ago

In reply to: Andrey Borodin (#22)

1 attachment(s)

Re: pglz compression performance, take two

On Sun, Nov 27, 2022 at 10:43 AM Andrey Borodin <amborodin86@gmail.com> wrote:

PFA review fixes (step 1 is unchanged).

Hello! Please find attached v8.
Changes are mostly cosmetic:
1. 2 steps from previous message were squashed together
2. I tried to do a better commit message

Thanks!

Best regards, Andrey Borodin.

Attachments:

v8-0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=v8-0001-Reorganize-pglz-compression-code.patchDownload

From 22b41e1e2d181278436ac9709e001d4a98065ff3 Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 27 Jun 2019 23:18:21 +0500
Subject: [PATCH v8] Reorganize pglz compression code

This patch accumulates several changes to pglz compression:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons during a search instead of 1-byte comparisons

Author: Andrey Borodin <amborodin@acm.org>
Author: Vladimir Leskov <vladimirlesk@gmail.com>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-By: Mark Dilger <mark.dilger@enterprisedb.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/FEF3DC5E-4BC4-44E1-8DEB-DADC67046FE3%40yandex-team.ru
---
 src/common/pg_lzcompress.c | 432 +++++++++++++++++++++----------------
 1 file changed, 241 insertions(+), 191 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index f14c89fae4..bc85327af0 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -118,13 +118,13 @@
  *			our 4K maximum look-back distance is too small.
  *
  *			The compressor creates a table for lists of positions.
- *			For each input position (except the last 3), a hash key is
+ *			For each input position (except the last 4), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4096 input positions, since we cannot use
+ *			kept for the last 4095 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -146,17 +146,15 @@
  *			used for the next tag in the output.
  *
  *			For each subsequent entry in the history list, the "good_match"
- *			is lowered by 10%. So the compressor will be more happy with
- *			short matches the further it has to go back in the history.
+ *			is lowered roughly by 10%. So the compressor will be more happy
+ *			with short matches the further it has to go back in the history.
  *			Another "speed against ratio" preference characteristic of
  *			the algorithm.
  *
- *			Thus there are 3 stop conditions for the lookup of matches:
+ *			Thus there are 2 stop conditions for the lookup of matches:
  *
  *				- a match >= good_match is found
  *				- there are no more history entries to look at
- *				- the next history entry is already too far back
- *				  to be coded into a tag.
  *
  *			Finally the match algorithm checks that at least a match
  *			of 3 or more bytes has been found, because that is the smallest
@@ -187,7 +185,6 @@
 
 #include "common/pg_lzcompress.h"
 
-
 /* ----------
  * Local definitions
  * ----------
@@ -202,17 +199,16 @@
  *
  *		Linked list for the backward history lookup
  *
- * All the entries sharing a hash key are linked in a doubly linked list.
- * This makes it easy to remove an entry when it's time to recycle it
- * (because it's more than 4K positions old).
+ * All the entries sharing a hash key are linked in a singly linked list.
+ * Indexes are not changed during insertion in order to speed it up.
+ * Instead more complicated stop condition is used during list iteration.
  * ----------
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
-	const char *pos;			/* my input position */
+	int16		next_id;		/* index of next entry with same hash */
+	uint16		hist_idx;		/* my current hash key */
+	const unsigned char *pos;	/* my input position */
 } PGLZ_HistEntry;
 
 
@@ -253,14 +249,12 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * ----------
  */
 static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
 
 /*
- * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
- * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ * Element 0 in hist_entries is unused, and means 'invalid'.
  */
 #define INVALID_ENTRY			0
-#define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -274,90 +268,57 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
-		)
+static inline uint16
+pglz_hist_idx(const unsigned char *s, uint16 mask)
+{
+	/*
+	 * Note that we only calculate function at the beginning of compressed data.
+	 * Further hash valued are computed by subtracting prev byte and adding next.
+	 * For details see pglz_hist_add().
+	 */
+	return ((s[0] << 6) ^ (s[1] << 4) ^ (s[2] << 2) ^ s[3]) & mask;
+}
 
 
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table.
- *
- * If _recycle is true, then we are recycling a previously used entry,
- * and must first delink it from its old hashcode's linked list.
- *
- * NOTE: beware of multiple evaluations of macro's arguments, and note that
- * _hn and _recycle are modified in the macro.
+ *		Adds a new entry to the history table and recalculates hash value.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
-do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
-			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
-			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
-					(_hs)[__myhe->hindex] = __myhe->next - (_he);			\
-				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
-			}																\
-			__myhe->next = &(_he)[*__myhsp];								\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
-			__myhe->pos  = (_s);											\
-			/* If there was an existing entry in this hash slot, link */	\
-			/* this new entry to it. However, the 0th entry in the */		\
-			/* entries table is unused, so we can freely scribble on it. */ \
-			/* So don't bother checking if the slot was used - we'll */		\
-			/* scribble on the unused entry if it was not, but that's */	\
-			/* harmless. Avoiding the branch in this critical path */		\
-			/* speeds this up a little bit. */								\
-			/* if (*__myhsp != INVALID_ENTRY) */							\
-				(_he)[(*__myhsp)].prev = __myhe;							\
-			*__myhsp = _hn;													\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
-				(_hn) = 1;													\
-				(_recycle) = true;											\
-			}																\
-} while (0)
+static inline int16
+pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char *s, uint16 mask)
+{
+	int16	   *my_hist_start = &hist_start[*hist_idx];
+	PGLZ_HistEntry *entry = &(hist_entries)[hist_next];
 
+	/*
+	 * Initialize entry with a new value.
+	 */
+	entry->next_id = *my_hist_start;
+	entry->hist_idx = *hist_idx;
+	entry->pos = s;
 
-/* ----------
- * pglz_out_ctrl -
- *
- *		Outputs the last and allocates a new control byte if needed.
- * ----------
- */
-#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
-do { \
-	if ((__ctrl & 0xff) == 0)												\
-	{																		\
-		*(__ctrlp) = __ctrlb;												\
-		__ctrlp = (__buf)++;												\
-		__ctrlb = 0;														\
-		__ctrl = 1;															\
-	}																		\
-} while (0)
+	/*
+	 * Update linked list head index.
+	 */
+	*my_hist_start = hist_next;
 
+	/*
+	 * Recalculate hash value for next position. Remove current byte and add next.
+	 */
+	*hist_idx = ((((*hist_idx) ^ (s[0] << 6)) << 2) ^ s[4]) & mask;
 
-/* ----------
- * pglz_out_literal -
- *
- *		Outputs a literal byte to the destination buffer including the
- *		appropriate control bit.
- * ----------
- */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	*(_buf)++ = (unsigned char)(_byte);										\
-	_ctrl <<= 1;															\
-} while (0)
+	/*
+	 * Shift history index.
+	 */
+	hist_next++;
+	if (hist_next == PGLZ_HISTORY_SIZE)
+	{
+		hist_next = 1;
+	}
+	return hist_next;
+}
 
 
 /* ----------
@@ -368,23 +329,47 @@ do { \
  *		appropriate control bit.
  * ----------
  */
-#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	_ctrlb |= _ctrl;														\
-	_ctrl <<= 1;															\
-	if (_len > 17)															\
-	{																		\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
-		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
-		(_buf)[2] = (unsigned char)((_len) - 18);							\
-		(_buf) += 3;														\
-	} else {																\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
-		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
-		(_buf) += 2;														\
-	}																		\
-} while (0)
+static inline unsigned char *
+pglz_out_tag(unsigned char *dest_ptr, int32 match_len, int32 match_offset)
+{
+	if (match_len > 17)
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | 0x0f);
+		*(dest_ptr++) = (unsigned char) (((match_offset) & 0xff));
+		*(dest_ptr++) = (unsigned char) ((match_len) - 18);
+	}
+	else
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | ((match_len) - 3));
+		*(dest_ptr++) = (unsigned char) ((match_offset) & 0xff);
+	}
+	return dest_ptr;
+}
+
+/* ----------
+ * pglz_compare -
+ *
+ *		Compares two sequences of bytes and returns position of first mismatch.
+ * ----------
+ */
+static inline int32
+pglz_compare(int32 len, int32 len_bound, const unsigned char *input_pos,
+			 const unsigned char *hist_pos)
+{
+	while (len <= len_bound - 4 && memcmp(input_pos, hist_pos, 4) == 0)
+	{
+		len += 4;
+		input_pos += 4;
+		hist_pos += 4;
+	}
+	while (len < len_bound && *input_pos == *hist_pos)
+	{
+		len++;
+		input_pos++;
+		hist_pos++;
+	}
+	return len;
+}
 
 
 /* ----------
@@ -396,32 +381,40 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char *end,
+				int *lenp, int *offp, int good_match, int good_drop)
 {
-	PGLZ_HistEntry *hent;
-	int16		hentno;
+	PGLZ_HistEntry *hist_entry;
+	int16	   *hist_entry_number;
 	int32		len = 0;
-	int32		off = 0;
+	int32		offset = 0;
+	int32		cur_len = 0;
+	int32		len_bound = Min(end - input, PGLZ_MAX_MATCH);
+
+	hist_entry_number = &hist_start[hist_idx];
+	if (*hist_entry_number == INVALID_ENTRY)
+		return 0;
+
+	hist_entry = &hist_entries[*hist_entry_number];
+	if (hist_idx != hist_entry->hist_idx)
+	{
+		/*
+		 * If current linked list head points to invalid entry then clear it
+		 * to reduce the number of comparisons in future.
+		 */
+		*hist_entry_number = INVALID_ENTRY;
+		return 0;
+	}
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
-	hent = &hist_entries[hentno];
-	while (hent != INVALID_ENTRY_PTR)
+	while (true)
 	{
-		const char *ip = input;
-		const char *hp = hent->pos;
-		int32		thisoff;
-		int32		thislen;
-
-		/*
-		 * Stop if the offset does not fit into our tag anymore.
-		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		const unsigned char *input_pos = input;
+		const unsigned char *hist_pos = hist_entry->pos;
+		const unsigned char *my_pos;
+		int32		cur_offset = input_pos - hist_pos;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -431,58 +424,62 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 * character by character comparison to know the exact position where
 		 * the diff occurred.
 		 */
-		thislen = 0;
 		if (len >= 16)
 		{
-			if (memcmp(ip, hp, len) == 0)
+			if (memcmp(input_pos, hist_pos, len) == 0)
 			{
-				thislen = len;
-				ip += len;
-				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
-				{
-					thislen++;
-					ip++;
-					hp++;
-				}
+				offset = cur_offset;
+				len = pglz_compare(len, len_bound, input_pos + len, hist_pos + len);
 			}
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			/*
+			 * Start search for short matches by comparing 4 bytes.
+			 * We expect that compiler will substitute memcmp() with fixed
+			 * length by optimal 4-bytes comparison it knows.
+			 */
+			if (memcmp(input_pos, hist_pos, 4) == 0)
 			{
-				thislen++;
-				ip++;
-				hp++;
+				cur_len = pglz_compare(4, len_bound, input_pos + 4, hist_pos + 4);
+				if (cur_len > len)
+				{
+					len = cur_len;
+					offset = cur_offset;
+				}
 			}
 		}
 
-		/*
-		 * Remember this match as the best (if it is)
-		 */
-		if (thislen > len)
-		{
-			len = thislen;
-			off = thisoff;
-		}
-
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		my_pos = hist_entry->pos;
+		hist_entry = &hist_entries[hist_entry->next_id];
 
 		/*
-		 * Be happy with lesser good matches the more entries we visited. But
-		 * no point in doing calculation if we're at end of list.
+		 * If current match length is ok then stop iteration. As outdated list
+		 * links are not updated during insertion process then additional stop
+		 * condition should be introduced to avoid following them. If recycled
+		 * entry has another hash, then iteration stops. If it has the same
+		 * hash then recycled cell would break input stream position
+		 * monotonicity, which is checked after.
 		 */
-		if (hent != INVALID_ENTRY_PTR)
+		if (len >= good_match || hist_idx != hist_entry->hist_idx || my_pos <= hist_entry->pos)
 		{
-			if (len >= good_match)
-				break;
-			good_match -= (good_match * good_drop) / 100;
+			break;
 		}
+
+		/*
+		 * Be happy with less good matches the more entries we visited.
+		 */
+		good_match -= (good_match * good_drop) >> 7;
 	}
 
+	/*
+	 * found match can be slightly more than necessary, bound it with len_bound
+	 */
+	len = Min(len, len_bound);
+
 	/*
 	 * Return match information only if it results at least in one byte
 	 * reduction.
@@ -490,7 +487,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	if (len > 2)
 	{
 		*lenp = len;
-		*offp = off;
+		*offp = offset;
 		return 1;
 	}
 
@@ -509,26 +506,28 @@ int32
 pglz_compress(const char *source, int32 slen, char *dest,
 			  const PGLZ_Strategy *strategy)
 {
-	unsigned char *bp = (unsigned char *) dest;
-	unsigned char *bstart = bp;
-	int			hist_next = 1;
-	bool		hist_recycle = false;
-	const char *dp = source;
-	const char *dend = source + slen;
-	unsigned char ctrl_dummy = 0;
-	unsigned char *ctrlp = &ctrl_dummy;
-	unsigned char ctrlb = 0;
-	unsigned char ctrl = 0;
+	unsigned char *dest_ptr = (unsigned char *) dest;
+	unsigned char *dest_start = dest_ptr;
+	uint16		hist_next = 1;
+	uint16		hist_idx;
+	const unsigned char *src_ptr = (const unsigned char *) source;
+	const unsigned char *src_end = (const unsigned char *) source + slen;
+	const unsigned char *compress_src_end = src_end - 4;
+	unsigned char control_dummy = 0;
+	unsigned char *control_ptr = &control_dummy;
+	unsigned char control_byte = 0;
+	unsigned char control_pos = 0;
 	bool		found_match = false;
 	int32		match_len;
-	int32		match_off;
+	int32		match_offset;
 	int32		good_match;
 	int32		good_drop;
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
 	int			hashsz;
-	int			mask;
+	uint16		mask;
+
 
 	/*
 	 * Our fallback strategy is the default.
@@ -560,6 +559,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	else if (good_drop > 100)
 		good_drop = 100;
 
+	/* We use <<7 later to calculate actual drop, so align percents to 128 */
+	good_drop = good_drop * 128 / 100;
+
 	need_rate = strategy->min_comp_rate;
 	if (need_rate < 0)
 		need_rate = 0;
@@ -577,7 +579,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		result_max = (slen / 100) * (100 - need_rate);
 	}
 	else
+	{
 		result_max = (slen * (100 - need_rate)) / 100;
+	}
 
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
@@ -603,10 +607,21 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	 */
 	memset(hist_start, 0, hashsz * sizeof(int16));
 
+	/*
+	 * Initialize INVALID_ENTRY for stopping during lookup.
+	 */
+	hist_entries[INVALID_ENTRY].pos = src_end;
+	hist_entries[INVALID_ENTRY].hist_idx = hashsz;
+
+	/*
+	 * Calculate initial hash value.
+	 */
+	hist_idx = pglz_hist_idx(src_ptr, mask);
+
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
-	while (dp < dend)
+	while (src_ptr < compress_src_end)
 	{
 		/*
 		 * If we already exceeded the maximum result size, fail.
@@ -615,7 +630,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
 		 * allow 4 slop bytes.
 		 */
-		if (bp - bstart >= result_max)
+		if (dest_ptr - dest_start >= result_max)
 			return -1;
 
 		/*
@@ -624,27 +639,36 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * reasonably quickly when looking at incompressible input (such as
 		 * pre-compressed data).
 		 */
-		if (!found_match && bp - bstart >= strategy->first_success_by)
+		if (!found_match && dest_ptr - dest_start >= strategy->first_success_by)
 			return -1;
 
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
+		}
+
 		/*
 		 * Try to find a match in the history
 		 */
-		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+		if (pglz_find_match(hist_idx, src_ptr, compress_src_end, &match_len,
+							&match_offset, good_match, good_drop))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
 			 * characters.
 			 */
-			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			control_byte |= control_pos;
+			dest_ptr = pglz_out_tag(dest_ptr, match_len, match_offset);
 			while (match_len--)
 			{
-				pglz_hist_add(hist_start, hist_entries,
-							  hist_next, hist_recycle,
-							  dp, dend, mask);
-				dp++;			/* Do not do this ++ in the line above! */
-				/* The macro would do it four times - Jan.  */
+				hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+				src_ptr++;
 			}
 			found_match = true;
 		}
@@ -653,21 +677,47 @@ pglz_compress(const char *source, int32 slen, char *dest,
 			/*
 			 * No match found. Copy one literal byte.
 			 */
-			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
-			pglz_hist_add(hist_start, hist_entries,
-						  hist_next, hist_recycle,
-						  dp, dend, mask);
-			dp++;				/* Do not do this ++ in the line above! */
-			/* The macro would do it four times - Jan.  */
+			hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+			*(dest_ptr)++ = (unsigned char) (*src_ptr);
+			src_ptr++;
+		}
+		control_pos <<= 1;
+	}
+
+
+	while (src_ptr < src_end)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (dest_ptr - dest_start >= result_max)
+			return -1;
+
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
 		}
+		*(dest_ptr)++ = (unsigned char) (*src_ptr);
+		src_ptr++;
+		control_pos <<= 1;
 	}
 
 	/*
 	 * Write out the last control byte and check that we haven't overrun the
 	 * output size allowed by the strategy.
 	 */
-	*ctrlp = ctrlb;
-	result_size = bp - bstart;
+	*control_ptr = control_byte;
+	result_size = dest_ptr - dest_start;
 	if (result_size >= result_max)
 		return -1;
 
-- 
2.32.0 (Apple Git-132)

#24

Andrey Borodin

amborodin86@gmail.com

almost 3 years ago

In reply to: Andrey Borodin (#23)

2 attachment(s)

Re: pglz compression performance, take two

On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin <amborodin86@gmail.com> wrote:

Hello! Please find attached v8.

I got some interesting feedback from some patch users.
There was an oversight that frequently yielded results that are 1,2 or
3 bytes longer than expected.
Looking closer I found that the correctness of the last 3-byte tail is
checked in two places. PFA fix for this. Previously compressed data
was correct, however in some cases few bytes longer than the result of
current pglz implementation.

Thanks!

Best regards, Andrey Borodin.

Attachments:

v9-0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=v9-0001-Reorganize-pglz-compression-code.patchDownload

From 5de9a714a9586ee2fc0259fb527ca016f451efcc Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 27 Jun 2019 23:18:21 +0500
Subject: [PATCH v9 1/2] Reorganize pglz compression code

This patch accumulates several changes to pglz compression:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons during a search instead of 1-byte comparisons

Author: Andrey Borodin <amborodin@acm.org>
Author: Vladimir Leskov <vladimirlesk@gmail.com>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-By: Mark Dilger <mark.dilger@enterprisedb.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/FEF3DC5E-4BC4-44E1-8DEB-DADC67046FE3%40yandex-team.ru
---
 src/common/pg_lzcompress.c | 432 +++++++++++++++++++++----------------
 1 file changed, 241 insertions(+), 191 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index f14c89fae4..bc85327af0 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -118,13 +118,13 @@
  *			our 4K maximum look-back distance is too small.
  *
  *			The compressor creates a table for lists of positions.
- *			For each input position (except the last 3), a hash key is
+ *			For each input position (except the last 4), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4096 input positions, since we cannot use
+ *			kept for the last 4095 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -146,17 +146,15 @@
  *			used for the next tag in the output.
  *
  *			For each subsequent entry in the history list, the "good_match"
- *			is lowered by 10%. So the compressor will be more happy with
- *			short matches the further it has to go back in the history.
+ *			is lowered roughly by 10%. So the compressor will be more happy
+ *			with short matches the further it has to go back in the history.
  *			Another "speed against ratio" preference characteristic of
  *			the algorithm.
  *
- *			Thus there are 3 stop conditions for the lookup of matches:
+ *			Thus there are 2 stop conditions for the lookup of matches:
  *
  *				- a match >= good_match is found
  *				- there are no more history entries to look at
- *				- the next history entry is already too far back
- *				  to be coded into a tag.
  *
  *			Finally the match algorithm checks that at least a match
  *			of 3 or more bytes has been found, because that is the smallest
@@ -187,7 +185,6 @@
 
 #include "common/pg_lzcompress.h"
 
-
 /* ----------
  * Local definitions
  * ----------
@@ -202,17 +199,16 @@
  *
  *		Linked list for the backward history lookup
  *
- * All the entries sharing a hash key are linked in a doubly linked list.
- * This makes it easy to remove an entry when it's time to recycle it
- * (because it's more than 4K positions old).
+ * All the entries sharing a hash key are linked in a singly linked list.
+ * Indexes are not changed during insertion in order to speed it up.
+ * Instead more complicated stop condition is used during list iteration.
  * ----------
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
-	const char *pos;			/* my input position */
+	int16		next_id;		/* index of next entry with same hash */
+	uint16		hist_idx;		/* my current hash key */
+	const unsigned char *pos;	/* my input position */
 } PGLZ_HistEntry;
 
 
@@ -253,14 +249,12 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * ----------
  */
 static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
 
 /*
- * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
- * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ * Element 0 in hist_entries is unused, and means 'invalid'.
  */
 #define INVALID_ENTRY			0
-#define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -274,90 +268,57 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
-		)
+static inline uint16
+pglz_hist_idx(const unsigned char *s, uint16 mask)
+{
+	/*
+	 * Note that we only calculate function at the beginning of compressed data.
+	 * Further hash valued are computed by subtracting prev byte and adding next.
+	 * For details see pglz_hist_add().
+	 */
+	return ((s[0] << 6) ^ (s[1] << 4) ^ (s[2] << 2) ^ s[3]) & mask;
+}
 
 
 /* ----------
  * pglz_hist_add -
  *
- *		Adds a new entry to the history table.
- *
- * If _recycle is true, then we are recycling a previously used entry,
- * and must first delink it from its old hashcode's linked list.
- *
- * NOTE: beware of multiple evaluations of macro's arguments, and note that
- * _hn and _recycle are modified in the macro.
+ *		Adds a new entry to the history table and recalculates hash value.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
-do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
-			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
-			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
-					(_hs)[__myhe->hindex] = __myhe->next - (_he);			\
-				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
-			}																\
-			__myhe->next = &(_he)[*__myhsp];								\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
-			__myhe->pos  = (_s);											\
-			/* If there was an existing entry in this hash slot, link */	\
-			/* this new entry to it. However, the 0th entry in the */		\
-			/* entries table is unused, so we can freely scribble on it. */ \
-			/* So don't bother checking if the slot was used - we'll */		\
-			/* scribble on the unused entry if it was not, but that's */	\
-			/* harmless. Avoiding the branch in this critical path */		\
-			/* speeds this up a little bit. */								\
-			/* if (*__myhsp != INVALID_ENTRY) */							\
-				(_he)[(*__myhsp)].prev = __myhe;							\
-			*__myhsp = _hn;													\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
-				(_hn) = 1;													\
-				(_recycle) = true;											\
-			}																\
-} while (0)
+static inline int16
+pglz_hist_add(int16 hist_next, uint16 *hist_idx, const unsigned char *s, uint16 mask)
+{
+	int16	   *my_hist_start = &hist_start[*hist_idx];
+	PGLZ_HistEntry *entry = &(hist_entries)[hist_next];
 
+	/*
+	 * Initialize entry with a new value.
+	 */
+	entry->next_id = *my_hist_start;
+	entry->hist_idx = *hist_idx;
+	entry->pos = s;
 
-/* ----------
- * pglz_out_ctrl -
- *
- *		Outputs the last and allocates a new control byte if needed.
- * ----------
- */
-#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
-do { \
-	if ((__ctrl & 0xff) == 0)												\
-	{																		\
-		*(__ctrlp) = __ctrlb;												\
-		__ctrlp = (__buf)++;												\
-		__ctrlb = 0;														\
-		__ctrl = 1;															\
-	}																		\
-} while (0)
+	/*
+	 * Update linked list head index.
+	 */
+	*my_hist_start = hist_next;
 
+	/*
+	 * Recalculate hash value for next position. Remove current byte and add next.
+	 */
+	*hist_idx = ((((*hist_idx) ^ (s[0] << 6)) << 2) ^ s[4]) & mask;
 
-/* ----------
- * pglz_out_literal -
- *
- *		Outputs a literal byte to the destination buffer including the
- *		appropriate control bit.
- * ----------
- */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	*(_buf)++ = (unsigned char)(_byte);										\
-	_ctrl <<= 1;															\
-} while (0)
+	/*
+	 * Shift history index.
+	 */
+	hist_next++;
+	if (hist_next == PGLZ_HISTORY_SIZE)
+	{
+		hist_next = 1;
+	}
+	return hist_next;
+}
 
 
 /* ----------
@@ -368,23 +329,47 @@ do { \
  *		appropriate control bit.
  * ----------
  */
-#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	_ctrlb |= _ctrl;														\
-	_ctrl <<= 1;															\
-	if (_len > 17)															\
-	{																		\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
-		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
-		(_buf)[2] = (unsigned char)((_len) - 18);							\
-		(_buf) += 3;														\
-	} else {																\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
-		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
-		(_buf) += 2;														\
-	}																		\
-} while (0)
+static inline unsigned char *
+pglz_out_tag(unsigned char *dest_ptr, int32 match_len, int32 match_offset)
+{
+	if (match_len > 17)
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | 0x0f);
+		*(dest_ptr++) = (unsigned char) (((match_offset) & 0xff));
+		*(dest_ptr++) = (unsigned char) ((match_len) - 18);
+	}
+	else
+	{
+		*(dest_ptr++) = (unsigned char) ((((match_offset) & 0xf00) >> 4) | ((match_len) - 3));
+		*(dest_ptr++) = (unsigned char) ((match_offset) & 0xff);
+	}
+	return dest_ptr;
+}
+
+/* ----------
+ * pglz_compare -
+ *
+ *		Compares two sequences of bytes and returns position of first mismatch.
+ * ----------
+ */
+static inline int32
+pglz_compare(int32 len, int32 len_bound, const unsigned char *input_pos,
+			 const unsigned char *hist_pos)
+{
+	while (len <= len_bound - 4 && memcmp(input_pos, hist_pos, 4) == 0)
+	{
+		len += 4;
+		input_pos += 4;
+		hist_pos += 4;
+	}
+	while (len < len_bound && *input_pos == *hist_pos)
+	{
+		len++;
+		input_pos++;
+		hist_pos++;
+	}
+	return len;
+}
 
 
 /* ----------
@@ -396,32 +381,40 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+pglz_find_match(uint16 hist_idx, const unsigned char *input, const unsigned char *end,
+				int *lenp, int *offp, int good_match, int good_drop)
 {
-	PGLZ_HistEntry *hent;
-	int16		hentno;
+	PGLZ_HistEntry *hist_entry;
+	int16	   *hist_entry_number;
 	int32		len = 0;
-	int32		off = 0;
+	int32		offset = 0;
+	int32		cur_len = 0;
+	int32		len_bound = Min(end - input, PGLZ_MAX_MATCH);
+
+	hist_entry_number = &hist_start[hist_idx];
+	if (*hist_entry_number == INVALID_ENTRY)
+		return 0;
+
+	hist_entry = &hist_entries[*hist_entry_number];
+	if (hist_idx != hist_entry->hist_idx)
+	{
+		/*
+		 * If current linked list head points to invalid entry then clear it
+		 * to reduce the number of comparisons in future.
+		 */
+		*hist_entry_number = INVALID_ENTRY;
+		return 0;
+	}
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
-	hent = &hist_entries[hentno];
-	while (hent != INVALID_ENTRY_PTR)
+	while (true)
 	{
-		const char *ip = input;
-		const char *hp = hent->pos;
-		int32		thisoff;
-		int32		thislen;
-
-		/*
-		 * Stop if the offset does not fit into our tag anymore.
-		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		const unsigned char *input_pos = input;
+		const unsigned char *hist_pos = hist_entry->pos;
+		const unsigned char *my_pos;
+		int32		cur_offset = input_pos - hist_pos;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -431,58 +424,62 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 * character by character comparison to know the exact position where
 		 * the diff occurred.
 		 */
-		thislen = 0;
 		if (len >= 16)
 		{
-			if (memcmp(ip, hp, len) == 0)
+			if (memcmp(input_pos, hist_pos, len) == 0)
 			{
-				thislen = len;
-				ip += len;
-				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
-				{
-					thislen++;
-					ip++;
-					hp++;
-				}
+				offset = cur_offset;
+				len = pglz_compare(len, len_bound, input_pos + len, hist_pos + len);
 			}
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			/*
+			 * Start search for short matches by comparing 4 bytes.
+			 * We expect that compiler will substitute memcmp() with fixed
+			 * length by optimal 4-bytes comparison it knows.
+			 */
+			if (memcmp(input_pos, hist_pos, 4) == 0)
 			{
-				thislen++;
-				ip++;
-				hp++;
+				cur_len = pglz_compare(4, len_bound, input_pos + 4, hist_pos + 4);
+				if (cur_len > len)
+				{
+					len = cur_len;
+					offset = cur_offset;
+				}
 			}
 		}
 
-		/*
-		 * Remember this match as the best (if it is)
-		 */
-		if (thislen > len)
-		{
-			len = thislen;
-			off = thisoff;
-		}
-
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		my_pos = hist_entry->pos;
+		hist_entry = &hist_entries[hist_entry->next_id];
 
 		/*
-		 * Be happy with lesser good matches the more entries we visited. But
-		 * no point in doing calculation if we're at end of list.
+		 * If current match length is ok then stop iteration. As outdated list
+		 * links are not updated during insertion process then additional stop
+		 * condition should be introduced to avoid following them. If recycled
+		 * entry has another hash, then iteration stops. If it has the same
+		 * hash then recycled cell would break input stream position
+		 * monotonicity, which is checked after.
 		 */
-		if (hent != INVALID_ENTRY_PTR)
+		if (len >= good_match || hist_idx != hist_entry->hist_idx || my_pos <= hist_entry->pos)
 		{
-			if (len >= good_match)
-				break;
-			good_match -= (good_match * good_drop) / 100;
+			break;
 		}
+
+		/*
+		 * Be happy with less good matches the more entries we visited.
+		 */
+		good_match -= (good_match * good_drop) >> 7;
 	}
 
+	/*
+	 * found match can be slightly more than necessary, bound it with len_bound
+	 */
+	len = Min(len, len_bound);
+
 	/*
 	 * Return match information only if it results at least in one byte
 	 * reduction.
@@ -490,7 +487,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	if (len > 2)
 	{
 		*lenp = len;
-		*offp = off;
+		*offp = offset;
 		return 1;
 	}
 
@@ -509,26 +506,28 @@ int32
 pglz_compress(const char *source, int32 slen, char *dest,
 			  const PGLZ_Strategy *strategy)
 {
-	unsigned char *bp = (unsigned char *) dest;
-	unsigned char *bstart = bp;
-	int			hist_next = 1;
-	bool		hist_recycle = false;
-	const char *dp = source;
-	const char *dend = source + slen;
-	unsigned char ctrl_dummy = 0;
-	unsigned char *ctrlp = &ctrl_dummy;
-	unsigned char ctrlb = 0;
-	unsigned char ctrl = 0;
+	unsigned char *dest_ptr = (unsigned char *) dest;
+	unsigned char *dest_start = dest_ptr;
+	uint16		hist_next = 1;
+	uint16		hist_idx;
+	const unsigned char *src_ptr = (const unsigned char *) source;
+	const unsigned char *src_end = (const unsigned char *) source + slen;
+	const unsigned char *compress_src_end = src_end - 4;
+	unsigned char control_dummy = 0;
+	unsigned char *control_ptr = &control_dummy;
+	unsigned char control_byte = 0;
+	unsigned char control_pos = 0;
 	bool		found_match = false;
 	int32		match_len;
-	int32		match_off;
+	int32		match_offset;
 	int32		good_match;
 	int32		good_drop;
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
 	int			hashsz;
-	int			mask;
+	uint16		mask;
+
 
 	/*
 	 * Our fallback strategy is the default.
@@ -560,6 +559,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	else if (good_drop > 100)
 		good_drop = 100;
 
+	/* We use <<7 later to calculate actual drop, so align percents to 128 */
+	good_drop = good_drop * 128 / 100;
+
 	need_rate = strategy->min_comp_rate;
 	if (need_rate < 0)
 		need_rate = 0;
@@ -577,7 +579,9 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		result_max = (slen / 100) * (100 - need_rate);
 	}
 	else
+	{
 		result_max = (slen * (100 - need_rate)) / 100;
+	}
 
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
@@ -603,10 +607,21 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	 */
 	memset(hist_start, 0, hashsz * sizeof(int16));
 
+	/*
+	 * Initialize INVALID_ENTRY for stopping during lookup.
+	 */
+	hist_entries[INVALID_ENTRY].pos = src_end;
+	hist_entries[INVALID_ENTRY].hist_idx = hashsz;
+
+	/*
+	 * Calculate initial hash value.
+	 */
+	hist_idx = pglz_hist_idx(src_ptr, mask);
+
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
-	while (dp < dend)
+	while (src_ptr < compress_src_end)
 	{
 		/*
 		 * If we already exceeded the maximum result size, fail.
@@ -615,7 +630,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
 		 * allow 4 slop bytes.
 		 */
-		if (bp - bstart >= result_max)
+		if (dest_ptr - dest_start >= result_max)
 			return -1;
 
 		/*
@@ -624,27 +639,36 @@ pglz_compress(const char *source, int32 slen, char *dest,
 		 * reasonably quickly when looking at incompressible input (such as
 		 * pre-compressed data).
 		 */
-		if (!found_match && bp - bstart >= strategy->first_success_by)
+		if (!found_match && dest_ptr - dest_start >= strategy->first_success_by)
 			return -1;
 
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
+		}
+
 		/*
 		 * Try to find a match in the history
 		 */
-		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+		if (pglz_find_match(hist_idx, src_ptr, compress_src_end, &match_len,
+							&match_offset, good_match, good_drop))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
 			 * characters.
 			 */
-			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			control_byte |= control_pos;
+			dest_ptr = pglz_out_tag(dest_ptr, match_len, match_offset);
 			while (match_len--)
 			{
-				pglz_hist_add(hist_start, hist_entries,
-							  hist_next, hist_recycle,
-							  dp, dend, mask);
-				dp++;			/* Do not do this ++ in the line above! */
-				/* The macro would do it four times - Jan.  */
+				hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+				src_ptr++;
 			}
 			found_match = true;
 		}
@@ -653,21 +677,47 @@ pglz_compress(const char *source, int32 slen, char *dest,
 			/*
 			 * No match found. Copy one literal byte.
 			 */
-			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
-			pglz_hist_add(hist_start, hist_entries,
-						  hist_next, hist_recycle,
-						  dp, dend, mask);
-			dp++;				/* Do not do this ++ in the line above! */
-			/* The macro would do it four times - Jan.  */
+			hist_next = pglz_hist_add(hist_next, &hist_idx, src_ptr, mask);
+			*(dest_ptr)++ = (unsigned char) (*src_ptr);
+			src_ptr++;
+		}
+		control_pos <<= 1;
+	}
+
+
+	while (src_ptr < src_end)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (dest_ptr - dest_start >= result_max)
+			return -1;
+
+		/*
+		 * Refresh control byte if needed.
+		 */
+		if ((control_pos & 0xff) == 0)
+		{
+			*(control_ptr) = control_byte;
+			control_ptr = (dest_ptr)++;
+			control_byte = 0;
+			control_pos = 1;
 		}
+		*(dest_ptr)++ = (unsigned char) (*src_ptr);
+		src_ptr++;
+		control_pos <<= 1;
 	}
 
 	/*
 	 * Write out the last control byte and check that we haven't overrun the
 	 * output size allowed by the strategy.
 	 */
-	*ctrlp = ctrlb;
-	result_size = bp - bstart;
+	*control_ptr = control_byte;
+	result_size = dest_ptr - dest_start;
 	if (result_size >= result_max)
 		return -1;
 
-- 
2.32.0 (Apple Git-132)

v9-0002-Fix-oversight-for-compression-of-last-4-bytes.patchapplication/octet-stream; name=v9-0002-Fix-oversight-for-compression-of-last-4-bytes.patchDownload

From 124ce421d039831a0f8c9016047ca10b0da52162 Mon Sep 17 00:00:00 2001
From: "Andrey M. Borodin" <x4mmm@flight.local>
Date: Sun, 5 Feb 2023 10:28:04 -0800
Subject: [PATCH v9 2/2] Fix oversight for compression of last 4 bytes

---
 src/common/pg_lzcompress.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index bc85327af0..aea262dad2 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -512,7 +512,7 @@ pglz_compress(const char *source, int32 slen, char *dest,
 	uint16		hist_idx;
 	const unsigned char *src_ptr = (const unsigned char *) source;
 	const unsigned char *src_end = (const unsigned char *) source + slen;
-	const unsigned char *compress_src_end = src_end - 4;
+	const unsigned char *compress_src_end = src_end;
 	unsigned char control_dummy = 0;
 	unsigned char *control_ptr = &control_dummy;
 	unsigned char control_byte = 0;
-- 
2.32.0 (Apple Git-132)

#25

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 3 years ago

In reply to: Andrey Borodin (#24)

Re: pglz compression performance, take two

On 2/5/23 19:36, Andrey Borodin wrote:

On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin <amborodin86@gmail.com> wrote:

Hello! Please find attached v8.

I got some interesting feedback from some patch users.
There was an oversight that frequently yielded results that are 1,2 or
3 bytes longer than expected.
Looking closer I found that the correctness of the last 3-byte tail is
checked in two places. PFA fix for this. Previously compressed data
was correct, however in some cases few bytes longer than the result of
current pglz implementation.

Thanks. What were the consequences of the issue? Lower compression
ratio, or did we then fail to decompress the data (or would current pglz
implementation fail to decompress it)?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26

Andrey Borodin

amborodin86@gmail.com

almost 3 years ago

In reply to: Tomas Vondra (#25)

Re: pglz compression performance, take two

On Sun, Feb 5, 2023 at 5:51 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 2/5/23 19:36, Andrey Borodin wrote:

On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin <amborodin86@gmail.com> wrote:

Hello! Please find attached v8.

I got some interesting feedback from some patch users.
There was an oversight that frequently yielded results that are 1,2 or
3 bytes longer than expected.
Looking closer I found that the correctness of the last 3-byte tail is
checked in two places. PFA fix for this. Previously compressed data
was correct, however in some cases few bytes longer than the result of
current pglz implementation.

Thanks. What were the consequences of the issue? Lower compression
ratio, or did we then fail to decompress the data (or would current pglz
implementation fail to decompress it)?

The data was decompressed fine. But extension tests (Citus's columnar
engine) hard-coded a lot of compression ratio stuff.
And there is still 1 more test where optimized version produces 1 byte
longer output. I'm trying to find it, but with no success yet.

There are known and documented cases when optimized pglz version would
do so. good_match without 10-division and memcmp by 4 bytes. But even
disabling this, still observing 1-byte longer compression results
persists... The problem is the length is changed after deleting some
data, so compression of that particular sequence seems to be somewhere
far away.
It was funny at the beginning - to hunt for 1 byte. But the weekend is
ending, and it seems that byte slipped from me again...

Best regards, Andrey Borodin.

#27

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 3 years ago

In reply to: Andrey Borodin (#26)

Re: pglz compression performance, take two

On 2/6/23 03:00, Andrey Borodin wrote:

On Sun, Feb 5, 2023 at 5:51 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 2/5/23 19:36, Andrey Borodin wrote:

On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin <amborodin86@gmail.com> wrote:

Hello! Please find attached v8.

I got some interesting feedback from some patch users.
There was an oversight that frequently yielded results that are 1,2 or
3 bytes longer than expected.
Looking closer I found that the correctness of the last 3-byte tail is
checked in two places. PFA fix for this. Previously compressed data
was correct, however in some cases few bytes longer than the result of
current pglz implementation.

Thanks. What were the consequences of the issue? Lower compression
ratio, or did we then fail to decompress the data (or would current pglz
implementation fail to decompress it)?

The data was decompressed fine. But extension tests (Citus's columnar
engine) hard-coded a lot of compression ratio stuff.

OK. Not sure I'd blame the patch for these failures, as long as long as
the result is still correct and can be decompressed. I'm not aware of a
specification of what the compression must (not) produce.

And there is still 1 more test where optimized version produces 1 byte
longer output. I'm trying to find it, but with no success yet.

There are known and documented cases when optimized pglz version would
do so. good_match without 10-division and memcmp by 4 bytes. But even
disabling this, still observing 1-byte longer compression results
persists... The problem is the length is changed after deleting some
data, so compression of that particular sequence seems to be somewhere
far away.
It was funny at the beginning - to hunt for 1 byte. But the weekend is
ending, and it seems that byte slipped from me again...

I wonder what that means for the patch. I haven't investigated this at
all, but it seems as if the optimization means we fail to find a match,
producing a tad larger output. That may be still be a good tradeoff, as
long as the output is correct (assuming it doesn't break some promise
regarding expected output).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28

Andrey Borodin

amborodin86@gmail.com

almost 3 years ago

In reply to: Tomas Vondra (#27)

Re: pglz compression performance, take two

On Mon, Feb 6, 2023 at 11:57 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I wonder what that means for the patch. I haven't investigated this at
all, but it seems as if the optimization means we fail to find a match,
producing a tad larger output. That may be still be a good tradeoff, as
long as the output is correct (assuming it doesn't break some promise
regarding expected output).

Yes, patch produces correct results and faster. And keeps the
compression ratio the same except for some one odd case.
The only problem is I do not understand _why_ it happens in that odd
case. And so far I failed to extract input\outputs of that odd case,
because it is buried under so many layers of abstraction and affects
only late stats.
Maybe the problem is not in compression at all...

Best regards, Andrey Borodin.

#29

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andrey Borodin (#24)

Re: pglz compression performance, take two

Hi,

On 2023-02-05 10:36:39 -0800, Andrey Borodin wrote:

On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin <amborodin86@gmail.com> wrote:

Hello! Please find attached v8.

I got some interesting feedback from some patch users.
There was an oversight that frequently yielded results that are 1,2 or
3 bytes longer than expected.
Looking closer I found that the correctness of the last 3-byte tail is
checked in two places. PFA fix for this. Previously compressed data
was correct, however in some cases few bytes longer than the result of
current pglz implementation.

This version fails on cfbot, due to address sanitizer:

https://cirrus-ci.com/task/4921632586727424
https://api.cirrus-ci.com/v1/artifact/task/4921632586727424/log/src/test/regress/log/initdb.log

performing post-bootstrap initialization ... =================================================================
==15991==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61e000002ee0 at pc 0x558e1b847b16 bp 0x7ffd35782f70 sp 0x7ffd35782f68
READ of size 1 at 0x61e000002ee0 thread T0
#0 0x558e1b847b15 in pglz_hist_add /tmp/cirrus-ci-build/src/common/pg_lzcompress.c:310
#1 0x558e1b847b15 in pglz_compress /tmp/cirrus-ci-build/src/common/pg_lzcompress.c:680
#2 0x558e1aa86ef0 in pglz_compress_datum /tmp/cirrus-ci-build/src/backend/access/common/toast_compression.c:65
#3 0x558e1aa87af2 in toast_compress_datum /tmp/cirrus-ci-build/src/backend/access/common/toast_internals.c:68
#4 0x558e1ac22989 in toast_tuple_try_compression /tmp/cirrus-ci-build/src/backend/access/table/toast_helper.c:234
#5 0x558e1ab6af24 in heap_toast_insert_or_update /tmp/cirrus-ci-build/src/backend/access/heap/heaptoast.c:197
#6 0x558e1ab4a2a6 in heap_update /tmp/cirrus-ci-build/src/backend/access/heap/heapam.c:3533
...

Independent of this failure, I'm worried about the cost/benefit analysis of a
pglz change that changes this much at once. It's quite hard to review.

Andres

#30

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 3 years ago

In reply to: Andres Freund (#29)

Re: pglz compression performance, take two

On 2/7/23 21:18, Andres Freund wrote:

Hi,

On 2023-02-05 10:36:39 -0800, Andrey Borodin wrote:

On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin <amborodin86@gmail.com> wrote:

Hello! Please find attached v8.

I got some interesting feedback from some patch users.
There was an oversight that frequently yielded results that are 1,2 or
3 bytes longer than expected.
Looking closer I found that the correctness of the last 3-byte tail is
checked in two places. PFA fix for this. Previously compressed data
was correct, however in some cases few bytes longer than the result of
current pglz implementation.

This version fails on cfbot, due to address sanitizer:

https://cirrus-ci.com/task/4921632586727424
https://api.cirrus-ci.com/v1/artifact/task/4921632586727424/log/src/test/regress/log/initdb.log

performing post-bootstrap initialization ... =================================================================
==15991==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61e000002ee0 at pc 0x558e1b847b16 bp 0x7ffd35782f70 sp 0x7ffd35782f68
READ of size 1 at 0x61e000002ee0 thread T0
#0 0x558e1b847b15 in pglz_hist_add /tmp/cirrus-ci-build/src/common/pg_lzcompress.c:310
#1 0x558e1b847b15 in pglz_compress /tmp/cirrus-ci-build/src/common/pg_lzcompress.c:680
#2 0x558e1aa86ef0 in pglz_compress_datum /tmp/cirrus-ci-build/src/backend/access/common/toast_compression.c:65
#3 0x558e1aa87af2 in toast_compress_datum /tmp/cirrus-ci-build/src/backend/access/common/toast_internals.c:68
#4 0x558e1ac22989 in toast_tuple_try_compression /tmp/cirrus-ci-build/src/backend/access/table/toast_helper.c:234
#5 0x558e1ab6af24 in heap_toast_insert_or_update /tmp/cirrus-ci-build/src/backend/access/heap/heaptoast.c:197
#6 0x558e1ab4a2a6 in heap_update /tmp/cirrus-ci-build/src/backend/access/heap/heapam.c:3533
...

Yeah, and valgrind seems to hit the same issue (it's not labeled as
buffer overflow, but it seems to be exactly the same place):

==380682== Invalid read of size 1
==380682== at 0xBCEAAB: pglz_hist_add (pg_lzcompress.c:310)
==380682== by 0xBCF130: pglz_compress (pg_lzcompress.c:670)
==380682== by 0x4A911F: pglz_compress_datum (toast_compression.c:65)
==380682== by 0x4A97E2: toast_compress_datum (toast_internals.c:68)
==380682== by 0x54CCA4: toast_tuple_try_compression (toast_helper.c:234)
==380682== by 0x4FFC33: heap_toast_insert_or_update (heaptoast.c:197)
==380682== by 0x4ED498: heap_update (heapam.c:3624)
==380682== by 0x4EE023: simple_heap_update (heapam.c:4060)
==380682== by 0x5B1B2B: CatalogTupleUpdateWithInfo (indexing.c:329)
==380682== by 0x65C3AB: update_attstats (analyze.c:1741)
==380682== by 0x65A054: do_analyze_rel (analyze.c:602)
==380682== by 0x659405: analyze_rel (analyze.c:261)
==380682== by 0x70A162: vacuum (vacuum.c:523)
==380682== by 0x8DF8F7: autovacuum_do_vac_analyze (autovacuum.c:3155)
==380682== by 0x8DE74A: do_autovacuum (autovacuum.c:2473)
==380682== by 0x8DD49E: AutoVacWorkerMain (autovacuum.c:1716)
==380682== by 0x8DD097: StartAutoVacWorker (autovacuum.c:1494)
==380682== by 0x8EA5B2: StartAutovacuumWorker (postmaster.c:5481)
==380682== by 0x8EA10A: process_pm_pmsignal (postmaster.c:5192)
==380682== by 0x8E6121: ServerLoop (postmaster.c:1770)
==380682== Address 0xe722c78 is 103,368 bytes inside a recently
re-allocated block of size 131,072 alloc'd
==380682== at 0x48457AB: malloc (vg_replace_malloc.c:393)
==380682== by 0xB95423: AllocSetAlloc (aset.c:929)
==380682== by 0xBA2B6C: palloc (mcxt.c:1224)
==380682== by 0x4A0962: heap_copytuple (heaptuple.c:687)
==380682== by 0x73A2BB: tts_buffer_heap_copy_heap_tuple
(execTuples.c:842)
==380682== by 0x658E42: ExecCopySlotHeapTuple (tuptable.h:464)
==380682== by 0x65B288: acquire_sample_rows (analyze.c:1261)
==380682== by 0x659E42: do_analyze_rel (analyze.c:536)
==380682== by 0x659405: analyze_rel (analyze.c:261)
==380682== by 0x70A162: vacuum (vacuum.c:523)
==380682== by 0x8DF8F7: autovacuum_do_vac_analyze (autovacuum.c:3155)
==380682== by 0x8DE74A: do_autovacuum (autovacuum.c:2473)
==380682== by 0x8DD49E: AutoVacWorkerMain (autovacuum.c:1716)
==380682== by 0x8DD097: StartAutoVacWorker (autovacuum.c:1494)
==380682== by 0x8EA5B2: StartAutovacuumWorker (postmaster.c:5481)
==380682== by 0x8EA10A: process_pm_pmsignal (postmaster.c:5192)
==380682== by 0x8E6121: ServerLoop (postmaster.c:1770)
==380682== by 0x8E5B54: PostmasterMain (postmaster.c:1463)
==380682== by 0x7A806C: main (main.c:200)
==380682==

The place allocating the buffer changes over time, but the first part
(invalid read) seems to be exactly the same.

FWIW I did run previous versions using valgrind, so this gotta be due
some recent change.

Independent of this failure, I'm worried about the cost/benefit analysis of a
pglz change that changes this much at once. It's quite hard to review.

I agree.

I think I managed to understand what the patch does during the review,
but it's so much harder - it'd definitely be better to have this split
into smaller parts, somehow. Interestingly enough the commit message
actually says this:

This patch accumulates several changes to pglz compression:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons during a search instead of 1-byte
comparisons

Which I think is a pretty good recipe how to split the patch. (And we
also need a better commit message, or at least a proposal.)

This'd probably also help when investigating the extra byte issue,
discussed yesterday. (Assuming it's not related to the invalid access
reported by valgrind / asan).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Tomas Vondra (#30)

Re: pglz compression performance, take two

Hi,

On 2023-02-08 11:16:47 +0100, Tomas Vondra wrote:

On 2/7/23 21:18, Andres Freund wrote:

Independent of this failure, I'm worried about the cost/benefit analysis of a
pglz change that changes this much at once. It's quite hard to review.

I agree.

I think I managed to understand what the patch does during the review,
but it's so much harder - it'd definitely be better to have this split
into smaller parts, somehow. Interestingly enough the commit message
actually says this:

This patch accumulates several changes to pglz compression:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons during a search instead of 1-byte
comparisons

Which I think is a pretty good recipe how to split the patch. (And we
also need a better commit message, or at least a proposal.)

This'd probably also help when investigating the extra byte issue,
discussed yesterday. (Assuming it's not related to the invalid access
reported by valgrind / asan).

Due to the sanitizer changes, and this feedback, I'm marking the entry as
waiting on author.

Greetings,

Andres Freund

#32

Andrey Borodin

amborodin86@gmail.com

almost 3 years ago

In reply to: Andres Freund (#31)

Re: pglz compression performance, take two

On Mon, Feb 13, 2023 at 4:03 PM Andres Freund <andres@anarazel.de> wrote:

Due to the sanitizer changes, and this feedback, I'm marking the entry as
waiting on author.

Thanks Andres! Yes, I plan to make another attempt to refactor this
patch on the weekend. If this attempt fails, I think we should just
reject it and I'll get back to this during summer.

Best regards, Andrey Borodin.