More work on SortSupport for text - strcoll() and strxfrm() caching
Since apparently we're back to development work, I thought it was time
to share a patch implementing a few additional simple tricks to make
sorting text under a non-C locale even faster than in 9.5. These
techniques are mostly effective when values are physically clustered
together. This might be because there is a physical/logical
correlation, but cases involving any kind of clustering of values are
helped significantly.
Caching
======
The basic idea is that we cache strxfrm() blobs. Separately, we
exploit temporal locality and clustering of values by caching the
result of the most recent strcoll()-resolved comparison performed. The
strxfrm() technique helps a lot with low cardinality single attribute
sorts if we can avoid most strxfrm() work. On the other hand,
strcoll() comparison caching particularly helps with multi-attribute
sorts where there is a low to moderate cardinality secondary attribute
and low cardinality leading attribute. The master branch will still
opportunistically take the equality memcmp() fastpath plenty of times
for that second attribute, but there are no abbreviated keys to help
when that doesn't work out (because it isn't the leading attribute).
Regressions
==========
The patch still helps with strcoll() avoidance when the ordering of a
moderate cardinality attribute is totally random, but it helps much
less there. I have not seen a regression for any case. I'm expecting
someone to ask me to do something with the program I wrote last year,
to prove the opportunistic memcmp() equality fastpath for text is
"free" [1]/messages/by-id/5415A843.3070602@vmware.com. This patch has exactly the same tension as last year's
memcmp() equality one [2]Commit e246b3d6: I add something opportunistic, that in
general might consistently not work out at all in some cases, and on
the face of it implies extra costs for those cases -- costs which must
be paid every single time. So as with the opportunistic memcmp()
equality thing, the *actual* overhead for cases that do not benefit
must be virtually zero for the patch to be worthwhile. That is the
standard that I expect that this patch will be held to, too.
Benchmark
=========
The query that I've been trying this out with is a typical rollup
query, using my "cities" sample data [3]http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump (this is somewhat, although
not perfectly correlated on (country, province) before sorting):
postgres=# select country, province, count(*) from cities group by
rollup (country, province);
country | province
| count
-------------------------------------+---------------------------------------+--------
Afghanistan | Badaẖšan
| 5
Afghanistan | Bādgīs
| 2
Afghanistan | Baġlān
| 5
Afghanistan | Balẖ
| 6
Afghanistan | Bāmiyān
| 3
Afghanistan | Farāh
| 3
Afghanistan | Fāryāb
| 4
Afghanistan | Ġawr
| 3
*** SNIP *****
...
Zimbabwe | Manicaland
| 22
Zimbabwe | Mashonaland Central
| 13
Zimbabwe | Mashonaland East
| 9
Zimbabwe | Mashonaland West
| 21
Zimbabwe | Masvingo
| 11
Zimbabwe | Matabeleland North
| 8
Zimbabwe | Matabeleland South
| 14
Zimbabwe | Midlands
| 14
Zimbabwe | [null]
| 116
[null] | [null]
| 317102
(3529 rows)
With master, this takes about 525ms when it stabilizes after a few
runs on my laptop. With the patch, it takes about 405ms. That's almost
a 25% reduction in total run time. If I perform a more direct test of
sort performance against this data with minimal non-sorting overhead,
I see a reduction of as much as 30% in total query runtime (I chose
this rollup query because it is obviously representative of the real
world).
If this data is *perfectly* correlated (e.g. because someone ran
CLUSTER) and some sort can use the dubious "bubble sort best case"
path [4]Commit a3f0b3d6 -- Peter Geoghegan that we added to qsort back in 2006, the improvement still
hold up at ~20%, I've found.
Performance of the "C" locale
---------------------------------------
For this particular rollup query, my 25% improvement leaves the
collated text sort perhaps marginally faster than an equivalent query
that uses the "C" locale (with or without the patch applied). It's
hard to be sure that that effect is real -- many trials are needed --
but it's reasonable to speculate that it's possible to sometimes beat
the "C" locale because of factors like final abbreviated key
cardinality.
It's easy to *contrive* a case where the "C" locale is beaten even
with 9.5 -- just sort a bunch of strings (that are abbreviated), that
look something like this:
"``..,,``..AAA"
"``..,,``..CCC"
"``..,,``..ZZZ"
"``..,,``..BBB"
Anyway, this avoidance of strxfrm() work I've introduced makes it
possible that abbreviated keys could make a strxfrm() locale-based
sort beat the "C" locale fair-and-square with a realistic dataset and
specific realistic query. That would be pretty nice, because that
can't be too far from optimal, and these cases are not uncommon.
A further idea -- unsigned integer comparisons
===================================
I've also changed text abbreviated keys to compare as unsigned
integers. On my Thinkpad laptop (which, of course, has an Intel CPU),
this makes a noticeable difference. memcmp() may be fast, but an
unsigned integer comparison is even faster (not sure if a big-endian
machine can have the existing memcmp() call optimized away, so that
effectively the same thing happens automatically).
Maybe other platforms benefit less, but it's very hard to imagine it
ever costing us anything.
[1]: /messages/by-id/5415A843.3070602@vmware.com
[2]: Commit e246b3d6
[3]: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump
[4]: Commit a3f0b3d6 -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
0002-Add-several-text-sorting-optimizations.patchtext/x-patch; charset=US-ASCII; name=0002-Add-several-text-sorting-optimizations.patchDownload
From 1528a27f9a8331003a82cd645aabd876882b85ca Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Fri, 3 Jul 2015 13:48:08 -0700
Subject: [PATCH 2/2] Add several text sorting optimizations
Cache the results of strxfrm() calls made by the text abbreviation
routine, and strcoll() calls made by the text comparison routine (text
SortSupport authoritative comparator only).
Testing shows that the cost of doing this is immeasurably small in cases
where the optimization does not help, while the benefits in cases where
the optimization is effective can be quite significant.
Separately, arrange for text abbreviated keys to be represented as
unsigned integers, rather than a char array. This is somewhat faster on
at least some platforms, since integer comparisons are now used rather
than memcmp() during sorting proper.
---
src/backend/utils/adt/varlena.c | 118 +++++++++++++++++++++++++++++++++++-----
1 file changed, 105 insertions(+), 13 deletions(-)
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 2fbbf54..7a4eed7 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -61,6 +61,9 @@ typedef struct
char *buf2; /* 2nd string, or abbreviation strxfrm() buf */
int buflen1;
int buflen2;
+ int last_len1; /* Length of last buf1 string/strxfrm() blob */
+ int last_len2; /* Length of last buf2 string/strxfrm() blob */
+ int last_returned; /* Last comparison result (cache) */
bool collate_c;
hyperLogLogState abbr_card; /* Abbreviated key cardinality state */
hyperLogLogState full_card; /* Full key cardinality state */
@@ -71,6 +74,23 @@ typedef struct
} TextSortSupport;
/*
+ * Macro generating new representation of char array packed into Datum, making
+ * it a sizeof(Datum)-wide uint (the C type Datum is an unsigned integer).
+ * Comparisons of this representation should behave equivalently to a memcmp()
+ * of the Datum's entire original contents. This offers a measurable
+ * performance boost on some systems.
+ */
+#ifdef WORDS_BIGENDIAN
+#define string_uint(x) (x)
+#else /* !WORDS_BIGENDIAN */
+#if SIZEOF_DATUM == 8
+#define string_uint(x) BSWAP64(x)
+#else /* SIZEOF_DATUM != 8 */
+#define string_uint(x) BSWAP32(x)
+#endif /* SIZEOF_DATUM == 8 */
+#endif /* WORDS_BIGENDIAN */
+
+/*
* This should be large enough that most strings will fit, but small enough
* that we feel comfortable putting it on the stack
*/
@@ -1831,9 +1851,20 @@ btsortsupport_worker(SortSupport ssup, Oid collid)
tss->buflen1 = TEXTBUFLEN;
tss->buf2 = palloc(TEXTBUFLEN);
tss->buflen2 = TEXTBUFLEN;
+ /* Start with invalid values */
+ tss->last_len1 = -1;
+ tss->last_len2 = -1;
#ifdef HAVE_LOCALE_T
tss->locale = locale;
#endif
+ /*
+ * To avoid somehow confusing a strxfrm() blob and an original string
+ * within bttextfastcmp_locale(), initialize last returned cache to a
+ * sentinel value. A platform-specific actual strcoll() return value
+ * of INT_MIN seems unlikely, but if that occurs it will not cause
+ * wrong answers.
+ */
+ tss->last_returned = INT_MIN;
tss->collate_c = collate_c;
ssup->ssup_extra = tss;
@@ -1896,6 +1927,7 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
{
text *arg1 = DatumGetTextPP(x);
text *arg2 = DatumGetTextPP(y);
+ bool arg1_match;
TextSortSupport *tss = (TextSortSupport *) ssup->ssup_extra;
/* working state */
@@ -1914,6 +1946,11 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
/* Fast pre-check for equality, as discussed in varstr_cmp() */
if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
{
+ /*
+ * No change in buf1 or buf2 contents, so avoid changing last_len1 or
+ * last_len2. Existing contents of buffers might still be used by next
+ * call.
+ */
result = 0;
goto done;
}
@@ -1931,10 +1968,50 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
tss->buf2 = MemoryContextAlloc(ssup->ssup_cxt, tss->buflen2);
}
- memcpy(tss->buf1, a1p, len1);
- tss->buf1[len1] = '\0';
- memcpy(tss->buf2, a2p, len2);
- tss->buf2[len2] = '\0';
+ /*
+ * Attempt to re-use buffers across calls. Also, avoid work in the event
+ * of clustered together identical items by exploiting temporal locality.
+ * This works well with divide-and-conquer, comparison-based sorts like
+ * quicksort and mergesort.
+ *
+ * With quicksort, there is, in general, a pretty strong chance that the
+ * same buffer contents can be used repeatedly for pivot items -- early
+ * pivot items will account for large number of total comparisons, since
+ * they must be compared against many (possibly all other) items.
+ */
+ arg1_match = true;
+ if (len1 != tss->last_len1 || memcmp(tss->buf1, a1p, len1) != 0)
+ {
+ arg1_match = false;
+ memcpy(tss->buf1, a1p, len1);
+ tss->buf1[len1] = '\0';
+ tss->last_len1 = len1;
+ }
+
+ /*
+ * While it is worth going to the trouble of trying to re-use buffer
+ * contents across calls, ideally that will lead to entirely avoiding a
+ * strcoll() call by using a cached return value.
+ *
+ * This optimization can work well again and again for the same set of
+ * clustered together identical attributes; as they're relocated to new
+ * subpartitions, only one strcoll() is required for each pivot (in respect
+ * of that clump of identical values, at least). Similarly, the final
+ * N-way merge of a mergesort can be effectively accelerated if each run
+ * has its own locally clustered values.
+ */
+ if (len2 != tss->last_len2 || memcmp(tss->buf2, a2p, len2) != 0)
+ {
+ memcpy(tss->buf2, a2p, len2);
+ tss->buf2[len2] = '\0';
+ tss->last_len2 = len2;
+ }
+ else if (arg1_match && tss->last_returned != INT_MIN)
+ {
+ /* Use result cached following last actual strcoll() call */
+ result = tss->last_returned;
+ goto done;
+ }
#ifdef HAVE_LOCALE_T
if (tss->locale)
@@ -1951,6 +2028,8 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
if (result == 0)
result = strcmp(tss->buf1, tss->buf2);
+ /* Cache result value, perhaps saving a costly strcoll() in next call */
+ tss->last_returned = result;
done:
/* We can't afford to leak memory here. */
if (PointerGetDatum(arg1) != x)
@@ -1967,25 +2046,24 @@ done:
static int
bttextcmp_abbrev(Datum x, Datum y, SortSupport ssup)
{
- char *a = (char *) &x;
- char *b = (char *) &y;
- int result;
-
- result = memcmp(a, b, sizeof(Datum));
-
/*
- * When result = 0, the core system will call bttextfastcmp_c() or
+ * When 0 is returned, the core system will call bttextfastcmp_c() or
* bttextfastcmp_locale(). Even a strcmp() on two non-truncated strxfrm()
* blobs cannot indicate *equality* authoritatively, for the same reason
* that there is a strcoll() tie-breaker call to strcmp() in varstr_cmp().
*/
- return result;
+ if (x > y)
+ return 1;
+ else if (x == y)
+ return 0;
+ else
+ return -1;
}
/*
* Conversion routine for sortsupport. Converts original text to abbreviated
* key representation. Our encoding strategy is simple -- pack the first 8
- * bytes of a strxfrm() blob into a Datum.
+ * bytes of a strxfrm() blob into a Datum, and treat it as an unsigned integer.
*/
static Datum
bttext_abbrev_convert(Datum original, SortSupport ssup)
@@ -2033,9 +2111,19 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
tss->buf1 = palloc(tss->buflen1);
}
+ /* Might be able to reuse strxfrm() blob from last call */
+ if (tss->last_len1 == len &&
+ memcmp(tss->buf1, authoritative_data, len) == 0)
+ {
+ memcpy(pres, tss->buf2, Min(sizeof(Datum), tss->last_len2));
+ /* No change affecting cardinality, so no hashing required */
+ goto done;
+ }
+
/* Just like strcoll(), strxfrm() expects a NUL-terminated string */
memcpy(tss->buf1, authoritative_data, len);
tss->buf1[len] = '\0';
+ tss->last_len1 = len;
for (;;)
{
@@ -2047,6 +2135,7 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
#endif
bsize = strxfrm(tss->buf2, tss->buf1, tss->buflen2);
+ tss->last_len2 = bsize;
if (bsize < tss->buflen2)
break;
@@ -2104,6 +2193,9 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
addHyperLogLog(&tss->abbr_card, hash);
+done:
+ /* Represent Datum packed with string as uint */
+ res = string_uint(res);
/* Don't leak memory here */
if (PointerGetDatum(authoritative) != original)
pfree(authoritative);
--
1.9.1
0001-Add-BSWAP64-byte-swapping-macro.patchtext/x-patch; charset=US-ASCII; name=0001-Add-BSWAP64-byte-swapping-macro.patchDownload
From 10ad5a5975306e2ed9ba7f98a116536d606444b3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Mon, 29 Jun 2015 23:53:05 -0400
Subject: [PATCH 1/2] Add BSWAP64() byte-swapping macro
This includes detection of __builtin_bswap64() built-in availability,
which is preferred. BSWAP64() can be used to convert big-endian
unsigned integers to little-endian unsigned integers, and vice-versa.
BSWAP32() is moved to c.h, alongside the new BSWAP64() macro.
---
config/c-compiler.m4 | 18 ++++++++++++++++++
configure | 24 ++++++++++++++++++++++++
configure.in | 1 +
src/include/c.h | 23 +++++++++++++++++++++++
src/include/pg_config.h.in | 3 +++
src/include/pg_config.h.win32 | 3 +++
src/include/port/pg_crc32c.h | 10 ----------
7 files changed, 72 insertions(+), 10 deletions(-)
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 050bfa5..f1b7045 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -248,6 +248,24 @@ fi])# PGAC_C_BUILTIN_BSWAP32
+# PGAC_C_BUILTIN_BSWAP64
+# -------------------------
+# Check if the C compiler understands __builtin_bswap64(),
+# and define HAVE__BUILTIN_BSWAP64 if so.
+AC_DEFUN([PGAC_C_BUILTIN_BSWAP64],
+[AC_CACHE_CHECK(for __builtin_bswap64, pgac_cv__builtin_bswap64,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static unsigned long int x = __builtin_bswap64(0xaabbccdd);]
+)],
+[pgac_cv__builtin_bswap64=yes],
+[pgac_cv__builtin_bswap64=no])])
+if test x"$pgac_cv__builtin_bswap64" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_BSWAP64, 1,
+ [Define to 1 if your compiler understands __builtin_bswap64.])
+fi])# PGAC_C_BUILTIN_BSWAP64
+
+
+
# PGAC_C_BUILTIN_CONSTANT_P
# -------------------------
# Check if the C compiler understands __builtin_constant_p(),
diff --git a/configure b/configure
index 0407c4f..b921d63 100755
--- a/configure
+++ b/configure
@@ -10481,6 +10481,30 @@ if test x"$pgac_cv__builtin_bswap32" = xyes ; then
$as_echo "#define HAVE__BUILTIN_BSWAP32 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_bswap64" >&5
+$as_echo_n "checking for __builtin_bswap64... " >&6; }
+if ${pgac_cv__builtin_bswap64+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+static unsigned long int x = __builtin_bswap64(0xaabbccdd);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+ pgac_cv__builtin_bswap64=yes
+else
+ pgac_cv__builtin_bswap64=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_bswap64" >&5
+$as_echo "$pgac_cv__builtin_bswap64" >&6; }
+if test x"$pgac_cv__builtin_bswap64" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_BSWAP64 1" >>confdefs.h
+
+fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_constant_p" >&5
$as_echo_n "checking for __builtin_constant_p... " >&6; }
if ${pgac_cv__builtin_constant_p+:} false; then :
diff --git a/configure.in b/configure.in
index 1de41a2..77cb477 100644
--- a/configure.in
+++ b/configure.in
@@ -1237,6 +1237,7 @@ PGAC_C_FUNCNAME_SUPPORT
PGAC_C_STATIC_ASSERT
PGAC_C_TYPES_COMPATIBLE
PGAC_C_BUILTIN_BSWAP32
+PGAC_C_BUILTIN_BSWAP64
PGAC_C_BUILTIN_CONSTANT_P
PGAC_C_BUILTIN_UNREACHABLE
PGAC_C_VA_ARGS
diff --git a/src/include/c.h b/src/include/c.h
index 92c5202..9874e39 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -963,6 +963,29 @@ typedef NameData *Name;
#define STATUS_FOUND (1)
#define STATUS_WAITING (2)
+/* byte swapping macros */
+#ifdef HAVE__BUILTIN_BSWAP32
+#define BSWAP32(x) __builtin_bswap32(x)
+#else
+#define BSWAP32(x) (((x << 24) & 0xff000000) | \
+ ((x << 8) & 0x00ff0000) | \
+ ((x >> 8) & 0x0000ff00) | \
+ ((x >> 24) & 0x000000ff))
+#endif /* HAVE__BUILTIN_BSWAP32 */
+
+#ifdef HAVE__BUILTIN_BSWAP64
+#define BSWAP64(x) __builtin_bswap64(x)
+#else
+#define BSWAP64(x) (StaticAssertExpr(sizeof(x) == 8, "argument not 8 bytes"), \
+ ( (x << 56) & 0xff00000000000000UL ) | \
+ ( (x << 40) & 0x00ff000000000000UL ) | \
+ ( (x << 24) & 0x0000ff0000000000UL ) | \
+ ( (x << 8) & 0x000000ff00000000UL ) | \
+ ( (x >> 8) & 0x00000000ff000000UL ) | \
+ ( (x >> 24) & 0x0000000000ff0000UL ) | \
+ ( (x >> 40) & 0x000000000000ff00UL ) | \
+ ( (x >> 56) & 0x00000000000000ffUL ))
+#endif /* HAVE__BUILTIN_BSWAP64 */
/*
* Append PG_USED_FOR_ASSERTS_ONLY to definitions of variables that are only
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 5688f75..90c0e5f 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -666,6 +666,9 @@
/* Define to 1 if your compiler understands __builtin_bswap32. */
#undef HAVE__BUILTIN_BSWAP32
+/* Define to 1 if your compiler understands __builtin_bswap64. */
+#undef HAVE__BUILTIN_BSWAP64
+
/* Define to 1 if your compiler understands __builtin_constant_p. */
#undef HAVE__BUILTIN_CONSTANT_P
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 22bbb91..4d484d1 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -520,6 +520,9 @@
/* Define to 1 if your compiler understands __builtin_bswap32. */
/* #undef HAVE__BUILTIN_BSWAP32 */
+/* Define to 1 if your compiler understands __builtin_bswap64. */
+/* #undef HAVE__BUILTIN_BSWAP64 */
+
/* Define to 1 if your compiler understands __builtin_constant_p. */
/* #undef HAVE__BUILTIN_CONSTANT_P */
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index c925c56..e5389aa 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -71,16 +71,6 @@ extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len)
#define COMP_CRC32C(crc, data, len) \
((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
#ifdef WORDS_BIGENDIAN
-
-#ifdef HAVE__BUILTIN_BSWAP32
-#define BSWAP32(x) __builtin_bswap32(x)
-#else
-#define BSWAP32(x) (((x << 24) & 0xff000000) | \
- ((x << 8) & 0x00ff0000) | \
- ((x >> 8) & 0x0000ff00) | \
- ((x >> 24) & 0x000000ff))
-#endif
-
#define FIN_CRC32C(crc) ((crc) = BSWAP32(crc) ^ 0xFFFFFFFF)
#else
#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
--
1.9.1
On Fri, Jul 3, 2015 at 8:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
Since apparently we're back to development work, I thought it was time
to share a patch implementing a few additional simple tricks to make
sorting text under a non-C locale even faster than in 9.5. These
techniques are mostly effective when values are physically clustered
together. This might be because there is a physical/logical
correlation, but cases involving any kind of clustering of values are
helped significantly.
Interesting work.
Some comments:
1. My biggest gripe with this patch is that the comments are not easy
to understand. For example:
+ * Attempt to re-use buffers across calls. Also, avoid work in the event
+ * of clustered together identical items by exploiting temporal locality.
+ * This works well with divide-and-conquer, comparison-based sorts like
+ * quicksort and mergesort.
+ *
+ * With quicksort, there is, in general, a pretty strong chance that the
+ * same buffer contents can be used repeatedly for pivot items -- early
+ * pivot items will account for large number of total comparisons, since
+ * they must be compared against many (possibly all other) items.
Well, what I would have written is something like: "We're likely to be
asked to compare the same strings repeatedly, and memcmp() is so much
cheaper than memcpy() that it pays to attempt a memcmp() in the hopes
of avoiding a memcpy(). This doesn't seem to slow things down
measurably even if it doesn't work out very often."
+ * While it is worth going to the trouble of trying to re-use buffer
+ * contents across calls, ideally that will lead to entirely avoiding a
+ * strcoll() call by using a cached return value.
+ *
+ * This optimization can work well again and again for the same set of
+ * clustered together identical attributes; as they're relocated to new
+ * subpartitions, only one strcoll() is required for each pivot (in respect
+ * of that clump of identical values, at least). Similarly, the final
+ * N-way merge of a mergesort can be effectively accelerated if each run
+ * has its own locally clustered values.
And here I would have written something like: "If we're comparing the
same two strings that we compared last time, we can return the same
answer without calling strcoll() again. This is more likely than it
seems, because quicksort compares the same pivot against many values,
and some of those values might be duplicates."
Of course everybody may prefer something different here; I'm just
telling you what I think.
2. I believe the change to bttextcmp_abbrev() should be pulled out
into a separate patch and committed separately. That part seems like
a slam dunk.
3. What is the worst case for the strxfrm()-reuse stuff? I suppose
it's the case where we have many strings that are long, all
equal-length, and all different, but only in the last few characters.
Then the memcmp() is as expensive as possible but never works out.
How does the patch hold up in that case?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 4, 2015 at 12:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Interesting work.
Thanks.
1. My biggest gripe with this patch is that the comments are not easy
to understand.
Of course everybody may prefer something different here; I'm just
telling you what I think.
I have struggled with trying to put just the right amount of
exposition on the theory behind a particular optimization in source
code comments, and things like that. Since no one is telling me that I
need to write more, clearly I don't have the balance right yet. To a
certain extent, it is a matter of personal style, but I'll try and be
more terse.
2. I believe the change to bttextcmp_abbrev() should be pulled out
into a separate patch and committed separately. That part seems like
a slam dunk.
Makes sense.
3. What is the worst case for the strxfrm()-reuse stuff? I suppose
it's the case where we have many strings that are long, all
equal-length, and all different, but only in the last few characters.
Then the memcmp() is as expensive as possible but never works out.
How does the patch hold up in that case?
I haven't tested it. I'll get around to it at some point in the next
couple of weeks. I imagine that it's exactly the same as the memcmp()
equality thing because of factors like speculative execution, and the
fact that we need both strings in cache anyway. It's almost exactly
the same story, although unlike the memcmp() opportunistic equality
pre-check thing, this check happens only n times, not n log n times.
I'm quite sure that the cost needs to be virtually zero to go ahead
with the idea. I think it probably is. Note that like the memcmp()
thing, we check string length first, before a memcmp().
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 4, 2015 at 1:30 PM, Peter Geoghegan <pg@heroku.com> wrote:
2. I believe the change to bttextcmp_abbrev() should be pulled out
into a separate patch and committed separately. That part seems like
a slam dunk.Makes sense.
BTW, I want to put the string_uint() macro in a common header now. It
can be used for other types. I've written a SortSupport + abbreviated
keys patch for the UUID type which will use it, too, so that it too
can use simple unsigned integer comparisons within its abbreviated
comparator. I haven't posted the UUID patch yet only because everyone
is up to their ears in my sorting patches.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 4, 2015 at 12:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Some comments:
I attach a new version of the patch series that incorporates all your
feedback. The patch series is now made cumulative in a way that makes
it easy for someone to independently commit the unsigned integer
comparison optimization for text, and nothing else. The macro that
uses is in a dedicated header now, because I have another patch
(SortSupport for the UUID type) that uses the same optimization for
the same reason. It seems like something that will probably end up
with a third or forth client before too long, so I think the byte swap
macro wrapper belongs in sortsupport.h.
BTW, I think that in practice the merge phase of a tape sort isn't
much helped by comparison caching, contrary to comments appearing in
the original version. The heap data structure used by polyphase merge
has bad properties around locality (both temporal and spatial). I'm
thinking about independently addressing that problem. I now make no
claims about it in this patch.
--
Peter Geoghegan
Attachments:
0003-Add-two-text-sort-caching-optimizations.patchtext/x-patch; charset=US-ASCII; name=0003-Add-two-text-sort-caching-optimizations.patchDownload
From f9a95e0b930c43a3d5c424df5cd5b3ddb2302763 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Sat, 3 Oct 2015 16:11:02 -0700
Subject: [PATCH 3/4] Add two text sort caching optimizations
Firstly, cache strxfrm() blobs across calls made to the text SortSupport
abbreviation routine. Secondly, cache strcoll() results across calls to
the text comparison routine (SortSupport authoritative comparator only).
The cost of doing this should be immeasurably small in cases where the
optimization does not help, while the benefits in cases where the
optimization is effective should be quite significant.
---
src/backend/utils/adt/varlena.c | 75 ++++++++++++++++++++++++++++++++++++++---
1 file changed, 71 insertions(+), 4 deletions(-)
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 63ca437..da55a84 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -61,6 +61,9 @@ typedef struct
char *buf2; /* 2nd string, or abbreviation strxfrm() buf */
int buflen1;
int buflen2;
+ int last_len1; /* Length of last buf1 string/strxfrm() blob */
+ int last_len2; /* Length of last buf2 string/strxfrm() blob */
+ int last_returned; /* Last comparison result (cache) */
bool collate_c;
hyperLogLogState abbr_card; /* Abbreviated key cardinality state */
hyperLogLogState full_card; /* Full key cardinality state */
@@ -1831,9 +1834,20 @@ btsortsupport_worker(SortSupport ssup, Oid collid)
tss->buflen1 = TEXTBUFLEN;
tss->buf2 = palloc(TEXTBUFLEN);
tss->buflen2 = TEXTBUFLEN;
+ /* Start with invalid values */
+ tss->last_len1 = -1;
+ tss->last_len2 = -1;
#ifdef HAVE_LOCALE_T
tss->locale = locale;
#endif
+ /*
+ * To avoid somehow confusing a strxfrm() blob and an original string
+ * within bttextfastcmp_locale(), initialize last returned cache to a
+ * sentinel value. A platform-specific actual strcoll() return value
+ * of INT_MIN seems unlikely, but if that occurs it will not cause
+ * wrong answers.
+ */
+ tss->last_returned = INT_MIN;
tss->collate_c = collate_c;
ssup->ssup_extra = tss;
@@ -1896,6 +1910,7 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
{
text *arg1 = DatumGetTextPP(x);
text *arg2 = DatumGetTextPP(y);
+ bool arg1_match;
TextSortSupport *tss = (TextSortSupport *) ssup->ssup_extra;
/* working state */
@@ -1914,6 +1929,11 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
/* Fast pre-check for equality, as discussed in varstr_cmp() */
if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
{
+ /*
+ * No change in buf1 or buf2 contents, so avoid changing last_len1 or
+ * last_len2. Existing contents of buffers might still be used by next
+ * call.
+ */
result = 0;
goto done;
}
@@ -1931,10 +1951,43 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
tss->buf2 = MemoryContextAlloc(ssup->ssup_cxt, tss->buflen2);
}
- memcpy(tss->buf1, a1p, len1);
- tss->buf1[len1] = '\0';
- memcpy(tss->buf2, a2p, len2);
- tss->buf2[len2] = '\0';
+ /*
+ * We're likely to be asked to compare the same strings repeatedly, and
+ * memcmp() is so much cheaper than strcoll() that it pays to try to cache
+ * comparisons, even though in general there is no reason to think that
+ * that will work out (every text datum may be unique). Caching does not
+ * slow things down measurably when it doesn't work out, and can speed
+ * things up by rather a lot when it does. In part, this is because the
+ * memcmp() compares data from cachelines that are needed in L1 cache even
+ * when the last comparison's result cannot be reused.
+ */
+ arg1_match = true;
+ if (len1 != tss->last_len1 || memcmp(tss->buf1, a1p, len1) != 0)
+ {
+ arg1_match = false;
+ memcpy(tss->buf1, a1p, len1);
+ tss->buf1[len1] = '\0';
+ tss->last_len1 = len1;
+ }
+
+ /*
+ * If we're comparing the same two strings as last time, we can return the
+ * same answer without calling strcoll() again. This is more likely than
+ * it seems (at least with moderate to low cardinality sets), because
+ * quicksort compares the same pivot against many values.
+ */
+ if (len2 != tss->last_len2 || memcmp(tss->buf2, a2p, len2) != 0)
+ {
+ memcpy(tss->buf2, a2p, len2);
+ tss->buf2[len2] = '\0';
+ tss->last_len2 = len2;
+ }
+ else if (arg1_match && tss->last_returned != INT_MIN)
+ {
+ /* Use result cached following last actual strcoll() call */
+ result = tss->last_returned;
+ goto done;
+ }
#ifdef HAVE_LOCALE_T
if (tss->locale)
@@ -1951,6 +2004,8 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
if (result == 0)
result = strcmp(tss->buf1, tss->buf2);
+ /* Cache result, perhaps saving an expensive strcoll() call next time */
+ tss->last_returned = result;
done:
/* We can't afford to leak memory here. */
if (PointerGetDatum(arg1) != x)
@@ -2033,9 +2088,19 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
tss->buf1 = palloc(tss->buflen1);
}
+ /* Might be able to reuse strxfrm() blob from last call */
+ if (tss->last_len1 == len &&
+ memcmp(tss->buf1, authoritative_data, len) == 0)
+ {
+ memcpy(pres, tss->buf2, Min(sizeof(Datum), tss->last_len2));
+ /* No change affecting cardinality, so no hashing required */
+ goto done;
+ }
+
/* Just like strcoll(), strxfrm() expects a NUL-terminated string */
memcpy(tss->buf1, authoritative_data, len);
tss->buf1[len] = '\0';
+ tss->last_len1 = len;
for (;;)
{
@@ -2047,6 +2112,7 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
#endif
bsize = strxfrm(tss->buf2, tss->buf1, tss->buflen2);
+ tss->last_len2 = bsize;
if (bsize < tss->buflen2)
break;
@@ -2104,6 +2170,7 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
addHyperLogLog(&tss->abbr_card, hash);
+done:
/* Byteswap on little-endian machines: */
res = ABBREV_STRING_UINT(res);
/* Don't leak memory here */
--
1.9.1
0002-Use-unsigned-integer-abbreviated-keys-for-text.patchtext/x-patch; charset=US-ASCII; name=0002-Use-unsigned-integer-abbreviated-keys-for-text.patchDownload
From 5847c49dcf4f98adb0d3bfc16a43fb3135c852b4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Fri, 3 Jul 2015 13:48:08 -0700
Subject: [PATCH 2/4] Use unsigned integer abbreviated keys for text
Arrange for text abbreviated keys to be represented such that a simple
3-way unsigned integer comparison using the new representation is
correct. This is somewhat faster on at least some platforms, since
unsigned integer comparisons are now used rather than memcmp() during
sorting proper. On little-endian machines, this requires a byteswap at
the end of each call to conversion routine (big-endian machines use the
same bitwise representation as before).
---
src/backend/utils/adt/varlena.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 2fbbf54..63ca437 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1967,25 +1967,25 @@ done:
static int
bttextcmp_abbrev(Datum x, Datum y, SortSupport ssup)
{
- char *a = (char *) &x;
- char *b = (char *) &y;
- int result;
-
- result = memcmp(a, b, sizeof(Datum));
-
/*
- * When result = 0, the core system will call bttextfastcmp_c() or
+ * When 0 is returned, the core system will call bttextfastcmp_c() or
* bttextfastcmp_locale(). Even a strcmp() on two non-truncated strxfrm()
* blobs cannot indicate *equality* authoritatively, for the same reason
* that there is a strcoll() tie-breaker call to strcmp() in varstr_cmp().
*/
- return result;
+ if (x > y)
+ return 1;
+ else if (x == y)
+ return 0;
+ else
+ return -1;
}
/*
* Conversion routine for sortsupport. Converts original text to abbreviated
* key representation. Our encoding strategy is simple -- pack the first 8
- * bytes of a strxfrm() blob into a Datum.
+ * bytes of a strxfrm() blob into a Datum (on little-endian machines, the 8
+ * bytes are stored in reverse order), and treat it as an unsigned integer.
*/
static Datum
bttext_abbrev_convert(Datum original, SortSupport ssup)
@@ -2104,6 +2104,8 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
addHyperLogLog(&tss->abbr_card, hash);
+ /* Byteswap on little-endian machines: */
+ res = ABBREV_STRING_UINT(res);
/* Don't leak memory here */
if (PointerGetDatum(authoritative) != original)
pfree(authoritative);
--
1.9.1
0001-Add-BSWAP64-byte-swapping-macro.patchtext/x-patch; charset=US-ASCII; name=0001-Add-BSWAP64-byte-swapping-macro.patchDownload
From b3005f3bce8c96191eb5e523ea2480766b496bd1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Mon, 29 Jun 2015 23:53:05 -0400
Subject: [PATCH 1/4] Add BSWAP64() byte-swapping macro
This includes detection of __builtin_bswap64() built-in availability,
which BSWAP64() often becomes no more than a simple wrapper for.
BSWAP64() can be used to convert 64-bit big-endian unsigned integers to
64-bit little-endian unsigned integers, and vice-versa.
Add another macro, ABBREV_STRING_UINT(), that allows string-like types
to represent abbreviated keys as unsigned integers (ABBREV_STRING_UINT()
will use BSWAP64() on 64-bit little-endian platforms). In future
commits, certain types can be made to use simple 3-way unsigned integer
comparisons (sizeof(Datum)-wide unsigned integers) within their
abbreviated key comparators; they need only first use
ABBREV_STRING_UINT() within their SortSupport conversion routines to
alter the abbreviated representation and make this safe on all supported
platforms.
---
config/c-compiler.m4 | 18 ++++++++++++++
configure | 24 +++++++++++++++++++
configure.in | 1 +
src/include/c.h | 22 +++++++++++++++++
src/include/pg_config.h.in | 3 +++
src/include/pg_config.h.win32 | 3 +++
src/include/port/pg_crc32c.h | 10 --------
src/include/utils/sortsupport.h | 52 +++++++++++++++++++++++++++++++++++++++++
8 files changed, 123 insertions(+), 10 deletions(-)
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 9feec0c..550d034 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -214,6 +214,24 @@ fi])# PGAC_C_BUILTIN_BSWAP32
+# PGAC_C_BUILTIN_BSWAP64
+# -------------------------
+# Check if the C compiler understands __builtin_bswap64(),
+# and define HAVE__BUILTIN_BSWAP64 if so.
+AC_DEFUN([PGAC_C_BUILTIN_BSWAP64],
+[AC_CACHE_CHECK(for __builtin_bswap64, pgac_cv__builtin_bswap64,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static unsigned long int x = __builtin_bswap64(0xaabbccddeeff0011);]
+)],
+[pgac_cv__builtin_bswap64=yes],
+[pgac_cv__builtin_bswap64=no])])
+if test x"$pgac_cv__builtin_bswap64" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_BSWAP64, 1,
+ [Define to 1 if your compiler understands __builtin_bswap64.])
+fi])# PGAC_C_BUILTIN_BSWAP64
+
+
+
# PGAC_C_BUILTIN_CONSTANT_P
# -------------------------
# Check if the C compiler understands __builtin_constant_p(),
diff --git a/configure b/configure
index 99ef10f..b771a83 100755
--- a/configure
+++ b/configure
@@ -11259,6 +11259,30 @@ if test x"$pgac_cv__builtin_bswap32" = xyes ; then
$as_echo "#define HAVE__BUILTIN_BSWAP32 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_bswap64" >&5
+$as_echo_n "checking for __builtin_bswap64... " >&6; }
+if ${pgac_cv__builtin_bswap64+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+static unsigned long int x = __builtin_bswap64(0xaabbccddeeff0011);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+ pgac_cv__builtin_bswap64=yes
+else
+ pgac_cv__builtin_bswap64=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_bswap64" >&5
+$as_echo "$pgac_cv__builtin_bswap64" >&6; }
+if test x"$pgac_cv__builtin_bswap64" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_BSWAP64 1" >>confdefs.h
+
+fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_constant_p" >&5
$as_echo_n "checking for __builtin_constant_p... " >&6; }
if ${pgac_cv__builtin_constant_p+:} false; then :
diff --git a/configure.in b/configure.in
index 4143451..b5868b0 100644
--- a/configure.in
+++ b/configure.in
@@ -1317,6 +1317,7 @@ PGAC_C_FUNCNAME_SUPPORT
PGAC_C_STATIC_ASSERT
PGAC_C_TYPES_COMPATIBLE
PGAC_C_BUILTIN_BSWAP32
+PGAC_C_BUILTIN_BSWAP64
PGAC_C_BUILTIN_CONSTANT_P
PGAC_C_BUILTIN_UNREACHABLE
PGAC_C_VA_ARGS
diff --git a/src/include/c.h b/src/include/c.h
index 8163b00..9be41dc 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -943,6 +943,28 @@ typedef NameData *Name;
#define STATUS_FOUND (1)
#define STATUS_WAITING (2)
+/* byte swapping macros */
+#ifdef HAVE__BUILTIN_BSWAP32
+#define BSWAP32(x) __builtin_bswap32(x)
+#else
+#define BSWAP32(x) (((x << 24) & 0xff000000) | \
+ ((x << 8) & 0x00ff0000) | \
+ ((x >> 8) & 0x0000ff00) | \
+ ((x >> 24) & 0x000000ff))
+#endif /* HAVE__BUILTIN_BSWAP32 */
+
+#ifdef HAVE__BUILTIN_BSWAP64
+#define BSWAP64(x) __builtin_bswap64(x)
+#else
+#define BSWAP64(x) (((x << 56) & 0xff00000000000000UL) | \
+ ((x << 40) & 0x00ff000000000000UL) | \
+ ((x << 24) & 0x0000ff0000000000UL) | \
+ ((x << 8) & 0x000000ff00000000UL) | \
+ ((x >> 8) & 0x00000000ff000000UL) | \
+ ((x >> 24) & 0x0000000000ff0000UL) | \
+ ((x >> 40) & 0x000000000000ff00UL) | \
+ ((x >> 56) & 0x00000000000000ffUL))
+#endif /* HAVE__BUILTIN_BSWAP64 */
/*
* Append PG_USED_FOR_ASSERTS_ONLY to definitions of variables that are only
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index dda73a8..a20e0cd 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -660,6 +660,9 @@
/* Define to 1 if your compiler understands __builtin_bswap32. */
#undef HAVE__BUILTIN_BSWAP32
+/* Define to 1 if your compiler understands __builtin_bswap64. */
+#undef HAVE__BUILTIN_BSWAP64
+
/* Define to 1 if your compiler understands __builtin_constant_p. */
#undef HAVE__BUILTIN_CONSTANT_P
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 6f7a773..8566065 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -508,6 +508,9 @@
/* Define to 1 if your compiler understands __builtin_bswap32. */
/* #undef HAVE__BUILTIN_BSWAP32 */
+/* Define to 1 if your compiler understands __builtin_bswap64. */
+/* #undef HAVE__BUILTIN_BSWAP64 */
+
/* Define to 1 if your compiler understands __builtin_constant_p. */
/* #undef HAVE__BUILTIN_CONSTANT_P */
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index c925c56..e5389aa 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -71,16 +71,6 @@ extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len)
#define COMP_CRC32C(crc, data, len) \
((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
#ifdef WORDS_BIGENDIAN
-
-#ifdef HAVE__BUILTIN_BSWAP32
-#define BSWAP32(x) __builtin_bswap32(x)
-#else
-#define BSWAP32(x) (((x << 24) & 0xff000000) | \
- ((x << 8) & 0x00ff0000) | \
- ((x >> 8) & 0x0000ff00) | \
- ((x >> 24) & 0x000000ff))
-#endif
-
#define FIN_CRC32C(crc) ((crc) = BSWAP32(crc) ^ 0xFFFFFFFF)
#else
#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 4258630..a8bedfd 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -267,6 +267,58 @@ ApplySortAbbrevFullComparator(Datum datum1, bool isNull1,
return compare;
}
+/*
+ * Make unsigned integer (Datum) 3-way comparisons safe for string-like types.
+ *
+ * For some types, their abbreviated representations could reasonably be built
+ * such that comparisons are binary string comparisons. For example, for a
+ * string type, an abbreviated comparator like this might be used:
+ *
+ * static int
+ * btstringcmp_abbrev(Datum x, Datum y, SortSupport ssup)
+ * {
+ * return memcmp((char *) &x, (char *) &y, sizeof(Datum));
+ * }
+ *
+ * Perhaps the conversion routine packs NUL bytes used as sentinel values
+ * representing termination in the event of short strings too, so short strings
+ * require no special treatment (assuming NUL bytes are generally disallowed).
+ * There might be slightly different specifics for each type, but this basic
+ * pattern recurs relatively often.
+ *
+ * On big-endian machines, compilers might be able to optimize this to just a
+ * few CPU instructions (involving unsigned integer comparisons). Opclass
+ * authors should not rely on that, though. Instead, they should use the
+ * ABBREV_STRING_UINT() macro within their conversion routine to make sure that
+ * it's safe to use a 3-way comparator that performs simple unsigned integer
+ * comparisons explicitly. The abbreviated comparator should then look like
+ * this instead:
+ *
+ * static int
+ * btstringcmp_abbrev(Datum x, Datum y, SortSupport ssup)
+ * {
+ * if (x > y)
+ * return 1;
+ * else if (x == y)
+ * return 0;
+ * else
+ * return -1;
+ * }
+ *
+ * Apart from ensuring even cheaper comparisons on big-endian machines, this
+ * technique also works on little-endian machines, since a byteswap can be
+ * performed within ABBREV_STRING_UINT().
+ */
+#ifdef WORDS_BIGENDIAN
+#define ABBREV_STRING_UINT(x) (x)
+#else /* !WORDS_BIGENDIAN */
+#if SIZEOF_DATUM == 8
+#define ABBREV_STRING_UINT(x) BSWAP64(x)
+#else /* SIZEOF_DATUM != 8 */
+#define ABBREV_STRING_UINT(x) BSWAP32(x)
+#endif /* SIZEOF_DATUM == 8 */
+#endif /* WORDS_BIGENDIAN */
+
/* Other functions in utils/sort/sortsupport.c */
extern void PrepareSortSupportComparisonShim(Oid cmpFunc, SortSupport ssup);
extern void PrepareSortSupportFromOrderingOp(Oid orderingOp, SortSupport ssup);
--
1.9.1
On Sun, Oct 4, 2015 at 2:17 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Aug 4, 2015 at 12:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Some comments:
I attach a new version of the patch series that incorporates all your
feedback. The patch series is now made cumulative in a way that makes
it easy for someone to independently commit the unsigned integer
comparison optimization for text, and nothing else. The macro that
uses is in a dedicated header now, because I have another patch
(SortSupport for the UUID type) that uses the same optimization for
the same reason. It seems like something that will probably end up
with a third or forth client before too long, so I think the byte swap
macro wrapper belongs in sortsupport.h.
Reviewing 0001, I'm happy to see us add bswap64, but I'm not sure we
should put it in c.h, because that's included by absolutely
everything. How about putting it in a new #include inside src/port,
like src/port/pg_bswap.h? Then pg_crc.h can include that, but other
things can, too.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 6, 2015 at 1:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Reviewing 0001, I'm happy to see us add bswap64, but I'm not sure we
should put it in c.h, because that's included by absolutely
everything. How about putting it in a new #include inside src/port,
like src/port/pg_bswap.h? Then pg_crc.h can include that, but other
things can, too.
I guess I imagined that bswap64() was fundamental infrastructure, but
on second thought that's not actually in evidence -- it is not already
needed in plenty of other places. So yeah, works for me.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 6, 2015 at 4:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Oct 6, 2015 at 1:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Reviewing 0001, I'm happy to see us add bswap64, but I'm not sure we
should put it in c.h, because that's included by absolutely
everything. How about putting it in a new #include inside src/port,
like src/port/pg_bswap.h? Then pg_crc.h can include that, but other
things can, too.I guess I imagined that bswap64() was fundamental infrastructure, but
on second thought that's not actually in evidence -- it is not already
needed in plenty of other places. So yeah, works for me.
If you would care to revise the patch accordingly, I will commit it
(barring objections from others, of course).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 6, 2015 at 1:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I guess I imagined that bswap64() was fundamental infrastructure, but
on second thought that's not actually in evidence -- it is not already
needed in plenty of other places. So yeah, works for me.If you would care to revise the patch accordingly, I will commit it
(barring objections from others, of course).
Sure. It might take me a couple of days to get around to it, though --
things are a bit hectic here.
Thanks
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 6, 2015 at 4:26 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Oct 6, 2015 at 1:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I guess I imagined that bswap64() was fundamental infrastructure, but
on second thought that's not actually in evidence -- it is not already
needed in plenty of other places. So yeah, works for me.If you would care to revise the patch accordingly, I will commit it
(barring objections from others, of course).Sure. It might take me a couple of days to get around to it, though --
things are a bit hectic here.
I know the feeling.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 6, 2015 at 1:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
If you would care to revise the patch accordingly, I will commit it
(barring objections from others, of course).
Here is a revision of 0001-*, with both BSWAP32() and BSWAP64() in a
new header, src/port/pg_bswap.h.
No revisions were required to any other patch in the patch series to
make this work, and so I only include a revised 0001-*.
--
Peter Geoghegan
Attachments:
0001-Provide-for-unsigned-comparisons-of-abbreviated-keys.patchtext/x-patch; charset=US-ASCII; name=0001-Provide-for-unsigned-comparisons-of-abbreviated-keys.patchDownload
From f671bbccc1eb2a7ecb5c64925630a5593f888a9d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Mon, 29 Jun 2015 23:53:05 -0400
Subject: [PATCH 1/4] Provide for unsigned comparisons of abbreviated keys
Add ABBREV_STRING_UINT() macro to allow string-like types to represent
abbreviated keys as unsigned integers. ABBREV_STRING_UINT() will use
the new BSWAP64() byte swapping macro (also introduced in this commit)
on 64-bit little-endian platforms, or the existing BSWAP32() macro on
32-bit platforms.
In future commits, certain types with abbreviation support will call
ABBREV_STRING_UINT() to perform a final conversion process on their
string-like abbreviated keys. This makes it safe and portable to
implement the abbreviated key comparator as a simple 3-way unsigned
integer comparator, which must also be changed. By using an
exceptionally cheap abbreviated comparator, a significant boost in sort
performance can be seen for these types on at least some platforms.
---
config/c-compiler.m4 | 18 ++++++++++++++
configure | 24 +++++++++++++++++++
configure.in | 1 +
src/include/pg_config.h.in | 3 +++
src/include/pg_config.h.win32 | 3 +++
src/include/port/pg_bswap.h | 46 +++++++++++++++++++++++++++++++++++
src/include/port/pg_crc32c.h | 12 ++--------
src/include/utils/sortsupport.h | 53 +++++++++++++++++++++++++++++++++++++++++
8 files changed, 150 insertions(+), 10 deletions(-)
create mode 100644 src/include/port/pg_bswap.h
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 9feec0c..550d034 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -214,6 +214,24 @@ fi])# PGAC_C_BUILTIN_BSWAP32
+# PGAC_C_BUILTIN_BSWAP64
+# -------------------------
+# Check if the C compiler understands __builtin_bswap64(),
+# and define HAVE__BUILTIN_BSWAP64 if so.
+AC_DEFUN([PGAC_C_BUILTIN_BSWAP64],
+[AC_CACHE_CHECK(for __builtin_bswap64, pgac_cv__builtin_bswap64,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static unsigned long int x = __builtin_bswap64(0xaabbccddeeff0011);]
+)],
+[pgac_cv__builtin_bswap64=yes],
+[pgac_cv__builtin_bswap64=no])])
+if test x"$pgac_cv__builtin_bswap64" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_BSWAP64, 1,
+ [Define to 1 if your compiler understands __builtin_bswap64.])
+fi])# PGAC_C_BUILTIN_BSWAP64
+
+
+
# PGAC_C_BUILTIN_CONSTANT_P
# -------------------------
# Check if the C compiler understands __builtin_constant_p(),
diff --git a/configure b/configure
index 99ef10f..b771a83 100755
--- a/configure
+++ b/configure
@@ -11259,6 +11259,30 @@ if test x"$pgac_cv__builtin_bswap32" = xyes ; then
$as_echo "#define HAVE__BUILTIN_BSWAP32 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_bswap64" >&5
+$as_echo_n "checking for __builtin_bswap64... " >&6; }
+if ${pgac_cv__builtin_bswap64+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+static unsigned long int x = __builtin_bswap64(0xaabbccddeeff0011);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+ pgac_cv__builtin_bswap64=yes
+else
+ pgac_cv__builtin_bswap64=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_bswap64" >&5
+$as_echo "$pgac_cv__builtin_bswap64" >&6; }
+if test x"$pgac_cv__builtin_bswap64" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_BSWAP64 1" >>confdefs.h
+
+fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_constant_p" >&5
$as_echo_n "checking for __builtin_constant_p... " >&6; }
if ${pgac_cv__builtin_constant_p+:} false; then :
diff --git a/configure.in b/configure.in
index 4143451..b5868b0 100644
--- a/configure.in
+++ b/configure.in
@@ -1317,6 +1317,7 @@ PGAC_C_FUNCNAME_SUPPORT
PGAC_C_STATIC_ASSERT
PGAC_C_TYPES_COMPATIBLE
PGAC_C_BUILTIN_BSWAP32
+PGAC_C_BUILTIN_BSWAP64
PGAC_C_BUILTIN_CONSTANT_P
PGAC_C_BUILTIN_UNREACHABLE
PGAC_C_VA_ARGS
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index dda73a8..a20e0cd 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -660,6 +660,9 @@
/* Define to 1 if your compiler understands __builtin_bswap32. */
#undef HAVE__BUILTIN_BSWAP32
+/* Define to 1 if your compiler understands __builtin_bswap64. */
+#undef HAVE__BUILTIN_BSWAP64
+
/* Define to 1 if your compiler understands __builtin_constant_p. */
#undef HAVE__BUILTIN_CONSTANT_P
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 6f7a773..8566065 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -508,6 +508,9 @@
/* Define to 1 if your compiler understands __builtin_bswap32. */
/* #undef HAVE__BUILTIN_BSWAP32 */
+/* Define to 1 if your compiler understands __builtin_bswap64. */
+/* #undef HAVE__BUILTIN_BSWAP64 */
+
/* Define to 1 if your compiler understands __builtin_constant_p. */
/* #undef HAVE__BUILTIN_CONSTANT_P */
diff --git a/src/include/port/pg_bswap.h b/src/include/port/pg_bswap.h
new file mode 100644
index 0000000..6555942
--- /dev/null
+++ b/src/include/port/pg_bswap.h
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_bswap.h
+ * Byte swapping.
+ *
+ * Macros for reversing the byte order of 32-bit and 64-bit unsigned integers.
+ * For example, 0xAABBCCDD becomes 0xDDCCBBAA. These are just wrappers for
+ * built-in functions provided by the compiler where support exists.
+ *
+ * Note that the GCC built-in functions __builtin_bswap32() and
+ * __builtin_bswap64() are documented as accepting single arguments of type
+ * uint32_t and uint64_t respectively (these are also the respective return
+ * types). Use caution when using these wrapper macros with signed integers.
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_bswap.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_BSWAP_H
+#define PG_BSWAP_H
+
+#ifdef HAVE__BUILTIN_BSWAP32
+#define BSWAP32(x) __builtin_bswap32(x)
+#else
+#define BSWAP32(x) (((x << 24) & 0xff000000) | \
+ ((x << 8) & 0x00ff0000) | \
+ ((x >> 8) & 0x0000ff00) | \
+ ((x >> 24) & 0x000000ff))
+#endif /* HAVE__BUILTIN_BSWAP32 */
+
+#ifdef HAVE__BUILTIN_BSWAP64
+#define BSWAP64(x) __builtin_bswap64(x)
+#else
+#define BSWAP64(x) (((x << 56) & 0xff00000000000000UL) | \
+ ((x << 40) & 0x00ff000000000000UL) | \
+ ((x << 24) & 0x0000ff0000000000UL) | \
+ ((x << 8) & 0x000000ff00000000UL) | \
+ ((x >> 8) & 0x00000000ff000000UL) | \
+ ((x >> 24) & 0x0000000000ff0000UL) | \
+ ((x >> 40) & 0x000000000000ff00UL) | \
+ ((x >> 56) & 0x00000000000000ffUL))
+#endif /* HAVE__BUILTIN_BSWAP64 */
+
+#endif /* PG_BSWAP_H */
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index c925c56..35589c0 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -33,6 +33,8 @@
#ifndef PG_CRC32C_H
#define PG_CRC32C_H
+#include "port/pg_bswap.h"
+
typedef uint32 pg_crc32c;
/* The INIT and EQ macros are the same for all implementations. */
@@ -71,16 +73,6 @@ extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len)
#define COMP_CRC32C(crc, data, len) \
((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
#ifdef WORDS_BIGENDIAN
-
-#ifdef HAVE__BUILTIN_BSWAP32
-#define BSWAP32(x) __builtin_bswap32(x)
-#else
-#define BSWAP32(x) (((x << 24) & 0xff000000) | \
- ((x << 8) & 0x00ff0000) | \
- ((x >> 8) & 0x0000ff00) | \
- ((x >> 24) & 0x000000ff))
-#endif
-
#define FIN_CRC32C(crc) ((crc) = BSWAP32(crc) ^ 0xFFFFFFFF)
#else
#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 4258630..ae1291b 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -53,6 +53,7 @@
#define SORTSUPPORT_H
#include "access/attnum.h"
+#include "port/pg_bswap.h"
#include "utils/relcache.h"
typedef struct SortSupportData *SortSupport;
@@ -267,6 +268,58 @@ ApplySortAbbrevFullComparator(Datum datum1, bool isNull1,
return compare;
}
+/*
+ * Make unsigned integer (Datum) 3-way comparisons safe for string-like types.
+ *
+ * For some types, their abbreviated representations could reasonably be built
+ * such that comparisons are binary string comparisons. For example, for a
+ * string type, an abbreviated comparator like this might be used:
+ *
+ * static int
+ * btstringcmp_abbrev(Datum x, Datum y, SortSupport ssup)
+ * {
+ * return memcmp((char *) &x, (char *) &y, sizeof(Datum));
+ * }
+ *
+ * Perhaps the conversion routine packs NUL bytes used as sentinel values
+ * representing termination in the event of short strings too, so short strings
+ * require no special treatment (assuming NUL bytes are generally disallowed).
+ * There might be slightly different specifics for each type, but this basic
+ * pattern recurs relatively often.
+ *
+ * On big-endian machines, compilers might be able to optimize this to just a
+ * few CPU instructions (involving unsigned integer comparisons). Opclass
+ * authors should not rely on that, though. Instead, they should use the
+ * ABBREV_STRING_UINT() macro within their conversion routine to make sure that
+ * it's safe to use a 3-way comparator that performs simple unsigned integer
+ * comparisons explicitly. The abbreviated comparator should then look like
+ * this instead:
+ *
+ * static int
+ * btstringcmp_abbrev(Datum x, Datum y, SortSupport ssup)
+ * {
+ * if (x > y)
+ * return 1;
+ * else if (x == y)
+ * return 0;
+ * else
+ * return -1;
+ * }
+ *
+ * Apart from ensuring even cheaper comparisons on big-endian machines, this
+ * technique also works on little-endian machines, since a byteswap can be
+ * performed within ABBREV_STRING_UINT().
+ */
+#ifdef WORDS_BIGENDIAN
+#define ABBREV_STRING_UINT(x) (x)
+#else /* !WORDS_BIGENDIAN */
+#if SIZEOF_DATUM == 8
+#define ABBREV_STRING_UINT(x) BSWAP64(x)
+#else /* SIZEOF_DATUM != 8 */
+#define ABBREV_STRING_UINT(x) BSWAP32(x)
+#endif /* SIZEOF_DATUM == 8 */
+#endif /* WORDS_BIGENDIAN */
+
/* Other functions in utils/sort/sortsupport.c */
extern void PrepareSortSupportComparisonShim(Oid cmpFunc, SortSupport ssup);
extern void PrepareSortSupportFromOrderingOp(Oid orderingOp, SortSupport ssup);
--
1.9.1
On Wed, Oct 7, 2015 at 8:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Oct 6, 2015 at 1:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
If you would care to revise the patch accordingly, I will commit it
(barring objections from others, of course).Here is a revision of 0001-*, with both BSWAP32() and BSWAP64() in a
new header, src/port/pg_bswap.h.No revisions were required to any other patch in the patch series to
make this work, and so I only include a revised 0001-*.
Great. I've committed that, minus the sortsupport.h changes which I
think should be part of 0002, and which in any case I'd like to
discuss a bit more. It seems to me that (1) ABBREV_STRING_UINT isn't
a great name for this and (2) the comment is awfully long for the
thing to which it refers. I suggest that we instead call it
DatumToBigEndian(), put it pg_bswap.h, and change the comments to
something like this:
/*
* Rearrange the bytes of a Datum into big-endian order.
*
* One possible application of this macro is to make comparisons
cheaper. An integer
* comparison of the new Datums will return the same result as a memcmp() on the
* original Datums, but the integer comparison should be much cheaper.
*/
The specific way that this is used by various sortsupport routines can
be adequately explained in the comments for those routines.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 8, 2015 at 10:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
It seems to me that (1) ABBREV_STRING_UINT isn't
a great name for this and (2) the comment is awfully long for the
thing to which it refers. I suggest that we instead call it
DatumToBigEndian(), put it pg_bswap.h, and change the comments to
something like this:/*
* Rearrange the bytes of a Datum into big-endian order.
*
* One possible application of this macro is to make comparisons
cheaper. An integer
* comparison of the new Datums will return the same result as a memcmp() on the
* original Datums, but the integer comparison should be much cheaper.
*/The specific way that this is used by various sortsupport routines can
be adequately explained in the comments for those routines.
This is pretty clearly something specific to SortSupport. I'm not
opposed to changing the name and making the comments more terse along
those lines, but I think it should live in sortsupport.h. The macro
byteswaps datums on little-endian platforms only, which seems very
specific.
I think that we're going to have SortSupport with abbreviation for
UUIDs and bytea at some point, and maybe character(n). Centralizing
information about this to sortsupport.h makes sense to me.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 8, 2015 at 2:07 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Thu, Oct 8, 2015 at 10:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
It seems to me that (1) ABBREV_STRING_UINT isn't
a great name for this and (2) the comment is awfully long for the
thing to which it refers. I suggest that we instead call it
DatumToBigEndian(), put it pg_bswap.h, and change the comments to
something like this:/*
* Rearrange the bytes of a Datum into big-endian order.
*
* One possible application of this macro is to make comparisons
cheaper. An integer
* comparison of the new Datums will return the same result as a memcmp() on the
* original Datums, but the integer comparison should be much cheaper.
*/The specific way that this is used by various sortsupport routines can
be adequately explained in the comments for those routines.This is pretty clearly something specific to SortSupport. I'm not
opposed to changing the name and making the comments more terse along
those lines, but I think it should live in sortsupport.h. The macro
byteswaps datums on little-endian platforms only, which seems very
specific.I think that we're going to have SortSupport with abbreviation for
UUIDs and bytea at some point, and maybe character(n). Centralizing
information about this to sortsupport.h makes sense to me.
I'm not convinced. Doesn't this exact same concept get used for
over-the-wire communication between BE and LE machines? There, this
operation is spelled htonl/ntohl. Some systems even have htonll, but
I'm sure there are still a bunch that don't.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 8, 2015 at 11:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I'm not convinced. Doesn't this exact same concept get used for
over-the-wire communication between BE and LE machines? There, this
operation is spelled htonl/ntohl. Some systems even have htonll, but
I'm sure there are still a bunch that don't.
I continue to disagree with that. The spelling of the macro that you
propose suggests that this process occurs at a relatively high level
of abstraction, which is misleading. Datums that have abbreviated key
bytes packed into them are in general kind of special. All the same,
here is a revision of the patch series along those lines. I'll also
have to update the UUID patch to independently note the same issues.
I should point out that I did not call the macro DatumToBigEndian(),
because it's actually the other way around. I called it
DatumToLittleEndian(), since the unsigned integer comparator would
work correctly on big-endian systems without calling any new macro
(which is of course why the new macro does nothing on big-endian
systems). We start off with a big endian Datum/unsigned integer on all
platforms, and then we byteswap it to make it a little-endian unsigned
integer if and when that's required (i.e. only on little-endian
systems).
--
Peter Geoghegan
Attachments:
0002-Add-two-text-sort-caching-optimizations.patchtext/x-patch; charset=US-ASCII; name=0002-Add-two-text-sort-caching-optimizations.patchDownload
From ddc2bce56f8d375a22d7a635dd402ad1e507b85c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Sat, 3 Oct 2015 16:11:02 -0700
Subject: [PATCH 2/3] Add two text sort caching optimizations
Firstly, cache strxfrm() blobs across calls made to the text SortSupport
abbreviation routine. Secondly, cache strcoll() results across calls to
the text comparison routine (SortSupport authoritative comparator only).
The cost of doing this should be immeasurably small in cases where the
optimization does not help, while the benefits in cases where the
optimization is effective should be quite significant.
---
src/backend/utils/adt/varlena.c | 75 ++++++++++++++++++++++++++++++++++++++---
1 file changed, 71 insertions(+), 4 deletions(-)
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index fadd827..d814502 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -62,6 +62,9 @@ typedef struct
char *buf2; /* 2nd string, or abbreviation strxfrm() buf */
int buflen1;
int buflen2;
+ int last_len1; /* Length of last buf1 string/strxfrm() blob */
+ int last_len2; /* Length of last buf2 string/strxfrm() blob */
+ int last_returned; /* Last comparison result (cache) */
bool collate_c;
hyperLogLogState abbr_card; /* Abbreviated key cardinality state */
hyperLogLogState full_card; /* Full key cardinality state */
@@ -1832,9 +1835,20 @@ btsortsupport_worker(SortSupport ssup, Oid collid)
tss->buflen1 = TEXTBUFLEN;
tss->buf2 = palloc(TEXTBUFLEN);
tss->buflen2 = TEXTBUFLEN;
+ /* Start with invalid values */
+ tss->last_len1 = -1;
+ tss->last_len2 = -1;
#ifdef HAVE_LOCALE_T
tss->locale = locale;
#endif
+ /*
+ * To avoid somehow confusing a strxfrm() blob and an original string
+ * within bttextfastcmp_locale(), initialize last returned cache to a
+ * sentinel value. A platform-specific actual strcoll() return value
+ * of INT_MIN seems unlikely, but if that occurs it will not cause
+ * wrong answers.
+ */
+ tss->last_returned = INT_MIN;
tss->collate_c = collate_c;
ssup->ssup_extra = tss;
@@ -1897,6 +1911,7 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
{
text *arg1 = DatumGetTextPP(x);
text *arg2 = DatumGetTextPP(y);
+ bool arg1_match;
TextSortSupport *tss = (TextSortSupport *) ssup->ssup_extra;
/* working state */
@@ -1915,6 +1930,11 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
/* Fast pre-check for equality, as discussed in varstr_cmp() */
if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
{
+ /*
+ * No change in buf1 or buf2 contents, so avoid changing last_len1 or
+ * last_len2. Existing contents of buffers might still be used by next
+ * call.
+ */
result = 0;
goto done;
}
@@ -1932,10 +1952,43 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
tss->buf2 = MemoryContextAlloc(ssup->ssup_cxt, tss->buflen2);
}
- memcpy(tss->buf1, a1p, len1);
- tss->buf1[len1] = '\0';
- memcpy(tss->buf2, a2p, len2);
- tss->buf2[len2] = '\0';
+ /*
+ * We're likely to be asked to compare the same strings repeatedly, and
+ * memcmp() is so much cheaper than strcoll() that it pays to try to cache
+ * comparisons, even though in general there is no reason to think that
+ * that will work out (every text datum may be unique). Caching does not
+ * slow things down measurably when it doesn't work out, and can speed
+ * things up by rather a lot when it does. In part, this is because the
+ * memcmp() compares data from cachelines that are needed in L1 cache even
+ * when the last comparison's result cannot be reused.
+ */
+ arg1_match = true;
+ if (len1 != tss->last_len1 || memcmp(tss->buf1, a1p, len1) != 0)
+ {
+ arg1_match = false;
+ memcpy(tss->buf1, a1p, len1);
+ tss->buf1[len1] = '\0';
+ tss->last_len1 = len1;
+ }
+
+ /*
+ * If we're comparing the same two strings as last time, we can return the
+ * same answer without calling strcoll() again. This is more likely than
+ * it seems (at least with moderate to low cardinality sets), because
+ * quicksort compares the same pivot against many values.
+ */
+ if (len2 != tss->last_len2 || memcmp(tss->buf2, a2p, len2) != 0)
+ {
+ memcpy(tss->buf2, a2p, len2);
+ tss->buf2[len2] = '\0';
+ tss->last_len2 = len2;
+ }
+ else if (arg1_match && tss->last_returned != INT_MIN)
+ {
+ /* Use result cached following last actual strcoll() call */
+ result = tss->last_returned;
+ goto done;
+ }
#ifdef HAVE_LOCALE_T
if (tss->locale)
@@ -1952,6 +2005,8 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
if (result == 0)
result = strcmp(tss->buf1, tss->buf2);
+ /* Cache result, perhaps saving an expensive strcoll() call next time */
+ tss->last_returned = result;
done:
/* We can't afford to leak memory here. */
if (PointerGetDatum(arg1) != x)
@@ -2034,9 +2089,19 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
tss->buf1 = palloc(tss->buflen1);
}
+ /* Might be able to reuse strxfrm() blob from last call */
+ if (tss->last_len1 == len &&
+ memcmp(tss->buf1, authoritative_data, len) == 0)
+ {
+ memcpy(pres, tss->buf2, Min(sizeof(Datum), tss->last_len2));
+ /* No change affecting cardinality, so no hashing required */
+ goto done;
+ }
+
/* Just like strcoll(), strxfrm() expects a NUL-terminated string */
memcpy(tss->buf1, authoritative_data, len);
tss->buf1[len] = '\0';
+ tss->last_len1 = len;
for (;;)
{
@@ -2048,6 +2113,7 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
#endif
bsize = strxfrm(tss->buf2, tss->buf1, tss->buflen2);
+ tss->last_len2 = bsize;
if (bsize < tss->buflen2)
break;
@@ -2105,6 +2171,7 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
addHyperLogLog(&tss->abbr_card, hash);
+done:
/*
* Byteswap on little-endian machines.
*
--
1.9.1
0001-Use-unsigned-integer-abbreviated-keys-for-text.patchtext/x-patch; charset=US-ASCII; name=0001-Use-unsigned-integer-abbreviated-keys-for-text.patchDownload
From 48493d8806f11e92b5d585b217e0e13b69ecdeb0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Fri, 3 Jul 2015 13:48:08 -0700
Subject: [PATCH 1/3] Use unsigned integer abbreviated keys for text
Add DatumToLittleEndian() macro to allow string-like types to represent
abbreviated keys as unsigned integers. DatumToLittleEndian() will
perform a byteswap on little-endian platforms only.
Using the new macro, arrange for text abbreviated keys to be represented
such that a simple 3-way unsigned integer comparison using the new
representation is correct. Replace the comparator along those lines.
This is faster on at least some platforms, since unsigned integer
comparisons are now used rather than memcmp() during sorting proper.
---
src/backend/utils/adt/varlena.c | 28 +++++++++++++++++++---------
src/include/port/pg_bswap.h | 22 ++++++++++++++++++++++
2 files changed, 41 insertions(+), 9 deletions(-)
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 2fbbf54..fadd827 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -26,6 +26,7 @@
#include "libpq/pqformat.h"
#include "miscadmin.h"
#include "parser/scansup.h"
+#include "port/pg_bswap.h"
#include "regex/regex.h"
#include "utils/builtins.h"
#include "utils/bytea.h"
@@ -1967,25 +1968,25 @@ done:
static int
bttextcmp_abbrev(Datum x, Datum y, SortSupport ssup)
{
- char *a = (char *) &x;
- char *b = (char *) &y;
- int result;
-
- result = memcmp(a, b, sizeof(Datum));
-
/*
- * When result = 0, the core system will call bttextfastcmp_c() or
+ * When 0 is returned, the core system will call bttextfastcmp_c() or
* bttextfastcmp_locale(). Even a strcmp() on two non-truncated strxfrm()
* blobs cannot indicate *equality* authoritatively, for the same reason
* that there is a strcoll() tie-breaker call to strcmp() in varstr_cmp().
*/
- return result;
+ if (x > y)
+ return 1;
+ else if (x == y)
+ return 0;
+ else
+ return -1;
}
/*
* Conversion routine for sortsupport. Converts original text to abbreviated
* key representation. Our encoding strategy is simple -- pack the first 8
- * bytes of a strxfrm() blob into a Datum.
+ * bytes of a strxfrm() blob into a Datum (on little-endian machines, the 8
+ * bytes are stored in reverse order), and treat it as an unsigned integer.
*/
static Datum
bttext_abbrev_convert(Datum original, SortSupport ssup)
@@ -2104,6 +2105,15 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
addHyperLogLog(&tss->abbr_card, hash);
+ /*
+ * Byteswap on little-endian machines.
+ *
+ * This is needed so that bttextcmp_abbrev() (an unsigned integer 3-way
+ * comparator) works correctly on all platforms. If this was skipped, then
+ * the comparator would have to call memcmp() with a pair of pointers to
+ * the first byte of each abbreviated key, which is slower.
+ */
+ res = DatumToLittleEndian(res);
/* Don't leak memory here */
if (PointerGetDatum(authoritative) != original)
pfree(authoritative);
diff --git a/src/include/port/pg_bswap.h b/src/include/port/pg_bswap.h
index 6555942..ac060ea 100644
--- a/src/include/port/pg_bswap.h
+++ b/src/include/port/pg_bswap.h
@@ -43,4 +43,26 @@
((x >> 56) & 0x00000000000000ffUL))
#endif /* HAVE__BUILTIN_BSWAP64 */
+/*
+ * Rearrange the bytes of a Datum into little-endian order from big-endian
+ * order. On big-endian machines, this does nothing at all. Note that the C
+ * type Datum is an unsigned integer type on all platforms.
+ *
+ * One possible application of the DatumToLittleEndian() macro is to make
+ * bitwise comparisons cheaper. A simple 3-way comparison of Datums
+ * transformed by the macro (based on native, unsigned comparisons) will return
+ * the same result as a memcmp() of the corresponding original Datums, but can
+ * be much cheaper. It's generally safe to do this on big-endian systems
+ * without any special transformation occurring first.
+ */
+#ifdef WORDS_BIGENDIAN
+#define DatumToLittleEndian(x) (x)
+#else /* !WORDS_BIGENDIAN */
+#if SIZEOF_DATUM == 8
+#define DatumToLittleEndian(x) BSWAP64(x)
+#else /* SIZEOF_DATUM != 8 */
+#define DatumToLittleEndian(x) BSWAP32(x)
+#endif /* SIZEOF_DATUM == 8 */
+#endif /* WORDS_BIGENDIAN */
+
#endif /* PG_BSWAP_H */
--
1.9.1
On Thu, Oct 8, 2015 at 8:20 PM, Peter Geoghegan <pg@heroku.com> wrote:
I should point out that I did not call the macro DatumToBigEndian(),
because it's actually the other way around. I called it
DatumToLittleEndian(), since the unsigned integer comparator would
work correctly on big-endian systems without calling any new macro
(which is of course why the new macro does nothing on big-endian
systems). We start off with a big endian Datum/unsigned integer on all
platforms, and then we byteswap it to make it a little-endian unsigned
integer if and when that's required (i.e. only on little-endian
systems).
Hmm. But then this doesn't seem to make much sense:
+ * Rearrange the bytes of a Datum into little-endian order from big-endian
+ * order. On big-endian machines, this does nothing at all.
Rearranging bytes into little-endian order ought to be a no-op on a
little-endian machine; and rearranging them into big-endian order
ought to be a no-op on a big-endian machine.
Thinking about this a bit more, it seems like the situation we're in
here is that the input datum is always going to be big-endian.
Regardless of what the machine's integer format is, the sortsupport
abbreviator is going to output a Datum where the most significant byte
is the first one stored in the datum. We want to convert that Datum
to one that has *native* endianness. So maybe we should call this
DatumBigEndianToNative or something like that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 9, 2015 at 11:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Hmm. But then this doesn't seem to make much sense:
+ * Rearrange the bytes of a Datum into little-endian order from big-endian + * order. On big-endian machines, this does nothing at all.Rearranging bytes into little-endian order ought to be a no-op on a
little-endian machine; and rearranging them into big-endian order
ought to be a no-op on a big-endian machine.
I think that that's very clearly implied anyway.
Thinking about this a bit more, it seems like the situation we're in
here is that the input datum is always going to be big-endian.
Regardless of what the machine's integer format is, the sortsupport
abbreviator is going to output a Datum where the most significant byte
is the first one stored in the datum. We want to convert that Datum
to one that has *native* endianness. So maybe we should call this
DatumBigEndianToNative or something like that.
I'd be fine with DatumBigEndianToNative() -- I agree that that's
slightly better.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 9, 2015 at 2:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Fri, Oct 9, 2015 at 11:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Hmm. But then this doesn't seem to make much sense:
+ * Rearrange the bytes of a Datum into little-endian order from big-endian + * order. On big-endian machines, this does nothing at all.Rearranging bytes into little-endian order ought to be a no-op on a
little-endian machine; and rearranging them into big-endian order
ought to be a no-op on a big-endian machine.I think that that's very clearly implied anyway.
Thinking about this a bit more, it seems like the situation we're in
here is that the input datum is always going to be big-endian.
Regardless of what the machine's integer format is, the sortsupport
abbreviator is going to output a Datum where the most significant byte
is the first one stored in the datum. We want to convert that Datum
to one that has *native* endianness. So maybe we should call this
DatumBigEndianToNative or something like that.I'd be fine with DatumBigEndianToNative() -- I agree that that's
slightly better.
OK, committed that way.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 9, 2015 at 12:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
OK, committed that way.
Thank you.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 9, 2015 at 12:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
OK, committed that way.
Just for the record, with the same "cities" table as my original post
to this thread, this query:
select count(distinct(city)) from cities;
Goes from taking about 296ms (once it stabilizes), to about 265ms
(once it stabilizes) following today's commit of just the unsigned
integer comparison patch. I've shaved just over 10% off the duration
of this representative sort-heavy query (against a 9.5 baseline),
which is nice.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 9, 2015 at 3:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Fri, Oct 9, 2015 at 12:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
OK, committed that way.
Thank you.
You're welcome. After some study and experimentation, I've committed
the other part also.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 9, 2015 at 4:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
You're welcome. After some study and experimentation, I've committed
the other part also.
Fantastic. I guess the precedent of the 9.5 text equality fast path
made this discussion way smoother than last time, since essentially
the same principle applies.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 9, 2015 at 7:49 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Fri, Oct 9, 2015 at 4:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
You're welcome. After some study and experimentation, I've committed
the other part also.Fantastic. I guess the precedent of the 9.5 text equality fast path
made this discussion way smoother than last time, since essentially
the same principle applies.
I think that is true. I spent some time thinking about whether the
way you used INT_MIN as a sentinel value should be changed around
somehow, but ultimately I decided that it wasn't too bad and that
suggesting something else would be pointless kibitzing. I also tried
to think of scenarios in which this would lose, and I'm not totally
convinced that there aren't any, but I'm convinced that, if they
exist, I don't know what they are. Since the patch did deliver a
small improvement on my test cases and on yours, I think we might as
well have it in the tree. If some pathological scenario shows up
where it turns out to hurt, we can always fix it then, or revert if it
need be.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 9, 2015 at 5:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think that is true. I spent some time thinking about whether the
way you used INT_MIN as a sentinel value should be changed around
somehow, but ultimately I decided that it wasn't too bad and that
suggesting something else would be pointless kibitzing. I also tried
to think of scenarios in which this would lose, and I'm not totally
convinced that there aren't any, but I'm convinced that, if they
exist, I don't know what they are. Since the patch did deliver a
small improvement on my test cases and on yours, I think we might as
well have it in the tree. If some pathological scenario shows up
where it turns out to hurt, we can always fix it then, or revert if it
need be.
That seems very reasonable.
I noticed that there is still one comment that I really should have
removed as part of this work. The comment didn't actually add any new
information for 9.5, but is now obsolete. Attached patch removes it
entirely.
--
Peter Geoghegan
Attachments:
remove_text_comment.patchtext/x-patch; charset=US-ASCII; name=remove_text_comment.patchDownload
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index d545c34..3978b1e 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -2056,10 +2056,6 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
int len;
uint32 hash;
- /*
- * Abbreviated key representation is a pass-by-value Datum that is treated
- * as a char array by the specialized comparator bttextcmp_abbrev().
- */
pres = (char *) &res;
/* memset(), so any non-overwritten bytes are NUL */
memset(pres, 0, sizeof(Datum));
On Sat, Oct 10, 2015 at 12:32 PM, Peter Geoghegan <pg@heroku.com> wrote:
I noticed that there is still one comment that I really should have
removed as part of this work.
I also noticed that I failed to reset the last_returned strcoll()
cache variable as part of an abbreviation call, despite the fact that
tapesort may freely interleave conversions with comparisons, while
reusing buf1 and buf2 both as scratch space for strxfrm() blobs, as
well as for storing strings to be compared with strcoll(). I suggest
that the attached patch also be applied to fix this issue.
--
Peter Geoghegan
Attachments:
reset_last_returned.patchtext/x-patch; charset=US-ASCII; name=reset_last_returned.patchDownload
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index d545c34..9725a5f 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -2173,6 +2173,12 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
done:
/*
+ * Set last_returned to sentinel value. Comparisons that attempt to reuse
+ * last_returned may be interleaved with calls here.
+ */
+ tss->last_returned = INT_MIN;
+
+ /*
* Byteswap on little-endian machines.
*
* This is needed so that bttextcmp_abbrev() (an unsigned integer 3-way
On Mon, Oct 12, 2015 at 12:47 AM, Peter Geoghegan <pg@heroku.com> wrote:
I also noticed that I failed to reset the last_returned strcoll()
cache variable as part of an abbreviation call, despite the fact that
tapesort may freely interleave conversions with comparisons, while
reusing buf1 and buf2 both as scratch space for strxfrm() blobs, as
well as for storing strings to be compared with strcoll(). I suggest
that the attached patch also be applied to fix this issue.
I think that I jumped the gun with this fix, because theoretically you
can still get the same problem in the opposite direction -- an
original string treated as a strxfrm() blob when the cache is
consulted.
I'll consider a more comprehensive fix.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 12, 2015 at 3:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Oct 12, 2015 at 12:47 AM, Peter Geoghegan <pg@heroku.com> wrote:
I also noticed that I failed to reset the last_returned strcoll()
cache variable as part of an abbreviation call, despite the fact that
tapesort may freely interleave conversions with comparisons, while
reusing buf1 and buf2 both as scratch space for strxfrm() blobs, as
well as for storing strings to be compared with strcoll(). I suggest
that the attached patch also be applied to fix this issue.I think that I jumped the gun with this fix, because theoretically you
can still get the same problem in the opposite direction -- an
original string treated as a strxfrm() blob when the cache is
consulted.I'll consider a more comprehensive fix.
This is not the first time I've committed one of your patches and
promptly received a whole series of emails with fixes for what went
into that commit, despite the fact that I have not and did not change
the relevant parts of the patch while committing. Since that makes
more work for me, I am not a huge fan, and the practical effect is to
subtract from the amount of time that could otherwise have been spent
reviewing your next patch (or someone else's). In this case, I think
the best thing for me to do right now is wait to commit anything
further until you have had a chance to go over this and come up with a
fix or set of fixes that you think are completely, 100% ready to go;
or else until you get to the point of wanting feedback before
proceeding further. In general, I would appreciate if this sort of
post-commit self-review could be done pre-commit.
I apologize for the fact that this email probably sounds grouchy.
Please try to read it in the most positive light possible (and don't
shoot me).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 12, 2015 at 4:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
In this case, I think
the best thing for me to do right now is wait to commit anything
further until you have had a chance to go over this and come up with a
fix or set of fixes that you think are completely, 100% ready to go;
or else until you get to the point of wanting feedback before
proceeding further. In general, I would appreciate if this sort of
post-commit self-review could be done pre-commit.
This came to me while I was making dinner last night - I was not
actually poring over the code on my computer. I don't know why I
thought of it then and not at any earlier point, but it seems
reasonable to suppose it had something to do with perspective, or
being able to see a bigger picture that is difficult to keep in mind
during initial development and self review. I was mostly thinking
about external sorting at the time, and not this patch.
I cannot do that on demand. It isn't a matter of effort. When I come
back from vacation at the end of the month, I may be able to do
somewhat better for a few months.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 12, 2015 at 12:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
I'll consider a more comprehensive fix.
I attach a revised fix that considers the problem of misinterpreting
the contents of the buffers in both directions.
Thanks
--
Peter Geoghegan
Attachments:
0001-Correctly-distinguish-the-contents-of-text-buffers.patchtext/x-patch; charset=US-ASCII; name=0001-Correctly-distinguish-the-contents-of-text-buffers.patchDownload
From c2d4558df3ce93b939fa319dd6e0735400833df0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Mon, 12 Oct 2015 17:13:28 -0700
Subject: [PATCH] Correctly distinguish the contents of text buffers
Avoid confusing an original string with a strxfrm() buffer, or
vice-versa. This could previously occur in the event of interleaving of
authoritative comparisons with conversions to abbreviated keys. Such
interleaving is possible with external sorts.
---
src/backend/utils/adt/varlena.c | 35 +++++++++++++++++++++++++++--------
1 file changed, 27 insertions(+), 8 deletions(-)
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index d545c34..c8574f0 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -64,7 +64,7 @@ typedef struct
int buflen2;
int last_len1; /* Length of last buf1 string/strxfrm() blob */
int last_len2; /* Length of last buf2 string/strxfrm() blob */
- int last_returned; /* Last comparison result (cache) */
+ int last_returned; /* Last comparison result/caching sentinel */
bool collate_c;
hyperLogLogState abbr_card; /* Abbreviated key cardinality state */
hyperLogLogState full_card; /* Full key cardinality state */
@@ -1843,10 +1843,17 @@ btsortsupport_worker(SortSupport ssup, Oid collid)
#endif
/*
* To avoid somehow confusing a strxfrm() blob and an original string
- * within bttextfastcmp_locale(), initialize last returned cache to a
+ * within bttextfastcmp_locale(), initialize last_returned cache to a
* sentinel value. A platform-specific actual strcoll() return value
- * of INT_MIN seems unlikely, but if that occurs it will not cause
- * wrong answers.
+ * of INT_MIN seems unlikely, but will be immediately changed to -1 to
+ * remove the possibility of wrong answers.
+ *
+ * Comparisons that attempt to reuse last_returned may be interleaved
+ * with conversion calls. Frequently, conversions and comparisons are
+ * batched into two distinct phases, but the correctness of caching
+ * cannot hinge upon this. For comparison caching we only trust the
+ * cache when last_returned != INT_MIN, while for strxfrm() caching we
+ * only trust the cache when last_returned == INT_MIN.
*/
tss->last_returned = INT_MIN;
tss->collate_c = collate_c;
@@ -1931,9 +1938,9 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
{
/*
- * No change in buf1 or buf2 contents, so avoid changing last_len1 or
- * last_len2. Existing contents of buffers might still be used by next
- * call.
+ * No change in buf1 or buf2 contents, so avoid changing last_len1,
+ * last_len2, or last_returned. Existing contents of buffers might
+ * still be used by next call.
*/
result = 0;
goto done;
@@ -2005,7 +2012,12 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
if (result == 0)
result = strcmp(tss->buf1, tss->buf2);
- /* Cache result, perhaps saving an expensive strcoll() call next time */
+ /*
+ * Cache result, perhaps saving an expensive strcoll() call next time.
+ * Interpret a strcoll() return value that happens to be the sentinel value
+ * INT_MIN as -1.
+ */
+ result = Max(-1, result);
tss->last_returned = result;
done:
/* We can't afford to leak memory here. */
@@ -2091,6 +2103,7 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
/* Might be able to reuse strxfrm() blob from last call */
if (tss->last_len1 == len &&
+ tss->last_returned == INT_MIN &&
memcmp(tss->buf1, authoritative_data, len) == 0)
{
memcpy(pres, tss->buf2, Min(sizeof(Datum), tss->last_len2));
@@ -2173,6 +2186,12 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
done:
/*
+ * Set last_returned to sentinel value to indicate that buffers store
+ * strxfrm() original string and blob
+ */
+ tss->last_returned = INT_MIN;
+
+ /*
* Byteswap on little-endian machines.
*
* This is needed so that bttextcmp_abbrev() (an unsigned integer 3-way
--
1.9.1
On Wed, Oct 14, 2015 at 7:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Oct 12, 2015 at 12:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
I'll consider a more comprehensive fix.
I attach a revised fix that considers the problem of misinterpreting
the contents of the buffers in both directions.
I wonder if it wouldn't be better to just add a separate Boolean
indicating exactly the thing we care about. This doesn't seem
particularly easy to understand and verify.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 14, 2015 at 4:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I wonder if it wouldn't be better to just add a separate Boolean
indicating exactly the thing we care about. This doesn't seem
particularly easy to understand and verify.
I'm not really sure that that's an improvement. But I defer to you.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 14, 2015 at 7:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Oct 14, 2015 at 4:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I wonder if it wouldn't be better to just add a separate Boolean
indicating exactly the thing we care about. This doesn't seem
particularly easy to understand and verify.I'm not really sure that that's an improvement. But I defer to you.
Would you be willing to try coding it up that way so we can see how it looks?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 15, 2015 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Would you be willing to try coding it up that way so we can see how it looks?
I attach a patch that does it that way.
Note that I will be away for until late this month. Do not expect a
response from me before then, unless it's truly urgent.
Thanks
--
Peter Geoghegan
Attachments:
0001-Correctly-distinguish-the-contents-of-text-buffers.patchtext/x-patch; charset=US-ASCII; name=0001-Correctly-distinguish-the-contents-of-text-buffers.patchDownload
From 22936e4d69e74e8e744053024d7feea63d7cea56 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Mon, 12 Oct 2015 17:13:28 -0700
Subject: [PATCH] Correctly distinguish the contents of text buffers
Avoid confusing an original string with a strxfrm() buffer, or
vice-versa. This could previously occur in the event of interleaving of
authoritative comparisons with conversions to abbreviated keys. Such
interleaving is possible with external sorts.
---
src/backend/utils/adt/varlena.c | 29 +++++++++++++++++++++--------
1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index d545c34..83523f7 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -65,6 +65,7 @@ typedef struct
int last_len1; /* Length of last buf1 string/strxfrm() blob */
int last_len2; /* Length of last buf2 string/strxfrm() blob */
int last_returned; /* Last comparison result (cache) */
+ bool cache_blob; /* Does buf2 contain strxfrm() blob, etc? */
bool collate_c;
hyperLogLogState abbr_card; /* Abbreviated key cardinality state */
hyperLogLogState full_card; /* Full key cardinality state */
@@ -1838,17 +1839,26 @@ btsortsupport_worker(SortSupport ssup, Oid collid)
/* Start with invalid values */
tss->last_len1 = -1;
tss->last_len2 = -1;
+ /* Initialize */
+ tss->last_returned = 0;
#ifdef HAVE_LOCALE_T
tss->locale = locale;
#endif
/*
- * To avoid somehow confusing a strxfrm() blob and an original string
- * within bttextfastcmp_locale(), initialize last returned cache to a
- * sentinel value. A platform-specific actual strcoll() return value
- * of INT_MIN seems unlikely, but if that occurs it will not cause
- * wrong answers.
+ * To avoid somehow confusing a strxfrm() blob and an original string,
+ * constantly keep track of the variety of data that buf1 and buf2
+ * currently contain.
+ *
+ * Comparisons may be interleaved with conversion calls. Frequently,
+ * conversions and comparisons are batched into two distinct phases,
+ * but the correctness of caching cannot hinge upon this. For
+ * comparison caching, buffer state is only trusted if cache_blob is
+ * found set to false, whereas strxfrm() caching only trusts the state
+ * when cache_blob is found set to true.
+ *
+ * Arbitrarily initialize cache_blob to true.
*/
- tss->last_returned = INT_MIN;
+ tss->cache_blob = true;
tss->collate_c = collate_c;
ssup->ssup_extra = tss;
@@ -1983,7 +1993,7 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
tss->buf2[len2] = '\0';
tss->last_len2 = len2;
}
- else if (arg1_match && tss->last_returned != INT_MIN)
+ else if (arg1_match && !tss->cache_blob)
{
/* Use result cached following last actual strcoll() call */
result = tss->last_returned;
@@ -2006,6 +2016,7 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
result = strcmp(tss->buf1, tss->buf2);
/* Cache result, perhaps saving an expensive strcoll() call next time */
+ tss->cache_blob = false;
tss->last_returned = result;
done:
/* We can't afford to leak memory here. */
@@ -2090,7 +2101,7 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
}
/* Might be able to reuse strxfrm() blob from last call */
- if (tss->last_len1 == len &&
+ if (tss->last_len1 == len && tss->cache_blob &&
memcmp(tss->buf1, authoritative_data, len) == 0)
{
memcpy(pres, tss->buf2, Min(sizeof(Datum), tss->last_len2));
@@ -2171,6 +2182,8 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
addHyperLogLog(&tss->abbr_card, hash);
+ /* Cache result, perhaps saving an expensive strxfrm() call next time */
+ tss->cache_blob = true;
done:
/*
* Byteswap on little-endian machines.
--
1.9.1
On Sun, Oct 18, 2015 at 2:52 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Thu, Oct 15, 2015 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Would you be willing to try coding it up that way so we can see how it looks?
I attach a patch that does it that way.
That looks good to me. I've committed this, and the other patch to
remove the obsolete comment.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers