Patch for SortSupport implementation on inet/cdir

Started by Brandur Leachalmost 7 years ago13 messages

brandur@mutelight.org

almost 7 years ago

1 attachment(s)

Hi list,

I've attached a patch that implements SortSupport for the
inet/cidr types. It has the effect of typically reducing
the time taken to sort these types by ~50-60% (as measured
by `SELECT COUNT(DISTINCT ...)` which will carry over to
common operations like index creation, `ORDER BY`, and
`DISTINCT`.

As with other SortSupport implementations, this one
proposes an abbreviated key design that packs in as much
sorting-relevant information into the available datum as
possible while remaining faithful to the existing sorting
rules for the types. The key format depends on datum size
and whether working with IPv4 or IPv6. In most cases that
involves storing as many netmask bits as we can fit, but we
can do something a little more complete with IPv4 on 64-bit
— because inet data is small and the datum is large, we
can store enough information for a successful key-only
comparison in the majority of cases. Precise details
including background and bit-level diagrams are included in
the comments of the attached patch.

I've tried to take a variety of steps to build confidence
that the proposed SortSupport-based sort is correct:

1. It passes the existing inet regression suite (which was
pretty complete already).

2. I added a few new test cases the suite, specifically
trying to target edge cases like the minimums and
maximums of each abbreviated key bit "boundary". The new
cases were run against the master branch to double-check
that they're right.

3. I've added a variety of assertions throughout the
implementation to make sure that each step is seeing
inputs with expected parameters, and only manipulates
the bits that it's supposed to be manipulating.

4. I built large random data sets (10M rows) and sorted
them against a development build to try and trigger the
aforementioned assertions.

5. I built an index on 10M values and ran amcheck against
it.

6. I wrote some unit tests to verify that the bit-level
representation of the new abbreviated keys are indeed
what we expect. They're available here [3]. I didn't
include them in the patch because I couldn't find an
easy way to produce an expected `.out` file for a 32-bit
machine (experiments building with `setarch` didn't
succeed). They might be overkill anyway. I can continue
to pursue getting them working if reviewers think they'd
be useful.

My benchmarking methodology and script is available here
[1]: , and involves gathering statistics for 100 `count(distinct ...)` queries at various data sizes. I've saved the results I got on my machine here [2].
`count(distinct ...)` queries at various data sizes. I've
saved the results I got on my machine here [2].

Hat tip to Peter Geoghegan for advising on an appropriate
abbreviated key design and helping answer many questions
along the way.

Brandur

Attachments:

SortSupport-for-inet-cidr-types-v1.patchapplication/octet-stream; name=SortSupport-for-inet-cidr-types-v1.patchDownload

From 675555068fd50be727a0515601a279e7871d1f3c Mon Sep 17 00:00:00 2001
From: Brandur <brandur@brandur.org>
Date: Thu, 24 Jan 2019 08:34:51 -0800
Subject: [PATCH] SortSupport for inet/cidr types

Implements SortSupport for the inet/cidr types in Postgres by devising
an abbreviated key representation for them that will faithfully
reproduce their existing sorting rules. This has the effect of typically
reducing the time taken for inet/cidr sorts by ~50-60%, and should show
a good improvement for the vast majority of real-world data sets.
---
 src/backend/utils/adt/network.c    | 428 +++++++++++++++++++++++++++++++++++++
 src/include/catalog/pg_amproc.dat  |   3 +
 src/include/catalog/pg_proc.dat    |   3 +
 src/test/regress/expected/inet.out | 104 +++++++++
 src/test/regress/sql/inet.sql      |  53 +++++
 5 files changed, 591 insertions(+)

diff --git a/src/backend/utils/adt/network.c b/src/backend/utils/adt/network.c
index 5af7f4e..0afd6f5 100644
--- a/src/backend/utils/adt/network.c
+++ b/src/backend/utils/adt/network.c
@@ -15,17 +15,40 @@
 #include "access/hash.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
+#include "lib/hyperloglog.h"
 #include "libpq/libpq-be.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "utils/builtins.h"
+#include "utils/guc.h"
 #include "utils/inet.h"
+#include "utils/sortsupport.h"
 
+/* A few constants for the width in bits of certain values in inet/cidr
+ * abbreviated keys. */
+#if SIZEOF_DATUM == 8
+#define ABBREV_BITS_INET4_NETMASK_SIZE 6
+#define ABBREV_BITS_INET4_SUBNET 25
+#endif
 
 static int32 network_cmp_internal(inet *a1, inet *a2);
 static bool addressOK(unsigned char *a, int bits, int family);
 static inet *internal_inetpl(inet *ip, int64 addend);
 
+/* sortsupport for inet/cidr */
+typedef struct
+{
+	int64		input_count;	/* number of non-null values seen */
+	bool		estimating;		/* true if estimating cardinality */
+
+	hyperLogLogState abbr_card; /* cardinality estimator */
+}			network_sortsupport_state;
+
+static int	network_cmp_internal(inet *a1, inet *a2);
+static int	network_fast_cmp(Datum x, Datum y, SortSupport ssup);
+static int	network_cmp_abbrev(Datum x, Datum y, SortSupport ssup);
+static bool network_abbrev_abort(int memtupcount, SortSupport ssup);
+static Datum network_abbrev_convert(Datum original, SortSupport ssup);
 
 /*
  * Common INET/CIDR input routine
@@ -390,6 +413,411 @@ network_cmp(PG_FUNCTION_ARGS)
 }
 
 /*
+ *	SortSupport implementation functions.
+ */
+
+/*
+ * SortSupport strategy function. Populates a SortSupport struct with the
+ * information necessary to use comparison by abbreviated keys.
+ */
+Datum
+network_sortsupport(PG_FUNCTION_ARGS)
+{
+	SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+	ssup->comparator = network_fast_cmp;
+	ssup->ssup_extra = NULL;
+
+	if (ssup->abbreviate)
+	{
+		network_sortsupport_state *uss;
+		MemoryContext oldcontext;
+
+		oldcontext = MemoryContextSwitchTo(ssup->ssup_cxt);
+
+		uss = palloc(sizeof(network_sortsupport_state));
+		uss->input_count = 0;
+		uss->estimating = true;
+		initHyperLogLog(&uss->abbr_card, 10);
+
+		ssup->ssup_extra = uss;
+
+		ssup->comparator = network_cmp_abbrev;
+		ssup->abbrev_converter = network_abbrev_convert;
+		ssup->abbrev_abort = network_abbrev_abort;
+		ssup->abbrev_full_comparator = network_fast_cmp;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * SortSupport authoritative comparison function. Pulls two inet structs from
+ * the heap and runs a standard comparison on them.
+ */
+static int
+network_fast_cmp(Datum x, Datum y, SortSupport ssup)
+{
+	inet	   *arg1 = DatumGetInetPP(x);
+	inet	   *arg2 = DatumGetInetPP(y);
+
+	return network_cmp_internal(arg1, arg2);
+}
+
+/*
+ * SortSupport abbreviated key comparison function. Compares two inet/cidr
+ * values addresses quickly by treating them like integers, and without having
+ * to go to the heap. This works because we've packed them in a form that can
+ * support this comparison in network_abbrev_convert.
+ */
+static int
+network_cmp_abbrev(Datum x, Datum y, SortSupport ssup)
+{
+	if (x > y)
+		return 1;
+	else if (x == y)
+		return 0;
+	else
+		return -1;
+}
+
+/*
+ * Callback for estimating effectiveness of abbreviated key optimization.
+ *
+ * We pay no attention to the cardinality of the non-abbreviated data, because
+ * there is no equality fast-path within authoritative inet comparator.
+ */
+static bool
+network_abbrev_abort(int memtupcount, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	double		abbr_card;
+
+	if (memtupcount < 10000 || uss->input_count < 10000 || !uss->estimating)
+		return false;
+
+	abbr_card = estimateHyperLogLog(&uss->abbr_card);
+
+	/*
+	 * If we have >100k distinct values, then even if we were sorting many
+	 * billion rows we'd likely still break even, and the penalty of undoing
+	 * that many rows of abbrevs would probably not be worth it. At this point
+	 * we stop counting because we know that we're now fully committed.
+	 */
+	if (abbr_card > 100000.0)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: estimation ends at cardinality %f"
+				 " after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count, memtupcount);
+#endif
+		uss->estimating = false;
+		return false;
+	}
+
+	/*
+	 * Target minimum cardinality is 1 per ~2k of non-null inputs. 0.5 row
+	 * fudge factor allows us to abort earlier on genuinely pathological data
+	 * where we've had exactly one abbreviated value in the first 2k
+	 * (non-null) rows.
+	 */
+	if (abbr_card < uss->input_count / 2000.0 + 0.5)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: aborting abbreviation at cardinality %f"
+				 " below threshold %f after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count / 2000.0 + 0.5, uss->input_count,
+				 memtupcount);
+#endif
+		return true;
+	}
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "network_abbrev: cardinality %f after " INT64_FORMAT
+			 " values (%d rows)", abbr_card, uss->input_count, memtupcount);
+#endif
+
+	return false;
+}
+
+/*
+ * SortSupport conversion routine. Converts original inet/cidr representations
+ * to abbreviated keys . The inet/cidr types are pass-by-reference, so is an
+ * optimization so that sorting routines don't have to pull full values from
+ * the heap to compare.
+ *
+ * All inet values have three major components (and take for example the
+ * address 1.2.3.4/24):
+ *
+ *     * A network, or netmasked bits (1.2.3.0).
+ *     * A netmask size (/24).
+ *     * A subnet, or bits outside of the netmask (0.0.0.4).
+ *
+ * cidr values are the same except that with only the first two components --
+ * all their subnet bits *must* be zero (1.2.3.0/24).
+ *
+ * IPv4 and IPv6 are identical in this makeup, with the difference being that
+ * IPv4 addresses have a maximum of 32 bits compared to IPv6's 64 bits, so in
+ * IPv6 each part may be larger.
+ *
+ * inet/cdir types compare using these sorting rules. If inequality is detected
+ * at any step, comparison is done. If any rule is a tie, the algorithm drops
+ * through to the next to break it:
+ *
+ *     1. IPv4 always appears before IPv6.
+ *     2. Just bits in the network are compared.
+ *     3. Netmask size is compared.
+ *     4. All bits are compared (having made it here, we know that both
+ *        netmasked bits and netmask size are equal, so we're in effect only
+ *        comparing subnet bits).
+ *
+ * When generating abbreviated keys for SortSupport, we pack in as much
+ * information as we can while ensuring that when comparing those keys as
+ * integers, the rules above will be respected.
+ *
+ * In most cases, that means just the most significant of the netmasked bits
+ * are retained, but in the case of IPv4 addresses on a 64-bit machine, we can
+ * hold almost all relevant information for comparisons to the degree that we
+ * almost never have to fall back to authoritative comparison. In all cases,
+ * there will be enough data present for the abbreviated keys to perform very
+ * well in all except the very worst of circumstances (and in those, key
+ * abbreviation will probably be aborted as we detect low cardinality).
+ *
+ * IPv4
+ * ----
+ *
+ * 32-bit machine:
+ *
+ * Start with 1 bit for the IP family (IPv4 or IPv6; this bit is present in
+ * every case below) followed by all but 1 of the netmasked bits. If those are
+ * all equal, the system will have to fall back to non-abbreviated comparison.
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |     (up to 1 bit
+ * |  family  |     (truncated)     |   network omitted)
+ * +----------+---------------------+
+ *
+ * 64-bit machine:
+ *
+ * The most interesting case: we have space to store all netmasked bits,
+ * followed by the netmask size, followed by 25 bits of the subnet. We lay the
+ * bits out carefully this way so that the entire value can be compared as an
+ * integer while still adhering to the inet/cidr sorting rules stated above.
+ *
+ * 25 bits is more than enough to store most subnets (all but /6 or smaller),
+ * so in the vast majority of cases these keys should hold enough information
+ * to save trips to the heap.
+ *
+ * +----------+-----------------------+--------------+--------------------+
+ * | 1 bit IP |    32 bits network    |    6 bits    |   25 bits subnet   |
+ * |  family  |        (full)         | network size |    (truncated)     |
+ * +----------+-----------------------+--------------+--------------------+
+ *
+ * IPv6
+ * ----
+ *
+ * 32-bit machine:
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |    (up to 97 bits
+ * |  family  |     (truncated)     |   network omitted)
+ * +----------+---------------------+
+ *
+ * 64-bit machine:
+ *
+ * +----------+---------------------------------+
+ * | 1 bit IP |         63 bits network         |    (up to 65 bits
+ * |  family  |           (truncated)           |   network omitted)
+ * +----------+---------------------------------+
+ *
+ */
+static Datum
+network_abbrev_convert(Datum original, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	inet	   *authoritative = DatumGetInetPP(original);
+	Datum		res;
+	Datum		ipaddr_datum,
+				network,
+				subnet,
+				subnet_bitmask;
+	int			datum_subnet_size;
+
+	/*
+	 * If another IP family is ever added, we'll need to redesign the key
+	 * abbreviation strategy.
+	 */
+	Assert(ip_family(authoritative) == PGSQL_AF_INET ||
+		   ip_family(authoritative) == PGSQL_AF_INET6);
+
+	res = (Datum) 0;
+	if (ip_family(authoritative) == PGSQL_AF_INET6)
+	{
+		/* Shift a 1 over to the datum's most significant bit. */
+		res = ((Datum) 1) << (SIZEOF_DATUM * BITS_PER_BYTE - 1);
+	}
+
+	/*
+	 * Create an integer representation of the IP address by taking its first
+	 * 4 or 8 bytes. We take 8 bytes of an IPv6 address on a 64-bit machine
+	 * and 4 bytes on a 32-bit. Always take all 4 bytes of an IPv4 address.
+	 *
+	 * We're consuming an array of char, so make sure to byteswap on little
+	 * endian systems (an inet's IP array emulates big endian in that the
+	 * first byte is always the most significant).
+	 */
+	if (ip_family(authoritative) == PGSQL_AF_INET6)
+	{
+		ipaddr_datum = *((Datum *) ip_addr(authoritative));
+		ipaddr_datum = DatumBigEndianToNative(ipaddr_datum);
+	}
+	else
+	{
+		uint32		ipaddr_datum32 = *((uint32 *) ip_addr(authoritative));
+#ifndef WORDS_BIGENDIAN
+		ipaddr_datum = pg_bswap32(ipaddr_datum32);
+#endif
+	}
+
+	/*
+	 * Number of bits in subnet. e.g. An IPv4 that's /24 is 32 - 24 = 8.
+	 *
+	 * However, only some of the bits may have made it into the fixed sized
+	 * datum, so take the smallest number between bits in the subnet and bits
+	 * in the datum which are not part of the network.
+	 */
+	datum_subnet_size = Min(ip_maxbits(authoritative) - ip_bits(authoritative),
+							SIZEOF_DATUM * BITS_PER_BYTE - ip_bits(authoritative));
+
+	/* we may have ended up with < 0 for a large netmask size */
+	if (datum_subnet_size <= 0)
+	{
+		/* the network occupies the entirety `ipaddr_datum` */
+		network = ipaddr_datum;
+		subnet = (Datum) 0;
+	}
+	else
+	{
+		/*
+		 * This shift creates a power of two like `0010 0000`, and subtracts
+		 * one to create a bitmask for an IP's subnet bits like `0001 1111`.
+		 *
+		 * Note that `datum_subnet_mask` may be == 0, in which case we'll
+		 * generate a 0 bitmask and `subnet` will also come out as 0.
+		 */
+		subnet_bitmask = (((Datum) 1) << datum_subnet_size) - 1;
+
+		/* and likewise, use the mask's complement to get the netmask bits */
+		network = ipaddr_datum & ~subnet_bitmask;
+
+		/* bitwise AND the IP and bitmask to extract just the subnet bits */
+		subnet = ipaddr_datum & subnet_bitmask;
+	}
+
+#if SIZEOF_DATUM == 8
+
+	if (ip_family(authoritative) == PGSQL_AF_INET6)
+	{
+		/*
+		 * IPv6 on a 64-bit machine: keep the most significant 63 netmasked
+		 * bits.
+		 */
+		res |= network >> 1;
+	}
+	else
+	{
+		/*
+		 * IPv4 on a 64-bit machine: keep all 32 netmasked bits, netmask size,
+		 * and most significant 25 subnet bits (see diagram above for more
+		 * detail).
+		 */
+
+		Datum		network_shifted;
+		Datum		netmask_size_and_subnet = 0;
+
+		/*
+		 * an IPv4 netmask size has a maximum value of 32, which takes 6 bits
+		 * to contain (0x20 in hex)
+		 */
+		Datum		netmask_size = (Datum) ip_bits(authoritative);
+
+		Assert(netmask_size <= ip_maxbits(authoritative));
+		netmask_size_and_subnet |= netmask_size << ABBREV_BITS_INET4_SUBNET;
+
+		/*
+		 * if we have more than 25 subnet bits of information, shift it down
+		 * to the available size
+		 */
+		if (datum_subnet_size > ABBREV_BITS_INET4_SUBNET)
+		{
+			subnet >>= datum_subnet_size - ABBREV_BITS_INET4_SUBNET;
+		}
+		netmask_size_and_subnet |= subnet;
+
+		/*
+		 * There's a fair bit of scary bit manipulation in this function. This
+		 * is an extra check that verifies that that nothing outside of the
+		 * least signifiant 31 bits is set.
+		 *
+		 * (PG_UINT32_MAX >> 1) = 2^31 - 1 = 31 bits of 1s
+		 */
+		Assert((netmask_size_and_subnet | (PG_UINT32_MAX >> 1)) == (PG_UINT32_MAX >> 1));
+
+		/*
+		 * Shift left 31 bits: 6 bits netmask size + 25 subnet bits.
+		 *
+		 * And assert the opposite as above: no bits should be set in the
+		 * least significant 31 positions where we store netmask size and
+		 * subnet.
+		 */
+		network_shifted = network << (ABBREV_BITS_INET4_NETMASK_SIZE + ABBREV_BITS_INET4_SUBNET);
+		Assert((network_shifted & ~((Datum) PG_UINT32_MAX >> 1)) == network_shifted);
+
+		res |= network_shifted | netmask_size_and_subnet;
+	}
+
+#else							/* SIZEOF_DATUM != 8 */
+
+	/*
+	 * 32-bit machine: keep the most significant 31 netmasked bits in both
+	 * IPv4 and IPv6.
+	 */
+	res |= network >> 1;
+
+#endif
+
+	uss->input_count += 1;
+
+	/*
+	 * Cardinality estimation. The estimate uses uint32, so on a 64-bit
+	 * machine, XOR the two 32-bit halves together to produce slightly more
+	 * entropy.
+	 */
+	if (uss->estimating)
+	{
+		uint32		tmp;
+
+#if SIZEOF_DATUM == 8
+		tmp = (uint32) res ^ (uint32) ((uint64) res >> 32);
+#else							/* SIZEOF_DATUM != 8 */
+		tmp = (uint32) res;
+#endif
+
+		addHyperLogLog(&uss->abbr_card, DatumGetUInt32(hash_uint32(tmp)));
+	}
+
+	return res;
+}
+
+/*
  *	Boolean ordering tests.
  */
 Datum
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 020b741..dfe0ee6 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -93,6 +93,9 @@
   amproc => 'in_range(float4,float4,float8,bool,bool)' },
 { amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
   amprocrighttype => 'inet', amprocnum => '1', amproc => 'network_cmp' },
+{ amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
+  amprocrighttype => 'inet', amprocnum => '2',
+  amproc => 'network_sortsupport' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '1', amproc => 'btint2cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3ecc2e1..106f762 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3898,6 +3898,9 @@
 { oid => '3551',
   proname => 'network_overlap', prorettype => 'bool',
   proargtypes => 'inet inet', prosrc => 'network_overlap' },
+{ oid => '3424', descr => 'sort support',
+  proname => 'network_sortsupport', prorettype => 'void',
+  proargtypes => 'internal', prosrc => 'network_sortsupport' },
 
 # inet/cidr functions
 { oid => '598', descr => 'abbreviated display of inet value',
diff --git a/src/test/regress/expected/inet.out b/src/test/regress/expected/inet.out
index be9427e..ff7409b 100644
--- a/src/test/regress/expected/inet.out
+++ b/src/test/regress/expected/inet.out
@@ -790,3 +790,107 @@ SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
  ::/24
 (17 rows)
 
+-- tests to specifically check some of the potentially sharp edges when it
+-- comes to SortSupport implementation
+SELECT i FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.1/0'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/1'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i ORDER BY i;
+                      i                       
+----------------------------------------------
+ (0.0.0.0/0)
+ (0.0.0.1/0)
+ (192.168.1.4/0)
+ (192.168.1.5/0)
+ (255.0.0.0/0)
+ (255.255.255.255/0)
+ (255.255.255.255/0)
+ (0.0.0.0/1)
+ (0.0.0.1/1)
+ (0.0.0.0)
+ (192.168.1.3/1)
+ (255.255.255.255/1)
+ (192.168.1.1/5)
+ (192.168.1.255/5)
+ (192.168.1.1/6)
+ (192.168.1.255/6)
+ (192.168.1.1/23)
+ (192.168.1.2/23)
+ (192.168.1.3/23)
+ (192.168.1.0/24)
+ (192.168.1.0/25)
+ (255.255.255.255/16)
+ (255.255.255.255/31)
+ (255.255.255.254)
+ (255.255.255.255)
+ (::/0)
+ (ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0)
+ (::)
+ (::1)
+ (::1:ffff:ffff:ffff:ffff)
+ (::2:ffff:ffff:ffff:ffff)
+ (127::1)
+ (127::2)
+ (8000::/1)
+ (ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70)
+ (ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44)
+ (ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69)
+ (ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73)
+ (ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89)
+ (ffff:8dd0:646:694c:7c16:7e35:6a26:171/104)
+ (ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121)
+ (ffff:90e7:e744:664:a93:8efe:1f25:7663/122)
+ (ffff:9597:c69c:8b24:57a:8639:ec78:6026/111)
+ (ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88)
+ (ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23)
+ (ffff:ffff:ffff:fffd::)
+ (ffff:ffff:ffff:fffe::)
+ (ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff)
+(48 rows)
+
diff --git a/src/test/regress/sql/inet.sql b/src/test/regress/sql/inet.sql
index 880e115..1a623bc 100644
--- a/src/test/regress/sql/inet.sql
+++ b/src/test/regress/sql/inet.sql
@@ -146,3 +146,56 @@ INSERT INTO INET_TBL (c, i) VALUES ('10', '10::/8');
 SELECT inet_merge(c, i) FROM INET_TBL;
 -- fix it by inet_same_family() condition
 SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
+
+-- tests to specifically check some of the potentially sharp edges when it
+-- comes to SortSupport implementation
+SELECT i FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.1/0'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/1'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i ORDER BY i;
-- 
2.3.6

Brandur Leach

brandur@mutelight.org

almost 7 years ago

In reply to: Brandur Leach (#1)

1 attachment(s)

Re: Patch for SortSupport implementation on inet/cdir

Attached a V2 patch: identical to V1 except rebased and
with a new OID selected.

Attachments:

SortSupport-for-inet-cidr-types-v2.patchapplication/octet-stream; name=SortSupport-for-inet-cidr-types-v2.patchDownload

From 5fee1ce3f9ea78dd9a226a44748c83be04159423 Mon Sep 17 00:00:00 2001
From: Brandur <brandur@brandur.org>
Date: Thu, 24 Jan 2019 08:34:51 -0800
Subject: [PATCH] SortSupport for inet/cidr types

Implements SortSupport for the inet/cidr types in Postgres by devising
an abbreviated key representation for them that will faithfully
reproduce their existing sorting rules. This has the effect of typically
reducing the time taken for inet/cidr sorts by ~50-60%, and should show
a good improvement for the vast majority of real-world data sets.
---
 src/backend/utils/adt/network.c    | 428 +++++++++++++++++++++++++++++
 src/include/catalog/pg_amproc.dat  |   3 +
 src/include/catalog/pg_proc.dat    |   3 +
 src/test/regress/expected/inet.out | 104 +++++++
 src/test/regress/sql/inet.sql      |  53 ++++
 5 files changed, 591 insertions(+)

diff --git a/src/backend/utils/adt/network.c b/src/backend/utils/adt/network.c
index 5af7f4e046..0afd6f5dbd 100644
--- a/src/backend/utils/adt/network.c
+++ b/src/backend/utils/adt/network.c
@@ -15,17 +15,40 @@
 #include "access/hash.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
+#include "lib/hyperloglog.h"
 #include "libpq/libpq-be.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "utils/builtins.h"
+#include "utils/guc.h"
 #include "utils/inet.h"
+#include "utils/sortsupport.h"
 
+/* A few constants for the width in bits of certain values in inet/cidr
+ * abbreviated keys. */
+#if SIZEOF_DATUM == 8
+#define ABBREV_BITS_INET4_NETMASK_SIZE 6
+#define ABBREV_BITS_INET4_SUBNET 25
+#endif
 
 static int32 network_cmp_internal(inet *a1, inet *a2);
 static bool addressOK(unsigned char *a, int bits, int family);
 static inet *internal_inetpl(inet *ip, int64 addend);
 
+/* sortsupport for inet/cidr */
+typedef struct
+{
+	int64		input_count;	/* number of non-null values seen */
+	bool		estimating;		/* true if estimating cardinality */
+
+	hyperLogLogState abbr_card; /* cardinality estimator */
+}			network_sortsupport_state;
+
+static int	network_cmp_internal(inet *a1, inet *a2);
+static int	network_fast_cmp(Datum x, Datum y, SortSupport ssup);
+static int	network_cmp_abbrev(Datum x, Datum y, SortSupport ssup);
+static bool network_abbrev_abort(int memtupcount, SortSupport ssup);
+static Datum network_abbrev_convert(Datum original, SortSupport ssup);
 
 /*
  * Common INET/CIDR input routine
@@ -389,6 +412,411 @@ network_cmp(PG_FUNCTION_ARGS)
 	PG_RETURN_INT32(network_cmp_internal(a1, a2));
 }
 
+/*
+ *	SortSupport implementation functions.
+ */
+
+/*
+ * SortSupport strategy function. Populates a SortSupport struct with the
+ * information necessary to use comparison by abbreviated keys.
+ */
+Datum
+network_sortsupport(PG_FUNCTION_ARGS)
+{
+	SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+	ssup->comparator = network_fast_cmp;
+	ssup->ssup_extra = NULL;
+
+	if (ssup->abbreviate)
+	{
+		network_sortsupport_state *uss;
+		MemoryContext oldcontext;
+
+		oldcontext = MemoryContextSwitchTo(ssup->ssup_cxt);
+
+		uss = palloc(sizeof(network_sortsupport_state));
+		uss->input_count = 0;
+		uss->estimating = true;
+		initHyperLogLog(&uss->abbr_card, 10);
+
+		ssup->ssup_extra = uss;
+
+		ssup->comparator = network_cmp_abbrev;
+		ssup->abbrev_converter = network_abbrev_convert;
+		ssup->abbrev_abort = network_abbrev_abort;
+		ssup->abbrev_full_comparator = network_fast_cmp;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * SortSupport authoritative comparison function. Pulls two inet structs from
+ * the heap and runs a standard comparison on them.
+ */
+static int
+network_fast_cmp(Datum x, Datum y, SortSupport ssup)
+{
+	inet	   *arg1 = DatumGetInetPP(x);
+	inet	   *arg2 = DatumGetInetPP(y);
+
+	return network_cmp_internal(arg1, arg2);
+}
+
+/*
+ * SortSupport abbreviated key comparison function. Compares two inet/cidr
+ * values addresses quickly by treating them like integers, and without having
+ * to go to the heap. This works because we've packed them in a form that can
+ * support this comparison in network_abbrev_convert.
+ */
+static int
+network_cmp_abbrev(Datum x, Datum y, SortSupport ssup)
+{
+	if (x > y)
+		return 1;
+	else if (x == y)
+		return 0;
+	else
+		return -1;
+}
+
+/*
+ * Callback for estimating effectiveness of abbreviated key optimization.
+ *
+ * We pay no attention to the cardinality of the non-abbreviated data, because
+ * there is no equality fast-path within authoritative inet comparator.
+ */
+static bool
+network_abbrev_abort(int memtupcount, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	double		abbr_card;
+
+	if (memtupcount < 10000 || uss->input_count < 10000 || !uss->estimating)
+		return false;
+
+	abbr_card = estimateHyperLogLog(&uss->abbr_card);
+
+	/*
+	 * If we have >100k distinct values, then even if we were sorting many
+	 * billion rows we'd likely still break even, and the penalty of undoing
+	 * that many rows of abbrevs would probably not be worth it. At this point
+	 * we stop counting because we know that we're now fully committed.
+	 */
+	if (abbr_card > 100000.0)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: estimation ends at cardinality %f"
+				 " after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count, memtupcount);
+#endif
+		uss->estimating = false;
+		return false;
+	}
+
+	/*
+	 * Target minimum cardinality is 1 per ~2k of non-null inputs. 0.5 row
+	 * fudge factor allows us to abort earlier on genuinely pathological data
+	 * where we've had exactly one abbreviated value in the first 2k
+	 * (non-null) rows.
+	 */
+	if (abbr_card < uss->input_count / 2000.0 + 0.5)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: aborting abbreviation at cardinality %f"
+				 " below threshold %f after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count / 2000.0 + 0.5, uss->input_count,
+				 memtupcount);
+#endif
+		return true;
+	}
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "network_abbrev: cardinality %f after " INT64_FORMAT
+			 " values (%d rows)", abbr_card, uss->input_count, memtupcount);
+#endif
+
+	return false;
+}
+
+/*
+ * SortSupport conversion routine. Converts original inet/cidr representations
+ * to abbreviated keys . The inet/cidr types are pass-by-reference, so is an
+ * optimization so that sorting routines don't have to pull full values from
+ * the heap to compare.
+ *
+ * All inet values have three major components (and take for example the
+ * address 1.2.3.4/24):
+ *
+ *     * A network, or netmasked bits (1.2.3.0).
+ *     * A netmask size (/24).
+ *     * A subnet, or bits outside of the netmask (0.0.0.4).
+ *
+ * cidr values are the same except that with only the first two components --
+ * all their subnet bits *must* be zero (1.2.3.0/24).
+ *
+ * IPv4 and IPv6 are identical in this makeup, with the difference being that
+ * IPv4 addresses have a maximum of 32 bits compared to IPv6's 64 bits, so in
+ * IPv6 each part may be larger.
+ *
+ * inet/cdir types compare using these sorting rules. If inequality is detected
+ * at any step, comparison is done. If any rule is a tie, the algorithm drops
+ * through to the next to break it:
+ *
+ *     1. IPv4 always appears before IPv6.
+ *     2. Just bits in the network are compared.
+ *     3. Netmask size is compared.
+ *     4. All bits are compared (having made it here, we know that both
+ *        netmasked bits and netmask size are equal, so we're in effect only
+ *        comparing subnet bits).
+ *
+ * When generating abbreviated keys for SortSupport, we pack in as much
+ * information as we can while ensuring that when comparing those keys as
+ * integers, the rules above will be respected.
+ *
+ * In most cases, that means just the most significant of the netmasked bits
+ * are retained, but in the case of IPv4 addresses on a 64-bit machine, we can
+ * hold almost all relevant information for comparisons to the degree that we
+ * almost never have to fall back to authoritative comparison. In all cases,
+ * there will be enough data present for the abbreviated keys to perform very
+ * well in all except the very worst of circumstances (and in those, key
+ * abbreviation will probably be aborted as we detect low cardinality).
+ *
+ * IPv4
+ * ----
+ *
+ * 32-bit machine:
+ *
+ * Start with 1 bit for the IP family (IPv4 or IPv6; this bit is present in
+ * every case below) followed by all but 1 of the netmasked bits. If those are
+ * all equal, the system will have to fall back to non-abbreviated comparison.
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |     (up to 1 bit
+ * |  family  |     (truncated)     |   network omitted)
+ * +----------+---------------------+
+ *
+ * 64-bit machine:
+ *
+ * The most interesting case: we have space to store all netmasked bits,
+ * followed by the netmask size, followed by 25 bits of the subnet. We lay the
+ * bits out carefully this way so that the entire value can be compared as an
+ * integer while still adhering to the inet/cidr sorting rules stated above.
+ *
+ * 25 bits is more than enough to store most subnets (all but /6 or smaller),
+ * so in the vast majority of cases these keys should hold enough information
+ * to save trips to the heap.
+ *
+ * +----------+-----------------------+--------------+--------------------+
+ * | 1 bit IP |    32 bits network    |    6 bits    |   25 bits subnet   |
+ * |  family  |        (full)         | network size |    (truncated)     |
+ * +----------+-----------------------+--------------+--------------------+
+ *
+ * IPv6
+ * ----
+ *
+ * 32-bit machine:
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |    (up to 97 bits
+ * |  family  |     (truncated)     |   network omitted)
+ * +----------+---------------------+
+ *
+ * 64-bit machine:
+ *
+ * +----------+---------------------------------+
+ * | 1 bit IP |         63 bits network         |    (up to 65 bits
+ * |  family  |           (truncated)           |   network omitted)
+ * +----------+---------------------------------+
+ *
+ */
+static Datum
+network_abbrev_convert(Datum original, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	inet	   *authoritative = DatumGetInetPP(original);
+	Datum		res;
+	Datum		ipaddr_datum,
+				network,
+				subnet,
+				subnet_bitmask;
+	int			datum_subnet_size;
+
+	/*
+	 * If another IP family is ever added, we'll need to redesign the key
+	 * abbreviation strategy.
+	 */
+	Assert(ip_family(authoritative) == PGSQL_AF_INET ||
+		   ip_family(authoritative) == PGSQL_AF_INET6);
+
+	res = (Datum) 0;
+	if (ip_family(authoritative) == PGSQL_AF_INET6)
+	{
+		/* Shift a 1 over to the datum's most significant bit. */
+		res = ((Datum) 1) << (SIZEOF_DATUM * BITS_PER_BYTE - 1);
+	}
+
+	/*
+	 * Create an integer representation of the IP address by taking its first
+	 * 4 or 8 bytes. We take 8 bytes of an IPv6 address on a 64-bit machine
+	 * and 4 bytes on a 32-bit. Always take all 4 bytes of an IPv4 address.
+	 *
+	 * We're consuming an array of char, so make sure to byteswap on little
+	 * endian systems (an inet's IP array emulates big endian in that the
+	 * first byte is always the most significant).
+	 */
+	if (ip_family(authoritative) == PGSQL_AF_INET6)
+	{
+		ipaddr_datum = *((Datum *) ip_addr(authoritative));
+		ipaddr_datum = DatumBigEndianToNative(ipaddr_datum);
+	}
+	else
+	{
+		uint32		ipaddr_datum32 = *((uint32 *) ip_addr(authoritative));
+#ifndef WORDS_BIGENDIAN
+		ipaddr_datum = pg_bswap32(ipaddr_datum32);
+#endif
+	}
+
+	/*
+	 * Number of bits in subnet. e.g. An IPv4 that's /24 is 32 - 24 = 8.
+	 *
+	 * However, only some of the bits may have made it into the fixed sized
+	 * datum, so take the smallest number between bits in the subnet and bits
+	 * in the datum which are not part of the network.
+	 */
+	datum_subnet_size = Min(ip_maxbits(authoritative) - ip_bits(authoritative),
+							SIZEOF_DATUM * BITS_PER_BYTE - ip_bits(authoritative));
+
+	/* we may have ended up with < 0 for a large netmask size */
+	if (datum_subnet_size <= 0)
+	{
+		/* the network occupies the entirety `ipaddr_datum` */
+		network = ipaddr_datum;
+		subnet = (Datum) 0;
+	}
+	else
+	{
+		/*
+		 * This shift creates a power of two like `0010 0000`, and subtracts
+		 * one to create a bitmask for an IP's subnet bits like `0001 1111`.
+		 *
+		 * Note that `datum_subnet_mask` may be == 0, in which case we'll
+		 * generate a 0 bitmask and `subnet` will also come out as 0.
+		 */
+		subnet_bitmask = (((Datum) 1) << datum_subnet_size) - 1;
+
+		/* and likewise, use the mask's complement to get the netmask bits */
+		network = ipaddr_datum & ~subnet_bitmask;
+
+		/* bitwise AND the IP and bitmask to extract just the subnet bits */
+		subnet = ipaddr_datum & subnet_bitmask;
+	}
+
+#if SIZEOF_DATUM == 8
+
+	if (ip_family(authoritative) == PGSQL_AF_INET6)
+	{
+		/*
+		 * IPv6 on a 64-bit machine: keep the most significant 63 netmasked
+		 * bits.
+		 */
+		res |= network >> 1;
+	}
+	else
+	{
+		/*
+		 * IPv4 on a 64-bit machine: keep all 32 netmasked bits, netmask size,
+		 * and most significant 25 subnet bits (see diagram above for more
+		 * detail).
+		 */
+
+		Datum		network_shifted;
+		Datum		netmask_size_and_subnet = 0;
+
+		/*
+		 * an IPv4 netmask size has a maximum value of 32, which takes 6 bits
+		 * to contain (0x20 in hex)
+		 */
+		Datum		netmask_size = (Datum) ip_bits(authoritative);
+
+		Assert(netmask_size <= ip_maxbits(authoritative));
+		netmask_size_and_subnet |= netmask_size << ABBREV_BITS_INET4_SUBNET;
+
+		/*
+		 * if we have more than 25 subnet bits of information, shift it down
+		 * to the available size
+		 */
+		if (datum_subnet_size > ABBREV_BITS_INET4_SUBNET)
+		{
+			subnet >>= datum_subnet_size - ABBREV_BITS_INET4_SUBNET;
+		}
+		netmask_size_and_subnet |= subnet;
+
+		/*
+		 * There's a fair bit of scary bit manipulation in this function. This
+		 * is an extra check that verifies that that nothing outside of the
+		 * least signifiant 31 bits is set.
+		 *
+		 * (PG_UINT32_MAX >> 1) = 2^31 - 1 = 31 bits of 1s
+		 */
+		Assert((netmask_size_and_subnet | (PG_UINT32_MAX >> 1)) == (PG_UINT32_MAX >> 1));
+
+		/*
+		 * Shift left 31 bits: 6 bits netmask size + 25 subnet bits.
+		 *
+		 * And assert the opposite as above: no bits should be set in the
+		 * least significant 31 positions where we store netmask size and
+		 * subnet.
+		 */
+		network_shifted = network << (ABBREV_BITS_INET4_NETMASK_SIZE + ABBREV_BITS_INET4_SUBNET);
+		Assert((network_shifted & ~((Datum) PG_UINT32_MAX >> 1)) == network_shifted);
+
+		res |= network_shifted | netmask_size_and_subnet;
+	}
+
+#else							/* SIZEOF_DATUM != 8 */
+
+	/*
+	 * 32-bit machine: keep the most significant 31 netmasked bits in both
+	 * IPv4 and IPv6.
+	 */
+	res |= network >> 1;
+
+#endif
+
+	uss->input_count += 1;
+
+	/*
+	 * Cardinality estimation. The estimate uses uint32, so on a 64-bit
+	 * machine, XOR the two 32-bit halves together to produce slightly more
+	 * entropy.
+	 */
+	if (uss->estimating)
+	{
+		uint32		tmp;
+
+#if SIZEOF_DATUM == 8
+		tmp = (uint32) res ^ (uint32) ((uint64) res >> 32);
+#else							/* SIZEOF_DATUM != 8 */
+		tmp = (uint32) res;
+#endif
+
+		addHyperLogLog(&uss->abbr_card, DatumGetUInt32(hash_uint32(tmp)));
+	}
+
+	return res;
+}
+
 /*
  *	Boolean ordering tests.
  */
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 020b7413cc..dfe0ee6a2c 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -93,6 +93,9 @@
   amproc => 'in_range(float4,float4,float8,bool,bool)' },
 { amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
   amprocrighttype => 'inet', amprocnum => '1', amproc => 'network_cmp' },
+{ amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
+  amprocrighttype => 'inet', amprocnum => '2',
+  amproc => 'network_sortsupport' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '1', amproc => 'btint2cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 93e3e16f01..b18995db55 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3898,6 +3898,9 @@
 { oid => '3551',
   proname => 'network_overlap', prorettype => 'bool',
   proargtypes => 'inet inet', prosrc => 'network_overlap' },
+{ oid => '3435', descr => 'sort support',
+  proname => 'network_sortsupport', prorettype => 'void',
+  proargtypes => 'internal', prosrc => 'network_sortsupport' },
 
 # inet/cidr functions
 { oid => '598', descr => 'abbreviated display of inet value',
diff --git a/src/test/regress/expected/inet.out b/src/test/regress/expected/inet.out
index be9427eb6b..ff7409b1b6 100644
--- a/src/test/regress/expected/inet.out
+++ b/src/test/regress/expected/inet.out
@@ -790,3 +790,107 @@ SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
  ::/24
 (17 rows)
 
+-- tests to specifically check some of the potentially sharp edges when it
+-- comes to SortSupport implementation
+SELECT i FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.1/0'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/1'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i ORDER BY i;
+                      i                       
+----------------------------------------------
+ (0.0.0.0/0)
+ (0.0.0.1/0)
+ (192.168.1.4/0)
+ (192.168.1.5/0)
+ (255.0.0.0/0)
+ (255.255.255.255/0)
+ (255.255.255.255/0)
+ (0.0.0.0/1)
+ (0.0.0.1/1)
+ (0.0.0.0)
+ (192.168.1.3/1)
+ (255.255.255.255/1)
+ (192.168.1.1/5)
+ (192.168.1.255/5)
+ (192.168.1.1/6)
+ (192.168.1.255/6)
+ (192.168.1.1/23)
+ (192.168.1.2/23)
+ (192.168.1.3/23)
+ (192.168.1.0/24)
+ (192.168.1.0/25)
+ (255.255.255.255/16)
+ (255.255.255.255/31)
+ (255.255.255.254)
+ (255.255.255.255)
+ (::/0)
+ (ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0)
+ (::)
+ (::1)
+ (::1:ffff:ffff:ffff:ffff)
+ (::2:ffff:ffff:ffff:ffff)
+ (127::1)
+ (127::2)
+ (8000::/1)
+ (ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70)
+ (ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44)
+ (ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69)
+ (ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73)
+ (ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89)
+ (ffff:8dd0:646:694c:7c16:7e35:6a26:171/104)
+ (ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121)
+ (ffff:90e7:e744:664:a93:8efe:1f25:7663/122)
+ (ffff:9597:c69c:8b24:57a:8639:ec78:6026/111)
+ (ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88)
+ (ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23)
+ (ffff:ffff:ffff:fffd::)
+ (ffff:ffff:ffff:fffe::)
+ (ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff)
+(48 rows)
+
diff --git a/src/test/regress/sql/inet.sql b/src/test/regress/sql/inet.sql
index 880e115360..1a623bc296 100644
--- a/src/test/regress/sql/inet.sql
+++ b/src/test/regress/sql/inet.sql
@@ -146,3 +146,56 @@ INSERT INTO INET_TBL (c, i) VALUES ('10', '10::/8');
 SELECT inet_merge(c, i) FROM INET_TBL;
 -- fix it by inet_same_family() condition
 SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
+
+-- tests to specifically check some of the potentially sharp edges when it
+-- comes to SortSupport implementation
+SELECT i FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.1/0'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/1'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i ORDER BY i;
-- 
2.20.1

Edmund Horner

ejrh00@gmail.com

almost 7 years ago

In reply to: Brandur Leach (#1)

Re: Patch for SortSupport implementation on inet/cdir

On Sat, 9 Feb 2019 at 04:48, Brandur Leach <brandur@mutelight.org> wrote:

I've attached a patch that implements SortSupport for the
inet/cidr types. It has the effect of typically reducing
the time taken to sort these types by ~50-60% (as measured
by `SELECT COUNT(DISTINCT ...)` which will carry over to
common operations like index creation, `ORDER BY`, and
`DISTINCT`.

Hi Brandur,

I had a look at this. Your V2 patch applies cleanly, and the code was
straightforward and well commented. I appreciate the big comment at the
top of network_abbrev_convert explaining how you encode the data.

The tests pass. I ran a couple of large scale tests myself and didn't find
any problems. Sorting a million random inets in work_mem = 256MB goes from
roughty 3670ms to 1620ms with the SortSupport, which is pretty impressive.
(But that's in my debug build, so not a serious benchmark.)

An interesting thing about sorting IPv4 inets on 64-bit machines is that
when the inets are the same, the abbreviated comparitor will return 0 which
is taken by the sorting machinery to mean "the datums are the same up to
this point, so you need to call the full comparitor" -- but, in this case,
0 means "the datums truly are the same, no need to call the full
comparitor". Since the full comparitor assumes its arguments to be the
original (typically pass-by-reference) datums, you can't do it there.
You'd need to add another optional comparitor to be called after the
abbreviated one. In inet's case on a 64-bit machine, it would look at the
abbreviated datums and if they're both in the IPv4 family, would return 0
(knowing that the abbreviated comparitor has already done the real work).
I have no reason to believe this particular optimisation is worth anything
much, though; it's outside the scope of this patch, besides.

I have some comments on the comments:

network.c:552
* SortSupport conversion routine. Converts original inet/cidr
representations
* to abbreviated keys . The inet/cidr types are pass-by-reference, so is an
* optimization so that sorting routines don't have to pull full values from
* the heap to compare.

Looks like you have an extra space before the "." on line 553. And
abbreviated keys being an optimisation for pass-by-reference types can be
taken for granted, so I think the last sentence is redundant.

network.c::567
* IPv4 and IPv6 are identical in this makeup, with the difference being that
* IPv4 addresses have a maximum of 32 bits compared to IPv6's 64 bits, so in
* IPv6 each part may be larger.

IPv6's addresses are 128 bit. I'm not sure sure if "maximum" is accurate,
or whether you should just say "IPv4 addresses have 32 bits".

network.c::571
* inet/cdir types compare using these sorting rules. If inequality is
detected
* at any step, comparison is done. If any rule is a tie, the algorithm drops
* through to the next to break it:

When you say "comparison is done" it sounds like more comparing is going to
be done, but what I think you mean is that comparison is finished.

[...]

My benchmarking methodology and script is available here
[1], and involves gathering statistics for 100
`count(distinct ...)` queries at various data sizes. I've
saved the results I got on my machine here [2].

I didn't see any links for [1], [2] and [3] in your email.

Finally, there's a duplicate CF entry:
https://commitfest.postgresql.org/22/1990/ .

Since you're updating https://commitfest.postgresql.org/22/1991/ , I
suggest you mark 1990 as Withdrawn to avoid confusion. If there's a way to
remove it from the CF list, that would be even better.

Edmund

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Edmund Horner (#3)

Re: Patch for SortSupport implementation on inet/cdir

Hi Brandur,

On 2019-02-09 20:12:53 +1300, Edmund Horner wrote:

I had a look at this. Your V2 patch applies cleanly, and the code was
straightforward and well commented. I appreciate the big comment at the
top of network_abbrev_convert explaining how you encode the data.

I've marked the CF entry as waiting-on-author, due to this review.

Brandur, unfortunately this patch has only been submitted to the last
commitfest for v12. Our policy is that all nontrivial patches should be
submitted to at least one earlier commitfest. Therefore I unfortunately
think that v12 is not a realistic target, sorry :(

Greetings,

Andres Freund

David Steele

david@pgmasters.net

almost 7 years ago

In reply to: Andres Freund (#4)

Re: Re: Patch for SortSupport implementation on inet/cdir

On 2/16/19 4:13 AM, Andres Freund wrote:

On 2019-02-09 20:12:53 +1300, Edmund Horner wrote:

I had a look at this. Your V2 patch applies cleanly, and the code was
straightforward and well commented. I appreciate the big comment at the
top of network_abbrev_convert explaining how you encode the data.

I've marked the CF entry as waiting-on-author, due to this review.

Brandur, unfortunately this patch has only been submitted to the last
commitfest for v12. Our policy is that all nontrivial patches should be
submitted to at least one earlier commitfest. Therefore I unfortunately
think that v12 is not a realistic target, sorry :(

I agree, and since there has been no response from the author I have
pushed the target version to PG13.

Regards,
--
-David
david@pgmasters.net

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Edmund Horner (#3)

Re: Patch for SortSupport implementation on inet/cdir

On Fri, Feb 8, 2019 at 11:13 PM Edmund Horner <ejrh00@gmail.com> wrote:

I have some comments on the comments:

Seems reasonable to me.

Where are we on this? I'd like to get the patch committed soon.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Brandur Leach (#2)

1 attachment(s)

Re: Patch for SortSupport implementation on inet/cdir

On Fri, Feb 8, 2019 at 10:20 AM Brandur Leach <brandur@mutelight.org> wrote:

Attached a V2 patch: identical to V1 except rebased and
with a new OID selected.

Attached is a revised version that I came up with, based on your v2.

I found this part of your approach confusing:

+   /*
+    * Number of bits in subnet. e.g. An IPv4 that's /24 is 32 - 24 = 8.
+    *
+    * However, only some of the bits may have made it into the fixed sized
+    * datum, so take the smallest number between bits in the subnet and bits
+    * in the datum which are not part of the network.
+    */
+   datum_subnet_size = Min(ip_maxbits(authoritative) - ip_bits(authoritative),
+                           SIZEOF_DATUM * BITS_PER_BYTE - ip_bits(authoritative));

The way that you put a Min() on the subnet size potentially constrains
the size of the bitmask used for the network component of the
abbreviated key (the component that comes immediately after the
ipfamily status bit). Why not just let the bitmask be a bitmask,
without bringing SIZEOF_DATUM into it? Doing it that way allowed for a
more streamlined approach, with significantly fewer special cases. I'm
not sure whether or not your approach had bugs, but I didn't like the
way you sometimes did a straight "network = ipaddr_datum" assignment
without masking.

I really liked your diagrams, but much of the text that went with them
either seemed redundant (it described established rules about how the
underlying types sort), or seemed to discuss things that were better
discussed next to the relevant network_abbrev_convert() code.

Thoughts?
--
Peter Geoghegan

Attachments:

v3-0001-Add-sort-support-for-inet-cidr-opfamily.patchapplication/octet-stream; name=v3-0001-Add-sort-support-for-inet-cidr-opfamily.patchDownload

From 47eea6b280c24b3908a696fddb048e5e5cacd2f5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 26 Jul 2019 09:13:31 -0700
Subject: [PATCH v3] Add sort support for inet/cidr opfamily.

Add sort support for inet and cidr, including support for abbreviated
keys.  Testing has shown that this reduces the time taken to sort medium
to large inet/cidr inputs by ~50-60% in realistic cases.

Author: Brandur Leach
Reviewed-By: Peter Geoghegan, Edmund Horner
---
 src/backend/utils/adt/network.c    | 328 +++++++++++++++++++++++++++++
 src/include/catalog/pg_amproc.dat  |   3 +
 src/include/catalog/pg_proc.dat    |   3 +
 src/test/regress/expected/inet.out | 104 +++++++++
 src/test/regress/sql/inet.sql      |  53 +++++
 5 files changed, 491 insertions(+)

diff --git a/src/backend/utils/adt/network.c b/src/backend/utils/adt/network.c
index 5e980cf23f..aa19fabeb3 100644
--- a/src/backend/utils/adt/network.c
+++ b/src/backend/utils/adt/network.c
@@ -16,6 +16,7 @@
 #include "catalog/pg_opfamily.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
+#include "lib/hyperloglog.h"
 #include "libpq/libpq-be.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
@@ -24,12 +25,36 @@
 #include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/guc.h"
 #include "utils/hashutils.h"
 #include "utils/inet.h"
 #include "utils/lsyscache.h"
+#include "utils/sortsupport.h"
 
+/*
+ * An IPv4 netmask size is a value in the range of 0 - 32, which is
+ * represented with 6 bits in inet/cidr abbreviated keys where possible.
+ *
+ * An IPv4 inet/cidr abbreviated key can use up to 25 bits for subnet
+ * component.
+ */
+#define ABBREV_BITS_INET4_NETMASK_SIZE	6
+#define ABBREV_BITS_INET4_SUBNET		25
+
+/* sortsupport for inet/cidr */
+typedef struct
+{
+	int64		input_count;	/* number of non-null values seen */
+	bool		estimating;		/* true if estimating cardinality */
+
+	hyperLogLogState abbr_card; /* cardinality estimator */
+} network_sortsupport_state;
 
 static int32 network_cmp_internal(inet *a1, inet *a2);
+static int network_fast_cmp(Datum x, Datum y, SortSupport ssup);
+static int network_cmp_abbrev(Datum x, Datum y, SortSupport ssup);
+static bool network_abbrev_abort(int memtupcount, SortSupport ssup);
+static Datum network_abbrev_convert(Datum original, SortSupport ssup);
 static List *match_network_function(Node *leftop,
 									Node *rightop,
 									int indexarg,
@@ -405,6 +430,309 @@ network_cmp(PG_FUNCTION_ARGS)
 	PG_RETURN_INT32(network_cmp_internal(a1, a2));
 }
 
+/*
+ * SortSupport strategy routine
+ */
+Datum
+network_sortsupport(PG_FUNCTION_ARGS)
+{
+	SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+	ssup->comparator = network_fast_cmp;
+	ssup->ssup_extra = NULL;
+
+	if (ssup->abbreviate)
+	{
+		network_sortsupport_state *uss;
+		MemoryContext oldcontext;
+
+		oldcontext = MemoryContextSwitchTo(ssup->ssup_cxt);
+
+		uss = palloc(sizeof(network_sortsupport_state));
+		uss->input_count = 0;
+		uss->estimating = true;
+		initHyperLogLog(&uss->abbr_card, 10);
+
+		ssup->ssup_extra = uss;
+
+		ssup->comparator = network_cmp_abbrev;
+		ssup->abbrev_converter = network_abbrev_convert;
+		ssup->abbrev_abort = network_abbrev_abort;
+		ssup->abbrev_full_comparator = network_fast_cmp;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * SortSupport comparison func
+ */
+static int
+network_fast_cmp(Datum x, Datum y, SortSupport ssup)
+{
+	inet	   *arg1 = DatumGetInetPP(x);
+	inet	   *arg2 = DatumGetInetPP(y);
+
+	return network_cmp_internal(arg1, arg2);
+}
+
+/*
+ * Abbreviated key comparison func
+ */
+static int
+network_cmp_abbrev(Datum x, Datum y, SortSupport ssup)
+{
+	if (x > y)
+		return 1;
+	else if (x == y)
+		return 0;
+	else
+		return -1;
+}
+
+/*
+ * Callback for estimating effectiveness of abbreviated key optimization.
+ *
+ * We pay no attention to the cardinality of the non-abbreviated data, because
+ * there is no equality fast-path within authoritative inet comparator.
+ */
+static bool
+network_abbrev_abort(int memtupcount, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	double		abbr_card;
+
+	if (memtupcount < 10000 || uss->input_count < 10000 || !uss->estimating)
+		return false;
+
+	abbr_card = estimateHyperLogLog(&uss->abbr_card);
+
+	/*
+	 * If we have >100k distinct values, then even if we were sorting many
+	 * billion rows we'd likely still break even, and the penalty of undoing
+	 * that many rows of abbrevs would probably not be worth it. At this point
+	 * we stop counting because we know that we're now fully committed.
+	 */
+	if (abbr_card > 100000.0)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: estimation ends at cardinality %f"
+				 " after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count, memtupcount);
+#endif
+		uss->estimating = false;
+		return false;
+	}
+
+	/*
+	 * Target minimum cardinality is 1 per ~2k of non-null inputs. 0.5 row
+	 * fudge factor allows us to abort earlier on genuinely pathological data
+	 * where we've had exactly one abbreviated value in the first 2k
+	 * (non-null) rows.
+	 */
+	if (abbr_card < uss->input_count / 2000.0 + 0.5)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: aborting abbreviation at cardinality %f"
+				 " below threshold %f after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count / 2000.0 + 0.5, uss->input_count,
+				 memtupcount);
+#endif
+		return true;
+	}
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "network_abbrev: cardinality %f after " INT64_FORMAT
+			 " values (%d rows)", abbr_card, uss->input_count, memtupcount);
+#endif
+
+	return false;
+}
+
+/*
+ * SortSupport conversion routine.  Converts original inet/cidr representation
+ * to abbreviated key representation that works with simple 3-way unsigned int
+ * comparisons.  The network_cmp_internal() rules for sorting inet/cidr datums
+ * are followed by abbreviated comparisons by an encoding scheme that
+ * conditions keys through careful use of padding.
+ *
+ * IPv4
+ * ----
+ *
+ * 4 byte datums:
+ *
+ * Start with 1 bit for the IP family (IPv4 or IPv6; this bit is present in
+ * every case below) followed by all but 1 of the netmasked bits.
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |     (1 bit network
+ * |  family  |     (truncated)     |      omitted)
+ * +----------+---------------------+
+ *
+ * 8 byte datums:
+ *
+ * We have space to store all netmasked bits, followed by the netmask size,
+ * followed by 25 bits of the subnet (25 bits is usually more than enough in
+ * practice). cidr datums always have all-zero subnet bits.
+ *
+ * +----------+-----------------------+--------------+--------------------+
+ * | 1 bit IP |    32 bits network    |    6 bits    |   25 bits subnet   |
+ * |  family  |        (full)         | network size |    (truncated)     |
+ * +----------+-----------------------+--------------+--------------------+
+ *
+ * IPv6
+ * ----
+ *
+ * 4 byte datums:
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |    (up to 97 bits
+ * |  family  |     (truncated)     |   network omitted)
+ * +----------+---------------------+
+ *
+ * 8 byte datums:
+ *
+ * +----------+---------------------------------+
+ * | 1 bit IP |         63 bits network         |    (up to 65 bits
+ * |  family  |           (truncated)           |   network omitted)
+ * +----------+---------------------------------+
+ */
+static Datum
+network_abbrev_convert(Datum original, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	inet	   *authoritative = DatumGetInetPP(original);
+	Datum		res,
+				ipaddr_datum,
+				subnet_bitmask,
+				network;
+	int			subnet_size;
+
+	Assert(ip_family(authoritative) == PGSQL_AF_INET ||
+		   ip_family(authoritative) == PGSQL_AF_INET6);
+
+	/*
+	 * Get an unsigned integer representation of datum's ippaddr component.
+	 * This may just be a prefix consisting of the `SIZEOF_DATUM` most
+	 * significant bytes
+	 */
+	if (ip_family(authoritative) == PGSQL_AF_INET)
+	{
+		uint32		ipaddr_datum32 = *((uint32 *) ip_addr(authoritative));
+
+		/* Must byteswap on little-endian machines */
+#ifndef WORDS_BIGENDIAN
+		ipaddr_datum = pg_bswap32(ipaddr_datum32);
+#endif
+
+		/* Initialize result without setting ipfamily bit */
+		res = (Datum) 0;
+	}
+	else
+	{
+		ipaddr_datum = *((Datum *) ip_addr(authoritative));
+
+		/* Must byteswap on little-endian machines */
+		ipaddr_datum = DatumBigEndianToNative(ipaddr_datum);
+
+		/* Initialize result with ipfamily (most significant) bit set */
+		res = ((Datum) 1) << (SIZEOF_DATUM * BITS_PER_BYTE - 1);
+	}
+
+	/*
+	 * ipaddr_datum must be "split in two": high order bits go in "network"
+	 * component of abbreviated key (often with zeroed bits at the end due to
+	 * masking), while low order bits go in "subnet" component when there is
+	 * space for one.  This is accomplished by generating a temp datum subnet
+	 * bitmask, which we may reuse later when generating the subnet bits
+	 * themselves.  (Note that subnet bits are only used with IPv4 datums on
+	 * platforms where datum is 8 bytes.)
+	 *
+	 * The number of bits in subnet is used to generate datum subnet bitmask.
+	 * For example, with a /24 IPv4 datum there are 8 subnet bits (32 - 24),
+	 * so final subnet bitmask is '1111 1111'.
+	 */
+	subnet_size = ip_maxbits(authoritative) - ip_bits(authoritative);
+	Assert(subnet_size >= 0);
+	subnet_bitmask = (((Datum) 1) << subnet_size) - 1;
+	network = ipaddr_datum & ~subnet_bitmask;
+
+#if SIZEOF_DATUM == 8
+	if (ip_family(authoritative) == PGSQL_AF_INET)
+	{
+		/*
+		 * IPv4 with 8 byte datums: keep all 32 netmasked bits, netmask size,
+		 * and most significant 25 subnet bits
+		 */
+		Datum	netmask_size = (Datum) ip_bits(authoritative);
+		Datum	subnet;
+
+		/* Shift left 31 bits: 6 bits netmask size + 25 subnet bits */
+		network <<= (ABBREV_BITS_INET4_NETMASK_SIZE +
+					 ABBREV_BITS_INET4_SUBNET);
+
+		/* Shift size to make room for subnet bits at the end */
+		netmask_size <<= ABBREV_BITS_INET4_SUBNET;
+		/* Don't shift subnet bits (unless we discard below) */
+		subnet = ipaddr_datum & subnet_bitmask;
+
+		/*
+		 * If we have more than 25 subnet bits, we can't fit everything.
+		 * Shift subnet down to avoid clobbering bits that are only supposed
+		 * to be used for netmask_size.
+		 *
+		 * Discarding the least significant subnet bits like this is correct
+		 * because abbreviated comparisons that are resolved at the subnet
+		 * level must have had equal subnet sizes in order to get that far.
+		 * Comparisons consistently compare like with like.
+		 */
+		if (subnet_size > ABBREV_BITS_INET4_SUBNET)
+			subnet >>= subnet_size - ABBREV_BITS_INET4_SUBNET;
+
+		/*
+		 * Assemble the final abbreviated key without clobbering the ipfamily
+		 * bit that must remain unset
+		 */
+		res |= network | netmask_size | subnet;
+	}
+	else
+#endif
+	{
+		/*
+		 * 4 byte datums, or IPv6 with 8 byte datums: Use as many of the
+		 * netmasked bits as will fit in final abbreviated key.  Avoid
+		 * clobbering the ipfamily bit that was set earlier.
+		 */
+		res |= network >> 1;
+	}
+
+	uss->input_count += 1;
+
+	/* Hash abbreviated key */
+	if (uss->estimating)
+	{
+		uint32		tmp;
+
+#if SIZEOF_DATUM == 8
+		tmp = (uint32) res ^ (uint32) ((uint64) res >> 32);
+#else							/* SIZEOF_DATUM != 8 */
+		tmp = (uint32) res;
+#endif
+
+		addHyperLogLog(&uss->abbr_card, DatumGetUInt32(hash_uint32(tmp)));
+	}
+
+	return res;
+}
+
 /*
  *	Boolean ordering tests.
  */
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 5567b7ebad..5e705019b4 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -93,6 +93,9 @@
   amproc => 'in_range(float4,float4,float8,bool,bool)' },
 { amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
   amprocrighttype => 'inet', amprocnum => '1', amproc => 'network_cmp' },
+{ amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
+  amprocrighttype => 'inet', amprocnum => '2',
+  amproc => 'network_sortsupport' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '1', amproc => 'btint2cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0902dce5f1..58311d31e3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3954,6 +3954,9 @@
 { oid => '3551',
   proname => 'network_overlap', prorettype => 'bool',
   proargtypes => 'inet inet', prosrc => 'network_overlap' },
+{ oid => '8190', descr => 'sort support',
+  proname => 'network_sortsupport', prorettype => 'void',
+  proargtypes => 'internal', prosrc => 'network_sortsupport' },
 
 # inet/cidr functions
 { oid => '598', descr => 'abbreviated display of inet value',
diff --git a/src/test/regress/expected/inet.out b/src/test/regress/expected/inet.out
index 24202376f1..961808cde1 100644
--- a/src/test/regress/expected/inet.out
+++ b/src/test/regress/expected/inet.out
@@ -845,3 +845,107 @@ SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
  ::/24
 (17 rows)
 
+-- tests to specifically check some of the potentially sharp edges when it
+-- comes to SortSupport implementation
+SELECT i FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.1/0'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/1'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i ORDER BY i;
+                      i                       
+----------------------------------------------
+ (0.0.0.0/0)
+ (0.0.0.1/0)
+ (192.168.1.4/0)
+ (192.168.1.5/0)
+ (255.0.0.0/0)
+ (255.255.255.255/0)
+ (255.255.255.255/0)
+ (0.0.0.0/1)
+ (0.0.0.1/1)
+ (0.0.0.0)
+ (192.168.1.3/1)
+ (255.255.255.255/1)
+ (192.168.1.1/5)
+ (192.168.1.255/5)
+ (192.168.1.1/6)
+ (192.168.1.255/6)
+ (192.168.1.1/23)
+ (192.168.1.2/23)
+ (192.168.1.3/23)
+ (192.168.1.0/24)
+ (192.168.1.0/25)
+ (255.255.255.255/16)
+ (255.255.255.255/31)
+ (255.255.255.254)
+ (255.255.255.255)
+ (::/0)
+ (ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0)
+ (::)
+ (::1)
+ (::1:ffff:ffff:ffff:ffff)
+ (::2:ffff:ffff:ffff:ffff)
+ (127::1)
+ (127::2)
+ (8000::/1)
+ (ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70)
+ (ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44)
+ (ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69)
+ (ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73)
+ (ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89)
+ (ffff:8dd0:646:694c:7c16:7e35:6a26:171/104)
+ (ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121)
+ (ffff:90e7:e744:664:a93:8efe:1f25:7663/122)
+ (ffff:9597:c69c:8b24:57a:8639:ec78:6026/111)
+ (ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88)
+ (ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23)
+ (ffff:ffff:ffff:fffd::)
+ (ffff:ffff:ffff:fffe::)
+ (ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff)
+(48 rows)
+
diff --git a/src/test/regress/sql/inet.sql b/src/test/regress/sql/inet.sql
index bbfa9d37df..1bd6219d8f 100644
--- a/src/test/regress/sql/inet.sql
+++ b/src/test/regress/sql/inet.sql
@@ -156,3 +156,56 @@ INSERT INTO INET_TBL (c, i) VALUES ('10', '10::/8');
 SELECT inet_merge(c, i) FROM INET_TBL;
 -- fix it by inet_same_family() condition
 SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
+
+-- tests to specifically check some of the potentially sharp edges when it
+-- comes to SortSupport implementation
+SELECT i FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.1/0'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/1'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i ORDER BY i;
-- 
2.17.1

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#7)

Re: Patch for SortSupport implementation on inet/cdir

On Fri, Jul 26, 2019 at 6:58 PM Peter Geoghegan <pg@bowt.ie> wrote:

I found this part of your approach confusing:
+   /*
+    * Number of bits in subnet. e.g. An IPv4 that's /24 is 32 - 24 = 8.
+    *
+    * However, only some of the bits may have made it into the fixed sized
+    * datum, so take the smallest number between bits in the subnet and bits
+    * in the datum which are not part of the network.
+    */
+   datum_subnet_size = Min(ip_maxbits(authoritative) - ip_bits(authoritative),
+                           SIZEOF_DATUM * BITS_PER_BYTE - ip_bits(authoritative));
The way that you put a Min() on the subnet size potentially constrains
the size of the bitmask used for the network component of the
abbreviated key (the component that comes immediately after the
ipfamily status bit). Why not just let the bitmask be a bitmask,
without bringing SIZEOF_DATUM into it? Doing it that way allowed for a
more streamlined approach, with significantly fewer special cases. I'm
not sure whether or not your approach had bugs, but I didn't like the
way you sometimes did a straight "network = ipaddr_datum" assignment
without masking.

I guess that the idea here was to prevent masking on ipv6 addresses,
though not on ipv4 addresses. Obviously we're only dealing with a
prefix with ipv6 addresses, whereas we usually have the whole raw
ipaddr with ipv4. Not sure if I'm doing the right thing there in v3,
even though the tests pass. In any case, this will need to be a lot
clearer in the final version.

--
Peter Geoghegan

Brandur Leach

brandur@mutelight.org

over 6 years ago

In reply to: Peter Geoghegan (#8)

1 attachment(s)

Re: Patch for SortSupport implementation on inet/cdir

Thanks the follow ups on this one Edmund/Peter!

I've attached a new V4 variant of the patch based on
Peter's V3, mostly containing comment amendments and a few
other minor stylistic fixes.

An interesting thing about sorting IPv4 inets on 64-bit machines is that

when the inets are the same, the abbreviated comparitor will return 0 which
is taken by the sorting machinery to mean "the datums are the same up to
this point, so you need to call the full comparitor" — but, in this case, 0
means "the datums truly are the same, no need to call the full
comparitor". Since the full comparitor assumes its arguments to be the
original (typically pass-by-reference) datums, you can't do it there.
You'd need to add another optional comparitor to be called after the
abbreviated one. In inet's case on a 64-bit machine, it would look at the
abbreviated datums and if they're both in the IPv4 family, would return 0
(knowing that the abbreviated comparitor has already done the real work).
I have no reason to believe this particular optimisation is worth anything
much, though; it's outside the scope of this patch, besides.

Edmund: thanks a lot for the really quick review turn
around, and apologies for not following up sooner!

Agreed that this change is out-of-scope, but it could be an
interesting improvement. You'd have similar potential speed
improvements for other SortSupport data types like uuid,
strings (short ones), and macaddr. Low cardinality data
sets would probably benefit the most.

When you say "comparison is done" it sounds like more comparing is going

to be done, but what I think you mean is that comparison is finished.

Peter's version of my patch ended up stripping out and/or
changing some of my original comments, so most of them are
fixed by virtue of that. And agreed about the ambiguity in
wording above — I changed "done" to "finished".

I didn't see any links for [1], [2] and [3] in your email.

Good catch. Here are the original footnote links:

[1]: https://github.com/brandur/inet-sortsupport-test
[2]: https://github.com/brandur/inet-sortsupport-test/blob/master/results.md
https://github.com/brandur/inet-sortsupport-test/blob/master/results.md
[3]: https://github.com/brandur/postgres/compare/master...brandur-inet-sortsupport-unit#diff-a28824d1339d3bb74bb0297c60140dd1
https://github.com/brandur/postgres/compare/master...brandur-inet-sortsupport-unit#diff-a28824d1339d3bb74bb0297c60140dd1

The way that you put a Min() on the subnet size potentially constrains
the size of the bitmask used for the network component of the
abbreviated key (the component that comes immediately after the
ipfamily status bit). Why not just let the bitmask be a bitmask,
without bringing SIZEOF_DATUM into it? Doing it that way allowed for a
more streamlined approach, with significantly fewer special cases.

Peter: thanks a lot for the very thorough look and revised
version! Generally agreed that fewer special cases is good,
but I was also trying to make sure that we didn't
compromise the code's understandability by optimizing for
fewer special cases above everything else (especially for
this sort of thing where tricky bit manipulation is
involved).

But I traced through your variant and it looks fine to me
(still looks correct, and readability is still good). I've
pulled most of your changes into V4.

I'm not sure whether or not your approach had bugs, but I
didn't like the way you sometimes did a straight "network
= ipaddr_datum" assignment without masking.

For what it's worth, I believe this did work, even if it
did depend on being within that one branch of code. Agreed
though that avoiding it (as in the new version) is more
hygienic.

I really liked your diagrams, but much of the text that went with them
either seemed redundant (it described established rules about how the
underlying types sort), or seemed to discuss things that were better
discussed next to the relevant network_abbrev_convert() code.

Thanks! I kept most of your changes, but resurrected some
of my original introductory text. The fine points of the
code's implementation are intricate enough that I think
having some background included is useful to new entrants;
specifically:

1. Norms for naming the different "parts" (network, size,
subnet) of an inet/cidr value aren't broadly
well-established, so defining what they are before
showing diagrams is helpful.

2. Having examples of each part (`1.2.3.0`, `/24`,
`0.0.0.4`) helps mentally cement them.

3. I know that it's in the PG documentation, but the rules
for sorting inet/cidr are not very intuitive. Spending a
few lines re-iterating them so that they don't have to
be cross-referenced elsewhere is worth the space.

Anyway, once again appreciate the extreme attention to
detail on these reviews — this level of rigor would be a
very rare find in projects outside of the Postgres
community!

Brandur

Attachments:

v4-0001-SortSupport-for-inet-cidr-types.patchapplication/octet-stream; name=v4-0001-SortSupport-for-inet-cidr-types.patchDownload

From 8f904b59f91c2bb9a550ee8377e46f4a36c69b11 Mon Sep 17 00:00:00 2001
From: Brandur <brandur@brandur.org>
Date: Sun, 28 Jul 2019 10:57:07 -0700
Subject: [PATCH v4] SortSupport for inet/cidr types

Implements SortSupport for the inet/cidr types in Postgres by devising
an abbreviated key representation for them that will faithfully
reproduce their existing sorting rules. This has the effect of typically
reducing the time taken for inet/cidr sorts by ~50-60%, and should show
a good improvement for the vast majority of real-world data sets.
---
 src/backend/utils/adt/network.c    | 361 +++++++++++++++++++++++++++++++++++++
 src/include/catalog/pg_amproc.dat  |   3 +
 src/include/catalog/pg_proc.dat    |   3 +
 src/test/regress/expected/inet.out | 104 +++++++++++
 src/test/regress/sql/inet.sql      |  53 ++++++
 5 files changed, 524 insertions(+)

diff --git a/src/backend/utils/adt/network.c b/src/backend/utils/adt/network.c
index 5e980cf..3c0d97e 100644
--- a/src/backend/utils/adt/network.c
+++ b/src/backend/utils/adt/network.c
@@ -16,6 +16,7 @@
 #include "catalog/pg_opfamily.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
+#include "lib/hyperloglog.h"
 #include "libpq/libpq-be.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
@@ -24,12 +25,36 @@
 #include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/guc.h"
 #include "utils/hashutils.h"
 #include "utils/inet.h"
 #include "utils/lsyscache.h"
+#include "utils/sortsupport.h"
 
+/*
+ * An IPv4 netmask size is a value in the range of 0 - 32, which is
+ * represented with 6 bits in inet/cidr abbreviated keys where possible.
+ *
+ * An IPv4 inet/cidr abbreviated key can use up to 25 bits for subnet
+ * component.
+ */
+#define ABBREV_BITS_INET4_NETMASK_SIZE	6
+#define ABBREV_BITS_INET4_SUBNET		25
+
+/* sortsupport for inet/cidr */
+typedef struct
+{
+	int64		input_count;	/* number of non-null values seen */
+	bool		estimating;		/* true if estimating cardinality */
+
+	hyperLogLogState abbr_card; /* cardinality estimator */
+}			network_sortsupport_state;
 
 static int32 network_cmp_internal(inet *a1, inet *a2);
+static int	network_fast_cmp(Datum x, Datum y, SortSupport ssup);
+static int	network_cmp_abbrev(Datum x, Datum y, SortSupport ssup);
+static bool network_abbrev_abort(int memtupcount, SortSupport ssup);
+static Datum network_abbrev_convert(Datum original, SortSupport ssup);
 static List *match_network_function(Node *leftop,
 									Node *rightop,
 									int indexarg,
@@ -406,6 +431,342 @@ network_cmp(PG_FUNCTION_ARGS)
 }
 
 /*
+ * SortSupport strategy routine
+ */
+Datum
+network_sortsupport(PG_FUNCTION_ARGS)
+{
+	SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+	ssup->comparator = network_fast_cmp;
+	ssup->ssup_extra = NULL;
+
+	if (ssup->abbreviate)
+	{
+		network_sortsupport_state *uss;
+		MemoryContext oldcontext;
+
+		oldcontext = MemoryContextSwitchTo(ssup->ssup_cxt);
+
+		uss = palloc(sizeof(network_sortsupport_state));
+		uss->input_count = 0;
+		uss->estimating = true;
+		initHyperLogLog(&uss->abbr_card, 10);
+
+		ssup->ssup_extra = uss;
+
+		ssup->comparator = network_cmp_abbrev;
+		ssup->abbrev_converter = network_abbrev_convert;
+		ssup->abbrev_abort = network_abbrev_abort;
+		ssup->abbrev_full_comparator = network_fast_cmp;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * SortSupport comparison func
+ */
+static int
+network_fast_cmp(Datum x, Datum y, SortSupport ssup)
+{
+	inet	   *arg1 = DatumGetInetPP(x);
+	inet	   *arg2 = DatumGetInetPP(y);
+
+	return network_cmp_internal(arg1, arg2);
+}
+
+/*
+ * Abbreviated key comparison func
+ */
+static int
+network_cmp_abbrev(Datum x, Datum y, SortSupport ssup)
+{
+	if (x > y)
+		return 1;
+	else if (x == y)
+		return 0;
+	else
+		return -1;
+}
+
+/*
+ * Callback for estimating effectiveness of abbreviated key optimization.
+ *
+ * We pay no attention to the cardinality of the non-abbreviated data, because
+ * there is no equality fast-path within authoritative inet comparator.
+ */
+static bool
+network_abbrev_abort(int memtupcount, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	double		abbr_card;
+
+	if (memtupcount < 10000 || uss->input_count < 10000 || !uss->estimating)
+		return false;
+
+	abbr_card = estimateHyperLogLog(&uss->abbr_card);
+
+	/*
+	 * If we have >100k distinct values, then even if we were sorting many
+	 * billion rows we'd likely still break even, and the penalty of undoing
+	 * that many rows of abbrevs would probably not be worth it. At this point
+	 * we stop counting because we know that we're now fully committed.
+	 */
+	if (abbr_card > 100000.0)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: estimation ends at cardinality %f"
+				 " after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count, memtupcount);
+#endif
+		uss->estimating = false;
+		return false;
+	}
+
+	/*
+	 * Target minimum cardinality is 1 per ~2k of non-null inputs. 0.5 row
+	 * fudge factor allows us to abort earlier on genuinely pathological data
+	 * where we've had exactly one abbreviated value in the first 2k
+	 * (non-null) rows.
+	 */
+	if (abbr_card < uss->input_count / 2000.0 + 0.5)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: aborting abbreviation at cardinality %f"
+				 " below threshold %f after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count / 2000.0 + 0.5, uss->input_count,
+				 memtupcount);
+#endif
+		return true;
+	}
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "network_abbrev: cardinality %f after " INT64_FORMAT
+			 " values (%d rows)", abbr_card, uss->input_count, memtupcount);
+#endif
+
+	return false;
+}
+
+/*
+ * SortSupport conversion routine. Converts original inet/cidr representation
+ * to abbreviated key representation that works with simple 3-way unsigned int
+ * comparisons. The network_cmp_internal() rules for sorting inet/cidr datums
+ * are followed by abbreviated comparisons by an encoding scheme that
+ * conditions keys through careful use of padding.
+ *
+ * Some background: inet values have three major components (take for example
+ * the address 1.2.3.4/24):
+ *
+ *     * A network, or netmasked bits (1.2.3.0).
+ *     * A netmask size (/24).
+ *     * A subnet, or bits outside of the netmask (0.0.0.4).
+ *
+ * cidr values are the same except that with only the first two components --
+ * all their subnet bits *must* be zero (1.2.3.0/24).
+ *
+ * IPv4 and IPv6 are identical in this makeup, with the difference being that
+ * IPv4 addresses have a maximum of 32 bits compared to IPv6's 64 bits, so in
+ * IPv6 each part may be larger.
+ *
+ * inet/cdir types compare using these sorting rules. If inequality is detected
+ * at any step, comparison is finished. If any rule is a tie, the algorithm
+ * drops through to the next to break it:
+ *
+ *     1. IPv4 always appears before IPv6.
+ *     2. Network bits are compared.
+ *     3. Netmask size is compared.
+ *     4. All bits are compared (having made it here, we know that both
+ *        netmasked bits and netmask size are equal, so we're in effect only
+ *        comparing subnet bits).
+ *
+ * When generating abbreviated keys for SortSupport, we pack as much as we can
+ * into a datum while ensuring that when comparing those keys as integers,
+ * these rules will be respected. Exact contents depend on IP family and datum
+ * size.
+ *
+ * IPv4
+ * ----
+ *
+ * 4 byte datums:
+ *
+ * Start with 1 bit for the IP family (IPv4 or IPv6; this bit is present in
+ * every case below) followed by all but 1 of the netmasked bits.
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |     (1 bit network
+ * |  family  |     (truncated)     |      omitted)
+ * +----------+---------------------+
+ *
+ * 8 byte datums:
+ *
+ * We have space to store all netmasked bits, followed by the netmask size,
+ * followed by 25 bits of the subnet (25 bits is usually more than enough in
+ * practice). cidr datums always have all-zero subnet bits.
+ *
+ * +----------+-----------------------+--------------+--------------------+
+ * | 1 bit IP |    32 bits network    |    6 bits    |   25 bits subnet   |
+ * |  family  |        (full)         | network size |    (truncated)     |
+ * +----------+-----------------------+--------------+--------------------+
+ *
+ * IPv6
+ * ----
+ *
+ * 4 byte datums:
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |    (up to 97 bits
+ * |  family  |     (truncated)     |   network omitted)
+ * +----------+---------------------+
+ *
+ * 8 byte datums:
+ *
+ * +----------+---------------------------------+
+ * | 1 bit IP |         63 bits network         |    (up to 65 bits
+ * |  family  |           (truncated)           |   network omitted)
+ * +----------+---------------------------------+
+ */
+static Datum
+network_abbrev_convert(Datum original, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	inet	   *authoritative = DatumGetInetPP(original);
+	Datum		res,
+				ipaddr_datum,
+				subnet_bitmask,
+				network;
+	int			subnet_size;
+
+	Assert(ip_family(authoritative) == PGSQL_AF_INET ||
+		   ip_family(authoritative) == PGSQL_AF_INET6);
+
+	/*
+	 * Get an unsigned integer representation of datum's ipaddr component.
+	 * This may just be a prefix consisting of the `SIZEOF_DATUM` most
+	 * significant bytes
+	 */
+	if (ip_family(authoritative) == PGSQL_AF_INET)
+	{
+		uint32		ipaddr_datum32 = *((uint32 *) ip_addr(authoritative));
+
+		/* Must byteswap on little-endian machines */
+#ifndef WORDS_BIGENDIAN
+		ipaddr_datum = pg_bswap32(ipaddr_datum32);
+#endif
+
+		/*
+		 * Initialize result without setting ipfamily bit (a leading zero bit
+		 * implies IPv4)
+		 */
+		res = (Datum) 0;
+	}
+	else
+	{
+		ipaddr_datum = *((Datum *) ip_addr(authoritative));
+
+		/* Must byteswap on little-endian machines */
+		ipaddr_datum = DatumBigEndianToNative(ipaddr_datum);
+
+		/* Initialize result with ipfamily (most significant) bit set */
+		res = ((Datum) 1) << (SIZEOF_DATUM * BITS_PER_BYTE - 1);
+	}
+
+	/*
+	 * ipaddr_datum must be "split in two": high order bits go in "network"
+	 * component of abbreviated key (often with zeroed bits at the end due to
+	 * masking), while low order bits go in "subnet" component when there is
+	 * space for one.  This is accomplished by generating a temp datum subnet
+	 * bitmask, which we may reuse later when generating the subnet bits
+	 * themselves.  (Note that subnet bits are only used with IPv4 datums on
+	 * platforms where datum is 8 bytes.)
+	 *
+	 * The number of bits in subnet is used to generate a datum subnet
+	 * bitmask. For example, with a /24 IPv4 datum there are 8 subnet bits (32
+	 * - 24), so the final subnet bitmask is '1111 1111'.
+	 */
+	subnet_size = ip_maxbits(authoritative) - ip_bits(authoritative);
+	Assert(subnet_size >= 0);
+	subnet_bitmask = (((Datum) 1) << subnet_size) - 1;
+	network = ipaddr_datum & ~subnet_bitmask;
+
+#if SIZEOF_DATUM == 8
+	if (ip_family(authoritative) == PGSQL_AF_INET)
+	{
+		/*
+		 * IPv4 with 8 byte datums: keep all 32 netmasked bits, netmask size,
+		 * and most significant 25 subnet bits
+		 */
+		Datum		netmask_size = (Datum) ip_bits(authoritative);
+		Datum		subnet;
+
+		/* Shift left 31 bits: 6 bits netmask size + 25 subnet bits */
+		network <<= (ABBREV_BITS_INET4_NETMASK_SIZE +
+					 ABBREV_BITS_INET4_SUBNET);
+
+		/* Shift size to make room for subnet bits at the end */
+		netmask_size <<= ABBREV_BITS_INET4_SUBNET;
+
+		/* Extract subnet bits without shifting them */
+		subnet = ipaddr_datum & subnet_bitmask;
+
+		/*
+		 * If we have more than 25 subnet bits, we can't fit everything. Shift
+		 * subnet down to avoid clobbering bits that are only supposed to be
+		 * used for netmask_size.
+		 *
+		 * Discarding the least significant subnet bits like this is correct
+		 * because abbreviated comparisons that are resolved at the subnet
+		 * level must have had equal subnet sizes in order to get that far.
+		 */
+		if (subnet_size > ABBREV_BITS_INET4_SUBNET)
+			subnet >>= subnet_size - ABBREV_BITS_INET4_SUBNET;
+
+		/*
+		 * Assemble the final abbreviated key without clobbering the ipfamily
+		 * bit that must remain a zero.
+		 */
+		res |= network | netmask_size | subnet;
+	}
+	else
+#endif
+	{
+		/*
+		 * 4 byte datums, or IPv6 with 8 byte datums: Use as many of the
+		 * netmasked bits as will fit in final abbreviated key.  Avoid
+		 * clobbering the ipfamily bit that was set earlier.
+		 */
+		res |= network >> 1;
+	}
+
+	uss->input_count += 1;
+
+	/* Hash abbreviated key */
+	if (uss->estimating)
+	{
+		uint32		tmp;
+
+#if SIZEOF_DATUM == 8
+		tmp = (uint32) res ^ (uint32) ((uint64) res >> 32);
+#else							/* SIZEOF_DATUM != 8 */
+		tmp = (uint32) res;
+#endif
+
+		addHyperLogLog(&uss->abbr_card, DatumGetUInt32(hash_uint32(tmp)));
+	}
+
+	return res;
+}
+
+/*
  *	Boolean ordering tests.
  */
 Datum
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 5567b7e..5e70501 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -93,6 +93,9 @@
   amproc => 'in_range(float4,float4,float8,bool,bool)' },
 { amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
   amprocrighttype => 'inet', amprocnum => '1', amproc => 'network_cmp' },
+{ amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
+  amprocrighttype => 'inet', amprocnum => '2',
+  amproc => 'network_sortsupport' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '1', amproc => 'btint2cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0902dce..58311d3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3954,6 +3954,9 @@
 { oid => '3551',
   proname => 'network_overlap', prorettype => 'bool',
   proargtypes => 'inet inet', prosrc => 'network_overlap' },
+{ oid => '8190', descr => 'sort support',
+  proname => 'network_sortsupport', prorettype => 'void',
+  proargtypes => 'internal', prosrc => 'network_sortsupport' },
 
 # inet/cidr functions
 { oid => '598', descr => 'abbreviated display of inet value',
diff --git a/src/test/regress/expected/inet.out b/src/test/regress/expected/inet.out
index 2420237..961808c 100644
--- a/src/test/regress/expected/inet.out
+++ b/src/test/regress/expected/inet.out
@@ -845,3 +845,107 @@ SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
  ::/24
 (17 rows)
 
+-- tests to specifically check some of the potentially sharp edges when it
+-- comes to SortSupport implementation
+SELECT i FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.1/0'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/1'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i ORDER BY i;
+                      i                       
+----------------------------------------------
+ (0.0.0.0/0)
+ (0.0.0.1/0)
+ (192.168.1.4/0)
+ (192.168.1.5/0)
+ (255.0.0.0/0)
+ (255.255.255.255/0)
+ (255.255.255.255/0)
+ (0.0.0.0/1)
+ (0.0.0.1/1)
+ (0.0.0.0)
+ (192.168.1.3/1)
+ (255.255.255.255/1)
+ (192.168.1.1/5)
+ (192.168.1.255/5)
+ (192.168.1.1/6)
+ (192.168.1.255/6)
+ (192.168.1.1/23)
+ (192.168.1.2/23)
+ (192.168.1.3/23)
+ (192.168.1.0/24)
+ (192.168.1.0/25)
+ (255.255.255.255/16)
+ (255.255.255.255/31)
+ (255.255.255.254)
+ (255.255.255.255)
+ (::/0)
+ (ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0)
+ (::)
+ (::1)
+ (::1:ffff:ffff:ffff:ffff)
+ (::2:ffff:ffff:ffff:ffff)
+ (127::1)
+ (127::2)
+ (8000::/1)
+ (ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70)
+ (ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44)
+ (ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69)
+ (ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73)
+ (ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89)
+ (ffff:8dd0:646:694c:7c16:7e35:6a26:171/104)
+ (ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121)
+ (ffff:90e7:e744:664:a93:8efe:1f25:7663/122)
+ (ffff:9597:c69c:8b24:57a:8639:ec78:6026/111)
+ (ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88)
+ (ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23)
+ (ffff:ffff:ffff:fffd::)
+ (ffff:ffff:ffff:fffe::)
+ (ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff)
+(48 rows)
+
diff --git a/src/test/regress/sql/inet.sql b/src/test/regress/sql/inet.sql
index bbfa9d3..1bd6219 100644
--- a/src/test/regress/sql/inet.sql
+++ b/src/test/regress/sql/inet.sql
@@ -156,3 +156,56 @@ INSERT INTO INET_TBL (c, i) VALUES ('10', '10::/8');
 SELECT inet_merge(c, i) FROM INET_TBL;
 -- fix it by inet_same_family() condition
 SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
+
+-- tests to specifically check some of the potentially sharp edges when it
+-- comes to SortSupport implementation
+SELECT i FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.1/0'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/1'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i ORDER BY i;
-- 
2.3.6

#10

Brandur Leach

brandur@mutelight.org

over 6 years ago

In reply to: Brandur Leach (#9)

1 attachment(s)

Re: Patch for SortSupport implementation on inet/cdir

And a slightly amended version of the last patch with a bug
fixed where IPv4 abbreviated keys were were not being
initialized correctly on big-endian machines.

Attachments:

v5-0001-SortSupport-for-inet-cidr-types.patchapplication/octet-stream; name=v5-0001-SortSupport-for-inet-cidr-types.patchDownload

From 19b6de698ade5f12c5058764b6906e2472a44322 Mon Sep 17 00:00:00 2001
From: Brandur <brandur@brandur.org>
Date: Mon, 29 Jul 2019 20:09:43 -0700
Subject: [PATCH v5] SortSupport for inet/cidr types

Implements SortSupport for the inet/cidr types in Postgres by devising
an abbreviated key representation for them that will faithfully
reproduce their existing sorting rules. This has the effect of typically
reducing the time taken for inet/cidr sorts by ~50-60%, and should show
a good improvement for the vast majority of real-world data sets.
---
 src/backend/utils/adt/network.c    | 369 +++++++++++++++++++++++++++++++++++++
 src/include/catalog/pg_amproc.dat  |   3 +
 src/include/catalog/pg_proc.dat    |   3 +
 src/test/regress/expected/inet.out | 104 +++++++++++
 src/test/regress/sql/inet.sql      |  53 ++++++
 5 files changed, 532 insertions(+)

diff --git a/src/backend/utils/adt/network.c b/src/backend/utils/adt/network.c
index 5e980cf23f..12e9f29b66 100644
--- a/src/backend/utils/adt/network.c
+++ b/src/backend/utils/adt/network.c
@@ -16,6 +16,7 @@
 #include "catalog/pg_opfamily.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
+#include "lib/hyperloglog.h"
 #include "libpq/libpq-be.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
@@ -24,12 +25,36 @@
 #include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/guc.h"
 #include "utils/hashutils.h"
 #include "utils/inet.h"
 #include "utils/lsyscache.h"
+#include "utils/sortsupport.h"
 
+/*
+ * An IPv4 netmask size is a value in the range of 0 - 32, which is
+ * represented with 6 bits in inet/cidr abbreviated keys where possible.
+ *
+ * An IPv4 inet/cidr abbreviated key can use up to 25 bits for subnet
+ * component.
+ */
+#define ABBREV_BITS_INET4_NETMASK_SIZE	6
+#define ABBREV_BITS_INET4_SUBNET		25
+
+/* sortsupport for inet/cidr */
+typedef struct
+{
+	int64		input_count;	/* number of non-null values seen */
+	bool		estimating;		/* true if estimating cardinality */
+
+	hyperLogLogState abbr_card; /* cardinality estimator */
+}			network_sortsupport_state;
 
 static int32 network_cmp_internal(inet *a1, inet *a2);
+static int	network_fast_cmp(Datum x, Datum y, SortSupport ssup);
+static int	network_cmp_abbrev(Datum x, Datum y, SortSupport ssup);
+static bool network_abbrev_abort(int memtupcount, SortSupport ssup);
+static Datum network_abbrev_convert(Datum original, SortSupport ssup);
 static List *match_network_function(Node *leftop,
 									Node *rightop,
 									int indexarg,
@@ -405,6 +430,350 @@ network_cmp(PG_FUNCTION_ARGS)
 	PG_RETURN_INT32(network_cmp_internal(a1, a2));
 }
 
+/*
+ * SortSupport strategy routine
+ */
+Datum
+network_sortsupport(PG_FUNCTION_ARGS)
+{
+	SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+	ssup->comparator = network_fast_cmp;
+	ssup->ssup_extra = NULL;
+
+	if (ssup->abbreviate)
+	{
+		network_sortsupport_state *uss;
+		MemoryContext oldcontext;
+
+		oldcontext = MemoryContextSwitchTo(ssup->ssup_cxt);
+
+		uss = palloc(sizeof(network_sortsupport_state));
+		uss->input_count = 0;
+		uss->estimating = true;
+		initHyperLogLog(&uss->abbr_card, 10);
+
+		ssup->ssup_extra = uss;
+
+		ssup->comparator = network_cmp_abbrev;
+		ssup->abbrev_converter = network_abbrev_convert;
+		ssup->abbrev_abort = network_abbrev_abort;
+		ssup->abbrev_full_comparator = network_fast_cmp;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * SortSupport comparison func
+ */
+static int
+network_fast_cmp(Datum x, Datum y, SortSupport ssup)
+{
+	inet	   *arg1 = DatumGetInetPP(x);
+	inet	   *arg2 = DatumGetInetPP(y);
+
+	return network_cmp_internal(arg1, arg2);
+}
+
+/*
+ * Abbreviated key comparison func
+ */
+static int
+network_cmp_abbrev(Datum x, Datum y, SortSupport ssup)
+{
+	if (x > y)
+		return 1;
+	else if (x == y)
+		return 0;
+	else
+		return -1;
+}
+
+/*
+ * Callback for estimating effectiveness of abbreviated key optimization.
+ *
+ * We pay no attention to the cardinality of the non-abbreviated data, because
+ * there is no equality fast-path within authoritative inet comparator.
+ */
+static bool
+network_abbrev_abort(int memtupcount, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	double		abbr_card;
+
+	if (memtupcount < 10000 || uss->input_count < 10000 || !uss->estimating)
+		return false;
+
+	abbr_card = estimateHyperLogLog(&uss->abbr_card);
+
+	/*
+	 * If we have >100k distinct values, then even if we were sorting many
+	 * billion rows we'd likely still break even, and the penalty of undoing
+	 * that many rows of abbrevs would probably not be worth it. At this point
+	 * we stop counting because we know that we're now fully committed.
+	 */
+	if (abbr_card > 100000.0)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: estimation ends at cardinality %f"
+				 " after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count, memtupcount);
+#endif
+		uss->estimating = false;
+		return false;
+	}
+
+	/*
+	 * Target minimum cardinality is 1 per ~2k of non-null inputs. 0.5 row
+	 * fudge factor allows us to abort earlier on genuinely pathological data
+	 * where we've had exactly one abbreviated value in the first 2k
+	 * (non-null) rows.
+	 */
+	if (abbr_card < uss->input_count / 2000.0 + 0.5)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: aborting abbreviation at cardinality %f"
+				 " below threshold %f after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count / 2000.0 + 0.5, uss->input_count,
+				 memtupcount);
+#endif
+		return true;
+	}
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "network_abbrev: cardinality %f after " INT64_FORMAT
+			 " values (%d rows)", abbr_card, uss->input_count, memtupcount);
+#endif
+
+	return false;
+}
+
+/*
+ * SortSupport conversion routine. Converts original inet/cidr representation
+ * to abbreviated key representation that works with simple 3-way unsigned int
+ * comparisons. The network_cmp_internal() rules for sorting inet/cidr datums
+ * are followed by abbreviated comparisons by an encoding scheme that
+ * conditions keys through careful use of padding.
+ *
+ * Some background: inet values have three major components (take for example
+ * the address 1.2.3.4/24):
+ *
+ *     * A network, or netmasked bits (1.2.3.0).
+ *     * A netmask size (/24).
+ *     * A subnet, or bits outside of the netmask (0.0.0.4).
+ *
+ * cidr values are the same except that with only the first two components --
+ * all their subnet bits *must* be zero (1.2.3.0/24).
+ *
+ * IPv4 and IPv6 are identical in this makeup, with the difference being that
+ * IPv4 addresses have a maximum of 32 bits compared to IPv6's 64 bits, so in
+ * IPv6 each part may be larger.
+ *
+ * inet/cdir types compare using these sorting rules. If inequality is detected
+ * at any step, comparison is finished. If any rule is a tie, the algorithm
+ * drops through to the next to break it:
+ *
+ *     1. IPv4 always appears before IPv6.
+ *     2. Network bits are compared.
+ *     3. Netmask size is compared.
+ *     4. All bits are compared (having made it here, we know that both
+ *        netmasked bits and netmask size are equal, so we're in effect only
+ *        comparing subnet bits).
+ *
+ * When generating abbreviated keys for SortSupport, we pack as much as we can
+ * into a datum while ensuring that when comparing those keys as integers,
+ * these rules will be respected. Exact contents depend on IP family and datum
+ * size.
+ *
+ * IPv4
+ * ----
+ *
+ * 4 byte datums:
+ *
+ * Start with 1 bit for the IP family (IPv4 or IPv6; this bit is present in
+ * every case below) followed by all but 1 of the netmasked bits.
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |     (1 bit network
+ * |  family  |     (truncated)     |      omitted)
+ * +----------+---------------------+
+ *
+ * 8 byte datums:
+ *
+ * We have space to store all netmasked bits, followed by the netmask size,
+ * followed by 25 bits of the subnet (25 bits is usually more than enough in
+ * practice). cidr datums always have all-zero subnet bits.
+ *
+ * +----------+-----------------------+--------------+--------------------+
+ * | 1 bit IP |    32 bits network    |    6 bits    |   25 bits subnet   |
+ * |  family  |        (full)         | network size |    (truncated)     |
+ * +----------+-----------------------+--------------+--------------------+
+ *
+ * IPv6
+ * ----
+ *
+ * 4 byte datums:
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |    (up to 97 bits
+ * |  family  |     (truncated)     |   network omitted)
+ * +----------+---------------------+
+ *
+ * 8 byte datums:
+ *
+ * +----------+---------------------------------+
+ * | 1 bit IP |         63 bits network         |    (up to 65 bits
+ * |  family  |           (truncated)           |   network omitted)
+ * +----------+---------------------------------+
+ */
+static Datum
+network_abbrev_convert(Datum original, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	inet	   *authoritative = DatumGetInetPP(original);
+	Datum		res,
+				ipaddr_datum,
+				subnet_bitmask,
+				network;
+	int			subnet_size;
+
+	Assert(ip_family(authoritative) == PGSQL_AF_INET ||
+		   ip_family(authoritative) == PGSQL_AF_INET6);
+
+
+	/*
+	 * Get an unsigned integer representation of the IP address by taking its
+	 * first 4 or 8 bytes. Always take all 4 bytes of an IPv4 address. Take
+	 * the first 8 bytes of an IPv6 address with an 8 byte datum and 4 bytes
+	 * otherwise.
+	 *
+	 * We're consuming an array of char, so make sure to byteswap on little
+	 * endian systems (an inet's IP array emulates big endian in that the
+	 * first byte is always the most significant).
+	 */
+	if (ip_family(authoritative) == PGSQL_AF_INET)
+	{
+		uint32		ipaddr_datum32 = *((uint32 *) ip_addr(authoritative));
+
+		/* Must byteswap on little-endian machines */
+#ifndef WORDS_BIGENDIAN
+		ipaddr_datum32 = pg_bswap32(ipaddr_datum32);
+#endif
+
+		ipaddr_datum = ipaddr_datum32;
+
+		/*
+		 * Initialize result without setting ipfamily bit (a leading zero bit
+		 * implies IPv4)
+		 */
+		res = (Datum) 0;
+	}
+	else
+	{
+		ipaddr_datum = *((Datum *) ip_addr(authoritative));
+
+		/* Must byteswap on little-endian machines */
+		ipaddr_datum = DatumBigEndianToNative(ipaddr_datum);
+
+		/* Initialize result with ipfamily (most significant) bit set */
+		res = ((Datum) 1) << (SIZEOF_DATUM * BITS_PER_BYTE - 1);
+	}
+
+	/*
+	 * ipaddr_datum must be "split in two": high order bits go in "network"
+	 * component of abbreviated key (often with zeroed bits at the end due to
+	 * masking), while low order bits go in "subnet" component when there is
+	 * space for one.  This is accomplished by generating a temp datum subnet
+	 * bitmask, which we may reuse later when generating the subnet bits
+	 * themselves.  (Note that subnet bits are only used with IPv4 datums on
+	 * platforms where datum is 8 bytes.)
+	 *
+	 * The number of bits in subnet is used to generate a datum subnet
+	 * bitmask. For example, with a /24 IPv4 datum there are 8 subnet bits (32
+	 * - 24), so the final subnet bitmask is '1111 1111'.
+	 */
+	subnet_size = ip_maxbits(authoritative) - ip_bits(authoritative);
+	Assert(subnet_size >= 0);
+	subnet_bitmask = (((Datum) 1) << subnet_size) - 1;
+	network = ipaddr_datum & ~subnet_bitmask;
+
+#if SIZEOF_DATUM == 8
+	if (ip_family(authoritative) == PGSQL_AF_INET)
+	{
+		/*
+		 * IPv4 with 8 byte datums: keep all 32 netmasked bits, netmask size,
+		 * and most significant 25 subnet bits
+		 */
+		Datum		netmask_size = (Datum) ip_bits(authoritative);
+		Datum		subnet;
+
+		/* Shift left 31 bits: 6 bits netmask size + 25 subnet bits */
+		network <<= (ABBREV_BITS_INET4_NETMASK_SIZE +
+					 ABBREV_BITS_INET4_SUBNET);
+
+		/* Shift size to make room for subnet bits at the end */
+		netmask_size <<= ABBREV_BITS_INET4_SUBNET;
+
+		/* Extract subnet bits without shifting them */
+		subnet = ipaddr_datum & subnet_bitmask;
+
+		/*
+		 * If we have more than 25 subnet bits, we can't fit everything. Shift
+		 * subnet down to avoid clobbering bits that are only supposed to be
+		 * used for netmask_size.
+		 *
+		 * Discarding the least significant subnet bits like this is correct
+		 * because abbreviated comparisons that are resolved at the subnet
+		 * level must have had equal subnet sizes in order to get that far.
+		 */
+		if (subnet_size > ABBREV_BITS_INET4_SUBNET)
+			subnet >>= subnet_size - ABBREV_BITS_INET4_SUBNET;
+
+		/*
+		 * Assemble the final abbreviated key without clobbering the ipfamily
+		 * bit that must remain a zero.
+		 */
+		res |= network | netmask_size | subnet;
+	}
+	else
+#endif
+	{
+		/*
+		 * 4 byte datums, or IPv6 with 8 byte datums: Use as many of the
+		 * netmasked bits as will fit in final abbreviated key.  Avoid
+		 * clobbering the ipfamily bit that was set earlier.
+		 */
+		res |= network >> 1;
+	}
+
+	uss->input_count += 1;
+
+	/* Hash abbreviated key */
+	if (uss->estimating)
+	{
+		uint32		tmp;
+
+#if SIZEOF_DATUM == 8
+		tmp = (uint32) res ^ (uint32) ((uint64) res >> 32);
+#else							/* SIZEOF_DATUM != 8 */
+		tmp = (uint32) res;
+#endif
+
+		addHyperLogLog(&uss->abbr_card, DatumGetUInt32(hash_uint32(tmp)));
+	}
+
+	return res;
+}
+
 /*
  *	Boolean ordering tests.
  */
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 5567b7ebad..5e705019b4 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -93,6 +93,9 @@
   amproc => 'in_range(float4,float4,float8,bool,bool)' },
 { amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
   amprocrighttype => 'inet', amprocnum => '1', amproc => 'network_cmp' },
+{ amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
+  amprocrighttype => 'inet', amprocnum => '2',
+  amproc => 'network_sortsupport' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '1', amproc => 'btint2cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0902dce5f1..58311d31e3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3954,6 +3954,9 @@
 { oid => '3551',
   proname => 'network_overlap', prorettype => 'bool',
   proargtypes => 'inet inet', prosrc => 'network_overlap' },
+{ oid => '8190', descr => 'sort support',
+  proname => 'network_sortsupport', prorettype => 'void',
+  proargtypes => 'internal', prosrc => 'network_sortsupport' },
 
 # inet/cidr functions
 { oid => '598', descr => 'abbreviated display of inet value',
diff --git a/src/test/regress/expected/inet.out b/src/test/regress/expected/inet.out
index 24202376f1..961808cde1 100644
--- a/src/test/regress/expected/inet.out
+++ b/src/test/regress/expected/inet.out
@@ -845,3 +845,107 @@ SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
  ::/24
 (17 rows)
 
+-- tests to specifically check some of the potentially sharp edges when it
+-- comes to SortSupport implementation
+SELECT i FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.1/0'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/1'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i ORDER BY i;
+                      i                       
+----------------------------------------------
+ (0.0.0.0/0)
+ (0.0.0.1/0)
+ (192.168.1.4/0)
+ (192.168.1.5/0)
+ (255.0.0.0/0)
+ (255.255.255.255/0)
+ (255.255.255.255/0)
+ (0.0.0.0/1)
+ (0.0.0.1/1)
+ (0.0.0.0)
+ (192.168.1.3/1)
+ (255.255.255.255/1)
+ (192.168.1.1/5)
+ (192.168.1.255/5)
+ (192.168.1.1/6)
+ (192.168.1.255/6)
+ (192.168.1.1/23)
+ (192.168.1.2/23)
+ (192.168.1.3/23)
+ (192.168.1.0/24)
+ (192.168.1.0/25)
+ (255.255.255.255/16)
+ (255.255.255.255/31)
+ (255.255.255.254)
+ (255.255.255.255)
+ (::/0)
+ (ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0)
+ (::)
+ (::1)
+ (::1:ffff:ffff:ffff:ffff)
+ (::2:ffff:ffff:ffff:ffff)
+ (127::1)
+ (127::2)
+ (8000::/1)
+ (ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70)
+ (ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44)
+ (ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69)
+ (ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73)
+ (ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89)
+ (ffff:8dd0:646:694c:7c16:7e35:6a26:171/104)
+ (ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121)
+ (ffff:90e7:e744:664:a93:8efe:1f25:7663/122)
+ (ffff:9597:c69c:8b24:57a:8639:ec78:6026/111)
+ (ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88)
+ (ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23)
+ (ffff:ffff:ffff:fffd::)
+ (ffff:ffff:ffff:fffe::)
+ (ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff)
+(48 rows)
+
diff --git a/src/test/regress/sql/inet.sql b/src/test/regress/sql/inet.sql
index bbfa9d37df..1bd6219d8f 100644
--- a/src/test/regress/sql/inet.sql
+++ b/src/test/regress/sql/inet.sql
@@ -156,3 +156,56 @@ INSERT INTO INET_TBL (c, i) VALUES ('10', '10::/8');
 SELECT inet_merge(c, i) FROM INET_TBL;
 -- fix it by inet_same_family() condition
 SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
+
+-- tests to specifically check some of the potentially sharp edges when it
+-- comes to SortSupport implementation
+SELECT i FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.1/0'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/1'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i ORDER BY i;
-- 
2.16.2

#11

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#8)

1 attachment(s)

Re: Patch for SortSupport implementation on inet/cdir

On Fri, Jul 26, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

I guess that the idea here was to prevent masking on ipv6 addresses,
though not on ipv4 addresses. Obviously we're only dealing with a
prefix with ipv6 addresses, whereas we usually have the whole raw
ipaddr with ipv4. Not sure if I'm doing the right thing there in v3,
even though the tests pass. In any case, this will need to be a lot
clearer in the final version.

This turned out to be borked for certain IPv6 cases, as suspected.
Attached is a revised v6, which fixes the issue by adding the explicit
handling needed when ipaddr_datum is just a prefix of the full ipaddr
from the authoritative representation. Also made sure that the tests
will catch issues like this. Separately, it occurred to me that it's
probably not okay to do straight type punning of the ipaddr unsigned
char array to a Datum on alignment-picky platforms. Using a memcpy()
seems like the right approach, which is what we do in the latest
revision.

I accepted almost all of Brandur's comment revisions from v5 for v6.

I'm probably going to commit this tomorrow morning Pacific time. Do
you have any final input on the testing, Brandur? I would like to hear
your thoughts on the possibility of edge cases that still don't have
coverage. The tests will break if the new "if (ip_bits(authoritative)
== 0)" branch is removed, but only at one exact point. I'm pretty sure
that there are no remaining subtleties like that one, but two heads
are better than one.

--
Peter Geoghegan

Attachments:

v6-0001-Add-sort-support-for-inet-cidr-opfamily.patchapplication/octet-stream; name=v6-0001-Add-sort-support-for-inet-cidr-opfamily.patchDownload

From af224551e4ad599138c8ec541eb58dc4a1fd5f8c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 26 Jul 2019 09:13:31 -0700
Subject: [PATCH v6] Add sort support for inet/cidr opfamily.

Add sort support for inet and cidr, including support for abbreviated
keys.  Testing has shown that this reduces the time taken to sort medium
to large inet/cidr inputs by ~50-60% in realistic cases.

Author: Brandur Leach
Reviewed-By: Peter Geoghegan, Edmund Horner
Discussion: https://postgr.es/m/CABR_9B-PQ8o2MZNJ88wo6r-NxW2EFG70M96Wmcgf99G6HUQ3sw@mail.gmail.com
---
 src/backend/utils/adt/network.c    | 387 +++++++++++++++++++++++++++++
 src/include/catalog/pg_amproc.dat  |   3 +
 src/include/catalog/pg_proc.dat    |   3 +
 src/test/regress/expected/inet.out | 123 +++++++++
 src/test/regress/sql/inet.sql      |  62 +++++
 5 files changed, 578 insertions(+)

diff --git a/src/backend/utils/adt/network.c b/src/backend/utils/adt/network.c
index 5e980cf23f..d047908500 100644
--- a/src/backend/utils/adt/network.c
+++ b/src/backend/utils/adt/network.c
@@ -16,6 +16,7 @@
 #include "catalog/pg_opfamily.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
+#include "lib/hyperloglog.h"
 #include "libpq/libpq-be.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
@@ -24,12 +25,37 @@
 #include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/guc.h"
 #include "utils/hashutils.h"
 #include "utils/inet.h"
 #include "utils/lsyscache.h"
+#include "utils/sortsupport.h"
 
 
+/*
+ * An IPv4 netmask size is a value in the range of 0 - 32, which is
+ * represented with 6 bits in inet/cidr abbreviated keys where possible.
+ *
+ * An IPv4 inet/cidr abbreviated key can use up to 25 bits for subnet
+ * component.
+ */
+#define ABBREV_BITS_INET4_NETMASK_SIZE	6
+#define ABBREV_BITS_INET4_SUBNET		25
+
+/* sortsupport for inet/cidr */
+typedef struct
+{
+	int64		input_count;	/* number of non-null values seen */
+	bool		estimating;		/* true if estimating cardinality */
+
+	hyperLogLogState abbr_card; /* cardinality estimator */
+} network_sortsupport_state;
+
 static int32 network_cmp_internal(inet *a1, inet *a2);
+static int	network_fast_cmp(Datum x, Datum y, SortSupport ssup);
+static int	network_cmp_abbrev(Datum x, Datum y, SortSupport ssup);
+static bool network_abbrev_abort(int memtupcount, SortSupport ssup);
+static Datum network_abbrev_convert(Datum original, SortSupport ssup);
 static List *match_network_function(Node *leftop,
 									Node *rightop,
 									int indexarg,
@@ -405,6 +431,367 @@ network_cmp(PG_FUNCTION_ARGS)
 	PG_RETURN_INT32(network_cmp_internal(a1, a2));
 }
 
+/*
+ * SortSupport strategy routine
+ */
+Datum
+network_sortsupport(PG_FUNCTION_ARGS)
+{
+	SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+	ssup->comparator = network_fast_cmp;
+	ssup->ssup_extra = NULL;
+
+	if (ssup->abbreviate)
+	{
+		network_sortsupport_state *uss;
+		MemoryContext oldcontext;
+
+		oldcontext = MemoryContextSwitchTo(ssup->ssup_cxt);
+
+		uss = palloc(sizeof(network_sortsupport_state));
+		uss->input_count = 0;
+		uss->estimating = true;
+		initHyperLogLog(&uss->abbr_card, 10);
+
+		ssup->ssup_extra = uss;
+
+		ssup->comparator = network_cmp_abbrev;
+		ssup->abbrev_converter = network_abbrev_convert;
+		ssup->abbrev_abort = network_abbrev_abort;
+		ssup->abbrev_full_comparator = network_fast_cmp;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * SortSupport comparison func
+ */
+static int
+network_fast_cmp(Datum x, Datum y, SortSupport ssup)
+{
+	inet	   *arg1 = DatumGetInetPP(x);
+	inet	   *arg2 = DatumGetInetPP(y);
+
+	return network_cmp_internal(arg1, arg2);
+}
+
+/*
+ * Abbreviated key comparison func
+ */
+static int
+network_cmp_abbrev(Datum x, Datum y, SortSupport ssup)
+{
+	if (x > y)
+		return 1;
+	else if (x == y)
+		return 0;
+	else
+		return -1;
+}
+
+/*
+ * Callback for estimating effectiveness of abbreviated key optimization.
+ *
+ * We pay no attention to the cardinality of the non-abbreviated data, because
+ * there is no equality fast-path within authoritative inet comparator.
+ */
+static bool
+network_abbrev_abort(int memtupcount, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	double		abbr_card;
+
+	if (memtupcount < 10000 || uss->input_count < 10000 || !uss->estimating)
+		return false;
+
+	abbr_card = estimateHyperLogLog(&uss->abbr_card);
+
+	/*
+	 * If we have >100k distinct values, then even if we were sorting many
+	 * billion rows we'd likely still break even, and the penalty of undoing
+	 * that many rows of abbrevs would probably not be worth it. At this point
+	 * we stop counting because we know that we're now fully committed.
+	 */
+	if (abbr_card > 100000.0)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: estimation ends at cardinality %f"
+				 " after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count, memtupcount);
+#endif
+		uss->estimating = false;
+		return false;
+	}
+
+	/*
+	 * Target minimum cardinality is 1 per ~2k of non-null inputs. 0.5 row
+	 * fudge factor allows us to abort earlier on genuinely pathological data
+	 * where we've had exactly one abbreviated value in the first 2k
+	 * (non-null) rows.
+	 */
+	if (abbr_card < uss->input_count / 2000.0 + 0.5)
+	{
+#ifdef TRACE_SORT
+		if (trace_sort)
+			elog(LOG,
+				 "network_abbrev: aborting abbreviation at cardinality %f"
+				 " below threshold %f after " INT64_FORMAT " values (%d rows)",
+				 abbr_card, uss->input_count / 2000.0 + 0.5, uss->input_count,
+				 memtupcount);
+#endif
+		return true;
+	}
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "network_abbrev: cardinality %f after " INT64_FORMAT
+			 " values (%d rows)", abbr_card, uss->input_count, memtupcount);
+#endif
+
+	return false;
+}
+
+/*
+ * SortSupport conversion routine. Converts original inet/cidr representation
+ * to abbreviated key representation that works with simple 3-way unsigned int
+ * comparisons. The network_cmp_internal() rules for sorting inet/cidr datums
+ * are followed by abbreviated comparisons by an encoding scheme that
+ * conditions keys through careful use of padding.
+ *
+ * Some background: inet values have three major components (take for example
+ * the address 1.2.3.4/24):
+ *
+ *     * A network, or netmasked bits (1.2.3.0).
+ *     * A netmask size (/24).
+ *     * A subnet, or bits outside of the netmask (0.0.0.4).
+ *
+ * cidr values are the same except that with only the first two components --
+ * all their subnet bits *must* be zero (1.2.3.0/24).
+ *
+ * IPv4 and IPv6 are identical in this makeup, with the difference being that
+ * IPv4 addresses have a maximum of 32 bits compared to IPv6's 64 bits, so in
+ * IPv6 each part may be larger.
+ *
+ * inet/cdir types compare using these sorting rules. If inequality is detected
+ * at any step, comparison is finished. If any rule is a tie, the algorithm
+ * drops through to the next to break it:
+ *
+ *     1. IPv4 always appears before IPv6.
+ *     2. Network bits are compared.
+ *     3. Netmask size is compared.
+ *     4. All bits are compared (having made it here, we know that both
+ *        netmasked bits and netmask size are equal, so we're in effect only
+ *        comparing subnet bits).
+ *
+ * When generating abbreviated keys for SortSupport, we pack as much as we can
+ * into a datum while ensuring that when comparing those keys as integers,
+ * these rules will be respected. Exact contents depend on IP family and datum
+ * size.
+ *
+ * IPv4
+ * ----
+ *
+ * 4 byte datums:
+ *
+ * Start with 1 bit for the IP family (IPv4 or IPv6; this bit is present in
+ * every case below) followed by all but 1 of the netmasked bits.
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |     (1 bit network
+ * |  family  |     (truncated)     |      omitted)
+ * +----------+---------------------+
+ *
+ * 8 byte datums:
+ *
+ * We have space to store all netmasked bits, followed by the netmask size,
+ * followed by 25 bits of the subnet (25 bits is usually more than enough in
+ * practice). cidr datums always have all-zero subnet bits.
+ *
+ * +----------+-----------------------+--------------+--------------------+
+ * | 1 bit IP |    32 bits network    |    6 bits    |   25 bits subnet   |
+ * |  family  |        (full)         | network size |    (truncated)     |
+ * +----------+-----------------------+--------------+--------------------+
+ *
+ * IPv6
+ * ----
+ *
+ * 4 byte datums:
+ *
+ * +----------+---------------------+
+ * | 1 bit IP |   31 bits network   |    (up to 97 bits
+ * |  family  |     (truncated)     |   network omitted)
+ * +----------+---------------------+
+ *
+ * 8 byte datums:
+ *
+ * +----------+---------------------------------+
+ * | 1 bit IP |         63 bits network         |    (up to 65 bits
+ * |  family  |           (truncated)           |   network omitted)
+ * +----------+---------------------------------+
+ */
+static Datum
+network_abbrev_convert(Datum original, SortSupport ssup)
+{
+	network_sortsupport_state *uss = ssup->ssup_extra;
+	inet	   *authoritative = DatumGetInetPP(original);
+	Datum		res,
+				ipaddr_datum,
+				subnet_bitmask,
+				network;
+	int			subnet_size;
+
+	Assert(ip_family(authoritative) == PGSQL_AF_INET ||
+		   ip_family(authoritative) == PGSQL_AF_INET6);
+
+	/*
+	 * Get an unsigned integer representation of the IP address by taking its
+	 * first 4 or 8 bytes. Always take all 4 bytes of an IPv4 address. Take
+	 * the first 8 bytes of an IPv6 address with an 8 byte datum and 4 bytes
+	 * otherwise.
+	 *
+	 * We're consuming an array of char, so make sure to byteswap on little
+	 * endian systems (an inet's IP array emulates big endian in that the
+	 * first byte is always the most significant).
+	 */
+	if (ip_family(authoritative) == PGSQL_AF_INET)
+	{
+		uint32		ipaddr_datum32;
+
+		memcpy(&ipaddr_datum32, ip_addr(authoritative), sizeof(uint32));
+
+		/* Must byteswap on little-endian machines */
+#ifndef WORDS_BIGENDIAN
+		ipaddr_datum = pg_bswap32(ipaddr_datum32);
+#else
+		ipaddr_datum = ipaddr_datum32;
+#endif
+
+		/* Initialize result without setting ipfamily bit */
+		res = (Datum) 0;
+	}
+	else
+	{
+		memcpy(&ipaddr_datum, ip_addr(authoritative), sizeof(Datum));
+
+		/* Must byteswap on little-endian machines */
+		ipaddr_datum = DatumBigEndianToNative(ipaddr_datum);
+
+		/* Initialize result with ipfamily (most significant) bit set */
+		res = ((Datum) 1) << (SIZEOF_DATUM * BITS_PER_BYTE - 1);
+	}
+
+	/*
+	 * ipaddr_datum must be "split": high order bits go in "network" component
+	 * of abbreviated key (often with zeroed bits at the end due to masking),
+	 * while low order bits go in "subnet" component when there is space for
+	 * one. This is often accomplished by generating a temp datum subnet
+	 * bitmask, which we may reuse later when generating the subnet bits
+	 * themselves.  (Note that subnet bits are only used with IPv4 datums on
+	 * platforms where datum is 8 bytes.)
+	 *
+	 * The number of bits in subnet is used to generate a datum subnet
+	 * bitmask. For example, with a /24 IPv4 datum there are 8 subnet bits
+	 * (since 32 - 24 is 8), so the final subnet bitmask is B'1111 1111'. We
+	 * need explicit handling for cases where the ipaddr bits cannot all fit
+	 * in a datum, though (otherwise we'd incorrectly mask the network
+	 * component with IPv6 values).
+	 */
+	subnet_size = ip_maxbits(authoritative) - ip_bits(authoritative);
+	Assert(subnet_size >= 0);
+	if (ip_bits(authoritative) == 0)
+	{
+		/* Fit as many ipaddr bits as possible into subnet */
+		subnet_bitmask = ((Datum) 0) - 1;
+		network = 0;
+	}
+	else if (ip_bits(authoritative) < SIZEOF_DATUM * BITS_PER_BYTE)
+	{
+		/* Split ipaddr bits between network and subnet */
+		subnet_bitmask = (((Datum) 1) << subnet_size) - 1;
+		network = ipaddr_datum & ~subnet_bitmask;
+	}
+	else
+	{
+		/* Fit as many ipaddr bits as possible into network */
+		subnet_bitmask = 0;		/* Unused, but be tidy */
+		network = ipaddr_datum;
+	}
+
+#if SIZEOF_DATUM == 8
+	if (ip_family(authoritative) == PGSQL_AF_INET)
+	{
+		/*
+		 * IPv4 with 8 byte datums: keep all 32 netmasked bits, netmask size,
+		 * and most significant 25 subnet bits
+		 */
+		Datum		netmask_size = (Datum) ip_bits(authoritative);
+		Datum		subnet;
+
+		/* Shift left 31 bits: 6 bits netmask size + 25 subnet bits */
+		network <<= (ABBREV_BITS_INET4_NETMASK_SIZE +
+					 ABBREV_BITS_INET4_SUBNET);
+
+		/* Shift size to make room for subnet bits at the end */
+		netmask_size <<= ABBREV_BITS_INET4_SUBNET;
+
+		/* Extract subnet bits without shifting them */
+		subnet = ipaddr_datum & subnet_bitmask;
+
+		/*
+		 * If we have more than 25 subnet bits, we can't fit everything. Shift
+		 * subnet down to avoid clobbering bits that are only supposed to be
+		 * used for netmask_size.
+		 *
+		 * Discarding the least significant subnet bits like this is correct
+		 * because abbreviated comparisons that are resolved at the subnet
+		 * level must have had equal subnet sizes in order to get that far.
+		 */
+		if (subnet_size > ABBREV_BITS_INET4_SUBNET)
+			subnet >>= subnet_size - ABBREV_BITS_INET4_SUBNET;
+
+		/*
+		 * Assemble the final abbreviated key without clobbering the ipfamily
+		 * bit that must remain a zero.
+		 */
+		res |= network | netmask_size | subnet;
+	}
+	else
+#endif
+	{
+		/*
+		 * 4 byte datums, or IPv6 with 8 byte datums: Use as many of the
+		 * netmasked bits as will fit in final abbreviated key. Avoid
+		 * clobbering the ipfamily bit that was set earlier.
+		 */
+		res |= network >> 1;
+	}
+
+	uss->input_count += 1;
+
+	/* Hash abbreviated key */
+	if (uss->estimating)
+	{
+		uint32		tmp;
+
+#if SIZEOF_DATUM == 8
+		tmp = (uint32) res ^ (uint32) ((uint64) res >> 32);
+#else							/* SIZEOF_DATUM != 8 */
+		tmp = (uint32) res;
+#endif
+
+		addHyperLogLog(&uss->abbr_card, DatumGetUInt32(hash_uint32(tmp)));
+	}
+
+	return res;
+}
+
 /*
  *	Boolean ordering tests.
  */
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 5567b7ebad..5e705019b4 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -93,6 +93,9 @@
   amproc => 'in_range(float4,float4,float8,bool,bool)' },
 { amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
   amprocrighttype => 'inet', amprocnum => '1', amproc => 'network_cmp' },
+{ amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
+  amprocrighttype => 'inet', amprocnum => '2',
+  amproc => 'network_sortsupport' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '1', amproc => 'btint2cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0902dce5f1..58311d31e3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3954,6 +3954,9 @@
 { oid => '3551',
   proname => 'network_overlap', prorettype => 'bool',
   proargtypes => 'inet inet', prosrc => 'network_overlap' },
+{ oid => '8190', descr => 'sort support',
+  proname => 'network_sortsupport', prorettype => 'void',
+  proargtypes => 'internal', prosrc => 'network_sortsupport' },
 
 # inet/cidr functions
 { oid => '598', descr => 'abbreviated display of inet value',
diff --git a/src/test/regress/expected/inet.out b/src/test/regress/expected/inet.out
index 24202376f1..10f5dc8255 100644
--- a/src/test/regress/expected/inet.out
+++ b/src/test/regress/expected/inet.out
@@ -845,3 +845,126 @@ SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
  ::/24
 (17 rows)
 
+-- Test inet sortsupport with a variety of boundary inputs:
+SELECT a FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/0'::inet),
+  ('0.0.0.1/1'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.000.000/0'::inet),
+  ('255.255.000.000/0'::inet),
+  ('255.255.000.000/15'::inet),
+  ('255.255.000.000/16'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('::4:3:2:1/24'::inet),
+  ('10:23::f1/64'::inet),
+  ('10:23::f1/65'::inet),
+  ('10:23::ffff'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i(a) ORDER BY a;
+                     a                      
+--------------------------------------------
+ 0.0.0.0/0
+ 0.0.0.1/0
+ 192.168.1.4/0
+ 192.168.1.5/0
+ 255.0.0.0/0
+ 255.255.0.0/0
+ 255.255.0.0/0
+ 255.255.255.255/0
+ 255.255.255.255/0
+ 255.255.255.255/0
+ 0.0.0.0/1
+ 0.0.0.1/1
+ 0.0.0.0
+ 192.168.1.3/1
+ 255.255.255.255/1
+ 192.168.1.1/5
+ 192.168.1.255/5
+ 192.168.1.1/6
+ 192.168.1.255/6
+ 192.168.1.1/23
+ 192.168.1.2/23
+ 192.168.1.3/23
+ 192.168.1.0/24
+ 192.168.1.0/25
+ 255.255.0.0/15
+ 255.255.0.0/16
+ 255.255.255.255/16
+ 255.255.255.255/16
+ 255.255.255.255/31
+ 255.255.255.254
+ 255.255.255.255
+ ::/0
+ ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0
+ ::4:3:2:1/24
+ ::
+ ::1
+ ::1:ffff:ffff:ffff:ffff
+ ::2:ffff:ffff:ffff:ffff
+ 10:23::f1/64
+ 10:23::f1/65
+ 10:23::ffff
+ 127::1
+ 127::2
+ 8000::/1
+ ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70
+ ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44
+ ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69
+ ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73
+ ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89
+ ffff:8dd0:646:694c:7c16:7e35:6a26:171/104
+ ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121
+ ffff:90e7:e744:664:a93:8efe:1f25:7663/122
+ ffff:9597:c69c:8b24:57a:8639:ec78:6026/111
+ ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88
+ ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23
+ ffff:ffff:ffff:fffd::
+ ffff:ffff:ffff:fffe::
+ ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff
+(58 rows)
+
diff --git a/src/test/regress/sql/inet.sql b/src/test/regress/sql/inet.sql
index bbfa9d37df..40d49ea281 100644
--- a/src/test/regress/sql/inet.sql
+++ b/src/test/regress/sql/inet.sql
@@ -156,3 +156,65 @@ INSERT INTO INET_TBL (c, i) VALUES ('10', '10::/8');
 SELECT inet_merge(c, i) FROM INET_TBL;
 -- fix it by inet_same_family() condition
 SELECT inet_merge(c, i) FROM INET_TBL WHERE inet_same_family(c, i);
+
+-- Test inet sortsupport with a variety of boundary inputs:
+SELECT a FROM (VALUES
+  ('0.0.0.0/0'::inet),
+  ('0.0.0.0/1'::inet),
+  ('0.0.0.0/32'::inet),
+  ('0.0.0.1/0'::inet),
+  ('0.0.0.1/1'::inet),
+  ('192.168.1.0/24'::inet),
+  ('192.168.1.0/25'::inet),
+  ('192.168.1.1/23'::inet),
+  ('192.168.1.1/5'::inet),
+  ('192.168.1.1/6'::inet),
+  ('192.168.1.2/23'::inet),
+  ('192.168.1.255/5'::inet),
+  ('192.168.1.255/6'::inet),
+  ('192.168.1.3/1'::inet),
+  ('192.168.1.3/23'::inet),
+  ('192.168.1.4/0'::inet),
+  ('192.168.1.5/0'::inet),
+  ('255.0.0.0/0'::inet),
+  ('255.255.000.000/0'::inet),
+  ('255.255.000.000/0'::inet),
+  ('255.255.000.000/15'::inet),
+  ('255.255.000.000/16'::inet),
+  ('255.255.255.254/32'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/0'::inet),
+  ('255.255.255.255/1'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/16'::inet),
+  ('255.255.255.255/31'::inet),
+  ('255.255.255.255/32'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/0'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0000/128'::inet),
+  ('0000:0000:0000:0000:0000:0000:0000:0001/128'::inet),
+  ('::1:ffff:ffff:ffff:ffff/128'::inet),
+  ('::2:ffff:ffff:ffff:ffff/128'::inet),
+  ('::4:3:2:1/24'::inet),
+  ('10:23::f1/64'::inet),
+  ('10:23::f1/65'::inet),
+  ('10:23::ffff'::inet),
+  ('127::1'::inet),
+  ('127::2'::inet),
+  ('8000:0000:0000:0000:0000:0000:0000:0000/1'::inet),
+  ('ffff:83e7:f118:57dc:6093:6d92:689d:58cf/70'::inet),
+  ('ffff:84b0:4775:536e:c3ed:7116:a6d6:34f0/44'::inet),
+  ('ffff:8566:f84:5867:47f1:7867:d2ba:8a1a/69'::inet),
+  ('ffff:8883:f028:7d2:4d68:d510:7d6b:ac43/73'::inet),
+  ('ffff:8ae8:7c14:65b3:196:8e4a:89ae:fb30/89'::inet),
+  ('ffff:8dd0:646:694c:7c16:7e35:6a26:171/104'::inet),
+  ('ffff:8eef:cbf:700:eda3:ae32:f4b4:318b/121'::inet),
+  ('ffff:90e7:e744:664:a93:8efe:1f25:7663/122'::inet),
+  ('ffff:9597:c69c:8b24:57a:8639:ec78:6026/111'::inet),
+  ('ffff:9e86:79ea:f16e:df31:8e4d:7783:532e/88'::inet),
+  ('ffff:a0c7:82d3:24de:f762:6e1f:316d:3fb2/23'::inet),
+  ('ffff:ffff:ffff:fffd::/128'::inet),
+  ('ffff:ffff:ffff:fffe::/128'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/0'::inet),
+  ('ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128'::inet)
+) AS i(a) ORDER BY a;
-- 
2.17.1

#12

Brandur Leach

brandur@mutelight.org

over 6 years ago

In reply to: Peter Geoghegan (#11)

Re: Patch for SortSupport implementation on inet/cdir

Thanks Peter. V6 is pretty uncontroversial by me — the new
conditional ladder broken cleanly into cases of (1) all
subnet, (2) network/subnet mix, and (3) all network is a
little more verbose, but all in all makes things easier to
reason about.

Do you have any final input on the testing, Brandur? I
would like to hear your thoughts on the possibility of
edge cases that still don't have coverage. The tests will
break if the new "if (ip_bits(authoritative) == 0)"
branch is removed, but only at one exact point. I'm
pretty sure that there are no remaining subtleties like
that one, but two heads are better than one.

(Discussed a little offline already, but) the new, more
exhaustive test suite combined with your approach of
synthesizing many random values and comparing before and
after sorting seems sufficient to me. The approach was
meant to suss out any remaining edge cases, and seems to
have done its job.

Brandur

On Wed, Jul 31, 2019 at 10:49 AM Peter Geoghegan <pg@bowt.ie> wrote:

Show quoted text

On Fri, Jul 26, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

I guess that the idea here was to prevent masking on ipv6 addresses,
though not on ipv4 addresses. Obviously we're only dealing with a
prefix with ipv6 addresses, whereas we usually have the whole raw
ipaddr with ipv4. Not sure if I'm doing the right thing there in v3,
even though the tests pass. In any case, this will need to be a lot
clearer in the final version.

This turned out to be borked for certain IPv6 cases, as suspected.
Attached is a revised v6, which fixes the issue by adding the explicit
handling needed when ipaddr_datum is just a prefix of the full ipaddr
from the authoritative representation. Also made sure that the tests
will catch issues like this. Separately, it occurred to me that it's
probably not okay to do straight type punning of the ipaddr unsigned
char array to a Datum on alignment-picky platforms. Using a memcpy()
seems like the right approach, which is what we do in the latest
revision.

I accepted almost all of Brandur's comment revisions from v5 for v6.

I'm probably going to commit this tomorrow morning Pacific time. Do
you have any final input on the testing, Brandur? I would like to hear
your thoughts on the possibility of edge cases that still don't have
coverage. The tests will break if the new "if (ip_bits(authoritative)
== 0)" branch is removed, but only at one exact point. I'm pretty sure
that there are no remaining subtleties like that one, but two heads
are better than one.

--
Peter Geoghegan

#13

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Brandur Leach (#12)

Re: Patch for SortSupport implementation on inet/cdir

On Thu, Aug 1, 2019 at 8:34 AM Brandur Leach <brandur@mutelight.org> wrote:

Thanks Peter. V6 is pretty uncontroversial by me — the new
conditional ladder broken cleanly into cases of (1) all
subnet, (2) network/subnet mix, and (3) all network is a
little more verbose, but all in all makes things easier to
reason about.

Pushed.

Thanks!
--
Peter Geoghegan