mcvstats serialization code is still shy of a load

Started by Tom Laneover 6 years ago19 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I'm seeing a reproducible bus error here:

#0 0x00417420 in statext_mcv_serialize (mcvlist=0x62223450, stats=Variable "stats" is not available.
)
at mcv.c:785
785 memcpy(ITEM_BASE_FREQUENCY(item, ndims), &mcvitem->base_frequency, sizeof(double));

What appears to be happening is that since ITEM_BASE_FREQUENCY is defined as

#define ITEM_BASE_FREQUENCY(item,ndims) ((double *) (ITEM_FREQUENCY(item, ndims) + 1))

the compiler is assuming that the first argument to memcpy is
double-aligned, and it is generating code that depends on that being
true, and of course it isn't true and kaboom.

You can *not* cast something to an aligned pointer type if it's not
actually certain to be aligned suitably for that type. In this example,
even if you wrote "(char *)" in front of this, it wouldn't save you;
the compiler would still be entitled to believe that the intermediate
cast value meant something. The casts in the underlying macros
ITEM_FREQUENCY and so on are equally unsafe.

(For the record, this is with gcc 4.2.1 on OpenBSD/hppa 6.4.)

regards, tom lane

#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#1)
Re: mcvstats serialization code is still shy of a load

On Tue, Jun 25, 2019 at 11:52:28PM -0400, Tom Lane wrote:

I'm seeing a reproducible bus error here:

#0 0x00417420 in statext_mcv_serialize (mcvlist=0x62223450, stats=Variable "stats" is not available.
)
at mcv.c:785
785 memcpy(ITEM_BASE_FREQUENCY(item, ndims), &mcvitem->base_frequency, sizeof(double));

What appears to be happening is that since ITEM_BASE_FREQUENCY is defined as

#define ITEM_BASE_FREQUENCY(item,ndims) ((double *) (ITEM_FREQUENCY(item, ndims) + 1))

the compiler is assuming that the first argument to memcpy is
double-aligned, and it is generating code that depends on that being
true, and of course it isn't true and kaboom.

You can *not* cast something to an aligned pointer type if it's not
actually certain to be aligned suitably for that type. In this example,
even if you wrote "(char *)" in front of this, it wouldn't save you;
the compiler would still be entitled to believe that the intermediate
cast value meant something. The casts in the underlying macros
ITEM_FREQUENCY and so on are equally unsafe.

OK. So the solution is to ditch the casts altogether, and then do plain
pointer arithmetics like this:

#define ITEM_INDEXES(item) (item)
#define ITEM_NULLS(item,ndims) (ITEM_INDEXES(item) + (ndims))
#define ITEM_FREQUENCY(item,ndims) (ITEM_NULLS(item, ndims) + (ndims))
#define ITEM_BASE_FREQUENCY(item,ndims) (ITEM_FREQUENCY(item, ndims) + sizeof(double))

Or is that still relying on alignment, somehow?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#2)
Re: mcvstats serialization code is still shy of a load

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Tue, Jun 25, 2019 at 11:52:28PM -0400, Tom Lane wrote:

You can *not* cast something to an aligned pointer type if it's not
actually certain to be aligned suitably for that type.

OK. So the solution is to ditch the casts altogether, and then do plain
pointer arithmetics like this:

#define ITEM_INDEXES(item) (item)
#define ITEM_NULLS(item,ndims) (ITEM_INDEXES(item) + (ndims))
#define ITEM_FREQUENCY(item,ndims) (ITEM_NULLS(item, ndims) + (ndims))
#define ITEM_BASE_FREQUENCY(item,ndims) (ITEM_FREQUENCY(item, ndims) + sizeof(double))

Or is that still relying on alignment, somehow?

No, constructs like a char* pointer plus n times sizeof(something) should
be safe.

regards, tom lane

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#2)
1 attachment(s)
Re: mcvstats serialization code is still shy of a load

On Wed, Jun 26, 2019 at 09:49:46AM +0200, Tomas Vondra wrote:

On Tue, Jun 25, 2019 at 11:52:28PM -0400, Tom Lane wrote:

I'm seeing a reproducible bus error here:

#0 0x00417420 in statext_mcv_serialize (mcvlist=0x62223450, stats=Variable "stats" is not available.
)
at mcv.c:785
785 memcpy(ITEM_BASE_FREQUENCY(item, ndims), &mcvitem->base_frequency, sizeof(double));

What appears to be happening is that since ITEM_BASE_FREQUENCY is defined as

#define ITEM_BASE_FREQUENCY(item,ndims) ((double *) (ITEM_FREQUENCY(item, ndims) + 1))

the compiler is assuming that the first argument to memcpy is
double-aligned, and it is generating code that depends on that being
true, and of course it isn't true and kaboom.

You can *not* cast something to an aligned pointer type if it's not
actually certain to be aligned suitably for that type. In this example,
even if you wrote "(char *)" in front of this, it wouldn't save you;
the compiler would still be entitled to believe that the intermediate
cast value meant something. The casts in the underlying macros
ITEM_FREQUENCY and so on are equally unsafe.

OK. So the solution is to ditch the casts altogether, and then do plain
pointer arithmetics like this:

#define ITEM_INDEXES(item) (item)
#define ITEM_NULLS(item,ndims) (ITEM_INDEXES(item) + (ndims))
#define ITEM_FREQUENCY(item,ndims) (ITEM_NULLS(item, ndims) + (ndims))
#define ITEM_BASE_FREQUENCY(item,ndims) (ITEM_FREQUENCY(item, ndims) + sizeof(double))

Or is that still relying on alignment, somehow?

Attached is a patch that should (hopefully) fix this. It essentially
treats the item as (char *) and does all pointer arithmetics without any
additional casts. So there are no intermediate casts.

I have no way to test this, so I may either wait for you to test this
first, or push and wait. It seems to fail only on a very small number of
buildfarm animals, so having a confirmation would be nice.

The fix keeps the binary format as is, so the serialized MCV items are
max-aligned. That means we can access the uint16 indexes directly, but we
need to copy the rest of the fields (because those may not be aligned). In
hindsight that seems a bit silly, we might as well copy everything, not
care about the alignment and maybe save a few more bytes. But that would
require catversion bump. OTOH we may beed to do that anyway, to fix the
pg_mcv_list_items() signature (as discussed in the other MCV thread).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

mcv-serialization-fix.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 5fe61ea0a4..01ea05b486 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -57,10 +57,10 @@
 /*
  * Macros for convenient access to parts of a serialized MCV item.
  */
-#define ITEM_INDEXES(item)			((uint16 *) (item))
-#define ITEM_NULLS(item,ndims)		((bool *) (ITEM_INDEXES(item) + (ndims)))
-#define ITEM_FREQUENCY(item,ndims)	((double *) (ITEM_NULLS(item, ndims) + (ndims)))
-#define ITEM_BASE_FREQUENCY(item,ndims)	((double *) (ITEM_FREQUENCY(item, ndims) + 1))
+#define ITEM_INDEXES(item)			((char *) item)
+#define ITEM_NULLS(item,ndims)		(ITEM_INDEXES(item) + (ndims) * sizeof(uint16))
+#define ITEM_FREQUENCY(item,ndims)	(ITEM_NULLS(item, ndims) + (ndims) * sizeof(bool))
+#define ITEM_BASE_FREQUENCY(item,ndims)	(ITEM_FREQUENCY(item, ndims) + sizeof(double))
 
 /*
  * Used to compute size of serialized MCV list representation.
@@ -751,6 +751,7 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 	for (i = 0; i < mcvlist->nitems; i++)
 	{
 		MCVItem    *mcvitem = &mcvlist->items[i];
+		uint16	   *indexes = (uint16 *) ITEM_INDEXES(item);
 
 		/* don't write beyond the allocated space */
 		Assert(ptr <= raw + total_length - itemsize);
@@ -773,10 +774,10 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 			Assert(value != NULL);	/* serialization or deduplication error */
 
 			/* compute index within the array */
-			ITEM_INDEXES(item)[dim] = (uint16) (value - values[dim]);
+			indexes[dim] = (uint16) (value - values[dim]);
 
 			/* check the index is within expected bounds */
-			Assert(ITEM_INDEXES(item)[dim] < info[dim].nvalues);
+			Assert(indexes[dim] < info[dim].nvalues);
 		}
 
 		/* copy NULL and frequency flags into the item */
@@ -1087,10 +1088,13 @@ statext_mcv_deserialize(bytea *data)
 		item->isnull = (bool *) isnullptr;
 		isnullptr += MAXALIGN(sizeof(bool) * ndims);
 
+		/*
+		 * To access indexes, we can just point to the right place (we can
+		 * do this, because we ensure alignment during serialization).
+		 */
+		indexes = (uint16 *) ITEM_INDEXES(ptr);
 
-		/* just point to the right place */
-		indexes = ITEM_INDEXES(ptr);
-
+		/* The remaining fields may not be aligned, though. */
 		memcpy(item->isnull, ITEM_NULLS(ptr, ndims), sizeof(bool) * ndims);
 		memcpy(&item->frequency, ITEM_FREQUENCY(ptr, ndims), sizeof(double));
 		memcpy(&item->base_frequency, ITEM_BASE_FREQUENCY(ptr, ndims), sizeof(double));
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#4)
Re: mcvstats serialization code is still shy of a load

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Attached is a patch that should (hopefully) fix this. It essentially
treats the item as (char *) and does all pointer arithmetics without any
additional casts. So there are no intermediate casts.

This passes the eyeball test, and it also allows my OpenBSD/hppa
installation to get through the core regression tests, so I think
it's good as far as it goes. Please push.

However ... nosing around in mcv.c, I noticed that the next macro:

/*
* Used to compute size of serialized MCV list representation.
*/
#define MinSizeOfMCVList \
(VARHDRSZ + sizeof(uint32) * 3 + sizeof(AttrNumber))

#define SizeOfMCVList(ndims,nitems) \
(MAXALIGN(MinSizeOfMCVList + sizeof(Oid) * (ndims)) + \
MAXALIGN((ndims) * sizeof(DimensionInfo)) + \
MAXALIGN((nitems) * ITEM_SIZE(ndims)))

is both woefully underdocumented and completely at variance with
reality. It doesn't seem to be accounting for the actual data values.
No doubt this is why it's not used in the places where it'd matter;
the tests that do use it are testing much weaker conditions than they
should.

The fix keeps the binary format as is, so the serialized MCV items are
max-aligned. That means we can access the uint16 indexes directly, but we
need to copy the rest of the fields (because those may not be aligned). In
hindsight that seems a bit silly, we might as well copy everything, not
care about the alignment and maybe save a few more bytes.

I think that part of the problem here is that the way this code is
written, "maxaligned" is no such thing. What you're actually maxaligning
seems to be the offset from the start of the data area of a varlena value,
which is generally going to be a maxaligned palloc result plus 4 bytes.
So "aligned" double values are actually guaranteed to be on odd word
boundaries not even ones.

What's more, it's difficult to convince oneself that the maxaligns done
in different parts of the code are all enforcing the same choices about
which substructures get pseudo-maxaligned and which don't, because the
logic doesn't line up very well.

If we do need another catversion bump before v12, I'd vote for ripping
out Every Single One of the "maxalign" operations in this code, just
on the grounds of code simplicity and bug reduction.

regards, tom lane

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#5)
Re: mcvstats serialization code is still shy of a load

On Wed, Jun 26, 2019 at 11:26:21AM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Attached is a patch that should (hopefully) fix this. It essentially
treats the item as (char *) and does all pointer arithmetics without any
additional casts. So there are no intermediate casts.

This passes the eyeball test, and it also allows my OpenBSD/hppa
installation to get through the core regression tests, so I think
it's good as far as it goes. Please push.

However ... nosing around in mcv.c, I noticed that the next macro:

/*
* Used to compute size of serialized MCV list representation.
*/
#define MinSizeOfMCVList \
(VARHDRSZ + sizeof(uint32) * 3 + sizeof(AttrNumber))

#define SizeOfMCVList(ndims,nitems) \
(MAXALIGN(MinSizeOfMCVList + sizeof(Oid) * (ndims)) + \
MAXALIGN((ndims) * sizeof(DimensionInfo)) + \
MAXALIGN((nitems) * ITEM_SIZE(ndims)))

is both woefully underdocumented and completely at variance with
reality. It doesn't seem to be accounting for the actual data values.
No doubt this is why it's not used in the places where it'd matter;
the tests that do use it are testing much weaker conditions than they
should.

I agree about the macro being underdocumented, but AFAICS it's used
correctly to check the expected length. It can't include the data values
directly, because that's variable amount of data - and it's encoded in not
yet verified part of the data.

So this only includes parts with known lengths, and then the code does
this:

for (dim = 0; dim < ndims; dim++)
{
...
expected_size += MAXALIGN(info[dim].nbytes);
}

and uses that to check the actual length.

if (VARSIZE_ANY(data) != expected_size)
elog(ERROR, ...);

That being said, maybe this is unnecessarily defensive and we should just
trust the values not being corrupted. So if we get pg_mcv_list value, we'd
simply assume it's OK.

The fix keeps the binary format as is, so the serialized MCV items are
max-aligned. That means we can access the uint16 indexes directly, but we
need to copy the rest of the fields (because those may not be aligned). In
hindsight that seems a bit silly, we might as well copy everything, not
care about the alignment and maybe save a few more bytes.

I think that part of the problem here is that the way this code is
written, "maxaligned" is no such thing. What you're actually maxaligning
seems to be the offset from the start of the data area of a varlena value,
which is generally going to be a maxaligned palloc result plus 4 bytes.
So "aligned" double values are actually guaranteed to be on odd word
boundaries not even ones.

I don't think so. The pointers should be maxaligned with respect to the
whole varlena value, which is what 'raw' points to. At least that was the
intent of code like this:

raw = palloc0(total_length);

...

/* the header may not be exactly aligned, so make sure it is */
ptr = raw + MAXALIGN(ptr - raw);

If it's not like that in some place, it's a bug.

What's more, it's difficult to convince oneself that the maxaligns done
in different parts of the code are all enforcing the same choices about
which substructures get pseudo-maxaligned and which don't, because the
logic doesn't line up very well.

Not sure. If there's a way to make it clearer, I'm ready to do the work.
Unfortunately it's hard for me to judge that, because I've spent so much
time on that code that it seems fairly clear to me.

If we do need another catversion bump before v12, I'd vote for ripping
out Every Single One of the "maxalign" operations in this code, just
on the grounds of code simplicity and bug reduction.

Hmmm, OK. The original reason to keep the parts aligned was to be able to
reference the parts directly during processing. If we get rid of the
alignment, we'll have to memcpy everything during deserialization. But
if it makes the code simpler, it might be worth it - this part of the
code was clearly the weakest part of the patch.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#6)
Re: mcvstats serialization code is still shy of a load

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Wed, Jun 26, 2019 at 11:26:21AM -0400, Tom Lane wrote:

#define SizeOfMCVList(ndims,nitems) \
is both woefully underdocumented and completely at variance with
reality. It doesn't seem to be accounting for the actual data values.

I agree about the macro being underdocumented, but AFAICS it's used
correctly to check the expected length. It can't include the data values
directly, because that's variable amount of data - and it's encoded in not
yet verified part of the data.

Well, it should have some other name then. Or *at least* a comment.
It's unbelievably misleading as it stands.

That being said, maybe this is unnecessarily defensive and we should just
trust the values not being corrupted.

No, I'm on board with checking the lengths. I just don't like how
hard it is to discern what's being checked.

I think that part of the problem here is that the way this code is
written, "maxaligned" is no such thing. What you're actually maxaligning
seems to be the offset from the start of the data area of a varlena value,

I don't think so. The pointers should be maxaligned with respect to the
whole varlena value, which is what 'raw' points to.

[ squint ... ] OK, I think I misread this:

statext_mcv_deserialize(bytea *data)
{
...
/* pointer to the data part (skip the varlena header) */
ptr = VARDATA_ANY(data);
raw = (char *) data;

I think this is confusing in itself --- I read it as "raw = (char *) ptr"
and I think most other people would assume that too based on the order
of operations. It'd read better as

/* remember start of datum for maxalign reference */
raw = (char *) data;

/* pointer to the data part (skip the varlena header) */
ptr = VARDATA_ANY(data);

Another problem with this code is that it flat doesn't work for
non-4-byte-header varlenas: it'd do the alignment differently than the
serialization side did. That's okay given that the two extant call sites
are guaranteed to pass detoasted datums. But using VARDATA_ANY gives a
completely misleading appearance of being ready to deal with short-header
varlenas, and heaven forbid there should be any comment to discourage
future coders from trying. So really what I'd like to see here is

/* remember start of datum for maxalign reference */
raw = (char *) data;

/* alignment logic assumes full-size datum header */
Assert(VARATT_IS_4B(data));

/* pointer to the data part (skip the varlena header) */
ptr = VARDATA_ANY(data);

Or, of course, this could all go away if we got rid of the
bogus maxaligning...

regards, tom lane

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#7)
1 attachment(s)
Re: mcvstats serialization code is still shy of a load

On Wed, Jun 26, 2019 at 12:31:13PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Wed, Jun 26, 2019 at 11:26:21AM -0400, Tom Lane wrote:

#define SizeOfMCVList(ndims,nitems) \
is both woefully underdocumented and completely at variance with
reality. It doesn't seem to be accounting for the actual data values.

I agree about the macro being underdocumented, but AFAICS it's used
correctly to check the expected length. It can't include the data values
directly, because that's variable amount of data - and it's encoded in not
yet verified part of the data.

Well, it should have some other name then. Or *at least* a comment.
It's unbelievably misleading as it stands.

True.

That being said, maybe this is unnecessarily defensive and we should just
trust the values not being corrupted.

No, I'm on board with checking the lengths. I just don't like how
hard it is to discern what's being checked.

Understood.

I think that part of the problem here is that the way this code is
written, "maxaligned" is no such thing. What you're actually maxaligning
seems to be the offset from the start of the data area of a varlena value,

I don't think so. The pointers should be maxaligned with respect to the
whole varlena value, which is what 'raw' points to.

[ squint ... ] OK, I think I misread this:

statext_mcv_deserialize(bytea *data)
{
...
/* pointer to the data part (skip the varlena header) */
ptr = VARDATA_ANY(data);
raw = (char *) data;

I think this is confusing in itself --- I read it as "raw = (char *) ptr"
and I think most other people would assume that too based on the order
of operations. It'd read better as

/* remember start of datum for maxalign reference */
raw = (char *) data;

/* pointer to the data part (skip the varlena header) */
ptr = VARDATA_ANY(data);

Yeah, that'd have been better.

Another problem with this code is that it flat doesn't work for
non-4-byte-header varlenas: it'd do the alignment differently than the
serialization side did. That's okay given that the two extant call sites
are guaranteed to pass detoasted datums. But using VARDATA_ANY gives a
completely misleading appearance of being ready to deal with short-header
varlenas, and heaven forbid there should be any comment to discourage
future coders from trying. So really what I'd like to see here is

/* remember start of datum for maxalign reference */
raw = (char *) data;

/* alignment logic assumes full-size datum header */
Assert(VARATT_IS_4B(data));

/* pointer to the data part (skip the varlena header) */
ptr = VARDATA_ANY(data);

Or, of course, this could all go away if we got rid of the
bogus maxaligning...

OK. Attached is a patch ditching the alignment in serialized data. I've
ditched the macros to access parts of serialized data, and everything
gets copied.

The main complication is with varlena values, which may or may not have
4B headers (for now there's the PG_DETOAST_DATUM call, but as you
mentioned we may want to remove it in the future). So I've stored the
length as uint32 separately, followed by the full varlena value (thanks
to that the deserialization is simpler). Not sure if that's the best
solution, though, because this way we store the length twice.

I've kept the alignment in the deserialization code, because there it
allows us to allocate the whole value as a single chunk, which I think
is useful (I admit I don't have any measurements to demonstrate that).
But if we decide to rework this later, we can - it's just in-memory
representation, not on-disk.

Is this roughly what you had in mind?

FWIW I'm sure some of the comments are stale and/or need clarification,
but it's a bit too late over here, so I'll look into that tomorrow.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

mcv-serialization-rework.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 5fe61ea0a4..93c774ea1c 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -52,15 +52,7 @@
  *	 ndim * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double)
  */
 #define ITEM_SIZE(ndims)	\
-	MAXALIGN((ndims) * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double))
-
-/*
- * Macros for convenient access to parts of a serialized MCV item.
- */
-#define ITEM_INDEXES(item)			((uint16 *) (item))
-#define ITEM_NULLS(item,ndims)		((bool *) (ITEM_INDEXES(item) + (ndims)))
-#define ITEM_FREQUENCY(item,ndims)	((double *) (ITEM_NULLS(item, ndims) + (ndims)))
-#define ITEM_BASE_FREQUENCY(item,ndims)	((double *) (ITEM_FREQUENCY(item, ndims) + 1))
+	((ndims) * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double))
 
 /*
  * Used to compute size of serialized MCV list representation.
@@ -68,10 +60,15 @@
 #define MinSizeOfMCVList		\
 	(VARHDRSZ + sizeof(uint32) * 3 + sizeof(AttrNumber))
 
+/*
+ * Size of the serialized MCV list, exluding the space needed for
+ * deduplicated per-dimension values. That's because the macro is
+ * used in places when it's not yet safe to access the data.
+ */
 #define SizeOfMCVList(ndims,nitems)	\
-	(MAXALIGN(MinSizeOfMCVList + sizeof(Oid) * (ndims)) + \
-	 MAXALIGN((ndims) * sizeof(DimensionInfo)) + \
-	 MAXALIGN((nitems) * ITEM_SIZE(ndims)))
+	((MinSizeOfMCVList + sizeof(Oid) * (ndims)) + \
+	 ((ndims) * sizeof(DimensionInfo)) + \
+	 ((nitems) * ITEM_SIZE(ndims)))
 
 static MultiSortSupport build_mss(VacAttrStats **stats, int numattrs);
 
@@ -491,19 +488,16 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 	int			i;
 	int			dim;
 	int			ndims = mcvlist->ndimensions;
-	int			itemsize = ITEM_SIZE(ndims);
 
 	SortSupport ssup;
 	DimensionInfo *info;
 
 	Size		total_length;
 
-	/* allocate the item just once */
-	char	   *item = palloc0(itemsize);
-
 	/* serialized items (indexes into arrays, etc.) */
-	char	   *raw;
+	bytea	   *raw;
 	char	   *ptr;
+	char	   *endptr PG_USED_FOR_ASSERTS_ONLY;
 
 	/* values per dimension (and number of non-NULL values) */
 	Datum	  **values = (Datum **) palloc0(sizeof(Datum *) * ndims);
@@ -597,8 +591,17 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 		 */
 		info[dim].nvalues = ndistinct;
 
-		if (info[dim].typlen > 0)	/* fixed-length data types */
+		if (info[dim].typbyval)	/* by-value data types */
+		{
+			info[dim].nbytes = info[dim].nvalues * info[dim].typlen;
+			/* we copy the data into the item, so we don't need extra space */
+			info[dim].nbytes_aligned = 0;
+		}
+		else if (info[dim].typlen > 0)		/* fixed-length by-ref */
+		{
 			info[dim].nbytes = info[dim].nvalues * info[dim].typlen;
+			info[dim].nbytes_aligned = info[dim].nvalues * MAXALIGN(info[dim].typlen);
+		}
 		else if (info[dim].typlen == -1)	/* varlena */
 		{
 			info[dim].nbytes = 0;
@@ -609,7 +612,11 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 				values[dim][i] = PointerGetDatum(PG_DETOAST_DATUM(values[dim][i]));
 
 				len = VARSIZE_ANY(values[dim][i]);
-				info[dim].nbytes += MAXALIGN(len);
+				info[dim].nbytes += sizeof(uint32);	/* length */
+				info[dim].nbytes += len;			/* value (with header) */
+
+				/* space needed for max-aligned copies (deserialization) */
+				info[dim].nbytes_aligned += MAXALIGN(len);
 			}
 		}
 		else if (info[dim].typlen == -2)	/* cstring */
@@ -619,11 +626,15 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 			{
 				Size		len;
 
-				/* c-strings include terminator, so +1 byte */
 				values[dim][i] = PointerGetDatum(PG_DETOAST_DATUM(values[dim][i]));
 
+				/* c-strings include terminator, so +1 byte */
 				len = strlen(DatumGetCString(values[dim][i])) + 1;
-				info[dim].nbytes += MAXALIGN(len);
+				info[dim].nbytes += sizeof(uint32);	/* length */
+				info[dim].nbytes += len;			/* value */
+
+				/* space needed for max-aligned copies (deserialization) */
+				info[dim].nbytes_aligned += MAXALIGN(len);
 			}
 		}
 
@@ -635,34 +646,33 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 	 * Now we can finally compute how much space we'll actually need for the
 	 * whole serialized MCV list (varlena header, MCV header, dimension info
 	 * for each attribute, deduplicated values and items).
-	 *
-	 * The header fields are copied one by one, so that we don't need any
-	 * explicit alignment (we copy them while deserializing). All fields after
-	 * this need to be properly aligned, for direct access.
 	 */
-	total_length = MAXALIGN(VARHDRSZ + (3 * sizeof(uint32))
-							+ sizeof(AttrNumber) + (ndims * sizeof(Oid)));
+	total_length = (3 * sizeof(uint32))			/* magic + type + nitems */
+					+ sizeof(AttrNumber)		/* ndimensions */
+					+ (ndims * sizeof(Oid));	/* types attributes */
 
 	/* dimension info */
-	total_length += MAXALIGN(ndims * sizeof(DimensionInfo));
+	total_length += ndims * sizeof(DimensionInfo);
 
 	/* add space for the arrays of deduplicated values */
 	for (i = 0; i < ndims; i++)
-		total_length += MAXALIGN(info[i].nbytes);
+		total_length += info[i].nbytes;
 
 	/*
-	 * And finally the items (no additional alignment needed, we start at
-	 * proper alignment and the itemsize formula uses MAXALIGN)
+	 * And finally account for the items (those are fixed-length, thanks to
+	 * replacing values with uint16 indexes into the deduplicated arrays).
 	 */
-	total_length += mcvlist->nitems * itemsize;
+	total_length += mcvlist->nitems * ITEM_SIZE(dim);
 
 	/*
 	 * Allocate space for the whole serialized MCV list (we'll skip bytes, so
 	 * we set them to zero to make the result more compressible).
 	 */
-	raw = palloc0(total_length);
-	SET_VARSIZE(raw, total_length);
+	raw = (bytea *) palloc0(VARHDRSZ + total_length);
+	SET_VARSIZE(raw, VARHDRSZ + total_length);
+
 	ptr = VARDATA(raw);
+	endptr = ptr + total_length;
 
 	/* copy the MCV list header fields, one by one */
 	memcpy(ptr, &mcvlist->magic, sizeof(uint32));
@@ -680,12 +690,9 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 	memcpy(ptr, mcvlist->types, sizeof(Oid) * ndims);
 	ptr += (sizeof(Oid) * ndims);
 
-	/* the header may not be exactly aligned, so make sure it is */
-	ptr = raw + MAXALIGN(ptr - raw);
-
-	/* store information about the attributes */
+	/* store information about the attributes (data amounts, ...) */
 	memcpy(ptr, info, sizeof(DimensionInfo) * ndims);
-	ptr += MAXALIGN(sizeof(DimensionInfo) * ndims);
+	ptr += sizeof(DimensionInfo) * ndims;
 
 	/* Copy the deduplicated values for all attributes to the output. */
 	for (dim = 0; dim < ndims; dim++)
@@ -723,17 +730,27 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 			}
 			else if (info[dim].typlen == -1)	/* varlena */
 			{
-				int			len = VARSIZE_ANY(value);
+				uint32		len = VARSIZE_ANY(value);
 
+				/* copy the length */
+				memcpy(ptr, &len, sizeof(uint32));
+				ptr += sizeof(uint32);
+
+				/* value (with the varlena header) */
 				memcpy(ptr, DatumGetPointer(value), len);
-				ptr += MAXALIGN(len);
+				ptr += len;
 			}
 			else if (info[dim].typlen == -2)	/* cstring */
 			{
-				Size		len = strlen(DatumGetCString(value)) + 1;	/* terminator */
+				uint32		len = (uint32) strlen(DatumGetCString(value)) + 1;
+
+				/* copy the length */
+				memcpy(ptr, &len, sizeof(uint32));
+				ptr += sizeof(uint32);
 
+				/* value */
 				memcpy(ptr, DatumGetCString(value), len);
-				ptr += MAXALIGN(len);
+				ptr += len;
 			}
 
 			/* no underflows or overflows */
@@ -742,62 +759,65 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 
 		/* we should get exactly nbytes of data for this dimension */
 		Assert((ptr - start) == info[dim].nbytes);
-
-		/* make sure the pointer is aligned correctly after each dimension */
-		ptr = raw + MAXALIGN(ptr - raw);
 	}
 
 	/* Serialize the items, with uint16 indexes instead of the values. */
 	for (i = 0; i < mcvlist->nitems; i++)
 	{
 		MCVItem    *mcvitem = &mcvlist->items[i];
+		int			itemlen = ITEM_SIZE(dim);
 
 		/* don't write beyond the allocated space */
-		Assert(ptr <= raw + total_length - itemsize);
+		Assert(ptr <= (endptr - itemlen));
+
+		/* copy NULL and frequency flags into the serialized MCV */
+		memcpy(ptr, mcvitem->isnull, sizeof(bool) * ndims);
+		ptr += sizeof(bool) * ndims;
+
+		memcpy(ptr, &mcvitem->frequency, sizeof(double));
+		ptr += sizeof(double);
 
-		/* reset the item (we only allocate it once and reuse it) */
-		memset(item, 0, itemsize);
+		memcpy(ptr, &mcvitem->base_frequency, sizeof(double));
+		ptr += sizeof(double);
 
+		/* store the indexes last */
 		for (dim = 0; dim < ndims; dim++)
 		{
+			uint16		index = 0;
 			Datum	   *value;
 
 			/* do the lookup only for non-NULL values */
-			if (mcvitem->isnull[dim])
-				continue;
+			if (!mcvitem->isnull[dim])
+			{
+				value = (Datum *) bsearch_arg(&mcvitem->values[dim], values[dim],
+											  info[dim].nvalues, sizeof(Datum),
+											  compare_scalars_simple, &ssup[dim]);
 
-			value = (Datum *) bsearch_arg(&mcvitem->values[dim], values[dim],
-										  info[dim].nvalues, sizeof(Datum),
-										  compare_scalars_simple, &ssup[dim]);
+				Assert(value != NULL);	/* serialization or deduplication error */
 
-			Assert(value != NULL);	/* serialization or deduplication error */
+				/* compute index within the deduplicated array */
+				index = (uint16) (value - values[dim]);
 
-			/* compute index within the array */
-			ITEM_INDEXES(item)[dim] = (uint16) (value - values[dim]);
+				/* check the index is within expected bounds */
+				Assert(index < info[dim].nvalues);
+			}
 
-			/* check the index is within expected bounds */
-			Assert(ITEM_INDEXES(item)[dim] < info[dim].nvalues);
+			/* copy the index into the serialized MCV */
+			memcpy(ptr, &index, sizeof(uint16));
+			ptr += sizeof(uint16);
 		}
 
-		/* copy NULL and frequency flags into the item */
-		memcpy(ITEM_NULLS(item, ndims), mcvitem->isnull, sizeof(bool) * ndims);
-		memcpy(ITEM_FREQUENCY(item, ndims), &mcvitem->frequency, sizeof(double));
-		memcpy(ITEM_BASE_FREQUENCY(item, ndims), &mcvitem->base_frequency, sizeof(double));
-
-		/* copy the serialized item into the array */
-		memcpy(ptr, item, itemsize);
-
-		ptr += itemsize;
+		/* make sure we don't overflow the allocated value */
+		Assert(ptr <= endptr);
 	}
 
 	/* at this point we expect to match the total_length exactly */
-	Assert((ptr - raw) == total_length);
+	Assert(ptr == endptr);
 
-	pfree(item);
 	pfree(values);
 	pfree(counts);
 
-	return (bytea *) raw;
+	return raw;
 }
 
 /*
@@ -816,6 +836,7 @@ statext_mcv_deserialize(bytea *data)
 	MCVList    *mcvlist;
 	char	   *raw;
 	char	   *ptr;
+	char	   *endptr PG_USED_FOR_ASSERTS_ONLY;
 
 	int			ndims,
 				nitems;
@@ -849,8 +870,9 @@ statext_mcv_deserialize(bytea *data)
 	mcvlist = (MCVList *) palloc0(offsetof(MCVList, items));
 
 	/* pointer to the data part (skip the varlena header) */
-	ptr = VARDATA_ANY(data);
 	raw = (char *) data;
+	ptr = VARDATA_ANY(raw);
+	endptr = (char *) raw + VARSIZE_ANY(data);
 
 	/* get the header and perform further sanity checks */
 	memcpy(&mcvlist->magic, ptr, sizeof(uint32));
@@ -909,12 +931,11 @@ statext_mcv_deserialize(bytea *data)
 	memcpy(mcvlist->types, ptr, sizeof(Oid) * ndims);
 	ptr += (sizeof(Oid) * ndims);
 
-	/* ensure alignment of the pointer (after the header fields) */
-	ptr = raw + MAXALIGN(ptr - raw);
-
 	/* Now it's safe to access the dimension info. */
-	info = (DimensionInfo *) ptr;
-	ptr += MAXALIGN(ndims * sizeof(DimensionInfo));
+	info = palloc(ndims * sizeof(DimensionInfo));
+
+	memcpy(info, ptr, ndims * sizeof(DimensionInfo));
+	ptr += (ndims * sizeof(DimensionInfo));
 
 	/* account for the value arrays */
 	for (dim = 0; dim < ndims; dim++)
@@ -926,7 +947,7 @@ statext_mcv_deserialize(bytea *data)
 		Assert(info[dim].nvalues >= 0);
 		Assert(info[dim].nbytes >= 0);
 
-		expected_size += MAXALIGN(info[dim].nbytes);
+		expected_size += info[dim].nbytes;
 	}
 
 	/*
@@ -955,15 +976,18 @@ statext_mcv_deserialize(bytea *data)
 		map[dim] = (Datum *) palloc(sizeof(Datum) * info[dim].nvalues);
 
 		/* space needed for a copy of data for by-ref types */
-		if (!info[dim].typbyval)
-			datalen += MAXALIGN(info[dim].nbytes);
+		datalen += info[dim].nbytes_aligned;
 	}
 
 	/*
-	 * Now resize the MCV list so that the allocation includes all the data
+	 * Now resize the MCV list so that the allocation includes all the data.
+	 *
 	 * Allocate space for a copy of the data, as we can't simply reference the
-	 * original data - it may disappear while we're still using the MCV list,
-	 * e.g. due to catcache release. Only needed for by-ref types.
+	 * serialized data - it's not aligned properly, and it may disappear while
+	 * we're still using the MCV list, e.g. due to catcache release.
+	 *
+	 * We do care about alignment here, because we will allocate all the pieces
+	 * at once, but then use pointers to different parts.
 	 */
 	mcvlen = MAXALIGN(offsetof(MCVList, items) + (sizeof(MCVItem) * nitems));
 
@@ -1024,7 +1048,7 @@ statext_mcv_deserialize(bytea *data)
 
 					/* just point into the array */
 					map[dim][i] = PointerGetDatum(dataptr);
-					dataptr += info[dim].typlen;
+					dataptr += MAXALIGN(info[dim].typlen);
 				}
 			}
 			else if (info[dim].typlen == -1)
@@ -1032,10 +1056,13 @@ statext_mcv_deserialize(bytea *data)
 				/* varlena */
 				for (i = 0; i < info[dim].nvalues; i++)
 				{
-					Size		len = VARSIZE_ANY(ptr);
+					uint32		len;
+
+					memcpy(&len, ptr, sizeof(uint32));
+					ptr += sizeof(uint32);
 
 					memcpy(dataptr, ptr, len);
-					ptr += MAXALIGN(len);
+					ptr += len;
 
 					/* just point into the array */
 					map[dim][i] = PointerGetDatum(dataptr);
@@ -1047,10 +1074,13 @@ statext_mcv_deserialize(bytea *data)
 				/* cstring */
 				for (i = 0; i < info[dim].nvalues; i++)
 				{
-					Size		len = (strlen(ptr) + 1);	/* don't forget the \0 */
+					uint32		len;
+
+					memcpy(&len, ptr, sizeof(uint32));
+					ptr += sizeof(uint32);
 
 					memcpy(dataptr, ptr, len);
-					ptr += MAXALIGN(len);
+					ptr += len;
 
 					/* just point into the array */
 					map[dim][i] = PointerGetDatum(dataptr);
@@ -1067,9 +1097,6 @@ statext_mcv_deserialize(bytea *data)
 
 		/* check we consumed input data for this dimension exactly */
 		Assert(ptr == (start + info[dim].nbytes));
-
-		/* ensure proper alignment of the data */
-		ptr = raw + MAXALIGN(ptr - raw);
 	}
 
 	/* we should have also filled the MCV list exactly */
@@ -1078,7 +1105,6 @@ statext_mcv_deserialize(bytea *data)
 	/* deserialize the MCV items and translate the indexes to Datums */
 	for (i = 0; i < nitems; i++)
 	{
-		uint16	   *indexes = NULL;
 		MCVItem    *item = &mcvlist->items[i];
 
 		item->values = (Datum *) valuesptr;
@@ -1087,27 +1113,35 @@ statext_mcv_deserialize(bytea *data)
 		item->isnull = (bool *) isnullptr;
 		isnullptr += MAXALIGN(sizeof(bool) * ndims);
 
+		memcpy(item->isnull, ptr, sizeof(bool) * ndims);
+		ptr += sizeof(bool) * ndims;
 
-		/* just point to the right place */
-		indexes = ITEM_INDEXES(ptr);
+		memcpy(&item->frequency, ptr, sizeof(double));
+		ptr += sizeof(double);
 
-		memcpy(item->isnull, ITEM_NULLS(ptr, ndims), sizeof(bool) * ndims);
-		memcpy(&item->frequency, ITEM_FREQUENCY(ptr, ndims), sizeof(double));
-		memcpy(&item->base_frequency, ITEM_BASE_FREQUENCY(ptr, ndims), sizeof(double));
+		memcpy(&item->base_frequency, ptr, sizeof(double));
+		ptr += sizeof(double);
 
-		/* translate the values */
+		/* finally translate the indexes (for non-NULL only) */
 		for (dim = 0; dim < ndims; dim++)
-			if (!item->isnull[dim])
-				item->values[dim] = map[dim][indexes[dim]];
+		{
+			uint16	index;
+
+			memcpy(&index, ptr, sizeof(uint16));
+			ptr += sizeof(uint16);
 
-		ptr += ITEM_SIZE(ndims);
+			if (item->isnull[dim])
+				continue;
+
+			item->values[dim] = map[dim][index];
+		}
 
 		/* check we're not overflowing the input */
-		Assert(ptr <= (char *) raw + VARSIZE_ANY(data));
+		Assert(ptr <= endptr);
 	}
 
 	/* check that we processed all the data */
-	Assert(ptr == raw + VARSIZE_ANY(data));
+	Assert(ptr == endptr);
 
 	/* release the buffers used for mapping */
 	for (dim = 0; dim < ndims; dim++)
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index fb03c52f50..6778746db1 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -36,6 +36,7 @@ typedef struct DimensionInfo
 {
 	int			nvalues;		/* number of deduplicated values */
 	int			nbytes;			/* number of bytes (serialized) */
+	int			nbytes_aligned;	/* deserialized data with alignment */
 	int			typlen;			/* pg_type.typlen */
 	bool		typbyval;		/* pg_type.typbyval */
 } DimensionInfo;
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#8)
Re: mcvstats serialization code is still shy of a load

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

OK. Attached is a patch ditching the alignment in serialized data. I've
ditched the macros to access parts of serialized data, and everything
gets copied.

I lack energy to actually read this patch right now, and I don't currently
have an opinion about whether it's worth another catversion bump to fix
this stuff in v12. But I did test the patch, and I can confirm it gets
through the core regression tests on hppa (both gaur's host environment
with gcc 3.4.6, and the OpenBSD installation with gcc 4.2.1).

regards, tom lane

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#9)
Re: mcvstats serialization code is still shy of a load

On Thu, Jun 27, 2019 at 12:04:30AM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

OK. Attached is a patch ditching the alignment in serialized data. I've
ditched the macros to access parts of serialized data, and everything
gets copied.

I lack energy to actually read this patch right now, and I don't currently
have an opinion about whether it's worth another catversion bump to fix
this stuff in v12. But I did test the patch, and I can confirm it gets
through the core regression tests on hppa (both gaur's host environment
with gcc 3.4.6, and the OpenBSD installation with gcc 4.2.1).

Thanks for running it through regression tests, that alone is a very
useful piece of information for me.

As for the catversion bump - I'd probably vote to do it. Not just because
of this serialization stuff, but to fix the pg_mcv_list_items function.
It's not something I'm very enthusiastic about (kinda embarassed about it,
really), but it seems better than shipping something that we'll need to
rework in PG13.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#10)
1 attachment(s)
Re: mcvstats serialization code is still shy of a load

On Thu, Jun 27, 2019 at 01:26:32PM +0200, Tomas Vondra wrote:

On Thu, Jun 27, 2019 at 12:04:30AM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

OK. Attached is a patch ditching the alignment in serialized data. I've
ditched the macros to access parts of serialized data, and everything
gets copied.

I lack energy to actually read this patch right now, and I don't currently
have an opinion about whether it's worth another catversion bump to fix
this stuff in v12. But I did test the patch, and I can confirm it gets
through the core regression tests on hppa (both gaur's host environment
with gcc 3.4.6, and the OpenBSD installation with gcc 4.2.1).

Thanks for running it through regression tests, that alone is a very
useful piece of information for me.

As for the catversion bump - I'd probably vote to do it. Not just because
of this serialization stuff, but to fix the pg_mcv_list_items function.
It's not something I'm very enthusiastic about (kinda embarassed about it,
really), but it seems better than shipping something that we'll need to
rework in PG13.

Attached is a slightly improved version of the serialization patch. The
main difference is that when serializing varlena values, the previous
patch version stored

length (uint32) + full varlena (incl. the header)

which is kinda redundant, because the varlena stores the length too. So
now it only stores the length + data, without the varlena header. I
don't think there's a better way to store varlena values without
enforcing alignment (which is what happens in current master).

There's one additional change I failed to mention before - I had to add
another field to DimensionInfo, tracking how much space will be needed
for deserialized data. This is needed because the deserialization
allocates the whole MCV as a single chunk of memory, to reduce palloc
overhead. It could parse the data twice (first to determine the space,
then to actually parse it), this allows doing just a single pass. Which
seems useful for large MCV lists, but maybe it's not worth it?

Barring objections I'll commit this together with the pg_mcv_list_items
fix, posted in a separate thread. Of course, this requires catversion
bump - an alternative would be to keep enforcing the alignment, but
tweak the macros to work on all platforms without SIGBUS.

Considering how troublesome this serialiation part of the patch turner
out to be, I'm not really sure by anything at this point. So I'd welcome
thoughts about the proposed changes.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

mcv-serialization-rework-v2.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 5fe61ea0a4..256728a02a 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -52,15 +52,7 @@
  *	 ndim * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double)
  */
 #define ITEM_SIZE(ndims)	\
-	MAXALIGN((ndims) * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double))
-
-/*
- * Macros for convenient access to parts of a serialized MCV item.
- */
-#define ITEM_INDEXES(item)			((uint16 *) (item))
-#define ITEM_NULLS(item,ndims)		((bool *) (ITEM_INDEXES(item) + (ndims)))
-#define ITEM_FREQUENCY(item,ndims)	((double *) (ITEM_NULLS(item, ndims) + (ndims)))
-#define ITEM_BASE_FREQUENCY(item,ndims)	((double *) (ITEM_FREQUENCY(item, ndims) + 1))
+	((ndims) * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double))
 
 /*
  * Used to compute size of serialized MCV list representation.
@@ -68,10 +60,16 @@
 #define MinSizeOfMCVList		\
 	(VARHDRSZ + sizeof(uint32) * 3 + sizeof(AttrNumber))
 
+/*
+ * Size of the serialized MCV list, exluding the space needed for
+ * deduplicated per-dimension values. The macro is meant to be used
+ * when it's not yet safe to access the serialized info about amount
+ * of data for each column.
+ */
 #define SizeOfMCVList(ndims,nitems)	\
-	(MAXALIGN(MinSizeOfMCVList + sizeof(Oid) * (ndims)) + \
-	 MAXALIGN((ndims) * sizeof(DimensionInfo)) + \
-	 MAXALIGN((nitems) * ITEM_SIZE(ndims)))
+	((MinSizeOfMCVList + sizeof(Oid) * (ndims)) + \
+	 ((ndims) * sizeof(DimensionInfo)) + \
+	 ((nitems) * ITEM_SIZE(ndims)))
 
 static MultiSortSupport build_mss(VacAttrStats **stats, int numattrs);
 
@@ -491,19 +489,16 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 	int			i;
 	int			dim;
 	int			ndims = mcvlist->ndimensions;
-	int			itemsize = ITEM_SIZE(ndims);
 
 	SortSupport ssup;
 	DimensionInfo *info;
 
 	Size		total_length;
 
-	/* allocate the item just once */
-	char	   *item = palloc0(itemsize);
-
 	/* serialized items (indexes into arrays, etc.) */
-	char	   *raw;
+	bytea	   *raw;
 	char	   *ptr;
+	char	   *endptr PG_USED_FOR_ASSERTS_ONLY;
 
 	/* values per dimension (and number of non-NULL values) */
 	Datum	  **values = (Datum **) palloc0(sizeof(Datum *) * ndims);
@@ -597,8 +592,30 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 		 */
 		info[dim].nvalues = ndistinct;
 
-		if (info[dim].typlen > 0)	/* fixed-length data types */
+		if (info[dim].typbyval)	/* by-value data types */
+		{
 			info[dim].nbytes = info[dim].nvalues * info[dim].typlen;
+
+			/*
+			 * We copy the data into the MCV item during deserialization, so
+			 * we don't need to allocate any extra space.
+			*/
+			info[dim].nbytes_aligned = 0;
+		}
+		else if (info[dim].typlen > 0)		/* fixed-length by-ref */
+		{
+			/*
+			 * We don't care about alignment in the serialized data, so we
+			 * pack the data as much as possible. But we also track how much
+			 * data will be needed after deserialization, and in that case
+			 * we need to account for alignment of each item.
+			 *
+			 * Note: As the items are fixed-length, we could easily compute
+			 * this during deserialization, but we do it here anyway.
+			 */
+			info[dim].nbytes = info[dim].nvalues * info[dim].typlen;
+			info[dim].nbytes_aligned = info[dim].nvalues * MAXALIGN(info[dim].typlen);
+		}
 		else if (info[dim].typlen == -1)	/* varlena */
 		{
 			info[dim].nbytes = 0;
@@ -606,10 +623,24 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 			{
 				Size		len;
 
+				/*
+				 * For varlena values, we detoast the values and store the
+				 * length and data separately. We don't bother with alignment
+				 * here, which means that during deserialization we need to
+				 * copy the fields and only access the copies.
+				 */
 				values[dim][i] = PointerGetDatum(PG_DETOAST_DATUM(values[dim][i]));
 
-				len = VARSIZE_ANY(values[dim][i]);
-				info[dim].nbytes += MAXALIGN(len);
+				/* serialized length (uint32 length + data) */
+				len = VARSIZE_ANY_EXHDR(values[dim][i]);
+				info[dim].nbytes += sizeof(uint32);	/* length */
+				info[dim].nbytes += len;			/* value (no header) */
+
+				/*
+				 * During deserialization we'll build regular varlena values
+				 * with full headers, and we need to align them properly.
+				 */
+				info[dim].nbytes_aligned += MAXALIGN(VARHDRSZ + len);
 			}
 		}
 		else if (info[dim].typlen == -2)	/* cstring */
@@ -619,11 +650,20 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 			{
 				Size		len;
 
-				/* c-strings include terminator, so +1 byte */
-				values[dim][i] = PointerGetDatum(PG_DETOAST_DATUM(values[dim][i]));
+				/*
+				 * For cstring, we do similar thing as for varlena - first we
+				 * store the length as uint32 and then the data. We don't care
+				 * about alignment, which means that during deserialization we
+				 * need to copy the fields and only access the copies.
+				 */
 
+				/* c-strings include terminator, so +1 byte */
 				len = strlen(DatumGetCString(values[dim][i])) + 1;
-				info[dim].nbytes += MAXALIGN(len);
+				info[dim].nbytes += sizeof(uint32);	/* length */
+				info[dim].nbytes += len;			/* value */
+
+				/* space needed for properly aligned deserialized copies */
+				info[dim].nbytes_aligned += MAXALIGN(len);
 			}
 		}
 
@@ -635,34 +675,33 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 	 * Now we can finally compute how much space we'll actually need for the
 	 * whole serialized MCV list (varlena header, MCV header, dimension info
 	 * for each attribute, deduplicated values and items).
-	 *
-	 * The header fields are copied one by one, so that we don't need any
-	 * explicit alignment (we copy them while deserializing). All fields after
-	 * this need to be properly aligned, for direct access.
 	 */
-	total_length = MAXALIGN(VARHDRSZ + (3 * sizeof(uint32))
-							+ sizeof(AttrNumber) + (ndims * sizeof(Oid)));
+	total_length = (3 * sizeof(uint32))			/* magic + type + nitems */
+					+ sizeof(AttrNumber)		/* ndimensions */
+					+ (ndims * sizeof(Oid));	/* attribute types */
 
 	/* dimension info */
-	total_length += MAXALIGN(ndims * sizeof(DimensionInfo));
+	total_length += ndims * sizeof(DimensionInfo);
 
 	/* add space for the arrays of deduplicated values */
 	for (i = 0; i < ndims; i++)
-		total_length += MAXALIGN(info[i].nbytes);
+		total_length += info[i].nbytes;
 
 	/*
-	 * And finally the items (no additional alignment needed, we start at
-	 * proper alignment and the itemsize formula uses MAXALIGN)
+	 * And finally account for the items (those are fixed-length, thanks to
+	 * replacing values with uint16 indexes into the deduplicated arrays).
 	 */
-	total_length += mcvlist->nitems * itemsize;
+	total_length += mcvlist->nitems * ITEM_SIZE(dim);
 
 	/*
 	 * Allocate space for the whole serialized MCV list (we'll skip bytes, so
 	 * we set them to zero to make the result more compressible).
 	 */
-	raw = palloc0(total_length);
-	SET_VARSIZE(raw, total_length);
+	raw = (bytea *) palloc0(VARHDRSZ + total_length);
+	SET_VARSIZE(raw, VARHDRSZ + total_length);
+
 	ptr = VARDATA(raw);
+	endptr = ptr + total_length;
 
 	/* copy the MCV list header fields, one by one */
 	memcpy(ptr, &mcvlist->magic, sizeof(uint32));
@@ -680,12 +719,9 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 	memcpy(ptr, mcvlist->types, sizeof(Oid) * ndims);
 	ptr += (sizeof(Oid) * ndims);
 
-	/* the header may not be exactly aligned, so make sure it is */
-	ptr = raw + MAXALIGN(ptr - raw);
-
-	/* store information about the attributes */
+	/* store information about the attributes (data amounts, ...) */
 	memcpy(ptr, info, sizeof(DimensionInfo) * ndims);
-	ptr += MAXALIGN(sizeof(DimensionInfo) * ndims);
+	ptr += sizeof(DimensionInfo) * ndims;
 
 	/* Copy the deduplicated values for all attributes to the output. */
 	for (dim = 0; dim < ndims; dim++)
@@ -723,17 +759,27 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 			}
 			else if (info[dim].typlen == -1)	/* varlena */
 			{
-				int			len = VARSIZE_ANY(value);
+				uint32		len = VARSIZE_ANY_EXHDR(value);
+
+				/* copy the length */
+				memcpy(ptr, &len, sizeof(uint32));
+				ptr += sizeof(uint32);
 
-				memcpy(ptr, DatumGetPointer(value), len);
-				ptr += MAXALIGN(len);
+				/* data from the varlena value (without the header) */
+				memcpy(ptr, VARDATA(DatumGetPointer(value)), len);
+				ptr += len;
 			}
 			else if (info[dim].typlen == -2)	/* cstring */
 			{
-				Size		len = strlen(DatumGetCString(value)) + 1;	/* terminator */
+				uint32		len = (uint32) strlen(DatumGetCString(value)) + 1;
 
+				/* copy the length */
+				memcpy(ptr, &len, sizeof(uint32));
+				ptr += sizeof(uint32);
+
+				/* value */
 				memcpy(ptr, DatumGetCString(value), len);
-				ptr += MAXALIGN(len);
+				ptr += len;
 			}
 
 			/* no underflows or overflows */
@@ -742,62 +788,65 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 
 		/* we should get exactly nbytes of data for this dimension */
 		Assert((ptr - start) == info[dim].nbytes);
-
-		/* make sure the pointer is aligned correctly after each dimension */
-		ptr = raw + MAXALIGN(ptr - raw);
 	}
 
 	/* Serialize the items, with uint16 indexes instead of the values. */
 	for (i = 0; i < mcvlist->nitems; i++)
 	{
 		MCVItem    *mcvitem = &mcvlist->items[i];
+		int			itemlen = ITEM_SIZE(dim);
 
 		/* don't write beyond the allocated space */
-		Assert(ptr <= raw + total_length - itemsize);
+		Assert(ptr <= (endptr - itemlen));
+
+		/* copy NULL and frequency flags into the serialized MCV */
+		memcpy(ptr, mcvitem->isnull, sizeof(bool) * ndims);
+		ptr += sizeof(bool) * ndims;
 
-		/* reset the item (we only allocate it once and reuse it) */
-		memset(item, 0, itemsize);
+		memcpy(ptr, &mcvitem->frequency, sizeof(double));
+		ptr += sizeof(double);
 
+		memcpy(ptr, &mcvitem->base_frequency, sizeof(double));
+		ptr += sizeof(double);
+
+		/* store the indexes last */
 		for (dim = 0; dim < ndims; dim++)
 		{
+			uint16		index = 0;
 			Datum	   *value;
 
 			/* do the lookup only for non-NULL values */
-			if (mcvitem->isnull[dim])
-				continue;
+			if (!mcvitem->isnull[dim])
+			{
+				value = (Datum *) bsearch_arg(&mcvitem->values[dim], values[dim],
+											  info[dim].nvalues, sizeof(Datum),
+											  compare_scalars_simple, &ssup[dim]);
 
-			value = (Datum *) bsearch_arg(&mcvitem->values[dim], values[dim],
-										  info[dim].nvalues, sizeof(Datum),
-										  compare_scalars_simple, &ssup[dim]);
+				Assert(value != NULL);	/* serialization or deduplication error */
 
-			Assert(value != NULL);	/* serialization or deduplication error */
+				/* compute index within the deduplicated array */
+				index = (uint16) (value - values[dim]);
 
-			/* compute index within the array */
-			ITEM_INDEXES(item)[dim] = (uint16) (value - values[dim]);
+				/* check the index is within expected bounds */
+				Assert(index < info[dim].nvalues);
+			}
 
-			/* check the index is within expected bounds */
-			Assert(ITEM_INDEXES(item)[dim] < info[dim].nvalues);
+			/* copy the index into the serialized MCV */
+			memcpy(ptr, &index, sizeof(uint16));
+			ptr += sizeof(uint16);
 		}
 
-		/* copy NULL and frequency flags into the item */
-		memcpy(ITEM_NULLS(item, ndims), mcvitem->isnull, sizeof(bool) * ndims);
-		memcpy(ITEM_FREQUENCY(item, ndims), &mcvitem->frequency, sizeof(double));
-		memcpy(ITEM_BASE_FREQUENCY(item, ndims), &mcvitem->base_frequency, sizeof(double));
-
-		/* copy the serialized item into the array */
-		memcpy(ptr, item, itemsize);
-
-		ptr += itemsize;
+		/* make sure we don't overflow the allocated value */
+		Assert(ptr <= endptr);
 	}
 
 	/* at this point we expect to match the total_length exactly */
-	Assert((ptr - raw) == total_length);
+	Assert(ptr == endptr);
 
-	pfree(item);
 	pfree(values);
 	pfree(counts);
 
-	return (bytea *) raw;
+	return raw;
 }
 
 /*
@@ -816,6 +865,7 @@ statext_mcv_deserialize(bytea *data)
 	MCVList    *mcvlist;
 	char	   *raw;
 	char	   *ptr;
+	char	   *endptr PG_USED_FOR_ASSERTS_ONLY;
 
 	int			ndims,
 				nitems;
@@ -849,8 +899,9 @@ statext_mcv_deserialize(bytea *data)
 	mcvlist = (MCVList *) palloc0(offsetof(MCVList, items));
 
 	/* pointer to the data part (skip the varlena header) */
-	ptr = VARDATA_ANY(data);
 	raw = (char *) data;
+	ptr = VARDATA_ANY(raw);
+	endptr = (char *) raw + VARSIZE_ANY(data);
 
 	/* get the header and perform further sanity checks */
 	memcpy(&mcvlist->magic, ptr, sizeof(uint32));
@@ -909,12 +960,11 @@ statext_mcv_deserialize(bytea *data)
 	memcpy(mcvlist->types, ptr, sizeof(Oid) * ndims);
 	ptr += (sizeof(Oid) * ndims);
 
-	/* ensure alignment of the pointer (after the header fields) */
-	ptr = raw + MAXALIGN(ptr - raw);
-
 	/* Now it's safe to access the dimension info. */
-	info = (DimensionInfo *) ptr;
-	ptr += MAXALIGN(ndims * sizeof(DimensionInfo));
+	info = palloc(ndims * sizeof(DimensionInfo));
+
+	memcpy(info, ptr, ndims * sizeof(DimensionInfo));
+	ptr += (ndims * sizeof(DimensionInfo));
 
 	/* account for the value arrays */
 	for (dim = 0; dim < ndims; dim++)
@@ -926,7 +976,7 @@ statext_mcv_deserialize(bytea *data)
 		Assert(info[dim].nvalues >= 0);
 		Assert(info[dim].nbytes >= 0);
 
-		expected_size += MAXALIGN(info[dim].nbytes);
+		expected_size += info[dim].nbytes;
 	}
 
 	/*
@@ -955,15 +1005,18 @@ statext_mcv_deserialize(bytea *data)
 		map[dim] = (Datum *) palloc(sizeof(Datum) * info[dim].nvalues);
 
 		/* space needed for a copy of data for by-ref types */
-		if (!info[dim].typbyval)
-			datalen += MAXALIGN(info[dim].nbytes);
+		datalen += info[dim].nbytes_aligned;
 	}
 
 	/*
-	 * Now resize the MCV list so that the allocation includes all the data
+	 * Now resize the MCV list so that the allocation includes all the data.
+	 *
 	 * Allocate space for a copy of the data, as we can't simply reference the
-	 * original data - it may disappear while we're still using the MCV list,
-	 * e.g. due to catcache release. Only needed for by-ref types.
+	 * serialized data - it's not aligned properly, and it may disappear while
+	 * we're still using the MCV list, e.g. due to catcache release.
+	 *
+	 * We do care about alignment here, because we will allocate all the pieces
+	 * at once, but then use pointers to different parts.
 	 */
 	mcvlen = MAXALIGN(offsetof(MCVList, items) + (sizeof(MCVItem) * nitems));
 
@@ -1024,7 +1077,7 @@ statext_mcv_deserialize(bytea *data)
 
 					/* just point into the array */
 					map[dim][i] = PointerGetDatum(dataptr);
-					dataptr += info[dim].typlen;
+					dataptr += MAXALIGN(info[dim].typlen);
 				}
 			}
 			else if (info[dim].typlen == -1)
@@ -1032,13 +1085,21 @@ statext_mcv_deserialize(bytea *data)
 				/* varlena */
 				for (i = 0; i < info[dim].nvalues; i++)
 				{
-					Size		len = VARSIZE_ANY(ptr);
+					uint32		len;
 
-					memcpy(dataptr, ptr, len);
-					ptr += MAXALIGN(len);
+					/* read the uint32 length */
+					memcpy(&len, ptr, sizeof(uint32));
+					ptr += sizeof(uint32);
+
+					/* the length is data-only */
+					SET_VARSIZE(dataptr, len + VARHDRSZ);
+					memcpy(VARDATA(dataptr), ptr, len);
+					ptr += len;
 
 					/* just point into the array */
 					map[dim][i] = PointerGetDatum(dataptr);
+
+					/* skip to place of the next deserialized value */
 					dataptr += MAXALIGN(len);
 				}
 			}
@@ -1047,10 +1108,13 @@ statext_mcv_deserialize(bytea *data)
 				/* cstring */
 				for (i = 0; i < info[dim].nvalues; i++)
 				{
-					Size		len = (strlen(ptr) + 1);	/* don't forget the \0 */
+					uint32		len;
+
+					memcpy(&len, ptr, sizeof(uint32));
+					ptr += sizeof(uint32);
 
 					memcpy(dataptr, ptr, len);
-					ptr += MAXALIGN(len);
+					ptr += len;
 
 					/* just point into the array */
 					map[dim][i] = PointerGetDatum(dataptr);
@@ -1067,9 +1131,6 @@ statext_mcv_deserialize(bytea *data)
 
 		/* check we consumed input data for this dimension exactly */
 		Assert(ptr == (start + info[dim].nbytes));
-
-		/* ensure proper alignment of the data */
-		ptr = raw + MAXALIGN(ptr - raw);
 	}
 
 	/* we should have also filled the MCV list exactly */
@@ -1078,7 +1139,6 @@ statext_mcv_deserialize(bytea *data)
 	/* deserialize the MCV items and translate the indexes to Datums */
 	for (i = 0; i < nitems; i++)
 	{
-		uint16	   *indexes = NULL;
 		MCVItem    *item = &mcvlist->items[i];
 
 		item->values = (Datum *) valuesptr;
@@ -1087,27 +1147,35 @@ statext_mcv_deserialize(bytea *data)
 		item->isnull = (bool *) isnullptr;
 		isnullptr += MAXALIGN(sizeof(bool) * ndims);
 
+		memcpy(item->isnull, ptr, sizeof(bool) * ndims);
+		ptr += sizeof(bool) * ndims;
 
-		/* just point to the right place */
-		indexes = ITEM_INDEXES(ptr);
+		memcpy(&item->frequency, ptr, sizeof(double));
+		ptr += sizeof(double);
 
-		memcpy(item->isnull, ITEM_NULLS(ptr, ndims), sizeof(bool) * ndims);
-		memcpy(&item->frequency, ITEM_FREQUENCY(ptr, ndims), sizeof(double));
-		memcpy(&item->base_frequency, ITEM_BASE_FREQUENCY(ptr, ndims), sizeof(double));
+		memcpy(&item->base_frequency, ptr, sizeof(double));
+		ptr += sizeof(double);
 
-		/* translate the values */
+		/* finally translate the indexes (for non-NULL only) */
 		for (dim = 0; dim < ndims; dim++)
-			if (!item->isnull[dim])
-				item->values[dim] = map[dim][indexes[dim]];
+		{
+			uint16	index;
+
+			memcpy(&index, ptr, sizeof(uint16));
+			ptr += sizeof(uint16);
 
-		ptr += ITEM_SIZE(ndims);
+			if (item->isnull[dim])
+				continue;
+
+			item->values[dim] = map[dim][index];
+		}
 
 		/* check we're not overflowing the input */
-		Assert(ptr <= (char *) raw + VARSIZE_ANY(data));
+		Assert(ptr <= endptr);
 	}
 
 	/* check that we processed all the data */
-	Assert(ptr == raw + VARSIZE_ANY(data));
+	Assert(ptr == endptr);
 
 	/* release the buffers used for mapping */
 	for (dim = 0; dim < ndims; dim++)
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index fb03c52f50..6778746db1 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -36,6 +36,7 @@ typedef struct DimensionInfo
 {
 	int			nvalues;		/* number of deduplicated values */
 	int			nbytes;			/* number of bytes (serialized) */
+	int			nbytes_aligned;	/* deserialized data with alignment */
 	int			typlen;			/* pg_type.typlen */
 	bool		typbyval;		/* pg_type.typbyval */
 } DimensionInfo;
#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#11)
1 attachment(s)
Re: mcvstats serialization code is still shy of a load

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Attached is a slightly improved version of the serialization patch.

I reviewed this patch, and tested it on hppa and ppc. I found one
serious bug: in the deserialization varlena case, you need

-					dataptr += MAXALIGN(len);
+					dataptr += MAXALIGN(len + VARHDRSZ);

(approx. line 1100 in mcv.c). Without this, the output data is corrupt,
plus the Assert a few lines further down about dataptr having been
advanced by the correct amount fires. (On one machine I tested on,
that happened during the core regression tests. The other machine
got through regression, but trying to do "select * from pg_stats_ext;"
afterwards exhibited the crash. I didn't investigate closely, but
I suspect the difference has to do with different MAXALIGN values,
4 and 8 respectively.)

The attached patch (a delta atop your v2) corrects that plus some
cosmetic issues.

If we're going to push this, it would be considerably less complicated
to do so before v12 gets branched --- not long after that, there will be
catversion differences to cope with. I'm planning to make the branch
tomorrow (Monday), probably ~1500 UTC. Just sayin'.

regards, tom lane

Attachments:

mcv-serialization-delta.patchtext/x-diff; charset=us-ascii; name=mcv-serialization-delta.patchDownload
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 256728a..cfe7e54 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -40,13 +40,14 @@
  * stored in a separate array (deduplicated, to minimize the size), and
  * so the serialized items only store uint16 indexes into that array.
  *
- * Each serialized item store (in this order):
+ * Each serialized item stores (in this order):
  *
  * - indexes to values	  (ndim * sizeof(uint16))
  * - null flags			  (ndim * sizeof(bool))
  * - frequency			  (sizeof(double))
  * - base_frequency		  (sizeof(double))
  *
+ * There is no alignment padding within an MCV item.
  * So in total each MCV item requires this many bytes:
  *
  *	 ndim * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double)
@@ -61,7 +62,7 @@
 	(VARHDRSZ + sizeof(uint32) * 3 + sizeof(AttrNumber))
 
 /*
- * Size of the serialized MCV list, exluding the space needed for
+ * Size of the serialized MCV list, excluding the space needed for
  * deduplicated per-dimension values. The macro is meant to be used
  * when it's not yet safe to access the serialized info about amount
  * of data for each column.
@@ -619,6 +620,7 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 		else if (info[dim].typlen == -1)	/* varlena */
 		{
 			info[dim].nbytes = 0;
+			info[dim].nbytes_aligned = 0;
 			for (i = 0; i < info[dim].nvalues; i++)
 			{
 				Size		len;
@@ -646,6 +648,7 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 		else if (info[dim].typlen == -2)	/* cstring */
 		{
 			info[dim].nbytes = 0;
+			info[dim].nbytes_aligned = 0;
 			for (i = 0; i < info[dim].nvalues; i++)
 			{
 				Size		len;
@@ -743,7 +746,7 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 				 * assumes little endian behavior.  store_att_byval does
 				 * almost what we need, but it requires properly aligned
 				 * buffer - the output buffer does not guarantee that. So we
-				 * simply use a static Datum variable (which guarantees proper
+				 * simply use a local Datum variable (which guarantees proper
 				 * alignment), and then copy the value from it.
 				 */
 				store_att_byval(&tmp, value, info[dim].typlen);
@@ -759,14 +762,14 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats)
 			}
 			else if (info[dim].typlen == -1)	/* varlena */
 			{
-				uint32		len = VARSIZE_ANY_EXHDR(value);
+				uint32		len = VARSIZE_ANY_EXHDR(DatumGetPointer(value));
 
 				/* copy the length */
 				memcpy(ptr, &len, sizeof(uint32));
 				ptr += sizeof(uint32);
 
 				/* data from the varlena value (without the header) */
-				memcpy(ptr, VARDATA(DatumGetPointer(value)), len);
+				memcpy(ptr, VARDATA_ANY(DatumGetPointer(value)), len);
 				ptr += len;
 			}
 			else if (info[dim].typlen == -2)	/* cstring */
@@ -1100,7 +1103,7 @@ statext_mcv_deserialize(bytea *data)
 					map[dim][i] = PointerGetDatum(dataptr);
 
 					/* skip to place of the next deserialized value */
-					dataptr += MAXALIGN(len);
+					dataptr += MAXALIGN(len + VARHDRSZ);
 				}
 			}
 			else if (info[dim].typlen == -2)
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 6778746..8fc5419 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -36,7 +36,7 @@ typedef struct DimensionInfo
 {
 	int			nvalues;		/* number of deduplicated values */
 	int			nbytes;			/* number of bytes (serialized) */
-	int			nbytes_aligned;	/* deserialized data with alignment */
+	int			nbytes_aligned;	/* size of deserialized data with alignment */
 	int			typlen;			/* pg_type.typlen */
 	bool		typbyval;		/* pg_type.typbyval */
 } DimensionInfo;
#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#12)
Re: mcvstats serialization code is still shy of a load

On Sun, Jun 30, 2019 at 08:30:33PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Attached is a slightly improved version of the serialization patch.

I reviewed this patch, and tested it on hppa and ppc. I found one
serious bug: in the deserialization varlena case, you need

-					dataptr += MAXALIGN(len);
+					dataptr += MAXALIGN(len + VARHDRSZ);

(approx. line 1100 in mcv.c). Without this, the output data is corrupt,
plus the Assert a few lines further down about dataptr having been
advanced by the correct amount fires. (On one machine I tested on,
that happened during the core regression tests. The other machine
got through regression, but trying to do "select * from pg_stats_ext;"
afterwards exhibited the crash. I didn't investigate closely, but
I suspect the difference has to do with different MAXALIGN values,
4 and 8 respectively.)

The attached patch (a delta atop your v2) corrects that plus some
cosmetic issues.

Thanks.

If we're going to push this, it would be considerably less complicated
to do so before v12 gets branched --- not long after that, there will be
catversion differences to cope with. I'm planning to make the branch
tomorrow (Monday), probably ~1500 UTC. Just sayin'.

Unfortunately, I was travelling on Sunday and was quite busy on Monday, so
I've been unable to push this before the branching :-(

I'll push by the end of this week, once I get home.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#13)
Re: mcvstats serialization code is still shy of a load

On Tue, Jul 02, 2019 at 10:38:29AM +0200, Tomas Vondra wrote:

On Sun, Jun 30, 2019 at 08:30:33PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Attached is a slightly improved version of the serialization patch.

I reviewed this patch, and tested it on hppa and ppc. I found one
serious bug: in the deserialization varlena case, you need

-					dataptr += MAXALIGN(len);
+					dataptr += MAXALIGN(len + VARHDRSZ);

(approx. line 1100 in mcv.c). Without this, the output data is corrupt,
plus the Assert a few lines further down about dataptr having been
advanced by the correct amount fires. (On one machine I tested on,
that happened during the core regression tests. The other machine
got through regression, but trying to do "select * from pg_stats_ext;"
afterwards exhibited the crash. I didn't investigate closely, but
I suspect the difference has to do with different MAXALIGN values,
4 and 8 respectively.)

The attached patch (a delta atop your v2) corrects that plus some
cosmetic issues.

Thanks.

If we're going to push this, it would be considerably less complicated
to do so before v12 gets branched --- not long after that, there will be
catversion differences to cope with. I'm planning to make the branch
tomorrow (Monday), probably ~1500 UTC. Just sayin'.

Unfortunately, I was travelling on Sunday and was quite busy on Monday, so
I've been unable to push this before the branching :-(

I'll push by the end of this week, once I get home.

I've pushed the fix (along with the pg_mcv_list_item fix) into master,
hopefully the buildfarm won't be upset about it.

I was about to push into REL_12_STABLE, when I realized that maybe we
need to do something about the catversion first. REL_12_STABLE is still
on 201906161, while master got to 201907041 thanks to commit
7b925e12703. Simply cherry-picking the commits would get us to
201907052 in both branches, but that'd be wrong as the catalogs do
differ. I suppose this is what you meant by "catversion differences to
cope with".

I suppose this is not the first time this happened - how did we deal
with it in the past? I guess we could use some "past" non-conflicting
catversion number in the REL_12_STABLE branch (say, 201907030x) but
maybe that'd be wrong?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#14)
Re: mcvstats serialization code is still shy of a load

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I was about to push into REL_12_STABLE, when I realized that maybe we
need to do something about the catversion first. REL_12_STABLE is still
on 201906161, while master got to 201907041 thanks to commit
7b925e12703. Simply cherry-picking the commits would get us to
201907052 in both branches, but that'd be wrong as the catalogs do
differ. I suppose this is what you meant by "catversion differences to
cope with".

Yeah, exactly.

My recommendation is to use 201907051 on v12 and 201907052
on master (or whatever is $today for you). They need to be
different now that the branches' catalog histories have diverged,
and it seems to me that the back branch should be "older".

regards, tom lane

#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#15)
Re: mcvstats serialization code is still shy of a load

On Fri, Jul 5, 2019, 03:28 Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I was about to push into REL_12_STABLE, when I realized that maybe we
need to do something about the catversion first. REL_12_STABLE is still
on 201906161, while master got to 201907041 thanks to commit
7b925e12703. Simply cherry-picking the commits would get us to
201907052 in both branches, but that'd be wrong as the catalogs do
differ. I suppose this is what you meant by "catversion differences to
cope with".

Yeah, exactly.

My recommendation is to use 201907051 on v12 and 201907052
on master (or whatever is $today for you). They need to be
different now that the branches' catalog histories have diverged,
and it seems to me that the back branch should be "older".

regards, tom lane

Unfortunately, master is already using both 201907051 and 201907052 (two of
the patches I pushed touched the catalog), so we can't quite do exactly
that. We need to use 201907042 and 201907043 or something preceding 201907041
(which is the extra catversion on master).

At this point there's no perfect sequence, thanks to the extra commit on
master, so REL_12_STABLE can't be exactly "older" :-(

Barring objections, I'll go ahead with 201907042+201907043 later today,
before someone pushes another catversion-bumping patch.

regards

Show quoted text
#17Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#16)
Re: mcvstats serialization code is still shy of a load

On Fri, Jul 05, 2019 at 10:36:59AM +0200, Tomas Vondra wrote:

On Fri, Jul 5, 2019, 03:28 Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I was about to push into REL_12_STABLE, when I realized that maybe we
need to do something about the catversion first. REL_12_STABLE is still
on 201906161, while master got to 201907041 thanks to commit
7b925e12703. Simply cherry-picking the commits would get us to
201907052 in both branches, but that'd be wrong as the catalogs do
differ. I suppose this is what you meant by "catversion differences to
cope with".

Yeah, exactly.

My recommendation is to use 201907051 on v12 and 201907052
on master (or whatever is $today for you). They need to be
different now that the branches' catalog histories have diverged,
and it seems to me that the back branch should be "older".

regards, tom lane

Unfortunately, master is already using both 201907051 and 201907052 (two of
the patches I pushed touched the catalog), so we can't quite do exactly
that. We need to use 201907042 and 201907043 or something preceding 201907041
(which is the extra catversion on master).

At this point there's no perfect sequence, thanks to the extra commit on
master, so REL_12_STABLE can't be exactly "older" :-(

Barring objections, I'll go ahead with 201907042+201907043 later today,
before someone pushes another catversion-bumping patch.

I've pushed the REL_12_STABLE backpatches too, now. I've ended up using
201907031 and 201907032 - those values precede the first catversion bump
in master (201907041), so the back branch looks "older". And there's a
bit of slack for additional bumps (if the unlikely case we need them).

We might have "fixed" this by backpatching the commit with the extra
catversion bump (7b925e12) but the commit seems a bit too large for
that. It's fairly isolated though. But it seems like a bad practice.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#17)
Re: mcvstats serialization code is still shy of a load

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I've pushed the REL_12_STABLE backpatches too, now. I've ended up using
201907031 and 201907032 - those values precede the first catversion bump
in master (201907041), so the back branch looks "older". And there's a
bit of slack for additional bumps (if the unlikely case we need them).

FWIW, I don't think there's a need for every catversion on the back branch
to look older than any catversion on HEAD. The requirement so far as the
core code is concerned is only for non-equality. Now, extension code does
often do something like "if catversion >= xxx", but in practice they're
only concerned about numbers used by released versions. HEAD's catversion
will be strictly greater than v12's soon enough, even if you had made it
not so today. So I think sticking to today's-date-with-some-N is better
than artificially assigning other dates.

What's done is done, and there's no need to change it, but now you
know what to do next time.

We might have "fixed" this by backpatching the commit with the extra
catversion bump (7b925e12) but the commit seems a bit too large for
that. It's fairly isolated though. But it seems like a bad practice.

Yeah, that approach flies in the face of the notion of feature freeze.

regards, tom lane

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#18)
Re: mcvstats serialization code is still shy of a load

On 2019-Jul-05, Tom Lane wrote:

FWIW, I don't think there's a need for every catversion on the back branch
to look older than any catversion on HEAD. The requirement so far as the
core code is concerned is only for non-equality. Now, extension code does
often do something like "if catversion >= xxx", but in practice they're
only concerned about numbers used by released versions.

pg_upgrade also uses >= catversion comparison for a couple of things. I
don't think it affects this case, but it's worth keeping in mind.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services