Coccinelle for PostgreSQL development [4/N]: correcting palloc() use

Started by Mats Kindahlabout 1 year ago3 messages
#1Mats Kindahl
mats@timescale.com
1 attachment(s)

This is the first example semantic patch and shows how to capture and fix a
common problem.

If you use an palloc() to allocate memory for an object (or an array of
objects) and by mistake type something like:

StringInfoData *info = palloc(sizeof(StringInfoData*));

You will not allocate enough memory for storing the object. This semantic
patch catches any cases where you are either allocating an array of objects
or a single object that do not have corret types in this sense, more
precisely, it captures assignments to a variable of type T* where palloc()
uses sizeof(T) either alone or with a single expression (assuming this is
an array count).

The semantic patch is overzealous in the sense that allocation to a "const
char **" expects a "sizeof(const char *)" and it cannot deal with typedefs
that introduce aliases (these two can be seen in the patch). Although the
sizes of these are the same, and Coccinelle do not have a good system for
comparing types, it might be better to just follow the convention of always
using the type "T*" for any "palloc(sizeof(T))" since it makes automated
checking easier and is a small inconvenience; especially considering that
coccicheck can easily fix this for you. It also simplifies other automated
checking to follow this convention.

We don't really have any real bugs as a result from this, but we have one
case where an allocation of "sizeof(LLVMBasicBlockRef*)" is allocated to an
"LLVMBasicBlockRef*", which strictly speaking is not correct (it should be
"sizeof(LLVMBasicBlockRef)"). However, since they are both pointers, there
is no risk of incorrect allocation size.
--
Best wishes,
Mats Kindahl, Timescale

Attachments:

0004-Add-semantic-patch-for-sizeof-using-palloc.v1.patchtext/x-patch; charset=US-ASCII; name=0004-Add-semantic-patch-for-sizeof-using-palloc.v1.patchDownload
From eedf99b78a0e63b4de6db14617091f18adbd0d83 Mon Sep 17 00:00:00 2001
From: Mats Kindahl <mats@kindahl.net>
Date: Sun, 5 Jan 2025 19:26:47 +0100
Subject: Add semantic patch for sizeof() using palloc()

If palloc() is used to allocate elements of type T it should be assigned to a
variable of type T* or risk indexes out of bounds. This semantic patch checks
that allocations to variables of type T* are using sizeof(T) when allocating
memory using palloc().
---
 cocci/palloc_sizeof.cocci            | 40 ++++++++++++++++++++++++++++
 contrib/btree_gist/btree_utils_var.c |  2 +-
 src/backend/jit/llvm/llvmjit_expr.c  |  5 ++--
 src/backend/utils/adt/tsquery.c      |  2 +-
 4 files changed, 44 insertions(+), 5 deletions(-)
 create mode 100644 cocci/palloc_sizeof.cocci

diff --git a/cocci/palloc_sizeof.cocci b/cocci/palloc_sizeof.cocci
new file mode 100644
index 00000000000..7e79139fed4
--- /dev/null
+++ b/cocci/palloc_sizeof.cocci
@@ -0,0 +1,40 @@
+virtual report
+virtual context
+virtual patch
+
+@r1 depends on report || context@
+type T1 != {void};
+idexpression T1 *I;
+type T2 != T1;
+position p;
+expression E;
+identifier func = {palloc, palloc0};
+@@
+(
+* I = func@p(sizeof(T2))
+|
+* I = func@p(E * sizeof(T2))
+)
+
+@script:python depends on report@
+T1 << r1.T1;
+T2 << r1.T2;
+I << r1.I;
+p << r1.p;
+@@
+coccilib.report.print_report(p[0], f"'{I}' has type '{T1}*' but 'sizeof({T2})' is used to allocate memory")
+
+@depends on patch@
+type T1 != {void};
+idexpression T1 *I;
+type T2 != T1;
+expression E;
+identifier func = {palloc, palloc0};
+@@
+(
+- I = func(sizeof(T2))
++ I = func(sizeof(T1))
+|
+- I = func(E * sizeof(T2))
++ I = func(E * sizeof(T1))
+)
diff --git a/contrib/btree_gist/btree_utils_var.c b/contrib/btree_gist/btree_utils_var.c
index d9df2356cd1..36937795e90 100644
--- a/contrib/btree_gist/btree_utils_var.c
+++ b/contrib/btree_gist/btree_utils_var.c
@@ -475,7 +475,7 @@ gbt_var_picksplit(const GistEntryVector *entryvec, GIST_SPLITVEC *v,
 	v->spl_nleft = 0;
 	v->spl_nright = 0;
 
-	sv = palloc(sizeof(bytea *) * (maxoff + 1));
+	sv = palloc((maxoff + 1) * sizeof(GBT_VARKEY *));
 
 	/* Sort entries */
 
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index c533f552540..c582fd89c67 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -618,8 +618,7 @@ llvm_compile_expr(ExprState *state)
 						LLVMBuildStore(b, l_sbool_const(1), v_resnullp);
 
 						/* create blocks for checking args, one for each */
-						b_checkargnulls =
-							palloc(sizeof(LLVMBasicBlockRef *) * op->d.func.nargs);
+						b_checkargnulls = palloc(op->d.func.nargs * sizeof(LLVMBasicBlockRef));
 						for (int argno = 0; argno < op->d.func.nargs; argno++)
 							b_checkargnulls[argno] =
 								l_bb_before_v(b_nonull, "b.%d.isnull.%d", opno,
@@ -2409,7 +2408,7 @@ llvm_compile_expr(ExprState *state)
 					v_nullsp = l_ptr_const(nulls, l_ptr(TypeStorageBool));
 
 					/* create blocks for checking args */
-					b_checknulls = palloc(sizeof(LLVMBasicBlockRef *) * nargs);
+					b_checknulls = palloc(nargs * sizeof(LLVMBasicBlockRef));
 					for (int argno = 0; argno < nargs; argno++)
 					{
 						b_checknulls[argno] =
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 0366c2a2acd..aa7a296149c 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -1241,7 +1241,7 @@ tsqueryrecv(PG_FUNCTION_ARGS)
 		elog(ERROR, "invalid size of tsquery");
 
 	/* Allocate space to temporarily hold operand strings */
-	operands = palloc(size * sizeof(char *));
+	operands = palloc(size * sizeof(const char *));
 
 	/* Allocate space for all the QueryItems. */
 	len = HDRSIZETQ + sizeof(QueryItem) * size;
-- 
2.43.0

#2Peter Eisentraut
peter@eisentraut.org
In reply to: Mats Kindahl (#1)
Re: Coccinelle for PostgreSQL development [4/N]: correcting palloc() use

On 07.01.25 20:49, Mats Kindahl wrote:

This is the first example semantic patch and shows how to capture and
fix a common problem.

If you use an palloc() to allocate memory for an object (or an array of
objects) and by mistake type something like:

    StringInfoData *info = palloc(sizeof(StringInfoData*));

You will not allocate enough memory for storing the object. This
semantic patch catches any cases where you are either allocating an
array of objects or a single object that do not have corret types in
this sense, more precisely, it captures assignments to a variable of
type T* where palloc() uses sizeof(T) either alone or with a single
expression (assuming this is an array count).

The semantic patch is overzealous in the sense that allocation to a
"const char **" expects a "sizeof(const char *)" and it cannot deal with
typedefs that introduce aliases (these two can be seen in the patch).
Although the sizes of these are the same, and Coccinelle do not have a
good system for comparing types, it might be better to just follow the
convention of always using the type "T*" for any "palloc(sizeof(T))"
since it makes automated checking easier and is a small inconvenience;
especially considering that coccicheck can easily fix this for you. It
also simplifies other automated checking to follow this convention.

I think this kind of thing is better addressed with a static analyzer or
some heavier compiler annotations and warnings or something like that.
That kind of tool would have the right level of information to decide
whether some code is correct. Having a text matching tool do it is just
never going to be complete, and the problems you mention show that this
tool fundamentally doesn't understand the code, and just knows how to
parse the code a little better than grep or sed.

I do like the idea of exploring this tool for facilitating code
rewriting or semantic patching. But I'm suspicious about using it for
enforcing coding styles or patterns.

#3Mats Kindahl
mats@timescale.com
In reply to: Peter Eisentraut (#2)
Re: Coccinelle for PostgreSQL development [4/N]: correcting palloc() use

On Wed, Jan 8, 2025 at 10:02 PM Peter Eisentraut <peter@eisentraut.org>
wrote:

On 07.01.25 20:49, Mats Kindahl wrote:

This is the first example semantic patch and shows how to capture and
fix a common problem.

If you use an palloc() to allocate memory for an object (or an array of
objects) and by mistake type something like:

StringInfoData *info = palloc(sizeof(StringInfoData*));

You will not allocate enough memory for storing the object. This
semantic patch catches any cases where you are either allocating an
array of objects or a single object that do not have corret types in
this sense, more precisely, it captures assignments to a variable of
type T* where palloc() uses sizeof(T) either alone or with a single
expression (assuming this is an array count).

The semantic patch is overzealous in the sense that allocation to a
"const char **" expects a "sizeof(const char *)" and it cannot deal with
typedefs that introduce aliases (these two can be seen in the patch).
Although the sizes of these are the same, and Coccinelle do not have a
good system for comparing types, it might be better to just follow the
convention of always using the type "T*" for any "palloc(sizeof(T))"
since it makes automated checking easier and is a small inconvenience;
especially considering that coccicheck can easily fix this for you. It
also simplifies other automated checking to follow this convention.

I think this kind of thing is better addressed with a static analyzer or
some heavier compiler annotations and warnings or something like that.
That kind of tool would have the right level of information to decide
whether some code is correct. Having a text matching tool do it is just
never going to be complete, and the problems you mention show that this
tool fundamentally doesn't understand the code, and just knows how to
parse the code a little better than grep or sed.

Thanks for the comments Peter.

On one hand, I agree with you. Coccinelle understands the *structure* of
the code (e.g., what is a type and what is an identifier, where functions
start and end, etc.), but not the *semantics* of the code (e.g., if two
types are different).

On the other hand, however, development practice is also about following
common coding patterns to avoid problems and improve readability. In this
sense, having strange situations in the code that just happen to work makes
it more difficult for developers as well as for automation tools. For
example, allocating memory for elements of sizeof(T**) and assigning it to
a T** variable will not be caught by something like ASAN (because the sizes
match), and cannot trigger a fault (again, because the sizes match), but
correcting this would still make the life easier for both developers and
static analysis tools.

That said, if your only concern is the "const int *" vs. "int *" thing (but
not the T** vs T* thing), it is possible to deal with this by either adding
Python code to the Coccinelle script to just remove all "const" since they
do not affect the size or detect that one of the types has "const" in it
and just ignore this situation and not suggest to patch it.

I do like the idea of exploring this tool for facilitating code
rewriting or semantic patching. But I'm suspicious about using it for
enforcing coding styles or patterns.

The case we're discussing here is one such case where I think that
enforcing a coding style might not be a good idea, but I think it was
valuable to raise the issue and get opinions. (This is why I mentioned the
problem with the patch being overzealous.)

The real value here is not specific patches though. It is having a tool
like Coccinelle for detecting and correcting problematic code as part of
the development practice since I think it would speed up development and
review and also improve the quality of the code.
--
Best wishes,
Mats Kindahl, Timescale