PoC: VALGRIND_MAKE_MEM_NOACCESS for dynamic shared memory

Started by Tomas Vondraabout 7 hours ago3 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

A couple of days ago we got a report regarding an incorrect shmem size
calculation in btree [1]/messages/by-id/CAGCUe0Lwk3C0qdkBa+OLpYc7yXwW=pbaz8Sju4xMXEQAmyp+5g@mail.gmail.com, leading to a buffer overflow / memory
corruption. Which became a much less common issue in our code, thanks to
valgrind and similar tools. But it took me a while to realize valgrind
won't catch this because we only instrument private memory (palloc et
al), while shmem is left alone.

I was wondering if it's feasible to improve this. Attached is a trivial
patch that adjusts shm_toc.c to add a couple NOACCESS bytes after each
entry in the segment. It seems to do the trick - with this we get a
reasonable report (for the reproducer provided in the bug report, before
it got fixed by 748d871b7c) from valgrind, with invalid accesses. See
the attached .log for an example. It's much better than the confusing
crashes due to corrupted state.

There's an issue, though. It seems the valgrind memory markings are not
shared between processes. The leader sets the shm_toc up, marks the
ranges as NOACCESS, and then checks it while accessing the memory. But
the parallel workers don't seem to see this, and so will produce no
reports. I'm assuming this is the case, because all the reports come
from the leader, never from the workers. Maybe there's a different
explanation (e.g. maybe it's just the leader touching the memory?).

Still, it seems useful even with this limitation. If the leader
participates, it will produce a nice report (unless the worker crashes
first, because of the corrupted state). But we also have

debug_parallel_query = regress

in which case everything should run in a single process, and work just
fine. Maybe that's good enough.

An alternative would be to do mprotect(), but unfortunately it requires
page-aligned ranges, which makes it somewhat useless for small overflows
of a couple bytes (like here). It should work cross-process, I think,

FWIW the PoC patch adds a 32-byte chunk, not just a single byte. This is
intentional, because if the state is an array, it's quite possible the
invalid access steps over a fair number of bytes. I'm actually thinking
it should be even larger.

This modifies just the dynamic shmem, but maybe we should do this even
for the "regular" shmem allocated at start. Issues in that would likely
cause crashes pretty quick (unlike the btree issue, which requires a
somewhat special reproducer), but a nice valgrind report helps.

regards

[1]: /messages/by-id/CAGCUe0Lwk3C0qdkBa+OLpYc7yXwW=pbaz8Sju4xMXEQAmyp+5g@mail.gmail.com
/messages/by-id/CAGCUe0Lwk3C0qdkBa+OLpYc7yXwW=pbaz8Sju4xMXEQAmyp+5g@mail.gmail.com

--
Tomas Vondra

Attachments:

0001-valgrind-add-NOACCESS-sentinels-for-dynamic-shmem.patchtext/x-patch; charset=UTF-8; name=0001-valgrind-add-NOACCESS-sentinels-for-dynamic-shmem.patchDownload+22-2
valgrind.logtext/x-log; charset=UTF-8; name=valgrind.logDownload
#2Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#1)
Re: PoC: VALGRIND_MAKE_MEM_NOACCESS for dynamic shared memory

Hi,

On 2026-05-04 10:18:31 +0200, Tomas Vondra wrote:

A couple of days ago we got a report regarding an incorrect shmem size
calculation in btree [1], leading to a buffer overflow / memory
corruption. Which became a much less common issue in our code, thanks to
valgrind and similar tools. But it took me a while to realize valgrind
won't catch this because we only instrument private memory (palloc et
al), while shmem is left alone.

I was wondering if it's feasible to improve this. Attached is a trivial
patch that adjusts shm_toc.c to add a couple NOACCESS bytes after each
entry in the segment. It seems to do the trick - with this we get a
reasonable report (for the reproducer provided in the bug report, before
it got fixed by 748d871b7c) from valgrind, with invalid accesses. See
the attached .log for an example. It's much better than the confusing
crashes due to corrupted state.

There's an issue, though. It seems the valgrind memory markings are not
shared between processes. The leader sets the shm_toc up, marks the
ranges as NOACCESS, and then checks it while accessing the memory. But
the parallel workers don't seem to see this, and so will produce no
reports. I'm assuming this is the case, because all the reports come
from the leader, never from the workers. Maybe there's a different
explanation (e.g. maybe it's just the leader touching the memory?).

I assume the issue is just that the workers don't have the NOACCESS markers? I
think you'd need to do them in every process using the shm_toc. Either by
doing it in shm_toc_attach() or in shm_toc().

An alternative would be to do mprotect(), but unfortunately it requires
page-aligned ranges, which makes it somewhat useless for small overflows
of a couple bytes (like here). It should work cross-process, I think,

It doesn't work across processes. Every process has their own mprotect "view".

FWIW the PoC patch adds a 32-byte chunk, not just a single byte. This is
intentional, because if the state is an array, it's quite possible the
invalid access steps over a fair number of bytes. I'm actually thinking
it should be even larger.

This modifies just the dynamic shmem, but maybe we should do this even
for the "regular" shmem allocated at start. Issues in that would likely
cause crashes pretty quick (unlike the btree issue, which requires a
somewhat special reproducer), but a nice valgrind report helps.

I can tell you from experience that no, it's not necessarily quickly caught.
So yes, I think we should definitely do that.

/*
* Allocate shared memory from a segment managed by a table of contents.
*
@@ -119,9 +127,19 @@ shm_toc_allocate(shm_toc *toc, Size nbytes)
}
vtoc->toc_allocated_bytes += nbytes;

+#ifdef USE_VALGRIND
+	vtoc->toc_allocated_bytes += NUM_NOACCESS_BYTES;
+#endif
+
SpinLockRelease(&toc->toc_mutex);
-	return ((char *) toc) + (total_bytes - allocated_bytes - nbytes);
+#ifdef USE_VALGRIND
+	/* make the bytes at the end no-access */
+	VALGRIND_MAKE_MEM_NOACCESS(((char *) toc) + (total_bytes - allocated_bytes - NUM_NOACCESS_BYTES),
+							   NUM_NOACCESS_BYTES);
+#endif
+
+	return ((char *) toc) + (total_bytes - allocated_bytes - nbytes - NUM_NOACCESS_BYTES);
}

The size is already rounded up by that point:

/*
* Make sure request is well-aligned. XXX: MAXALIGN is not enough,
* because atomic ops might need a wider alignment. We don't have a
* proper definition for the minimum to make atomic ops safe, but
* BUFFERALIGN ought to be enough.
*/
nbytes = BUFFERALIGN(nbytes);

Which means that you'll often have a up to 32byte pad at the end of the
(ALIGNOF_BUFFER=32) allocation already. I don't care about the waste, but the
ALIGNOF_BUFFER padding will often prevent detecting smaller out-of-bounds
accesses.

Greetings,

Andres Freund

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#2)
Re: PoC: VALGRIND_MAKE_MEM_NOACCESS for dynamic shared memory

On 5/4/26 15:56, Andres Freund wrote:

Hi,

On 2026-05-04 10:18:31 +0200, Tomas Vondra wrote:

A couple of days ago we got a report regarding an incorrect shmem size
calculation in btree [1], leading to a buffer overflow / memory
corruption. Which became a much less common issue in our code, thanks to
valgrind and similar tools. But it took me a while to realize valgrind
won't catch this because we only instrument private memory (palloc et
al), while shmem is left alone.

I was wondering if it's feasible to improve this. Attached is a trivial
patch that adjusts shm_toc.c to add a couple NOACCESS bytes after each
entry in the segment. It seems to do the trick - with this we get a
reasonable report (for the reproducer provided in the bug report, before
it got fixed by 748d871b7c) from valgrind, with invalid accesses. See
the attached .log for an example. It's much better than the confusing
crashes due to corrupted state.

There's an issue, though. It seems the valgrind memory markings are not
shared between processes. The leader sets the shm_toc up, marks the
ranges as NOACCESS, and then checks it while accessing the memory. But
the parallel workers don't seem to see this, and so will produce no
reports. I'm assuming this is the case, because all the reports come
from the leader, never from the workers. Maybe there's a different
explanation (e.g. maybe it's just the leader touching the memory?).

I assume the issue is just that the workers don't have the NOACCESS markers? I
think you'd need to do them in every process using the shm_toc. Either by
doing it in shm_toc_attach() or in shm_toc().

Right, but it'd also need to know how long the entry is. Which AFAIK it
does not. We don't want to redo the calculation in every worker, but we
might add a "len" field to the toc entry.

An alternative would be to do mprotect(), but unfortunately it requires
page-aligned ranges, which makes it somewhat useless for small overflows
of a couple bytes (like here). It should work cross-process, I think,

It doesn't work across processes. Every process has their own mprotect "view".

Ah, OK. So I guessed wrong, and it has the same issue as NOACCESS.

FWIW the PoC patch adds a 32-byte chunk, not just a single byte. This is
intentional, because if the state is an array, it's quite possible the
invalid access steps over a fair number of bytes. I'm actually thinking
it should be even larger.

This modifies just the dynamic shmem, but maybe we should do this even
for the "regular" shmem allocated at start. Issues in that would likely
cause crashes pretty quick (unlike the btree issue, which requires a
somewhat special reproducer), but a nice valgrind report helps.

I can tell you from experience that no, it's not necessarily quickly caught.
So yes, I think we should definitely do that.

Understood.

/*
* Allocate shared memory from a segment managed by a table of contents.
*
@@ -119,9 +127,19 @@ shm_toc_allocate(shm_toc *toc, Size nbytes)
}
vtoc->toc_allocated_bytes += nbytes;

+#ifdef USE_VALGRIND
+	vtoc->toc_allocated_bytes += NUM_NOACCESS_BYTES;
+#endif
+
SpinLockRelease(&toc->toc_mutex);
-	return ((char *) toc) + (total_bytes - allocated_bytes - nbytes);
+#ifdef USE_VALGRIND
+	/* make the bytes at the end no-access */
+	VALGRIND_MAKE_MEM_NOACCESS(((char *) toc) + (total_bytes - allocated_bytes - NUM_NOACCESS_BYTES),
+							   NUM_NOACCESS_BYTES);
+#endif
+
+	return ((char *) toc) + (total_bytes - allocated_bytes - nbytes - NUM_NOACCESS_BYTES);
}

The size is already rounded up by that point:

/*
* Make sure request is well-aligned. XXX: MAXALIGN is not enough,
* because atomic ops might need a wider alignment. We don't have a
* proper definition for the minimum to make atomic ops safe, but
* BUFFERALIGN ought to be enough.
*/
nbytes = BUFFERALIGN(nbytes);

Which means that you'll often have a up to 32byte pad at the end of the
(ALIGNOF_BUFFER=32) allocation already. I don't care about the waste, but the
ALIGNOF_BUFFER padding will often prevent detecting smaller out-of-bounds
accesses.

Good point. I forgot about this alignment rounding. I don't care about
the waste either (it'd be a bit silly in a valgrind build anyway), but
not catching smaller mistakes would not be great.

regards

--
Tomas Vondra