Fix locking issue with fixed-size stats template in injection_points

Started by Michael Paquier4 months ago8 messages
#1Michael Paquier
michael@paquier.xyz
1 attachment(s)

Hi all,

While doing some concurrency benchmarking with injection_points stats
enabled in a server, I have been able to trigger an assertion failure
in pgstat_begin_changecount_write():
#4 0x0000564917fdc816 in ExceptionalCondition
(conditionName=0x7f60af0dc5c8 "(*cc & 1) == 0",
fileName=0x7f60af0dc598
"../../../../src/include/utils/pgstat_internal.h",
lineNumber=831) at assert.c:65
#5 0x00007f60af0da3ef in pgstat_begin_changecount_write
(cc=0x7f60acbb8e10) at
../../../../src/include/utils/pgstat_internal.h:831
#6 0x00007f60af0db1d9 in pgstat_report_inj_fixed (numattach=0,
numdetach=0, numrun=1, numcached=0, numloaded=0) at
injection_stats_fixed.c:155
#7 0x00007f60af0d8b5c in injection_points_run (fcinfo=0x564931b66588)
at injection_points.c:429

This can be reproduced as follows. First, postgresql.conf:
shared_preload_libraries = 'injection_points'
injection_points.stats = on

Then something like the following command:
$ cat create_inj.sql
\set id random(1,100000)
select injection_points_attach('popo:id', 'notice');
select injection_points_run('popo:id');
select injection_points_detach('popo:id');
$ pgbench -n -T 300 -f create_inj.sql -c 10

The failure is not surprising, because the stats reports can happen in
a concurrent fashion when a point is run for example, contrary to
other fixed-sized stats kind where the reports are only done by a
single process (archiver, bgwriter, checkpointer). So this is just a
matter of acquiring a lock that was forgotten, to make sure that the
changes are consistent. Far from critical as this is template code,
still embarrassing.

Thoughts or comments?
--
Michael

Attachments:

inj-stats-lock.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/test/modules/injection_points/injection_stats_fixed.c b/src/test/modules/injection_points/injection_stats_fixed.c
index bc54c79d190b..74c35fcbfa71 100644
--- a/src/test/modules/injection_points/injection_stats_fixed.c
+++ b/src/test/modules/injection_points/injection_stats_fixed.c
@@ -152,6 +152,8 @@ pgstat_report_inj_fixed(uint32 numattach,
 
 	stats_shmem = pgstat_get_custom_shmem_data(PGSTAT_KIND_INJECTION_FIXED);
 
+	LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
 	pgstat_begin_changecount_write(&stats_shmem->changecount);
 	stats_shmem->stats.numattach += numattach;
 	stats_shmem->stats.numdetach += numdetach;
@@ -159,6 +161,8 @@ pgstat_report_inj_fixed(uint32 numattach,
 	stats_shmem->stats.numcached += numcached;
 	stats_shmem->stats.numloaded += numloaded;
 	pgstat_end_changecount_write(&stats_shmem->changecount);
+
+	LWLockRelease(&stats_shmem->lock);
 }
 
 /*
#2Chao Li
li.evan.chao@gmail.com
In reply to: Michael Paquier (#1)
Re: Fix locking issue with fixed-size stats template in injection_points

On Sep 29, 2025, at 08:48, Michael Paquier <michael@paquier.xyz> wrote:

The failure is not surprising, because the stats reports can happen in
a concurrent fashion when a point is run for example, contrary to
other fixed-sized stats kind where the reports are only done by a
single process (archiver, bgwriter, checkpointer). So this is just a
matter of acquiring a lock that was forgotten, to make sure that the
changes are consistent. Far from critical as this is template code,
still embarrassing.

Thoughts or comments?

I saw pg_state_begin_changecount_write() is called multiple places, as you mention, for example bgwriter. But there are not the same lock acquired in other places, for example, in bgwriter:

void
pgstat_report_bgwriter(void)
{
PgStatShared_BgWriter *stats_shmem = &pgStatLocal.shmem->bgwriter;

Assert(!pgStatLocal.shmem->is_shutdown);
pgstat_assert_is_up();

/*
* This function can be called even if nothing at all has happened. In
* this case, avoid unnecessarily modifying the stats entry.
*/
if (pg_memory_is_all_zeros(&PendingBgWriterStats,
sizeof(struct PgStat_BgWriterStats)))
return;

pgstat_begin_changecount_write(&stats_shmem->changecount);

#define BGWRITER_ACC(fld) stats_shmem->stats.fld += PendingBgWriterStats.fld
BGWRITER_ACC(buf_written_clean);
BGWRITER_ACC(maxwritten_clean);
BGWRITER_ACC(buf_alloc);
#undef BGWRITER_ACC

pgstat_end_changecount_write(&stats_shmem->changecount);

Only adding the lock in pg_report_inj_fixed() won’t prevent the race conditions from bgwriter. So I wonder, do we need to add the same lock in the other places?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#3Michael Paquier
michael@paquier.xyz
In reply to: Chao Li (#2)
Re: Fix locking issue with fixed-size stats template in injection_points

On Mon, Sep 29, 2025 at 09:46:05AM +0800, Chao Li wrote:

I saw pg_state_begin_changecount_write() is called multiple places,
as you mention, for example bgwriter.

I've mentioned that in my first email, and put in details:
- pgstat_report_bgwriter() is called once, by the bgwriter.
- pgstat_report_checkpointer() is called three time, all by the
checkpointer.
- pgstat_report_archiver() is called twice, all by pgarch.c.

So all of them don't have a problem, two calls cannot happen
concurrently.
--
Michael

#4Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#1)
Re: Fix locking issue with fixed-size stats template in injection_points

Hi,

On Mon, Sep 29, 2025 at 09:48:44AM +0900, Michael Paquier wrote:

Then something like the following command:
$ cat create_inj.sql
\set id random(1,100000)
select injection_points_attach('popo:id', 'notice');
select injection_points_run('popo:id');
select injection_points_detach('popo:id');
$ pgbench -n -T 300 -f create_inj.sql -c 10

The failure is not surprising, because the stats reports can happen in
a concurrent fashion when a point is run for example, contrary to
other fixed-sized stats kind where the reports are only done by a
single process (archiver, bgwriter, checkpointer). So this is just a
matter of acquiring a lock that was forgotten, to make sure that the
changes are consistent. Far from critical as this is template code,
still embarrassing.

Thoughts or comments?

Patch LGTM.

Remark: I like the "popo" prefix in your test ;-)

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#5Chao Li
li.evan.chao@gmail.com
In reply to: Michael Paquier (#3)
Re: Fix locking issue with fixed-size stats template in injection_points

On Sep 29, 2025, at 14:27, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Sep 29, 2025 at 09:46:05AM +0800, Chao Li wrote:

I saw pg_state_begin_changecount_write() is called multiple places,
as you mention, for example bgwriter.

I've mentioned that in my first email, and put in details:
- pgstat_report_bgwriter() is called once, by the bgwriter.
- pgstat_report_checkpointer() is called three time, all by the
checkpointer.
- pgstat_report_archiver() is called twice, all by pgarch.c.

So all of them don't have a problem, two calls cannot happen
concurrently.
--
Michael

Thanks for the clarification. Then the patch looks good to me.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#6wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Chao Li (#5)
Re: Fix locking issue with fixed-size stats template in injection_points

HI
This patch looks good to me.

The issue is clear: unlike other fixed-size stats kinds (archiver,
bgwriter, checkpointer), the injection_points stats can be updated
concurrently by multiple backends. Without synchronization, this can lead
to inconsistent changecount state and assertion failures in
pgstat_begin_changecount_write(), as shown in your reproduction.

Thanks

On Mon, Sep 29, 2025 at 3:29 PM Chao Li <li.evan.chao@gmail.com> wrote:

Show quoted text

On Sep 29, 2025, at 14:27, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Sep 29, 2025 at 09:46:05AM +0800, Chao Li wrote:

I saw pg_state_begin_changecount_write() is called multiple places,
as you mention, for example bgwriter.

I've mentioned that in my first email, and put in details:
- pgstat_report_bgwriter() is called once, by the bgwriter.
- pgstat_report_checkpointer() is called three time, all by the
checkpointer.
- pgstat_report_archiver() is called twice, all by pgarch.c.

So all of them don't have a problem, two calls cannot happen
concurrently.
--
Michael

Thanks for the clarification. Then the patch looks good to me.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#7Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#4)
Re: Fix locking issue with fixed-size stats template in injection_points

On Mon, Sep 29, 2025 at 06:52:31AM +0000, Bertrand Drouvot wrote:

Remark: I like the "popo" prefix in your test ;-)

It's that or popopop. "glop" and "pas-glop" were my other candidates,
but I doubt we can use them freely, and you may be the only one around
here to know what this refers to. :D
--
Michael

#8Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#7)
Re: Fix locking issue with fixed-size stats template in injection_points

On Mon, Sep 29, 2025 at 05:01:37PM +0900, Michael Paquier wrote:

It's that or popopop. "glop" and "pas-glop" were my other candidates,
but I doubt we can use them freely, and you may be the only one around
here to know what this refers to. :D

Putting that aside, fixed down to v18.
--
Michael