64-bit wait_event and introduction of 32-bit wait_event_arg
Hi all,
We were debating internally if making transition to 64-bit wait_event
would be an acceptable idea (Robert's primary concern is that it may
be too limited info), but I had code to demo this, so let's just
discuss it further: After ensuring that 64-bit int math has same
performance characteristics as 32-bit one at least on x86_64, i've
converted our wait_event_info (32-bit today) to 64-bits while trying
to use pg atomics, then used some bit masking voodoo and got the lower
32-bit exposed as new wait_event_arg with some dumb demos. The idea is
to encode some specific (limited, but useful!) information into the
wait event variable itself, so we could gain access to additional
32-bit of space for details along with wait-event itself to help
assessment of some wait-event-related problems. This seems to probably
come without any performance impact, at least on reasonable platforms
used today in production (for ones with
PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY that is). Intended use pattern:
if I were chasing a certain specific wait_event-related problem, I
could extract certain info straight from wait_event_arg, making it
much easier than even drilling into other more advanced views (if
that's information exposed at all, often it's not).
Q0) Key question: does that sound like a good idea to pursue further
or are there any blockers to it?
Sample demos included in patch, depending on the specific wait_event,
wait_event_arg could be:
1. PgSleep could show time since it was launched (simplest thing one
can imagine, or we could think about time left maybe too?):
pid | backend_type | wait_event_type | wait_event | wait_event_arg |
query
-------+----------------+-----------------+------------+----------------+----------------------------------------------------------------------------------------------------------------------------------
78317 | client backend | Timeout | PgSleep | 10
| select 'imagine complex stuff here dozes of kB SQL text
query,procedures, functions' as s, pg_sleep(10) as
embedded_internally;
2. Passing exact relation oid on where we are waiting for (here pid
82242 was doing "alter table p3 add.." , but it's waiting for the
backend that executed "lock table p3 in exclusive mode;"). We can
decode wait_event right into relation (p3)
postgres=# select pid, backend_type, wait_event_type, wait_event,
wait_event_arg, wait_event_arg::regclass, query from pg_stat_activity
where state = 'active' and (wait_event_type, wait_event) = ('Lock',
'relation');
pid | backend_type | wait_event_type | wait_event |
wait_event_arg | wait_event_arg | query
-------+----------------+-----------------+------------+----------------+----------------+--------------------------------
82242 | client backend | Lock | relation | 16467
| p3 | alter table p3 add id3 bigint;
3. IPC/SyncRep (SyncRepWaitForLSN()) could report PID of the slowest
walsender. This is useful in cases where multiple are involved to
pinpoint where you might be slow/stuck:
pid | application_name | wait_event_type | wait_event |
wait_event_arg | q
--------+------------------+-----------------+---------------+----------------+------------------------------------------
120318 | pgbench | IPC | SyncRep | 119689
| INSERT INTO child (parent_id, payload)
120319 | pgbench | IPC | SyncRep | 119689
| INSERT INTO child (parent_id, payload)
120320 | pgbench | IPC | SyncRep | 120248
| INSERT INTO child (parent_id, payload)
120321 | pgbench | IPC | SyncRep | 119689
| INSERT INTO child (parent_id, payload)
119689 | walreceiver2 | Activity | WalSenderMain | |
START_REPLICATION 0/DC000000 TIMELINE 1
120248 | walreceiver | Activity | WalSenderMain |
| START_REPLICATION 0/E2000000 TIMELINE 1
(then you would basically query pg_stat_replication for pid = 119689
as it seems to be the slowest one here)
4. DataFile could report fd (yes, it can differ from backend to
backend [due to fd cache], but it's demo, probably it would be better
with oid/relationNumber, but it's not fast to do that :) and although
we have dboid already, there's tablespace and dunno how we could
squeeze RelFileNumber with tablespace there, possibly we could just
use tablespace Oid there too)
pid | backend_type | wait_event_type | wait_event |
wait_event_arg | query
-------+----------------+-----------------+--------------+----------------+------------------------------------------------------------
77467 | client backend | IO | DataFileRead | 8 | SELECT
abalance FROM pgbench_accounts WHERE aid = 8657837;
77470 | client backend | IO | DataFileRead |11 | SELECT
abalance FROM pgbench_accounts WHERE aid = 6840630;
5. (Challenging for me) Multixact Wait events - with wait_event_arg,
we could report where stuff is really waiting, right, now it's a
little guesswork, but with 0002 concept:
dbmultixact=# select wait_event_type, wait_event, wait_event_arg,
count(*) from pg_stat_activity where state='active' group by
wait_event_type, wait_event, wait_event_arg order by 4 desc limit 5;
wait_event_type | wait_event | wait_event_arg | count
-----------------+---------------------+----------------+-------
LWLock | BufferContent | | 365
Lock | tuple | 16494 | 42
LWLock | MultiXactOffsetSLRU | 16494 | 13
Lock | transactionid | | 10
LWLock | MultiXactOffsetSLRU | | 9
dbmultixact=# select pid, query, wait_event_type,
wait_event,wait_event_arg from pg_stat_activity where wait_event =
'MultiXactMemberSLRU';
pid | query
| wait_event_type | wait_event | wait_event_arg
-------+--------------------------------------------------------------------+-----------------+---------------------+----------------
99864 | INSERT INTO users (loc_id, fname) VALUES (2,'Testing
User-2-002'); | LWLock | MultiXactMemberSLRU | 16494
dbmultixact=# select 16494::regclass;
regclass
-----------
locations
dbmultixact=# \d users
[..]
"users_loc_id_fkey" FOREIGN KEY (loc_id) REFERENCES locations(loc_id)
The knowledge (for the end user) what is stored exactly in
wait_event_arg (depending on main wait_event) would be coming from
docs (probably some table). Probably each different wait_event could
be enhanced by some information.
Quick performance crosscheck of 0001 alone: /usr/pgsql19/bin/pgbench
-c 4 -P 1 -T 30 -S postgres:
master: tps = 121020.723246 (without initial connection time)
patched: tps = 121802.527000 (without initial connection time)
Q1) because we compile without -Wconversion, I was wondering if we
shouldn't need a safe/strict uint64 struct-like type that would catch
errors when stuff like uint64 return from WaitEventExtensionNew()
could be used externally by extensions with uint32? (because we do NOT
have -Wtruncation [too verbose?], any return value from uint64 that
will be casted silently to uint32 in extensions without any warning.
That may cause hangs during tests -- often tests wait for some
waitevent to show-up, but it wont).
Q2) 0002: Please ignore the 0002 quality, I did not want to sink more
time into MultiXact stuff, especially if the main concept would be
shot down. The main problem is how one can get RelFileNumber about
Relation that faces MultiXact back into LWLockReportWaitStart(). Here
I just wanted to see how much rework would be necessary (passing
variables, modifying API and so on) - in short: it introduces
LWLockAcquire() as fallback to LWLockAcquireExt(.. RelFileNumber r) ,
but still gets pretty nasty soon sadly, lots of stuff needs to be
dumb-adjusted. I would like to point out that I'm a complete
multixact/heapam noob, so it is a very dumb way of passing that info
for sure, in way too many places. Another thing we could do is
basically maybe have some "static uint32 lwlock_relation" inside
lwlock and properly just set it there (and reset it) once from within
heap*.c or similiar, so then all dependent LWLock routines would OR it
(== so it would be visible as wait_event_arg) and we would get the
involved RelFileNumber for all operations involved there (at least for
LWLocks).
While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)
Any help, opinions, ideas and code/co-authors are more than welcome.
-J.
[1]: Disassembly picture of stock binary taken from PGDG on Ubuntu/Debian x86_64 (so as used by real users), shows use of 64-bit (rax) registry e.g. in AT&T mnemonics: lea 0x6a4c1c(%rip), %rax // store into %rax value of my_wait_event_info mov (%rax), %rax // dereference ptr %rax (and it put back into rax) mov %ebx, (%rax) // write 32-bit value of ebx into 64-bit rax
Ubuntu/Debian x86_64 (so as used by real users), shows use of 64-bit
(rax) registry e.g. in AT&T mnemonics:
lea 0x6a4c1c(%rip), %rax // store into %rax value of my_wait_event_info
mov (%rax), %rax // dereference ptr %rax (and it put
back into rax)
mov %ebx, (%rax) // write 32-bit value of ebx into 64-bit rax
Or different, but very sample example with Intel mnemonics:
lea rax,[rip+0x865b09]
mov rax,QWORD PTR [rax] // notice it's already RAX and quadword
mov DWORD PTR [rax],0x0
[2]: x86_64 linux, operations on eax vs rax, that's 1.00646 under non-ideal conditions. Benchmarking int32_t (32-bit) additions... Operations/Second (int32_t): 3.37e+07 Benchmarking int64_t (64-bit) additions... Operations/Second (int64_t): 3.38e+07
non-ideal conditions.
Benchmarking int32_t (32-bit) additions... Operations/Second
(int32_t): 3.37e+07
Benchmarking int64_t (64-bit) additions... Operations/Second
(int64_t): 3.38e+07
Attachments:
On 08/12/2025 11:54, Jakub Wartak wrote:
While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)Any help, opinions, ideas and code/co-authors are more than welcome.
Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to
make use of it. For example, if you encode an table OID in it, how do
you interpret that when you're looking at pg_stat_activity? A new
pg_explain_wait_event(bigint waitevent) that returns a text
representation of the event perhaps? Wait events can be defined in
extensions; how does an extension plug into this facility?
Inevitably, the extra 32 bits won't be enough to expose everything that
you might want to expose. Should we already think about what to do then?
For lock waits, for example, should we have another array in shared
memory with more details, and just store an offset into that array in
the extra wait event bits, for example? (e already have pg_locks, but
let's imagine we didn't. How would you design it in a green field scenario?
- Heikki
Hi,
On Mon, Dec 08, 2025 at 12:12:27PM +0200, Heikki Linnakangas wrote:
On 08/12/2025 11:54, Jakub Wartak wrote:
Thanks for working on this!
While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)Any help, opinions, ideas and code/co-authors are more than welcome.
Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to make
use of it. For example, if you encode an table OID in it, how do you
interpret that when you're looking at pg_stat_activity? A new
pg_explain_wait_event(bigint waitevent) that returns a text representation
of the event perhaps?
I worked on something similar in the past (see [1]/messages/by-id/aIIeX7p2cKUO6KTa@ip-10-97-1-34.eu-west-3.compute.internal) and ended up providing the extra
information that way:
pid | wait_event_type | wait_event | infos
---------+-----------------+--------------+-------------------------------------------------------------
2560105 | IO | DataFileRead | {"blocknum" : "9272", "dbnode" : "5", "relnode" : "16407"}
2560135 | IO | WalSync | {"segno" : "1", "tli" : "1"}
2560138 | IO | DataFileRead | {"blocknum" : "78408", "dbnode" : "5", "relnode" : "16399"}
The "descriptions" were added in wait_event_names.txt, for example,
+DATA_FILE_READ "Waiting for a read from a relation data file." "blocknum" "dbnode" "relnode"
and the json was build only at query time. Maybe that could be an option to expose
the values and the descriptions in the same field.
[1]: /messages/by-id/aIIeX7p2cKUO6KTa@ip-10-97-1-34.eu-west-3.compute.internal
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi Heikki, thanks for having a look!
On Mon, Dec 8, 2025 at 11:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 08/12/2025 11:54, Jakub Wartak wrote:
While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)Any help, opinions, ideas and code/co-authors are more than welcome.
Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to
make use of it.
Right, I'm very interested in hearing what could be added there/what
people want (bonus points if that is causing some performance issues
today and we do not have the area covered and exposing that would fit
in 32-bits ;) )
For example, if you encode an table OID in it, how do
you interpret that when you're looking at pg_stat_activity? A new
pg_explain_wait_event(bigint waitevent) that returns a text
representation of the event perhaps?
Well I was thinking initially just about leaving it as that (bigint),
and the interpretation would have to be provided by the operator
himself (based on docs) - not yet part of patch, because I still don't
know if the idea is worth developing further. Technically the
wait_event_arg value sometimes is going to be some OID, sometimes pid
(like in SyncRep case), most often probably it could be reason_code
(of the wait), sometimes maybe even some hash of something to make it
fit? So yeah I think we could. I like the idea of having
pg_explain_wait_event_argument(bigint)::text built-in that could add
some additional hint to what the argument really shows without looking
at the docs. Question what it should return, simple ::text like
'reason'/'pid'/'OID' or something more descriptive in English and
wouldn't English only output be a problem for translators?
The alternative would be just to have a table inside docs (for a
start?) to explain the meaning. In practice you would hunt for
specific wait_event or have some big CASE WHEN/ELSE IF big SQL query
to interpret the values properly.
Wait events can be defined in extensions; how does an extension plug into this facility?
I have not given extensions a lot of thought or coverage yet, but the
answer is probably like: well, they don't seem to plug heavily into
this, but I think one could in extension just use
WaitEventExtensionNew() / pgstat_report_wait_start() as usual and
later logically OR some 32-bit number, however the interpretation of
the wait_event_arg would have to be provided by the extension itself
(via docs) I guess. Would that approach be acceptable?, or Were You
having some other idea? Maybe with Your idea of having
pg_explain_wait_event_argument(), then we would have to alter to
WaitEventExtensionNew(const char *wait_event_name) and add something
like 'const char *wait_event_arg_description' there?
Inevitably, the extra 32 bits won't be enough to expose everything that
you might want to expose. Should we already think about what to do then?
Well I wanted to stick to exposing only stuff that will _always_ fit
32-bits. If additional/more detailed instrumentation would be
necessary then separate monitoring/observability/variables/subsystem
probably should be built for that specific use case. So if that
information can become over 32-bit, it should not be encoded into
wait_event_arg, just to avoid debating performance regressions for any
other additional wait-event infrastructure. I simply do not want to
open a can of worms: see Bertrand tried that in [1]/messages/by-id/lt6n664ijbmfftnuv3bgvt47q7kjz4tflu4kg3ingv6njjtvld@kesknxnidemo, but I don't want
this $thread to follow that route where Andres and Robert expressed
their concerns earlier. E.g. one of the key questions is that I'm
somehow lost if we would like to continue the earlier 56-bit [2]/messages/by-id/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com /
64-bit OID/RelFileNode attempt(s). If the project wants to continue
with that, then probably we couldn't express ::relation id as 32-bit
wait_event_arg or maybe I am missing something. (ofc, we could hash
potential 64-bit OID back into 32-bit OID one day, but it sounds like
a hack, doesn't it?)
For lock waits, for example, should we have another array in shared
memory with more details, and just store an offset into that array in
the extra wait event bits, for example? (we already have pg_locks, but
let's imagine we didn't. How would you design it in a green field scenario?
If we didn't have pg_locks, I would probably stick with encoding the
mode, maybe mode|granted|fastpath (assuming OIDs are no-go).
Some brainstorming and other crazy(?) ideas how we could expose some
intrinsic PG behavior:
- writing while reading (AKA setting hint bits) - could be exposed as
reason_code for write-like wait events? (e.g. for IO/WALWrite we could
encode reason_code?)
- same as above (hint bits), but for CLOG/SLRU but also for others?
Maybe we could expose what SLRU exactly we are reading/writing
IO/SLRU_READ|WRITE waits and encodes further some "reason" there too?
- still for IO/WALWrite, we could also add another reason_code bit
meaning: are we writing full FPI or not? (that would it make
wait_event_arg for IO/WALWrite a bitmap: e.g. writing_FPI |
writing_hintbits)
-J.
[1]: /messages/by-id/lt6n664ijbmfftnuv3bgvt47q7kjz4tflu4kg3ingv6njjtvld@kesknxnidemo
[2]: /messages/by-id/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com
On Mon, Dec 8, 2025 at 12:27 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
On Mon, Dec 08, 2025 at 12:12:27PM +0200, Heikki Linnakangas wrote:
On 08/12/2025 11:54, Jakub Wartak wrote:
Thanks for working on this!
While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)Any help, opinions, ideas and code/co-authors are more than welcome.
Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to make
use of it. For example, if you encode an table OID in it, how do you
interpret that when you're looking at pg_stat_activity? A new
pg_explain_wait_event(bigint waitevent) that returns a text representation
of the event perhaps?I worked on something similar in the past (see [1]) and ended up providing the extra
information that way:pid | wait_event_type | wait_event | infos
---------+-----------------+--------------+-------------------------------------------------------------
2560105 | IO | DataFileRead | {"blocknum" : "9272", "dbnode" : "5", "relnode" : "16407"}
2560135 | IO | WalSync | {"segno" : "1", "tli" : "1"}
2560138 | IO | DataFileRead | {"blocknum" : "78408", "dbnode" : "5", "relnode" : "16399"}The "descriptions" were added in wait_event_names.txt, for example,
+DATA_FILE_READ "Waiting for a read from a relation data file." "blocknum" "dbnode" "relnode"
and the json was build only at query time. Maybe that could be an option to expose
the values and the descriptions in the same field.[1]: /messages/by-id/aIIeX7p2cKUO6KTa@ip-10-97-1-34.eu-west-3.compute.internal
Hi Bertrand, thanks for the link. I've responded to Heikki, but Your
thread (and feedback You got) was one of the directions of why not to
add anything there beyond just using a more powerful(longer) registry
for free.
-J.
Hi,
On Tue, Dec 09, 2025 at 10:14:19AM +0100, Jakub Wartak wrote:
On Mon, Dec 8, 2025 at 12:27 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:pid | wait_event_type | wait_event | infos
---------+-----------------+--------------+-------------------------------------------------------------
2560105 | IO | DataFileRead | {"blocknum" : "9272", "dbnode" : "5", "relnode" : "16407"}
2560135 | IO | WalSync | {"segno" : "1", "tli" : "1"}
2560138 | IO | DataFileRead | {"blocknum" : "78408", "dbnode" : "5", "relnode" : "16399"}The "descriptions" were added in wait_event_names.txt, for example,
+DATA_FILE_READ "Waiting for a read from a relation data file." "blocknum" "dbnode" "relnode"
Hi Bertrand, thanks for the link. I've responded to Heikki, but Your
thread (and feedback You got) was one of the directions of why not to
add anything there beyond just using a more powerful(longer) registry
for free.
Yeah exactly. My message in the current thread was only about proposing a way
to add a text description to the extra value.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com