64-bit wait_event and introduction of 32-bit wait_event_arg

Started by Jakub Wartak3 months ago13 messages

jakub.wartak@enterprisedb.com

3 months ago

Hi all,

We were debating internally if making transition to 64-bit wait_event
would be an acceptable idea (Robert's primary concern is that it may
be too limited info), but I had code to demo this, so let's just
discuss it further: After ensuring that 64-bit int math has same
performance characteristics as 32-bit one at least on x86_64, i've
converted our wait_event_info (32-bit today) to 64-bits while trying
to use pg atomics, then used some bit masking voodoo and got the lower
32-bit exposed as new wait_event_arg with some dumb demos. The idea is
to encode some specific (limited, but useful!) information into the
wait event variable itself, so we could gain access to additional
32-bit of space for details along with wait-event itself to help
assessment of some wait-event-related problems. This seems to probably
come without any performance impact, at least on reasonable platforms
used today in production (for ones with
PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY that is). Intended use pattern:
if I were chasing a certain specific wait_event-related problem, I
could extract certain info straight from wait_event_arg, making it
much easier than even drilling into other more advanced views (if
that's information exposed at all, often it's not).

Q0) Key question: does that sound like a good idea to pursue further
or are there any blockers to it?

Sample demos included in patch, depending on the specific wait_event,
wait_event_arg could be:

1. PgSleep could show time since it was launched (simplest thing one
can imagine, or we could think about time left maybe too?):

2. Passing exact relation oid on where we are waiting for (here pid
82242 was doing "alter table p3 add.." , but it's waiting for the
backend that executed "lock table p3 in exclusive mode;"). We can
decode wait_event right into relation (p3)

(then you would basically query pg_stat_replication for pid = 119689
as it seems to be the slowest one here)

4. DataFile could report fd (yes, it can differ from backend to
backend [due to fd cache], but it's demo, probably it would be better
with oid/relationNumber, but it's not fast to do that :) and although
we have dboid already, there's tablespace and dunno how we could
squeeze RelFileNumber with tablespace there, possibly we could just
use tablespace Oid there too)

5. (Challenging for me) Multixact Wait events - with wait_event_arg,
we could report where stuff is really waiting, right, now it's a
little guesswork, but with 0002 concept:

dbmultixact=# select pid, query, wait_event_type,
wait_event,wait_event_arg from pg_stat_activity where wait_event =
'MultiXactMemberSLRU';
pid | query
| wait_event_type | wait_event | wait_event_arg
-------+--------------------------------------------------------------------+-----------------+---------------------+----------------
99864 | INSERT INTO users (loc_id, fname) VALUES (2,'Testing
User-2-002'); | LWLock | MultiXactMemberSLRU | 16494

dbmultixact=# select 16494::regclass;
regclass
-----------
locations
dbmultixact=# \d users
[..]
"users_loc_id_fkey" FOREIGN KEY (loc_id) REFERENCES locations(loc_id)

The knowledge (for the end user) what is stored exactly in
wait_event_arg (depending on main wait_event) would be coming from
docs (probably some table). Probably each different wait_event could
be enhanced by some information.

Quick performance crosscheck of 0001 alone: /usr/pgsql19/bin/pgbench
-c 4 -P 1 -T 30 -S postgres:
master: tps = 121020.723246 (without initial connection time)
patched: tps = 121802.527000 (without initial connection time)

Q1) because we compile without -Wconversion, I was wondering if we
shouldn't need a safe/strict uint64 struct-like type that would catch
errors when stuff like uint64 return from WaitEventExtensionNew()
could be used externally by extensions with uint32? (because we do NOT
have -Wtruncation [too verbose?], any return value from uint64 that
will be casted silently to uint32 in extensions without any warning.
That may cause hangs during tests -- often tests wait for some
waitevent to show-up, but it wont).

Q2) 0002: Please ignore the 0002 quality, I did not want to sink more
time into MultiXact stuff, especially if the main concept would be
shot down. The main problem is how one can get RelFileNumber about
Relation that faces MultiXact back into LWLockReportWaitStart(). Here
I just wanted to see how much rework would be necessary (passing
variables, modifying API and so on) - in short: it introduces
LWLockAcquire() as fallback to LWLockAcquireExt(.. RelFileNumber r) ,
but still gets pretty nasty soon sadly, lots of stuff needs to be
dumb-adjusted. I would like to point out that I'm a complete
multixact/heapam noob, so it is a very dumb way of passing that info
for sure, in way too many places. Another thing we could do is
basically maybe have some "static uint32 lwlock_relation" inside
lwlock and properly just set it there (and reset it) once from within
heap*.c or similiar, so then all dependent LWLock routines would OR it
(== so it would be visible as wait_event_arg) and we would get the
involved RelFileNumber for all operations involved there (at least for
LWLocks).

While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)

Any help, opinions, ideas and code/co-authors are more than welcome.

-J.

[1]: Disassembly picture of stock binary taken from PGDG on Ubuntu/Debian x86_64 (so as used by real users), shows use of 64-bit (rax) registry e.g. in AT&T mnemonics: lea 0x6a4c1c(%rip), %rax // store into %rax value of my_wait_event_info mov (%rax), %rax // dereference ptr %rax (and it put back into rax) mov %ebx, (%rax) // write 32-bit value of ebx into 64-bit rax
Ubuntu/Debian x86_64 (so as used by real users), shows use of 64-bit
(rax) registry e.g. in AT&T mnemonics:
lea 0x6a4c1c(%rip), %rax // store into %rax value of my_wait_event_info
mov (%rax), %rax // dereference ptr %rax (and it put
back into rax)
mov %ebx, (%rax) // write 32-bit value of ebx into 64-bit rax

Or different, but very sample example with Intel mnemonics:
lea rax,[rip+0x865b09]
mov rax,QWORD PTR [rax] // notice it's already RAX and quadword
mov DWORD PTR [rax],0x0

[2]: x86_64 linux, operations on eax vs rax, that's 1.00646 under non-ideal conditions. Benchmarking int32_t (32-bit) additions... Operations/Second (int32_t): 3.37e+07 Benchmarking int64_t (64-bit) additions... Operations/Second (int64_t): 3.38e+07
non-ideal conditions.
Benchmarking int32_t (32-bit) additions... Operations/Second
(int32_t): 3.37e+07
Benchmarking int64_t (64-bit) additions... Operations/Second
(int64_t): 3.38e+07

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Jakub Wartak (#1)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

On 08/12/2025 11:54, Jakub Wartak wrote:

While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)

Any help, opinions, ideas and code/co-authors are more than welcome.

Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to
make use of it. For example, if you encode an table OID in it, how do
you interpret that when you're looking at pg_stat_activity? A new
pg_explain_wait_event(bigint waitevent) that returns a text
representation of the event perhaps? Wait events can be defined in
extensions; how does an extension plug into this facility?

Inevitably, the extra 32 bits won't be enough to expose everything that
you might want to expose. Should we already think about what to do then?
For lock waits, for example, should we have another array in shared
memory with more details, and just store an offset into that array in
the extra wait event bits, for example? (e already have pg_locks, but
let's imagine we didn't. How would you design it in a green field scenario?

- Heikki

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#2)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

Hi,

On Mon, Dec 08, 2025 at 12:12:27PM +0200, Heikki Linnakangas wrote:

On 08/12/2025 11:54, Jakub Wartak wrote:

Thanks for working on this!

While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)

Any help, opinions, ideas and code/co-authors are more than welcome.

Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to make
use of it. For example, if you encode an table OID in it, how do you
interpret that when you're looking at pg_stat_activity? A new
pg_explain_wait_event(bigint waitevent) that returns a text representation
of the event perhaps?

I worked on something similar in the past (see [1]/messages/by-id/aIIeX7p2cKUO6KTa@ip-10-97-1-34.eu-west-3.compute.internal) and ended up providing the extra
information that way:

pid | wait_event_type | wait_event | infos
---------+-----------------+--------------+-------------------------------------------------------------
2560105 | IO | DataFileRead | {"blocknum" : "9272", "dbnode" : "5", "relnode" : "16407"}
2560135 | IO | WalSync | {"segno" : "1", "tli" : "1"}
2560138 | IO | DataFileRead | {"blocknum" : "78408", "dbnode" : "5", "relnode" : "16399"}

The "descriptions" were added in wait_event_names.txt, for example,

+DATA_FILE_READ "Waiting for a read from a relation data file." "blocknum" "dbnode" "relnode"

and the json was build only at query time. Maybe that could be an option to expose
the values and the descriptions in the same field.

[1]: /messages/by-id/aIIeX7p2cKUO6KTa@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Jakub Wartak

jakub.wartak@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#2)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

Hi Heikki, thanks for having a look!

On Mon, Dec 8, 2025 at 11:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 08/12/2025 11:54, Jakub Wartak wrote:

While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)

Any help, opinions, ideas and code/co-authors are more than welcome.

Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to
make use of it.

Right, I'm very interested in hearing what could be added there/what
people want (bonus points if that is causing some performance issues
today and we do not have the area covered and exposing that would fit
in 32-bits ;) )

For example, if you encode an table OID in it, how do
you interpret that when you're looking at pg_stat_activity? A new
pg_explain_wait_event(bigint waitevent) that returns a text
representation of the event perhaps?

Well I was thinking initially just about leaving it as that (bigint),
and the interpretation would have to be provided by the operator
himself (based on docs) - not yet part of patch, because I still don't
know if the idea is worth developing further. Technically the
wait_event_arg value sometimes is going to be some OID, sometimes pid
(like in SyncRep case), most often probably it could be reason_code
(of the wait), sometimes maybe even some hash of something to make it
fit? So yeah I think we could. I like the idea of having
pg_explain_wait_event_argument(bigint)::text built-in that could add
some additional hint to what the argument really shows without looking
at the docs. Question what it should return, simple ::text like
'reason'/'pid'/'OID' or something more descriptive in English and
wouldn't English only output be a problem for translators?

The alternative would be just to have a table inside docs (for a
start?) to explain the meaning. In practice you would hunt for
specific wait_event or have some big CASE WHEN/ELSE IF big SQL query
to interpret the values properly.

Wait events can be defined in extensions; how does an extension plug into this facility?

I have not given extensions a lot of thought or coverage yet, but the
answer is probably like: well, they don't seem to plug heavily into
this, but I think one could in extension just use
WaitEventExtensionNew() / pgstat_report_wait_start() as usual and
later logically OR some 32-bit number, however the interpretation of
the wait_event_arg would have to be provided by the extension itself
(via docs) I guess. Would that approach be acceptable?, or Were You
having some other idea? Maybe with Your idea of having
pg_explain_wait_event_argument(), then we would have to alter to
WaitEventExtensionNew(const char *wait_event_name) and add something
like 'const char *wait_event_arg_description' there?

Inevitably, the extra 32 bits won't be enough to expose everything that
you might want to expose. Should we already think about what to do then?

Well I wanted to stick to exposing only stuff that will _always_ fit
32-bits. If additional/more detailed instrumentation would be
necessary then separate monitoring/observability/variables/subsystem
probably should be built for that specific use case. So if that
information can become over 32-bit, it should not be encoded into
wait_event_arg, just to avoid debating performance regressions for any
other additional wait-event infrastructure. I simply do not want to
open a can of worms: see Bertrand tried that in [1]/messages/by-id/lt6n664ijbmfftnuv3bgvt47q7kjz4tflu4kg3ingv6njjtvld@kesknxnidemo, but I don't want
this $thread to follow that route where Andres and Robert expressed
their concerns earlier. E.g. one of the key questions is that I'm
somehow lost if we would like to continue the earlier 56-bit [2]/messages/by-id/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com /
64-bit OID/RelFileNode attempt(s). If the project wants to continue
with that, then probably we couldn't express ::relation id as 32-bit
wait_event_arg or maybe I am missing something. (ofc, we could hash
potential 64-bit OID back into 32-bit OID one day, but it sounds like
a hack, doesn't it?)

For lock waits, for example, should we have another array in shared
memory with more details, and just store an offset into that array in
the extra wait event bits, for example? (we already have pg_locks, but
let's imagine we didn't. How would you design it in a green field scenario?

If we didn't have pg_locks, I would probably stick with encoding the
mode, maybe mode|granted|fastpath (assuming OIDs are no-go).

Some brainstorming and other crazy(?) ideas how we could expose some
intrinsic PG behavior:
- writing while reading (AKA setting hint bits) - could be exposed as
reason_code for write-like wait events? (e.g. for IO/WALWrite we could
encode reason_code?)
- same as above (hint bits), but for CLOG/SLRU but also for others?
Maybe we could expose what SLRU exactly we are reading/writing
IO/SLRU_READ|WRITE waits and encodes further some "reason" there too?
- still for IO/WALWrite, we could also add another reason_code bit
meaning: are we writing full FPI or not? (that would it make
wait_event_arg for IO/WALWrite a bitmap: e.g. writing_FPI |
writing_hintbits)

-J.

[1]: /messages/by-id/lt6n664ijbmfftnuv3bgvt47q7kjz4tflu4kg3ingv6njjtvld@kesknxnidemo
[2]: /messages/by-id/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com

Jakub Wartak

jakub.wartak@enterprisedb.com

3 months ago

In reply to: Bertrand Drouvot (#3)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

On Mon, Dec 8, 2025 at 12:27 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Mon, Dec 08, 2025 at 12:12:27PM +0200, Heikki Linnakangas wrote:

On 08/12/2025 11:54, Jakub Wartak wrote:

Thanks for working on this!

While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)

Any help, opinions, ideas and code/co-authors are more than welcome.

Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to make
use of it. For example, if you encode an table OID in it, how do you
interpret that when you're looking at pg_stat_activity? A new
pg_explain_wait_event(bigint waitevent) that returns a text representation
of the event perhaps?

I worked on something similar in the past (see [1]) and ended up providing the extra
information that way:

pid | wait_event_type | wait_event | infos
---------+-----------------+--------------+-------------------------------------------------------------
2560105 | IO | DataFileRead | {"blocknum" : "9272", "dbnode" : "5", "relnode" : "16407"}
2560135 | IO | WalSync | {"segno" : "1", "tli" : "1"}
2560138 | IO | DataFileRead | {"blocknum" : "78408", "dbnode" : "5", "relnode" : "16399"}

The "descriptions" were added in wait_event_names.txt, for example,

+DATA_FILE_READ "Waiting for a read from a relation data file." "blocknum" "dbnode" "relnode"

and the json was build only at query time. Maybe that could be an option to expose
the values and the descriptions in the same field.

[1]: /messages/by-id/aIIeX7p2cKUO6KTa@ip-10-97-1-34.eu-west-3.compute.internal

Hi Bertrand, thanks for the link. I've responded to Heikki, but Your
thread (and feedback You got) was one of the directions of why not to
add anything there beyond just using a more powerful(longer) registry
for free.

-J.

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

3 months ago

In reply to: Jakub Wartak (#5)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

Hi,

On Tue, Dec 09, 2025 at 10:14:19AM +0100, Jakub Wartak wrote:

On Mon, Dec 8, 2025 at 12:27 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

pid | wait_event_type | wait_event | infos
---------+-----------------+--------------+-------------------------------------------------------------
2560105 | IO | DataFileRead | {"blocknum" : "9272", "dbnode" : "5", "relnode" : "16407"}
2560135 | IO | WalSync | {"segno" : "1", "tli" : "1"}
2560138 | IO | DataFileRead | {"blocknum" : "78408", "dbnode" : "5", "relnode" : "16399"}

The "descriptions" were added in wait_event_names.txt, for example,

+DATA_FILE_READ "Waiting for a read from a relation data file." "blocknum" "dbnode" "relnode"

Hi Bertrand, thanks for the link. I've responded to Heikki, but Your
thread (and feedback You got) was one of the directions of why not to
add anything there beyond just using a more powerful(longer) registry
for free.

Yeah exactly. My message in the current thread was only about proposing a way
to add a text description to the extra value.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Jakub Wartak

jakub.wartak@enterprisedb.com

about 2 months ago

In reply to: Jakub Wartak (#4)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

On Tue, Dec 9, 2025 at 10:11 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Hi Heikki, thanks for having a look!

On Mon, Dec 8, 2025 at 11:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 08/12/2025 11:54, Jakub Wartak wrote:

While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)

Any help, opinions, ideas and code/co-authors are more than welcome.

Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to
make use of it.

Right, I'm very interested in hearing what could be added there/what
people want (bonus points if that is causing some performance issues
today and we do not have the area covered and exposing that would fit
in 32-bits ;) )

OK, so v3 is attached. Changes in v3:
- added proper RelFileNumber as wait_event_arg for DataFileRead/Write/etc
waits instead of simply using "filedescriptor" as wait_event_arg
- cfbot complained hard on win32 due to lack of support of uint64 for enums
("warning C4309: 'initializing': truncation of constant value"), so i've
tried two ways how enum can be forced into 64-bit ints instead of just
default (32-bit int). However none of the tricks seem to help the MSVC case:
a) `typedef enum : uint64_t` causes ""error C2332: 'enum': missing tag name"
b) putting `PG_WAIT_ACTIVITY_MAX = 0xFFFFFFFFFFFFFFFFULL` at the end of
enum also doesnt work
so I had to get rid of enum{} and stick to #defines to make cfbot happy there

- pass RelFileNumber/tablespaceId as wait_event_arg for recovery conflict waits
(earlier you would get that information only from log, but here we pinpoint
exact RelFileNumber for which startup is waiting), e.g. use case demo, we run
some long analytical query on standby (while read/write pgbench is
hitting hard
primary and we run without hot_standby_feedback):

s1) "SELECT count(*) FROM pgbench_accounts a CROSS JOIN pgbench_accounts b;"

postgres=# select relname from pg_class where relfilenode = 16427;
relname
------------------
pgbench_branches

s1) after some time (max_standby_streaming_delay) we get:
ERROR: canceling statement due to conflict with recovery

- added description of wait_event_arg to wait event infrastructure
(pg_wait_events view and docs)

- if there's high I/O on SLRU we can get data from pg_stat_slru,
however previously
one couldn't exactly pinpoint which exact SLRU type affects which backend,
so I've thought I've add class of Slru to IO/SLRU{Read,Write} as
wait_event_arg to make it easier on multitenant DBs, e.g. it shows:

postgres=# select waiteventarg_description from pg_wait_events where
name='SlruRead';
waiteventarg_description
---------------------------------------------------------------------------------------
SlruType: unknown(0), [..] multixactoffset (5), multixactmembers(6),
serialializable(7)

-- \d will show FK (so we connect the dots with less ambiguity about
FK <-> multixacts):
postgres=# \d+ users
[..]
Foreign-key constraints:
"fk1" FOREIGN KEY (loc_id) REFERENCES locations(loc_id)

postgres=# \d+ locations
[..]
Referenced by:
TABLE "users" CONSTRAINT "fk1" FOREIGN KEY (loc_id) REFERENCES
locations(loc_id)

For example, if you encode an table OID in it, how do
you interpret that when you're looking at pg_stat_activity? A new
pg_explain_wait_event(bigint waitevent) that returns a text
representation of the event perhaps?

Well I was thinking initially[..irrelevant, so snipped out]

Right, so v3 has built-in self-description of wait_event_arg in
pg_wait_events (and also docs also contain such details too)

[..]

Inevitably, the extra 32 bits won't be enough to expose everything that
you might want to expose. Should we already think about what to do then?

Well I wanted to stick to exposing only stuff that will _always_ fit
32-bits. If additional/more detailed instrumentation would be
necessary then separate monitoring/observability/variables/subsystem
probably should be built for that specific use case. So if that
information can become over 32-bit, it should not be encoded into
wait_event_arg, just to avoid debating performance regressions for any
other additional wait-event infrastructure. I simply do not want to
open a can of worms: see Bertrand tried that in [1], but I don't want
this $thread to follow that route where Andres and Robert expressed
their concerns earlier. E.g. one of the key questions is that I'm
somehow lost if we would like to continue the earlier 56-bit [2] /
64-bit OID/RelFileNode attempt(s). If the project wants to continue
with that, then probably we couldn't express ::relation id as 32-bit
wait_event_arg or maybe I am missing something. (ofc, we could hash
potential 64-bit OID back into 32-bit OID one day, but it sounds like
a hack, doesn't it?)

Questions:

1. Question about 56-bit relfilenode idea [1]/messages/by-id/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com (05d4cbf9b6ba, reverted by
a448e49bcbe): can I assume that it is dead in the water and can I assume
that >> 33-bits RelFileNode is not going to happen?
(if my 64-bit wait_events with 32-bits for wait_events_args use
RelFileNode -- that makes it incompatible)

2. Please ignore the 0002 quality (multixact), but I would grateful for feedback
on is such extending MultiXact routines (to contain RelFileNumber) ok or
not ok? And if not , what would be a better way to pass through
such information?

-J.

[1]: /messages/by-id/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 2 months ago

In reply to: Jakub Wartak (#7)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

Hi,

On Fri, Jan 09, 2026 at 11:34:09AM +0100, Jakub Wartak wrote:

On Tue, Dec 9, 2025 at 10:11 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Hi Heikki, thanks for having a look!

On Mon, Dec 8, 2025 at 11:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 08/12/2025 11:54, Jakub Wartak wrote:

While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)

Any help, opinions, ideas and code/co-authors are more than welcome.

Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to
make use of it.

Right, I'm very interested in hearing what could be added there/what
people want (bonus points if that is causing some performance issues
today and we do not have the area covered and exposing that would fit
in 32-bits ;) )

OK, so v3 is attached. Changes in v3:

Thanks for the new version!

It looks like that it needs a rebase. Also, FWIW, a quick scan shows a few
numbers of "XXX" and elog calls commented out (that are probably used during
your own debugging?).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Jakub Wartak

jakub.wartak@enterprisedb.com

about 2 months ago

In reply to: Bertrand Drouvot (#8)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

On Wed, Jan 14, 2026 at 9:38 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Fri, Jan 09, 2026 at 11:34:09AM +0100, Jakub Wartak wrote:

On Tue, Dec 9, 2025 at 10:11 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Hi Heikki, thanks for having a look!

On Mon, Dec 8, 2025 at 11:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 08/12/2025 11:54, Jakub Wartak wrote:

While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)

Any help, opinions, ideas and code/co-authors are more than welcome.

Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to
make use of it.

Right, I'm very interested in hearing what could be added there/what
people want (bonus points if that is causing some performance issues
today and we do not have the area covered and exposing that would fit
in 32-bits ;) )

OK, so v3 is attached. Changes in v3:

Thanks for the new version!

It looks like that it needs a rebase. Also, FWIW, a quick scan shows a few
numbers of "XXX" and elog calls commented out (that are probably used during
your own debugging?).

Yes, indeed, that's intentional right now - it's more like a draft
rather than something that should be polished.

To be honest I would like to avoid sinking more time on it, if the
sole idea gets shot down or there is opposition due e.g. to concerns
of exposing 32-bit relfilenodes that way (see that 56-bit relfilenode
idea).

-J.

#10

Jakub Wartak

jakub.wartak@enterprisedb.com

23 days ago

In reply to: Jakub Wartak (#9)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

On Wed, Jan 14, 2026 at 9:56 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

On Wed, Jan 14, 2026 at 9:38 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Fri, Jan 09, 2026 at 11:34:09AM +0100, Jakub Wartak wrote:

On Tue, Dec 9, 2025 at 10:11 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Hi Heikki, thanks for having a look!

On Mon, Dec 8, 2025 at 11:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 08/12/2025 11:54, Jakub Wartak wrote:

While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)

Any help, opinions, ideas and code/co-authors are more than welcome.

Expanding it to 64 bit seems fine as far as performance is concerned. I
think the difficult and laborious part is to design the facilities to
make use of it.

Right, I'm very interested in hearing what could be added there/what
people want (bonus points if that is causing some performance issues
today and we do not have the area covered and exposing that would fit
in 32-bits ;) )

OK, so v3 is attached. Changes in v3:

Thanks for the new version!

It looks like that it needs a rebase. Also, FWIW, a quick scan shows a few
numbers of "XXX" and elog calls commented out (that are probably used during
your own debugging?).

Yes, indeed, that's intentional right now - it's more like a draft
rather than something that should be polished.

To be honest I would like to avoid sinking more time on it, if the
sole idea gets shot down or there is opposition due e.g. to concerns
of exposing 32-bit relfilenodes that way (see that 56-bit relfilenode
idea).

Goodafter gentlemen,

I was considering marking this as Rejected/RwF and giving up due
RelFilesNodes could becoming > 32-bits which kinda goes against the
the main intention of this patch (showing involved relations involved
in some complex LWLock/ Multixact performance scenarios).

In offline discussions with Andres and Robert I've learned that:
1. there's still room that RelFileNodes could become 56-bits one day
2. introducing another uint64 just for wait_events_arg is a no-go zone
due to performance concerns.
3. exposing something like "relfilenode % (2^32)" is seem as hack and could
cause issues (problems with interpretation/conflicts in future when
RelFileNode would be bigger)

Anyway, today this WIP/PoC patchset gives:

Summary of changes since previous version:

- Removed all refilnodeid references including
ProcSleep()->WaitLatch(..PG_WAIT_LOCK | locktag_field2 );
as we cannot take locktag_type_field2 (which maps to reloid, set by
SET_LOCKTAG_RELATION)

- In pgstat_report_wait_end() change volatile direct set to zero with
more proper: pg_atomic_write_u64(..,0);

- separated patch for SyncRepWaitForLSN() as I have plenty of performance
concerns there (with abnormally high max_wal_senders). I could reduce those
spinlocks happen not more often than every N iterations as today
there is a full scan
under spinlocks every time the latch is reset, but how often to do this
scan then?

- added exposing Buffer# (one can lookup relation via pg_buffercache),
idea by Andres, it seems to work (simulated with fetching from cursor):

postgres=# select
pg_filenode_relation(0, relfilenode)::regclass,
pinning_backends
from pg_buffercache where bufferid = 225;

pg_filenode_relation | pinning_backends
----------------------+-----------------
pin_test | 2

- added exposing Timeout/SpinDelay, not sure if that would be helpful

What's left:
- Earlier Heikki raised the question "Wait events can be defined in extensions;
how does an extension plug into this facility?" - that's still unanswered.
I think they could just OR 32-bit value themselves, but maybe we could
just provide a way to plug into pg_get_wait_events().waiteventarg_description?
- docs
- of course it could be extended with some reporting if one finds further
ideas

-J.

#11

Michael Paquier

michael@paquier.xyz

22 days ago

In reply to: Jakub Wartak (#10)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

On Thu, Feb 12, 2026 at 01:42:23PM +0100, Jakub Wartak wrote:

What's left:
- Earlier Heikki raised the question "Wait events can be defined in extensions;
how does an extension plug into this facility?" - that's still unanswered.

Reserving the full 8 bytes to the callers of WaitEventExtensionNew()
and WaitEventInjectionPointNew() would be an error, because we would
forever lock down the possibility for extensions to set at will the 4
extra bytes that become available when setting some extra data in
parallel of a wait event name. The results of these routines should
still be 4 bytes for the "static" part of the wait event names, not 8.

I think they could just OR 32-bit value themselves, but maybe we could
just provide a way to plug into pg_get_wait_events().waiteventarg_description?

The value provided back to pg_stat_activity would be a 4-byte integer
under this design, whose interpretation is up to the client, I guess,
with a filter based on the wait event name found (likely a CASE/ELSE
to force casts back to a text value at the end in most cases?). That
may be annoying for client applications, though, but perhaps
acceptable as this provides extra information with a single atomic
write.

At the end, the way these 4 extra bytes can be set by extensions is an
API problem for me, and I suspect that the correct way to extend
things, on top of forcing the use of 4 bytes for the ID of the fixed
event ID (perhaps just define a type here anyway?), would be to patch
the most popular APIs that extensions currently use to let them set
the value they want for the extra 4 bytes. The first choice that
comes into mind here is the family of WaitLatchOrSocket() APIs, that
could have an extra argument with a uint32 for the extra data. That's
a popular one among extension developers.

By the way, patch 0001 includes a log file from pg_plan_advice with
some information I suspect you did not intend to send..
--
Michael

#12

Jakub Wartak

jakub.wartak@enterprisedb.com

19 days ago

In reply to: Michael Paquier (#11)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

On Fri, Feb 13, 2026 at 5:46 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Feb 12, 2026 at 01:42:23PM +0100, Jakub Wartak wrote:

Hi Michael, thanks for taking a look at this!

What's left:
- Earlier Heikki raised the question "Wait events can be defined in extensions;
how does an extension plug into this facility?" - that's still unanswered.

Reserving the full 8 bytes to the callers of WaitEventExtensionNew()
and WaitEventInjectionPointNew() would be an error, because we would
forever lock down the possibility for extensions to set at will the 4
extra bytes that become available when setting some extra data in
parallel of a wait event name. The results of these routines should
still be 4 bytes for the "static" part of the wait event names, not 8.

When trying to understand this I found that the patch contained assumption
error on my side that might have mislead You: I was blind to the fact that
all custom wait events land in PG_WAIT_EXTENSION class, so we need in
WaitEventCustomNew() to shift the resulting eventId from nextId++ to the
proper place in 64-bit word (this leaves the 32-bits part (for
wait_event_arg) initially as zero when returned by those routines)

.. and then, well, the registration of the new custom wait event is something
distinct from starting the waiting on it, right? I mean the extension
could:

// it will contain PG_WAIT_EXTENSION | shifted_eventId_from_nextId
// + zero on all 32-bit lower bits (for wait_event_arg)
uint64 custom_wait_event = WaitEventExtensionNew("custom");
libpqsrv_connect(..,custom_wait_event);
..
// and we can still add 32-bit argument there
libpqsrv_connect(..,custom_wait_event | some_wait_event_arg);

Because probably the code was misleading Could I ask You to
re-evaluate the above
concern based on the fixed code?

I think they could just OR 32-bit value themselves, but maybe we could
just provide a way to plug into pg_get_wait_events().waiteventarg_description?

The value provided back to pg_stat_activity would be a 4-byte integer
under this design, whose interpretation is up to the client, I guess,
with a filter based on the wait event name found (likely a CASE/ELSE
to force casts back to a text value at the end in most cases?). That
may be annoying for client applications, though, but perhaps
acceptable as this provides extra information with a single atomic
write.

Yes, today we just return the lower half of 64-bits as
pg_stat_activity.wait_event_arg::integer. The usage could be be interpreted
by the SQL query doing CASE/ELSE if necessary (or just by human when directly
looking at the pg_get_wait_events().waiteventarg_description -- that is what
is implemented today for the IO/Slru* wait event and their wait_event_arg),
e.g.:

postgres=# select type, substring(name, 1, 20) wait,
waiteventarg_description as desc
from pg_get_wait_events() where type = 'IO' and name like 'Slru%';
type | wait |
desc
------+---------------+----------------------------------------------------------------------------------------------------------------------------------
IO | SlruFlushSync | SlruType: unknown(0), notify(1), clog(2),
subtrans(3), committs(4), multixactoffset (5), multixactmembers(6),
serialializable(7)
IO | SlruRead | SlruType: unknown(0), notify(1), clog(2),
subtrans(3), committs(4), multixactoffset (5), multixactmembers(6),
serialializable(7)
[..]

so one of course could write something like to interpret it:
SELECT *, CASE WHEN wait_event_arg=0 THEN 'unknown'
WHEN wait_event_arg=1 THEN 'notify'
[..]
ELSE 'other'
END as
FROM pg_stat_activity WHERE wait_event LIKE 'Slru%';

At the end, the way these 4 extra bytes can be set by extensions is an
API problem for me, and I suspect that the correct way to extend
things, on top of forcing the use of 4 bytes for the ID of the fixed
event ID (perhaps just define a type here anyway?), would be to patch
the most popular APIs that extensions currently use to let them set
the value they want for the extra 4 bytes. The first choice that
comes into mind here is the family of WaitLatchOrSocket() APIs, that
could have an extra argument with a uint32 for the extra data. That's
a popular one among extension developers.

Right now in the patchset, the WaitLatchOrSocket() and WaitLatch() take
uint64 argument, I've done that so that we I could minimize the patchset
footprint and then e.g. I do still:
- rc = WaitLatch(MyLatch, ..., WAIT_EVENT_SYNC_REP);
+ rc = WaitLatch(MyLatch, ..., WAIT_EVENT_SYNC_REP | wait_event_arg_pid);
(so uint64 that is OR-ed with CLASS and 32-bit lower e.g. pid here)
so without many internal unnecessary changes as WaitLatch() is used heavily.

BTW: note that all of the WAIT_EVENT_* are now 64-bits, so it allows to
work that way.

One problem with that approach is that with WaitLatchOrSocket()/WaitLatch()
the extensions would have to be altered (the extension code can pass uint32
silently - it will compile and won't show a good wait event in runtime).

Sadly this won't be detected without -Wconversion (silent casting will
uint32->uint64) and I don't have other idea how silent casting could
be detected during extension compilation, as even
`#pragma GCC diagnostic warning "-Wconversion"` won't be effective in
3rd party code using this.

By the way, patch 0001 includes a log file from pg_plan_advice with
some information I suspect you did not intend to send..

Oops, fixed, thanks.

New version attached, it also fixed the dblink bug where it was not showing
proper wait events (missing uint32 to uint64 change).

Probably one thing remaining is also the API about pg_get_wait_events(). To
me it looks like:
- the extensions do not have way to pass description (it's just static
"Waiting for custom wait event \"%s\" defined by extension module")
- therefore they also miss way to pass waiteventarg_description

I think there are two ways there:
a) either display static text for waiteventarg_description like:
"consult extension documentation for details"

b) or we could add some more extended way to initialize:
WaitEventExtensionNew(
const char *wait_event_name,
const char *wait_event_desc,
const char *wait_event_arg_desc);

but I seriously doubt it would be used by anyone, but maybe I'm wrong.

-J.

#13

Jakub Wartak

jakub.wartak@enterprisedb.com

5 days ago

In reply to: Jakub Wartak (#12)

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg

On Mon, Feb 16, 2026 at 12:34 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

On Fri, Feb 13, 2026 at 5:46 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Feb 12, 2026 at 01:42:23PM +0100, Jakub Wartak wrote:

Hi Michael, thanks for taking a look at this!

What's left:
- Earlier Heikki raised the question "Wait events can be defined in extensions;
how does an extension plug into this facility?" - that's still unanswered.

Reserving the full 8 bytes to the callers of WaitEventExtensionNew()
and WaitEventInjectionPointNew() would be an error, because we would
forever lock down the possibility for extensions to set at will the 4
extra bytes that become available when setting some extra data in
parallel of a wait event name. The results of these routines should
still be 4 bytes for the "static" part of the wait event names, not 8.

When trying to understand this I found that the patch contained assumption
error on my side that might have mislead You: I was blind to the fact that
all custom wait events land in PG_WAIT_EXTENSION class, so we need in
WaitEventCustomNew() to shift the resulting eventId from nextId++ to the
proper place in 64-bit word (this leaves the 32-bits part (for
wait_event_arg) initially as zero when returned by those routines)

.. and then, well, the registration of the new custom wait event is something
distinct from starting the waiting on it, right? I mean the extension
could:

// it will contain PG_WAIT_EXTENSION | shifted_eventId_from_nextId
// + zero on all 32-bit lower bits (for wait_event_arg)
uint64 custom_wait_event = WaitEventExtensionNew("custom");
libpqsrv_connect(..,custom_wait_event);
..
// and we can still add 32-bit argument there
libpqsrv_connect(..,custom_wait_event | some_wait_event_arg);

Because probably the code was misleading Could I ask You to
re-evaluate the above
concern based on the fixed code?

I think they could just OR 32-bit value themselves, but maybe we could
just provide a way to plug into pg_get_wait_events().waiteventarg_description?

The value provided back to pg_stat_activity would be a 4-byte integer
under this design, whose interpretation is up to the client, I guess,
with a filter based on the wait event name found (likely a CASE/ELSE
to force casts back to a text value at the end in most cases?). That
may be annoying for client applications, though, but perhaps
acceptable as this provides extra information with a single atomic
write.

Yes, today we just return the lower half of 64-bits as
pg_stat_activity.wait_event_arg::integer. The usage could be be interpreted
by the SQL query doing CASE/ELSE if necessary (or just by human when directly
looking at the pg_get_wait_events().waiteventarg_description -- that is what
is implemented today for the IO/Slru* wait event and their wait_event_arg),
e.g.:

postgres=# select type, substring(name, 1, 20) wait,
waiteventarg_description as desc
from pg_get_wait_events() where type = 'IO' and name like 'Slru%';
type | wait |
desc
------+---------------+----------------------------------------------------------------------------------------------------------------------------------
IO | SlruFlushSync | SlruType: unknown(0), notify(1), clog(2),
subtrans(3), committs(4), multixactoffset (5), multixactmembers(6),
serialializable(7)
IO | SlruRead | SlruType: unknown(0), notify(1), clog(2),
subtrans(3), committs(4), multixactoffset (5), multixactmembers(6),
serialializable(7)
[..]

so one of course could write something like to interpret it:
SELECT *, CASE WHEN wait_event_arg=0 THEN 'unknown'
WHEN wait_event_arg=1 THEN 'notify'
[..]
ELSE 'other'
END as
FROM pg_stat_activity WHERE wait_event LIKE 'Slru%';

At the end, the way these 4 extra bytes can be set by extensions is an
API problem for me, and I suspect that the correct way to extend
things, on top of forcing the use of 4 bytes for the ID of the fixed
event ID (perhaps just define a type here anyway?), would be to patch
the most popular APIs that extensions currently use to let them set
the value they want for the extra 4 bytes. The first choice that
comes into mind here is the family of WaitLatchOrSocket() APIs, that
could have an extra argument with a uint32 for the extra data. That's
a popular one among extension developers.
Right now in the patchset, the WaitLatchOrSocket() and WaitLatch() take
uint64 argument, I've done that so that we I could minimize the patchset
footprint and then e.g. I do still:
- rc = WaitLatch(MyLatch, ..., WAIT_EVENT_SYNC_REP);
+ rc = WaitLatch(MyLatch, ..., WAIT_EVENT_SYNC_REP | wait_event_arg_pid);
(so uint64 that is OR-ed with CLASS and 32-bit lower e.g. pid here)
so without many internal unnecessary changes as WaitLatch() is used heavily.
BTW: note that all of the WAIT_EVENT_* are now 64-bits, so it allows to
work that way.

One problem with that approach is that with WaitLatchOrSocket()/WaitLatch()
the extensions would have to be altered (the extension code can pass uint32
silently - it will compile and won't show a good wait event in runtime).

Sadly this won't be detected without -Wconversion (silent casting will
uint32->uint64) and I don't have other idea how silent casting could
be detected during extension compilation, as even
`#pragma GCC diagnostic warning "-Wconversion"` won't be effective in
3rd party code using this.

By the way, patch 0001 includes a log file from pg_plan_advice with
some information I suspect you did not intend to send..

Oops, fixed, thanks.

New version attached, it also fixed the dblink bug where it was not showing
proper wait events (missing uint32 to uint64 change).

Probably one thing remaining is also the API about pg_get_wait_events(). To
me it looks like:
- the extensions do not have way to pass description (it's just static
"Waiting for custom wait event \"%s\" defined by extension module")
- therefore they also miss way to pass waiteventarg_description

I think there are two ways there:
a) either display static text for waiteventarg_description like:
"consult extension documentation for details"

b) or we could add some more extended way to initialize:
WaitEventExtensionNew(
const char *wait_event_name,
const char *wait_event_desc,
const char *wait_event_arg_desc);

but I seriously doubt it would be used by anyone, but maybe I'm wrong.

v6 attached, rebased due to 412f78c66eedbe9.

-J.

64-bit wait_event and introduction of 32-bit wait_event_arg

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: