Enhancing Memory Context Statistics Reporting

Started by Rahila Syedover 1 year ago110 messages
Jump to latest
#1Rahila Syed
rahilasyed90@gmail.com

Hi,

PostgreSQL provides following capabilities for reporting memory contexts
statistics.
1. pg_get_backend_memory_contexts(); [1]provides a view of memory context statistics for a local backend, while [2] prints the memory context statistics of any backend or auxiliary process to the PostgreSQL logs. Although [1] offers detailed statistics, it is limited to the local backend, restricting its use to PostgreSQL client backends only. On the other hand, [2] provides the statistics for all backends but logs them in a file, which may not be convenient for quick access.
2. pg_log_backend_memory_contexts(pid); [2]*PostgreSQL: Re: Get memory contexts of an arbitrary backend process <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2Fbea016ad-d1a7-f01d-a7e8-01106a1de77f%2540oss.nttdata.com&amp;data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436629740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=UCwkwg6kikVEf0oHf3%2BlliA%2FTUdMG%2F0cOiMta7fjPPk%3D&amp;reserved=0&gt;*

[1]: provides a view of memory context statistics for a local backend, while [2] prints the memory context statistics of any backend or auxiliary process to the PostgreSQL logs. Although [1] offers detailed statistics, it is limited to the local backend, restricting its use to PostgreSQL client backends only. On the other hand, [2] provides the statistics for all backends but logs them in a file, which may not be convenient for quick access.
while [2]*PostgreSQL: Re: Get memory contexts of an arbitrary backend process <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2Fbea016ad-d1a7-f01d-a7e8-01106a1de77f%2540oss.nttdata.com&amp;data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436629740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=UCwkwg6kikVEf0oHf3%2BlliA%2FTUdMG%2F0cOiMta7fjPPk%3D&amp;reserved=0&gt;* prints the memory context statistics of any backend or auxiliary
process to the PostgreSQL logs. Although [1]provides a view of memory context statistics for a local backend, while [2] prints the memory context statistics of any backend or auxiliary process to the PostgreSQL logs. Although [1] offers detailed statistics, it is limited to the local backend, restricting its use to PostgreSQL client backends only. On the other hand, [2] provides the statistics for all backends but logs them in a file, which may not be convenient for quick access. offers detailed statistics,
it is limited to the local backend, restricting its use to PostgreSQL
client backends only.
On the other hand, [2]*PostgreSQL: Re: Get memory contexts of an arbitrary backend process <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2Fbea016ad-d1a7-f01d-a7e8-01106a1de77f%2540oss.nttdata.com&amp;data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436629740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=UCwkwg6kikVEf0oHf3%2BlliA%2FTUdMG%2F0cOiMta7fjPPk%3D&amp;reserved=0&gt;* provides the statistics for all backends but logs
them in a file,
which may not be convenient for quick access.

I propose enhancing memory context statistics reporting by combining these
capabilities and offering a view of memory statistics for all PostgreSQL
backends
and auxiliary processes.

Attached is a patch that implements this functionality. It introduces a SQL
function
that takes the PID of a backend as an argument, returning a set of records,
each containing statistics for a single memory context. The underlying C
function
sends a signal to the backend and waits for it to publish its memory
context statistics
before returning them to the user. The publishing backend copies these
statistics
during the next CHECK_FOR_INTERRUPTS call.

This approach facilitates on-demand publication of memory statistics
for a specific backend, rather than collecting them at regular intervals.
Since past memory context statistics may no longer be relevant,
there is little value in retaining historical data. Any collected
statistics
can be discarded once read by the client backend.

A fixed-size shared memory block, currently accommodating 30 records,
is used to store the statistics. This number was chosen arbitrarily,
as it covers all parent contexts at level 1 (i.e., direct children of the
top memory context)
based on my tests.
Further experiments are needed to determine the optimal number
for summarizing memory statistics.

Any additional statistics that exceed the shared memory capacity
are written to a file per backend in the PG_TEMP_FILES_DIR. The client
backend
first reads from the shared memory, and if necessary, retrieves the
remaining data from the file,
combining everything into a unified view. The files are cleaned up
automatically
if a backend crashes or during server restarts.

The statistics are reported in a breadth-first search order of the memory
context tree,
with parent contexts reported before their children. This provides a
cumulative summary
before diving into the details of each child context's consumption.

The rationale behind the shared memory chunk is to ensure that the
majority of contexts which are the direct children of TopMemoryContext,
fit into memory
This allows a client to request a summary of memory statistics,
which can be served from memory without the overhead of file access,
unless necessary.

A publishing backend signals waiting client backends using a condition
variable when it has finished writing its statistics to memory.
The client backend checks whether the statistics belong to the requested
backend.
If not, it continues waiting on the condition variable, timing out after 2
minutes.
This timeout is an arbitrary choice, and further work is required to
determine
a more practical value.

All backends use the same memory space to publish their statistics.
Before publishing, a backend checks whether the previous statistics have
been
successfully read by a client using a shared flag, "in_use."
This flag is set by the publishing backend and cleared by the client
backend once the data is read. If a backend cannot publish due to shared
memory being occupied, it exits the interrupt processing code,
and the client backend times out with a warning.

Please find below an example query to fetch memory contexts from the backend
with id '106114'. Second argument -'get_summary' is 'false',
indicating a request for statistics of all the contexts.

postgres=#
select * FROM pg_get_remote_backend_memory_contexts('116292', false) LIMIT
2;
-[ RECORD 1 ]-+----------------------
name | TopMemoryContext
ident |
type | AllocSet
path | {0}
total_bytes | 97696
total_nblocks | 5
free_bytes | 15376
free_chunks | 11
used_bytes | 82320
pid | 116292
-[ RECORD 2 ]-+----------------------
name | RowDescriptionContext
ident |
type | AllocSet
path | {0,1}
total_bytes | 8192
total_nblocks | 1
free_bytes | 6912
free_chunks | 0
used_bytes | 1280
pid | 116292

TODO:
1. Determine the behaviour when the statistics don't fit in one file.

*[1]provides a view of memory context statistics for a local backend, while [2] prints the memory context statistics of any backend or auxiliary process to the PostgreSQL logs. Although [1] offers detailed statistics, it is limited to the local backend, restricting its use to PostgreSQL client backends only. On the other hand, [2] provides the statistics for all backends but logs them in a file, which may not be convenient for quick access. **PostgreSQL: Re: Creating a function for exposing memory usage of
backend process
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2F0a768ae1-1703-59c7-86cc-7068ff5e318c%2540oss.nttdata.com&amp;data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436604911%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=cbO2DBP6IsgMPTEVFNh%2FKeq4IoK3MZvTpzKkCQzNPMo%3D&amp;reserved=0&gt;*

[2]: *PostgreSQL: Re: Get memory contexts of an arbitrary backend process <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2Fbea016ad-d1a7-f01d-a7e8-01106a1de77f%2540oss.nttdata.com&amp;data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436629740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=UCwkwg6kikVEf0oHf3%2BlliA%2FTUdMG%2F0cOiMta7fjPPk%3D&amp;reserved=0&gt;*
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2Fbea016ad-d1a7-f01d-a7e8-01106a1de77f%2540oss.nttdata.com&amp;data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436629740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=UCwkwg6kikVEf0oHf3%2BlliA%2FTUdMG%2F0cOiMta7fjPPk%3D&amp;reserved=0&gt;*

Thank you,
Rahila Syed

Attachments:

0001-Function-to-report-memory-context-stats-of-any-backe.patchapplication/octet-stream; name=0001-Function-to-report-memory-context-stats-of-any-backe.patchDownload+741-13
#2Michael Paquier
michael@paquier.xyz
In reply to: Rahila Syed (#1)
Re: Enhancing Memory Context Statistics Reporting

On Mon, Oct 21, 2024 at 11:54:21PM +0530, Rahila Syed wrote:

On the other hand, [2] provides the statistics for all backends but logs
them in a file, which may not be convenient for quick access.

To be precise, pg_log_backend_memory_contexts() pushes the memory
context stats to LOG_SERVER_ONLY or stderr, hence this is appended to
the server logs.

A fixed-size shared memory block, currently accommodating 30 records,
is used to store the statistics. This number was chosen arbitrarily,
as it covers all parent contexts at level 1 (i.e., direct children of the
top memory context)
based on my tests.
Further experiments are needed to determine the optimal number
for summarizing memory statistics.

+ * Statistics are shared via fixed shared memory which
+ * can hold statistics for 29 contexts. The rest of the
[...]
+   MemoryContextInfo memctx_infos[30]; 
[...]
+       memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo)); 
[...]
+   size = add_size(size, mul_size(30, sizeof(MemoryContextInfo)));
[...]
+   memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo)); 
[...]
+       memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo)); 

This number is tied to MemoryContextState added by the patch. Sounds
like this would be better as a constant properly defined rather than
hardcoded in all these places. This would make the upper-bound more
easily switchable in the patch.

+   Datum       path[128];
+   char        type[128];
[...]
+   char        name[1024];
+   char        ident[1024];
+   char        type[128];
+   Datum       path[128];

Again, constants. Why these values? You may want to use more
#defines here.

Any additional statistics that exceed the shared memory capacity
are written to a file per backend in the PG_TEMP_FILES_DIR. The client
backend
first reads from the shared memory, and if necessary, retrieves the
remaining data from the file,
combining everything into a unified view. The files are cleaned up
automatically
if a backend crashes or during server restarts.

Is the addition of the file to write any remaining stats really that
useful? This comes with a heavy cost in the patch with the "in_use"
flag, the various tweaks around the LWLock release/acquire protecting
the shmem area and the extra cleanup steps required after even a clean
restart. That's a lot of facility for this kind of information.
Another thing that may be worth considering is to put this information
in a DSM per the variable-size nature of the information, perhaps cap
it to a max to make the memory footprint cheaper, and avoid all
on-disk footprint because we don't need it to begin with as this is
information that makes sense only while the server is running.

Also, why the single-backend limitation? One could imagine a shared
memory area indexed similarly to pgproc entries, that includes
auxiliary processes as much as backends, so as it can be possible to
get more memory footprints through SQL for more than one single
process at one moment in time. If each backend has its own area of
shmem to deal with, they could use a shared LWLock on the shmem area
with an extra spinlock while the context data is dumped into memory as
the copy is short-lived. Each one of them could save the information
in a DSM created only when a dump of the shmem is requested for a
given PID, for example.
--
Michael

#3torikoshia
torikoshia@oss.nttdata.com
In reply to: Rahila Syed (#1)
Re: Enhancing Memory Context Statistics Reporting

On 2024-10-22 03:24, Rahila Syed wrote:

Hi,

PostgreSQL provides following capabilities for reporting memory
contexts statistics.
1. pg_get_backend_memory_contexts(); [1]
2. pg_log_backend_memory_contexts(pid); [2]

[1] provides a view of memory context statistics for a local backend,
while [2] prints the memory context statistics of any backend or
auxiliary
process to the PostgreSQL logs. Although [1] offers detailed
statistics,
it is limited to the local backend, restricting its use to PostgreSQL
client backends only.
On the other hand, [2] provides the statistics for all backends but
logs them in a file,
which may not be convenient for quick access.

I propose enhancing memory context statistics reporting by combining
these
capabilities and offering a view of memory statistics for all
PostgreSQL backends
and auxiliary processes.

Thanks for working on this!

I originally tried to develop something like your proposal in [2], but
there were some difficulties and settled down to implement
pg_log_backend_memory_contexts().

Attached is a patch that implements this functionality. It introduces
a SQL function
that takes the PID of a backend as an argument, returning a set of
records,
each containing statistics for a single memory context. The underlying
C function
sends a signal to the backend and waits for it to publish its memory
context statistics
before returning them to the user. The publishing backend copies
these statistics
during the next CHECK_FOR_INTERRUPTS call.

I remember waiting for dumping memory contexts stats could cause trouble
considering some erroneous cases.

For example, just after the target process finished dumping stats,
pg_get_remote_backend_memory_contexts() caller is terminated before
reading the stats, calling pg_get_remote_backend_memory_contexts() has
no response any more:

[session1]$ psql
(40699)=#

$ kill -s SIGSTOP 40699

[session2] psql
(40866)=# select * FROM
pg_get_remote_backend_memory_contexts('40699', false); -- waiting

$ kill -s SIGSTOP 40866

$ kill -s SIGCONT 40699

[session3] psql
(47656) $ select pg_terminate_backend(40866);

$ kill -s SIGCONT 40866 -- session2 terminated

[session3] (47656)=# select * FROM
pg_get_remote_backend_memory_contexts('47656', false); -- no response

It seems the reason is memCtxState->in_use is now and
memCtxState->proc_id is 40699.
We can continue to use pg_get_remote_backend_memory_contexts() after
specifying 40699, but it'd be hard to understand for users.

This approach facilitates on-demand publication of memory statistics
for a specific backend, rather than collecting them at regular
intervals.
Since past memory context statistics may no longer be relevant,
there is little value in retaining historical data. Any collected
statistics
can be discarded once read by the client backend.

A fixed-size shared memory block, currently accommodating 30 records,
is used to store the statistics. This number was chosen arbitrarily,
as it covers all parent contexts at level 1 (i.e., direct children of
the top memory context)
based on my tests.
Further experiments are needed to determine the optimal number
for summarizing memory statistics.

Any additional statistics that exceed the shared memory capacity
are written to a file per backend in the PG_TEMP_FILES_DIR. The client
backend
first reads from the shared memory, and if necessary, retrieves the
remaining data from the file,
combining everything into a unified view. The files are cleaned up
automatically
if a backend crashes or during server restarts.

The statistics are reported in a breadth-first search order of the
memory context tree,
with parent contexts reported before their children. This provides a
cumulative summary
before diving into the details of each child context's consumption.

The rationale behind the shared memory chunk is to ensure that the
majority of contexts which are the direct children of
TopMemoryContext,
fit into memory
This allows a client to request a summary of memory statistics,
which can be served from memory without the overhead of file access,
unless necessary.

A publishing backend signals waiting client backends using a condition

variable when it has finished writing its statistics to memory.
The client backend checks whether the statistics belong to the
requested backend.
If not, it continues waiting on the condition variable, timing out
after 2 minutes.
This timeout is an arbitrary choice, and further work is required to
determine
a more practical value.

All backends use the same memory space to publish their statistics.
Before publishing, a backend checks whether the previous statistics
have been
successfully read by a client using a shared flag, "in_use."
This flag is set by the publishing backend and cleared by the client
backend once the data is read. If a backend cannot publish due to
shared
memory being occupied, it exits the interrupt processing code,
and the client backend times out with a warning.

Please find below an example query to fetch memory contexts from the
backend
with id '106114'. Second argument -'get_summary' is 'false',
indicating a request for statistics of all the contexts.

postgres=#
select * FROM pg_get_remote_backend_memory_contexts('116292', false)
LIMIT 2;
-[ RECORD 1 ]-+----------------------
name | TopMemoryContext
ident |
type | AllocSet
path | {0}
total_bytes | 97696
total_nblocks | 5
free_bytes | 15376
free_chunks | 11
used_bytes | 82320
pid | 116292
-[ RECORD 2 ]-+----------------------
name | RowDescriptionContext
ident |
type | AllocSet
path | {0,1}
total_bytes | 8192
total_nblocks | 1
free_bytes | 6912
free_chunks | 0
used_bytes | 1280
pid | 116292

32d3ed8165f821f introduced 1-based path to pg_backend_memory_contexts,
but pg_get_remote_backend_memory_contexts() seems to have 0-base path.

pg_backend_memory_contexts has "level" column, but
pg_get_remote_backend_memory_contexts doesn't.

Are there any reasons for these?

TODO:
1. Determine the behaviour when the statistics don't fit in one file.

[1] PostgreSQL: Re: Creating a function for exposing memory usage of
backend process [1]

[2] PostgreSQL: Re: Get memory contexts of an arbitrary backend
process [2]

Thank you,
Rahila Syed

Links:
------
[1]
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2F0a768ae1-1703-59c7-86cc-7068ff5e318c%2540oss.nttdata.com&amp;amp;data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436604911%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;amp;sdata=cbO2DBP6IsgMPTEVFNh%2FKeq4IoK3MZvTpzKkCQzNPMo%3D&amp;amp;reserved=0
[2]
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2Fbea016ad-d1a7-f01d-a7e8-01106a1de77f%2540oss.nttdata.com&amp;amp;data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436629740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;amp;sdata=UCwkwg6kikVEf0oHf3%2BlliA%2FTUdMG%2F0cOiMta7fjPPk%3D&amp;amp;reserved=0

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

#4Rahila Syed
rahilasyed90@gmail.com
In reply to: Michael Paquier (#2)
Re: Enhancing Memory Context Statistics Reporting

Hi Michael,

Thank you for the review.

On Tue, Oct 22, 2024 at 12:18 PM Michael Paquier <michael@paquier.xyz>
wrote:

On Mon, Oct 21, 2024 at 11:54:21PM +0530, Rahila Syed wrote:

On the other hand, [2] provides the statistics for all backends but logs
them in a file, which may not be convenient for quick access.

To be precise, pg_log_backend_memory_contexts() pushes the memory
context stats to LOG_SERVER_ONLY or stderr, hence this is appended to
the server logs.

A fixed-size shared memory block, currently accommodating 30 records,
is used to store the statistics. This number was chosen arbitrarily,
as it covers all parent contexts at level 1 (i.e., direct children of

the

top memory context)
based on my tests.
Further experiments are needed to determine the optimal number
for summarizing memory statistics.

+ * Statistics are shared via fixed shared memory which
+ * can hold statistics for 29 contexts. The rest of the
[...]
+   MemoryContextInfo memctx_infos[30];
[...]
+       memset(&memCtxState->memctx_infos, 0, 30 *
sizeof(MemoryContextInfo));
[...]
+   size = add_size(size, mul_size(30, sizeof(MemoryContextInfo)));
[...]
+   memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));
[...]
+       memset(&memCtxState->memctx_infos, 0, 30 *
sizeof(MemoryContextInfo));

This number is tied to MemoryContextState added by the patch. Sounds
like this would be better as a constant properly defined rather than
hardcoded in all these places. This would make the upper-bound more
easily switchable in the patch.

Makes sense. Fixed in the attached patch.

+   Datum       path[128];
+   char        type[128];
[...]
+   char        name[1024];
+   char        ident[1024];
+   char        type[128];
+   Datum       path[128];

Again, constants. Why these values? You may want to use more
#defines here.

I added the #defines for these in the attached patch.

Size of the path array should match the number of levels in the memory
context tree and type is a constant string.

For the name and ident, I have used the existing #define
MEMORY_CONTEXT_IDENT_DISPLAY_SIZE as the size limit.

Any additional statistics that exceed the shared memory capacity
are written to a file per backend in the PG_TEMP_FILES_DIR. The client
backend
first reads from the shared memory, and if necessary, retrieves the
remaining data from the file,
combining everything into a unified view. The files are cleaned up
automatically
if a backend crashes or during server restarts.

Is the addition of the file to write any remaining stats really that
useful? This comes with a heavy cost in the patch with the "in_use"
flag, the various tweaks around the LWLock release/acquire protecting
the shmem area and the extra cleanup steps required after even a clean
restart. That's a lot of facility for this kind of information.

The rationale behind using the file is to cater to the unbounded
number of memory contexts.
The "in_use" flag is used to govern the access to shared memory
as I am reserving enough memory for only one backend.
It ensures that another backend does not overwrite the statistics
in the shared memory, before it is read by a client backend.

Another thing that may be worth considering is to put this information
in a DSM per the variable-size nature of the information, perhaps cap
it to a max to make the memory footprint cheaper, and avoid all
on-disk footprint because we don't need it to begin with as this is
information that makes sense only while the server is running.

Thank you for the suggestion. I will look into using DSMs especially

if there is a way to limit the statistics dump, while still providing a
user
with enough information to debug memory consumption.

In this draft, I preferred using a file over DSMs, as a file can provide
ample space for dumping a large number of memory context statistics
without the risk of DSM creation failure due to insufficient memory.

Also, why the single-backend limitation?

To reduce the memory footprint, the shared memory is
created for only one backend.
Each backend has to wait for previous operation
to finish before it can write.

I think a good use case for this would be a background process
periodically running the monitoring function on each of the
backends sequentially to fetch the statistics.
This way there will be little contention for shared memory.

In case a shared memory is not available, a backend immediately
returns from the interrupt handler without blocking its normal
operations.

One could imagine a shared

memory area indexed similarly to pgproc entries, that includes
auxiliary processes as much as backends, so as it can be possible to
get more memory footprints through SQL for more than one single
process at one moment in time. If each backend has its own area of
shmem to deal with, they could use a shared LWLock on the shmem area
with an extra spinlock while the context data is dumped into memory as
the copy is short-lived. Each one of them could save the information
in a DSM created only when a dump of the shmem is requested for a
given PID, for example.

I agree that such an infrastructure would be useful for fetching memory
statistics concurrently without significant synchronization overhead.
However, a drawback of this approach is reserving shared
memory slots up to MAX_BACKENDS without utilizing them
when no concurrent monitoring is happening.
As you mentioned, creating a DSM on the fly when a dump
request is received could help avoid over-allocating shared memory.
I will look into this suggestion

Thank you for your feedback!

Rahila Syed

Attachments:

0002-Function-to-report-memory-context-stats-of-any-backe.patchapplication/octet-stream; name=0002-Function-to-report-memory-context-stats-of-any-backe.patchDownload+743-13
#5Rahila Syed
rahilasyed90@gmail.com
In reply to: torikoshia (#3)
Re: Enhancing Memory Context Statistics Reporting

Hi Torikoshia,

Thank you for reviewing the patch!

On Wed, Oct 23, 2024 at 9:28 AM torikoshia <torikoshia@oss.nttdata.com>
wrote:

On 2024-10-22 03:24, Rahila Syed wrote:

Hi,

PostgreSQL provides following capabilities for reporting memory
contexts statistics.
1. pg_get_backend_memory_contexts(); [1]
2. pg_log_backend_memory_contexts(pid); [2]

[1] provides a view of memory context statistics for a local backend,
while [2] prints the memory context statistics of any backend or
auxiliary
process to the PostgreSQL logs. Although [1] offers detailed
statistics,
it is limited to the local backend, restricting its use to PostgreSQL
client backends only.
On the other hand, [2] provides the statistics for all backends but
logs them in a file,
which may not be convenient for quick access.

I propose enhancing memory context statistics reporting by combining
these
capabilities and offering a view of memory statistics for all
PostgreSQL backends
and auxiliary processes.

Thanks for working on this!

I originally tried to develop something like your proposal in [2], but
there were some difficulties and settled down to implement
pg_log_backend_memory_contexts().

Yes. I am revisiting this problem :)

Attached is a patch that implements this functionality. It introduces
a SQL function
that takes the PID of a backend as an argument, returning a set of
records,
each containing statistics for a single memory context. The underlying
C function
sends a signal to the backend and waits for it to publish its memory
context statistics
before returning them to the user. The publishing backend copies
these statistics
during the next CHECK_FOR_INTERRUPTS call.

I remember waiting for dumping memory contexts stats could cause trouble
considering some erroneous cases.

For example, just after the target process finished dumping stats,
pg_get_remote_backend_memory_contexts() caller is terminated before
reading the stats, calling pg_get_remote_backend_memory_contexts() has
no response any more:

[session1]$ psql
(40699)=#

$ kill -s SIGSTOP 40699

[session2] psql
(40866)=# select * FROM
pg_get_remote_backend_memory_contexts('40699', false); -- waiting

$ kill -s SIGSTOP 40866

$ kill -s SIGCONT 40699

[session3] psql
(47656) $ select pg_terminate_backend(40866);

$ kill -s SIGCONT 40866 -- session2 terminated

[session3] (47656)=# select * FROM
pg_get_remote_backend_memory_contexts('47656', false); -- no response

It seems the reason is memCtxState->in_use is now and
memCtxState->proc_id is 40699.
We can continue to use pg_get_remote_backend_memory_contexts() after
specifying 40699, but it'd be hard to understand for users.

Thanks for testing and reporting. While I am not able to reproduce this

problem,
I think this may be happening because the requesting backend/caller is
terminated
before it gets a chance to mark memCtxState->in_use as false.

In this case memCtxState->in_use should be marked as
'false' possibly during the processing of ProcDiePending in
ProcessInterrupts().

This approach facilitates on-demand publication of memory statistics

for a specific backend, rather than collecting them at regular
intervals.
Since past memory context statistics may no longer be relevant,
there is little value in retaining historical data. Any collected
statistics
can be discarded once read by the client backend.

A fixed-size shared memory block, currently accommodating 30 records,
is used to store the statistics. This number was chosen arbitrarily,
as it covers all parent contexts at level 1 (i.e., direct children of
the top memory context)
based on my tests.
Further experiments are needed to determine the optimal number
for summarizing memory statistics.

Any additional statistics that exceed the shared memory capacity
are written to a file per backend in the PG_TEMP_FILES_DIR. The client
backend
first reads from the shared memory, and if necessary, retrieves the
remaining data from the file,
combining everything into a unified view. The files are cleaned up
automatically
if a backend crashes or during server restarts.

The statistics are reported in a breadth-first search order of the
memory context tree,
with parent contexts reported before their children. This provides a
cumulative summary
before diving into the details of each child context's consumption.

The rationale behind the shared memory chunk is to ensure that the
majority of contexts which are the direct children of
TopMemoryContext,
fit into memory
This allows a client to request a summary of memory statistics,
which can be served from memory without the overhead of file access,
unless necessary.

A publishing backend signals waiting client backends using a condition

variable when it has finished writing its statistics to memory.
The client backend checks whether the statistics belong to the
requested backend.
If not, it continues waiting on the condition variable, timing out
after 2 minutes.
This timeout is an arbitrary choice, and further work is required to
determine
a more practical value.

All backends use the same memory space to publish their statistics.
Before publishing, a backend checks whether the previous statistics
have been
successfully read by a client using a shared flag, "in_use."
This flag is set by the publishing backend and cleared by the client
backend once the data is read. If a backend cannot publish due to
shared
memory being occupied, it exits the interrupt processing code,
and the client backend times out with a warning.

Please find below an example query to fetch memory contexts from the
backend
with id '106114'. Second argument -'get_summary' is 'false',
indicating a request for statistics of all the contexts.

postgres=#
select * FROM pg_get_remote_backend_memory_contexts('116292', false)
LIMIT 2;
-[ RECORD 1 ]-+----------------------
name | TopMemoryContext
ident |
type | AllocSet
path | {0}
total_bytes | 97696
total_nblocks | 5
free_bytes | 15376
free_chunks | 11
used_bytes | 82320
pid | 116292
-[ RECORD 2 ]-+----------------------
name | RowDescriptionContext
ident |
type | AllocSet
path | {0,1}
total_bytes | 8192
total_nblocks | 1
free_bytes | 6912
free_chunks | 0
used_bytes | 1280
pid | 116292

32d3ed8165f821f introduced 1-based path to pg_backend_memory_contexts,
but pg_get_remote_backend_memory_contexts() seems to have 0-base path.

Right. I will change it to match this commit.

pg_backend_memory_contexts has "level" column, but
pg_get_remote_backend_memory_contexts doesn't.

Are there any reasons for these?

No particular reason, I can add this column as well.

Thank you,
Rahila Syed

#6torikoshia
torikoshia@oss.nttdata.com
In reply to: Rahila Syed (#5)
Re: Enhancing Memory Context Statistics Reporting

On 2024-10-24 14:59, Rahila Syed wrote:

Hi Torikoshia,

Thank you for reviewing the patch!

On Wed, Oct 23, 2024 at 9:28 AM torikoshia
<torikoshia@oss.nttdata.com> wrote:

On 2024-10-22 03:24, Rahila Syed wrote:

Hi,

PostgreSQL provides following capabilities for reporting memory
contexts statistics.
1. pg_get_backend_memory_contexts(); [1]
2. pg_log_backend_memory_contexts(pid); [2]

[1] provides a view of memory context statistics for a local

backend,

while [2] prints the memory context statistics of any backend or
auxiliary
process to the PostgreSQL logs. Although [1] offers detailed
statistics,
it is limited to the local backend, restricting its use to

PostgreSQL

client backends only.
On the other hand, [2] provides the statistics for all backends

but

logs them in a file,
which may not be convenient for quick access.

I propose enhancing memory context statistics reporting by

combining

these
capabilities and offering a view of memory statistics for all
PostgreSQL backends
and auxiliary processes.

Thanks for working on this!

I originally tried to develop something like your proposal in [2],
but
there were some difficulties and settled down to implement
pg_log_backend_memory_contexts().

Yes. I am revisiting this problem :)

Attached is a patch that implements this functionality. It

introduces

a SQL function
that takes the PID of a backend as an argument, returning a set of
records,
each containing statistics for a single memory context. The

underlying

C function
sends a signal to the backend and waits for it to publish its

memory

context statistics
before returning them to the user. The publishing backend copies
these statistics
during the next CHECK_FOR_INTERRUPTS call.

I remember waiting for dumping memory contexts stats could cause
trouble
considering some erroneous cases.

For example, just after the target process finished dumping stats,
pg_get_remote_backend_memory_contexts() caller is terminated before
reading the stats, calling pg_get_remote_backend_memory_contexts()
has
no response any more:

[session1]$ psql
(40699)=#

$ kill -s SIGSTOP 40699

[session2] psql
(40866)=# select * FROM
pg_get_remote_backend_memory_contexts('40699', false); -- waiting

$ kill -s SIGSTOP 40866

$ kill -s SIGCONT 40699

[session3] psql
(47656) $ select pg_terminate_backend(40866);

$ kill -s SIGCONT 40866 -- session2 terminated

[session3] (47656)=# select * FROM
pg_get_remote_backend_memory_contexts('47656', false); -- no
response

It seems the reason is memCtxState->in_use is now and
memCtxState->proc_id is 40699.
We can continue to use pg_get_remote_backend_memory_contexts() after

specifying 40699, but it'd be hard to understand for users.

Thanks for testing and reporting. While I am not able to reproduce
this problem,
I think this may be happening because the requesting backend/caller is
terminated
before it gets a chance to mark memCtxState->in_use as false.

Yeah, when I attached a debugger to 47656 when it was waiting on
pg_get_remote_backend_memory_contexts('47656', false),
memCtxState->in_use was true as you suspected:

(lldb) p memCtxState->in_use
(bool) $1 = true
(lldb) p memCtxState->proc_id
(int) $2 = 40699
(lldb) p pid
(int) $3 = 47656

In this case memCtxState->in_use should be marked as
'false' possibly during the processing of ProcDiePending in
ProcessInterrupts().

This approach facilitates on-demand publication of memory

statistics

for a specific backend, rather than collecting them at regular
intervals.
Since past memory context statistics may no longer be relevant,
there is little value in retaining historical data. Any collected
statistics
can be discarded once read by the client backend.

A fixed-size shared memory block, currently accommodating 30

records,

is used to store the statistics. This number was chosen

arbitrarily,

as it covers all parent contexts at level 1 (i.e., direct

children of

the top memory context)
based on my tests.
Further experiments are needed to determine the optimal number
for summarizing memory statistics.

Any additional statistics that exceed the shared memory capacity
are written to a file per backend in the PG_TEMP_FILES_DIR. The

client

backend
first reads from the shared memory, and if necessary, retrieves

the

remaining data from the file,
combining everything into a unified view. The files are cleaned up
automatically
if a backend crashes or during server restarts.

The statistics are reported in a breadth-first search order of the
memory context tree,
with parent contexts reported before their children. This

provides a

cumulative summary
before diving into the details of each child context's

consumption.

The rationale behind the shared memory chunk is to ensure that the
majority of contexts which are the direct children of
TopMemoryContext,
fit into memory
This allows a client to request a summary of memory statistics,
which can be served from memory without the overhead of file

access,

unless necessary.

A publishing backend signals waiting client backends using a

condition

variable when it has finished writing its statistics to memory.
The client backend checks whether the statistics belong to the
requested backend.
If not, it continues waiting on the condition variable, timing out
after 2 minutes.
This timeout is an arbitrary choice, and further work is required

to

determine
a more practical value.

All backends use the same memory space to publish their

statistics.

Before publishing, a backend checks whether the previous

statistics

have been
successfully read by a client using a shared flag, "in_use."
This flag is set by the publishing backend and cleared by the

client

backend once the data is read. If a backend cannot publish due to
shared
memory being occupied, it exits the interrupt processing code,
and the client backend times out with a warning.

Please find below an example query to fetch memory contexts from

the

backend
with id '106114'. Second argument -'get_summary' is 'false',
indicating a request for statistics of all the contexts.

postgres=#
select * FROM pg_get_remote_backend_memory_contexts('116292',

false)

LIMIT 2;
-[ RECORD 1 ]-+----------------------
name | TopMemoryContext
ident |
type | AllocSet
path | {0}
total_bytes | 97696
total_nblocks | 5
free_bytes | 15376
free_chunks | 11
used_bytes | 82320
pid | 116292
-[ RECORD 2 ]-+----------------------
name | RowDescriptionContext
ident |
type | AllocSet
path | {0,1}
total_bytes | 8192
total_nblocks | 1
free_bytes | 6912
free_chunks | 0
used_bytes | 1280
pid | 116292

32d3ed8165f821f introduced 1-based path to
pg_backend_memory_contexts,
but pg_get_remote_backend_memory_contexts() seems to have 0-base
path.

Right. I will change it to match this commit.

pg_backend_memory_contexts has "level" column, but
pg_get_remote_backend_memory_contexts doesn't.

Are there any reasons for these?

No particular reason, I can add this column as well.

Thank you,
Rahila Syed

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

#7Nitin Jadhav
nitinjadhavpostgres@gmail.com
In reply to: Rahila Syed (#4)
Re: Enhancing Memory Context Statistics Reporting

Hi Rahila,

I’ve spent some time reviewing the patch, and the review is still
ongoing. Here are the comments I’ve found so far.

1.
The tests are currently missing. Could you please add them?

2.
I have some concerns regarding the function name
‘pg_get_remote_backend_memory_contexts’. Specifically, the term
‘remote’ doesn’t seem appropriate to me. The function retrieves data
from other processes running on the same machine, which might give the
impression that it deals with processes on different machines. This
could be misleading or unclear in this context. The argument ‘pid’
already indicates that it can get data from different processes.
Additionally, the term ‘backend’ also seems inappropriate since we are
obtaining data from processes that are different from backend
processes.

3.

+ Datum values[10];
+ bool nulls[10];

Please consider #defining the column count, or you could reuse the
existing one ‘PG_GET_BACKEND_MEMORY_CONTEXTS_COLS’.

4.

if (context_id <= 28)
if (context_id == 29)
if (context_id < 29)

#define these

5.

for (MemoryContext cur_context = cur; cur_context != NULL; cur_context = cur_context->parent)
{
MemoryContextId *cur_entry;

cur_entry = hash_search(context_id_lookup, &cur_context, HASH_FIND, &found);

if (!found)
{
elog(LOG, "hash table corrupted, can't construct path value");
break;
}
path = lcons_int(cur_entry->context_id, path);
}

Similar code already exists in PutMemoryContextsStatsTupleStore().
Could you create a separate function to handle this?

6.

/*
* Shared memory is full, release lock and write to file from next
* iteration
*/
context_id++;
if (context_id == 29)
{

What if there are exactly 29 entries in the memory context? In that
case, creating the file would be unnecessary.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

Show quoted text

On Wed, Oct 23, 2024 at 10:20 AM Rahila Syed <rahilasyed90@gmail.com> wrote:

Hi Michael,

Thank you for the review.

On Tue, Oct 22, 2024 at 12:18 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Oct 21, 2024 at 11:54:21PM +0530, Rahila Syed wrote:

On the other hand, [2] provides the statistics for all backends but logs
them in a file, which may not be convenient for quick access.

To be precise, pg_log_backend_memory_contexts() pushes the memory
context stats to LOG_SERVER_ONLY or stderr, hence this is appended to
the server logs.

A fixed-size shared memory block, currently accommodating 30 records,
is used to store the statistics. This number was chosen arbitrarily,
as it covers all parent contexts at level 1 (i.e., direct children of the
top memory context)
based on my tests.
Further experiments are needed to determine the optimal number
for summarizing memory statistics.

+ * Statistics are shared via fixed shared memory which
+ * can hold statistics for 29 contexts. The rest of the
[...]
+   MemoryContextInfo memctx_infos[30];
[...]
+       memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));
[...]
+   size = add_size(size, mul_size(30, sizeof(MemoryContextInfo)));
[...]
+   memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));
[...]
+       memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));

This number is tied to MemoryContextState added by the patch. Sounds
like this would be better as a constant properly defined rather than
hardcoded in all these places. This would make the upper-bound more
easily switchable in the patch.

Makes sense. Fixed in the attached patch.

+   Datum       path[128];
+   char        type[128];
[...]
+   char        name[1024];
+   char        ident[1024];
+   char        type[128];
+   Datum       path[128];

Again, constants. Why these values? You may want to use more
#defines here.

I added the #defines for these in the attached patch.
Size of the path array should match the number of levels in the memory
context tree and type is a constant string.

For the name and ident, I have used the existing #define
MEMORY_CONTEXT_IDENT_DISPLAY_SIZE as the size limit.

Any additional statistics that exceed the shared memory capacity
are written to a file per backend in the PG_TEMP_FILES_DIR. The client
backend
first reads from the shared memory, and if necessary, retrieves the
remaining data from the file,
combining everything into a unified view. The files are cleaned up
automatically
if a backend crashes or during server restarts.

Is the addition of the file to write any remaining stats really that
useful? This comes with a heavy cost in the patch with the "in_use"
flag, the various tweaks around the LWLock release/acquire protecting
the shmem area and the extra cleanup steps required after even a clean
restart. That's a lot of facility for this kind of information.

The rationale behind using the file is to cater to the unbounded
number of memory contexts.
The "in_use" flag is used to govern the access to shared memory
as I am reserving enough memory for only one backend.
It ensures that another backend does not overwrite the statistics
in the shared memory, before it is read by a client backend.

Another thing that may be worth considering is to put this information
in a DSM per the variable-size nature of the information, perhaps cap
it to a max to make the memory footprint cheaper, and avoid all
on-disk footprint because we don't need it to begin with as this is
information that makes sense only while the server is running.

Thank you for the suggestion. I will look into using DSMs especially
if there is a way to limit the statistics dump, while still providing a user
with enough information to debug memory consumption.

In this draft, I preferred using a file over DSMs, as a file can provide
ample space for dumping a large number of memory context statistics
without the risk of DSM creation failure due to insufficient memory.

Also, why the single-backend limitation?

To reduce the memory footprint, the shared memory is
created for only one backend.
Each backend has to wait for previous operation
to finish before it can write.

I think a good use case for this would be a background process
periodically running the monitoring function on each of the
backends sequentially to fetch the statistics.
This way there will be little contention for shared memory.

In case a shared memory is not available, a backend immediately
returns from the interrupt handler without blocking its normal
operations.

One could imagine a shared
memory area indexed similarly to pgproc entries, that includes
auxiliary processes as much as backends, so as it can be possible to
get more memory footprints through SQL for more than one single
process at one moment in time. If each backend has its own area of
shmem to deal with, they could use a shared LWLock on the shmem area
with an extra spinlock while the context data is dumped into memory as
the copy is short-lived. Each one of them could save the information
in a DSM created only when a dump of the shmem is requested for a
given PID, for example.

I agree that such an infrastructure would be useful for fetching memory
statistics concurrently without significant synchronization overhead.
However, a drawback of this approach is reserving shared
memory slots up to MAX_BACKENDS without utilizing them
when no concurrent monitoring is happening.
As you mentioned, creating a DSM on the fly when a dump
request is received could help avoid over-allocating shared memory.
I will look into this suggestion

Thank you for your feedback!

Rahila Syed

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Rahila Syed (#1)
Re: Enhancing Memory Context Statistics Reporting

On 2024-Oct-21, Rahila Syed wrote:

I propose enhancing memory context statistics reporting by combining
these capabilities and offering a view of memory statistics for all
PostgreSQL backends and auxiliary processes.

Sounds good.

A fixed-size shared memory block, currently accommodating 30 records,
is used to store the statistics.

Hmm, would it make sene to use dynamic shared memory for this? The
publishing backend could dsm_create one DSM chunk of the exact size that
it needs, pass the dsm_handle to the consumer, and then have it be
destroy once it's been read. That way you don't have to define an
arbitrary limit of any size. (Maybe you could keep a limit to how much
is published in shared memory and spill the rest to disk, but I think
such a limit should be very high[1]This is very arbitrary of course, but 1 MB gives enough room for some 7000 contexts, which should cover normal cases., so that it's unlikely to take
effect in normal cases.)

[1]: This is very arbitrary of course, but 1 MB gives enough room for some 7000 contexts, which should cover normal cases.
some 7000 contexts, which should cover normal cases.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Find a bug in a program, and fix it, and the program will work today.
Show the program how to find and fix a bug, and the program
will work forever" (Oliver Silfridge)

#9Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#8)
Re: Enhancing Memory Context Statistics Reporting

Hi,

On 2024-10-26 16:14:25 +0200, Alvaro Herrera wrote:

A fixed-size shared memory block, currently accommodating 30 records,
is used to store the statistics.

Hmm, would it make sene to use dynamic shared memory for this?

+1

The publishing backend could dsm_create one DSM chunk of the exact size that
it needs, pass the dsm_handle to the consumer, and then have it be destroy
once it's been read.

I'd probably just make it a dshash table or such, keyed by the pid, pointing
to a dsa allocation with the stats.

That way you don't have to define an arbitrary limit
of any size. (Maybe you could keep a limit to how much is published in
shared memory and spill the rest to disk, but I think such a limit should be
very high[1], so that it's unlikely to take effect in normal cases.)

[1] This is very arbitrary of course, but 1 MB gives enough room for
some 7000 contexts, which should cover normal cases.

Agreed. I can see a point in a limit for extreme cases, but spilling to disk
doesn't seem particularly useful.

Greetings,

Andres Freund

#10Rahila Syed
rahilasyed90@gmail.com
In reply to: Alvaro Herrera (#8)
Re: Enhancing Memory Context Statistics Reporting

Hi,

Thank you for the review.

Hmm, would it make sene to use dynamic shared memory for this? The
publishing backend could dsm_create one DSM chunk of the exact size that
it needs, pass the dsm_handle to the consumer, and then have it be
destroy once it's been read. That way you don't have to define an
arbitrary limit of any size. (Maybe you could keep a limit to how much
is published in shared memory and spill the rest to disk, but I think
such a limit should be very high[1], so that it's unlikely to take
effect in normal cases.)

[1] This is very arbitrary of course, but 1 MB gives enough room for
some 7000 contexts, which should cover normal cases.

I used one DSA area per process to share statistics. Currently,
the size limit for each DSA is 16 MB, which can accommodate
approximately 6,700 MemoryContextInfo structs. Any additional
statistics will spill over to a file. I opted for DSAs over DSMs to
enable memory reuse by freeing segments for subsequent
statistics copies of the same backend, without needing to
recreate DSMs for each request.

The dsa_handle for each process is stored in an array,
indexed by the procNumber, within the shared memory.
The maximum size of this array is defined as the sum of
MaxBackends and the number of auxiliary processes.

As requested earlier, I have renamed the function to
pg_get_process_memory_contexts(pid, get_summary).
Suggestions for a better name are welcome.
When the get_summary argument is set to true, the function provides
statistics for memory contexts up to level 2—that is, the
top memory context and all its children.

Please find attached a rebased patch that includes these changes.
I will work on adding a test for the function and some code refactoring
suggestions.

Thank you,
Rahila Syed

Attachments:

v2-0001-Function-to-report-memory-context-stats-of-any-backe.patchapplication/octet-stream; name=v2-0001-Function-to-report-memory-context-stats-of-any-backe.patchDownload+836-13
#11Michael Paquier
michael@paquier.xyz
In reply to: Rahila Syed (#10)
Re: Enhancing Memory Context Statistics Reporting

On Wed, Nov 13, 2024 at 01:00:52PM +0530, Rahila Syed wrote:

I used one DSA area per process to share statistics. Currently,
the size limit for each DSA is 16 MB, which can accommodate
approximately 6,700 MemoryContextInfo structs. Any additional
statistics will spill over to a file. I opted for DSAs over DSMs to
enable memory reuse by freeing segments for subsequent
statistics copies of the same backend, without needing to
recreate DSMs for each request.

Already mentioned previously at [1]/messages/by-id/ZxdKx0DywUTAvkEF@paquier.xyz -- Michael and echoing with some surrounding
arguments, but I'd suggest to keep it simple and just remove entirely
the part of the patch where the stats information gets spilled into
disk. With more than 6000-ish context information available with a
hard limit in place, there should be plenty enough to know what's
going on anyway.

[1]: /messages/by-id/ZxdKx0DywUTAvkEF@paquier.xyz -- Michael
--
Michael

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#11)
Re: Enhancing Memory Context Statistics Reporting

On 2024-Nov-14, Michael Paquier wrote:

Already mentioned previously at [1] and echoing with some surrounding
arguments, but I'd suggest to keep it simple and just remove entirely
the part of the patch where the stats information gets spilled into
disk. With more than 6000-ish context information available with a
hard limit in place, there should be plenty enough to know what's
going on anyway.

Functionally-wise I don't necessarily agree with _removing_ the spill
code, considering that production systems with thousands of tables would
easily reach that number of contexts (each index gets its own index info
context, each regexp gets its own memcxt); and I don't think silently
omitting a fraction of people's memory situation (or erroring out if the
case is hit) is going to make us any friends.

That said, it worries me that we choose a shared memory size so large
that it becomes impractical to hit the spill-to-disk code in regression
testing. Maybe we can choose a much smaller limit size when
USE_ASSERT_CHECKING is enabled, and use a test that hits that number?
That way, we know the code is being hit and tested, without imposing a
huge memory consumption on test machines.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)

#13Rahila Syed
rahilasyed90@gmail.com
In reply to: Alvaro Herrera (#12)
Re: Enhancing Memory Context Statistics Reporting

Hi,

On Thu, Nov 14, 2024 at 5:18 PM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2024-Nov-14, Michael Paquier wrote:

Already mentioned previously at [1] and echoing with some surrounding
arguments, but I'd suggest to keep it simple and just remove entirely
the part of the patch where the stats information gets spilled into
disk. With more than 6000-ish context information available with a
hard limit in place, there should be plenty enough to know what's
going on anyway.

Functionally-wise I don't necessarily agree with _removing_ the spill
code, considering that production systems with thousands of tables would
easily reach that number of contexts (each index gets its own index info
context, each regexp gets its own memcxt); and I don't think silently
omitting a fraction of people's memory situation (or erroring out if the
case is hit) is going to make us any friends.

While I agree that removing the spill-to-file logic will simplify the code,
I also understand the rationale for retaining it to ensure completeness.
To achieve both completeness and avoid writing to a file, I can consider
displaying the numbers for the remaining contexts as a cumulative total
at the end of the output.

Something like follows:
```
postgres=# select * from pg_get_process_memory_contexts('237244', false);
name | ident
| type | path | total_bytes | tot
al_nblocks | free_bytes | free_chunks | used_bytes | pid
---------------------------------------+------------------------------------------------+----------+--------------+-------------+----
-----------+------------+-------------+------------+--------
TopMemoryContext |
| AllocSet | {0} | 97696 |
5 | 14288 | 11 | 83408 | 237244
search_path processing cache |
| AllocSet | {0,1} | 8192 |
1 | 5328 | 7 | 2864 | 237244

*Remaining contexts total: 23456 bytes (total_bytes) , 12345(used_bytes),
11,111(free_bytes)*
```

That said, it worries me that we choose a shared memory size so large
that it becomes impractical to hit the spill-to-disk code in regression
testing. Maybe we can choose a much smaller limit size when
USE_ASSERT_CHECKING is enabled, and use a test that hits that number?
That way, we know the code is being hit and tested, without imposing a
huge memory consumption on test machines.

Makes sense. I will look into writing such a test, if we finalize the
approach
of spill-to-file.

Please find attached a rebased and updated patch with a basic test
and some fixes. Kindly let me know your thoughts.

Thank you,
Rahila Syed

Attachments:

v3-0001-Function-to-report-memory-context-stats-of-any-backe.patchapplication/octet-stream; name=v3-0001-Function-to-report-memory-context-stats-of-any-backe.patchDownload+870-13
#14Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#13)
Re: Enhancing Memory Context Statistics Reporting

Hi,

To achieve both completeness and avoid writing to a file, I can consider

displaying the numbers for the remaining contexts as a cumulative total
at the end of the output.

Something like follows:
```
postgres=# select * from pg_get_process_memory_contexts('237244', false);
name | ident
| type | path | total_bytes | tot
al_nblocks | free_bytes | free_chunks | used_bytes | pid

---------------------------------------+------------------------------------------------+----------+--------------+-------------+----
-----------+------------+-------------+------------+--------
TopMemoryContext |
| AllocSet | {0} | 97696 |
5 | 14288 | 11 | 83408 | 237244
search_path processing cache |
| AllocSet | {0,1} | 8192 |
1 | 5328 | 7 | 2864 | 237244

*Remaining contexts total: 23456 bytes (total_bytes) ,
12345(used_bytes), 11,111(free_bytes)*
```

Please find attached an updated patch with this change. The file previously
used to
store spilled statistics has been removed. Instead, a cumulative total of
the
remaining/spilled context statistics is now stored in the DSM segment,
which is
displayed as follows.

postgres=# select * from pg_get_process_memory_contexts('352966', false);
* name * | ident | type | path | *total_bytes*
| *total_nblocks* | *free_bytes *| *free_chunks *| *used_bytes* | pi
d
------------------------------+-------+----------+--------+-------------+---------------+------------+-------------+------------+----
----
TopMemoryContext | | AllocSet | {0} | 97696 |
5 | 14288 | 11 | 83408 | 352
966
.
.
.
MdSmgr | | AllocSet | {0,18} | 8192 |
1 | 7424 | 0 | 768 | 352
966
* Remaining Totals* | | | | *1756016*
| *188 *| *658584 *| * 132* | * 1097432 *| 352
966
(7129 rows)
-----

I believe this serves as a good compromise between completeness
and avoiding the overhead of file handling. However, I am open to
reintroducing file handling if displaying the complete statistics of the
remaining contexts prove to be more important.

All the known bugs in the patch have been fixed.

In summary, one DSA per PostgreSQL process is used to share
the statistics of that process. A DSA is created by the first client
backend that requests memory context statistics, and it is pinned
for all future requests to that process.
A handle to this DSA is shared between the client and the publishing
process using fixed shared memory. The fixed shared memory consists
of an array of size MaxBackends + auxiliary processes, indexed
by procno. Each element in this array is less than 100 bytes in size.

A PostgreSQL process uses a condition variable to signal a waiting client
backend once it has finished publishing the statistics. If, for some
reason,
the signal is not sent, the waiting client backend will time out.

When statistics of a local backend is requested, this function returns the
following
WARNING and exits, since this can be handled by an existing function which
doesn't require a DSA.

WARNING: cannot return statistics for local backend
HINT: Use pg_get_backend_memory_contexts instead

Looking forward to your review.

Thank you,
Rahila Syed

Attachments:

v4-Function-to-report-memory-context-stats-of-a-process.patchapplication/octet-stream; name=v4-Function-to-report-memory-context-stats-of-a-process.patchDownload+666-13
#15Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rahila Syed (#14)
Re: Enhancing Memory Context Statistics Reporting

On Wed, Nov 20, 2024 at 2:39 PM Rahila Syed <rahilasyed90@gmail.com> wrote:

Hi,

To achieve both completeness and avoid writing to a file, I can consider
displaying the numbers for the remaining contexts as a cumulative total
at the end of the output.

Something like follows:
```
postgres=# select * from pg_get_process_memory_contexts('237244', false);
name | ident | type | path | total_bytes | tot
al_nblocks | free_bytes | free_chunks | used_bytes | pid
---------------------------------------+------------------------------------------------+----------+--------------+-------------+----
-----------+------------+-------------+------------+--------
TopMemoryContext | | AllocSet | {0} | 97696 |
5 | 14288 | 11 | 83408 | 237244
search_path processing cache | | AllocSet | {0,1} | 8192 |
1 | 5328 | 7 | 2864 | 237244
Remaining contexts total: 23456 bytes (total_bytes) , 12345(used_bytes), 11,111(free_bytes)

```

Please find attached an updated patch with this change. The file previously used to
store spilled statistics has been removed. Instead, a cumulative total of the
remaining/spilled context statistics is now stored in the DSM segment, which is
displayed as follows.

postgres=# select * from pg_get_process_memory_contexts('352966', false);
name | ident | type | path | total_bytes | total_nblocks | free_bytes | free_chunks | used_bytes | pi
d
------------------------------+-------+----------+--------+-------------+---------------+------------+-------------+------------+----
----
TopMemoryContext | | AllocSet | {0} | 97696 | 5 | 14288 | 11 | 83408 | 352
966
.
.
.
MdSmgr | | AllocSet | {0,18} | 8192 | 1 | 7424 | 0 | 768 | 352
966
Remaining Totals | | | | 1756016 | 188 | 658584 | 132 | 1097432 | 352
966
(7129 rows)
-----

I believe this serves as a good compromise between completeness
and avoiding the overhead of file handling. However, I am open to
reintroducing file handling if displaying the complete statistics of the
remaining contexts prove to be more important.

All the known bugs in the patch have been fixed.

In summary, one DSA per PostgreSQL process is used to share
the statistics of that process. A DSA is created by the first client
backend that requests memory context statistics, and it is pinned
for all future requests to that process.
A handle to this DSA is shared between the client and the publishing
process using fixed shared memory. The fixed shared memory consists
of an array of size MaxBackends + auxiliary processes, indexed
by procno. Each element in this array is less than 100 bytes in size.

A PostgreSQL process uses a condition variable to signal a waiting client
backend once it has finished publishing the statistics. If, for some reason,
the signal is not sent, the waiting client backend will time out.

How does the process know that the client backend has finished reading
stats and it can be refreshed? What happens, if the next request for
memory context stats comes before first requester has consumed the
statistics it requested?

Does the shared memory get deallocated when the backend which
allocated it exits?

When statistics of a local backend is requested, this function returns the following
WARNING and exits, since this can be handled by an existing function which
doesn't require a DSA.

WARNING: cannot return statistics for local backend
HINT: Use pg_get_backend_memory_contexts instead

How about using pg_get_backend_memory_contexts() for both - local as
well as other backend? Let PID argument default to NULL which would
indicate local backend, otherwise some other backend?

--
Best Wishes,
Ashutosh Bapat

#16Rahila Syed
rahilasyed90@gmail.com
In reply to: Ashutosh Bapat (#15)
Re: Enhancing Memory Context Statistics Reporting

Hi,

How does the process know that the client backend has finished reading

stats and it can be refreshed? What happens, if the next request for
memory context stats comes before first requester has consumed the
statistics it requested?

A process that's copying its statistics does not need to know that.

Whenever it receives a signal to copy statistics, it goes ahead and
copies the latest statistics to the DSA after acquiring an exclusive
lwlock.

A requestor takes a lock before it starts consuming the statistics.
If the next request comes while the first requestor is consuming the
statistics, the publishing process will wait on lwlock to be released
by the consuming process before it can write the statistics.
If the next request arrives before the first requester begins consuming
the statistics, the publishing process will acquire the lock and overwrite
the earlier statistics with the most recent ones.
As a result, both the first and second requesters will consume the
updated statistics.

Does the shared memory get deallocated when the backend which

allocated it exits?

Memory in the DSA is allocated by a postgres process and deallocated

by the client backend for each request. Both the publishing postgres process
and the client backend detach from the DSA at the end of each request.
However, the DSM segment(s) persist even after all the processes exit
and are only destroyed upon a server restart. Each DSA is associated
with the procNumber of a postgres process and
can be re-used by any future process with the same procNumber.

When statistics of a local backend is requested, this function returns

the following

WARNING and exits, since this can be handled by an existing function

which

doesn't require a DSA.

WARNING: cannot return statistics for local backend
HINT: Use pg_get_backend_memory_contexts instead

How about using pg_get_backend_memory_contexts() for both - local as
well as other backend? Let PID argument default to NULL which would
indicate local backend, otherwise some other backend?

I don't see much value in combining the two, specially since with

pg_get_process_memory_contexts() we can query both the postgres
backend and a background process, the name pg_get_backend_memory_context()
would be inaccurate and I am not sure whether a change to rename the
existing function would be welcome.

Please find an updated patch which fixes an issue seen in CI runs.

Thank you,
Rahila Syed

Attachments:

v5-Function-to-report-memory-context-stats-of-a-process.patchapplication/octet-stream; name=v5-Function-to-report-memory-context-stats-of-a-process.patchDownload+668-13
#17Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rahila Syed (#16)
Re: Enhancing Memory Context Statistics Reporting

On Fri, Nov 22, 2024 at 6:33 PM Rahila Syed <rahilasyed90@gmail.com> wrote:

Hi,

How does the process know that the client backend has finished reading
stats and it can be refreshed? What happens, if the next request for
memory context stats comes before first requester has consumed the
statistics it requested?

A process that's copying its statistics does not need to know that.
Whenever it receives a signal to copy statistics, it goes ahead and
copies the latest statistics to the DSA after acquiring an exclusive
lwlock.

A requestor takes a lock before it starts consuming the statistics.
If the next request comes while the first requestor is consuming the
statistics, the publishing process will wait on lwlock to be released
by the consuming process before it can write the statistics.
If the next request arrives before the first requester begins consuming
the statistics, the publishing process will acquire the lock and overwrite
the earlier statistics with the most recent ones.
As a result, both the first and second requesters will consume the
updated statistics.

IIUC, the publisher and the consumer processes, both, use the same
LWLock. Publisher acquires an exclusive lock. Does consumer acquire
SHARED lock?

The publisher process might be in a transaction, processing a query or
doing something else. If it has to wait for an LWLock may affect its
performance. This will become even more visible if the client backend
is trying to diagnose a slow running query. Have we tried to measure
how long the publisher might have to wait for an LWLock while the
consumer is consuming statistics OR what is the impact of this wait?

When statistics of a local backend is requested, this function returns the following
WARNING and exits, since this can be handled by an existing function which
doesn't require a DSA.

WARNING: cannot return statistics for local backend
HINT: Use pg_get_backend_memory_contexts instead

How about using pg_get_backend_memory_contexts() for both - local as
well as other backend? Let PID argument default to NULL which would
indicate local backend, otherwise some other backend?

I don't see much value in combining the two, specially since with
pg_get_process_memory_contexts() we can query both the postgres
backend and a background process, the name pg_get_backend_memory_context()
would be inaccurate and I am not sure whether a change to rename the
existing function would be welcome.

Having two separate functions for the same functionality isn't a
friendly user interface.

Playing a bit with pg_terminate_backend() which is another function
dealing with backends to understand a. what does it do to its own
backend and b. which processes are considered backends.

1. pg_terminate_backend() allows to terminate the backend from which
it is fired.
#select pid, application_name, backend_type, pg_terminate_backend(pid)
from pg_stat_activity;
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

2. It considers autovacuum launcher and logical replication launcher
as postgres backends but not checkpointer, background writer and
walwriter.
#select pid, application_name, backend_type, pg_terminate_backend(pid)
from pg_stat_activity where pid <> pg_backend_pid();
WARNING: PID 644887 is not a PostgreSQL backend process
WARNING: PID 644888 is not a PostgreSQL backend process
WARNING: PID 644890 is not a PostgreSQL backend process
pid | application_name | backend_type | pg_terminate_backend
--------+------------------+------------------------------+----------------------
645636 | | autovacuum launcher | t
645677 | | logical replication launcher | t
644887 | | checkpointer | f
644888 | | background writer | f
644890 | | walwriter | f
(5 rows)

In that sense you are correct that pg_get_backend_memory_context()
should not provide context information of WAL writer process for
example. But pg_get_process_memory_contexts() would be expected to
provide its own memory context information instead of redirecting to
another function through a WARNING. It could do that redirection
itself. That will also prevent the functions' output format going out
of sync.

--
Best Wishes,
Ashutosh Bapat

#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Rahila Syed (#16)
Re: Enhancing Memory Context Statistics Reporting

Hi,

I took a quick look at the patch today. Overall, I think this would be
very useful, I've repeatedly needed to inspect why a backend uses so
much memory, and I ended up triggering MemoryContextStats() from gdb.
This would be more convenient / safer. So +1 to the patch intent.

A couple review comments:

1) I read through the thread, and in general I agree with the reasoning
for removing the file part - it seems perfectly fine to just dump as
much as we can fit into a buffer, and then summarize the rest. But do we
need to invent a "new" limit here? The other places logging memory
contexts do something like this:

MemoryContextStatsDetail(TopMemoryContext, 100, 100, false);

Which means we only print the 100 memory contexts at the top, and that's
it. Wouldn't that give us a reasonable memory limit too?

2) I see the function got renamed to pg_get_process_memory_contexts(),
bu the comment still says pg_get_remote_backend_memory_contexts().

3) I don't see any SGML docs for this new function. I was a bit unsure
what the "summary" argument is meant to do. The comment does not explain
that either.

4) I wonder if the function needs to return PID. I mean, the caller
knows which PID it is for, so it seems rather unnecessary.

5) In the "summary" mode, it might be useful to include info about how
many child contexts were aggregated. It's useful to know whether there
was 1 child or 10000 children. In the regular (non-summary) mode it'd
always be "1", probably, but maybe it'd interact with the limit in (1).
Not sure.

6) I feel a bit uneasy about the whole locking / communication scheme.
In particular, I'm worried about lockups, deadlocks, etc. So I decided
to do a trivial stress-test - just run the new function through pgbench
with many clients.

The memstats.sql script does just this:

SELECT * FROM pg_get_process_memory_contexts(
(SELECT pid FROM pg_stat_activity
WHERE pid != pg_backend_pid()
ORDER BY random() LIMIT 1)
, false);

where the inner query just picks a PID for some other backend, and asks
for memory context stats for that.

And just run it like this on a scale 1 pgbench database:

pgbench -n -f memstats.sql -c 10 test

And it gets stuck *immediately*. I've seen it to wait for other client
backends and auxiliary processes like autovacuum launcher.

This is absolutely idle system, there's no reason why a process would
not respond almost immediately. I wonder if e.g. autovacuum launcher may
not be handling these requests, or what if client backends can wait in a
cycle. IIRC condition variables are not covered by a deadlock detector,
so that would be an issue. But maybe I remember wrong?

7) I've also seen this error:

pgbench: error: client 6 script 0 aborted in command 0 query 0: \
ERROR: can't attach the same segment more than once

I haven't investigated it, but it seems like a problem handling errors,
where we fail to detach from a segment after a timeout. I may be wrong,
but it might be related to this:

I opted for DSAs over DSMs to enable memory reuse by freeing
segments for subsequent statistics copies of the same backend,
without needing to recreate DSMs for each request.

I feel like this might be a premature optimization - I don't have a
clear idea how expensive it is to create DSM per request, but my
intuition is that it's cheaper than processing the contexts and
generating the info.

I'd just remove that, unless someone demonstrates it really matters. I
don't really worry about how expensive it is to process a request
(within reason, of course) - it will happen only very rarely. It's more
important to make sure there's no overhead when no one asks the backend
for memory context info, and simplicity.

Also, how expensive it is to just keep the DSA "just in case"? Imagine
someone asks for the memory context info once - isn't it a was to still
keep the DSA? I don't recall how much resources could that be.

I don't have a clear opinion on that, I'm more asking for opinions.

8) Two minutes seems pretty arbitrary, and also quite high. If a timeout
is necessary, I think it should not be hard-coded.

regards

--
Tomas Vondra

#19Rahila Syed
rahilasyed90@gmail.com
In reply to: Tomas Vondra (#18)
Re: Enhancing Memory Context Statistics Reporting

Hi Tomas,

Thank you for the review.

1) I read through the thread, and in general I agree with the reasoning
for removing the file part - it seems perfectly fine to just dump as
much as we can fit into a buffer, and then summarize the rest. But do we
need to invent a "new" limit here? The other places logging memory
contexts do something like this:

MemoryContextStatsDetail(TopMemoryContext, 100, 100, false);

Which means we only print the 100 memory contexts at the top, and that's
it. Wouldn't that give us a reasonable memory limit too?

I think this prints more than 100 memory contexts, since 100 denotes the

max_level
and contexts at each level could have upto 100 children. This limit seems
much higher than
what I am currently storing in DSA which is approx. 7000 contexts. I will
verify this again.

2) I see the function got renamed to pg_get_process_memory_contexts(),
bu the comment still says pg_get_remote_backend_memory_contexts().

Fixed

3) I don't see any SGML docs for this new function. I was a bit unsure
what the "summary" argument is meant to do. The comment does not explain
that either.

Added docs.

Intention behind adding a summary argument is to report statistics of
contexts at level 0
and 1 i.e TopMemoryContext and its immediate children.

4) I wonder if the function needs to return PID. I mean, the caller

knows which PID it is for, so it seems rather unnecessary.

Perhaps it can be used to ascertain that the information indeed belongs to

the requested pid.

5) In the "summary" mode, it might be useful to include info about how

many child contexts were aggregated. It's useful to know whether there
was 1 child or 10000 children. In the regular (non-summary) mode it'd
always be "1", probably, but maybe it'd interact with the limit in (1).
Not sure.

Sure, I will add this in the next iteration.

6) I feel a bit uneasy about the whole locking / communication scheme.
In particular, I'm worried about lockups, deadlocks, etc. So I decided
to do a trivial stress-test - just run the new function through pgbench
with many clients.

The memstats.sql script does just this:

SELECT * FROM pg_get_process_memory_contexts(
(SELECT pid FROM pg_stat_activity
WHERE pid != pg_backend_pid()
ORDER BY random() LIMIT 1)
, false);

where the inner query just picks a PID for some other backend, and asks
for memory context stats for that.

And just run it like this on a scale 1 pgbench database:

pgbench -n -f memstats.sql -c 10 test

And it gets stuck *immediately*. I've seen it to wait for other client
backends and auxiliary processes like autovacuum launcher.

This is absolutely idle system, there's no reason why a process would
not respond almost immediately.

In my reproduction, this issue occurred because the process was terminated
while the requesting backend was waiting on the condition variable to be
signaled by it. I don’t see any solution other than having the waiting
client
backend timeout using ConditionVariableTimedSleep.

In the patch, since the timeout was set to a high value, pgbench ended up
stuck
waiting for the timeout to occur. The failure happens less frequently after
I added an
additional check for the process's existence, but it cannot be entirely
avoided. This is because a process can terminate after we check for its
existence but
before it signals the client. In such cases, the client will not receive
any signal.

I wonder if e.g. autovacuum launcher may

not be handling these requests, or what if client backends can wait in a
cycle.

Did not see a cyclic wait in client backends due to the pgbench stress test.

7) I've also seen this error:

pgbench: error: client 6 script 0 aborted in command 0 query 0: \
ERROR: can't attach the same segment more than once

I haven't investigated it, but it seems like a problem handling errors,

where we fail to detach from a segment after a timeout.

Thanks for the hint, fixed by adding a missing call to dsa_detach after
timeout.

I opted for DSAs over DSMs to enable memory reuse by freeing
segments for subsequent statistics copies of the same backend,
without needing to recreate DSMs for each request.

I feel like this might be a premature optimization - I don't have a
clear idea how expensive it is to create DSM per request, but my
intuition is that it's cheaper than processing the contexts and
generating the info.

I'd just remove that, unless someone demonstrates it really matters. I
don't really worry about how expensive it is to process a request
(within reason, of course) - it will happen only very rarely. It's more
important to make sure there's no overhead when no one asks the backend
for memory context info, and simplicity.

Also, how expensive it is to just keep the DSA "just in case"? Imagine
someone asks for the memory context info once - isn't it a was to still
keep the DSA? I don't recall how much resources could that be.

I don't have a clear opinion on that, I'm more asking for opinions.

Imagining a tool that periodically queries the backends for statistics,
it would be beneficial to avoid recreating the DSAs for each call.
Currently, DSAs of size 1MB per process
(i.e., a maximum of 1MB * (MaxBackends + auxiliary processes))
would be created and pinned for subsequent reporting. This size does
not seem excessively high, even for approx 100 backends and
auxiliary processes.

8) Two minutes seems pretty arbitrary, and also quite high. If a timeout
is necessary, I think it should not be hard-coded.

Not sure which is the ideal value. Changed it to 15 secs and added a

#define as of now.
Something that gives enough time for the process to respond but
does not hold up the client for too long would be ideal. 15 secs seem to
be not enough for github CI tests, which fail with timeout error with this
setting.

PFA an updated patch with the above changes.

Attachments:

v6-Function-to-report-memory-context-stats-of-any-backe.patchapplication/octet-stream; name=v6-Function-to-report-memory-context-stats-of-any-backe.patchDownload+699-13
#20Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Rahila Syed (#19)
Re: Enhancing Memory Context Statistics Reporting

On 11/29/24 00:23, Rahila Syed wrote:

Hi Tomas,

Thank you for the review.

1) I read through the thread, and in general I agree with the reasoning
for removing the file part - it seems perfectly fine to just dump as
much as we can fit into a buffer, and then summarize the rest. But do we
need to invent a "new" limit here? The other places logging memory
contexts do something like this:

   MemoryContextStatsDetail(TopMemoryContext, 100, 100, false);

Which means we only print the 100 memory contexts at the top, and that's
it. Wouldn't that give us a reasonable memory limit too?

I think this prints more than 100 memory contexts, since 100 denotes the
max_level
and contexts at each level could have upto 100 children. This limit
seems much higher than
what I am currently storing in DSA which is approx. 7000 contexts.  I
will verify this again.
 

Yeah, you may be right. I don't remember what exactly that limit does.

2) I see the function got renamed to pg_get_process_memory_contexts(),
bu the comment still says pg_get_remote_backend_memory_contexts().

Fixed 

3) I don't see any SGML docs for this new function. I was a bit unsure
what the "summary" argument is meant to do. The comment does not explain
that either.

Added docs. 
Intention behind adding a summary argument is to report statistics of
contexts at level 0 
and 1 i.e TopMemoryContext and its immediate children. 

OK

4) I wonder if the function needs to return PID. I mean, the caller
knows which PID it is for, so it seems rather unnecessary.

Perhaps it can be used to ascertain that the information indeed belongs to 
the requested pid.

I find that a bit ... suspicious. By this logic we'd include the input
parameters in every result, but we don't. So why is this case different?

5) In the "summary" mode, it might be useful to include info about how
many child contexts were aggregated. It's useful to know whether there
was 1 child or 10000 children. In the regular (non-summary) mode it'd
always be "1", probably, but maybe it'd interact with the limit in (1).
Not sure.

Sure,  I will add this in the next iteration. 

OK

6) I feel a bit uneasy about the whole locking / communication scheme.
In particular, I'm worried about lockups, deadlocks, etc. So I decided
to do a trivial stress-test - just run the new function through pgbench
with many clients.

The memstats.sql script does just this:

  SELECT * FROM pg_get_process_memory_contexts(
    (SELECT pid FROM pg_stat_activity
      WHERE pid != pg_backend_pid()
      ORDER BY random() LIMIT 1)
    , false);

where the inner query just picks a PID for some other backend, and asks
for memory context stats for that.

And just run it like this on a scale 1 pgbench database:

  pgbench -n -f memstats.sql -c 10 test

And it gets stuck *immediately*. I've seen it to wait for other client
backends and auxiliary processes like autovacuum launcher.

This is absolutely idle system, there's no reason why a process would
not respond almost immediately.

 
In my reproduction, this issue occurred because the process was terminated 
while the requesting backend was waiting on the condition variable to be 
signaled by it. I don’t see any solution other than having the waiting
client 
backend timeout using ConditionVariableTimedSleep.

In the patch, since the timeout was set to a high value, pgbench ended
up stuck 
waiting for the timeout to occur. The failure happens less frequently
after I added an
additional check for the process's existence, but it cannot be entirely 
avoided. This is because a process can terminate after we check for its
existence but 
before it signals the client. In such cases, the client will not receive
any signal.

Hmmm, I see. I guess there's no way to know if a process responds to us,
but I guess it should be possible to wake up regularly and check if the
process still exists? Wouldn't that solve the case you mentioned?

I wonder if e.g. autovacuum launcher may
not be handling these requests, or what if client backends can wait in a
cycle.

 
Did not see a cyclic wait in client backends due to the pgbench stress test.
 

Not sure, but if I modify the query to only request memory contexts from
non-client processes, i.e.

SELECT * FROM pg_get_process_memory_contexts(
(SELECT pid FROM pg_stat_activity
WHERE pid != pg_backend_pid()
AND backend_type != 'client backend'
ORDER BY random() LIMIT 1)
, false);

then it gets stuck and reports this:

pgbench -n -f select.sql -c 4 -T 10 test
pgbench (18devel)
WARNING: Wait for 105029 process to publish stats timed out, ...

But process 105029 still very much exists, and it's the checkpointer:

$ ps ax | grep 105029
105029 ? Ss 0:00 postgres: checkpointer

OTOH if I modify the script to only look at client backends, and wait
until the processes get "stuck" (i.e. waiting on the condition variable,
consuming 0% CPU), I get this:

$ pgbench -n -f select.sql -c 4 -T 10 test
pgbench (18devel)
WARNING: Wait for 107146 process to publish stats timed out, try again
WARNING: Wait for 107144 process to publish stats timed out, try again
WARNING: Wait for 107147 process to publish stats timed out, try again
transaction type: select.sql
...

but when it gets 'stuck', most of the processes are still very much
running (but waiting for contexts from some other process). In the above
example I see this:

107144 ? Ss 0:02 postgres: user test [local] SELECT
107145 ? Ss 0:01 postgres: user test [local] SELECT
107147 ? Ss 0:02 postgres: user test [local] SELECT

So yes, 107146 seems to be gone. But why would that block getting info
from 107144 and 107147?

Maybe that's acceptable, but couldn't this be an issue with short-lived
connections, making it hard to implement the kind of automated
collection of stats that you envision. If it hits this kind of timeouts
often, it'll make it hard to reliably collect info. No?

  > I opted for DSAs over DSMs to enable memory reuse by freeing
  > segments for subsequent statistics copies of the same backend,
  > without needing to recreate DSMs for each request.

I feel like this might be a premature optimization - I don't have a
clear idea how expensive it is to create DSM per request, but my
intuition is that it's cheaper than processing the contexts and
generating the info.

I'd just remove that, unless someone demonstrates it really matters. I
don't really worry about how expensive it is to process a request
(within reason, of course) - it will happen only very rarely. It's more
important to make sure there's no overhead when no one asks the backend
for memory context info, and simplicity.

Also, how expensive it is to just keep the DSA "just in case"? Imagine
someone asks for the memory context info once - isn't it a was to still
keep the DSA? I don't recall how much resources could that be.

I don't have a clear opinion on that, I'm more asking for opinions.

  
Imagining a tool that periodically queries the backends for statistics, 
it would be beneficial to avoid recreating the DSAs for each call.

I think it would be nice if you backed this with some numbers. I mean,
how expensive is it to create/destroy the DSA? How does it compare to
the other stuff this function needs to do?

Currently,  DSAs of size 1MB per process 
(i.e., a maximum of 1MB * (MaxBackends + auxiliary processes)) 
would be created and pinned for subsequent reporting. This size does 
not seem excessively high, even for approx 100 backends and 
auxiliary processes. 

That seems like a pretty substantial amount of memory reserved for each
connection. IMHO the benefits would have to be pretty significant to
justify this, especially considering it's kept "forever", even if you
run the function only once per day.

8) Two minutes seems pretty arbitrary, and also quite high. If a timeout
is necessary, I think it should not be hard-coded.

Not sure which is the ideal value. Changed it to 15 secs and added a
#define as of now. 
Something that gives enough time for the process to respond but 
does not hold up the client for too long would be ideal. 15 secs seem to 
be not enough for github CI tests, which fail with timeout error with
this setting.

PFA an updated patch with the above changes.

Why not to make this a parameter of the function? With some sensible
default, but easy to override.

regards

--
Tomas Vondra

#21Rahila Syed
rahilasyed90@gmail.com
In reply to: Tomas Vondra (#20)
#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Rahila Syed (#21)
#23Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Rahila Syed (#21)
#24torikoshia
torikoshia@oss.nttdata.com
In reply to: Rahila Syed (#21)
#25Rahila Syed
rahilasyed90@gmail.com
In reply to: Tomas Vondra (#22)
#26Rahila Syed
rahilasyed90@gmail.com
In reply to: torikoshia (#24)
#27Fujii Masao
masao.fujii@gmail.com
In reply to: Rahila Syed (#26)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Rahila Syed (#26)
#29Rahila Syed
rahilasyed90@gmail.com
In reply to: Fujii Masao (#27)
#30Fujii Masao
masao.fujii@gmail.com
In reply to: Rahila Syed (#29)
#31Rahila Syed
rahilasyed90@gmail.com
In reply to: Tomas Vondra (#28)
#32Rahila Syed
rahilasyed90@gmail.com
In reply to: Tomas Vondra (#28)
#33Fujii Masao
masao.fujii@gmail.com
In reply to: Rahila Syed (#32)
#34Rahila Syed
rahilasyed90@gmail.com
In reply to: Fujii Masao (#33)
#35Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Rahila Syed (#34)
#36Rahila Syed
rahilasyed90@gmail.com
In reply to: Tomas Vondra (#35)
#37Rahila Syed
rahilasyed90@gmail.com
In reply to: Tomas Vondra (#35)
#38torikoshia
torikoshia@oss.nttdata.com
In reply to: Rahila Syed (#37)
#39Rahila Syed
rahilasyed90@gmail.com
In reply to: torikoshia (#38)
#40Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#39)
#41Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Rahila Syed (#40)
#42Rahila Syed
rahilasyed90@gmail.com
In reply to: Tomas Vondra (#41)
#43Daniel Gustafsson
daniel@yesql.se
In reply to: Rahila Syed (#42)
#44Rahila Syed
rahilasyed90@gmail.com
In reply to: Daniel Gustafsson (#43)
#45Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#44)
#46Alexander Korotkov
aekorotkov@gmail.com
In reply to: Rahila Syed (#45)
#47Rahila Syed
rahilasyed90@gmail.com
In reply to: Alexander Korotkov (#46)
#48Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rahila Syed (#47)
#49Rahila Syed
rahilasyed90@gmail.com
In reply to: Ashutosh Bapat (#48)
#50Daniel Gustafsson
daniel@yesql.se
In reply to: Rahila Syed (#49)
#51Rahila Syed
rahilasyed90@gmail.com
In reply to: Daniel Gustafsson (#50)
#52Daniel Gustafsson
daniel@yesql.se
In reply to: Rahila Syed (#51)
#53Daniel Gustafsson
daniel@yesql.se
In reply to: Daniel Gustafsson (#52)
#54Rahila Syed
rahilasyed90@gmail.com
In reply to: Daniel Gustafsson (#53)
#55Daniel Gustafsson
daniel@yesql.se
In reply to: Rahila Syed (#54)
#56Andres Freund
andres@anarazel.de
In reply to: Daniel Gustafsson (#55)
#57Rahila Syed
rahilasyed90@gmail.com
In reply to: Andres Freund (#56)
#58Andres Freund
andres@anarazel.de
In reply to: Rahila Syed (#57)
#59Rahila Syed
rahilasyed90@gmail.com
In reply to: Andres Freund (#58)
#60Daniel Gustafsson
daniel@yesql.se
In reply to: Andres Freund (#56)
#61Andres Freund
andres@anarazel.de
In reply to: Daniel Gustafsson (#60)
#62Rahila Syed
rahilasyed90@gmail.com
In reply to: Andres Freund (#61)
#63Daniel Gustafsson
daniel@yesql.se
In reply to: Rahila Syed (#62)
#64Daniel Gustafsson
daniel@yesql.se
In reply to: Daniel Gustafsson (#63)
#65Fujii Masao
masao.fujii@gmail.com
In reply to: Daniel Gustafsson (#64)
#66Daniel Gustafsson
daniel@yesql.se
In reply to: Fujii Masao (#65)
#67Daniel Gustafsson
daniel@yesql.se
In reply to: Fujii Masao (#65)
#68Fujii Masao
masao.fujii@gmail.com
In reply to: Daniel Gustafsson (#67)
#69Rahila Syed
rahilasyed90@gmail.com
In reply to: Fujii Masao (#68)
#70Peter Eisentraut
peter_e@gmx.net
In reply to: Rahila Syed (#69)
#71Daniel Gustafsson
daniel@yesql.se
In reply to: Peter Eisentraut (#70)
#72Rahila Syed
rahilasyed90@gmail.com
In reply to: Daniel Gustafsson (#71)
#73Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#72)
#74Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#73)
#75torikoshia
torikoshia@oss.nttdata.com
In reply to: Rahila Syed (#74)
#76Rahila Syed
rahilasyed90@gmail.com
In reply to: torikoshia (#75)
#77torikoshia
torikoshia@oss.nttdata.com
In reply to: Rahila Syed (#76)
#78Rahila Syed
rahilasyed90@gmail.com
In reply to: torikoshia (#77)
#79torikoshia
torikoshia@oss.nttdata.com
In reply to: Rahila Syed (#78)
#80Rahila Syed
rahilasyed90@gmail.com
In reply to: torikoshia (#79)
#81torikoshia
torikoshia@oss.nttdata.com
In reply to: Rahila Syed (#80)
#82Rahila Syed
rahilasyed90@gmail.com
In reply to: torikoshia (#81)
#83Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#82)
#84Daniel Gustafsson
daniel@yesql.se
In reply to: Rahila Syed (#83)
#85Rahila Syed
rahilasyed90@gmail.com
In reply to: Daniel Gustafsson (#84)
#86Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#85)
#87torikoshia
torikoshia@oss.nttdata.com
In reply to: Rahila Syed (#86)
#88Daniel Gustafsson
daniel@yesql.se
In reply to: Rahila Syed (#86)
#89Rahila Syed
rahilasyed90@gmail.com
In reply to: torikoshia (#87)
#90Rahila Syed
rahilasyed90@gmail.com
In reply to: Daniel Gustafsson (#88)
#91Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#90)
#92Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#91)
#93Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#92)
#94Daniel Gustafsson
daniel@yesql.se
In reply to: Rahila Syed (#93)
#95Robert Haas
robertmhaas@gmail.com
In reply to: Rahila Syed (#93)
#96Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#95)
#97Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#96)
#98Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#97)
#99Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#98)
#100Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#99)
#101Rahila Syed
rahilasyed90@gmail.com
In reply to: Robert Haas (#95)
#102Robert Haas
robertmhaas@gmail.com
In reply to: Rahila Syed (#101)
#103Rahila Syed
rahilasyed90@gmail.com
In reply to: Robert Haas (#102)
#104Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#103)
#105Chao Li
li.evan.chao@gmail.com
In reply to: Rahila Syed (#104)
#106Rahila Syed
rahilasyed90@gmail.com
In reply to: Chao Li (#105)
#107Rahila Syed
rahilasyed90@gmail.com
In reply to: Chao Li (#105)
#108Daniel Gustafsson
daniel@yesql.se
In reply to: Rahila Syed (#107)
#109Rahila Syed
rahilasyed90@gmail.com
In reply to: Daniel Gustafsson (#108)
#110Rahila Syed
rahilasyed90@gmail.com
In reply to: Rahila Syed (#109)