Get memory contexts of an arbitrary backend process

Started by torikoshiaover 5 years ago59 messageshackers

torikoshia@oss.nttdata.com

over 5 years ago

Hi,

After commit 3e98c0bafb28de, we can display the usage of the
memory contexts using pg_backend_memory_contexts system
view.

However, its target is limited to the process attached to
the current session.

As discussed in the thread[1]/messages/by-id/72a656e0f71d0860161e0b3f67e4d771@oss.nttdata.com, it'll be useful to make it
possible to get the memory contexts of an arbitrary backend
process.

Attached PoC patch makes pg_get_backend_memory_contexts()
display meory contexts of the specified PID of the process.

-----------------------+-------+------------------+-------+-------------+---------------+------------+-------------+------------
TopMemoryContext | | | 0 |
68720 | 5 | 16816 | 16 | 51904
RowDescriptionContext | | TopMemoryContext | 1 |
8192 | 1 | 6880 | 0 | 1312
MessageContext | | TopMemoryContext | 1 |
65536 | 4 | 19912 | 1 | 45624
...

It doesn't display contexts of all the backends but only
the contexts of specified process.
I think it would be enough because I suppose this function
is used after investigations using ps command or other OS
level utilities.

The rough idea of implementation is like below:

1. send a signal to the specified process
2. signaled process dumps its memory contexts to a file
3. read the dumped file and display it to the user

Any thoughts?

[1]: /messages/by-id/72a656e0f71d0860161e0b3f67e4d771@oss.nttdata.com
/messages/by-id/72a656e0f71d0860161e0b3f67e4d771@oss.nttdata.com

Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Kasahara Tatsuhito

kasahara.tatsuhito@gmail.com

over 5 years ago

In reply to: torikoshia (#1)

Re: Get memory contexts of an arbitrary backend process

Hi,

On Mon, Aug 31, 2020 at 8:22 PM torikoshia <torikoshia@oss.nttdata.com> wrote:

As discussed in the thread[1], it'll be useful to make it
possible to get the memory contexts of an arbitrary backend
process.

Attached PoC patch makes pg_get_backend_memory_contexts()
display meory contexts of the specified PID of the process.

Thanks, it's a very good patch for discussion.

It doesn't display contexts of all the backends but only
the contexts of specified process.

or we can "SELECT (pg_get_backend_memory_contexts(pid)).* FROM
pg_stat_activity WHERE ...",
so I don't think it's a big deal.

The rough idea of implementation is like below:

1. send a signal to the specified process
2. signaled process dumps its memory contexts to a file
3. read the dumped file and display it to the user

I agree with the overview of the idea.
Here are some comments and questions.

- Currently, "the signal transmission for dumping memory information"
and "the read & output of dump information "
are on the same interface, but I think it would be better to separate them.
How about providing the following three types of functions for users?
- send a signal to specified pid
- check the status of the signal sent and received
- read the dumped information
- How about managing the status of signal send/receive and dump
operations on a shared hash or others ?
Sending and receiving signals, dumping memory information, and
referencing dump information all work asynchronously.
Therefore, it would be good to have management information to check
the status of each process.
A simple idea is that ..
- send a signal to dump to a PID, it first record following
information into the shared hash.
pid (specified pid)
loc (dump location, currently might be ASAP)
recv (did the pid process receive a signal? first false)
dumped (did the pid process dump a mem information? first false)
- specified process receive the signal, update the status in the
shared hash, then dumped at specified location.
- specified process finish dump mem information, update the status
in the shared hash.
- Does it allow one process to output multiple dump files?
It appears to be a specification to overwrite at present, but I
thought it would be good to be able to generate
multiple dump files in different phases (e.g., planning phase and
execution phase) in the future.
- How is the dump file cleaned up?

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Kasahara Tatsuhito (#2)

Re: Get memory contexts of an arbitrary backend process

po 31. 8. 2020 v 17:03 odesílatel Kasahara Tatsuhito <
kasahara.tatsuhito@gmail.com> napsal:

Hi,

On Mon, Aug 31, 2020 at 8:22 PM torikoshia <torikoshia@oss.nttdata.com>
wrote:

As discussed in the thread[1], it'll be useful to make it
possible to get the memory contexts of an arbitrary backend
process.

+1

Attached PoC patch makes pg_get_backend_memory_contexts()
display meory contexts of the specified PID of the process.

Thanks, it's a very good patch for discussion.

It doesn't display contexts of all the backends but only
the contexts of specified process.

or we can "SELECT (pg_get_backend_memory_contexts(pid)).* FROM
pg_stat_activity WHERE ...",
so I don't think it's a big deal.

The rough idea of implementation is like below:

1. send a signal to the specified process
2. signaled process dumps its memory contexts to a file
3. read the dumped file and display it to the user

I agree with the overview of the idea.
Here are some comments and questions.

- Currently, "the signal transmission for dumping memory information"
and "the read & output of dump information "
are on the same interface, but I think it would be better to separate
them.
How about providing the following three types of functions for users?
- send a signal to specified pid
- check the status of the signal sent and received
- read the dumped information
- How about managing the status of signal send/receive and dump
operations on a shared hash or others ?
Sending and receiving signals, dumping memory information, and
referencing dump information all work asynchronously.
Therefore, it would be good to have management information to check
the status of each process.
A simple idea is that ..
- send a signal to dump to a PID, it first record following
information into the shared hash.
pid (specified pid)
loc (dump location, currently might be ASAP)
recv (did the pid process receive a signal? first false)
dumped (did the pid process dump a mem information? first false)
- specified process receive the signal, update the status in the
shared hash, then dumped at specified location.
- specified process finish dump mem information, update the status
in the shared hash.
- Does it allow one process to output multiple dump files?
It appears to be a specification to overwrite at present, but I
thought it would be good to be able to generate
multiple dump files in different phases (e.g., planning phase and
execution phase) in the future.
- How is the dump file cleaned up?

For a very long time there has been similar discussion about taking
session query and session execution plans from other sessions.

I am not sure how necessary information is in the memory dump, but I am
sure so taking the current execution plan and complete text of the current
query is pretty necessary information.

but can be great so this infrastructure can be used for any debugging
purpose.

Regards

Pavel

Show quoted text

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: torikoshia (#1)

Re: Get memory contexts of an arbitrary backend process

Hi,

On 2020-08-31 20:22:18 +0900, torikoshia wrote:

After commit 3e98c0bafb28de, we can display the usage of the
memory contexts using pg_backend_memory_contexts system
view.

However, its target is limited to the process attached to
the current session.

As discussed in the thread[1], it'll be useful to make it
possible to get the memory contexts of an arbitrary backend
process.

Attached PoC patch makes pg_get_backend_memory_contexts()
display meory contexts of the specified PID of the process.

Awesome!

It doesn't display contexts of all the backends but only
the contexts of specified process.
I think it would be enough because I suppose this function
is used after investigations using ps command or other OS
level utilities.

It can be used as a building block if all are needed. Getting the
infrastructure right is the big thing here, I think. Adding more
detailed views on top of that data later is easier.

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a2d61302f9..88fb837ecd 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -555,10 +555,10 @@ REVOKE ALL ON pg_shmem_allocations FROM PUBLIC;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;

CREATE VIEW pg_backend_memory_contexts AS
-    SELECT * FROM pg_get_backend_memory_contexts();
+    SELECT * FROM pg_get_backend_memory_contexts(-1);

-1 is odd. Why not use NULL or even 0?

+	else
+	{
+		int			rc;
+		int parent_len = strlen(parent);
+		int name_len = strlen(name);
+
+		/*
+		 * write out the current memory context information.
+		 * Since some elements of values are reusable, we write it out.

Not sure what the second comment line here is supposed to mean?

+		 */
+		fputc('D', fpout);
+		rc = fwrite(values, sizeof(values), 1, fpout);
+		rc = fwrite(nulls, sizeof(nulls), 1, fpout);
+
+		/* write out information which is not resuable from serialized values */

s/resuable/reusable/

+		rc = fwrite(&name_len, sizeof(int), 1, fpout);
+		rc = fwrite(name, name_len, 1, fpout);
+		rc = fwrite(&idlen, sizeof(int), 1, fpout);
+		rc = fwrite(clipped_ident, idlen, 1, fpout);
+		rc = fwrite(&level, sizeof(int), 1, fpout);
+		rc = fwrite(&parent_len, sizeof(int), 1, fpout);
+		rc = fwrite(parent, parent_len, 1, fpout);
+		(void) rc;				/* we'll check for error with ferror */
+
+	}

This format is not descriptive. How about serializing to json or
something? Or at least having field names?

Alternatively, build the same tuple we build for the SRF, and serialize
that. Then there's basically no conversion needed.

@@ -117,6 +157,8 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore,
Datum
pg_get_backend_memory_contexts(PG_FUNCTION_ARGS)
{
+ int pid = PG_GETARG_INT32(0);
+
ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc tupdesc;
Tuplestorestate *tupstore;
@@ -147,11 +189,258 @@ pg_get_backend_memory_contexts(PG_FUNCTION_ARGS)

MemoryContextSwitchTo(oldcontext);
-	PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
-								TopMemoryContext, NULL, 0);
+	if (pid == -1)
+	{
+		/*
+		 * Since pid -1 indicates target is the local process, simply
+		 * traverse memory contexts.
+		 */
+		PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
+								TopMemoryContext, "", 0, NULL);
+	}
+	else
+	{
+		/*
+		 * Send signal for dumping memory contexts to the target process,
+		 * and read the dumped file.
+		 */
+		FILE	   *fpin;
+		char		dumpfile[MAXPGPATH];
+
+		SendProcSignal(pid, PROCSIG_DUMP_MEMORY, InvalidBackendId);
+
+		snprintf(dumpfile, sizeof(dumpfile), "pg_memusage/%d", pid);
+
+		while (true)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			pg_usleep(10000L);
+

Need better signalling back/forth here.

+/*
+ * dump_memory_contexts
+ * 	Dumping local memory contexts to a file.
+ * 	This function does not delete the file as it is intended to be read by
+ * 	another process.
+ */
+static void
+dump_memory_contexts(void)
+{
+	FILE	   *fpout;
+	char		tmpfile[MAXPGPATH];
+	char		dumpfile[MAXPGPATH];
+
+	snprintf(tmpfile, sizeof(tmpfile), "pg_memusage/%d.tmp", MyProcPid);
+	snprintf(dumpfile, sizeof(dumpfile), "pg_memusage/%d", MyProcPid);
+
+	/*
+	 * Open a temp file to dump the current memory context.
+	 */
+	fpout = AllocateFile(tmpfile, PG_BINARY_W);
+	if (fpout == NULL)
+	{
+		ereport(LOG,
+				(errcode_for_file_access(),
+				 errmsg("could not write temporary memory context file \"%s\": %m",
+						tmpfile)));
+		return;
+	}

Probably should be opened with O_CREAT | O_TRUNC?

Greetings,

Andres Freund

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: Pavel Stehule (#3)

Re: Get memory contexts of an arbitrary backend process

On 2020-09-01 03:29, Pavel Stehule wrote:

Hi

po 31. 8. 2020 v 17:03 odesílatel Kasahara Tatsuhito
<kasahara.tatsuhito@gmail.com> napsal:

Hi,

On Mon, Aug 31, 2020 at 8:22 PM torikoshia
<torikoshia@oss.nttdata.com> wrote:

As discussed in the thread[1], it'll be useful to make it
possible to get the memory contexts of an arbitrary backend
process.

+1

Attached PoC patch makes pg_get_backend_memory_contexts()
display meory contexts of the specified PID of the process.

Thanks, it's a very good patch for discussion.

It doesn't display contexts of all the backends but only
the contexts of specified process.

or we can "SELECT (pg_get_backend_memory_contexts(pid)).* FROM
pg_stat_activity WHERE ...",
so I don't think it's a big deal.

The rough idea of implementation is like below:

1. send a signal to the specified process
2. signaled process dumps its memory contexts to a file
3. read the dumped file and display it to the user

I agree with the overview of the idea.
Here are some comments and questions.

Thanks for the comments!

- Currently, "the signal transmission for dumping memory
information"
and "the read & output of dump information "
are on the same interface, but I think it would be better to
separate them.
How about providing the following three types of functions for
users?
- send a signal to specified pid
- check the status of the signal sent and received
- read the dumped information

Is this for future extensibility to make it possible to get
other information like the current execution plan which was
suggested by Pavel?

If so, I agree with considering extensibility, but I'm not
sure it's necessary whether providing these types of
functions for 'users'.

- How about managing the status of signal send/receive and dump
operations on a shared hash or others ?
Sending and receiving signals, dumping memory information, and
referencing dump information all work asynchronously.
Therefore, it would be good to have management information to
check
the status of each process.
A simple idea is that ..
- send a signal to dump to a PID, it first record following
information into the shared hash.
pid (specified pid)
loc (dump location, currently might be ASAP)
recv (did the pid process receive a signal? first false)
dumped (did the pid process dump a mem information? first
false)
- specified process receive the signal, update the status in the
shared hash, then dumped at specified location.
- specified process finish dump mem information, update the
status
in the shared hash.

Adding management information on shared memory seems necessary
when we want to have more controls over dumping like 'dump
location' or any other information such as 'current execution
plan'.
I'm going to consider this.

- Does it allow one process to output multiple dump files?
It appears to be a specification to overwrite at present, but I
thought it would be good to be able to generate
multiple dump files in different phases (e.g., planning phase and
execution phase) in the future.
- How is the dump file cleaned up?

For a very long time there has been similar discussion about taking
session query and session execution plans from other sessions.

I am not sure how necessary information is in the memory dump, but I
am sure so taking the current execution plan and complete text of the
current query is pretty necessary information.

but can be great so this infrastructure can be used for any debugging
purpose.

Thanks!
It would be good if some part of this effort can be an infrastructure
of other debugging.
It may be hard, but I will keep your comment in mind.

Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Show quoted text

Regards

Pavel

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com [1]

Links:
------
[1] http://gmail.com

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: Andres Freund (#4)

Re: Get memory contexts of an arbitrary backend process

Thanks for reviewing!

I'm going to modify the patch according to your comments.

On 2020-09-01 10:54, Andres Freund wrote:

Hi,

On 2020-08-31 20:22:18 +0900, torikoshia wrote:

After commit 3e98c0bafb28de, we can display the usage of the
memory contexts using pg_backend_memory_contexts system
view.

However, its target is limited to the process attached to
the current session.

As discussed in the thread[1], it'll be useful to make it
possible to get the memory contexts of an arbitrary backend
process.

Attached PoC patch makes pg_get_backend_memory_contexts()
display meory contexts of the specified PID of the process.

Awesome!

It doesn't display contexts of all the backends but only
the contexts of specified process.
I think it would be enough because I suppose this function
is used after investigations using ps command or other OS
level utilities.

It can be used as a building block if all are needed. Getting the
infrastructure right is the big thing here, I think. Adding more
detailed views on top of that data later is easier.
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index a2d61302f9..88fb837ecd 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -555,10 +555,10 @@ REVOKE ALL ON pg_shmem_allocations FROM PUBLIC;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
CREATE VIEW pg_backend_memory_contexts AS
-    SELECT * FROM pg_get_backend_memory_contexts();
+    SELECT * FROM pg_get_backend_memory_contexts(-1);
-1 is odd. Why not use NULL or even 0?
+	else
+	{
+		int			rc;
+		int parent_len = strlen(parent);
+		int name_len = strlen(name);
+
+		/*
+		 * write out the current memory context information.
+		 * Since some elements of values are reusable, we write it out.
Not sure what the second comment line here is supposed to mean?
+		 */
+		fputc('D', fpout);
+		rc = fwrite(values, sizeof(values), 1, fpout);
+		rc = fwrite(nulls, sizeof(nulls), 1, fpout);
+
+		/* write out information which is not resuable from serialized 
values */
s/resuable/reusable/
+		rc = fwrite(&name_len, sizeof(int), 1, fpout);
+		rc = fwrite(name, name_len, 1, fpout);
+		rc = fwrite(&idlen, sizeof(int), 1, fpout);
+		rc = fwrite(clipped_ident, idlen, 1, fpout);
+		rc = fwrite(&level, sizeof(int), 1, fpout);
+		rc = fwrite(&parent_len, sizeof(int), 1, fpout);
+		rc = fwrite(parent, parent_len, 1, fpout);
+		(void) rc;				/* we'll check for error with ferror */
+
+	}
This format is not descriptive. How about serializing to json or
something? Or at least having field names?

Alternatively, build the same tuple we build for the SRF, and serialize
that. Then there's basically no conversion needed.
@@ -117,6 +157,8 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate
*tupstore,
Datum
pg_get_backend_memory_contexts(PG_FUNCTION_ARGS)
{
+ int pid = PG_GETARG_INT32(0);
+
ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc tupdesc;
Tuplestorestate *tupstore;
@@ -147,11 +189,258 @@
pg_get_backend_memory_contexts(PG_FUNCTION_ARGS)

MemoryContextSwitchTo(oldcontext);
-	PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
-								TopMemoryContext, NULL, 0);
+	if (pid == -1)
+	{
+		/*
+		 * Since pid -1 indicates target is the local process, simply
+		 * traverse memory contexts.
+		 */
+		PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
+								TopMemoryContext, "", 0, NULL);
+	}
+	else
+	{
+		/*
+		 * Send signal for dumping memory contexts to the target process,
+		 * and read the dumped file.
+		 */
+		FILE	   *fpin;
+		char		dumpfile[MAXPGPATH];
+
+		SendProcSignal(pid, PROCSIG_DUMP_MEMORY, InvalidBackendId);
+
+		snprintf(dumpfile, sizeof(dumpfile), "pg_memusage/%d", pid);
+
+		while (true)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			pg_usleep(10000L);
+
Need better signalling back/forth here.

Do you mean I should also send another signal from the dumped
process to the caller of the pg_get_backend_memory_contexts()
when it finishes dumping?

Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Show quoted text

+/*
+ * dump_memory_contexts
+ * 	Dumping local memory contexts to a file.
+ * 	This function does not delete the file as it is intended to be 
read by
+ * 	another process.
+ */
+static void
+dump_memory_contexts(void)
+{
+	FILE	   *fpout;
+	char		tmpfile[MAXPGPATH];
+	char		dumpfile[MAXPGPATH];
+
+	snprintf(tmpfile, sizeof(tmpfile), "pg_memusage/%d.tmp", MyProcPid);
+	snprintf(dumpfile, sizeof(dumpfile), "pg_memusage/%d", MyProcPid);
+
+	/*
+	 * Open a temp file to dump the current memory context.
+	 */
+	fpout = AllocateFile(tmpfile, PG_BINARY_W);
+	if (fpout == NULL)
+	{
+		ereport(LOG,
+				(errcode_for_file_access(),
+				 errmsg("could not write temporary memory context file \"%s\": 
%m",
+						tmpfile)));
+		return;
+	}

Probably should be opened with O_CREAT | O_TRUNC?

Greetings,

Andres Freund

Kasahara Tatsuhito

kasahara.tatsuhito@gmail.com

over 5 years ago

In reply to: torikoshia (#5)

Re: Get memory contexts of an arbitrary backend process

Hi,

On Thu, Sep 3, 2020 at 3:38 PM torikoshia <torikoshia@oss.nttdata.com> wrote:

- Currently, "the signal transmission for dumping memory
information"
and "the read & output of dump information "
are on the same interface, but I think it would be better to
separate them.
How about providing the following three types of functions for
users?
- send a signal to specified pid
- check the status of the signal sent and received
- read the dumped information

Is this for future extensibility to make it possible to get
other information like the current execution plan which was
suggested by Pavel?

Yes, but it's not only for future expansion, but also for the
usability and the stability of this feature.
For example, if you want to read one dumped file multiple times and analyze it,
you will want the ability to just read the dump.
Moreover, when it takes a long time from the receive the signal to the
dump output,
or the dump output itself takes a long time, users can investigate
where it takes time
if each process is separated.

If so, I agree with considering extensibility, but I'm not
sure it's necessary whether providing these types of
functions for 'users'.

Of course, it is possible and may be necessary to provide a wrapped
sequence of processes
from sending a signal to reading dump files.
But IMO, some users would like to perform the signal transmission,
state management and
dump file reading processes separately.

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Kasahara Tatsuhito (#7)

Re: Get memory contexts of an arbitrary backend process

Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> writes:

Yes, but it's not only for future expansion, but also for the
usability and the stability of this feature.
For example, if you want to read one dumped file multiple times and analyze it,
you will want the ability to just read the dump.

If we design it to make that possible, how are we going to prevent disk
space leaks from never-cleaned-up dump files?

regards, tom lane

Kasahara Tatsuhito

kasahara.tatsuhito@gmail.com

over 5 years ago

In reply to: Tom Lane (#8)

Re: Get memory contexts of an arbitrary backend process

On Fri, Sep 4, 2020 at 2:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> writes:

Yes, but it's not only for future expansion, but also for the
usability and the stability of this feature.
For example, if you want to read one dumped file multiple times and analyze it,
you will want the ability to just read the dump.

If we design it to make that possible, how are we going to prevent disk
space leaks from never-cleaned-up dump files?

In my thought, with features such as a view that allows us to see a
list of dumped files,
it would be better to have a function that simply deletes the dump
files associated with a specific PID,
or to delete all dump files.
Some files may be dumped with unexpected delays, so I think the
cleaning feature will be necessary.
( Also, as the pgsql_tmp file, it might better to delete dump files
when PostgreSQL start.)

Or should we try to delete the dump file as soon as we can read it?

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

#10

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Kasahara Tatsuhito (#9)

Re: Get memory contexts of an arbitrary backend process

On Fri, Sep 04, 2020 at 11:47:30AM +0900, Kasahara Tatsuhito wrote:

On Fri, Sep 4, 2020 at 2:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> writes:

Yes, but it's not only for future expansion, but also for the
usability and the stability of this feature.
For example, if you want to read one dumped file multiple times and analyze it,
you will want the ability to just read the dump.

If we design it to make that possible, how are we going to prevent disk
space leaks from never-cleaned-up dump files?

In my thought, with features such as a view that allows us to see a
list of dumped files,
it would be better to have a function that simply deletes the dump
files associated with a specific PID,
or to delete all dump files.
Some files may be dumped with unexpected delays, so I think the
cleaning feature will be necessary.
( Also, as the pgsql_tmp file, it might better to delete dump files
when PostgreSQL start.)

Or should we try to delete the dump file as soon as we can read it?

IMO making the cleanup a responsibility of the users (e.g. by exposing
the list of dumped files through a view and expecting users to delete
them in some way) is rather fragile.

I don't quite see what's the point of designing it this way. It was
suggested this improves stability and usability of this feature, but
surely making it unnecessarily complex contradicts both points?

IMHO if the user needs to process the dump repeatedly, what's preventing
him/her from storing it in a file, or something like that? At that point
it's clear it's up to them to remove the file. So I suggest to keep the
feature as simple as possible - hand the dump over and delete.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: Tomas Vondra (#10)

Re: Get memory contexts of an arbitrary backend process

On 2020-09-04 21:46, Tomas Vondra wrote:

On Fri, Sep 04, 2020 at 11:47:30AM +0900, Kasahara Tatsuhito wrote:

On Fri, Sep 4, 2020 at 2:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> writes:

Yes, but it's not only for future expansion, but also for the
usability and the stability of this feature.
For example, if you want to read one dumped file multiple times and analyze it,
you will want the ability to just read the dump.

If we design it to make that possible, how are we going to prevent
disk
space leaks from never-cleaned-up dump files?

In my thought, with features such as a view that allows us to see a
list of dumped files,
it would be better to have a function that simply deletes the dump
files associated with a specific PID,
or to delete all dump files.
Some files may be dumped with unexpected delays, so I think the
cleaning feature will be necessary.
( Also, as the pgsql_tmp file, it might better to delete dump files
when PostgreSQL start.)

Or should we try to delete the dump file as soon as we can read it?

IMO making the cleanup a responsibility of the users (e.g. by exposing
the list of dumped files through a view and expecting users to delete
them in some way) is rather fragile.

I don't quite see what's the point of designing it this way. It was
suggested this improves stability and usability of this feature, but
surely making it unnecessarily complex contradicts both points?

IMHO if the user needs to process the dump repeatedly, what's
preventing
him/her from storing it in a file, or something like that? At that
point
it's clear it's up to them to remove the file. So I suggest to keep the
feature as simple as possible - hand the dump over and delete.

+1.
If there are no other objections, I'm going to accept this
suggestion.

Regards

#12

Kasahara Tatsuhito

kasahara.tatsuhito@gmail.com

over 5 years ago

In reply to: torikoshia (#11)

Re: Get memory contexts of an arbitrary backend process

Hi,

On Thu, Sep 10, 2020 at 8:53 PM torikoshia <torikoshia@oss.nttdata.com> wrote:

On 2020-09-04 21:46, Tomas Vondra wrote:

On Fri, Sep 04, 2020 at 11:47:30AM +0900, Kasahara Tatsuhito wrote:

On Fri, Sep 4, 2020 at 2:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> writes:

Yes, but it's not only for future expansion, but also for the
usability and the stability of this feature.
For example, if you want to read one dumped file multiple times and analyze it,
you will want the ability to just read the dump.

If we design it to make that possible, how are we going to prevent
disk
space leaks from never-cleaned-up dump files?

In my thought, with features such as a view that allows us to see a
list of dumped files,
it would be better to have a function that simply deletes the dump
files associated with a specific PID,
or to delete all dump files.
Some files may be dumped with unexpected delays, so I think the
cleaning feature will be necessary.
( Also, as the pgsql_tmp file, it might better to delete dump files
when PostgreSQL start.)

Or should we try to delete the dump file as soon as we can read it?

IMO making the cleanup a responsibility of the users (e.g. by exposing
the list of dumped files through a view and expecting users to delete
them in some way) is rather fragile.

I don't quite see what's the point of designing it this way. It was
suggested this improves stability and usability of this feature, but
surely making it unnecessarily complex contradicts both points?

IMHO if the user needs to process the dump repeatedly, what's
preventing
him/her from storing it in a file, or something like that? At that
point
it's clear it's up to them to remove the file. So I suggest to keep the
feature as simple as possible - hand the dump over and delete.

Yeah, it might be better to avoid making the user responsible for removal.

I think it's fine to have an interface to delete in an emergency, but
I agree that
users shouldn't be made aware of the existence or deletion of dump
files, basically.

+1.
If there are no other objections, I'm going to accept this
suggestion.

So +1

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

#13

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Kasahara Tatsuhito (#12)

Re: Get memory contexts of an arbitrary backend process

On Thu, Sep 10, 2020 at 09:11:21PM +0900, Kasahara Tatsuhito wrote:

I think it's fine to have an interface to delete in an emergency, but
I agree that
users shouldn't be made aware of the existence or deletion of dump
files, basically.

Per the CF bot, the number of tests needs to be tweaked, because we
test each entry filtered out with is_deeply(), meaning that the number
of tests needs to be updated to reflect that if the filtered list is
changed:
t/010_pg_basebackup.pl ... 104/109 # Looks like you planned 109 tests but ran 110.
t/010_pg_basebackup.pl ... Dubious, test returned 255 (wstat 65280, 0xff00)
All 109 subtests passed

Simple enough to fix.
--
Michael

#14

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: Michael Paquier (#13)

Re: Get memory contexts of an arbitrary backend process

Hi,

Thanks for all your comments, I updated the patch.

On Tue, Sep 1, 2020 at 12:03 AM Kasahara Tatsuhito
<kasahara.tatsuhito@gmail.com> wrote:

- How about managing the status of signal send/receive and dump
operations on a shared hash or others ?
Sending and receiving signals, dumping memory information, and
referencing dump information all work asynchronously.
Therefore, it would be good to have management information to
check
the status of each process.
A simple idea is that ..
- send a signal to dump to a PID, it first record following
information into the shared hash.
pid (specified pid)
loc (dump location, currently might be ASAP)
recv (did the pid process receive a signal? first false)
dumped (did the pid process dump a mem information? first
false)
- specified process receive the signal, update the status in the
shared hash, then dumped at specified location.
- specified process finish dump mem information, update the
status
in the shared hash.

I added a shared hash table consisted of minimal members
mainly for managing whether the file is dumped or not.
Some members like 'loc' seem useful in the future, but I
haven't added them since it's not essential at this point.

On 2020-09-01 10:54, Andres Freund wrote:

CREATE VIEW pg_backend_memory_contexts AS
-    SELECT * FROM pg_get_backend_memory_contexts();
+    SELECT * FROM pg_get_backend_memory_contexts(-1);

-1 is odd. Why not use NULL or even 0?

Changed it from -1 to NULL.

+             rc = fwrite(&name_len, sizeof(int), 1, fpout);
+             rc = fwrite(name, name_len, 1, fpout);
+             rc = fwrite(&idlen, sizeof(int), 1, fpout);
+             rc = fwrite(clipped_ident, idlen, 1, fpout);
+             rc = fwrite(&level, sizeof(int), 1, fpout);
+             rc = fwrite(&parent_len, sizeof(int), 1, fpout);
+             rc = fwrite(parent, parent_len, 1, fpout);
+             (void) rc;                              /* we'll check 
for error with ferror */
+
+     }

This format is not descriptive. How about serializing to json or
something? Or at least having field names?

Added field names for each value.

+             while (true)
+             {
+                     CHECK_FOR_INTERRUPTS();
+
+                     pg_usleep(10000L);
+

Need better signalling back/forth here.

Added a shared hash table that has a flag for managing whether the file
is dumped or not.
I modified it to use this flag.

+     /*
+      * Open a temp file to dump the current memory context.
+      */
+     fpout = AllocateFile(tmpfile, PG_BINARY_W);
+     if (fpout == NULL)
+     {
+             ereport(LOG,
+                             (errcode_for_file_access(),
+                              errmsg("could not write temporary 
memory context file \"%s\": %m",
+                                             tmpfile)));
+             return;
+     }

Probably should be opened with O_CREAT | O_TRUNC?

AllocateFile() calls fopen() and AFAIU fopen() with mode "w" corresponds
to open() with "O_WRONLY | O_CREAT | O_TRUNC".

Do you mean I should not use fopen() here?

On 2020-09-24 13:01, Michael Paquier wrote:

On Thu, Sep 10, 2020 at 09:11:21PM +0900, Kasahara Tatsuhito wrote:

I think it's fine to have an interface to delete in an emergency, but
I agree that
users shouldn't be made aware of the existence or deletion of dump
files, basically.

Per the CF bot, the number of tests needs to be tweaked, because we
test each entry filtered out with is_deeply(), meaning that the number
of tests needs to be updated to reflect that if the filtered list is
changed:
t/010_pg_basebackup.pl ... 104/109 # Looks like you planned 109 tests
but ran 110.
t/010_pg_basebackup.pl ... Dubious, test returned 255 (wstat 65280,
0xff00)
All 109 subtests passed

Simple enough to fix.

Incremented the number of tests.

Any thoughts?

Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

#15

Kasahara Tatsuhito

kasahara.tatsuhito@gmail.com

over 5 years ago

In reply to: torikoshia (#14)

Re: Get memory contexts of an arbitrary backend process

Hi,

On Fri, Sep 25, 2020 at 4:28 PM torikoshia <torikoshia@oss.nttdata.com> wrote:

Thanks for all your comments, I updated the patch.

Thanks for updating the patch.
I did a brief test and code review.

I added a shared hash table consisted of minimal members
mainly for managing whether the file is dumped or not.
Some members like 'loc' seem useful in the future, but I
haven't added them since it's not essential at this point.

Yes, that would be good.

+        /*
+         * Since we allow only one session can request to dump
memory context at
+         * the same time, check whether the dump files already exist.
+         */
+        while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile, &stat_tmp) == 0)
+        {
+            pg_usleep(1000000L);
+        }

If pg_get_backend_memory_contexts() is executed by two or more
sessions at the same time, it cannot be run exclusively in this way.
Currently it seems to cause a crash when do it so.
This is easy to reproduce and can be done as follows.

[session-1]
BEGIN;
LOCK TABKE t1;
[Session-2]
BEGIN;
LOCK TABLE t1; <- waiting
[Session-3]
select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
[Session-4]
select * FROM pg_get_backend_memory_contexts(<pid of session-2>);

If you issue commit or abort at session-1, you will get SEGV.

Instead of checking for the existence of the file, it might be better
to use a hash (mcxtdumpHash) entry with LWLock.

+        if (proc == NULL)
+        {
+            ereport(WARNING,
+                    (errmsg("PID %d is not a PostgreSQL server
process", dst_pid)));
+            return (Datum) 1;
+        }

Shouldn't it clear the hash entry before return?

+        /* Wait until target process finished dumping file. */
+        while (!entry->is_dumped)
+        {
+            CHECK_FOR_INTERRUPTS();
+            pg_usleep(10000L);
+        }

If the target is killed and exit before dumping the memory
information, you're in an infinite loop here.
So how about making sure that any process that has to stop before
doing a memory dump changes the status of the hash (entry->is_dumped)
before stopping and the caller detects it?
I'm not sure it's best or not, but you might want to use something
like the on_shmem_exit callback.

In the current design, if the caller stops processing before reading
the dumped file, you will have an orphaned file.
It looks like the following.

[session-1]
BEGIN;
LOCK TABKE t1;
[Session-2]
BEGIN;
LOCK TABLE t1; <- waiting
[Session-3]
select * FROM pg_get_backend_memory_contexts(<pid of session-2>);

If you cancel or terminate the session-3, then issue commit or abort
at session-1, you will get orphaned files in pg_memusage.

So if you allow only one session can request to dump file, it could
call pg_memusage_reset() before send the signal in this function.

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

#16

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: Kasahara Tatsuhito (#15)

Re: Get memory contexts of an arbitrary backend process

On Thu, Oct 1, 2020 at 4:06 PM Kasahara Tatsuhito
<kasahara.tatsuhito@gmail.com> wrote:
Hi,

On Fri, Sep 25, 2020 at 4:28 PM torikoshia <torikoshia@oss.nttdata.com>
wrote:

Thanks for all your comments, I updated the patch.

Thanks for updating the patch.
I did a brief test and code review.

Thanks for your tests and review!

I added a shared hash table consisted of minimal members
mainly for managing whether the file is dumped or not.
Some members like 'loc' seem useful in the future, but I
haven't added them since it's not essential at this point.

Yes, that would be good.
+        /*
+         * Since we allow only one session can request to dump
memory context at
+         * the same time, check whether the dump files already exist.
+         */
+        while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile, 
&stat_tmp) == 0)
+        {
+            pg_usleep(1000000L);
+        }
If pg_get_backend_memory_contexts() is executed by two or more
sessions at the same time, it cannot be run exclusively in this way.
Currently it seems to cause a crash when do it so.
This is easy to reproduce and can be done as follows.

[session-1]
BEGIN;
LOCK TABKE t1;
[Session-2]
BEGIN;
LOCK TABLE t1; <- waiting
[Session-3]
select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
[Session-4]
select * FROM pg_get_backend_memory_contexts(<pid of session-2>);

If you issue commit or abort at session-1, you will get SEGV.

Instead of checking for the existence of the file, it might be better
to use a hash (mcxtdumpHash) entry with LWLock.

Thanks!
Added a LWLock and changed the way from checking the file existence
to finding the hash entry.

+        if (proc == NULL)
+        {
+            ereport(WARNING,
+                    (errmsg("PID %d is not a PostgreSQL server
process", dst_pid)));
+            return (Datum) 1;
+        }

Shouldn't it clear the hash entry before return?

Yeah. added codes for removing the entry.

+        /* Wait until target process finished dumping file. */
+        while (!entry->is_dumped)
+        {
+            CHECK_FOR_INTERRUPTS();
+            pg_usleep(10000L);
+        }
If the target is killed and exit before dumping the memory
information, you're in an infinite loop here.
So how about making sure that any process that has to stop before
doing a memory dump changes the status of the hash (entry->is_dumped)
before stopping and the caller detects it?
I'm not sure it's best or not, but you might want to use something
like the on_shmem_exit callback.

Thanks for your idea!
Added a callback to change the status of the hash table entry.

Although I think it's necessary to remove this callback when it finished
processing memory dumping, on_shmem_exit() does not seem to have such
a function.
I used before_shmem_exit() since it has a corresponding function to
remove registered callback.
If it's inappropriate, I'm going to add a function removing the
registered callback of on_shmem_exit().

In the current design, if the caller stops processing before reading
the dumped file, you will have an orphaned file.
It looks like the following.

[session-1]
BEGIN;
LOCK TABKE t1;
[Session-2]
BEGIN;
LOCK TABLE t1; <- waiting
[Session-3]
select * FROM pg_get_backend_memory_contexts(<pid of session-2>);

If you cancel or terminate the session-3, then issue commit or abort
at session-1, you will get orphaned files in pg_memusage.

So if you allow only one session can request to dump file, it could
call pg_memusage_reset() before send the signal in this function.

Although I'm going to allow only one session per one target process,
I'd like to allow running multiple pg_get_backend_memory_contexts()
which target process is different.

Instead of calling pg_memusage_reset(), I added a callback for
cleaning up orphaned files and the elements of the hash table
using before_shmem_exit() through PG_ENSURE_ERROR_CLEANUP() and
PG_END_ENSURE_ERROR_CLEANUP().

I chose PG_ENSURE_ERROR_CLEANUP() and PG_END_ENSURE_ERROR_CLEANUP()
here since it can handle not only termination but also cancellation.

Any thoughts?

--
Atsushi Torikoshi

#17

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: torikoshia (#16)

Re: Get memory contexts of an arbitrary backend process

Wait...

Attachments: 0003-Enabled-pg_get_backend_memory_contexts-to-collect.patch

For a moment I thought that the number is patch number but the
predecessors are 0002-Enabled..collect.patch and 0001-(same
name). It's not mandatory but we usually do as the follows and it's
the way of git.

v1-0001-Enabled...collect.patch
v2-0001-Enabled...collect.patch

The vn is added by -v option for git-format-patch.

At Thu, 22 Oct 2020 21:32:00 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in

I added a shared hash table consisted of minimal members
mainly for managing whether the file is dumped or not.
Some members like 'loc' seem useful in the future, but I
haven't added them since it's not essential at this point.

Yes, that would be good.
+        /*
+         * Since we allow only one session can request to dump
memory context at
+         * the same time, check whether the dump files already exist.
+         */
+        while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile,
&stat_tmp) == 0)
+        {
+            pg_usleep(1000000L);
+        }
If pg_get_backend_memory_contexts() is executed by two or more
sessions at the same time, it cannot be run exclusively in this way.
Currently it seems to cause a crash when do it so.
This is easy to reproduce and can be done as follows.
[session-1]
BEGIN;
LOCK TABKE t1;
  [Session-2]
  BEGIN;
  LOCK TABLE t1; <- waiting
    [Session-3]
    select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
      [Session-4]
      select * FROM pg_get_backend_memory_contexts(<pid of
session-2>);
If you issue commit or abort at session-1, you will get SEGV.
Instead of checking for the existence of the file, it might be better
to use a hash (mcxtdumpHash) entry with LWLock.

Thanks!
Added a LWLock and changed the way from checking the file existence
to finding the hash entry.

+        if (proc == NULL)
+        {
+            ereport(WARNING,
+                    (errmsg("PID %d is not a PostgreSQL server
process", dst_pid)));
+            return (Datum) 1;
+        }
Shouldn't it clear the hash entry before return?

Yeah. added codes for removing the entry.

+		entry = AddEntryToMcxtdumpHash(dst_pid);
+
+		/* Check whether the target process is PostgreSQL backend process. */
+		/* TODO: Check also whether backend or not. */
+		proc = BackendPidGetProc(dst_pid);
+
+		if (proc == NULL)
+		{
+			ereport(WARNING,
+					(errmsg("PID %d is not a PostgreSQL server process", dst_pid)));
+
+			LWLockAcquire(McxtDumpHashLock, LW_EXCLUSIVE);
+
+			if (hash_search(mcxtdumpHash, &dst_pid, HASH_REMOVE, NULL) == NULL)
+				elog(WARNING, "hash table corrupted");
+
+			LWLockRelease(McxtDumpHashLock);
+
+			return (Datum) 1;
+		}

Why do you enter a useles entry then remove it immedately?

+		PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) Int32GetDatum(dst_pid));
+		{
+			SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId);

"PROCSIG_DUMP_MEMORY" is somewhat misleading. Hwo about
"PROCSIG_DUMP_MEMCXT" or "PROCSIG_DUMP_MEMORY_CONTEXT"?

I thought that the hash table would prevent multiple reqestors from
making a request at once, but the patch doesn't seem to do that.

+			/* Wait until target process finished dumping file. */
+			while (entry->dump_status == MCXTDUMPSTATUS_NOTYET)

This needs LWLock. And this could read the entry after reused by
another backend if the dumper process is gone. That isn't likely to
happen, but theoretically the other backend may set it to
MCXTDUMPSTATUS_NOTYET inbetween two successive check on the member.

+	/*
+	 * Make dump file ends with 'D'.
+	 * This is checked by the caller when reading the file.
+	 */
+	fputc('E', fpout);

Which is right?

+	fputc('E', fpout);
+
+	CHECK_FOR_INTERRUPTS();

This means that the process accepts another request and rewrite the
file even while the first requester is reading it. And, the file can
be removed by the first requestor before the second requestor can read
it.

+	mcxtdumpHash = ShmemInitHash("mcxtdump hash",
+							   SHMEM_MEMCONTEXT_SIZE,
+							   SHMEM_MEMCONTEXT_SIZE,

The space needed for this hash doesn't seem to be secured. The hash is
sized to 64 entries, so pg_get_backend_memory_contexts() may fail with
ERROR "out of shared memory".

The hash is used only to check if the dump file is completed and if
ended with error. If we need only those, an shared byte array with the
length of max_backend is sufficient.

+		PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) Int32GetDatum(dst_pid));
+		{
+			SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId);
+
+			/* Wait until target process finished dumping file. */
+			while (entry->dump_status == MCXTDUMPSTATUS_NOTYET)
+			{
+				CHECK_FOR_INTERRUPTS();
+				pg_usleep(10000L);
+			}
+		}
+		PG_END_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) Int32GetDatum(dst_pid));
+
+		if (entry->dump_status == MCXTDUMPSTATUS_ERROR)
+		{
..
+			if (stat(tmpfile, &stat_tmp) == 0)
+				unlink(tmpfile);
+			if (stat(dumpfile, &stat_tmp) == 0)
+				unlink(dumpfile);
...
+			return (Datum) 0;
+		}
+
+		/* Read values from the dumped file and put them on tuplestore. */
+		PutDumpedValuesOnTuplestore(dumpfile, tupstore, tupdesc, dst_pid);

This means that if the function gets sigint before the dumper creates
the file, the dumper can leave a dump file?

+        /* Wait until target process finished dumping file. */
+        while (!entry->is_dumped)
+        {
+            CHECK_FOR_INTERRUPTS();
+            pg_usleep(10000L);
+        }
If the target is killed and exit before dumping the memory
information, you're in an infinite loop here.
So how about making sure that any process that has to stop before
doing a memory dump changes the status of the hash (entry->is_dumped)
before stopping and the caller detects it?
I'm not sure it's best or not, but you might want to use something
like the on_shmem_exit callback.
Thanks for your idea!
Added a callback to change the status of the hash table entry.

Although I think it's necessary to remove this callback when it
finished
processing memory dumping, on_shmem_exit() does not seem to have such
a function.
I used before_shmem_exit() since it has a corresponding function to
remove registered callback.
If it's inappropriate, I'm going to add a function removing the
registered callback of on_shmem_exit().

This seems to leave a file for the pid.

In the current design, if the caller stops processing before reading
the dumped file, you will have an orphaned file.
It looks like the following.
[session-1]
BEGIN;
LOCK TABKE t1;
[Session-2]
BEGIN;
LOCK TABLE t1; <- waiting
[Session-3]
select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
If you cancel or terminate the session-3, then issue commit or abort
at session-1, you will get orphaned files in pg_memusage.
So if you allow only one session can request to dump file, it could
call pg_memusage_reset() before send the signal in this function.

Although I'm going to allow only one session per one target process,
I'd like to allow running multiple pg_get_backend_memory_contexts()
which target process is different.

Instead of calling pg_memusage_reset(), I added a callback for
cleaning up orphaned files and the elements of the hash table
using before_shmem_exit() through PG_ENSURE_ERROR_CLEANUP() and
PG_END_ENSURE_ERROR_CLEANUP().

I chose PG_ENSURE_ERROR_CLEANUP() and PG_END_ENSURE_ERROR_CLEANUP()
here since it can handle not only termination but also cancellation.

Any thoughts?

+/*
+ * pg_memusage_reset
+ *	 	Remove the memory context dump files.
+ */
+void
+pg_memusage_reset(int pid)

The function name seem to represents somewhat different from what it
does.

I think we might need to step-back to basic design of this feature
since this patch seems to have unhandled corner cases that are
difficult to find.

- Dump file lifecycle or state-transition of the dumper

Currently the lifecycle of a dump file, or the state-transition of
the dumper process doesn't seem to be well defined.

The file is create only by the dumper.
If the requestor reads the completed file, the reader removes it.

If the dumper receives a cancel request, the dumper removes it if
any.

Of course dumper removes it if it fails to complete the file.

- Better way of signalling between the requestor and the dumper

I think there's no doubt about request signal.

About the complete signal, currently the requestor polls on a flag
in a hash entry. I'm wondering if we can get rid of polling. The
problem on doing that is the lack of a means for a dumper to know
the requestor. We need to store requestor pid or backend id in the
shared hash entry.

By the way, about shared hash entry, it uses about 70kB for only 64
entries so it seems inefficient than a shared array that has
MaxBackends entries. If we used a following array on shared memory,

struct hoge
{
BackendId requestor[MAX_BACKENDS];
int status[MAX_BACKENDS];
LWLock lock;
};

This array has the size of 24 * MaxBackends + 16. 24kB for 1000
backends. It could be on dsm since this feature is not used
commonly.

- The way to cancel a request already made. (or management of the dump
state transition.)

Cancellation seems to contain some race conditions. But basically
that could be done by sending a request signal after setting the
hoge.requestor above to some special value, not needing the third
type of signal. The special value should be different from the
initial state, which signals that the process is accepting a new
request.

As the whole, that would looks like the folloing?

------------------------------------------------------------
Successful request.

Requestor dumper state
[idle] initial
[request] -------------------> requestor pid/backendid
-signal->
[dumping]
<-signal-[done]
[read]
[done] --------------------> initial

------------------------------------------------------------
On failure, the dumper signals with setting state to initial.
[request] -------------------> requestor pid/backendid
-signal->
[dumping]
[failed] initial
<-signal-
(some other requestor might come meantime.)
<sees that the requestor is not me>
[failed]

------------------------------------------------------------
If the requestor wants to cancel the request, it sets the state to
'cancel' then signal.

Requestor dumper state
[idle] initial
[request] -------------------> cancel
<if canceled. clean up>
[dumping]
<if canceled. clean up>
<-signal-[done]

-signal-><try to clean up>

Other aspects to cnosider?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#18

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#17)

Re: Get memory contexts of an arbitrary backend process

On 2020-10-23 13:46, Kyotaro Horiguchi wrote:

Wait...

Attachments:
0003-Enabled-pg_get_backend_memory_contexts-to-collect.patch

For a moment I thought that the number is patch number but the
predecessors are 0002-Enabled..collect.patch and 0001-(same
name). It's not mandatory but we usually do as the follows and it's
the way of git.

v1-0001-Enabled...collect.patch
v2-0001-Enabled...collect.patch

The vn is added by -v option for git-format-patch.

Sorry for the confusion. I'll follow that way next time.

At Thu, 22 Oct 2020 21:32:00 +0900, torikoshia
<torikoshia@oss.nttdata.com> wrote in

I added a shared hash table consisted of minimal members
mainly for managing whether the file is dumped or not.
Some members like 'loc' seem useful in the future, but I
haven't added them since it's not essential at this point.

Yes, that would be good.
+        /*
+         * Since we allow only one session can request to dump
memory context at
+         * the same time, check whether the dump files already exist.
+         */
+        while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile,
&stat_tmp) == 0)
+        {
+            pg_usleep(1000000L);
+        }
If pg_get_backend_memory_contexts() is executed by two or more
sessions at the same time, it cannot be run exclusively in this way.
Currently it seems to cause a crash when do it so.
This is easy to reproduce and can be done as follows.
[session-1]
BEGIN;
LOCK TABKE t1;
  [Session-2]
  BEGIN;
  LOCK TABLE t1; <- waiting
    [Session-3]
    select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
      [Session-4]
      select * FROM pg_get_backend_memory_contexts(<pid of
session-2>);
If you issue commit or abort at session-1, you will get SEGV.
Instead of checking for the existence of the file, it might be better
to use a hash (mcxtdumpHash) entry with LWLock.

Thanks!
Added a LWLock and changed the way from checking the file existence
to finding the hash entry.

+        if (proc == NULL)
+        {
+            ereport(WARNING,
+                    (errmsg("PID %d is not a PostgreSQL server
process", dst_pid)));
+            return (Datum) 1;
+        }
Shouldn't it clear the hash entry before return?

Yeah. added codes for removing the entry.

+		entry = AddEntryToMcxtdumpHash(dst_pid);
+
+		/* Check whether the target process is PostgreSQL backend process. 
*/
+		/* TODO: Check also whether backend or not. */
+		proc = BackendPidGetProc(dst_pid);
+
+		if (proc == NULL)
+		{
+			ereport(WARNING,
+					(errmsg("PID %d is not a PostgreSQL server process", dst_pid)));
+
+			LWLockAcquire(McxtDumpHashLock, LW_EXCLUSIVE);
+
+			if (hash_search(mcxtdumpHash, &dst_pid, HASH_REMOVE, NULL) == NULL)
+				elog(WARNING, "hash table corrupted");
+
+			LWLockRelease(McxtDumpHashLock);
+
+			return (Datum) 1;
+		}

Why do you enter a useles entry then remove it immedately?

Do you mean I should check the process existence first
since it enables us to skip entering hash entries?

+		PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) 
Int32GetDatum(dst_pid));
+		{
+			SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId);
"PROCSIG_DUMP_MEMORY" is somewhat misleading. Hwo about
"PROCSIG_DUMP_MEMCXT" or "PROCSIG_DUMP_MEMORY_CONTEXT"?

I'll go with "PROCSIG_DUMP_MEMCXT".

I thought that the hash table would prevent multiple reqestors from
making a request at once, but the patch doesn't seem to do that.
+			/* Wait until target process finished dumping file. */
+			while (entry->dump_status == MCXTDUMPSTATUS_NOTYET)
This needs LWLock. And this could read the entry after reused by
another backend if the dumper process is gone. That isn't likely to
happen, but theoretically the other backend may set it to
MCXTDUMPSTATUS_NOTYET inbetween two successive check on the member.

Thanks for your notification.
I'll use an LWLock.

+	/*
+	 * Make dump file ends with 'D'.
+	 * This is checked by the caller when reading the file.
+	 */
+	fputc('E', fpout);

Which is right?

Sorry, the comment was wrong..

+	fputc('E', fpout);
+
+	CHECK_FOR_INTERRUPTS();
This means that the process accepts another request and rewrite the
file even while the first requester is reading it. And, the file can
be removed by the first requestor before the second requestor can read
it.

I added CHECK_FOR_INTERRUPTS() here to make the dump cancellation
possible, however, considering your indication, it needs to think
about a way to handle only the dump cancellation.

+	mcxtdumpHash = ShmemInitHash("mcxtdump hash",
+							   SHMEM_MEMCONTEXT_SIZE,
+							   SHMEM_MEMCONTEXT_SIZE,
.
The space needed for this hash doesn't seem to be secured. The hash is
sized to 64 entries, so pg_get_backend_memory_contexts() may fail with
ERROR "out of shared memory".
The hash is used only to check if the dump file is completed and if
ended with error. If we need only those, an shared byte array with the
length of max_backend is sufficient.

Although there was a comment that controlling dump location may be a
good idea, but it also seems possible to include the location
information
in the struct Hoge you suggested below.

On Tue, Sep 1, 2020 at 12:03 AM Kasahara Tatsuhito <[hidden email]>
wrote:
| - send a signal to dump to a PID, it first record following
information into the shared hash.
| pid (specified pid)
| loc (dump location, currently might be ASAP)

+		PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) 
Int32GetDatum(dst_pid));
+		{
+			SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId);
+
+			/* Wait until target process finished dumping file. */
+			while (entry->dump_status == MCXTDUMPSTATUS_NOTYET)
+			{
+				CHECK_FOR_INTERRUPTS();
+				pg_usleep(10000L);
+			}
+		}
+		PG_END_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) 
Int32GetDatum(dst_pid));
+
+		if (entry->dump_status == MCXTDUMPSTATUS_ERROR)
+		{
..
+			if (stat(tmpfile, &stat_tmp) == 0)
+				unlink(tmpfile);
+			if (stat(dumpfile, &stat_tmp) == 0)
+				unlink(dumpfile);
...
+			return (Datum) 0;
+		}
+
+		/* Read values from the dumped file and put them on tuplestore. */
+		PutDumpedValuesOnTuplestore(dumpfile, tupstore, tupdesc, dst_pid);

This means that if the function gets sigint before the dumper creates
the file, the dumper can leave a dump file?

In that case, the requestor removes the corresponding hash entry in the
callback McxtReqKill, then the dumper who can not find the hash entry
does not dump a file.

However, I've now noticed that when the requestor gets sigint just
after the dumper check, the dump file remains.

In the current design, only the requestor can remove the dump file,
but it seems necessary to allow the dumper to remove it.

+        /* Wait until target process finished dumping file. */
+        while (!entry->is_dumped)
+        {
+            CHECK_FOR_INTERRUPTS();
+            pg_usleep(10000L);
+        }
If the target is killed and exit before dumping the memory
information, you're in an infinite loop here.
So how about making sure that any process that has to stop before
doing a memory dump changes the status of the hash (entry->is_dumped)
before stopping and the caller detects it?
I'm not sure it's best or not, but you might want to use something
like the on_shmem_exit callback.
Thanks for your idea!
Added a callback to change the status of the hash table entry.

Although I think it's necessary to remove this callback when it
finished
processing memory dumping, on_shmem_exit() does not seem to have such
a function.
I used before_shmem_exit() since it has a corresponding function to
remove registered callback.
If it's inappropriate, I'm going to add a function removing the
registered callback of on_shmem_exit().
This seems to leave a file for the pid.

As mentioned above, there can be a chance to remain files.

In the current design, if the caller stops processing before reading
the dumped file, you will have an orphaned file.
It looks like the following.
[session-1]
BEGIN;
LOCK TABKE t1;
[Session-2]
BEGIN;
LOCK TABLE t1; <- waiting
[Session-3]
select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
If you cancel or terminate the session-3, then issue commit or abort
at session-1, you will get orphaned files in pg_memusage.
So if you allow only one session can request to dump file, it could
call pg_memusage_reset() before send the signal in this function.

Although I'm going to allow only one session per one target process,
I'd like to allow running multiple pg_get_backend_memory_contexts()
which target process is different.

Instead of calling pg_memusage_reset(), I added a callback for
cleaning up orphaned files and the elements of the hash table
using before_shmem_exit() through PG_ENSURE_ERROR_CLEANUP() and
PG_END_ENSURE_ERROR_CLEANUP().

I chose PG_ENSURE_ERROR_CLEANUP() and PG_END_ENSURE_ERROR_CLEANUP()
here since it can handle not only termination but also cancellation.

Any thoughts?
+/*
+ * pg_memusage_reset
+ *	 	Remove the memory context dump files.
+ */
+void
+pg_memusage_reset(int pid)
The function name seem to represents somewhat different from what it
does.

Yeah, It looks like it actually reset the memory.
I'll rename it to remove_memcxt_file or something.

I think we might need to step-back to basic design of this feature
since this patch seems to have unhandled corner cases that are
difficult to find.

Agreed and thanks for writing it down below.

- Dump file lifecycle or state-transition of the dumper

Currently the lifecycle of a dump file, or the state-transition of
the dumper process doesn't seem to be well defined.

The file is create only by the dumper.
If the requestor reads the completed file, the reader removes it.

If the dumper receives a cancel request, the dumper removes it if
any.

Of course dumper removes it if it fails to complete the file.

- Better way of signalling between the requestor and the dumper

I think there's no doubt about request signal.

About the complete signal, currently the requestor polls on a flag
in a hash entry. I'm wondering if we can get rid of polling. The
problem on doing that is the lack of a means for a dumper to know
the requestor. We need to store requestor pid or backend id in the
shared hash entry.

Agreed to get rid of polling.

BTW, it seems common to use a latch instead of pg_usleep() to wait until
signals arrive as far as I read latch.h.
I'm now thinking about using a latch here and it would make polling
removed.

By the way, about shared hash entry, it uses about 70kB for only 64
entries so it seems inefficient than a shared array that has
MaxBackends entries. If we used a following array on shared memory,

struct hoge
{
BackendId requestor[MAX_BACKENDS];
int status[MAX_BACKENDS];
LWLock lock;
};

If the requestor's id is 5 and dumper's id is 10,
is this struct used like below?

- hoge.requestor[10] = 5
- Both status[5] and status[10] change like "request", "idle" or "done"

This array has the size of 24 * MaxBackends + 16. 24kB for 1000
backends. It could be on dsm since this feature is not used
commonly.

Sorry but I'm not sure this calculation.
Do you mean 16.24kB for 1000 backends?

Regarding dsm, do you imagine using hoge on dsm?

- The way to cancel a request already made. (or management of the dump
state transition.)

Cancellation seems to contain some race conditions. But basically
that could be done by sending a request signal after setting the
hoge.requestor above to some special value, not needing the third

I imagined hoge.requestor was set to requestor's backendid.
Isn't it hoge.status?

type of signal. The special value should be different from the
initial state, which signals that the process is accepting a new
request.

As the whole, that would looks like the folloing?

------------------------------------------------------------
Successful request.

Requestor dumper state
[idle] initial
[request] -------------------> requestor pid/backendid
-signal->
[dumping]
<-signal-[done]
[read]
[done] --------------------> initial

------------------------------------------------------------
On failure, the dumper signals with setting state to initial.
[request] -------------------> requestor pid/backendid
-signal->
[dumping]
[failed] initial
<-signal-
(some other requestor might come meantime.)
<sees that the requestor is not me>

Is this "some other requestor come case" relevant to the dumper failure?

Regards,

--
Atsushi Torikoshi

Show quoted text

[failed]

------------------------------------------------------------
If the requestor wants to cancel the request, it sets the state to
'cancel' then signal.

Requestor dumper state
[idle] initial
[request] -------------------> cancel
<if canceled. clean up>
[dumping]
<if canceled. clean up>
<-signal-[done]

-signal-><try to clean up>

Other aspects to cnosider?

regards.

#19

Georgios Kokolatos

gkokolatos@protonmail.com

over 5 years ago

In reply to: torikoshia (#18)

Re: Get memory contexts of an arbitrary backend process

Hi,

I noticed that this patch fails on the cfbot.
For this, I changed the status to: 'Waiting on Author'.

Cheers,
//Georgios

The new status of this patch is: Waiting on Author

#20

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: Georgios Kokolatos (#19)

Re: Get memory contexts of an arbitrary backend process

On 2020-10-28 15:32, torikoshia wrote:

On 2020-10-23 13:46, Kyotaro Horiguchi wrote:

I think we might need to step-back to basic design of this feature
since this patch seems to have unhandled corner cases that are
difficult to find.

I've written out the basic design below and attached
corresponding patch.

# Communication flow between the dumper and the requester
- (1) When REQUESTING memory context dumping, the dumper adds an entry
to the shared memory. The entry manages the dump state and it is set to
'REQUESTING'.
- (2) The dumper sends the signal to the dumper and wait on the latch.
- (3) The dumper looks into the corresponding shared memory entry and
changes its state to 'DUMPING'.
- (4) When the dumper completes dumping, it changes the state to
'DONE' and set the latch.
- (5) The dumper reads the dump file and shows it to the user.
Finally, the dumper removes the dump file and reset the shared memory
entry.

# Query cancellation
- When the requestor cancels dumping, e.g. signaling using ctrl-C, the
requestor changes the status of the shared memory entry to 'CANCELING'.
- The dumper checks the status when it tries to change the state to
'DONE' at (4), and if the state is 'CANCELING', it removes the dump file
and reset the shared memory entry.

# Cleanup dump file and the shared memory entry
- In the normal case, the dumper removes the dump file and resets the
shared memory entry as described in (5).
- When something like query cancellation or process termination
happens on the dumper after (1) and before (3), in other words, the
state is 'REQUESTING', the requestor does the cleanup.
- When something happens on the dumper or the requestor after (3) and
before (4), in other words, the state is 'DUMPING', the dumper does the
cleanup. Specifically, if the requestor cancels the query, it just
changes the state to 'CANCELING' and the dumper notices it and cleans up
things later. OTOH, when the dumper fails to dump, it cleans up the dump
file and deletes the entry on the shared memory.
- When something happens on the requestor after (4), i.e., the state
is 'DONE', the requestor does the cleanup.
- In the case of receiving SIGKILL or power failure, all dump files
are removed in the crash recovery process.

Although there was a suggestion that shared memory hash
table should be changed to more efficient structures,
I haven't done it in this patch.
I think it can be treated separately, I'm going to work
on that later.

On 2020-11-11 00:07, Georgios Kokolatos wrote:

Hi,

I noticed that this patch fails on the cfbot.
For this, I changed the status to: 'Waiting on Author'.

Cheers,
//Georgios

The new status of this patch is: Waiting on Author

Thanks for your notification and updated the patch.
Changed the status to: 'Waiting on Author'.

Regards,

--
Atsushi Torikoshi

#21

Fujii Masao

masao.fujii@gmail.com

over 5 years ago

In reply to: torikoshia (#20)

#22

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Fujii Masao (#21)

#23

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: Tom Lane (#22)

#24

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: torikoshia (#23)

#25

Kasahara Tatsuhito

kasahara.tatsuhito@gmail.com

over 5 years ago

In reply to: torikoshia (#24)

#26

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: Kasahara Tatsuhito (#25)

#27

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: torikoshia (#26)

#28

torikoshia

torikoshia@oss.nttdata.com

over 5 years ago

In reply to: torikoshia (#27)

#29

torikoshia

torikoshia@oss.nttdata.com

about 5 years ago

In reply to: torikoshia (#28)

#30

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: torikoshia (#29)

#31

torikoshia

torikoshia@oss.nttdata.com

about 5 years ago

In reply to: Fujii Masao (#30)

#32

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: torikoshia (#31)

#33

torikoshia

torikoshia@oss.nttdata.com

about 5 years ago

In reply to: Fujii Masao (#32)

#34

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: torikoshia (#33)

#35

torikoshia

torikoshia@oss.nttdata.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#34)

#36

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: torikoshia (#35)

#37

torikoshia

torikoshia@oss.nttdata.com

about 5 years ago

In reply to: Fujii Masao (#36)

#38

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: torikoshia (#37)

#39

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#38)

#40

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Fujii Masao (#39)

#41

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Fujii Masao (#39)

#42

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Fujii Masao (#40)

#43

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#42)

#44

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#43)

#45

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Fujii Masao (#44)

#46

torikoshia

torikoshia@oss.nttdata.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#45)

#47

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: torikoshia (#46)

#48

torikoshia

torikoshia@oss.nttdata.com

about 5 years ago

In reply to: Fujii Masao (#47)

#49

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: torikoshia (#48)

#50

torikoshia

torikoshia@oss.nttdata.com

about 5 years ago

In reply to: Fujii Masao (#49)

#51

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: torikoshia (#50)

#52

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#51)

#53

torikoshia

torikoshia@oss.nttdata.com

about 5 years ago

In reply to: Fujii Masao (#52)

#54

Zhihong Yu

zyu@yugabyte.com

about 5 years ago

In reply to: torikoshia (#53)

#55

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Zhihong Yu (#54)

#56

torikoshia

torikoshia@oss.nttdata.com

about 5 years ago

In reply to: Fujii Masao (#55)

#57

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: torikoshia (#56)

#58

torikoshia

torikoshia@oss.nttdata.com

about 5 years ago

In reply to: Fujii Masao (#57)

#59

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: torikoshia (#58)