Avoid memory leaks during base backups

Started by Bharath Rupireddyover 3 years ago32 messageshackers

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

Hi,

Postgres currently can leak memory if a failure occurs during base
backup in do_pg_backup_start() or do_pg_backup_stop() or
perform_base_backup(). The palloc'd memory such as backup_state or
tablespace_map in xlogfuncs.c or basebackup.c or tablespaceinfo or the
memory that gets allocated by bbsink_begin_backup() in
perform_base_backup() or any other, is left-out which may cause memory
bloat on the server eventually. To experience this issue, run
pg_basebackup with --label name longer than 1024 characters and
observe memory with watch command, the memory usage goes up.

It looks like the memory leak issue has been there for quite some
time, discussed in [1]/messages/by-id/Yyq15ekNzjZecwMW@paquier.xyz.

I'm proposing a patch that leverages the error callback mechanism and
memory context. The design of the patch is as follows:
1) pg_backup_start() and pg_backup_stop() - the error callback frees
up the backup_state and tablespace_map variables allocated in
TopMemoryContext. We don't need a separate memory context here because
do_pg_backup_start() and do_pg_backup_stop() don't return any
dynamically created memory for now. We can choose to create a separate
memory context for the future changes that may come, but now it is not
required.
2) perform_base_backup() - a new memory context has been created that
gets deleted by the callback upon error.

The error callbacks are typically called for all the elevels, but we
need to free up the memory only when elevel is >= ERROR or ==
COMMERROR. The COMMERROR is a common scenario because the server can
close the connection to the client or vice versa in which case the
base backup fails. For all other elevels like WARNING, NOTICE, DEBUGX,
INFO etc. we don't free up the memory.

I'm attaching v1 patch herewith.

Thoughts?

[1]: /messages/by-id/Yyq15ekNzjZecwMW@paquier.xyz

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Bharath Rupireddy (#1)

Re: Avoid memory leaks during base backups

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

Postgres currently can leak memory if a failure occurs during base
backup in do_pg_backup_start() or do_pg_backup_stop() or
perform_base_backup(). The palloc'd memory such as backup_state or
tablespace_map in xlogfuncs.c or basebackup.c or tablespaceinfo or the
memory that gets allocated by bbsink_begin_backup() in
perform_base_backup() or any other, is left-out which may cause memory
bloat on the server eventually. To experience this issue, run
pg_basebackup with --label name longer than 1024 characters and
observe memory with watch command, the memory usage goes up.

It looks like the memory leak issue has been there for quite some
time, discussed in [1].

I'm proposing a patch that leverages the error callback mechanism and
memory context.

This ... seems like inventing your own shape of wheel. The
normal mechanism for preventing this type of leak is to put the
allocations in a memory context that can be reset or deallocated
in mainline code at the end of the operation. I do not think that
having an errcontext callback with side-effects like deallocating
memory is even remotely safe, and it's certainly a first-order
abuse of that mechanism.

regards, tom lane

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Tom Lane (#2)

Re: Avoid memory leaks during base backups

On Mon, Sep 26, 2022 at 7:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm proposing a patch that leverages the error callback mechanism and
memory context.

This ... seems like inventing your own shape of wheel. The
normal mechanism for preventing this type of leak is to put the
allocations in a memory context that can be reset or deallocated
in mainline code at the end of the operation.

Yes, that's the typical way and the patch attached does it for
perform_base_backup(). What happens if we allocate some memory in the
new memory context and error-out before reaching the end of operation?
How do we deallocate such memory?
Backup related code has simple-to-generate-error paths in between and
memory can easily be leaked.

Are you suggesting to use sigsetjmp or some other way to prevent memory leaks?

I do not think that
having an errcontext callback with side-effects like deallocating
memory is even remotely safe, and it's certainly a first-order
abuse of that mechanism.

Are you saying that the error callback might deallocate the memory
that may be needed later in the error processing?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#3)

Re: Avoid memory leaks during base backups

At Tue, 27 Sep 2022 11:33:56 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Mon, Sep 26, 2022 at 7:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This ... seems like inventing your own shape of wheel. The
normal mechanism for preventing this type of leak is to put the
allocations in a memory context that can be reset or deallocated
in mainline code at the end of the operation.

Yes, that's the typical way and the patch attached does it for
perform_base_backup(). What happens if we allocate some memory in the
new memory context and error-out before reaching the end of operation?
How do we deallocate such memory?

Whoever directly or indirectly catches the exception can do that. For
example, SendBaseBackup() seems to catch execptions from
perform_base_backup(). bbsinc_cleanup() is already resides there.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Kyotaro Horiguchi (#4)

Re: Avoid memory leaks during base backups

On Tue, Sep 27, 2022 at 05:32:26PM +0900, Kyotaro Horiguchi wrote:

At Tue, 27 Sep 2022 11:33:56 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Mon, Sep 26, 2022 at 7:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This ... seems like inventing your own shape of wheel. The
normal mechanism for preventing this type of leak is to put the
allocations in a memory context that can be reset or deallocated
in mainline code at the end of the operation.

Yes, that's the typical way and the patch attached does it for
perform_base_backup(). What happens if we allocate some memory in the
new memory context and error-out before reaching the end of operation?
How do we deallocate such memory?

Whoever directly or indirectly catches the exception can do that. For
example, SendBaseBackup() seems to catch execptions from
perform_base_backup(). bbsinc_cleanup() is already resides there.

Even with that, what's the benefit in using an extra memory context in
basebackup.c? backup_label and tablespace_map are mentioned upthread,
but we have a tight control of these, and they should be allocated in
the memory context created for replication commands (grep for
"Replication command context") anyway. Using a dedicated memory
context for the SQL backup functions under TopMemoryContext could be
interesting, on the other hand..
--
Michael

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Michael Paquier (#5)

Re: Avoid memory leaks during base backups

On Wed, Sep 28, 2022 at 9:46 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Sep 27, 2022 at 05:32:26PM +0900, Kyotaro Horiguchi wrote:

At Tue, 27 Sep 2022 11:33:56 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Mon, Sep 26, 2022 at 7:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This ... seems like inventing your own shape of wheel. The
normal mechanism for preventing this type of leak is to put the
allocations in a memory context that can be reset or deallocated
in mainline code at the end of the operation.

Yes, that's the typical way and the patch attached does it for
perform_base_backup(). What happens if we allocate some memory in the
new memory context and error-out before reaching the end of operation?
How do we deallocate such memory?

Whoever directly or indirectly catches the exception can do that. For
example, SendBaseBackup() seems to catch execptions from
perform_base_backup(). bbsinc_cleanup() is already resides there.

Even with that, what's the benefit in using an extra memory context in
basebackup.c? backup_label and tablespace_map are mentioned upthread,
but we have a tight control of these, and they should be allocated in
the memory context created for replication commands (grep for
"Replication command context") anyway. Using a dedicated memory
context for the SQL backup functions under TopMemoryContext could be
interesting, on the other hand..

I had the same opinion. Here's what I think - for backup functions, we
can have the new memory context child of TopMemoryContext and for
perform_base_backup(), we can have the memory context child of
CurrentMemoryContext. With PG_TRY()-PG_FINALLY()-PG_END_TRY(), we can
delete those memory contexts upon ERRORs. This approach works for us
since backup-related code doesn't have any FATALs.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Bharath Rupireddy (#6)

Re: Avoid memory leaks during base backups

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

I had the same opinion. Here's what I think - for backup functions, we
can have the new memory context child of TopMemoryContext and for
perform_base_backup(), we can have the memory context child of
CurrentMemoryContext. With PG_TRY()-PG_FINALLY()-PG_END_TRY(), we can
delete those memory contexts upon ERRORs. This approach works for us
since backup-related code doesn't have any FATALs.

Not following your last point here? A process exiting on FATAL
does not especially need to clean up its memory allocations first.
Which is good, because "backup-related code doesn't have any FATALs"
seems like an assertion with a very short half-life.

regards, tom lane

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Tom Lane (#7)

Re: Avoid memory leaks during base backups

On Wed, Sep 28, 2022 at 10:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

I had the same opinion. Here's what I think - for backup functions, we
can have the new memory context child of TopMemoryContext and for
perform_base_backup(), we can have the memory context child of
CurrentMemoryContext. With PG_TRY()-PG_FINALLY()-PG_END_TRY(), we can
delete those memory contexts upon ERRORs. This approach works for us
since backup-related code doesn't have any FATALs.

Not following your last point here? A process exiting on FATAL
does not especially need to clean up its memory allocations first.
Which is good, because "backup-related code doesn't have any FATALs"
seems like an assertion with a very short half-life.

You're right. My bad. For FATALs, we don't need to clean the memory as
the process itself exits.

* Note: an ereport(FATAL) will not be caught by this construct; control will
* exit straight through proc_exit().

/*
* Perform error recovery action as specified by elevel.
*/
if (elevel == FATAL)
{

/*
* Do normal process-exit cleanup, then return exit code 1 to indicate
* FATAL termination. The postmaster may or may not consider this
* worthy of panic, depending on which subprocess returns it.
*/
proc_exit(1);
}

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Michael Paquier (#5)

Re: Avoid memory leaks during base backups

At Wed, 28 Sep 2022 13:16:36 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Tue, Sep 27, 2022 at 05:32:26PM +0900, Kyotaro Horiguchi wrote:

At Tue, 27 Sep 2022 11:33:56 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Mon, Sep 26, 2022 at 7:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This ... seems like inventing your own shape of wheel. The
normal mechanism for preventing this type of leak is to put the
allocations in a memory context that can be reset or deallocated
in mainline code at the end of the operation.

Yes, that's the typical way and the patch attached does it for
perform_base_backup(). What happens if we allocate some memory in the
new memory context and error-out before reaching the end of operation?
How do we deallocate such memory?

Whoever directly or indirectly catches the exception can do that. For
example, SendBaseBackup() seems to catch execptions from
perform_base_backup(). bbsinc_cleanup() is already resides there.

Even with that, what's the benefit in using an extra memory context in
basebackup.c? backup_label and tablespace_map are mentioned upthread,
but we have a tight control of these, and they should be allocated in
the memory context created for replication commands (grep for
"Replication command context") anyway. Using a dedicated memory
context for the SQL backup functions under TopMemoryContext could be
interesting, on the other hand..

If I understand you correctly, my point was the usage of error
callbacks. I meant that we can release that tangling memory blocks in
SendBaseBackup() even by directly pfree()ing then NULLing the pointer,
if the pointer were file-scoped static.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#6)

Re: Avoid memory leaks during base backups

On Wed, Sep 28, 2022 at 10:09 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Here's what I think - for backup functions, we
can have the new memory context child of TopMemoryContext and for
perform_base_backup(), we can have the memory context child of
CurrentMemoryContext. With PG_TRY()-PG_FINALLY()-PG_END_TRY(), we can
delete those memory contexts upon ERRORs. This approach works for us
since backup-related code doesn't have any FATALs.

Thoughts?

I'm attaching the v2 patch designed as described above. Please review it.

I've added an entry in CF - https://commitfest.postgresql.org/40/3915/

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#11

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#10)

Re: Avoid memory leaks during base backups

On Wed, Sep 28, 2022 at 5:30 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I'm attaching the v2 patch designed as described above. Please review it.

I've added an entry in CF - https://commitfest.postgresql.org/40/3915/

This looks odd to me. In the case of a regular backend, the
sigsetjmp() handler in src/backend/tcop/postgres.c is responsible for
cleaning up memory. It calls AbortCurrentTransaction() which will call
CleanupTransaction() which will call AtCleanup_Memory() which will
block away TopTransactionContext. I think we ought to do something
analogous here, and we almost do already. Some walsender commands are
going to be SQL commands and some aren't. For those that aren't, the
same block calls WalSndErrorCleanup() which does similar kinds of
cleanup, including in some situations calling WalSndResourceCleanup()
which cleans up the resource owner in cases where we have a resource
owner without a transaction. I feel like we ought to be trying to tie
the cleanup into WalSndErrorCleanup() or WalSndResourceCleanup() based
on having the memory context that we ought to be blowing away stored
in a global variable, rather than using a try/catch block.

Like, maybe there's a function EstablishWalSenderMemoryContext() that
commands can call before allocating memory that shouldn't survive an
error. And it's deleted after each command if it exists, or if an
error occurs then WalSndErrorCleanup() deletes it.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Robert Haas (#11)

Re: Avoid memory leaks during base backups

On Wed, Sep 28, 2022 at 8:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

I feel like we ought to be trying to tie
the cleanup into WalSndErrorCleanup() or WalSndResourceCleanup() based
on having the memory context that we ought to be blowing away stored
in a global variable, rather than using a try/catch block.

Okay, I got rid of the try-catch block. I added two clean up callbacks
(one for SQL backup functions or on-line backup, another for base
backup) that basically delete the respective memory contexts and reset
the file-level variables, they get called from PostgresMain()'s error
handling code.

Like, maybe there's a function EstablishWalSenderMemoryContext() that
commands can call before allocating memory that shouldn't survive an
error. And it's deleted after each command if it exists, or if an
error occurs then WalSndErrorCleanup() deletes it.

I don't think we need any of the above. I've used file-level variables
to hold memory contexts, allocating them whenever needed and cleaning
them up either at the end of backup operation or upon error.

Please review the attached v3 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#12)

Re: Avoid memory leaks during base backups

On Thu, Sep 29, 2022 at 4:29 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Please review the attached v3 patch.

template1=# select * from pg_backup_start('sdgkljsdgkjdsg', true);
pg_backup_start
-----------------
0/2000028
(1 row)

template1=# select 1/0;
ERROR: division by zero
template1=# select * from pg_backup_stop();
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
!?>

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Robert Haas (#13)

Re: Avoid memory leaks during base backups

On Thu, Sep 29, 2022 at 7:05 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 29, 2022 at 4:29 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Please review the attached v3 patch.

template1=# select * from pg_backup_start('sdgkljsdgkjdsg', true);
pg_backup_start
-----------------
0/2000028
(1 row)

template1=# select 1/0;
ERROR: division by zero
template1=# select * from pg_backup_stop();
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
!?>

Thanks! I used a variable to define the scope to clean up the backup
memory context for SQL functions/on-line backup. We don't have this
problem in case of base backup because we don't give control in
between start and stop backup in perform_base_backup().

Please review the v4 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#15

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#14)

Re: Avoid memory leaks during base backups

On Thu, Sep 29, 2022 at 10:38 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Please review the v4 patch.

I used valgrind for testing. Without patch, there's an obvious memory
leak [1]==00:00:01:36.306 145709== VALGRINDERROR-BEGIN ==00:00:01:36.306 145709== 24 bytes in 1 blocks are still reachable in loss record 122 of 511 ==00:00:01:36.306 145709== at 0x98E501: palloc (mcxt.c:1170) ==00:00:01:36.306 145709== by 0x9C1795: makeStringInfo (stringinfo.c:45) ==00:00:01:36.306 145709== by 0x2DE22A: pg_backup_start (xlogfuncs.c:96) ==00:00:01:36.306 145709== by 0x4D2DB6: ExecMakeTableFunctionResult (execSRF.c:234) ==00:00:01:36.306 145709== by 0x4F08DA: FunctionNext (nodeFunctionscan.c:95) ==00:00:01:36.306 145709== by 0x4D48EA: ExecScanFetch (execScan.c:133) ==00:00:01:36.306 145709== by 0x4D4963: ExecScan (execScan.c:182) ==00:00:01:36.306 145709== by 0x4F0C84: ExecFunctionScan (nodeFunctionscan.c:270) ==00:00:01:36.306 145709== by 0x4D0255: ExecProcNodeFirst (execProcnode.c:464) ==00:00:01:36.306 145709== by 0x4C32D4: ExecProcNode (executor.h:259) ==00:00:01:36.306 145709== by 0x4C619C: ExecutePlan (execMain.c:1636) ==00:00:01:36.306 145709== by 0x4C3A0F: standard_ExecutorRun (execMain.c:363) ==00:00:01:36.306 145709== ==00:00:01:36.306 145709== VALGRINDERROR-END, with patch no memory leak.

I used ALLOCSET_START_SMALL_SIZES instead of ALLOCSET_DEFAULT_SIZES
for backup memory context so that it can start small and grow if
required.

I'm attaching v5 patch, please review it further.

[1]: ==00:00:01:36.306 145709== VALGRINDERROR-BEGIN ==00:00:01:36.306 145709== 24 bytes in 1 blocks are still reachable in loss record 122 of 511 ==00:00:01:36.306 145709== at 0x98E501: palloc (mcxt.c:1170) ==00:00:01:36.306 145709== by 0x9C1795: makeStringInfo (stringinfo.c:45) ==00:00:01:36.306 145709== by 0x2DE22A: pg_backup_start (xlogfuncs.c:96) ==00:00:01:36.306 145709== by 0x4D2DB6: ExecMakeTableFunctionResult (execSRF.c:234) ==00:00:01:36.306 145709== by 0x4F08DA: FunctionNext (nodeFunctionscan.c:95) ==00:00:01:36.306 145709== by 0x4D48EA: ExecScanFetch (execScan.c:133) ==00:00:01:36.306 145709== by 0x4D4963: ExecScan (execScan.c:182) ==00:00:01:36.306 145709== by 0x4F0C84: ExecFunctionScan (nodeFunctionscan.c:270) ==00:00:01:36.306 145709== by 0x4D0255: ExecProcNodeFirst (execProcnode.c:464) ==00:00:01:36.306 145709== by 0x4C32D4: ExecProcNode (executor.h:259) ==00:00:01:36.306 145709== by 0x4C619C: ExecutePlan (execMain.c:1636) ==00:00:01:36.306 145709== by 0x4C3A0F: standard_ExecutorRun (execMain.c:363) ==00:00:01:36.306 145709== ==00:00:01:36.306 145709== VALGRINDERROR-END
==00:00:01:36.306 145709== VALGRINDERROR-BEGIN
==00:00:01:36.306 145709== 24 bytes in 1 blocks are still reachable in
loss record 122 of 511
==00:00:01:36.306 145709== at 0x98E501: palloc (mcxt.c:1170)
==00:00:01:36.306 145709== by 0x9C1795: makeStringInfo (stringinfo.c:45)
==00:00:01:36.306 145709== by 0x2DE22A: pg_backup_start (xlogfuncs.c:96)
==00:00:01:36.306 145709== by 0x4D2DB6: ExecMakeTableFunctionResult
(execSRF.c:234)
==00:00:01:36.306 145709== by 0x4F08DA: FunctionNext (nodeFunctionscan.c:95)
==00:00:01:36.306 145709== by 0x4D48EA: ExecScanFetch (execScan.c:133)
==00:00:01:36.306 145709== by 0x4D4963: ExecScan (execScan.c:182)
==00:00:01:36.306 145709== by 0x4F0C84: ExecFunctionScan
(nodeFunctionscan.c:270)
==00:00:01:36.306 145709== by 0x4D0255: ExecProcNodeFirst
(execProcnode.c:464)
==00:00:01:36.306 145709== by 0x4C32D4: ExecProcNode (executor.h:259)
==00:00:01:36.306 145709== by 0x4C619C: ExecutePlan (execMain.c:1636)
==00:00:01:36.306 145709== by 0x4C3A0F: standard_ExecutorRun (execMain.c:363)
==00:00:01:36.306 145709==
==00:00:01:36.306 145709== VALGRINDERROR-END

==00:00:01:36.334 145709== VALGRINDERROR-BEGIN
==00:00:01:36.334 145709== 1,024 bytes in 1 blocks are still reachable
in loss record 426 of 511
==00:00:01:36.334 145709== at 0x98E501: palloc (mcxt.c:1170)
==00:00:01:36.334 145709== by 0x9C17CF: initStringInfo (stringinfo.c:63)
==00:00:01:36.334 145709== by 0x9C17A5: makeStringInfo (stringinfo.c:47)
==00:00:01:36.334 145709== by 0x2DE22A: pg_backup_start (xlogfuncs.c:96)
==00:00:01:36.334 145709== by 0x4D2DB6: ExecMakeTableFunctionResult
(execSRF.c:234)
==00:00:01:36.334 145709== by 0x4F08DA: FunctionNext (nodeFunctionscan.c:95)
==00:00:01:36.334 145709== by 0x4D48EA: ExecScanFetch (execScan.c:133)
==00:00:01:36.334 145709== by 0x4D4963: ExecScan (execScan.c:182)
==00:00:01:36.334 145709== by 0x4F0C84: ExecFunctionScan
(nodeFunctionscan.c:270)
==00:00:01:36.334 145709== by 0x4D0255: ExecProcNodeFirst
(execProcnode.c:464)
==00:00:01:36.334 145709== by 0x4C32D4: ExecProcNode (executor.h:259)
==00:00:01:36.334 145709== by 0x4C619C: ExecutePlan (execMain.c:1636)
==00:00:01:36.334 145709==
==00:00:01:36.334 145709== VALGRINDERROR-END

==00:00:01:36.335 145709== VALGRINDERROR-BEGIN
==00:00:01:36.335 145709== 1,096 bytes in 1 blocks are still reachable
in loss record 431 of 511
==00:00:01:36.335 145709== at 0x98E766: palloc0 (mcxt.c:1201)
==00:00:01:36.335 145709== by 0x2DE152: pg_backup_start (xlogfuncs.c:81)
==00:00:01:36.335 145709== by 0x4D2DB6: ExecMakeTableFunctionResult
(execSRF.c:234)
==00:00:01:36.335 145709== by 0x4F08DA: FunctionNext (nodeFunctionscan.c:95)
==00:00:01:36.335 145709== by 0x4D48EA: ExecScanFetch (execScan.c:133)
==00:00:01:36.335 145709== by 0x4D4963: ExecScan (execScan.c:182)
==00:00:01:36.335 145709== by 0x4F0C84: ExecFunctionScan
(nodeFunctionscan.c:270)
==00:00:01:36.335 145709== by 0x4D0255: ExecProcNodeFirst
(execProcnode.c:464)
==00:00:01:36.335 145709== by 0x4C32D4: ExecProcNode (executor.h:259)
==00:00:01:36.335 145709== by 0x4C619C: ExecutePlan (execMain.c:1636)
==00:00:01:36.335 145709== by 0x4C3A0F: standard_ExecutorRun (execMain.c:363)
==00:00:01:36.335 145709== by 0x4C37FA: ExecutorRun (execMain.c:307)
==00:00:01:36.335 145709==
==00:00:01:36.335 145709== VALGRINDERROR-END

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#16

Cary Huang

cary.huang@highgo.ca

over 3 years ago

In reply to: Bharath Rupireddy (#15)

Re: Avoid memory leaks during base backups

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

Hello

I applied your v5 patch on the current master and run valgrind on it while doing a basebackup with simulated error. No memory leak related to backup is observed. Regression is also passing

thank you

Cary Huang
HighGo Software Canada

#17

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Cary Huang (#16)

Re: Avoid memory leaks during base backups

On Fri, Oct 14, 2022 at 09:56:31PM +0000, Cary Huang wrote:

I applied your v5 patch on the current master and run valgrind on it
while doing a basebackup with simulated error. No memory leak
related to backup is observed. Regression is also passing.

Echoing with what I mentioned upthread in [1]/messages/by-id/YzPKpKEk/JMjhWEz@paquier.xyz -- Michael, I don't quite
understand why this patch needs to touch basebackup.c, walsender.c
and postgres.c. In the case of a replication command processed by a
WAL sender, memory allocations happen in the memory context created
for replication commands, which is itself, as far as I understand, the
message memory context when we get a 'Q' message for a simple query.
Why do we need more code for a cleanup that should be already
happening? Am I missing something obvious?

xlogfuncs.c, by storing stuff in the TopMemoryContext of the process
running the SQL commands pg_backup_start/stop() is different, of
course. Perhaps the point of centralizing the base backup context in
xlogbackup.c makes sense, but my guess that it makes more sense to
keep that with the SQL functions as these are the only ones in need of
a cleanup, coming down to the fact that the start and stop functions
happen in different queries, aka these are not bind to a message
context.

[1]: /messages/by-id/YzPKpKEk/JMjhWEz@paquier.xyz -- Michael
--
Michael

#18

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Michael Paquier (#17)

Re: Avoid memory leaks during base backups

On Mon, Oct 17, 2022 at 1:09 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Oct 14, 2022 at 09:56:31PM +0000, Cary Huang wrote:

I applied your v5 patch on the current master and run valgrind on it
while doing a basebackup with simulated error. No memory leak
related to backup is observed. Regression is also passing.

Echoing with what I mentioned upthread in [1], I don't quite
understand why this patch needs to touch basebackup.c, walsender.c
and postgres.c. In the case of a replication command processed by a
WAL sender, memory allocations happen in the memory context created
for replication commands, which is itself, as far as I understand, the
message memory context when we get a 'Q' message for a simple query.
Why do we need more code for a cleanup that should be already
happening? Am I missing something obvious?

[1]: /messages/by-id/YzPKpKEk/JMjhWEz@paquier.xyz

My bad, I missed that. You are right. We have "Replication command
context" as a child of "MessageContext" memory context for base backup
that gets cleaned upon error in PostgresMain() [1](gdb) bt #0 MemoryContextDelete (context=0x558b7cd0de50) at mcxt.c:378 #1 0x0000558b7c655733 in MemoryContextDeleteChildren (context=0x558b7ccda8c0) at mcxt.c:430 #2 0x0000558b7c65546d in MemoryContextReset (context=0x558b7ccda8c0) at mcxt.c:309 #3 0x0000558b7c43b5cd in PostgresMain (dbname=0x558b7cd11fb8 "", username=0x558b7ccd6298 "ubuntu") at postgres.c:4358 #4 0x0000558b7c364a88 in BackendRun (port=0x558b7cd09620) at postmaster.c:4482 #5 0x0000558b7c36431b in BackendStartup (port=0x558b7cd09620) at postmaster.c:4210 #6 0x0000558b7c3603be in ServerLoop () at postmaster.c:1804 #7 0x0000558b7c35fb1b in PostmasterMain (argc=3, argv=0x558b7ccd4200) at postmaster.c:1476 #8 0x0000558b7c229a0e in main (argc=3, argv=0x558b7ccd4200) at main.c:197.

xlogfuncs.c, by storing stuff in the TopMemoryContext of the process
running the SQL commands pg_backup_start/stop() is different, of
course. Perhaps the point of centralizing the base backup context in
xlogbackup.c makes sense, but my guess that it makes more sense to
keep that with the SQL functions as these are the only ones in need of
a cleanup, coming down to the fact that the start and stop functions
happen in different queries, aka these are not bind to a message
context.

Yes, they're not related to "MessageContext" memory context.

Please see the attached v6 patch that deals with memory leaks for
backup SQL-callable functions.

[1]: (gdb) bt #0 MemoryContextDelete (context=0x558b7cd0de50) at mcxt.c:378 #1 0x0000558b7c655733 in MemoryContextDeleteChildren (context=0x558b7ccda8c0) at mcxt.c:430 #2 0x0000558b7c65546d in MemoryContextReset (context=0x558b7ccda8c0) at mcxt.c:309 #3 0x0000558b7c43b5cd in PostgresMain (dbname=0x558b7cd11fb8 "", username=0x558b7ccd6298 "ubuntu") at postgres.c:4358 #4 0x0000558b7c364a88 in BackendRun (port=0x558b7cd09620) at postmaster.c:4482 #5 0x0000558b7c36431b in BackendStartup (port=0x558b7cd09620) at postmaster.c:4210 #6 0x0000558b7c3603be in ServerLoop () at postmaster.c:1804 #7 0x0000558b7c35fb1b in PostmasterMain (argc=3, argv=0x558b7ccd4200) at postmaster.c:1476 #8 0x0000558b7c229a0e in main (argc=3, argv=0x558b7ccd4200) at main.c:197
(gdb) bt
#0 MemoryContextDelete (context=0x558b7cd0de50) at mcxt.c:378
#1 0x0000558b7c655733 in MemoryContextDeleteChildren
(context=0x558b7ccda8c0) at mcxt.c:430
#2 0x0000558b7c65546d in MemoryContextReset (context=0x558b7ccda8c0)
at mcxt.c:309
#3 0x0000558b7c43b5cd in PostgresMain (dbname=0x558b7cd11fb8 "",
username=0x558b7ccd6298 "ubuntu")
at postgres.c:4358
#4 0x0000558b7c364a88 in BackendRun (port=0x558b7cd09620) at postmaster.c:4482
#5 0x0000558b7c36431b in BackendStartup (port=0x558b7cd09620) at
postmaster.c:4210
#6 0x0000558b7c3603be in ServerLoop () at postmaster.c:1804
#7 0x0000558b7c35fb1b in PostmasterMain (argc=3, argv=0x558b7ccd4200)
at postmaster.c:1476
#8 0x0000558b7c229a0e in main (argc=3, argv=0x558b7ccd4200) at main.c:197

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#19

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#18)

Re: Avoid memory leaks during base backups

On Wed, Oct 19, 2022 at 3:04 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Echoing with what I mentioned upthread in [1], I don't quite
understand why this patch needs to touch basebackup.c, walsender.c
and postgres.c. In the case of a replication command processed by a
WAL sender, memory allocations happen in the memory context created
for replication commands, which is itself, as far as I understand, the
message memory context when we get a 'Q' message for a simple query.
Why do we need more code for a cleanup that should be already
happening? Am I missing something obvious?

My bad, I missed that. You are right. We have "Replication command
context" as a child of "MessageContext" memory context for base backup
that gets cleaned upon error in PostgresMain() [1].

Well this still touches postgres.c. And I still think it's an awful
lot of machinery for a pretty trivial problem. As a practical matter,
nobody is going to open a connection and sit there and try to start a
backup over and over again on the same connection. And even if someone
wrote a client that did that -- why? -- they'd have to be awfully
persistent to leak any amount of memory that would actually matter. So
it is not insane to just think of ignoring this problem entirely.

But if we want to fix it, I think we should do it in some more
localized way. One option is to just have do_pg_start_backup() blow
away any old memory context before it allocates any new memory, and
forget about releasing anything in PostgresMain(). That means memory
could remain allocated after a failure until you next retry the
operation, but I don't think that really matters. It's not a lot of
memory; we just don't want it to accumulate across many repetitions.
Another option, perhaps, is to delete some memory context from within
the TRY/CATCH block if non-NULL, although that wouldn't work nicely if
it might blow away the data we need to generate the error message. A
third option is to do something useful inside WalSndErrorCleanup() or
WalSndResourceCleanup() as I suggested previously.

I'm not exactly sure what the right solution is here, but I think you
need to put more thought into how to make the code look simple,
elegant, and non-invasive.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Robert Haas (#19)

Re: Avoid memory leaks during base backups

On Wed, Oct 19, 2022 at 8:10 PM Robert Haas <robertmhaas@gmail.com> wrote:

Well this still touches postgres.c. And I still think it's an awful
lot of machinery for a pretty trivial problem. As a practical matter,
nobody is going to open a connection and sit there and try to start a
backup over and over again on the same connection. And even if someone
wrote a client that did that -- why? -- they'd have to be awfully
persistent to leak any amount of memory that would actually matter. So
it is not insane to just think of ignoring this problem entirely.

I understand that the amount of memory allocated by pg_backup_start()
is small compared to the overall RAM, however, I don't think we can
ignore the problem and let postgres cause memory leaks.

But if we want to fix it, I think we should do it in some more
localized way.

Agreed.

One option is to just have do_pg_start_backup() blow
away any old memory context before it allocates any new memory, and
forget about releasing anything in PostgresMain(). That means memory
could remain allocated after a failure until you next retry the
operation, but I don't think that really matters. It's not a lot of
memory; we just don't want it to accumulate across many repetitions.

This seems reasonable to me.

Another option, perhaps, is to delete some memory context from within
the TRY/CATCH block if non-NULL, although that wouldn't work nicely if
it might blow away the data we need to generate the error message.

Right.

A third option is to do something useful inside WalSndErrorCleanup() or
WalSndResourceCleanup() as I suggested previously.

These functions will not be called for SQL-callable backup functions
pg_backup_start() and pg_backup_stop(). And the memory leak problem
we're trying to solve is for SQL-callable functions, but not for
basebaskups as they already have a memory context named "Replication
command context" that gets deleted in PostgresMain().

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#21

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#20)

#22

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#21)

#23