Logical Replication Memory Allocation Error - "invalid memory alloc request size"

Started by Max Madden11 months ago5 messagesgeneral

maxmmadden@gmail.com

11 months ago

Hello,

I'm encountering a consistent issue with PostgreSQL 15 logical replication
and would appreciate any guidance on debugging or resolving this problem.

*Setup:*
- Source: PostgreSQL 15.x
- Target: PostgreSQL 15.x
- Replication: Logical replication using publication/subscription (pgoutput)
- Tables: 3 tables (details below)

*Table Details:*
- Table 1: ~1,300 records, 7 columns, no large objects
- Table 2: ~100,000 records, 7 columns, no large objects
- Table 3: ~100,000 records, 17 columns, no large objects

*Problem:*

The initial snapshot and data copy complete successfully for all tables.
However, anywhere from 5 minutes to 2 hours after the initial sync, the
subscription consistently fails with memory allocation errors like:

```
2025-06-10 14:14:56.800 UTC [299] ERROR: could not receive data from WAL
stream: ERROR: invalid memory alloc request size 1238451248
2025-06-10 14:14:56.805 UTC [1] LOG: background worker "logical replication
worker" (PID 299) exited with exit code 1
```

This occurs whether I replicate all 3 tables together or individually.

My initial hypothesis is that large transactions are creating WAL segments
that exceed memory limits when sent to the subscriber. However, I haven't
been able to confirm this / find the cause.

*Questions:*
1. What's the best approach to debug this memory allocation issue?
2. Are there specific PostgreSQL settings I should check ?
3. How can I identify if large transactions are indeed the root cause?

*Additional Context:*
- This happens consistently across multiple replication attempts
- The error size varies but is always requesting > 1GB
- No custom logical replication settings currently applied
- Subscriber machine has 256 GB of RAM and Ubuntu 20.04
- Can recreate it on different machines

I should also mention that we're operating in a managed environment on
DigitalOcean, which means we don't have direct access to the WAL logs on
the publisher node. This is why the log information above is limited. I
understand this constraint makes it more difficult to provide help, but I
would really appreciate any insights or suggestions you might have.

Thanks,

Max

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

11 months ago

In reply to: Max Madden (#1)

RE: Logical Replication Memory Allocation Error - "invalid memory alloc request size"

Dear Max,

Thanks for the report.

The initial snapshot and data copy complete successfully for all tables. However, anywhere from 5
minutes to 2 hours after the initial sync, the subscription consistently fails with memory allocation errors like:

```
2025-06-10 14:14:56.800 UTC [299] ERROR: could not receive data from WAL stream: ERROR: invalid memory alloc request size 1238451248
2025-06-10 14:14:56.805 UTC [1] LOG: background worker "logical replication worker" (PID 299) exited with exit code 1
```

I think this is a known postgres bug which has been also reported at [1]/messages/by-id/CALDaNm0TaTPuza7Fa+DRMzL+mqK3+7RVEvFiRoDJbU2vkJESwg@mail.gmail.com. We are discussing
how we fix. Typically this can happen when there are lots of concurrent transactions
and they have DDLs. IIUC there are no good workaround for now - any parameters can't
avoid the failure. Only you can reduce them.

I'm happy if you apply the patch posted at [1]/messages/by-id/CALDaNm0TaTPuza7Fa+DRMzL+mqK3+7RVEvFiRoDJbU2vkJESwg@mail.gmail.com and confirms the issue can be solved, but...
seems difficult because you are in the managed env.

[1]: /messages/by-id/CALDaNm0TaTPuza7Fa+DRMzL+mqK3+7RVEvFiRoDJbU2vkJESwg@mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Max Madden

maxmmadden@gmail.com

11 months ago

In reply to: Hayato Kuroda (Fujitsu) (#2)

Re: Logical Replication Memory Allocation Error - "invalid memory alloc request size"

Hi Hayato,

Thank you for your reply.

We have rewritten as many of our transactions as possible to avoid using
temporary tables, and so far, that seems to have resolved the problem.

Thank you for your help.

Many thanks,

Max

On Wed, Jun 11, 2025 at 3:31 AM Hayato Kuroda (Fujitsu) <
kuroda.hayato@fujitsu.com> wrote:

Show quoted text

Dear Max,

Thanks for the report.

The initial snapshot and data copy complete successfully for all tables.

However, anywhere from 5

minutes to 2 hours after the initial sync, the subscription consistently

fails with memory allocation errors like:

```
2025-06-10 14:14:56.800 UTC [299] ERROR: could not receive data from WAL

stream: ERROR: invalid memory alloc request size 1238451248

2025-06-10 14:14:56.805 UTC [1] LOG: background worker "logical

replication worker" (PID 299) exited with exit code 1

```

I think this is a known postgres bug which has been also reported at [1].
We are discussing
how we fix. Typically this can happen when there are lots of concurrent
transactions
and they have DDLs. IIUC there are no good workaround for now - any
parameters can't
avoid the failure. Only you can reduce them.

I'm happy if you apply the patch posted at [1] and confirms the issue can
be solved, but...
seems difficult because you are in the managed env.

[1]:
/messages/by-id/CALDaNm0TaTPuza7Fa+DRMzL+mqK3+7RVEvFiRoDJbU2vkJESwg@mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

11 months ago

In reply to: Max Madden (#3)

RE: Logical Replication Memory Allocation Error - "invalid memory alloc request size"

Dear Max,

We have rewritten as many of our transactions as possible to avoid using
temporary tables, and so far, that seems to have resolved the problem.

Good to know. We try to fix as soon as possible.

Sorry for inconvenience.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Hayato Kuroda (Fujitsu) (#4)

Re: Logical Replication Memory Allocation Error - "invalid memory alloc request size"

On Wed, Jun 11, 2025 at 7:36 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Max,

We have rewritten as many of our transactions as possible to avoid using
temporary tables, and so far, that seems to have resolved the problem.

Good to know. We try to fix as soon as possible.

I pushed the fix for this issue[1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d87d07b7ad3b782cb74566cd771ecdb2823adf6a.

Regards,

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d87d07b7ad3b782cb74566cd771ecdb2823adf6a

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com