Limit memory usage by postgres_fdw batches
Hi.
We had some real cases when client set rather big batch_size on server
level, but for some foreign table, containing large documents, it was
inadequate and lead to OOM killer intervention. You can argue that
batch_size can be set on foreign table level, but it can still be not
flexible enough, when tuple size varies. I suppose this case is also
takes place for fetch_size. Issue here is that we can't somehow limit
size of data (versus number of rows) while fetching from cursor. But we
can use tuple store to preserve fetched results, so that they spill out
to the disk.
I'm attaching two patches which try to fix issues with possible huge
memory usage by postgres_fdw batches.
With fetched tuples we still can't use only tuplestore, as ctids are not
preserved, and so have to store them separately.
The reproducer for insert is simple.
create extension postgres_fdw ;
create server loopback foreign data wrapper postgres_fdw options (dbname
'postgres', port '5432', batch_size '100', fetch_size '100');
create table base_table(i int, s bytea);
create foreign table foreign_table (i int, s bytea) server loopback
options(table_name 'base_table');
create user mapping for public server loopback ;
insert into foreign_table select i,
pg_read_binary_file('/some/big/file') from generate_series(1,1000) i;
will easily grow backend RSS to several gigabytes.
The first patch fixes this problem.
The second patch alleviates the second issue - SELECT * queries also can
grow backend memory to several GBs. Still memory usage can peak (on my
toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than
non-patched version.
--
Best regards,
Alexander Pyhalov,
Postgres Professional
Attachments:
v1-0001-Limit-batch_size-for-foreign-insert-with-work_mem.patchtext/x-diff; name=v1-0001-Limit-batch_size-for-foreign-insert-with-work_mem.patchDownload+35-4
v1-0002-Use-tuplestore-in-PgFdwScanState-scan-state-to-limit.patchtext/x-diff; name=v1-0002-Use-tuplestore-in-PgFdwScanState-scan-state-to-limit.patchDownload+66-9
Hi Alexander,
On 12/26/25 11:54, Alexander Pyhalov wrote:
Hi.
We had some real cases when client set rather big batch_size on server
level, but for some foreign table, containing large documents, it was
inadequate and lead to OOM killer intervention. You can argue that
batch_size can be set on foreign table level, but it can still be not
flexible enough, when tuple size varies.
I agree this can be an issue with large tuples. It'd be good to consider
the size, not just the number of tuples.
I think you modified the right places, but I think there are two or
three issues we should improve:
1) It calls estimate_batch_length() for every ExecInsert() call, i.e.
for every tuples. And it walks all tuples up to that point, which makes
it O(N^2). I haven't measured how significant it is, but AFAICS we could
track the current size of the batch fairly easily, and use that.
2) work_mem is in kilobytes, while batch_len is in bytes, so the
comparison (batch_len > work_mem) is not quite right. I'll probably fire
every time, preventing any batching.
3) Isn't this consider the size of the new tuple in batch_len? Imagine
the tuples are 99% of the work_mem limit. We add the first one. When
adding the next one we check the current batch is below work_mem, and so
we proceed to add the second tuple. Now the batch is 1.98% of the limit.
I think it should work like this:
1) batch_len + tup_len < work_mem => add tuple to batch
2) tup_len < work_mem => flush batch, add tuple to batch
3) tup_len => work_mem => flush batch, insert tuple directly
What bothers me a little bit is that this is per relation. AFAICS when
inserting into a partitioned table with multiple foreign partitions,
each partition will have a separate limit. I wonder if we could do
better and have some sort of "global" limit for the whole insert.
But that's not the fault of this patch, of course.
I suppose this case is also
takes place for fetch_size. Issue here is that we can't somehow limit
size of data (versus number of rows) while fetching from cursor. But we
can use tuple store to preserve fetched results, so that they spill out
to the disk.
Perhaps. Seems like a separate issue. I haven't looked very closely, but
do we want to use the tuplestore always, or just when the tuples get too
large? It might even be on batch-by-batch, I guess. With fetch_size=1
it's hardly useful, right? Is the tuplestore management measurable?
I'm attaching two patches which try to fix issues with possible huge
memory usage by postgres_fdw batches.
With fetched tuples we still can't use only tuplestore, as ctids are not
preserved, and so have to store them separately.The reproducer for insert is simple.
create extension postgres_fdw ;
create server loopback foreign data wrapper postgres_fdw options (dbname
'postgres', port '5432', batch_size '100', fetch_size '100');
create table base_table(i int, s bytea);
create foreign table foreign_table (i int, s bytea) server loopback
options(table_name 'base_table');
create user mapping for public server loopback ;insert into foreign_table select i, pg_read_binary_file('/some/big/
file') from generate_series(1,1000) i;will easily grow backend RSS to several gigabytes.
The first patch fixes this problem.The second patch alleviates the second issue - SELECT * queries also can
grow backend memory to several GBs. Still memory usage can peak (on my
toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than non-
patched version.
How large are the tuples? How much higher was this RSS than the
theoretical minimum?
regards
--
Tomas Vondra
Tomas Vondra писал(а) 2025-12-27 16:18:
Hi Alexander,
On 12/26/25 11:54, Alexander Pyhalov wrote:
Hi.
We had some real cases when client set rather big batch_size on server
level, but for some foreign table, containing large documents, it was
inadequate and lead to OOM killer intervention. You can argue that
batch_size can be set on foreign table level, but it can still be not
flexible enough, when tuple size varies.I agree this can be an issue with large tuples. It'd be good to
consider
the size, not just the number of tuples.I think you modified the right places, but I think there are two or
three issues we should improve:1) It calls estimate_batch_length() for every ExecInsert() call, i.e.
for every tuples. And it walks all tuples up to that point, which makes
it O(N^2). I haven't measured how significant it is, but AFAICS we
could
track the current size of the batch fairly easily, and use that.2) work_mem is in kilobytes, while batch_len is in bytes, so the
comparison (batch_len > work_mem) is not quite right. I'll probably
fire
every time, preventing any batching.3) Isn't this consider the size of the new tuple in batch_len? Imagine
the tuples are 99% of the work_mem limit. We add the first one. When
adding the next one we check the current batch is below work_mem, and
so
we proceed to add the second tuple. Now the batch is 1.98% of the
limit.I think it should work like this:
1) batch_len + tup_len < work_mem => add tuple to batch
2) tup_len < work_mem => flush batch, add tuple to batch
3) tup_len => work_mem => flush batch, insert tuple directlyWhat bothers me a little bit is that this is per relation. AFAICS when
inserting into a partitioned table with multiple foreign partitions,
each partition will have a separate limit. I wonder if we could do
better and have some sort of "global" limit for the whole insert.But that's not the fault of this patch, of course.
Hi. Thank you for feedback.
I've fixed the first two issues in the first patch.
I've tried to implement switching from batch to usual foreign insert
(it's in the second patch). However, I don't like it. It changes the
resultRelInfo->ri_NumSlots interface. Now it can be 0 when BatchInsert
was followed by foreign insert, and ExecPendingInserts() should handle
this. Also we can't longer keep fact that BatchInsert() was performed
private to ExecInsert(), as we have to determine if
estate->es_insert_pending_result_relations and
estate->es_insert_pending_modifytables lists were modified between
ExecInsert() calls. In theory we could look at
list_member_ptr(estate->es_insert_pending_result_relations,
resultRelInfo), but I don't know if iterating over this list per each
tuple is a good idea.
I suppose this case is also
takes place for fetch_size. Issue here is that we can't somehow limit
size of data (versus number of rows) while fetching from cursor. But
we
can use tuple store to preserve fetched results, so that they spill
out
to the disk.Perhaps. Seems like a separate issue. I haven't looked very closely,
but
do we want to use the tuplestore always, or just when the tuples get
too
large? It might even be on batch-by-batch, I guess. With fetch_size=1
it's hardly useful, right? Is the tuplestore management measurable?
I haven't noticed significant impact on simple pgbench tests. I will
check different cases with
fetch_size 1 - for now I don't understand if we could trigger OOM for
example with async foreign
scan of partitioned table with many foreign partitions - now we would
have to store each tuple
we get in the memory, perhaps, saving it to the store until it's needed
by Append is not so bad idea.
I'm attaching two patches which try to fix issues with possible huge
memory usage by postgres_fdw batches.
With fetched tuples we still can't use only tuplestore, as ctids are
not
preserved, and so have to store them separately.The reproducer for insert is simple.
create extension postgres_fdw ;
create server loopback foreign data wrapper postgres_fdw options
(dbname
'postgres', port '5432', batch_size '100', fetch_size '100');
create table base_table(i int, s bytea);
create foreign table foreign_table (i int, s bytea) server loopback
options(table_name 'base_table');
create user mapping for public server loopback ;insert into foreign_table select i, pg_read_binary_file('/some/big/
file') from generate_series(1,1000) i;will easily grow backend RSS to several gigabytes.
The first patch fixes this problem.The second patch alleviates the second issue - SELECT * queries also
can
grow backend memory to several GBs. Still memory usage can peak (on my
toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than
non-
patched version.How large are the tuples? How much higher was this RSS than the
theoretical minimum?
Sorry, have lost the exact reproducer. Memory consumption of course
depends on tuple size and fetch size.
Remade some simple tests.
select * from pg_foreign_server ;
oid | srvname | srvowner | srvfdw | srvtype | srvversion | srvacl |
srvoptions
-------+----------+----------+--------+---------+------------+--------+-----------------------------------------------------------
16403 | loopback | 10 | 16392 | | | |
{dbname=postgres,port=5432,batch_size=100,fetch_size=100}
(1 row)
select count(*) from foreign_table ;
count
-------
1100
(1 row)
\d+ base_table
Table "public.base_table"
Column | Type | Collation | Nullable | Default | Storage |
Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
i | integer | | | | plain |
| |
s | bytea | | | | extended |
| |
Access method: heap
select pg_size_pretty(pg_total_relation_size('base_table')) ;
pg_size_pretty
----------------
5495 MB
(1 row)
postgres=# select
pg_size_pretty(pg_total_relation_size('base_table')/1100) ;
pg_size_pretty
----------------
5116 kB
(1 row)
Backend, executing
"select i, length(s) from foreign_table "
query takes up to 1.4 GB RSS on master and up to 1 GB on patched
version.
P.S. Going to be on vacation for the first half of January.
--
Best regards,
Alexander Pyhalov,
Postgres Professional
Attachments:
v2-0001-Limit-batch_size-for-foreign-insert-with-work_mem.patchtext/x-diff; name=v2-0001-Limit-batch_size-for-foreign-insert-with-work_mem.patchDownload+13-4
v2-0002-Fall-back-to-foreign-insert-when-tuple-is-too-big.patchtext/x-diff; name=v2-0002-Fall-back-to-foreign-insert-when-tuple-is-too-big.patchDownload+96-64
v2-0003-Use-tuplestore-in-PgFdwScanState-scan-state-to-limit.patchtext/x-diff; name=v2-0003-Use-tuplestore-in-PgFdwScanState-scan-state-to-limit.patchDownload+66-9
Hi.
I've looked at the third patch more and found some evident issues.
1) While using tuplestore we get too much garbage from tuple conversion,
which was not cleared properly. Tried to fix it, but now we come to the
second problem.
2) While receiving tuples we've already allocated memory in libpq, so it
can be too late to do something. In simple examples I can see libpq
result sets which have several GB size (and likely are not limited by
anything).
I've used the attached script to create some tables.
select create_partitioned_rel('t1', 128, true, 1);
select create_partitioned_rel('t2', 4, true, 100);
insert into t1 select i, pg_read_binary_file('/some/100mb/file') from
generate_series(1,128) i;
insert into t2 select i, pg_read_binary_file('/some/100mb/file') from
generate_series(1,128) i;
And with simple queries like
select sum(length(s)) from TABLE;
we can see that backend can easily consume up to several GBs of RAM.
For now I start thinking we need some form of FETCH, which stops
fetching data based on batch size...
--
Best regards,
Alexander Pyhalov,
Postgres Professional
Attachments:
v3-0003-WIP-Use-tuplestore-in-PgFdwScanState-scan-state-to-l.patchtext/x-diff; name=v3-0003-WIP-Use-tuplestore-in-PgFdwScanState-scan-state-to-l.patchDownload+74-9
create_test_example.sqltext/plain; name=create_test_example.sqlDownload
v3-0001-Limit-batch_size-for-foreign-insert-with-work_mem.patchtext/x-diff; name=v3-0001-Limit-batch_size-for-foreign-insert-with-work_mem.patchDownload+13-4
v3-0002-Fall-back-to-foreign-insert-when-tuple-is-too-big.patchtext/x-diff; name=v3-0002-Fall-back-to-foreign-insert-when-tuple-is-too-big.patchDownload+96-64
Alexander Pyhalov писал(а) 2026-01-13 13:44:
For now I start thinking we need some form of FETCH, which stops
fetching data based on batch size...
Hi.
To limit memory consumption, we actually have to retreive less data. And
we can do it only on the side of the foreign server. I've rewritten the
third patch. We introduce a new parameter - cursor_fetch_limit, which is
set by postgres_fdw. When it is set, fetching limited count of records
from the cursor is also limited by memory consumed by the records. Of
course, record size is some estimation (for example, we don't know what
out function will do).
This works as expected - in my tests with tables of large records,
backends, executing selects, were always restricted by about 2 GB of RAM
overall (without patch memory consumption easily grows up to 8 GB).
However, now when we got less tuples from executor, than expected, we
should recheck, if these are all tuples we can get. I've introduced
es_eof EState field to signal that there's no more tuples. Don't know if
it's the best way.
--
Best regards,
Alexander Pyhalov,
Postgres Professional
Attachments:
v4-0001-Limit-batch_size-for-foreign-insert-with-work_mem.patchtext/x-diff; name=v4-0001-Limit-batch_size-for-foreign-insert-with-work_mem.patchDownload+13-4
v4-0002-Fall-back-to-foreign-insert-when-tuple-is-too-big.patchtext/x-diff; name=v4-0002-Fall-back-to-foreign-insert-when-tuple-is-too-big.patchDownload+96-64
v4-0003-Introduce-cursor_fetch_limit-parameter-and-use-it-in.patchtext/x-diff; name=v4-0003-Introduce-cursor_fetch_limit-parameter-and-use-it-in.patchDownload+304-13
Alexander Pyhalov писал(а) 2026-01-23 09:18:
Alexander Pyhalov писал(а) 2026-01-13 13:44:
For now I start thinking we need some form of FETCH, which stops
fetching data based on batch size...Hi.
To limit memory consumption, we actually have to retreive less data.
And we can do it only on the side of the foreign server. I've rewritten
the third patch. We introduce a new parameter - cursor_fetch_limit,
which is set by postgres_fdw. When it is set, fetching limited count of
records from the cursor is also limited by memory consumed by the
records. Of course, record size is some estimation (for example, we
don't know what out function will do).This works as expected - in my tests with tables of large records,
backends, executing selects, were always restricted by about 2 GB of
RAM overall (without patch memory consumption easily grows up to 8 GB).
However, now when we got less tuples from executor, than expected, we
should recheck, if these are all tuples we can get. I've introduced
es_eof EState field to signal that there's no more tuples. Don't know
if it's the best way.
The issue which worried me is that we could change behavior for cursors,
which don't expect this (for example, if these cursors are used in
functions, which are executed via FDW). So I added new cursor option and
used it in postgres_fdw.
Also rearranged patches. While working with test cases I found that we
should avoid batch insert for big tuples, as it has significant
overhead. Switching from (limited) batch insert to foreign insert
allowed to decrease memory consumption in 3 times in some tests.
--
Best regards,
Alexander Pyhalov,
Postgres Professional
Attachments:
v5-0001-Introduce-function-to-estimate-tuple-size.patchtext/x-diff; name=v5-0001-Introduce-function-to-estimate-tuple-size.patchDownload+57-1
v5-0002-Limit-batch_size-for-foreign-insert-with-work_mem.patchtext/x-diff; name=v5-0002-Limit-batch_size-for-foreign-insert-with-work_mem.patchDownload+107-65
v5-0003-Introduce-memory-limited-cursors-and-use-them-in-pos.patchtext/x-diff; name=v5-0003-Introduce-memory-limited-cursors-and-use-them-in-pos.patchDownload+275-19
Hi.
Attaching new version of patches. Besides some bit-size fixes, last
series of patches managed to limit MOVE operations. Fixed this.
--
Best regards,
Alexander Pyhalov,
Postgres Professional