Increased work_mem for "logical replication tablesync worker" only?

Started by Dmitry Koterov11 months ago10 messages

dmitry.koterov@gmail.com

11 months ago

Hi.

Trying to monitor perf during the initial tablesync phase (COPY) right
after CREATE SUBSCRIPTION. I noticed that the size
of 17/main/base/pgsql_tmp on the destination node grows (tens of gigabytes)
as the COPY command (running internally on the publisher) progresses. Then
in the end (when its "EXPLAIN SELECT 1 FROM tbl" on the destination shows
the approximate number of rows equals to the number of rows on the source
node) it hangs for several minutes, and then 17/main/base/pgsql_tmp
empties, and the subscription progresses.

It seems like if I increase work_mem to several GB, then the growth
of 17/main/base/pgsql_tmp becomes less significant.

Questions:

1. Are there some diagnostics commands that would allow me to figure out
what is in those tmp files? Why does the subscriber create those tmp files
and not just write directly to the data files and WAL? (The table has 2
bytea columns, i.e. it's TOASTed for sure.)
2. Is there a way to set work_mem only for "logical replication tablesync
worker"? I don't want to have it that high for all connections, but for
logical replication tablesync worker - it's fine to have it set to a huge
value (I have max_sync_workers_per_subscription=1, so there is not more
than 1 of such processes in the system).
3. Is this work_mem consideration relevant at all? Maybe it's a red herring?

Thanks!

amit.kapila16@gmail.com

11 months ago

In reply to: Dmitry Koterov (#1)

Re: Increased work_mem for "logical replication tablesync worker" only?

On Sun, Feb 2, 2025 at 5:13 PM Dmitry Koterov <dmitry.koterov@gmail.com> wrote:

Trying to monitor perf during the initial tablesync phase (COPY) right after CREATE SUBSCRIPTION. I noticed that the size of 17/main/base/pgsql_tmp on the destination node grows (tens of gigabytes) as the COPY command (running internally on the publisher) progresses. Then in the end (when its "EXPLAIN SELECT 1 FROM tbl" on the destination shows the approximate number of rows equals to the number of rows on the source node) it hangs for several minutes, and then 17/main/base/pgsql_tmp empties, and the subscription progresses.

It seems like if I increase work_mem to several GB, then the growth of 17/main/base/pgsql_tmp becomes less significant.

Questions:

1. Are there some diagnostics commands that would allow me to figure out what is in those tmp files? Why does the subscriber create those tmp files and not just write directly to the data files and WAL? (The table has 2 bytea columns, i.e. it's TOASTed for sure.)

We do write spill files (ending with '.spill') if the changes are
large. Can you please share the name of tmp files to avoid any
assumptions?

--
With Regards,
Amit Kapila.

dmitry.koterov@gmail.com

11 months ago

In reply to: Amit Kapila (#2)

Re: Increased work_mem for "logical replication tablesync worker" only?

Hi.

1. Those are temp files on the destination node (where the logical
subscription exists and tablesync worker runs), not on the source. On the
source, it’s all clear.

2. No “spill” suffix/substring in the file names. I tried to look at the
content of these temp files, I I saw some text fragments from the original
table’s text column there. I.e. it looks like for some reason, the stream
received from the source node’s COPY command goes to that temp files (at
least partially).

3. I made several more experiments, increasing work_mem to several GB (for
the role which tablesync worker uses when copying) definitely helps with
temp files.

Thanks!

On Sun, Feb 2, 2025 at 19:10 Amit Kapila <amit.kapila16@gmail.com> wrote:

Show quoted text

On Sun, Feb 2, 2025 at 5:13 PM Dmitry Koterov <dmitry.koterov@gmail.com>
wrote:

Trying to monitor perf during the initial tablesync phase (COPY) right

after CREATE SUBSCRIPTION. I noticed that the size of
17/main/base/pgsql_tmp on the destination node grows (tens of gigabytes) as
the COPY command (running internally on the publisher) progresses. Then in
the end (when its "EXPLAIN SELECT 1 FROM tbl" on the destination shows the
approximate number of rows equals to the number of rows on the source node)
it hangs for several minutes, and then 17/main/base/pgsql_tmp empties, and
the subscription progresses.

It seems like if I increase work_mem to several GB, then the growth of

17/main/base/pgsql_tmp becomes less significant.

Questions:

1. Are there some diagnostics commands that would allow me to figure out

what is in those tmp files? Why does the subscriber create those tmp files
and not just write directly to the data files and WAL? (The table has 2
bytea columns, i.e. it's TOASTed for sure.)

We do write spill files (ending with '.spill') if the changes are
large. Can you please share the name of tmp files to avoid any
assumptions?

--
With Regards,
Amit Kapila.

dmitry.koterov@gmail.com

11 months ago

In reply to: Dmitry Koterov (#3)

Re: Increased work_mem for "logical replication tablesync worker" only?

Here is the list of tmp files:

postgres@pg-101a:~/17/main/base/pgsql_tmp$ ls -la
total 5422297
drwx------ 2 postgres postgres 9 Feb 3 04:08 .
drwx------ 8 postgres postgres 8 Jan 29 01:27 ..
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.0
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.1
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.2
-rw------- 1 postgres postgres 1073741824 Feb 3 04:06 pgsql_tmp196534.3
-rw------- 1 postgres postgres 1073741824 Feb 3 04:07 pgsql_tmp196534.4
-rw------- 1 postgres postgres 1073741824 Feb 3 04:08 pgsql_tmp196534.5
-rw------- 1 postgres postgres 819396608 Feb 3 04:08 pgsql_tmp196534.6

With work_mem=4GB, all those files *on the destination node* seemed to
appear immediately with 4GB size and keep growing since then, while COPY
progresses on the source node (i.e. it looked like PG tried hard to utilize
work_mem, but after reaching the limit, dumped everything to pgsql_tmp
still).

The table structure being copied (just 1 index there):

CREATE TABLE mytable (
id bigint NOT NULL PRIMARY KEY,
snippet bytea,
title bytea,
updated_at timestamp with time zone,
rich_snippet bytea
);

Directories sizes on the destination node while tablesync is working (it's
copied in to an almost empty database):

$ watch du -sh 17/main/base/* 17/main/pg_wal
2.2M 17/main/base/1
14G 17/main/base/16385
2.3M 17/main/base/16387
2.2M 17/main/base/4
2.3M 17/main/base/5
12G 17/main/base/pgsql_tmp
6.3G 17/main/pg_wal

So the question, why does it use temp files. Why not just writes directly
to WAL+data.

On Mon, Feb 3, 2025 at 3:04 AM Dmitry Koterov <dmitry.koterov@gmail.com>
wrote:

Show quoted text

Hi.

1. Those are temp files on the destination node (where the logical
subscription exists and tablesync worker runs), not on the source. On the
source, it’s all clear.

2. No “spill” suffix/substring in the file names. I tried to look at the
content of these temp files, I I saw some text fragments from the original
table’s text column there. I.e. it looks like for some reason, the stream
received from the source node’s COPY command goes to that temp files (at
least partially).

3. I made several more experiments, increasing work_mem to several GB (for
the role which tablesync worker uses when copying) definitely helps with
temp files.

Thanks!

On Sun, Feb 2, 2025 at 19:10 Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Feb 2, 2025 at 5:13 PM Dmitry Koterov <dmitry.koterov@gmail.com>
wrote:

Trying to monitor perf during the initial tablesync phase (COPY) right

after CREATE SUBSCRIPTION. I noticed that the size of
17/main/base/pgsql_tmp on the destination node grows (tens of gigabytes) as
the COPY command (running internally on the publisher) progresses. Then in
the end (when its "EXPLAIN SELECT 1 FROM tbl" on the destination shows the
approximate number of rows equals to the number of rows on the source node)
it hangs for several minutes, and then 17/main/base/pgsql_tmp empties, and
the subscription progresses.

It seems like if I increase work_mem to several GB, then the growth of

17/main/base/pgsql_tmp becomes less significant.

Questions:

1. Are there some diagnostics commands that would allow me to figure

out what is in those tmp files? Why does the subscriber create those tmp
files and not just write directly to the data files and WAL? (The table has
2 bytea columns, i.e. it's TOASTed for sure.)

We do write spill files (ending with '.spill') if the changes are
large. Can you please share the name of tmp files to avoid any
assumptions?

--
With Regards,
Amit Kapila.

dmitry.koterov@gmail.com

11 months ago

In reply to: Dmitry Koterov (#4)

Re: Increased work_mem for "logical replication tablesync worker" only?

What's also suspicious is that on the destination node, after copying
finishes,

# explain select 1 from mytable;
Seq Scan on mytable (cost=0.00..6821514.12 rows=250410012 width=4)

# SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE relname =
'mytable';
relname | estimated_rows
-------------+----------------
mytable | 150342468

Notice the discrepancy between how many rows EXPLAIN thinks there is in the
table, and how many rows pg_stat_user_tables thinks about it (250M vs.
150M).

On the source node, those 2 numbers are almost the same. It's only on the
destination they are different, right after the copying. (There are not a
lot of writes to this table happening on the source happening while copying
BTW.)

Maybe that could hint on why temp files are used?

On Mon, Feb 3, 2025 at 4:21 AM Dmitry Koterov <dmitry.koterov@gmail.com>
wrote:

Show quoted text

Here is the list of tmp files:

postgres@pg-101a:~/17/main/base/pgsql_tmp$ ls -la
total 5422297
drwx------ 2 postgres postgres 9 Feb 3 04:08 .
drwx------ 8 postgres postgres 8 Jan 29 01:27 ..
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.0
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.1
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.2
-rw------- 1 postgres postgres 1073741824 Feb 3 04:06 pgsql_tmp196534.3
-rw------- 1 postgres postgres 1073741824 Feb 3 04:07 pgsql_tmp196534.4
-rw------- 1 postgres postgres 1073741824 Feb 3 04:08 pgsql_tmp196534.5
-rw------- 1 postgres postgres 819396608 Feb 3 04:08 pgsql_tmp196534.6

With work_mem=4GB, all those files *on the destination node* seemed to
appear immediately with 4GB size and keep growing since then, while COPY
progresses on the source node (i.e. it looked like PG tried hard to utilize
work_mem, but after reaching the limit, dumped everything to pgsql_tmp
still).

The table structure being copied (just 1 index there):

CREATE TABLE mytable (
id bigint NOT NULL PRIMARY KEY,
snippet bytea,
title bytea,
updated_at timestamp with time zone,
rich_snippet bytea
);

Directories sizes on the destination node while tablesync is working (it's
copied in to an almost empty database):

$ watch du -sh 17/main/base/* 17/main/pg_wal
2.2M 17/main/base/1
14G 17/main/base/16385
2.3M 17/main/base/16387
2.2M 17/main/base/4
2.3M 17/main/base/5
12G 17/main/base/pgsql_tmp
6.3G 17/main/pg_wal

So the question, why does it use temp files. Why not just writes directly
to WAL+data.

On Mon, Feb 3, 2025 at 3:04 AM Dmitry Koterov <dmitry.koterov@gmail.com>
wrote:

Hi.

1. Those are temp files on the destination node (where the logical
subscription exists and tablesync worker runs), not on the source. On the
source, it’s all clear.

2. No “spill” suffix/substring in the file names. I tried to look at the
content of these temp files, I I saw some text fragments from the original
table’s text column there. I.e. it looks like for some reason, the stream
received from the source node’s COPY command goes to that temp files (at
least partially).

3. I made several more experiments, increasing work_mem to several GB
(for the role which tablesync worker uses when copying) definitely helps
with temp files.

Thanks!

On Sun, Feb 2, 2025 at 19:10 Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Feb 2, 2025 at 5:13 PM Dmitry Koterov <dmitry.koterov@gmail.com>
wrote:

Trying to monitor perf during the initial tablesync phase (COPY) right

after CREATE SUBSCRIPTION. I noticed that the size of
17/main/base/pgsql_tmp on the destination node grows (tens of gigabytes) as
the COPY command (running internally on the publisher) progresses. Then in
the end (when its "EXPLAIN SELECT 1 FROM tbl" on the destination shows the
approximate number of rows equals to the number of rows on the source node)
it hangs for several minutes, and then 17/main/base/pgsql_tmp empties, and
the subscription progresses.

It seems like if I increase work_mem to several GB, then the growth of

17/main/base/pgsql_tmp becomes less significant.

Questions:

1. Are there some diagnostics commands that would allow me to figure

out what is in those tmp files? Why does the subscriber create those tmp
files and not just write directly to the data files and WAL? (The table has
2 bytea columns, i.e. it's TOASTed for sure.)

We do write spill files (ending with '.spill') if the changes are
large. Can you please share the name of tmp files to avoid any
assumptions?

--
With Regards,
Amit Kapila.

amit.kapila16@gmail.com

11 months ago

In reply to: Dmitry Koterov (#4)

Re: Increased work_mem for "logical replication tablesync worker" only?

On Mon, Feb 3, 2025 at 5:51 PM Dmitry Koterov <dmitry.koterov@gmail.com> wrote:

Here is the list of tmp files:

postgres@pg-101a:~/17/main/base/pgsql_tmp$ ls -la
total 5422297
drwx------ 2 postgres postgres 9 Feb 3 04:08 .
drwx------ 8 postgres postgres 8 Jan 29 01:27 ..
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.0
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.1
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.2
-rw------- 1 postgres postgres 1073741824 Feb 3 04:06 pgsql_tmp196534.3
-rw------- 1 postgres postgres 1073741824 Feb 3 04:07 pgsql_tmp196534.4
-rw------- 1 postgres postgres 1073741824 Feb 3 04:08 pgsql_tmp196534.5
-rw------- 1 postgres postgres 819396608 Feb 3 04:08 pgsql_tmp196534.6

With work_mem=4GB, all those files on the destination node seemed to appear immediately with 4GB size and keep growing since then, while COPY progresses on the source node (i.e. it looked like PG tried hard to utilize work_mem, but after reaching the limit, dumped everything to pgsql_tmp still).

The table structure being copied (just 1 index there):

CREATE TABLE mytable (
id bigint NOT NULL PRIMARY KEY,
snippet bytea,
title bytea,
updated_at timestamp with time zone,
rich_snippet bytea
);

Do you have any indexes on the table? I am not aware if a direct copy
to table from tablesync uses tmp files.

--
With Regards,
Amit Kapila.

amit.kapila16@gmail.com

11 months ago

In reply to: Dmitry Koterov (#5)

Re: Increased work_mem for "logical replication tablesync worker" only?

On Mon, Feb 3, 2025 at 6:19 PM Dmitry Koterov <dmitry.koterov@gmail.com> wrote:

What's also suspicious is that on the destination node, after copying finishes,

# explain select 1 from mytable;
Seq Scan on mytable (cost=0.00..6821514.12 rows=250410012 width=4)

# SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE relname = 'mytable';
relname | estimated_rows
-------------+----------------
mytable | 150342468

Notice the discrepancy between how many rows EXPLAIN thinks there is in the table, and how many rows pg_stat_user_tables thinks about it (250M vs. 150M).

Is it possible that analyze has not yet run on the table on
subscribers? Can you try running the Analyze command on the subscriber
before trying the above two statements?

--
With Regards,
Amit Kapila.

dmitry.koterov@gmail.com

11 months ago

In reply to: Amit Kapila (#6)

Re: Increased work_mem for "logical replication tablesync worker" only?

Only one index, the primary key.

ChatGPT tells that temp files may be used, when bytea columns (or heavily
toasted ones) are utilized, but it can’t explain, why. To me, it’s hard to
believe that on the destination, the COPY stream receiver puts almost all
of the received data to temp files, then waits for COPY to finish, and only
then moves that temp files to data area, because of the large transaction
or something. Would be cool to understand the reason.

On Mon, Feb 3, 2025 at 22:31 Amit Kapila <amit.kapila16@gmail.com> wrote:

Show quoted text

On Mon, Feb 3, 2025 at 5:51 PM Dmitry Koterov <dmitry.koterov@gmail.com>
wrote:

Here is the list of tmp files:

postgres@pg-101a:~/17/main/base/pgsql_tmp$ ls -la
total 5422297
drwx------ 2 postgres postgres 9 Feb 3 04:08 .
drwx------ 8 postgres postgres 8 Jan 29 01:27 ..
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.0
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.1
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.2
-rw------- 1 postgres postgres 1073741824 Feb 3 04:06 pgsql_tmp196534.3
-rw------- 1 postgres postgres 1073741824 Feb 3 04:07 pgsql_tmp196534.4
-rw------- 1 postgres postgres 1073741824 Feb 3 04:08 pgsql_tmp196534.5
-rw------- 1 postgres postgres 819396608 Feb 3 04:08 pgsql_tmp196534.6

With work_mem=4GB, all those files on the destination node seemed to

appear immediately with 4GB size and keep growing since then, while COPY
progresses on the source node (i.e. it looked like PG tried hard to utilize
work_mem, but after reaching the limit, dumped everything to pgsql_tmp
still).

The table structure being copied (just 1 index there):

CREATE TABLE mytable (
id bigint NOT NULL PRIMARY KEY,
snippet bytea,
title bytea,
updated_at timestamp with time zone,
rich_snippet bytea
);

Do you have any indexes on the table? I am not aware if a direct copy
to table from tablesync uses tmp files.

--
With Regards,
Amit Kapila.

amit.kapila16@gmail.com

11 months ago

In reply to: Dmitry Koterov (#8)

Re: Increased work_mem for "logical replication tablesync worker" only?

On Wed, Feb 5, 2025 at 9:50 AM Dmitry Koterov <dmitry.koterov@gmail.com> wrote:

Only one index, the primary key.

ChatGPT tells that temp files may be used, when bytea columns (or heavily toasted ones) are utilized, but it can’t explain, why. To me, it’s hard to believe that on the destination, the COPY stream receiver puts almost all of the received data to temp files, then waits for COPY to finish, and only then moves that temp files to data area, because of the large transaction or something. Would be cool to understand the reason.

One idea is that temporarily you drop the index before copy and see
what happens. Is it possible for you to share a self-contained test
case?

--
With Regards,
Amit Kapila.

vignesh21@gmail.com

11 months ago

In reply to: Dmitry Koterov (#4)

Re: Increased work_mem for "logical replication tablesync worker" only?

On Mon, 3 Feb 2025 at 17:51, Dmitry Koterov <dmitry.koterov@gmail.com> wrote:

Here is the list of tmp files:

postgres@pg-101a:~/17/main/base/pgsql_tmp$ ls -la
total 5422297
drwx------ 2 postgres postgres 9 Feb 3 04:08 .
drwx------ 8 postgres postgres 8 Jan 29 01:27 ..
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.0
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.1
-rw------- 1 postgres postgres 1073741824 Feb 3 04:05 pgsql_tmp196534.2
-rw------- 1 postgres postgres 1073741824 Feb 3 04:06 pgsql_tmp196534.3
-rw------- 1 postgres postgres 1073741824 Feb 3 04:07 pgsql_tmp196534.4
-rw------- 1 postgres postgres 1073741824 Feb 3 04:08 pgsql_tmp196534.5
-rw------- 1 postgres postgres 819396608 Feb 3 04:08 pgsql_tmp196534.6

With work_mem=4GB, all those files on the destination node seemed to appear immediately with 4GB size and keep growing since then, while COPY progresses on the source node (i.e. it looked like PG tried hard to utilize work_mem, but after reaching the limit, dumped everything to pgsql_tmp still).

The table structure being copied (just 1 index there):

CREATE TABLE mytable (
id bigint NOT NULL PRIMARY KEY,
snippet bytea,
title bytea,
updated_at timestamp with time zone,
rich_snippet bytea
);

I prepare a logical replication with the following:
CREATE TABLE mytable (
id bigint NOT NULL PRIMARY KEY,
snippet bytea default decode(repeat('1234567890',10000) ,'hex'),
title bytea default decode(repeat('1234567890',10000),'hex'),
updated_at timestamp with time zone default 'now',
rich_snippet bytea default decode(repeat('1234567890',10000),'hex');

-- Specify the data only for first column, the rest of the column will
use default value
insert into mytable values(generate_series(1,100000));

These did not create temp files, I had used a low work_mem
configuration i.e. 64kB. I have enabled "log_temp_files = 0" to
confirm that no temp files were created.
Generally when we use ORDER BY, DISTINCT, hash joins, hash-based
aggregation, memoize nodes, hash-based processing, merge joins, these
temp files will be created.
Ex: The following creates temp files:
select * from mytable order by 2,3,4,5;
2025-02-07 15:10:25.409 IST [586494] LOG: temporary file: path
"base/pgsql_tmp/pgsql_tmp586494.0", size 65265664
2025-02-07 15:10:25.464 IST [586495] LOG: temporary file: path
"base/pgsql_tmp/pgsql_tmp586495.0", size 60481536
2025-02-07 15:10:25.514 IST [586492] LOG: temporary file: path
"base/pgsql_tmp/pgsql_tmp586492.0", size 54730752

Does the table have any triggers that could invoke another statement?
Could you provide more details or simulate a scenario using a table
like the one above with insert operations? That would be really
helpful.

Regards,
Vignesh