BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

Started by PG Bug reporting formover 5 years ago13 messagesbugs

noreply@postgresql.org

over 5 years ago

The following bug has been logged on the website:

Bug reference: 16691
Logged by: Max Vikharev
Email address: bm.kinder@gmail.com
PostgreSQL version: 12.3
Operating system: Ubuntu 16.04.5 LTS
Description:

Hi,

Previously i've reported BUG # 16620. But there is no answer on my new
mails, so i have to open this new report.

Currently we have to restart postgresql 12.3 every 5-7 days to make it work
properly on 5 terabyte cluster with ~ 50 databases. The problem is that
autovacuum stops to process certain databases and cycles on the only one...
Until we restart cluster.

We've made a trigger and now detecting very situation when it happen. And it
happen every 5-7 day.
So postgresql is loosing actual statistic, we see not effecttive plans, we
need to restart postgresql and also we have to run vacuum full explicitly
frequently. So this all is huge headake.

Please help us investigate the roots of problem.

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

čt 29. 10. 2020 v 22:44 odesílatel PG Bug reporting form <
noreply@postgresql.org> napsal:

The following bug has been logged on the website:

Bug reference: 16691
Logged by: Max Vikharev
Email address: bm.kinder@gmail.com
PostgreSQL version: 12.3
Operating system: Ubuntu 16.04.5 LTS
Description:

Hi,

Previously i've reported BUG # 16620. But there is no answer on my new
mails, so i have to open this new report.

Currently we have to restart postgresql 12.3 every 5-7 days to make it
work
properly on 5 terabyte cluster with ~ 50 databases. The problem is that
autovacuum stops to process certain databases and cycles on the only one...
Until we restart cluster.

We've made a trigger and now detecting very situation when it happen. And
it
happen every 5-7 day.
So postgresql is loosing actual statistic, we see not effecttive plans, we
need to restart postgresql and also we have to run vacuum full explicitly
frequently. So this all is huge headake.

please, check open transactions. Restart, close all sessions, all
transactions.

Did you check pg_stat_activity if there are some sessions in state "idle in
transactions" before restart?

Regards

Pavel

Show quoted text

Please help us investigate the roots of problem.

Max Vikharev

bm.kinder@gmail.com

over 5 years ago

In reply to: Pavel Stehule (#2)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

He Pavel,

Did you check pg_stat_activity if there are some sessions in state "idle

in transactions" before restart?
Of course, we discovered all known widely possible reasons:
1. We monitor long transactions and there are no active transactions
or opened sessions with idle in transaction state. Also we have
statement_timeout
2. There are no active transactons on slave and hot_standby_feedback=off
3. We do not use prepared transactions
4. There are no repliaction_slots.

пт, 30 окт. 2020 г. в 07:48, Pavel Stehule <pavel.stehule@gmail.com>:

Show quoted text

čt 29. 10. 2020 v 22:44 odesílatel PG Bug reporting form <
noreply@postgresql.org> napsal:

The following bug has been logged on the website:

Bug reference: 16691
Logged by: Max Vikharev
Email address: bm.kinder@gmail.com
PostgreSQL version: 12.3
Operating system: Ubuntu 16.04.5 LTS
Description:

Hi,

Previously i've reported BUG # 16620. But there is no answer on my new
mails, so i have to open this new report.

Currently we have to restart postgresql 12.3 every 5-7 days to make it
work
properly on 5 terabyte cluster with ~ 50 databases. The problem is that
autovacuum stops to process certain databases and cycles on the only
one...
Until we restart cluster.

We've made a trigger and now detecting very situation when it happen. And
it
happen every 5-7 day.
So postgresql is loosing actual statistic, we see not effecttive plans, we
need to restart postgresql and also we have to run vacuum full explicitly
frequently. So this all is huge headake.

please, check open transactions. Restart, close all sessions, all
transactions.

Did you check pg_stat_activity if there are some sessions in state "idle
in transactions" before restart?

Regards

Pavel

Please help us investigate the roots of problem.

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Max Vikharev (#3)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

pá 30. 10. 2020 v 8:23 odesílatel Max Vikharev <bm.kinder@gmail.com> napsal:

He Pavel,

Did you check pg_stat_activity if there are some sessions in state "idle

in transactions" before restart?
Of course, we discovered all known widely possible reasons:
1. We monitor long transactions and there are no active transactions
or opened sessions with idle in transaction state. Also we have
statement_timeout
2. There are no active transactons on slave and hot_standby_feedback=off
3. We do not use prepared transactions
4. There are no repliaction_slots.

maybe there are some lock leak - this is probably postgres bug, but
unfortunately it can be detected only in your environment - if you are not
able to prepare reproducer test.

can you attach autovacuum process by gdb and read whot this process does?

Show quoted text

пт, 30 окт. 2020 г. в 07:48, Pavel Stehule <pavel.stehule@gmail.com>:

čt 29. 10. 2020 v 22:44 odesílatel PG Bug reporting form <
noreply@postgresql.org> napsal:

The following bug has been logged on the website:

Bug reference: 16691
Logged by: Max Vikharev
Email address: bm.kinder@gmail.com
PostgreSQL version: 12.3
Operating system: Ubuntu 16.04.5 LTS
Description:

Hi,

Previously i've reported BUG # 16620. But there is no answer on my new
mails, so i have to open this new report.

Currently we have to restart postgresql 12.3 every 5-7 days to make it
work
properly on 5 terabyte cluster with ~ 50 databases. The problem is that
autovacuum stops to process certain databases and cycles on the only
one...
Until we restart cluster.

We've made a trigger and now detecting very situation when it happen.
And it
happen every 5-7 day.
So postgresql is loosing actual statistic, we see not effecttive plans,
we
need to restart postgresql and also we have to run vacuum full explicitly
frequently. So this all is huge headake.

please, check open transactions. Restart, close all sessions, all
transactions.

Did you check pg_stat_activity if there are some sessions in state "idle
in transactions" before restart?

Regards

Pavel

Please help us investigate the roots of problem.

Max Vikharev

bm.kinder@gmail.com

over 5 years ago

In reply to: Pavel Stehule (#4)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

Ok, next time it happens I'll do this debug.
Maybe anything more that i can collect?
Last time it happened i enabled DEBUG logging and that's what autovacuum
process do (see attach or
https://gist.github.com/mvikharev/62e2aa690cd3d256fa1880d44b169313)

пт, 30 окт. 2020 г. в 10:29, Pavel Stehule <pavel.stehule@gmail.com>:

Show quoted text

pá 30. 10. 2020 v 8:23 odesílatel Max Vikharev <bm.kinder@gmail.com>
napsal:

He Pavel,

Did you check pg_stat_activity if there are some sessions in state

"idle in transactions" before restart?
Of course, we discovered all known widely possible reasons:
1. We monitor long transactions and there are no active transactions
or opened sessions with idle in transaction state. Also we have
statement_timeout
2. There are no active transactons on slave and hot_standby_feedback=off
3. We do not use prepared transactions
4. There are no repliaction_slots.

maybe there are some lock leak - this is probably postgres bug, but
unfortunately it can be detected only in your environment - if you are not
able to prepare reproducer test.

can you attach autovacuum process by gdb and read whot this process does?

пт, 30 окт. 2020 г. в 07:48, Pavel Stehule <pavel.stehule@gmail.com>:

čt 29. 10. 2020 v 22:44 odesílatel PG Bug reporting form <
noreply@postgresql.org> napsal:

The following bug has been logged on the website:

Bug reference: 16691
Logged by: Max Vikharev
Email address: bm.kinder@gmail.com
PostgreSQL version: 12.3
Operating system: Ubuntu 16.04.5 LTS
Description:

Hi,

Previously i've reported BUG # 16620. But there is no answer on my new
mails, so i have to open this new report.

Currently we have to restart postgresql 12.3 every 5-7 days to make it
work
properly on 5 terabyte cluster with ~ 50 databases. The problem is that
autovacuum stops to process certain databases and cycles on the only
one...
Until we restart cluster.

We've made a trigger and now detecting very situation when it happen.
And it
happen every 5-7 day.
So postgresql is loosing actual statistic, we see not effecttive plans,
we
need to restart postgresql and also we have to run vacuum full
explicitly
frequently. So this all is huge headake.

please, check open transactions. Restart, close all sessions, all
transactions.

Did you check pg_stat_activity if there are some sessions in state "idle
in transactions" before restart?

Regards

Pavel

Please help us investigate the roots of problem.

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Max Vikharev (#5)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

pá 30. 10. 2020 v 10:13 odesílatel Max Vikharev <bm.kinder@gmail.com>
napsal:

Ok, next time it happens I'll do this debug.
Maybe anything more that i can collect?
Last time it happened i enabled DEBUG logging and that's what autovacuum
process do (see attach or
https://gist.github.com/mvikharev/62e2aa690cd3d256fa1880d44b169313)

you can check a pg_locks - if table are not locked

Show quoted text

пт, 30 окт. 2020 г. в 10:29, Pavel Stehule <pavel.stehule@gmail.com>:

pá 30. 10. 2020 v 8:23 odesílatel Max Vikharev <bm.kinder@gmail.com>
napsal:

He Pavel,

Did you check pg_stat_activity if there are some sessions in state

"idle in transactions" before restart?
Of course, we discovered all known widely possible reasons:
1. We monitor long transactions and there are no active transactions
or opened sessions with idle in transaction state. Also we have
statement_timeout
2. There are no active transactons on slave and hot_standby_feedback=off
3. We do not use prepared transactions
4. There are no repliaction_slots.

maybe there are some lock leak - this is probably postgres bug, but
unfortunately it can be detected only in your environment - if you are not
able to prepare reproducer test.

can you attach autovacuum process by gdb and read whot this process does?

пт, 30 окт. 2020 г. в 07:48, Pavel Stehule <pavel.stehule@gmail.com>:

čt 29. 10. 2020 v 22:44 odesílatel PG Bug reporting form <
noreply@postgresql.org> napsal:

The following bug has been logged on the website:

Bug reference: 16691
Logged by: Max Vikharev
Email address: bm.kinder@gmail.com
PostgreSQL version: 12.3
Operating system: Ubuntu 16.04.5 LTS
Description:

Hi,

Previously i've reported BUG # 16620. But there is no answer on my new
mails, so i have to open this new report.

Currently we have to restart postgresql 12.3 every 5-7 days to make
it work
properly on 5 terabyte cluster with ~ 50 databases. The problem is that
autovacuum stops to process certain databases and cycles on the only
one...
Until we restart cluster.

We've made a trigger and now detecting very situation when it happen.
And it
happen every 5-7 day.
So postgresql is loosing actual statistic, we see not effecttive
plans, we
need to restart postgresql and also we have to run vacuum full
explicitly
frequently. So this all is huge headake.

please, check open transactions. Restart, close all sessions, all
transactions.

Did you check pg_stat_activity if there are some sessions in state
"idle in transactions" before restart?

Regards

Pavel

Please help us investigate the roots of problem.

Jeff Janes

jeff.janes@gmail.com

over 5 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

On Thu, Oct 29, 2020 at 5:44 PM PG Bug reporting form <
noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 16691
Logged by: Max Vikharev
Email address: bm.kinder@gmail.com
PostgreSQL version: 12.3
Operating system: Ubuntu 16.04.5 LTS
Description:

Hi,

Previously i've reported BUG # 16620. But there is no answer on my new
mails, so i have to open this new report.

Currently we have to restart postgresql 12.3 every 5-7 days to make it
work
properly on 5 terabyte cluster with ~ 50 databases. The problem is that
autovacuum stops to process certain databases and cycles on the only one...
Until we restart cluster.

There is a known problem (
/messages/by-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com)
that looks something like this, but restarting the database should not fix
it (or at least, not for long), and it should stop on its own once the one
database being vacuumed for wraparound has finished doing so. Is it
vacuuming the same table over and over again, or is finishing that table
and moving to another in the same database, or what?

Cheers,

Jeff

Max Vikharev

bm.kinder@gmail.com

over 5 years ago

In reply to: Jeff Janes (#7)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

Hi
Thank you for response

finishing that table and moving to another in the same database

Yes, that case. Vacuum processes relations in one database In cycle

сб, 31 окт. 2020 г. в 21:55, Jeff Janes <jeff.janes@gmail.com>:

Show quoted text

On Thu, Oct 29, 2020 at 5:44 PM PG Bug reporting form <
noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 16691
Logged by: Max Vikharev
Email address: bm.kinder@gmail.com
PostgreSQL version: 12.3
Operating system: Ubuntu 16.04.5 LTS
Description:

Hi,

Previously i've reported BUG # 16620. But there is no answer on my new
mails, so i have to open this new report.

Currently we have to restart postgresql 12.3 every 5-7 days to make it
work
properly on 5 terabyte cluster with ~ 50 databases. The problem is that
autovacuum stops to process certain databases and cycles on the only
one...
Until we restart cluster.

There is a known problem (
/messages/by-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com)
that looks something like this, but restarting the database should not fix
it (or at least, not for long), and it should stop on its own once the one
database being vacuumed for wraparound has finished doing so. Is it
vacuuming the same table over and over again, or is finishing that table
and moving to another in the same database, or what?

Cheers,

Jeff

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Max Vikharev (#8)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

Max Vikharev <bm.kinder@gmail.com> writes:

finishing that table and moving to another in the same database

Yes, that case. Vacuum processes relations in one database In cycle

It sounds like it's failing to advance that database's datfrozenxid
and/or datminmxid. You might try to figure out which table is holding
those values back by comparing pg_class.relfrozenxid and relminmxid
to those pg_database columns. Then look to see if autovacuum is
failing on that table, and/or try a manual vacuum to see what happens.

regards, tom lane

#10

Max Vikharev

bm.kinder@gmail.com

over 5 years ago

In reply to: Tom Lane (#9)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

We are facing a problem right now.

In logs we see that autovacuum is processing mostly one table in database
in 1 second cycle

automatic vacuum of table "DATABASE.pg_catalog.pg_shdepend"

We did strace "postgres: autovacuum launcher". It tries to read from
global/679366931 in cycle.

open("global/679366931", O_RDWR)        = 6
lseek(6, 0, SEEK_END)                   = 16384
kill(11798, SIGUSR1)                    = 0
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1\312\177n\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"...,
4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"...,
4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"...,
4096) = 3100
close(7)                                = 0
sendto(11, "\1\0\0\0
\0\0\0\6\270v\322.V\2\0\346\26o\322.V\2\0\0\0\0\0\1\0\0\0", 32, 0,
NULL, 0) = 32
select(0, NULL, NULL, NULL, {0, 10000}) = ? ERESTARTNOHAND (To be
restarted if no handler)
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=12052, si_uid=114} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"...,
4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"...,
4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"...,
4096) = 3100
close(7)                                = 0
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"...,
4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"...,
4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"...,
4096) = 3100
close(7)                                = 0
close(6)                                = 0
open("global/679366931", O_RDWR)        = 6
lseek(6, 0, SEEK_END)                   = 16384
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"...,
4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"...,
4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"...,
4096) = 3100
close(7)                                = 0
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"...,
4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"...,
4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"...,
4096) = 3100
close(7)                                = 0
epoll_create1(EPOLL_CLOEXEC)            = 7
epoll_ctl(7, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP,
{u32=772071960, u64=94859419771416}}) = 0
epoll_ctl(7, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP,
{u32=772071984, u64=94859419771440}}) = 0
close(7)                                = 0
epoll_create1(EPOLL_CLOEXEC)            = 7
epoll_ctl(7, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP,
{u32=772071960, u64=94859419771416}}) = 0
epoll_ctl(7, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP,
{u32=772071984, u64=94859419771440}}) = 0
epoll_wait(7, 0x56462e04e260, 1, 554)   = -1 EINTR (Interrupted system call)

lsof

postgres 12202 postgres 6u REG 8,3
16384 47329946 /var/lib/postgresql/12/alytics_main/global/679366931
postgres 30607 postgres 239u REG 8,3
16384 47329946 /var/lib/postgresql/12/alytics_main/global/679366931

DATABASE=# SELECT pg_relation_filepath('pg_database');
pg_relation_filepath
----------------------
global/679366931

It sounds like it's failing to advance that database's datfrozenxid
and/or datminmxid. You might try to figure out which table is holding
those values back by comparing pg_class.relfrozenxid and relminmxid
to those pg_database columns. Then look to see if autovacuum is
failing on that table, and/or try a manual vacuum to see what happens.

We found tables with pg_database.datfrozenxid = pg_class.relfrozenxid or
pg_database.datminmxid = pg_class.relminmxid. We executed "vacuum verbose"
on that tables and it did the work correctly, in output we see that there
were 0 dead tuples in all tables and there are logs with ERROR/WARNING log
level.

In postgresql logs we see that after pg_toast were vacuumed autovacuum
started to process next tables

2020-11-03 12:54:31 MSK DATABASE [local] [38649]: [2] STATEMENT:
vacuum verbose pg_toast.pg_toast_687578781;
2020-11-03 12:55:24 MSK DATABASE [local] [38649]: [4] STATEMENT:
vacuum verbose pg_toast.pg_toast_687578787;
2020-11-03 12:56:10 MSK DATABASE [local] [38649]: [6] STATEMENT:
vacuum verbose pg_toast.pg_toast_687578793;
2020-11-03 12:56:32 MSK DATABASE [local] [38649]: [8] STATEMENT:
vacuum verbose pg_toast.pg_toast_687578799;
2020-11-03 12:59:50 MSK DATABASE [33266]: [2] LOG: automatic vacuum
of table "DATABASE.public.another_table": index scans: 1

But it still processes only one database in cycle. I mean that there are
tables in other databases that should be processed and there are free
autovacuum workers, but queued tables are not processed.

вс, 1 нояб. 2020 г. в 20:01, Tom Lane <tgl@sss.pgh.pa.us>:

Show quoted text

Max Vikharev <bm.kinder@gmail.com> writes:

finishing that table and moving to another in the same database

Yes, that case. Vacuum processes relations in one database In cycle

It sounds like it's failing to advance that database's datfrozenxid
and/or datminmxid. You might try to figure out which table is holding
those values back by comparing pg_class.relfrozenxid and relminmxid
to those pg_database columns. Then look to see if autovacuum is
failing on that table, and/or try a manual vacuum to see what happens.

regards, tom lane

#11

Max Vikharev

bm.kinder@gmail.com

over 5 years ago

In reply to: Max Vikharev (#10)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

Hi, Dear PostgreSQL Developers.
We are still facing this issue.
We reboot the cluster every 5-7 days.Any help plz?

вт, 3 нояб. 2020 г. в 14:14, Max Vikharev <bm.kinder@gmail.com>:

Show quoted text

We are facing a problem right now.

In logs we see that autovacuum is processing mostly one table in database
in 1 second cycle

automatic vacuum of table "DATABASE.pg_catalog.pg_shdepend"

We did strace "postgres: autovacuum launcher". It tries to read from
global/679366931 in cycle.
open("global/679366931", O_RDWR)        = 6
lseek(6, 0, SEEK_END)                   = 16384
kill(11798, SIGUSR1)                    = 0
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1\312\177n\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"..., 4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"..., 4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"..., 4096) = 3100
close(7)                                = 0
sendto(11, "\1\0\0\0 \0\0\0\6\270v\322.V\2\0\346\26o\322.V\2\0\0\0\0\0\1\0\0\0", 32, 0, NULL, 0) = 32
select(0, NULL, NULL, NULL, {0, 10000}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=12052, si_uid=114} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"..., 4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"..., 4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"..., 4096) = 3100
close(7)                                = 0
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"..., 4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"..., 4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"..., 4096) = 3100
close(7)                                = 0
close(6)                                = 0
open("global/679366931", O_RDWR)        = 6
lseek(6, 0, SEEK_END)                   = 16384
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"..., 4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"..., 4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"..., 4096) = 3100
close(7)                                = 0
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"..., 4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"..., 4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"..., 4096) = 3100
close(7)                                = 0
epoll_create1(EPOLL_CLOEXEC)            = 7
epoll_ctl(7, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=772071960, u64=94859419771416}}) = 0
epoll_ctl(7, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=772071984, u64=94859419771440}}) = 0
close(7)                                = 0
epoll_create1(EPOLL_CLOEXEC)            = 7
epoll_ctl(7, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=772071960, u64=94859419771416}}) = 0
epoll_ctl(7, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=772071984, u64=94859419771440}}) = 0
epoll_wait(7, 0x56462e04e260, 1, 554)   = -1 EINTR (Interrupted system call)
lsof

postgres 12202 postgres 6u REG 8,3 16384 47329946 /var/lib/postgresql/12/alytics_main/global/679366931
postgres 30607 postgres 239u REG 8,3 16384 47329946 /var/lib/postgresql/12/alytics_main/global/679366931

DATABASE=# SELECT pg_relation_filepath('pg_database');
pg_relation_filepath
----------------------
global/679366931

It sounds like it's failing to advance that database's datfrozenxid
and/or datminmxid. You might try to figure out which table is holding
those values back by comparing pg_class.relfrozenxid and relminmxid
to those pg_database columns. Then look to see if autovacuum is
failing on that table, and/or try a manual vacuum to see what happens.

We found tables with pg_database.datfrozenxid = pg_class.relfrozenxid or
pg_database.datminmxid = pg_class.relminmxid. We executed "vacuum verbose"
on that tables and it did the work correctly, in output we see that there
were 0 dead tuples in all tables and there are logs with ERROR/WARNING log
level.

In postgresql logs we see that after pg_toast were vacuumed autovacuum
started to process next tables

2020-11-03 12:54:31 MSK DATABASE [local] [38649]: [2] STATEMENT: vacuum verbose pg_toast.pg_toast_687578781;
2020-11-03 12:55:24 MSK DATABASE [local] [38649]: [4] STATEMENT: vacuum verbose pg_toast.pg_toast_687578787;
2020-11-03 12:56:10 MSK DATABASE [local] [38649]: [6] STATEMENT: vacuum verbose pg_toast.pg_toast_687578793;
2020-11-03 12:56:32 MSK DATABASE [local] [38649]: [8] STATEMENT: vacuum verbose pg_toast.pg_toast_687578799;
2020-11-03 12:59:50 MSK DATABASE [33266]: [2] LOG: automatic vacuum of table "DATABASE.public.another_table": index scans: 1

But it still processes only one database in cycle. I mean that there are
tables in other databases that should be processed and there are free
autovacuum workers, but queued tables are not processed.

вс, 1 нояб. 2020 г. в 20:01, Tom Lane <tgl@sss.pgh.pa.us>:

Max Vikharev <bm.kinder@gmail.com> writes:

finishing that table and moving to another in the same database

Yes, that case. Vacuum processes relations in one database In cycle

It sounds like it's failing to advance that database's datfrozenxid
and/or datminmxid. You might try to figure out which table is holding
those values back by comparing pg_class.relfrozenxid and relminmxid
to those pg_database columns. Then look to see if autovacuum is
failing on that table, and/or try a manual vacuum to see what happens.

regards, tom lane

#12

Max Vikharev

bm.kinder@gmail.com

over 5 years ago

In reply to: Max Vikharev (#11)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

Hi, everybody.

Finally we discovered that the problem is related to
/messages/by-id/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05

Autovacuum hangs on the database which contains orphaned temp tables. When
I deleted the schema with orphaned temporary tables - autovacuum continued
to work without reboot.

чт, 17 дек. 2020 г. в 16:26, Max Vikharev <bm.kinder@gmail.com>:

Show quoted text

Hi, Dear PostgreSQL Developers.
We are still facing this issue.
We reboot the cluster every 5-7 days.Any help plz?

вт, 3 нояб. 2020 г. в 14:14, Max Vikharev <bm.kinder@gmail.com>:
We are facing a problem right now.

In logs we see that autovacuum is processing mostly one table in database
in 1 second cycle

automatic vacuum of table "DATABASE.pg_catalog.pg_shdepend"

We did strace "postgres: autovacuum launcher". It tries to read from
global/679366931 in cycle.
open("global/679366931", O_RDWR)        = 6
lseek(6, 0, SEEK_END)                   = 16384
kill(11798, SIGUSR1)                    = 0
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1\312\177n\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"..., 4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"..., 4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"..., 4096) = 3100
close(7)                                = 0
sendto(11, "\1\0\0\0 \0\0\0\6\270v\322.V\2\0\346\26o\322.V\2\0\0\0\0\0\1\0\0\0", 32, 0, NULL, 0) = 32
select(0, NULL, NULL, NULL, {0, 10000}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=12052, si_uid=114} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"..., 4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"..., 4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"..., 4096) = 3100
close(7)                                = 0
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"..., 4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"..., 4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"..., 4096) = 3100
close(7)                                = 0
close(6)                                = 0
open("global/679366931", O_RDWR)        = 6
lseek(6, 0, SEEK_END)                   = 16384
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"..., 4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"..., 4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"..., 4096) = 3100
close(7)                                = 0
open("pg_stat_tmp/global.stat", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0600, st_size=11292, ...}) = 0
read(7, "\235\274\245\1z\270v\322.V\2\0'\v\0\0\0\0\0\0\4U\0\0\0\0\0\0u\326\313\26"..., 4096) = 4096
read(7, "iZ\272\17\344\0\0\0w\337\334}\343\0\0\0000\362\255\0\0\0\0\0\265\241\332\0\0\0\0\0"..., 4096) = 4096
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0U\2\0\0\0\0\0\0\0\0B\277\0\0\0\0\1\0\0\0"..., 4096) = 3100
close(7)                                = 0
epoll_create1(EPOLL_CLOEXEC)            = 7
epoll_ctl(7, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=772071960, u64=94859419771416}}) = 0
epoll_ctl(7, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=772071984, u64=94859419771440}}) = 0
close(7)                                = 0
epoll_create1(EPOLL_CLOEXEC)            = 7
epoll_ctl(7, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=772071960, u64=94859419771416}}) = 0
epoll_ctl(7, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=772071984, u64=94859419771440}}) = 0
epoll_wait(7, 0x56462e04e260, 1, 554)   = -1 EINTR (Interrupted system call)
lsof

postgres 12202 postgres 6u REG 8,3 16384 47329946 /var/lib/postgresql/12/alytics_main/global/679366931
postgres 30607 postgres 239u REG 8,3 16384 47329946 /var/lib/postgresql/12/alytics_main/global/679366931

DATABASE=# SELECT pg_relation_filepath('pg_database');
pg_relation_filepath
----------------------
global/679366931

It sounds like it's failing to advance that database's datfrozenxid
and/or datminmxid. You might try to figure out which table is holding
those values back by comparing pg_class.relfrozenxid and relminmxid
to those pg_database columns. Then look to see if autovacuum is
failing on that table, and/or try a manual vacuum to see what happens.

We found tables with pg_database.datfrozenxid = pg_class.relfrozenxid or
pg_database.datminmxid = pg_class.relminmxid. We executed "vacuum verbose"
on that tables and it did the work correctly, in output we see that there
were 0 dead tuples in all tables and there are logs with ERROR/WARNING log
level.

In postgresql logs we see that after pg_toast were vacuumed autovacuum
started to process next tables

2020-11-03 12:54:31 MSK DATABASE [local] [38649]: [2] STATEMENT: vacuum verbose pg_toast.pg_toast_687578781;
2020-11-03 12:55:24 MSK DATABASE [local] [38649]: [4] STATEMENT: vacuum verbose pg_toast.pg_toast_687578787;
2020-11-03 12:56:10 MSK DATABASE [local] [38649]: [6] STATEMENT: vacuum verbose pg_toast.pg_toast_687578793;
2020-11-03 12:56:32 MSK DATABASE [local] [38649]: [8] STATEMENT: vacuum verbose pg_toast.pg_toast_687578799;
2020-11-03 12:59:50 MSK DATABASE [33266]: [2] LOG: automatic vacuum of table "DATABASE.public.another_table": index scans: 1

But it still processes only one database in cycle. I mean that there are
tables in other databases that should be processed and there are free
autovacuum workers, but queued tables are not processed.

вс, 1 нояб. 2020 г. в 20:01, Tom Lane <tgl@sss.pgh.pa.us>:

Max Vikharev <bm.kinder@gmail.com> writes:

finishing that table and moving to another in the same database

Yes, that case. Vacuum processes relations in one database In cycle

It sounds like it's failing to advance that database's datfrozenxid
and/or datminmxid. You might try to figure out which table is holding
those values back by comparing pg_class.relfrozenxid and relminmxid
to those pg_database columns. Then look to see if autovacuum is
failing on that table, and/or try a manual vacuum to see what happens.

regards, tom lane

#13

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Max Vikharev (#12)

Re: BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

On Tue, Dec 29, 2020 at 02:17:51PM +0300, Max Vikharev wrote:

Finally we discovered that the problem is related to
/messages/by-id/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05

Autovacuum hangs on the database which contains orphaned temp tables. When
I deleted the schema with orphaned temporary tables - autovacuum continued
to work without reboot.

Just wondering. Are you sure that this instance is based on 12.3?
The removal of orphaned temp tables should be more aggressive as we
also track down in autovacuum if a specific temp schema OID is in use
by a backend, and the initialization of a temp namespace for a backend
cleans up temp objects from past connections that were leaving around
orphaned objects because of a crash (see around RemoveTempRelations()
in InitTempTableNamespace()).
--
Michael

BUG #16691: Autovacuum stops processing certain databases until postgresql rebooted

Attachments: