BUG #11264: Auto vacuum wraparound job blocking everything

Started by Dinesh Bhandaryover 11 years ago24 messagesbugs

dbhandary@switchfly.com

over 11 years ago

The following bug has been logged on the website:

Bug reference: 11264
Logged by: Dines Bhandary
Email address: dbhandary@switchfly.com
PostgreSQL version: 9.3.5
Operating system: Centos 6.5
Description:

We recently upgraded to postgressql 9.3.5 from postgres 9.1 using
pg_upgrade. We are running streaming replication. DB was functioning without
any issues till today where we discovered an auto vacuum wraparound job on
one of our tables which was blocking everything. There were no waiting jobs
when we queried pg_stat_activity, but incoming request would just hang, so
database was unusable.

We tried to disable auto vacuum for a table in question and tried kill auto
vacuum job, but that did not work. Also we had pg_dump running for several
hours for that particular table where auto vacuum was running. We tried
killing pg_dump job, but it did not succeed.

We eventually had to force shutdown the database and disable auto vacuum and
increase vacuum_freeze parameter so that auto vacuum wraparound job won't
trigger until we figure out what the issue is.

I am including debug trace here. It seems like auto vacuum is running in an
infinite loop. Will be happy to provide more info if necessary.

(gdb) bt
#0 0x00007fb9740eb5b3 in __select_nocancel () from /lib64/libc.so.6
#1 0x000000000075ad7a in pg_usleep ()
#2 0x00000000004ae2d4 in GetMultiXactIdMembers ()
#3 0x000000000047f6cc in heap_prepare_freeze_tuple ()
#4 0x000000000057d036 in lazy_vacuum_rel ()
#5 0x000000000057b346 in ?? ()
#6 0x000000000057b67c in vacuum ()
#7 0x000000000060cf8a in ?? ()
#8 0x000000000060d426 in ?? ()
#9 0x000000000060d4f6 in StartAutoVacWorker ()
#10 0x0000000000617d42 in ?? ()
#11 <signal handler called>
#12 0x00007fb9740eb5b3 in __select_nocancel () from /lib64/libc.so.6
#13 0x00000000006191fd in PostmasterMain ()
#14 0x00000000005b5e60 in main ()
(gdb)

16423 | jetblue | 21972 | 10 | postgres | |
| | | 2014-08-24 18:39:38.134514+00 |
2014-08-24 18:39:38.161612+00 | 2014-08-24 18:39:38.161612+00 | 2014-08-24
18:39:38.161614+00 | f | active | autovacuum: VACUUM
settings.room_contract_service_code (to prevent wraparound)

COPY:

-------+---------+-------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+---------+--------+--------------------------------------------------------------------------------------------------------
------------------------------------------------------------------
16423 | jetblue | 30246 | 20350 | m84 | pg_dump |
| | -1 | 2014-08-25 03:48:59.807689+00 |
2014-08-25 03:48:59.810871+00 | 2014-08-25 04:26:14.95753+00 | 2014-08-25
04:26:14.957534+00 | f | active | COPY
settings.room_contract_service_code (room_contract_service_code,
room_contract_service_description
, room_contract_service_code_id, supplierid, inactive) TO stdout;
16423 | jetblue | 31999 | 20350 | m84 | |
10.33.11.11 | | 40735 | 2014-08-25 10:38:02.944083+00
| 2014-08-25 10:38:02.964562+00 | 2014-08-25 10:38:22.805859+00 | 2014-08-25
10:38:22.805861+00 | f | active | COPY
settings.room_contract_service_code (room_contract_service_code,
room_contract_service_description
, room_contract_service_code_id, supplierid, inactive) TO stdout;

(gdb) bt
#0 0x00007fb9740f5187 in semop () from /lib64/libc.so.6
#1 0x0000000000609397 in PGSemaphoreLock ()
#2 0x000000000064e871 in LWLockAcquire ()
#3 0x000000000047e1a9 in ?? ()
#4 0x000000000047e66a in ?? ()
#5 0x000000000047f0a6 in heap_getnext ()
#6 0x00000000005343e6 in ?? ()
#7 0x000000000053521a in DoCopy ()
#8 0x000000000066112d in standard_ProcessUtility ()
#9 0x00007fb96c4df261 in ?? () from
/usr/pgsql-9.3/lib/pg_stat_statements.so
#10 0x000000000065db97 in ?? ()
#11 0x000000000065eb2d in ?? ()
#12 0x000000000065f223 in PortalRun ()
#13 0x000000000065b91e in ?? ()
#14 0x000000000065d048 in PostgresMain ()
#15 0x0000000000619c36 in PostmasterMain ()
#16 0x00000000005b5e60 in main ()

SELECTS:

datid | 16423
datname | jetblue
pid | 15599
usesysid | 20354
usename | developers
application_name |
client_addr | 10.33.11.11
client_hostname |
client_port | 60741
backend_start | 2014-08-25 20:20:38.985486+00
xact_start | 2014-08-25 20:39:13.426937+00
query_start | 2014-08-25 20:39:13.497362+00
state_change | 2014-08-25 20:39:13.497365+00
waiting | f
state | active
query |
| SELECT room_contract_service_code,
room_contract_service_description, room_contract_service_code_id
| FROM room_contract_service_code
| WHERE room_contract_service_code.supplierid = $1
| AND NOT inactive
| ORDER by room_contract_service_description
(gdb) bt
#0 0x00007fb9740f5187 in semop () from /lib64/libc.so.6
#1 0x0000000000609397 in PGSemaphoreLock ()
#2 0x000000000064e871 in LWLockAcquire ()
#3 0x000000000048d05d in index_fetch_heap ()
#4 0x000000000048d22e in index_getnext ()
#5 0x0000000000595e96 in ?? ()
#6 0x000000000058aabe in ExecScan ()
#7 0x0000000000583c08 in ExecProcNode ()
#8 0x000000000059c4e9 in ExecSort ()
#9 0x0000000000583b28 in ExecProcNode ()
#10 0x0000000000582a42 in standard_ExecutorRun ()
#11 0x00007fb96c4df4db in ?? () from
/usr/pgsql-9.3/lib/pg_stat_statements.so
#12 0x000000000065df67 in ?? ()
#13 0x000000000065f191 in PortalRun ()
#14 0x000000000065d3ed in PostgresMain ()
#15 0x0000000000619c36 in PostmasterMain ()
#16 0x00000000005b5e60 in main ()
(gdb)

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Dinesh Bhandary (#1)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

dbhandary@switchfly.com wrote:

I am including debug trace here. It seems like auto vacuum is running in an
infinite loop. Will be happy to provide more info if necessary.

(gdb) bt
#0 0x00007fb9740eb5b3 in __select_nocancel () from /lib64/libc.so.6
#1 0x000000000075ad7a in pg_usleep ()
#2 0x00000000004ae2d4 in GetMultiXactIdMembers ()
#3 0x000000000047f6cc in heap_prepare_freeze_tuple ()
#4 0x000000000057d036 in lazy_vacuum_rel ()

Can you please provide pg_controldata output?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Dinesh Bhandary

dbhandary@switchfly.com

over 11 years ago

In reply to: Alvaro Herrera (#2)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Hi Alvaro,

Here is the output of pg_controldata. We also looked at latest 9.3.5 fix for pg_upgrade and ran the query to check if we have to remove $PGDATA/pg_multixact/offsets/0000, but the query returned false.

pg_control version number: 937
Catalog version number: 201306121
Database system identifier: 6037642065780369403
Database cluster state: in production
pg_control last modified: Tue 26 Aug 2014 02:40:19 PM UTC
Latest checkpoint location: 1054/17E0BE30
Prior checkpoint location: 1054/15B2DEB8
Latest checkpoint's REDO location: 1054/16EC6028
Latest checkpoint's REDO WAL file: 000000010000105400000016
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/1004399358
Latest checkpoint's NextOID: 1303270531
Latest checkpoint's NextMultiXactId: 23431
Latest checkpoint's NextMultiOffset: 5678
Latest checkpoint's oldestXID: 801997459
Latest checkpoint's oldestXID's DB: 16428
Latest checkpoint's oldestActiveXID: 1004399358
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16423
Time of latest checkpoint: Tue 26 Aug 2014 02:37:48 PM UTC
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current max_connections setting: 1024
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0

Thanks
Dinesh

-----Original Message-----
From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]
Sent: Monday, August 25, 2014 9:03 PM
To: Dinesh Bhandary
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #11264: Auto vacuum wraparound job blocking everything

dbhandary@switchfly.com wrote:

I am including debug trace here. It seems like auto vacuum is running
in an infinite loop. Will be happy to provide more info if necessary.

(gdb) bt
#0 0x00007fb9740eb5b3 in __select_nocancel () from /lib64/libc.so.6
#1 0x000000000075ad7a in pg_usleep ()
#2 0x00000000004ae2d4 in GetMultiXactIdMembers ()
#3 0x000000000047f6cc in heap_prepare_freeze_tuple ()
#4 0x000000000057d036 in lazy_vacuum_rel ()

Can you please provide pg_controldata output?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Dinesh Bhandary (#1)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

dbhandary@switchfly.com wrote:

We recently upgraded to postgressql 9.3.5 from postgres 9.1 using
pg_upgrade. We are running streaming replication. DB was functioning without
any issues till today where we discovered an auto vacuum wraparound job on
one of our tables which was blocking everything. There were no waiting jobs
when we queried pg_stat_activity, but incoming request would just hang, so
database was unusable.

Can you please paste

select oid::regclass, relfrozenxid, relminmxid from pg_class where relname = 'room_contract_service_code';
select datname, datfrozenxid, datminmxid from pg_database where datname = 'jetblue';

(gdb) bt
#0 0x00007fb9740eb5b3 in __select_nocancel () from /lib64/libc.so.6
#1 0x000000000075ad7a in pg_usleep ()
#2 0x00000000004ae2d4 in GetMultiXactIdMembers ()
#3 0x000000000047f6cc in heap_prepare_freeze_tuple ()
#4 0x000000000057d036 in lazy_vacuum_rel ()

This stall in GetMultiXactIdMembers should only occur when somebody is
writing the multixact immediately following. It certainly should not
sleep for long. I'm not sure what's happening here but this is probably
where the problem is. While waiting, it holds the buffer content lock
in exclusive mode.

Can you install debug symbols and pageinspect? I'm curious why this is
stalling. If the tuple has a FOR SHARE or FOR UPDATE marking from
before the upgrade, it shouldn't be inquiring the multixact members at
all. I'm curious about

Latest checkpoint's NextMultiXactId: 23431
Latest checkpoint's NextMultiOffset: 5678
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16423

It says the oldest multi is 1, so the database should not have any
values between 1 and 23431 that correspond to pg_upgrade'd multixact
values ... so what is going on here? Unless the recent mucking with
pg_upgrade to handle multixact's got something wrong.

10.33.11.11 | | 40735 | 2014-08-25 10:38:02.944083+00
| 2014-08-25 10:38:02.964562+00 | 2014-08-25 10:38:22.805859+00 | 2014-08-25
10:38:22.805861+00 | f | active | COPY
settings.room_contract_service_code (room_contract_service_code,
room_contract_service_description
, room_contract_service_code_id, supplierid, inactive) TO stdout;

(gdb) bt
#0 0x00007fb9740f5187 in semop () from /lib64/libc.so.6
#1 0x0000000000609397 in PGSemaphoreLock ()
#2 0x000000000064e871 in LWLockAcquire ()
#3 0x000000000047e1a9 in ?? ()
#4 0x000000000047e66a in ?? ()
#5 0x000000000047f0a6 in heap_getnext ()
#6 0x00000000005343e6 in ?? ()
#7 0x000000000053521a in DoCopy ()
#8 0x000000000066112d in standard_ProcessUtility ()
#9 0x00007fb96c4df261 in ?? () from /usr/pgsql-9.3/lib/pg_stat_statements.so

This one is stalling on the buffer content lock trying to read the
table to dump it.

SELECTS:

query |
| SELECT room_contract_service_code,
room_contract_service_description, room_contract_service_code_id
| FROM room_contract_service_code
| WHERE room_contract_service_code.supplierid = $1
| AND NOT inactive
| ORDER by room_contract_service_description
(gdb) bt
#0 0x00007fb9740f5187 in semop () from /lib64/libc.so.6
#1 0x0000000000609397 in PGSemaphoreLock ()
#2 0x000000000064e871 in LWLockAcquire ()
#3 0x000000000048d05d in index_fetch_heap ()
#4 0x000000000048d22e in index_getnext ()

And this one is stalling in the same buffer lock most likely.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Alvaro Herrera (#4)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Alvaro Herrera wrote:

(gdb) bt
#0 0x00007fb9740eb5b3 in __select_nocancel () from /lib64/libc.so.6
#1 0x000000000075ad7a in pg_usleep ()
#2 0x00000000004ae2d4 in GetMultiXactIdMembers ()
#3 0x000000000047f6cc in heap_prepare_freeze_tuple ()
#4 0x000000000057d036 in lazy_vacuum_rel ()

This stall in GetMultiXactIdMembers should only occur when somebody is
writing the multixact immediately following. It certainly should not
sleep for long. I'm not sure what's happening here but this is probably
where the problem is. While waiting, it holds the buffer content lock
in exclusive mode.

Can you install debug symbols and pageinspect?

If you have debug symbols, please attach to this process, then do in
GDB:
bt full

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Dinesh Bhandary

dbhandary@switchfly.com

over 11 years ago

In reply to: Alvaro Herrera (#4)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Hi Alvaro,

We did manual vacuum and reindex of room_contract_service_code and it finished quickly. We still have auto vacuum disabled.

1 ) select oid::regclass, relfrozenxid, relminmxid from pg_class where relname = 'room_contract_service_code'; select datname, datfrozenxid, datminmxid from pg_database where datname = 'jetblue';

jetblue=# select oid::regclass, relfrozenxid, relminmxid from pg_class where relname = 'room_contract_service_code'
jetblue-# ;
oid | relfrozenxid | relminmxid
----------------------------+--------------+------------
room_contract_service_code | 953567787 | 20784
(1 row)

jetblue=# select datname, datfrozenxid, datminmxid from pg_database where datname = 'jetblue';
datname | datfrozenxid | datminmxid
---------+--------------+------------
jetblue | 870700471 | 1
(1 row)

2) Latest checkpoint's NextMultiXactId: 23431
Latest checkpoint's NextMultiOffset: 5678
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16423

It says the oldest multi is 1, so the database should not have any values between 1 and 23431 that correspond to pg_upgrade'd multixact values ... so what is going on here? Unless the recent mucking with pg_upgrade to handle multixact's got something wrong.

We used psotgres 9.3.4 to pg_upgrade from 9.1. We just upgraded binaries to 9.3.5 just last week ? Would this cause any issues with MultiXid issue.

Thanks
Dinesh

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Dinesh Bhandary (#6)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Dinesh Bhandary wrote:

We used psotgres 9.3.4 to pg_upgrade from 9.1. We just upgraded
binaries to 9.3.5 just last week ? Would this cause any issues with
MultiXid issue.

So you ran 200 million write transactions in one week?

Do you still have the 9.1 data from before the upgrade? If so, would
you paste pg_controldata output from that?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Dinesh Bhandary

dbhandary@switchfly.com

over 11 years ago

In reply to: Alvaro Herrera (#7)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Here is the pg_controldata output from 9.1. Thanks. Dinesh

$ /usr/pgsql-9.1/bin/pg_controldata data
pg_control version number: 903
Catalog version number: 201105231
Database system identifier: 5720710153987476471
Database cluster state: in production
pg_control last modified: Sat 19 Jul 2014 07:05:08 AM GMT
Latest checkpoint location: FC0/25643788
Prior checkpoint location: FC0/169F2238
Latest checkpoint's REDO location: FC0/1D919830
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/950483638
Latest checkpoint's NextOID: 1146307754
Latest checkpoint's NextMultiXactId: 20783
Latest checkpoint's NextMultiOffset: 42669
Latest checkpoint's oldestXID: 750608965
Latest checkpoint's oldestXID's DB: 16411
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Sat 19 Jul 2014 07:02:38 AM GMT
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: minimal
Current max_connections setting: 1024
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 256
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value

-----Original Message-----
From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]
Sent: Wednesday, August 27, 2014 8:48 AM
To: Dinesh Bhandary
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #11264: Auto vacuum wraparound job blocking everything

Dinesh Bhandary wrote:

We used psotgres 9.3.4 to pg_upgrade from 9.1. We just upgraded
binaries to 9.3.5 just last week ? Would this cause any issues with
MultiXid issue.

So you ran 200 million write transactions in one week?

Do you still have the 9.1 data from before the upgrade? If so, would you paste pg_controldata output from that?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Dinesh Bhandary (#8)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Dinesh Bhandary wrote:

Here is the pg_controldata output from 9.1. Thanks. Dinesh

$ /usr/pgsql-9.1/bin/pg_controldata data

Time of latest checkpoint: Sat 19 Jul 2014 07:02:38 AM GMT

Ah, so you upgraded a month ago, not one week ago, right?

Latest checkpoint's NextMultiXactId: 20783

I'm betting this is the problem. The 9.3 cluster should have
oldestMulti=20783, but instead it has oldestMulti=1. Either pg_upgrade
is misreading the nextMulti value from the old cluster, or pg_resetxlog
is failing to set oldestMulti appropriately ... or something funny
happened to pg_control in the meantime. I just ran a quick test, and I
see that pg_upgrade is setting oldestMulti correctly (to the old
cluster's NextMultiXactId value) in the new cluster when upgrading from
9.1.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10

Dinesh Bhandary

dbhandary@switchfly.com

over 11 years ago

In reply to: Alvaro Herrera (#9)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Yes, it was upgraded a month ago using postgresql 9.3.4 binaries. It was
just patched up last week with 9.3.5.

I checked all the other servers which we upgraded using postgres 9.3.5
binaries, oldestMultiXid is other than 1 and it seems to be have picked
the right number. But the server in question was upgraded using postgres
9.3.4 binaries. This sounds suspicious. Could it be pg_upgrade in 9.3.4 is
buggy.

How do we fix the current issue with this one server? Is there an easy fix?
Thanks.
Dinesh

On 8/27/14, 9:45 AM, "Alvaro Herrera" <alvherre@2ndquadrant.com> wrote:

Dinesh Bhandary wrote:

Here is the pg_controldata output from 9.1. Thanks. Dinesh

$ /usr/pgsql-9.1/bin/pg_controldata data

Time of latest checkpoint: Sat 19 Jul 2014 07:02:38 AM GMT

Ah, so you upgraded a month ago, not one week ago, right?

Latest checkpoint's NextMultiXactId: 20783

I'm betting this is the problem. The 9.3 cluster should have
oldestMulti=20783, but instead it has oldestMulti=1. Either pg_upgrade
is misreading the nextMulti value from the old cluster, or pg_resetxlog
is failing to set oldestMulti appropriately ... or something funny
happened to pg_control in the meantime. I just ran a quick test, and I
see that pg_upgrade is setting oldestMulti correctly (to the old
cluster's NextMultiXactId value) in the new cluster when upgrading from
9.1.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Dinesh Bhandary (#10)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Dinesh Bhandary wrote:

Yes, it was upgraded a month ago using postgresql 9.3.4 binaries. It was
just patched up last week with 9.3.5.

Ok, that makes a lot more sense.

I checked all the other servers which we upgraded using postgres 9.3.5
binaries, oldestMultiXid is other than 1 and it seems to be have picked
the right number. But the server in question was upgraded using postgres
9.3.4 binaries. This sounds suspicious. Could it be pg_upgrade in 9.3.4 is
buggy.

Well, yes, 9.3.4 had a bug fixed by this commit:

Author: Bruce Momjian <bruce@momjian.us>
Branch: master [a61daa14d] 2014-07-02 15:29:38 -0400
Branch: REL9_4_STABLE [b446a384b] 2014-07-02 15:29:38 -0400
Branch: REL9_3_STABLE Release: REL9_3_5 [3d2e18510] 2014-07-02 15:29:38 -0400

pg_upgrade: preserve database and relation minmxid values

Also set these values for pre-9.3 old clusters that don't have values to
preserve.

Analysis by Alvaro

Backpatch through 9.3

How do we fix the current issue with this one server? Is there an easy fix?
Thanks.

As far as I am aware, you should
UPDATE pg_database SET datminmxid=20783

and that should fix it. The oldestMulti value in pg_control would get
updated by itself some time later. If you experience stalls before
oldestMulti fixes itself, you could stop the server (cleanly!) and then
pg_resetxlog -m x,y where x is the correct nextMulti value from
pg_controldata and y is 20783.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#12

Dinesh Bhandary

dbhandary@switchfly.com

over 11 years ago

In reply to: Dinesh Bhandary (#1)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Hi Alvaro,

Thank you so much for jumping onto this issue. I greatly appreciate it.

We will be testing the fix that you have mentioned in test env first. I
will let you know when we are done with testing. Then we can update
production db.

Dinesh

-----Original Message-----
From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]
Sent: Wednesday, August 27, 2014 10:51 AM
To: Dinesh Bhandary
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #11264: Auto vacuum wraparound job blocking
everything

Dinesh Bhandary wrote:

Yes, it was upgraded a month ago using postgresql 9.3.4 binaries. It
was just patched up last week with 9.3.5.

Ok, that makes a lot more sense.

I checked all the other servers which we upgraded using postgres
9.3.5 binaries, oldestMultiXid is other than 1 and it seems to be have
picked the right number. But the server in question was upgraded using
postgres
9.3.4 binaries. This sounds suspicious. Could it be pg_upgrade in
9.3.4 is buggy.

Well, yes, 9.3.4 had a bug fixed by this commit:

pg_upgrade: preserve database and relation minmxid values

Also set these values for pre-9.3 old clusters that don't have values
to
preserve.

Analysis by Alvaro

Backpatch through 9.3

How do we fix the current issue with this one server? Is there an easy
fix?
Thanks.

As far as I am aware, you should
UPDATE pg_database SET datminmxid=20783

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#13

Dinesh Bhandary

dbhandary@switchfly.com

over 11 years ago

In reply to: Alvaro Herrera (#11)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

and that should fix it. The oldestMulti value in pg_control would get updated by itself some time later. If you experience stalls before oldestMulti fixes itself, you could stop the server (cleanly!) and then pg_resetxlog -m x,y where x is the correct nextMulti value from pg_controldata and y is 20783.

Are you referreing to NextMultiXactId of current pg_controldata ( from postgres 9.3.5) for x?
There is also NextMultiOffset, not sure If you meant to reset this. Please confirm.

Thanks,
Dinesh

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#14

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Dinesh Bhandary (#13)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Dinesh Bhandary wrote:

and that should fix it. The oldestMulti value in pg_control would get updated by itself some time later. If you experience stalls before oldestMulti fixes itself, you could stop the server (cleanly!) and then pg_resetxlog -m x,y where x is the correct nextMulti value from pg_controldata and y is 20783.

Are you referreing to NextMultiXactId of current pg_controldata ( from postgres 9.3.5) for x?
There is also NextMultiOffset, not sure If you meant to reset this. Please confirm.

No, leave the offset alone. I meant that the "x" value (i.e. the part
before the comma) should stay unchanged; you only need to change the
"oldest" value, which is "y" (part after the comma).

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#15

Dinesh Bhandary

dbhandary@switchfly.com

over 11 years ago

In reply to: Alvaro Herrera (#14)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

I think the fix is working. Now I am able to kill copy job while auto
vacuum job is going on that table.

Since I am setting pg_resetxlog on the primary, do I need to do that on
the slave too? Please let me know

Thanks.
Dinesh

On 8/29/14, 12:17 PM, "Alvaro Herrera" <alvherre@2ndquadrant.com> wrote:

Dinesh Bhandary wrote:

and that should fix it. The oldestMulti value in pg_control would get
updated by itself some time later. If you experience stalls before
oldestMulti fixes itself, you could stop the server (cleanly!) and then
pg_resetxlog -m x,y where x is the correct nextMulti value from
pg_controldata and y is 20783.

Are you referreing to NextMultiXactId of current pg_controldata ( from
postgres 9.3.5) for x?
There is also NextMultiOffset, not sure If you meant to reset this.
Please confirm.

No, leave the offset alone. I meant that the "x" value (i.e. the part
before the comma) should stay unchanged; you only need to change the
"oldest" value, which is "y" (part after the comma).

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#16

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Dinesh Bhandary (#15)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Dinesh Bhandary wrote:

I think the fix is working. Now I am able to kill copy job while auto
vacuum job is going on that table.

I hope this killing of autovacuum jobs is not continued operational
practice. You need whole-table vacuums of all your tables once in a
while. (Note you can increase the "freeze_table_age" parameters if you
want to be forced to do that less frequently. See the docs.)

Since I am setting pg_resetxlog on the primary, do I need to do that on
the slave too? Please let me know

No, the slave should fix itself once the checkpoint record is replayed.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#17

Dinesh Bhandary

dbhandary@switchfly.com

over 11 years ago

In reply to: Alvaro Herrera (#16)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Hi Alvaro,

I was killing to reproduce the bug.Once it is fixed we have no need to
kill these jobs.
Now we still have this production server not running in auto vacuum till
we apply the fix during next maintenance window. Is running vacuum
manually safe, or we may run into the same issue as auto vacuum tx
wraparound job? I am thinking max_freeze set to 500 million, manual
vacuum won¹t freeze xids, and it may be ok to run. But I am not completely
sure and worried about the same issue happening in production even with
manual vacuum. Please let me know.

Thanks a lot for your help. Much appreciated.
Dinesh

On 8/29/14, 2:07 PM, "Alvaro Herrera" <alvherre@2ndquadrant.com> wrote:

Dinesh Bhandary wrote:

I think the fix is working. Now I am able to kill copy job while auto
vacuum job is going on that table.

I hope this killing of autovacuum jobs is not continued operational
practice. You need whole-table vacuums of all your tables once in a
while. (Note you can increase the "freeze_table_age" parameters if you
want to be forced to do that less frequently. See the docs.)

Since I am setting pg_resetxlog on the primary, do I need to do that on
the slave too? Please let me know

No, the slave should fix itself once the checkpoint record is replayed.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#18

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Dinesh Bhandary (#17)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Dinesh Bhandary wrote:

Hi Alvaro,

I was killing to reproduce the bug.Once it is fixed we have no need to
kill these jobs.
Now we still have this production server not running in auto vacuum till
we apply the fix during next maintenance window. Is running vacuum
manually safe, or we may run into the same issue as auto vacuum tx
wraparound job? I am thinking max_freeze set to 500 million, manual
vacuum wonï¿½t freeze xids, and it may be ok to run. But I am not completely
sure and worried about the same issue happening in production even with
manual vacuum. Please let me know.

Yes, that seems correct. I think you're only running into
for-wraparound vacuums because the relfrozenxid for that table is above
the 200 millions that IIRC are the default for max_freeze_age. You can
just set it to the maximum of 2000 million; the only downside is disk
space used by pg_clog and pg_multixact.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#19

Dinesh Bhandary

dbhandary@switchfly.com

over 11 years ago

In reply to: Alvaro Herrera (#18)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

What you are saying is vacuum without tx wraparound won’t cause the
blocking issue we had experienced.

If that’s the case I can just change the auto vacuum on ( which can be
changed on the fly), and I had already bumped up max_freeze_settings to
500 mill for now. We will change that to lower number in the next
maintenance window.

Thanks
Dinesh

On 8/29/14, 2:25 PM, "Alvaro Herrera" <alvherre@2ndquadrant.com> wrote:

Dinesh Bhandary wrote:

Hi Alvaro,

I was killing to reproduce the bug.Once it is fixed we have no need to
kill these jobs.
Now we still have this production server not running in auto vacuum till
we apply the fix during next maintenance window. Is running vacuum
manually safe, or we may run into the same issue as auto vacuum tx
wraparound job? I am thinking max_freeze set to 500 million, manual
vacuum won¹t freeze xids, and it may be ok to run. But I am not
completely
sure and worried about the same issue happening in production even with
manual vacuum. Please let me know.

Yes, that seems correct. I think you're only running into
for-wraparound vacuums because the relfrozenxid for that table is above
the 200 millions that IIRC are the default for max_freeze_age. You can
just set it to the maximum of 2000 million; the only downside is disk
space used by pg_clog and pg_multixact.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#20

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Dinesh Bhandary (#19)

Re: BUG #11264: Auto vacuum wraparound job blocking everything

Dinesh Bhandary wrote:

What you are saying is vacuum without tx wraparound won’t cause the
blocking issue we had experienced.

Yep, that's correct. You're only seeing an issue because vacuum (manual
or auto) is trying to freeze the tuple. If the tuple is not considered
old enough to need freezing it will not cause the system to stall. Once
you get the oldestMulti value fixed, the freezing routine will be able
to look at the tuple, notice that the multixact in it is older than the
upgrade, and decide it can just rip it out without trying to "resolve"
it.

If that’s the case I can just change the auto vacuum on ( which can be
changed on the fly), and I had already bumped up max_freeze_settings to
500 mill for now. We will change that to lower number in the next
maintenance window.

Right -- AFAICS that should work.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#21