Errors on missing pg_subtrans/ files with 9.3
G'day list. Didn't get any interest in pgsql-general, thought I'd try
my luck here, which perhaps would be more fitting in case I've
stumbled upon an edge case issue or something...
I've recently upgraded a number of servers from PostgreSQL 9.2.5 to
9.3.1 and have started getting the following errors every couple of
hours along with some failed transactions. I have been unable to track
down any sort of rhyme or reason for the errors yet, so I figured I'd
check with the mailing list to see if I've potentially hit into a
potential bug. The errors look like the following:
ERROR could not access status of transaction 4179979
Could not open file "pg_subtrans/003F": No such file or directory.
Indeed, pg_subtrans/003F does not exist on the file system.
I have found similar errors mentioned before a number of
years ago with a thread starting here, but couldn't find any real resolution.
/messages/by-id/20111109150225.076c2f38@list.ru
A few details:
- the new database cluster was created by dumping global settings
using pg_dumpall and then dumping each database individually. Several
of the databases within this cluster have PostGIS installed and I
wanted to do a full PostGIS upgrade, which necessitated individual
dumps.
- this is the only database in a cluster of 14 databases that is
exhibiting the problem and it is sporadic at best. This database is
fairly write-intensive, and has been up since November 6th. In that
time, we've had 17 such errors.
- the pg_subtrans files mentioned in the log file do not exist on the
file system.
- I have rebuilt the cluster several times, as well as cleaned out the
pg_statistic table and run `VACUUM FULL ANALYZE` on every database in
case that was affecting it based on some similar threads back from the
PostgreSQL 8.4 days, but there was no affect and the errors still
occur.
Anyone have any suggestions or ideas? Thankfully these are dev systems
so I have some room to experiment and can post some more details as
necessary.
The following are the log file entries from one of the errors. Some
details have been sanitized for privacy reasons, but the overall lines
themselves are accurate.
Nov 10 10:14:02 dev-server postgres[29835]: [4-1] user=dev,db=dev
ERROR: could not access status of transaction 4179979
Nov 10 10:14:02 dev-server postgres[29835]: [4-2] user=dev,db=dev
DETAIL: Could not open file "pg_subtrans/003F": No such file or
directory.
Nov 10 10:14:02 dev-server postgres[29835]: [4-3] user=dev,db=dev
CONTEXT: SQL statement "SELECT 1 FROM ONLY
"typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1
FOR KEY SHARE OF x"
Nov 10 10:14:02 dev-server postgres[29835]: [4-4] user=dev,db=dev
STATEMENT: insert into raw (url, id) values ($1, $2)
Nov 10 10:14:02 dev-server postgres[29839]: [4-1] user=dev,db=dev
ERROR: could not access status of transaction 4179979
Nov 10 10:14:02 dev-server postgres[29839]: [4-2] user=dev,db=dev
DETAIL: Could not open file "pg_subtrans/003F": No such file or
directory.
Nov 10 10:14:02 dev-server postgres[29839]: [4-3] user=dev,db=dev
CONTEXT: SQL statement "SELECT 1 FROM ONLY
"typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1
FOR KEY SHARE OF x"
Nov 10 10:14:02 dev-server postgres[29839]: [4-4] user=dev,db=dev
STATEMENT: update listings set value_a=$1 where id=$2
Nov 10 10:14:02 dev-server postgres[29827]: [4-1] user=dev,db=dev
ERROR: could not access status of transaction 4179979
Nov 10 10:14:02 dev-server postgres[29827]: [4-2] user=dev,db=dev
DETAIL: Could not open file "pg_subtrans/003F": No such file or
directory.
Nov 10 10:14:02 dev-server postgres[29827]: [4-3] user=dev,db=dev
CONTEXT: SQL statement "SELECT 1 FROM ONLY
"typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1
FOR KEY SHARE OF x"
I can provide additional details like my postgresql.conf if it would
help, although it's fairly standard outside of tuning as provided by
pgtune.
Cheers and thanks, List.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2013-11-12 10:56:55 -0500, J Smith wrote:
G'day list. Didn't get any interest in pgsql-general, thought I'd try
my luck here, which perhaps would be more fitting in case I've
stumbled upon an edge case issue or something...
Normally the bug report for/the -bugs mailing list is the right one for
things like this.
ERROR could not access status of transaction 4179979
Could not open file "pg_subtrans/003F": No such file or directory.
A couple of question:
* Did you have any OS, hardware crashes?
* Do you run with fsync=off or full_page_writes=off?
* Could you post the pg_controldata output?
* Is it always the same xid?
* Does SELECT count(*) FROM pg_prepared_xacts; return 0?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* J Smith (dark.panda+lists@gmail.com) wrote:
I've recently upgraded a number of servers from PostgreSQL 9.2.5 to
9.3.1 and have started getting the following errors every couple of
hours along with some failed transactions.
How was this upgrade done? If you used pg_upgrade, what version of the
pg_upgrade code did you use? As I recall, there was a bug in older
versions which could exhibit in this way..
http://wiki.postgresql.org/wiki/20110408pg_upgrade_fix
Thanks,
Stephen
On 2013-11-12 11:25:03 -0500, Stephen Frost wrote:
* J Smith (dark.panda+lists@gmail.com) wrote:
I've recently upgraded a number of servers from PostgreSQL 9.2.5 to
9.3.1 and have started getting the following errors every couple of
hours along with some failed transactions.How was this upgrade done? If you used pg_upgrade, what version of the
pg_upgrade code did you use? As I recall, there was a bug in older
versions which could exhibit in this way..
He referred to using pg_dumpall/pg_dump. But that bug was erroring out
on pg_clog, not pg_subtrans, right?
Getting such errors on pg_subtrans indicates corruption or a bug
somewhere, we don't do them if the to-be-looked up xid is older than
TransactionXmin and we zero pg_subtrans at restarts.
My gut feeling is thats it's related to foreign key locks doing
XactLockTableWait()s on too old transactions or something like that.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 12, 2013 at 11:25 AM, Stephen Frost <sfrost@snowman.net> wrote:
How was this upgrade done? If you used pg_upgrade, what version of the
pg_upgrade code did you use? As I recall, there was a bug in older
versions which could exhibit in this way..
As mentioned in the original post...
"the new database cluster was created by dumping global settings
using pg_dumpall and then dumping each database individually."
We haven't been able to use pg_upgrade as we rely heavily on PostGIS
and do hard upgrades via pg_dump and postgis_restore.pl when we
upgrade.
Cheers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
G'day Andres.
On Tue, Nov 12, 2013 at 11:13 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,
On 2013-11-12 10:56:55 -0500, J Smith wrote:
G'day list. Didn't get any interest in pgsql-general, thought I'd try
my luck here, which perhaps would be more fitting in case I've
stumbled upon an edge case issue or something...Normally the bug report for/the -bugs mailing list is the right one for
things like this.
Ah, okay. I can redirect there if that's the better location.
(Assuming it's a bug and not a misconfiguration on my part, although
the configuration on these servers has been consistent since 9.0, and
we've upgraded through each of the 9.x series while merging together
new settings accordingly.)
ERROR could not access status of transaction 4179979
Could not open file "pg_subtrans/003F": No such file or directory.A couple of question:
* Did you have any OS, hardware crashes?
Haven't had any issues on this machine.
* Do you run with fsync=off or full_page_writes=off?
Both fsync and full_page_writes are in their default "on"
configurations. I'll post our full postgresql.conf below, as it is
quite short.
* Could you post the pg_controldata output?
Sure, posted below.
* Is it always the same xid?
Seems to be different every time.
* Does SELECT count(*) FROM pg_prepared_xacts; return 0?
Yes it does.
Here's the postgresql.conf and pg_controldata dump.
# postgresql.conf
listen_addresses = '*'
wal_level = hot_standby
checkpoint_completion_target = 0.7
max_wal_senders = 3
wal_keep_segments = 64
logging_collector = off
datestyle = 'iso, mdy'
timezone = 'Canada/Eastern'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
default_statistics_target = 100
maintenance_work_mem = 1GB
effective_cache_size = 44GB
work_mem = 320MB
wal_buffers = 4MB
checkpoint_segments = 64
shared_buffers = 15GB
max_connections = 200
# pg_controldata
pg_control version number: 937
Catalog version number: 201306121
Database system identifier: 5942827484423487452
Database cluster state: in production
pg_control last modified: Tue 12 Nov 2013 11:39:07 AM EST
Latest checkpoint location: 30/50F7A180
Prior checkpoint location: 30/4914C428
Latest checkpoint's REDO location: 30/4CB2C450
Latest checkpoint's REDO WAL file: 00000001000000300000004C
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/5138174
Latest checkpoint's NextOID: 3574036
Latest checkpoint's NextMultiXactId: 5762623
Latest checkpoint's NextMultiOffset: 11956824
Latest checkpoint's oldestXID: 1673
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 5138174
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint: Tue 12 Nov 2013 11:35:37 AM EST
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current max_connections setting: 200
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Cheers and thanks for the help all.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Andres Freund (andres@2ndquadrant.com) wrote:
He referred to using pg_dumpall/pg_dump. But that bug was erroring out
on pg_clog, not pg_subtrans, right?
Yeah, that was pg_clog. Obviously responded before really looking at
it. :)
My gut feeling is thats it's related to foreign key locks doing
XactLockTableWait()s on too old transactions or something like that.
Ugh, that wouldn't be good..
Thanks,
Stephen
* J Smith (dark.panda+lists@gmail.com) wrote:
We haven't been able to use pg_upgrade as we rely heavily on PostGIS
and do hard upgrades via pg_dump and postgis_restore.pl when we
upgrade.
Did you also upgrade to PostGIS 2.x as part of this..? Seems like it'd
be unrelated, but one never knows. Any chance you could distill this
down into a small test case which exhibits the problem? I'm guessing
'no', but figured I'd ask anyway..
Thanks,
Stephen
Hi,
On 2013-11-12 11:46:19 -0500, J Smith wrote:
* Does SELECT count(*) FROM pg_prepared_xacts; return 0?
Yes it does.
Could you show the output? Do you actually use prepared xacts actively?
Do you actively use row level locking? Is there high concurrency in that
environment? In short, could you describe your database usage a bit?
Did you restart the database since getting the errors?
Latest checkpoint's NextXID: 0/5138174
Latest checkpoint's NextOID: 3574036
Latest checkpoint's NextMultiXactId: 5762623
Latest checkpoint's NextMultiOffset: 11956824
Latest checkpoint's oldestXID: 1673
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 5138174
Latest checkpoint's oldestMultiXid: 1
Interesting. Pretty high multixact usage.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 12, 2013 at 11:54 AM, Stephen Frost <sfrost@snowman.net> wrote:
Did you also upgrade to PostGIS 2.x as part of this..? Seems like it'd
be unrelated, but one never knows. Any chance you could distill this
down into a small test case which exhibits the problem? I'm guessing
'no', but figured I'd ask anyway..
Yeah, we upgraded to PostGIS 2.1 as part of the upgrade. In the case
of the specific database that is exhibiting the problem, we only
really use PostGIS for storage and don't do a lot of spatial querying,
as this database is basically for an aggregation service we run. In
the case of the transactions that appear to be causing the problems,
we don't even use geometry fields at all.
Cheers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 12, 2013 at 11:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,
On 2013-11-12 11:46:19 -0500, J Smith wrote:
* Does SELECT count(*) FROM pg_prepared_xacts; return 0?
Yes it does.
Could you show the output? Do you actually use prepared xacts actively?
jay:dev@jagger=# select * from pg_prepared_xacts;
transaction | gid | prepared | owner | database
-------------+-----+----------+-------+----------
(0 rows)
We don't actively use prepared transactions, no. We use a lot of
prepared statements, but no two-phase committing.
Do you actively use row level locking?
Not to my knowledge, no. I've enabled statement logging for a little
while so we can get a snapshot of what our average queries look like,
but I'm not seeing any PREPARE TRANSACTION or explicit LOCK
statements.
Is there high concurrency in that environment? In short, could you describe your database usage a bit?
This particular database is used for an aggregation service. There are
a handful of Hibernate-based Java processes that manipulate and
massage the data from a number of data feeds we receive 24-7 From a
quick grepping of the statement log file we captured the following
over the course of approximately 13 minutes:
167452 SELECTs
72388 UPDATEs
6782 INSERTs
2 DELETEs
Did you restart the database since getting the errors?
Yes, we've restarted several times.
Latest checkpoint's NextXID: 0/5138174
Latest checkpoint's NextOID: 3574036
Latest checkpoint's NextMultiXactId: 5762623
Latest checkpoint's NextMultiOffset: 11956824
Latest checkpoint's oldestXID: 1673
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 5138174
Latest checkpoint's oldestMultiXid: 1Interesting. Pretty high multixact usage.
Part of our process involves a collecting several records and
aggregating them within a single transaction. There will usually be a
handful of SELECTs to fetch the data, a bit of processing in the Java
backend, then a set of either UPDATEs or INSERTs into two tables to
store the results of the processing or in rarer cases a couple of
DELETEs.
Our Java guy would have more details, as I just have the output of the
log files and a high-level view of how our backend processes work, but
this is the general gist of things as I understand it.
Cheers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Looks like we got another set of errors overnight. Here's the log file
from the errors. (Log file scrubbed slightly to remove private data,
but still representative of the problem I believe.)
Nov 13 05:34:34 dev postgres[6084]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6084]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6084]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6084]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6076]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6076]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6076]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6076]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6087]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6087]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6087]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6087]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6086]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6086]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6086]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6086]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6088]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6088]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6088]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6088]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6085]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6085]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6085]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6085]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Several processes all seemed to hit the problem at the same moment,
and all of them refer to the same transaction ID. Again, the file
pg_subtrans/0060 doesn't exist, and the only file that does exist is
pg_subtrans/005A which appears to be a zeroed-out file 245760 bytes in
length.
Still don't have a clue as to how I can reproduce the problem. It
seems that in all cases the error occurred during either an UPDATE to
a table_X or an INSERT to table_Y. In all cases, the error occurred in
a manner identical to those shown in the log above, the only
difference being either an UPDATE on table_X or an INSERT on table_Y.
Not sure what direction I should head to now. Perhaps some aggressive
logging would help, so we can see the queries surrounding the
problems? I could reconfigure things to capture all statements and set
up monit or something to send an alert when the problem resurfaces,
for instance.
Cheers all.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 13, 2013 at 12:29 PM, J Smith <dark.panda+lists@gmail.com> wrote:
Looks like we got another set of errors overnight. Here's the log file
from the errors. (Log file scrubbed slightly to remove private data,
but still representative of the problem I believe.)Nov 13 05:34:34 dev postgres[6084]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6084]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6084]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6084]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6076]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6076]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6076]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6076]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6087]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6087]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6087]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6087]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6086]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6086]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6086]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6086]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6088]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6088]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6088]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6088]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6085]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6085]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6085]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6085]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3Several processes all seemed to hit the problem at the same moment,
and all of them refer to the same transaction ID. Again, the file
pg_subtrans/0060 doesn't exist, and the only file that does exist is
pg_subtrans/005A which appears to be a zeroed-out file 245760 bytes in
length.Still don't have a clue as to how I can reproduce the problem. It
seems that in all cases the error occurred during either an UPDATE to
a table_X or an INSERT to table_Y. In all cases, the error occurred in
a manner identical to those shown in the log above, the only
difference being either an UPDATE on table_X or an INSERT on table_Y.Not sure what direction I should head to now. Perhaps some aggressive
logging would help, so we can see the queries surrounding the
problems? I could reconfigure things to capture all statements and set
up monit or something to send an alert when the problem resurfaces,
for instance.Cheers all.
I think what would help the most is if you could arrange to obtain a
stack backtrace at the point when the error is thrown. Maybe put a
long sleep call in just before the error happens, and when it gets
stuck there, attach gdb and run bt full.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think what would help the most is if you could arrange to obtain a
stack backtrace at the point when the error is thrown. Maybe put a
long sleep call in just before the error happens, and when it gets
stuck there, attach gdb and run bt full.
That could potentially be doable. Perhaps I could use something like
google-coredumper or something similar to have a core dump generated
if the error comes up? Part of the problem is that the error is so
sporadic that it's going to be tough to say when the next one will
occur. For instance, we haven't changed our load on the server, yet
the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid
blocking on the server with sleep or anything like that unless
absolutely necessary, as there are other services we have in
development that are using other databases on this cluster. (I can as
a matter of last resort, of course, but if google-coredumper can do
the job I'd like to give that a shot first.)
Any hints on where I could insert something like this? Should I try
putting it into the section of elog.c dealing with ENOENT errors, or
try to find a spot closer to where the file itself is being opened? I
haven't looked at Postgres internals for a while now so I'm not quite
sure of the best location for this sort of thing.
Cheers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 15, 2013 at 4:01 PM, J Smith <dark.panda+lists@gmail.com> wrote:
On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think what would help the most is if you could arrange to obtain a
stack backtrace at the point when the error is thrown. Maybe put a
long sleep call in just before the error happens, and when it gets
stuck there, attach gdb and run bt full.That could potentially be doable. Perhaps I could use something like
google-coredumper or something similar to have a core dump generated
if the error comes up? Part of the problem is that the error is so
sporadic that it's going to be tough to say when the next one will
occur. For instance, we haven't changed our load on the server, yet
the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid
blocking on the server with sleep or anything like that unless
absolutely necessary, as there are other services we have in
development that are using other databases on this cluster. (I can as
a matter of last resort, of course, but if google-coredumper can do
the job I'd like to give that a shot first.)Any hints on where I could insert something like this? Should I try
putting it into the section of elog.c dealing with ENOENT errors, or
try to find a spot closer to where the file itself is being opened? I
haven't looked at Postgres internals for a while now so I'm not quite
sure of the best location for this sort of thing.
I'd look for the specific ereport() call that's firing, and put it
just before that.
(note that setting the error verbosity to 'verbose' will give you the
file and line number where the error is happening, which is useful if
the message can be generated from more than one place)
I'm not familiar with google-coredumper but it sounds like a promising
technique.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alright, we'll look into doing that heading into the weekend.
Interestingly, we haven't experienced the issue since our main Java
developer made some modifications to our backend system. I'm not
entirely sure what the changes entail except that it's a one-liner
that involves re-SELECTing a table during a transaction. We'll
rollback this change and re-compile Postgres with google-coredumper
and let it run over the weekend and see where we stand.
Cheers
On Tue, Nov 19, 2013 at 9:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Nov 15, 2013 at 4:01 PM, J Smith <dark.panda+lists@gmail.com> wrote:
On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think what would help the most is if you could arrange to obtain a
stack backtrace at the point when the error is thrown. Maybe put a
long sleep call in just before the error happens, and when it gets
stuck there, attach gdb and run bt full.That could potentially be doable. Perhaps I could use something like
google-coredumper or something similar to have a core dump generated
if the error comes up? Part of the problem is that the error is so
sporadic that it's going to be tough to say when the next one will
occur. For instance, we haven't changed our load on the server, yet
the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid
blocking on the server with sleep or anything like that unless
absolutely necessary, as there are other services we have in
development that are using other databases on this cluster. (I can as
a matter of last resort, of course, but if google-coredumper can do
the job I'd like to give that a shot first.)Any hints on where I could insert something like this? Should I try
putting it into the section of elog.c dealing with ENOENT errors, or
try to find a spot closer to where the file itself is being opened? I
haven't looked at Postgres internals for a while now so I'm not quite
sure of the best location for this sort of thing.I'd look for the specific ereport() call that's firing, and put it
just before that.(note that setting the error verbosity to 'verbose' will give you the
file and line number where the error is happening, which is useful if
the message can be generated from more than one place)I'm not familiar with google-coredumper but it sounds like a promising
technique.--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 19, 2013 at 10:16 AM, J Smith <dark.panda+lists@gmail.com> wrote:
Alright, we'll look into doing that heading into the weekend.
Interestingly, we haven't experienced the issue since our main Java
developer made some modifications to our backend system. I'm not
entirely sure what the changes entail except that it's a one-liner
that involves re-SELECTing a table during a transaction. We'll
rollback this change and re-compile Postgres with google-coredumper
and let it run over the weekend and see where we stand.
Okay, I've patched our Postgres install and added in a call for
coredumper to output a core dump. We'll let our processes run for the
next few days and hopefully we'll get something we can look at. I've
attached the patch just in case anyone would like to make sure I've
done it in a sane way and that I've inserted the call in the correct
location.
Cheers
Attachments:
postgresql-coredumper.patchapplication/octet-stream; name=postgresql-coredumper.patchDownload+15-1
coredumper worked like a charm. Useful tool, that is... although as a
bit of advice, I'd try not to run it on Postgres if your various
memory settings are tweaked towards production use -- the core dump
that was captured on my server weighed in at 16 GB.
Anyways, I've attached both the log entries that caused the error as
well as a backtrace using `bt full`.
Cheers, hope it helps.
Attachments:
postgresql-backtrace.txttext/plain; charset=UTF-8; name=postgresql-backtrace.txtDownload
Hi,
On 2013-11-24 16:56:26 -0500, J Smith wrote:
coredumper worked like a charm. Useful tool, that is... although as a
bit of advice, I'd try not to run it on Postgres if your various
memory settings are tweaked towards production use -- the core dump
that was captured on my server weighed in at 16 GB.
Nov 23 14:38:32 dev postgres[23810]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 13514992
Nov 23 14:38:32 dev postgres[23810]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/00CE": Success.
Nov 23 14:38:32 dev postgres[23810]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "dev"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Ok, this is helpful. Do you rather longrunning transactions? The
transaction that does foreign key checks has an xid of 10260613, while
the row that's getting checked has 13514992.
#4 0x0000000000635dc7 in XactLockTableWait (xid=13514992) at lmgr.c:501
tag = {locktag_field1 = 13514992, locktag_field2 = 0, locktag_field3 = 0, locktag_field4 = 0, locktag_type = 4 '\004', locktag_lockmethodid = 1 '\001'}
#5 0x0000000000482223 in heap_lock_updated_tuple_rec (rel=0x2b20f050a8d0, tuple=<value optimized out>, ctid=<value optimized out>, xid=10260613, mode=LockTupleKeyShare) at heapam.c:4847
I am not sure whether that's the origin of the problem but at the very
least it seems to me that heap_lock_updated_tuple_rec() is missing
several important pieces:
a) do the priorXmax==xmin dance to check we're still following the same
ctid chain. Currently we could easily stumble across completely
unrelated tuples if a chain element aborted and got vacuumed.
b) Check whether a chain element actually aborted - currently we're
only doing that in the HEAP_KEYS_UPDATED updated case, but that seems
wrong (we can't check for committed tho!).
c) (reported separately as well) cope with failure of heap_fetch() to
return anything.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 25, 2013 at 6:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,
On 2013-11-24 16:56:26 -0500, J Smith wrote:
Nov 23 14:38:32 dev postgres[23810]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 13514992
Nov 23 14:38:32 dev postgres[23810]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/00CE": Success.
Nov 23 14:38:32 dev postgres[23810]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "dev"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"Ok, this is helpful. Do you rather longrunning transactions? The
transaction that does foreign key checks has an xid of 10260613, while
the row that's getting checked has 13514992.
We did have some long-running transactions, yes. We refactored a bit
and removed them and the problem ceased on our end. We ended up
reverting our changes for the sake of running this experiment over the
weekend and the errors returned. We've since restored our fix and
haven't had any problems since, so yeah, long-running transactions
appear to be involved.
We can continue to experiment if you have any additional tests you'd
like us to run. We may have to keep the experiments to running over
the weekend, but they're definitely do-able.
Cheers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers