Schema-only dump dumps no constraints, no triggers
Hi,
we are using "pg_dump -s" (schema-only) to copy the structure of a template/prototype database as a set-up for several dozen Fedora boxes. The dump used to work alright until very recently. Now, across new machines that are to be introduced into the network it consistently refuses to dump constraints and triggers - seems pg_dump just skips over them. Otherwise the dump seems to be complete, the schema-table-column layout is complete. We thought it was the template server problem, but the oddity is that a dump made with Windows version of pgAdmin3 comes out complete.
The command we use is:
/usr/bin/pg_dump -h <SERVER_IP> -p <SERVER_PORT> -U <SERVER_USER> -F p -N '*somename*' -N 'somename2' -N 'somename3' -N 'somename4' -N 'somename5' -T '*somename6*' -s -v -f /some/dir/schemacopy.sql <DATABASE_NAME>
The dump is made using a script we consider to be stable and therefore it hasn't changed since a long while ago. We also weren't able to pin down any other change between the systems where it previously worked and the ones where it now refuses to - the operating system (Fedora 16) is the same, the hardware is the same, the template database server is the same. It doesn't matter whether we are running the script on an up-to-date system or an outdated-off-liveCD-installation version, so it most probably is not update-related. The server (as a system) is sometimes under pretty much load so it might be resource-related - be it currently or previously.
Searching through the archives, I have only stumbled upon a post from 2003 about a similar issue (http://archives.postgresql.org/pgsql-admin/2003-08/msg00239.php) which might be connected, however, since the reporter gave up quickly, the issue remained unsolved.
How can we dig into this further? What might be happening?
Best regards,
Marek Kielar
On 07/26/2012 04:09 PM, Marek Kielar wrote:
Hi,
we are using "pg_dump -s" (schema-only) to copy the structure of a template/prototype database as a set-up for several dozen Fedora boxes. The dump used to work alright until very recently. Now, across new machines that are to be introduced into the network it consistently refuses to dump constraints and triggers - seems pg_dump just skips over them. Otherwise the dump seems to be complete, the schema-table-column layout is complete. We thought it was the template server problem, but the oddity is that a dump made with Windows version of pgAdmin3 comes out complete.
The command we use is:
/usr/bin/pg_dump -h <SERVER_IP> -p <SERVER_PORT> -U <SERVER_USER> -F p -N '*somename*' -N 'somename2' -N 'somename3' -N 'somename4' -N 'somename5' -T '*somename6*' -s -v -f /some/dir/schemacopy.sql <DATABASE_NAME>The dump is made using a script we consider to be stable and therefore it hasn't changed since a long while ago. We also weren't able to pin down any other change between the systems where it previously worked and the ones where it now refuses to - the operating system (Fedora 16) is the same, the hardware is the same, the template database server is the same. It doesn't matter whether we are running the script on an up-to-date system or an outdated-off-liveCD-installation version, so it most probably is not update-related. The server (as a system) is sometimes under pretty much load so it might be resource-related - be it currently or previously.
Searching through the archives, I have only stumbled upon a post from 2003 about a similar issue (http://archives.postgresql.org/pgsql-admin/2003-08/msg00239.php) which might be connected, however, since the reporter gave up quickly, the issue remained unsolved.
How can we dig into this further? What might be happening?
Postgres version?
Is there more than one version of PG on machine?
The dump made with PgAdmin uses the same parameters?
Any errors in the logs on either the dump or restore side?
Best regards,
Marek Kielar
--
Adrian Klaver
adrian.klaver@gmail.com
Hi, again,
I'm sorry about the lack of version information - I concentrated so much on describing the problem correctly, that I forgot to provide basic information. The version at locations is 9.1.4 (though this is irrelevant now), the server is 9.0.4.
We found what the problem was. Another problem stems from it, however. Please read on.
To add to the information already provided - we have a two-way backup of the template database. One is a WAL replication and the other is londiste (skytools) replication with periodic complete copy. As it turned out, the "stable" script uses not, as we remembered, the actual template database but the londiste-replicated database which was to make next complete copy a few days ago. The copy did not complete, however - the schema-table-column structure transfer completed, but the constraints and triggers did not get through somehow, as there was a lack of hard drive space. Digging on it, we found out that the drive's space was not used up by files in the filesystem, it was filled with deleted files that postgresql server was still clinging on to, probably for a good while. After restarting the server many, many gigabytes were suddenly made available on disk. And this is the new problem - the server has quite a throughput and this is probably what causes the "leakage". How can we force the server to let go of the files? Or maybe it is an actual leak that needs to be studied upon?
On a side note, obviously, the Windows dump came out alright because it was from the proper database, not the replicated copy.
Best regards,
Marek Kielar
Dnia 27 lipca 2012 4:46 Adrian Klaver <adrian.klaver@gmail.com> napisał(a):
Show quoted text
On 07/26/2012 04:09 PM, Marek Kielar wrote:
Hi,
we are using "pg_dump -s" (schema-only) to copy the structure of a template/prototype database as a set-up for several dozen Fedora boxes. The dump used to work alright until very recently. Now, across new machines that are to be introduced into the network it consistently refuses to dump constraints and triggers - seems pg_dump just skips over them. Otherwise the dump seems to be complete, the schema-table-column layout is complete. We thought it was the template server problem, but the oddity is that a dump made with Windows version of pgAdmin3 comes out complete.
The command we use is:
/usr/bin/pg_dump -h <SERVER_IP> -p <SERVER_PORT> -U <SERVER_USER> -F p -N '*somename*' -N 'somename2' -N 'somename3' -N 'somename4' -N 'somename5' -T '*somename6*' -s -v -f /some/dir/schemacopy.sql <DATABASE_NAME>The dump is made using a script we consider to be stable and therefore it hasn't changed since a long while ago. We also weren't able to pin down any other change between the systems where it previously worked and the ones where it now refuses to - the operating system (Fedora 16) is the same, the hardware is the same, the template database server is the same. It doesn't matter whether we are running the script on an up-to-date system or an outdated-off-liveCD-installation version, so it most probably is not update-related. The server (as a system) is sometimes under pretty much load so it might be resource-related - be it currently or previously.
Searching through the archives, I have only stumbled upon a post from 2003 about a similar issue (http://archives.postgresql.org/pgsql-admin/2003-08/msg00239.php) which might be connected, however, since the reporter gave up quickly, the issue remained unsolved.
How can we dig into this further? What might be happening?
Postgres version?
Is there more than one version of PG on machine?
The dump made with PgAdmin uses the same parameters?
Any errors in the logs on either the dump or restore side?Best regards,
Marek Kielar
Import Notes
Resolved by subject fallback
On 07/27/2012 09:32 AM, Marek Kielar wrote:
Hi, again,
I'm sorry about the lack of version information - I concentrated so much on describing the problem correctly, that I forgot to provide basic information. The version at locations is 9.1.4 (though this is irrelevant now), the server is 9.0.4.
We found what the problem was. Another problem stems from it, however. Please read on.
To add to the information already provided - we have a two-way backup of the template database. One is a WAL replication and the other is londiste (skytools) replication with periodic complete copy. As it turned out, the "stable" script uses not, as we remembered, the actual template database but the londiste-replicated database which was to make next complete copy a few days ago. The copy did not complete, however - the schema-table-column structure transfer completed, but the constraints and triggers did not get through somehow, as there was a lack of hard drive space. Digging on it, we found out that the drive's space was not used up by files in the filesystem, it was filled with deleted files that postgresql server was still clinging on to, probably for a good while. After restarting the server many, many gigabytes were suddenly made available on disk. And this is the new problem - the server has quite a throughput and this is probably what causes the "leakage". How can we force
the server to let go of the files? Or maybe it is an actual leak that needs to be studied upon?
On a side note, obviously, the Windows dump came out alright because it was from the proper database, not the replicated copy.
What where the deleted files?
WAL, Logs, other?
What type of WAL replication are you doing?
Streaming, log shipping, etc?
What are your settings for the WAL replication?
In particular wal_keep_segments ?
Is the WAL replication actually working?
Best regards,
Marek Kielar
--
Adrian Klaver
adrian.klaver@gmail.com
Hi,
Dnia 28 lipca 2012 1:10 Adrian Klaver <adrian.klaver@gmail.com> napisał(a):
What where the deleted files?
WAL, Logs, other?
What type of WAL replication are you doing?
Streaming, log shipping, etc?
What are your settings for the WAL replication?
In particular wal_keep_segments ?
Is the WAL replication actually working?
at this time - a couple days after restart, the clog hasn't re-formed yet. Thus, I am unable to tell you what files they were, we didn't pay that much attention to it then - there were some WAL files but I can't tell what the actual structure was. I'll provide this information whenever possible.
The WAL replication is a streaming replication with a hot standby server. The servers have a direct connection with one another. Configuration appended.
The replication is working fine.
The primary server also has a single mostly-"idle" transaction from any client node - a trait of an "always online" client application. Immediately after primary server restart this does not add much overhead, though. I can't tell whether this, in the long run, might cause the problem.
Configuration follows:
- primary server postgresql.conf - WAL section
- standby server postgresql.conf - WAL section
- standby server recovery.conf
The primary server WAL configuration is:
#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------
# - Settings -
wal_level = hot_standby # minimal, archive, or hot_standby
# (change requires restart)
#fsync = on # turns forced synchronization on or off
#synchronous_commit = on # immediate fsync at commit
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
# fdatasync (default on Linux)
# fsync
# fsync_writethrough
# open_sync
#full_page_writes = on # recover from partial page writes
wal_buffers = 1MB # min 32kB
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5 # range 1-1000
# - Checkpoints -
#checkpoint_segments = 30 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
# - Archiving -
archive_mode = on # allows archiving to be done
# (change requires restart)
archive_command = 'rsync %p <CUT>/%f' # command to use to archive a logfile segment
archive_timeout = 0 # force a logfile segment switch after this
# number of seconds; 0 disables
# - Streaming Replication -
max_wal_senders = 5 # max number of walsender processes
# (change requires restart)
#wal_sender_delay = 200ms # walsender cycle time, 1-10000 milliseconds
wal_keep_segments = 32 # in logfile segments, 16MB each; 0 disables
#vacuum_defer_cleanup_age = 0 # number of xacts by which cleanup is delayed
# - Standby Servers -
#hot_standby = off # "on" allows queries during recovery
# (change requires restart)
#max_standby_archive_delay = 30s # max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
#max_standby_streaming_delay = 30s # max delay before canceling queries
# when reading streaming WAL;
# -1 allows indefinite delay
The standby server WAL configuration is:
#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------
# - Settings -
#wal_level = minimal # minimal, archive, or hot_standby
# (change requires restart)
#fsync = on # turns forced synchronization on or off
#synchronous_commit = on # immediate fsync at commit
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
# fdatasync (default on Linux)
# fsync
# fsync_writethrough
# open_sync
#full_page_writes = on # recover from partial page writes
#wal_buffers = 64kB # min 32kB
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5 # range 1-1000
# - Checkpoints -
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
# - Archiving -
#archive_mode = on # allows archiving to be done
# (change requires restart)
#archive_command = 'cp %p /backup/repl/%f' # command to use to archive a logfile segment
#archive_timeout = 0 # force a logfile segment switch after this
# number of seconds; 0 disables
# - Streaming Replication -
#max_wal_senders = 5 # max number of walsender processes
# (change requires restart)
#wal_sender_delay = 200ms # walsender cycle time, 1-10000 milliseconds
#wal_keep_segments = 32 # in logfile segments, 16MB each; 0 disables
#vacuum_defer_cleanup_age = 0 # number of xacts by which cleanup is delayed
# - Standby Servers -
hot_standby = on # "on" allows queries during recovery
# (change requires restart)
#max_standby_archive_delay = 30s # max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
#max_standby_streaming_delay = 30s # max delay before canceling queries
# when reading streaming WAL;
# -1 allows indefinite delay
The standby server recovery.conf file:
standby_mode = 'on'
primary_conninfo = 'host=<CUT> port=<CUT> user=<CUT>'
trigger_file = '<CUT>/repl_trigger'
restore_command = 'rsync <CUT>/%f "%p"'
Import Notes
Resolved by subject fallback
Hi,
to complement information from the previous message:
Dnia 29 lipca 2012 12:29 Marek Kielar <mkielar@go2.pl> napisał(a):
Hi,
Dnia 28 lipca 2012 1:10 Adrian Klaver <adrian.klaver@gmail.com> napisał(a):
What where the deleted files?
WAL, Logs, other?at this time - a couple days after restart, the clog hasn't re-formed yet. Thus, I am unable to tell you what files they were, we didn't pay that much attention to it then - there were some WAL files but I can't tell what the actual structure was. I'll provide this information whenever possible.
The clog has somewhat re-formed - the full listing of lsof (filtered for unique files) for postmaster(s) on the database mount is here:
http://BillionUploads.com/ya9kjv78t9es/postmaster_files_sorted.csv.html
Consecutive commands were issued in a matter of minutes and differ slightly.
Some totals / aggregates:
df – /data 83 141 382 144
du – /data 29 170 365 801
lsof – /data 75 348 037 632
lsof – /data/base 74 975 969 280
lsof – /data/base (deleted) 53 769 936 896
lsof – /data/pg_xlog 369 098 752
lsof – /data/pg_xlog (deleted) 201 326 592
lsof – /data/global 2 965 504
It is clear that the server processes are keeping most of the files from being actually deleted.
On 08/06/2012 05:08 AM, Marek Kielar wrote:
Hi,
to complement information from the previous message:
Dnia 29 lipca 2012 12:29 Marek Kielar <mkielar@go2.pl> napisał(a):
Hi,
Dnia 28 lipca 2012 1:10 Adrian Klaver <adrian.klaver@gmail.com> napisał(a):
What where the deleted files?
WAL, Logs, other?at this time - a couple days after restart, the clog hasn't re-formed yet. Thus, I am unable to tell you what files they were, we didn't pay that much attention to it then - there were some WAL files but I can't tell what the actual structure was. I'll provide this information whenever possible.
The clog has somewhat re-formed - the full listing of lsof (filtered for unique files) for postmaster(s) on the database mount is here:
http://BillionUploads.com/ya9kjv78t9es/postmaster_files_sorted.csv.html
FYI you might to consider using some other site for uploads. The above
is sort of scary and leads you down all sorts of false paths.
Consecutive commands were issued in a matter of minutes and differ slightly.
Some totals / aggregates:
df – /data 83 141 382 144
du – /data 29 170 365 801
lsof – /data 75 348 037 632
lsof – /data/base 74 975 969 280
lsof – /data/base (deleted) 53 769 936 896
lsof – /data/pg_xlog 369 098 752
lsof – /data/pg_xlog (deleted) 201 326 592
lsof – /data/global 2 965 504It is clear that the server processes are keeping most of the files from being actually deleted.
Well the nature of database data files is they expand and/or contract as
needed. Unless you are getting rid of the actual object they refer to
they will not be deleted. The files WAL files in pg_xlog are a different
matter, but in the listing you sent they seem to be reasonable. There
are a couple of things off the top of my head that can cause data files
to expand unnecessarily:
1) Autovacuum is not aggressive enough.
2) There are open transactions keeping old tuples from being removed.
From previous posts, you mentioned a 'permanent' connection to the
database. Are you sure it is not holding an open transaction?
The pg_locks view would be a good place to start:
http://www.postgresql.org/docs/9.1/interactive/view-pg-locks.html
--
Adrian Klaver
adrian.klaver@gmail.com
Dnia 6 sierpnia 2012 17:00 Adrian Klaver <adrian.klaver@gmail.com> napisał(a):
The clog has somewhat re-formed - the full listing of lsof (filtered for unique files) for postmaster(s) on the database mount is here:
http://BillionUploads.com/ya9kjv78t9es/postmaster_files_sorted.csv.htmlFYI you might to consider using some other site for uploads. The above
is sort of scary and leads you down all sorts of false paths.
Sorry about that, it's the first time I had used a hosting service and they didn't require creating an account (as in a list on Wikipedia). I guess using NoScript spoils with saving from trouble, but makes one come at wrong assumptions. I'll try to choose better next time.
Consecutive commands were issued in a matter of minutes and differ slightly.
Some totals / aggregates:
df – /data 83 141 382 144
du – /data 29 170 365 801
lsof – /data 75 348 037 632
lsof – /data/base 74 975 969 280
lsof – /data/base (deleted) 53 769 936 896
lsof – /data/pg_xlog 369 098 752
lsof – /data/pg_xlog (deleted) 201 326 592
lsof – /data/global 2 965 504It is clear that the server processes are keeping most of the files from being actually deleted.
Well the nature of database data files is they expand and/or contract as
needed. Unless you are getting rid of the actual object they refer to
they will not be deleted. The files WAL files in pg_xlog are a different
matter, but in the listing you sent they seem to be reasonable. There
are a couple of things off the top of my head that can cause data files
to expand unnecessarily:
1) Autovacuum is not aggressive enough.
2) There are open transactions keeping old tuples from being removed.From previous posts, you mentioned a 'permanent' connection to the
database. Are you sure it is not holding an open transaction?
The pg_locks view would be a good place to start:
http://www.postgresql.org/docs/9.1/interactive/view-pg-locks.html
1) Running, through pgAdmin3, an ordinary VACUUM FULL ANALYZE and REINDEX on all our databases plus the "postgres" database that are in the cluster, didn't release a substantial amount of disk space - it might add up to maybe a few percent of the overall.
2) It doesn't seem there are any long-running transactions even though the PIDs do repeat during some time (but since the connections are kept open this seems reasonable):
postgres=# SELECT * FROM pg_locks ORDER BY pid;
locktype;database;relation;page;tuple;virtualxid;transactionid;classid;objid;objsubid;virtualtransaction;pid;mode;granted
relation;11874;10985;;;;;;;;85/101738;24367;AccessShareLock;t
virtualxid;;;;;85/101738;;;;;85/101738;24367;ExclusiveLock;t
virtualxid;;;;;20/788838;;;;;20/788838;24505;ExclusiveLock;t
virtualxid;;;;;14/923780;;;;;14/923780;24621;ExclusiveLock;t
virtualxid;;;;;76/139304;;;;;76/139304;24699;ExclusiveLock;t
virtualxid;;;;;55/199999;;;;;55/199999;24703;ExclusiveLock;t
virtualxid;;;;;59/363780;;;;;59/363780;24926;ExclusiveLock;t
(7 rows)
And after some time with a different invocation of psql (to let go of the PID):
postgres=# SELECT * FROM pg_locks ORDER BY pid;
locktype;database;relation;page;tuple;virtualxid;transactionid;classid;objid;objsubid;virtualtransaction;pid;mode;granted
virtualxid;;;;;56/410614;;;;;56/410614;25105;ExclusiveLock;t
virtualxid;;;;;3/667499;;;;;3/667499;25145;ExclusiveLock;t
relation;11874;10985;;;;;;;;85/101817;25171;AccessShareLock;t
virtualxid;;;;;85/101817;;;;;85/101817;25171;ExclusiveLock;t
(4 rows)
We are again approaching slowly the point that the server restart will be needed. If / when this happens, I'll provide statistics again.
Best regards,
Marek Kielar