pg_rawdump
I spent yesterday writing a new tool pg_rawdump (which will be released as
open source in due course), which takes the table files in an arbitrary
pgsql database, and is able to transform those back into tables (including
toast values).
In the course of doing this (a customer needed it because he only had a
copy of those files, the pg_clog etc. dirs were lost), I noticed that
there are two problems which complicate recovery (needlessly):
a. It takes some guesswork to find out which file corresponds with
which table.
b. In order to recover the table content, it is (obviously) necessary
to correlate with the original table definition, which obviously
is lost with the rest of the information.
In order to simplify recovery at this point (enormously), it would
have been very helpful (at almost negligible cost), to have the name
of the table, the name of the columns, and the types of the
columns available.
Why don't we insert that data into the first page of a regular table
file after in the special data area?
I'd be willing to create a patch for that (should be pretty easy),
if nobody considers it to be a bad idea.
--
Stephen.
On Tue, Oct 19, 2010 at 1:12 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
In order to simplify recovery at this point (enormously), it would
have been very helpful (at almost negligible cost), to have the name
of the table, the name of the columns, and the types of the
columns available.Why don't we insert that data into the first page of a regular table
file after in the special data area?I'd be willing to create a patch for that (should be pretty easy),
if nobody considers it to be a bad idea.
There isn't necessarily one value for these attributes. You can
rename columns and that rename may succeed and commit or fail and
rollback. You can drop or add columns and some rows will have or not
have the added columns at all. You could even add a column, insert
some rows, then abort -- all in a transaction. So some (aborted) rows
will have extra columns that aren't even present in the current table
definition.
All this isn't to say the idea you're presenting is impossible or a
bad idea. If this meta information was only a hint for forensic
purposes and you take into account these caveats it might still be
useful. But I'm not sure how useful. I mean, you can't really decipher
everything properly without the data in the catalog -- and you have to
premise this on the idea that you've lost everything in the catalog
but not the data in other tables. Which seems like a narrow use case.
--
greg
Greg Stark wrote:
On Tue, Oct 19, 2010 at 1:12 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
In order to simplify recovery at this point (enormously), it would
have been very helpful (at almost negligible cost), to have the name
of the table, the name of the columns, and the types of the
columns available.
Why don't we insert that data into the first page of a regular table
file after in the special data area?
I'd be willing to create a patch for that (should be pretty easy),
if nobody considers it to be a bad idea.
There isn't necessarily one value for these attributes. You can
rename columns and that rename may succeed and commit or fail and
rollback. You can drop or add columns and some rows will have or not
have the added columns at all. You could even add a column, insert
some rows, then abort -- all in a transaction. So some (aborted) rows
will have extra columns that aren't even present in the current table
definition.
True.
All this isn't to say the idea you're presenting is impossible or a
bad idea. If this meta information was only a hint for forensic
purposes and you take into account these caveats it might still be
useful.
This is exactly what I meant it for. It would contain the most
recent alter table state (give or take some delay due to cache
flushes).
But I'm not sure how useful. I mean, you can't really decipher
everything properly without the data in the catalog -- and you have to
premise this on the idea that you've lost everything in the catalog
but not the data in other tables. Which seems like a narrow use case.
It happens, more often than you'd think. My client had it, I've
seen numerous google hits which show the same.
--
Stephen.
Be braver. You cannot cross a chasm in two small jumps.
"Stephen R. van den Berg" <srb@cuci.nl> writes:
In order to simplify recovery at this point (enormously), it would
have been very helpful (at almost negligible cost), to have the name
of the table, the name of the columns, and the types of the
columns available.
Why don't we insert that data into the first page of a regular table
file after in the special data area?
(1) it wouldn't necessarily fit
(2) what are you going to do to maintain it during ALTER TABLE?
(3) if there are any custom types involved, you're still lost.
regards, tom lane
On Tue, Oct 19, 2010 at 6:13 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
Greg Stark wrote:
But I'm not sure how useful. I mean, you can't really decipher
everything properly without the data in the catalog -- and you have to
premise this on the idea that you've lost everything in the catalog
but not the data in other tables. Which seems like a narrow use case.It happens, more often than you'd think. My client had it, I've
seen numerous google hits which show the same.
It happened to us recently when a customer had disk issues, and we
were able to get the table files back through forensics, but the
control files were not in good shape, and the cluster wouldn't start.
A tool like Stephen is proposing would most likely have helped us
recover at least some or most of the data, I would hope.
Roberto
Tom Lane wrote:
"Stephen R. van den Berg" <srb@cuci.nl> writes:
In order to simplify recovery at this point (enormously), it would
have been very helpful (at almost negligible cost), to have the name
of the table, the name of the columns, and the types of the
columns available.
Why don't we insert that data into the first page of a regular table
file after in the special data area?
(1) it wouldn't necessarily fit
Three viable options (pick any one, depending on other criteria):
a. If it doesn't fit at first, drop columnnames, and try again.
b. If necessary extend it into the special data area of the following page
(repeat until you stored everything).
c. Simply put in what fits and discard the rest.
Please note that the information in there is purely informational
and of a best-effort nature. It is not required for regular operation.
It should have close to no performance impact in the normal use case.
It is meant to help with forensics if the catalog is damaged or lost,
and you still want to attempt to recover most of the data contained
in this tablefile.
(2) what are you going to do to maintain it during ALTER TABLE?
Simply pick a point in time where it will be blatantly overwritten.
Either at reception of the command, or at commit time, whatever
is more convenient to implement. It's a best effort service, it
is not a problem if we get it wrong sometimes due to wildly
overlapping alter table/commit/rollback sequences.
(3) if there are any custom types involved, you're still lost.
Yes, this is a not a complete solution, it's a valid attempt at
making forensics a lot easier in the common case at virtually
no cost to the running system.
--
Stephen.
Life is that brief interlude between nothingness and eternity.
Roberto Mello wrote:
On Tue, Oct 19, 2010 at 6:13 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
Greg Stark wrote:
premise this on the idea that you've lost everything in the catalog
but not the data in other tables. Which seems like a narrow use case.
It happens, more often than you'd think. ??My client had it, I've
seen numerous google hits which show the same.
It happened to us recently when a customer had disk issues, and we
It usually happens when there are disk issues, that's exactly what it is for.
A tool like Stephen is proposing would most likely have helped us
recover at least some or most of the data, I would hope.
Well, because the customer could recreate (within reason) the original
table definitions, we were able to recover all of his data (12 tables,
including some toasted/compressed).
It's just that matching table and file, and subsequently figuring out
some missing columns which may have been added/removed later,
can be rather timeconsuming and could be made a lot easier (not necessarily
perfect) if that information would have been present in the first page of
a file.
--
Stephen.
Life is that brief interlude between nothingness and eternity.
"Stephen R. van den Berg" <srb@cuci.nl> writes:
It's just that matching table and file, and subsequently figuring out
some missing columns which may have been added/removed later,
can be rather timeconsuming and could be made a lot easier (not necessarily
perfect) if that information would have been present in the first page of
a file.
So you've already moved the goalposts from what was claimed in your
prior message. If the data is not maintained (with 100% reliability)
during ALTER TABLE, how are you going to do something like "figure out
missing columns"?
I can see the potential usefulness of a self-documenting table storage
format, but this proposal isn't that; it's just an unreliable kluge.
regards, tom lane
Tom Lane wrote:
"Stephen R. van den Berg" <srb@cuci.nl> writes:
It's just that matching table and file, and subsequently figuring out
some missing columns which may have been added/removed later,
can be rather timeconsuming and could be made a lot easier (not necessarily
perfect) if that information would have been present in the first page of
a file.
So you've already moved the goalposts from what was claimed in your
prior message. If the data is not maintained (with 100% reliability)
during ALTER TABLE, how are you going to do something like "figure out
missing columns"?
Most alter table operations are well thought through and rarely undone
(at least not on production databases). This means that most tables
can be restored.
I can see the potential usefulness of a self-documenting table storage
format, but this proposal isn't that; it's just an unreliable kluge.
Restoring tables/databases from table storage only is, by definition,
an unreliable kludge. I'm not opposed to making the definition storage
more robust, but, since the records in the table already have lost their
relation to the pg_clog records, and therefore it *already* is uncertain
which records were deleted and/or have the wrong number of columns, it
seems to be a needless waste of time and energy to provide more reliable
information about the column structure.
I know for a fact that those who have lost data in such a way, and are
faced with the option to have this "unreliable kludgy information"
available now, or wait for a few years/months until a reliable solution
is present; they would (in every single case) opt for the former and
get at least some (if not all) of their data back in a shorter amount
of time.
--
Stephen.
Life is that brief interlude between nothingness and eternity.
On Wed, Oct 20, 2010 at 10:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Stephen R. van den Berg" <srb@cuci.nl> writes:
It's just that matching table and file, and subsequently figuring out
some missing columns which may have been added/removed later,
can be rather timeconsuming and could be made a lot easier (not necessarily
perfect) if that information would have been present in the first page of
a file.So you've already moved the goalposts from what was claimed in your
prior message. If the data is not maintained (with 100% reliability)
during ALTER TABLE, how are you going to do something like "figure out
missing columns"?I can see the potential usefulness of a self-documenting table storage
format, but this proposal isn't that; it's just an unreliable kluge.
If we're looking to have any sort of "out of catalog" documentation of
table storage format, what about just having a new relation fork that
just "appends" each and every change made to the table formats,
including ones rolled back, etc. Make this relation fork append only,
and dump a completely new set of metadata to it each and every ALTER
TABLE. This fork would never need to be read by PG, so a relation
fork might even be too much. All you really need is a file you can
tie to a relation, and blindly append "data" to on create/alter
statements.
Sure, it will have more information than *needed*, but I can't see it
ever growing too big, and people doing forensics rarely complain about
having *too much* information available.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
Aidan Van Dyk <aidan@highrise.ca> writes:
On Wed, Oct 20, 2010 at 10:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I can see the potential usefulness of a self-documenting table storage
format, but this proposal isn't that; it's just an unreliable kluge.
If we're looking to have any sort of "out of catalog" documentation of
table storage format, what about just having a new relation fork that
just "appends" each and every change made to the table formats,
including ones rolled back, etc.
I thought of that too, but I'm not sure if it's going to help enough.
The trouble is that the information is only tied to the table itself
via file names. In a fsck-recovery scenario, you may not have the
correct file names. (One of the multiple problems with Stephen's
proposal is that the metadata would only be physically tied to the
first segment of a large table, and unless you can trust the file
names you won't be able to match it to the other segments.)
[ thinks for a bit... ] Perhaps we could stick some sort of unique
ID into tables, which could be correlated to the same unique ID
appearing in a metadata fork.
[ thinks some more... ] Of course, at this point you have to start
asking why the metadata fork should be thought to be any more reliable
than the system catalogs.
Make this relation fork append only,
and dump a completely new set of metadata to it each and every ALTER
TABLE.
You can bet that there'll be somebody out there who whines because their
app does lots and lots of repeated ALTER TABLEs, and so the metadata fork
grows forever. I think we'd have to be a bit smarter than this.
regards, tom lane
On Wed, Oct 20, 2010 at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
[ thinks for a bit... ] Perhaps we could stick some sort of unique
ID into tables, which could be correlated to the same unique ID
appearing in a metadata fork.
It would be awfully nice if the table name was in the first few bytes
of the segments ;-)
[ thinks some more... ] Of course, at this point you have to start
asking why the metadata fork should be thought to be any more reliable
than the system catalogs.
It's not. But it's about data duplication. If you've lost some
files, and others are found in lost+found, etc. Or maybe you lost
only the "main" tablespace, and you have your big-honking-files from
another table-space around, etc.
The more copies of the data around, the more chance you have of being
able to get something usable from of it.
But we already have WAL archive as an external safe backup copy of *everything*.
Maybe the cost of those extra forks/duplicates is too much. Maybe it
would be cheaper to try and parse the WAL archive, and just
specifically rebuild the system catalogs for the couple of times
people actually need this type of data?
Or maybe a query/function that ultra paranoid types can run daily,
etc, which would show the system catalog information about table
storage format that the could save safely instead of munging GB of WAL
logs when they want it...
You can bet that there'll be somebody out there who whines because their
app does lots and lots of repeated ALTER TABLEs, and so the metadata fork
grows forever. I think we'd have to be a bit smarter than this.
Well, for bad habits, we have an equally bad solution ;-)
BEGIN;
ALTER TABLE bad_design RENAME TO bad_design_too_old;
CREATE TABLE bad_design LIKE bad_design_too_old INCLUDING DEFAULTS
CONSTRAINTS INDEXES;
INSERT INTO bad_design SELECT * FROM bad_design_too_old;
DROP TABLE bad_design_too_old;
COMMIT;
It's all about how much extra stuff do we want "around" to make
forensic/reconstruction type work easier when they can't go to the
documented, tried, tested, "normal restore from backup/WAL".
None? Or as much as possible? And what are the tradeoffs.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
Tom Lane wrote:
Aidan Van Dyk <aidan@highrise.ca> writes:
If we're looking to have any sort of "out of catalog" documentation of
table storage format, what about just having a new relation fork that
just "appends" each and every change made to the table formats,
including ones rolled back, etc.
I presume that means that all tables changes are appended to a single
central file per database? That would be a bad idea, because in the
typical problem scenario, losing this new catalogue, basically creates
the same problem. It would be preferable to keep the information
tied in with the actual table(file) it concerns.
[ thinks for a bit... ] Perhaps we could stick some sort of unique
ID into tables, which could be correlated to the same unique ID
appearing in a metadata fork.
Ideal would be: put the table-oid inside the header of each page
(either in the official header, or in the special area).
This way even lost blocks can be correlated to the same table.
I'd still vote for the latest known table definition in the first
page. It's by no means perfect, but it will help 99% of all
salvage attempts by an order of magnitude.
[ thinks some more... ] Of course, at this point you have to start
asking why the metadata fork should be thought to be any more reliable
than the system catalogs.
Quite. Which is why I wanted the best-effort latest version of the
table description in the first page of the tablefile instead.
Make this relation fork append only,
and dump a completely new set of metadata to it each and every ALTER
TABLE.
You can bet that there'll be somebody out there who whines because their
app does lots and lots of repeated ALTER TABLEs, and so the metadata fork
grows forever. I think we'd have to be a bit smarter than this.
Which means we come full circle and have to conclude that doing anything
comprehensive is too invasive for normal operations; best-effort is
all a forensics operation wants or can hope for.
--
Stephen.
Life is that brief interlude between nothingness and eternity.
On Wed, Oct 20, 2010 at 5:30 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
Ideal would be: put the table-oid inside the header of each page
(either in the official header, or in the special area).
This way even lost blocks can be correlated to the same table.
I'd still vote for the latest known table definition in the first
page. It's by no means perfect, but it will help 99% of all
salvage attempts by an order of magnitude.
I don't think we should shrink the amount of usable space by 4 bytes
per block and break on-disk compatibility just to make it easier to
reconstruct corrupted tables.
I'm pretty dubious about the proposal to stuff an otherwise-useless
metapage in every heap, too. If you have many small tables, you just
doubled your disk utilization - worse than that, maybe, if some of
them are empty. If we needed a metapage anyway and had extra space to
play with, stuffing some useful forensic information in there might be
worthwhile, but I have a hard time thinking that forensics alone is a
sufficient justification for such a change.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas wrote:
On Wed, Oct 20, 2010 at 5:30 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
Ideal would be: put the table-oid inside the header of each page
(either in the official header, or in the special area).
This way even lost blocks can be correlated to the same table.
I'd still vote for the latest known table definition in the first
page. ?It's by no means perfect, but it will help 99% of all
salvage attempts by an order of magnitude.
and break on-disk compatibility just to make it easier to
If it's inserted in the "special" area, it will not break any
compatibility.
I don't think we should shrink the amount of usable space by 4 bytes
per block
Instead of putting it in every page, it could be inserted (say) once
every megabyte (if done in the special area) to avoid excessive
overhead.
I'm pretty dubious about the proposal to stuff an otherwise-useless
metapage in every heap, too.
The information is supposed to go in the special area, so it will not
be an extra page.
If you have many small tables, you just
doubled your disk utilization
For small tables, the table description typically is small as well,
so in the common case it all will still fit in one page.
- worse than that, maybe, if some of
them are empty.
An empty table does not contain any critical information which needs
to be restored (by definition :-); so the code that inserts the
table definition in the special area could easily be instructed *not*
to write out this information unless the table actually has entries.
If we needed a metapage anyway and had extra space to
play with, stuffing some useful forensic information in there might be
worthwhile, but I have a hard time thinking that forensics alone is a
sufficient justification for such a change.
The "change" could easily be made backward compatible to all on disk
formats which support the special area pointer (I'm not sure how far
back that is).
--
Stephen.
For children with short attention spans: boomerangs that don't come back.
"Stephen R. van den Berg" <srb@cuci.nl> writes:
Robert Haas wrote:
and break on-disk compatibility just to make it easier to
If it's inserted in the "special" area, it will not break any
compatibility.
I'll tell you what I really don't like about this proposal: we discuss
some scheme or other for taking over the "special space" in heap pages
at least once a year. None of them have been particularly compelling
so far, but one may come along that is; especially given that we're now
trying to maintain on-disk compatibility across versions. So I think
the opportunity cost of assigning a use to that space is mighty high.
I don't find this idea important enough to justify foreclosing future
uses for the special space.
The real bottom line is this: if you care enough about your data to
be willing to expend a large amount of effort on manual recovery
attempts, why didn't you have a decent backup scheme in place?
There are way too many scenarios where you'll have no hope of doing
any such manual recovery anyway.
regards, tom lane
Tom Lane wrote:
"Stephen R. van den Berg" <srb@cuci.nl> writes:
If it's inserted in the "special" area, it will not break any
compatibility.
I'll tell you what I really don't like about this proposal: we discuss
some scheme or other for taking over the "special space" in heap pages
at least once a year. None of them have been particularly compelling
so far, but one may come along that is; especially given that we're now
trying to maintain on-disk compatibility across versions. So I think
the opportunity cost of assigning a use to that space is mighty high.
I don't find this idea important enough to justify foreclosing future
uses for the special space.
Well, I had (of course) thought of that, and the classical solution to
this is to specify a certain attribute based format in order not to
canabalise the space and block it for further other use.
I.e. in the special area, we could start using something like:
2-byte field length (including the length field), 1-byte identifier,
field content.
For the recovery information I'd like to reserve:
identifier: 00: table OID
01: table layout
The real bottom line is this: if you care enough about your data to
be willing to expend a large amount of effort on manual recovery
attempts, why didn't you have a decent backup scheme in place?
Two obvious valid answers would be: Stupidity and/or ignorance,
sometimes a strain of bad luck.
I know it is a sad state of affairs, but not all users of postgresql
are equally knowledgable/intelligent/responsible.
There are way too many scenarios where you'll have no hope of doing
any such manual recovery anyway.
True. It's all a matter of statistics. Judging by the number of reports
I find by googling net-history, I'd have to conclude that the proposed
extra information would have helped more than half of them.
--
Stephen.
On Thu, Oct 21, 2010 at 11:30 AM, Stephen R. van den Berg <srb@cuci.nl> wrote:
For the recovery information I'd like to reserve:
identifier: 00: table OID
01: table layout
So here's a proposal for something that could maybe be implemented. I
think I'm leaning against this currently as there are just too many
caveats even for this limited functionality. I'm more inclined to go
with the idea someone else proposed of dumping an audit log of all DDL
on the table or after any DDL dumping the create table statements as
pg_dump would generate them would to a separate fork.
But here goes just to give you an idea what I think is doable and how
limited it would be:
In this meta data object put:
table oid
table name
number of columns
array of typlen for those columns (-1 for varlena and size for fixed length)
That would have a maximum size of just over 6k which is too large for
a BUFSZ 4k database but fits on default 8k databases. It would have
enough information to be able to find all the table columns but not to
understand how to interpret the contents either what their name or
types.
Including the type brings in a new set of complications. Even if you
assume the built-in typoids never change there are going to be typoids
that correspond to user defined types. Without the names of those
types the oids are pretty useless.
Just the typoids would put you over 8k in the worst case and the names
would put you into the realm of needing arbitrary numbers of blocks
for even average cases. Simiarly including the column names would
require potentially many blocks.
--
greg
"Stephen R. van den Berg" <srb@cuci.nl> writes:
Tom Lane wrote:
There are way too many scenarios where you'll have no hope of doing
any such manual recovery anyway.
True. It's all a matter of statistics. Judging by the number of reports
I find by googling net-history, I'd have to conclude that the proposed
extra information would have helped more than half of them.
Uh, no, it would have helped whatever minuscule fraction of them had the
tools and the expertise to make use of the information. This should not
be confused with a magic fix-it button. If you've lost your system
catalogs you're facing a whole lot of work that will at best recover an
untrustworthy version of some of your data. Most of the data-loss
reports I've seen appeared to come from people who wouldn't be capable
of doing such recovery work even if better tools were available.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Most of the data-loss reports I've seen appeared to come from
people who wouldn't be capable of doing such recovery work even if
better tools were available.
No doubt; but the recovery work often winds up in the hands of
people with more skills than those responsible for the loss.
Whoever comes in to help with recovery is looking for every bit of
traction they can get. You'd be amazed at some of the odd straws
people can grasp to help recover data.
I don't know how big the special area is, but if enough room could
be carved out to include even the relation ID or filename, it might
be a big help to someone. I'm pretty skeptical about including
information about attributes, though.
-Kevin