[GSOC] questions about idea "rewrite pg_dump as library"

Started by 帅almost 13 years ago18 messages
#1帅
shuai900217@126.com

Hi all,

I'd like to introduce myself to the dev community. I am Shuai Fan, a student from Dalian University of Technology, DLUT , for short, China. And I am interested in working with PostgreSQL project in GSOC2013.
I'm interested in the idea "Rewrite (add) pg_dump and pg_restore utilities as libraries (.so, .dll & .dylib)".

These days, I had a talk with Mr. Pavel Golub by email, the author of this post. And asked some questions about this idea. He adviced me to post the question to this mail list.

My question is:
There are lots of functions in "pg_dump.c". If I rewrite this file as library. I should split "pg_dump.c" into two or more files(Mr. Pavel Golub's advice). However, some functions, do have return value. e.g.

static DumpableObject *createBoundaryObjects(void);

I thought, these functions must get return value through function argument by passing pointer to it, when using library. But, the question is: If I did so, function prototype may be changed. And Mr. Pavel Golub said "it's better to keep all functions with the same declaration", and so "we will only have one code base for both console application and library". I think he is right.
But then, how can I deal with this situation? From my point of view, I can't malloc a variable (e.g. DumpableObject) in one library (e.g. pg_dumplib.so), and then return it's pointer to another library (or excutable program). Maybe, it's not safe(?). Or has something to do with "memory leak"(?). I'm not sure.

Do you have any suggestions?

Best wishes,
Shuai

#2Peter Eisentraut
peter_e@gmx.net
In reply to: 帅 (#1)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

On 4/10/13 10:54 AM, ˧ wrote:

I'm interested in the idea "Rewrite (add) pg_dump and pg_restore
utilities as libraries (.so, .dll & .dylib)".

The pg_dump code is a giant mess, and refactoring it as a library is
perhaps not a project for a new hacker.

Independent of that, I think the first consideration in such a project
would be, who else would be using that library? What are the use cases?
And then come up with an interface around that, and then see about
refactoring pg_dump.

I think the main uses cases mentioned in connection with this idea are
usually in the direction of finer-grained control over what gets dumped
and how. But making pg_dump into a library would not necessarily
address that.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#2)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

Peter Eisentraut wrote:

I think the main uses cases mentioned in connection with this idea are
usually in the direction of finer-grained control over what gets dumped
and how. But making pg_dump into a library would not necessarily
address that.

There's also the matter of embedding pg_dump into other programs. For
example, calling the pg_dump executable from inside pgAdmin is a rather
messy solution to the problem.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Alvaro Herrera (#3)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

On 04/10/2013 11:02 PM, Alvaro Herrera wrote:

Peter Eisentraut wrote:

I think the main uses cases mentioned in connection with this idea are
usually in the direction of finer-grained control over what gets dumped
and how. But making pg_dump into a library would not necessarily
address that.

There's also the matter of embedding pg_dump into other programs. For
example, calling the pg_dump executable from inside pgAdmin is a rather
messy solution to the problem.

Natural solution to this seems to move most of pg_dump functionality
into backend as functions, so we have pg_dump_xxx() for everything
we want to dump plus a topological sort function for getting the
objects in right order.

The main things left into pg_dump the library would be support various ways
to format the dump results into text, tar and "dump" files.

--------------------
Hannu

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Hannu Krosing (#4)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

Hannu Krosing wrote:

On 04/10/2013 11:02 PM, Alvaro Herrera wrote:

Peter Eisentraut wrote:

I think the main uses cases mentioned in connection with this idea are
usually in the direction of finer-grained control over what gets dumped
and how. But making pg_dump into a library would not necessarily
address that.

There's also the matter of embedding pg_dump into other programs. For
example, calling the pg_dump executable from inside pgAdmin is a rather
messy solution to the problem.

Natural solution to this seems to move most of pg_dump functionality
into backend as functions, so we have pg_dump_xxx() for everything
we want to dump plus a topological sort function for getting the
objects in right order.

This idea doesn't work because of back-patch considerations (i.e. we
would not be able to create the functions in back branches, and so this
new style of pg_dump would only work with future server versions). So
pg_dump itself would have to retain capability to dump stuff from old
servers. This seems unlikely to fly at all, because we'd be then
effectively maintaining pg_dump in two places, both backend and the
pg_dump source code.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#5)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Hannu Krosing wrote:

Natural solution to this seems to move most of pg_dump functionality
into backend as functions, so we have pg_dump_xxx() for everything
we want to dump plus a topological sort function for getting the
objects in right order.

This idea doesn't work because of back-patch considerations (i.e. we
would not be able to create the functions in back branches, and so this
new style of pg_dump would only work with future server versions). So
pg_dump itself would have to retain capability to dump stuff from old
servers. This seems unlikely to fly at all, because we'd be then
effectively maintaining pg_dump in two places, both backend and the
pg_dump source code.

There are other issues too, in particular that most of the backend's
code tends to work on SnapshotNow time whereas pg_dump would really
prefer it was all done according to the transaction snapshot. We have
got bugs of that ilk already in pg_dump, but we shouldn't introduce a
bunch more. Doing this right would therefore mean that we'd have to
write a lot of duplicative code in the backend, ie, it's not clear that
we gain any synergy by pushing the functionality over. It might
simplify cross-backend-version issues (at least for backend versions
released after we'd rewritten all that code) but otherwise I'm afraid
it'd just be pushing the problems somewhere else.

In any case, "push it to the backend" offers no detectable help with the
core design issue here, which is figuring out what functionality needs
to be exposed with what API.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Pavel Golub
pavel@microolap.com
In reply to: 帅 (#1)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

Hello, 帅.

You wrote:

帅> Hi all,
帅> I'd like to introduce myself to the dev community. I am Shuai
帅> Fan, a student from Dalian University of Technology, DLUT , for
帅> short, China. And I am interested in working with PostgreSQL project in GSOC2013.
帅> I'm interested in the idea "Rewrite (add) pg_dump and
帅> pg_restore utilities as libraries (.so, .dll & .dylib)".

帅> These days, I had a talk with Mr. Pavel Golub by email, the
帅> author of this post. And asked some questions about this idea. He
帅> adviced me to post the question to this mail list.

帅> My question is:
帅> There are lots of functions in "pg_dump.c". If I rewrite this
帅> file as library. I should split "pg_dump.c" into two or more
帅> files(Mr. Pavel Golub's advice). However, some functions, do have return value. e.g.

帅> static DumpableObject *createBoundaryObjects(void);

帅> I thought, these functions must get return value through
帅> function argument by passing pointer to it, when using library.
帅> But, the question is: If I did so, function prototype may be
帅> changed. And Mr. Pavel Golub said "it's better to keep all
帅> functions with the same declaration", and so "we will only have one
帅> code base for both console application and library". I think he is right.
帅> But then, how can I deal with this situation? From my point of
帅> view, I can't malloc a variable (e.g. DumpableObject) in one
帅> library (e.g. pg_dumplib.so), and then return it's pointer to
帅> another library (or excutable program). Maybe, it's not safe(?). Or
帅> has something to do with "memory leak"(?). I'm not sure.
帅>
帅> Do you have any suggestions?

From my point of view the new library should export only two
functions:

1. The execution function:

ExecStatusType PGdumpdbParams(const char * const *keywords,
const char * const *values);

Return type may be other, but ExecStatusType seems to be OK for this
purpose: PGRES_TUPLES_OK - for success, PGRES_FATAL_ERROR - for fail.

Parameters will remain the same as usual command-line options for
pg_dump. Thus we will have less work for existing application, e.g.
pgAdmin.

2. Logging or notice processing function:

typedef void (*PGdumplogProcessor) (char *relname, int done, char *message);

PGdumplogProcessor PQsetNoticeProcessor(PGdumplogProcessor proc,
void *arg);

The purpose of this function is process output of the dump. The first
argument is callback-function which accepts information about current
relname (or operation, or stage), done indicates how much work done
(for progress bars etc.), message contains some extra information.

That's all! Only two functions. Indeed we don't need all those
low-level dump functions like createBoundaryObjects etc. There will be
the only one entry to the whole logic. And if one wants the only one
table, funcction or schema - combination of correct parameters should
be passed to PGdumpdbParams.

帅> Best wishes,
帅> Shuai

--
With best wishes,
Pavel mailto:pavel@gf.microolap.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Pavel Golub (#7)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

Pavel Golub <pavel@microolap.com> writes:

From my point of view the new library should export only two
functions:

1. The execution function:

ExecStatusType PGdumpdbParams(const char * const *keywords,
const char * const *values);

No, this is exactly *wrong*. You might as well not bother to refactor,
if the only API the library presents is exactly equivalent to what you
could get with system("pg_dump ...").

I don't know what the right answer is, but this isn't it. Most people
who are interested in this topic are interested because they want to get
output that is different from anything pg_dump would produce on its own,
for instance applying a more complex object-selection rule than anything
pg_dump offers. Right now, the only way they can do that is lobby to
add new switch options to pg_dump. With a change like this, it'd still
be the case that they can't get what they want except by adding new
switch options to pg_dump. I don't see any advantage gained.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Pavel Golub
pavel@microolap.com
In reply to: Tom Lane (#8)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

Hello, Tom.

You wrote:

TL> Pavel Golub <pavel@microolap.com> writes:

From my point of view the new library should export only two
functions:

1. The execution function:

ExecStatusType PGdumpdbParams(const char * const *keywords,
const char * const *values);

TL> No, this is exactly *wrong*. You might as well not bother to refactor,
TL> if the only API the library presents is exactly equivalent to what you
TL> could get with system("pg_dump ...").

Well, yes. You're absolutely right. But should this be a starting
point?

TL> I don't know what the right answer is, but this isn't it. Most people
TL> who are interested in this topic are interested because they want to get
TL> output that is different from anything pg_dump would produce on its own,
TL> for instance applying a more complex object-selection rule than anything
TL> pg_dump offers. Right now, the only way they can do that is lobby to
TL> add new switch options to pg_dump. With a change like this, it'd still
TL> be the case that they can't get what they want except by adding new
TL> switch options to pg_dump. I don't see any advantage gained.

TL> regards, tom lane

--
With best wishes,
Pavel mailto:pavel@gf.microolap.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#8)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

On 04/11/2013 09:51 AM, Tom Lane wrote:

Pavel Golub <pavel@microolap.com> writes:

From my point of view the new library should export only two
functions:
1. The execution function:
ExecStatusType PGdumpdbParams(const char * const *keywords,
const char * const *values);

No, this is exactly *wrong*. You might as well not bother to refactor,
if the only API the library presents is exactly equivalent to what you
could get with system("pg_dump ...").

I don't know what the right answer is, but this isn't it. Most people
who are interested in this topic are interested because they want to get
output that is different from anything pg_dump would produce on its own,
for instance applying a more complex object-selection rule than anything
pg_dump offers. Right now, the only way they can do that is lobby to
add new switch options to pg_dump. With a change like this, it'd still
be the case that they can't get what they want except by adding new
switch options to pg_dump. I don't see any advantage gained.

Well, either they want that or they want that output more accessibly,
and without all the baggage that pg_dump necessarily brings to the
table. pg_dump does a lot of stuff that's basically designed for bulk
operations, and often what people want is a way to get, say, the
creation DDL for some object, without any locks than the usual locks any
transaction takes. Last year I started writing a package to provide such
functions, which i called RetailDDL, and it was well received at the
conference where I talked about it, but I have not had time since then
to work on it, as JSON development among other things has had a rather
higher priority. But I think it's very well worth doing. I think in
principle having database functions for the creation DDL for its own
objects is a good idea.

And yes, that would mean keeping knowledge of how to produce such output
in two places - pg_dump is going to need to keep historical knowledge,
for one thing. But I can live with that.

It could be interesting to have a library that would output database
metadata in some machine readable and manipulatable format such as JSON
or XML. One thing that's annoying about the text output pg_dump produces
is that it's not at all structured, so if you want, for example, to
restore a table but to a table of a different name, or to a different
schema, then you're reduced to having to mangle the SQL by using hand
editing or regular expression matching. Something with the potential to
ease that pain would be worth having.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Stephen Frost
sfrost@snowman.net
In reply to: Andrew Dunstan (#10)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

* Andrew Dunstan (andrew@dunslane.net) wrote:

On 04/11/2013 09:51 AM, Tom Lane wrote:

No, this is exactly *wrong*. You might as well not bother to refactor,
if the only API the library presents is exactly equivalent to what you
could get with system("pg_dump ...").

Agreed.

Well, either they want that or they want that output more
accessibly, and without all the baggage that pg_dump necessarily
brings to the table. pg_dump does a lot of stuff that's basically
designed for bulk operations, and often what people want is a way to
get, say, the creation DDL for some object, without any locks than
the usual locks any transaction takes.

Yes- being able to get that from a simple database function would be
very nice. I wonder if some of what's been done with the "event"
triggers would inform us about what that API should look like.

And yes, that would mean keeping knowledge of how to produce such
output in two places - pg_dump is going to need to keep historical
knowledge, for one thing. But I can live with that.

Agreed. If it doesn't live in two places, for some period, we'll never
actually have it outside of pg_dump.

It could be interesting to have a library that would output database
metadata in some machine readable and manipulatable format such as
JSON or XML.

The DB function to produce the DDL command might have options to produce
it in multiple formats..? Seems like it'd be nice to have, though
perhaps not initially.

Thanks,

Stephen

#12Michael Paquier
michael.paquier@gmail.com
In reply to: Stephen Frost (#11)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

On Fri, Apr 12, 2013 at 1:00 AM, Stephen Frost <sfrost@snowman.net> wrote:

Well, either they want that or they want that output more
accessibly, and without all the baggage that pg_dump necessarily
brings to the table. pg_dump does a lot of stuff that's basically
designed for bulk operations, and often what people want is a way to
get, say, the creation DDL for some object, without any locks than
the usual locks any transaction takes.

Yes- being able to get that from a simple database function would be
very nice. I wonder if some of what's been done with the "event"
triggers would inform us about what that API should look like.

I recall discussions about reverse engineering of a parsed query tree in
the event trigger threads but nothing has been committed I think. Also, you
need to consider that implementing such reverse engineering mechanism in
core might not be a good thing for new features and maintenance, as it
would mean that it is necessary to change those APIs consistently with what
is added on the parsing side.
It could make more sense to have such a set of functions created as a
separate project.

My 2c.
--
Michael

#13Tatsuo Ishii
ishii@postgresql.org
In reply to: Michael Paquier (#12)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

Well, either they want that or they want that output more
accessibly, and without all the baggage that pg_dump necessarily
brings to the table. pg_dump does a lot of stuff that's basically
designed for bulk operations, and often what people want is a way to
get, say, the creation DDL for some object, without any locks than
the usual locks any transaction takes.

Yes- being able to get that from a simple database function would be
very nice. I wonder if some of what's been done with the "event"
triggers would inform us about what that API should look like.

I recall discussions about reverse engineering of a parsed query tree in
the event trigger threads but nothing has been committed I think. Also, you
need to consider that implementing such reverse engineering mechanism in
core might not be a good thing for new features and maintenance, as it
would mean that it is necessary to change those APIs consistently with what
is added on the parsing side.
It could make more sense to have such a set of functions created as a
separate project.

This may or may not related to, but...

pgpool-II already does "reverse engineering" from a parsed query. It
parses a query, genetrates raw parse tree, rewrites it for certain
purpose and generates text query. If you are interested, you could
take a look at pgpool-II source code.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Tom Lane (#6)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

Tom Lane <tgl@sss.pgh.pa.us> writes:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

This idea doesn't work because of back-patch considerations (i.e. we
would not be able to create the functions in back branches, and so this
new style of pg_dump would only work with future server versions). So

That is a policy question, not a technical one.

We could either add the new functions in the backend binary itself, or
provide it as an extension that pg_dump would know to install when
needed, if we decided it's ok.

My understanding is that will need to change that policy anyways the day
we have a page disk format change and pg_upgrade needs to flag the old
cluster pages with the old page version number before being able to run,
or something.

There are other issues too, in particular that most of the backend's
code tends to work on SnapshotNow time whereas pg_dump would really
prefer it was all done according to the transaction snapshot. We have

Would that be solved by having MVCC catalogs, or the backend code you're
talking about wouldn't be included in there? (which would be surprising
to me, as much as trumping the benefits of MVCC catalogs, but well).

In any case, "push it to the backend" offers no detectable help with the
core design issue here, which is figuring out what functionality needs
to be exposed with what API.

Andrew did begin to work on that parts with the "Retail DDL" project. We
know of several "getddl" implementation, and you can also have a look at
the pg_dump -Fs (split format) patch that didn't make it for 9.3, where
some API work has been done.

The need exists and some thinking over the API to get here did happen.
Some more certainly needs to be done, granted.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Michael Paquier (#12)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

Michael Paquier <michael.paquier@gmail.com> writes:

I recall discussions about reverse engineering of a parsed query tree in
the event trigger threads but nothing has been committed I think. Also, you

Yes. The name used in there was "Normalized Command String".

need to consider that implementing such reverse engineering mechanism in
core might not be a good thing for new features and maintenance, as it
would mean that it is necessary to change those APIs consistently with what
is added on the parsing side.

The goal is to retain the same API, which is quite simple:

function get_command_string(Node *parsetree) returns text

At the SQL level, the Node * is of datatype "internal" and you can't
forge it, you need to be given it in some ways. In the Event Trigger
case we though of a TG_PARSETREE magic variable, or maybe another
function get_current_parsetree() that only work when called from an
event trigger.

The other part of the API of course is how to represent the data, and as
we're talking about a Normalized Command String, there's no choice but
issuing a valid SQL command string that the server would know how to
execute and which would have the same side effects.

So of course a 9.3 and a 9.4 server equiped with that hypothetical
function would behave differently when the syntax did change. And that's
exactly why I think it the best choice here is to have that code
embedded and maintained in core.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Shuai Fan
shuai900217@126.com
In reply to: Andrew Dunstan (#10)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

On 04/11/2013 11:48 PM, Andrew Dunstan wrote:

It could be interesting to have a library that would output database
metadata in some machine readable and manipulatable format such as
JSON or XML. One thing that's annoying about the text output pg_dump
produces is that it's not at all structured, so if you want, for
example, to restore a table but to a table of a different name, or to
a different schema, then you're reduced to having to mangle the SQL by
using hand editing or regular expression matching. Something with the
potential to ease that pain would be worth having.

Yes. This is really interesting. Current code in pg_dump, supports 4
ArchiveFormat, e.g. archCustom, archTar, archNull and archDirectory.
These formats are implementation of interface "pg_backup". Maybe I could
try to add two implementation of "XML" and "JSON".
It is worth to mention that I wrote a program to parse XML format
file into csv one using library "libxercesc" a month ago, Although, this
program is just like helloworld. But, maybe I could get benefit from
that small program, because both of them use XML format. And what I need
to do is try another xml library.
I had a look at JSON on wiki. The format is a little like XML. Both
of them are nested. And there are some library could be used, e.g.
libjson (or json.c, or other json library writting in C) and libxml2 (or
something else).

BTW, could it be an idea for GSOC? If so, I can have a try. Add XML
and JSON output format for pg_dump.

Thank you all for your attention.

Best regards,
Shuai

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Tom Lane (#6)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

On 04/11/2013 12:17 AM, Tom Lane wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Hannu Krosing wrote:

Natural solution to this seems to move most of pg_dump functionality
into backend as functions, so we have pg_dump_xxx() for everything
we want to dump plus a topological sort function for getting the
objects in right order.

This idea doesn't work because of back-patch considerations (i.e. we
would not be able to create the functions in back branches, and so this
new style of pg_dump would only work with future server versions). So
pg_dump itself would have to retain capability to dump stuff from old
servers. This seems unlikely to fly at all, because we'd be then
effectively maintaining pg_dump in two places, both backend and the
pg_dump source code.

There are other issues too, in particular that most of the backend's
code tends to work on SnapshotNow time whereas pg_dump would really
prefer it was all done according to the transaction snapshot.

I was just thinking of moving the queries the pg_dump currently
uses into UDF-s, which do _not_ use catalog cache, but will use
the same SQL to query catalogs as pg_dump currently does
using whatever snapshot mode is currently set .

the pg_dump will need to still have the same queries for older
versions of postgresql but for new versions pg_dump can become
catalog-agnostic.

and I think that we can retire pg_dump support for older
postgresql versions the same way we drop support for
older versions of postgresql itself.

Hannu

We have
got bugs of that ilk already in pg_dump, but we shouldn't introduce a
bunch more. Doing this right would therefore mean that we'd have to
write a lot of duplicative code in the backend, ie, it's not clear that
we gain any synergy by pushing the functionality over. It might
simplify cross-backend-version issues (at least for backend versions
released after we'd rewritten all that code) but otherwise I'm afraid
it'd just be pushing the problems somewhere else.

In any case, "push it to the backend" offers no detectable help with the
core design issue here, which is figuring out what functionality needs
to be exposed with what API.

main things I see would be

* get_list_of_objects(object_type, pattern or namelist)
* get_sql_def_for_object(object_type, object_name)
* sort_by_dependency(list of [obj_type, obj_name])

from this you could easily construct most uses, especially if
sort_by_dependency(list of [obj_type, obj_name])
would be smart enough to break circular dependencies, like
turning to tables with mutual FK-s into tabledefs without
FKs + separate constraints.

Or we could always have constraints separately, so that
the ones depending on non-exported objects would be easy
to leave out

My be the dependency API analysis itself is something
worth a GSOC effort ?

Hannu

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Joel Jacobson
joel@trustly.com
In reply to: Hannu Krosing (#17)
Re: [GSOC] questions about idea "rewrite pg_dump as library"

On Fri, Apr 12, 2013 at 1:07 PM, Hannu Krosing <hannu@2ndquadrant.com>wrote:

I was just thinking of moving the queries the pg_dump currently
uses into UDF-s, which do _not_ use catalog cache, but will use
the same SQL to query catalogs as pg_dump currently does
using whatever snapshot mode is currently set .

the pg_dump will need to still have the same queries for older
versions of postgresql but for new versions pg_dump can become
catalog-agnostic.

and I think that we can retire pg_dump support for older
postgresql versions the same way we drop support for
older versions of postgresql itself.

main things I see would be

* get_list_of_objects(object_**type, pattern or namelist)
* get_sql_def_for_object(object_**type, object_name)
* sort_by_dependency(list of [obj_type, obj_name])

from this you could easily construct most uses, especially if
sort_by_dependency(list of [obj_type, obj_name])
would be smart enough to break circular dependencies, like
turning to tables with mutual FK-s into tabledefs without
FKs + separate constraints.

+1

This is an excellent idea. This would allow doing all kinds of crazy things
outside of the scope of pg_dump.

2 years ago I was working on a system to version control the schema, inside
the database.
Don't know if it's a good idea or not, but one thing which bugged me a lot
was the lack of pg_get_[object type]def(oid) functions for all different
object types.
It also turned out to be quite complicated to do the pg_depend topological
sort yourself. I managed eventually, but it was running to slow because I
had to pass the entire content of pg_depend to a plperl function I wrote.

With this in place I would be motivated enough to resume my old project,
which is still online at https://github.com/gluefinance/pov if anyone is
interested.

Is it really necessary to write all the missing pg_get_[object
type]def(oid) functions in C? I think it would be quite easy to put them
together using pure SQL, you wouldn't even need PL/pgSQL.

This old view I once wrote manage to produce working create and drop
statements for most object types using SQL only:
https://github.com/gluefinance/pov/blob/master/sql/schema/pov/views/pg_depend_definitions.sql

It would also be nice with functions which returned the proper command to
DROP an object. I need it in this project in order to do schema
modifications where objects have to be dropped/recreated in a particular
order to not break dependencies. Perhaps there are other use cases out
there.