QSoC proposal: Rewrite pg_dump and pg_restore

Started by Alexandralmost 12 years ago9 messages
#1Alexandr
askellio@gmail.com

Hello!
Here is the text of my proposal which I've applied to GSoC.
(and link
https://docs.google.com/document/d/1s-Q4rzEysPxo-dINsk_eKFJOBoVjNYDrQ-Oh75gtYEM/edit?usp=sharing)
Any suggestions and comments are welcome.
*

PostgreSQL GSoC 2014 proposal

Project name

Rewrite (add) pg_dump and pg_restore utilities as libraries (.so, .dll &
.dylib)

Short description

pg_dump is a utility for backing up a PostgreSQL database. It makes
consistent backups even if the database is being used concurrently.
pg_restore is a utility for restoring a PostgreSQL database from an
archive created by pg_dump in one of the non-plain-text formats. I think
it will be more comforatable to use these tools as libraries.

Name: Alexander Shvidchenko

E-mail: askellio@gmail.com <mailto:askellio@gmail.com>

Location: Rostov-on-Don, Russia (UTC +04.00)

Benefits to the PostgreSQL Community

This feature will expand opportunities to work with backups of
databases. Especially, it’s important for third-party developers. It
will be easier to use the functionality of the tools in applications
when they will be libraries.

Quantifiable results

Static and dynamic libraries with the functionality of pg_dump and
pg_restore tools (.so, .dll and .dylib files)

Project Schedule

until May 31

Make code review and solve architecture questions with help of community

1 June – 30 June

Detailed implementation of libraries.

1 July – 31 July

Finish Implementation of libraires and begin testing.

1 August -15 August

Final refactoring, testing and commit.

Academic experience

I entered the university in 2013. Before entering the university I
finished the college in 2012. My graduate work in the college was the
client-server application. It was a realization of XMPP. The client was
realized in Qt. The client worked with SQLite database and the server
worked with MySQL database.

Why is PostgreSQL?

- I’m intereted in this idea and believe this project would be useful
for the community;

- PostgreSQL is a very respected community. I would be proud to be a
part of it;

- PostgreSQL is one of the best DBMS and I would like to make it better.

Links

1) PostgreSQL 9.3.3 Documentation, pg_dump

http://www.postgresql.org/docs/9.3/static/app-pgdump.html

2) PostgreSQL 9.3.3 Documentation, pg_restore

http://www.postgresql.org/docs/9.3/static/app-pgrestore.html

3) Static library

http://en.wikipedia.org/wiki/Static_library

4) Dynamic library

http://en.wikipedia.org/wiki/Dynamic-link_library

*

With best wishes,
Alexander S.

#2Robert Haas
robertmhaas@gmail.com
In reply to: Alexandr (#1)
Re: QSoC proposal: Rewrite pg_dump and pg_restore

On Tue, Mar 18, 2014 at 8:41 PM, Alexandr <askellio@gmail.com> wrote:

Rewrite (add) pg_dump and pg_restore utilities as libraries (.so, .dll &
.dylib)

This strikes me as (1) pretty vague and (2) probably too hard for a
summer project.

I mean, getting the existing binaries to build libraries that you can
call with some trivial interface that mimics the existing command-line
functionality of pg_dump might be doable, but that's not all that
interesting. What people are really going to want is a library with a
sophisticated API that lets you do interesting things
programmatically. But that's going to be hard. AFAIK, nobody's even
tried to figure out what that API should look like. Even if we had
that worked out, a non-trivial task, the pg_dump source code is a
mess, so refactoring it to provide such an API is likely to be a job
and a half.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Craig Ringer
craig@2ndquadrant.com
In reply to: Robert Haas (#2)
Re: QSoC proposal: Rewrite pg_dump and pg_restore

On 03/21/2014 09:28 AM, Robert Haas wrote:

On Tue, Mar 18, 2014 at 8:41 PM, Alexandr <askellio@gmail.com> wrote:

Rewrite (add) pg_dump and pg_restore utilities as libraries (.so, .dll &
.dylib)

This strikes me as (1) pretty vague and (2) probably too hard for a
summer project.

I mean, getting the existing binaries to build libraries that you can
call with some trivial interface that mimics the existing command-line
functionality of pg_dump might be doable, but that's not all that
interesting. What people are really going to want is a library with a
sophisticated API that lets you do interesting things
programmatically. But that's going to be hard. AFAIK, nobody's even
tried to figure out what that API should look like. Even if we had
that worked out, a non-trivial task, the pg_dump source code is a
mess, so refactoring it to provide such an API is likely to be a job
and a half.

... and still wouldn't solve one of the most frequently requested things
for pg_dump / pg_restore, which is the ability to use them *server-side*
over a regular PostgreSQL connection. It'd be useful progress toward
that, though.

Right now, we can't even get the PostgreSQL server to emit DDL for a
table, let alone do anything more sophisticated.

Here's how I think it needs to look:

- Design a useful API for pg_dump and pg_restore that is practical to
use for pg_dump and pg_restore's current tasks (fast database
dump/restore) and also useful for extracting specific objects
from the database. When designing, consider that we'll want to
expose this API or functions that use it over SQL later.

- Create a new "libpqdump" library.

- Implement the designed API in the new library, moving and
adjusting code from pg_dump / pg_restore where possible, writing
new code where not.

- Refactor (closer to rewrite) pg_dump and pg_restore to use libpqdump,
removing as much knowledge of the system catalogs etc as possible from
them.

- Make sure the result still performs OK

THEN, once that's settled in:

- Modify libpqdump to support compilation as a backend extension, with
use of the SPI for queries and use of syscaches or direct scans
where possible.

- Write a "pg_dump" extension that uses libpqdump in SPI mode
to expose its API over SQL, or at least uses it to provide SQL
functions to describe database objects. So you can dump a DB,
or a subset of it, over SQL.

After all, a "libpgdump" won't do much good for the large proportion of
PostgreSQL users who use Java/JDBC, who can't use a native library
(without hideous hacks with JNI). For the very large group who use libpq
via language-specific client interfaces like the Pg gem for Ruby,
psycopg2 for Python, DBD::Pg for Perl, etc, it'll require a lot of work
to wrap the API and maintain it. Wheras a server-side SQL-callable
interface would be useful and immediately usable for all of them.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Craig Ringer (#3)
Re: QSoC proposal: Rewrite pg_dump and pg_restore

Craig Ringer <craig@2ndquadrant.com> writes:

Here's how I think it needs to look:
[ move all the functionality to the backend ]

Of course, after you've done all that work, you've got something that is
of exactly zero use to its supposed principal use-case, pg_dump. pg_dump
will still have to support server versions that predate all these fancy
new dump functions, and that pretty much ensures that most of pg_dump's
core functionality will still be on the client side. Or, if you try to
finesse that problem by making sure the new server APIs correspond to
easily-identified pieces of pg_dump code, you'll probably end up with APIs
that nobody else wants to use :-(.

In any case, I quite agree with the sentiment that this is not a suitable
problem for a GSOC project.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Craig Ringer
craig@2ndquadrant.com
In reply to: Tom Lane (#4)
Re: QSoC proposal: Rewrite pg_dump and pg_restore

On 03/21/2014 11:09 AM, Tom Lane wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

Here's how I think it needs to look:
[ move all the functionality to the backend ]

Of course, after you've done all that work, you've got something that is
of exactly zero use to its supposed principal use-case, pg_dump. pg_dump
will still have to support server versions that predate all these fancy
new dump functions, and that pretty much ensures that most of pg_dump's
core functionality will still be on the client side. Or, if you try to
finesse that problem by making sure the new server APIs correspond to
easily-identified pieces of pg_dump code, you'll probably end up with APIs
that nobody else wants to use :-(.

Yeah, that's why it's necessary to create a "libpqdump" that's usable
client-side even if you want server-side dump support.

So it's "allow the functionality to be used from the backend as well",
not just "move all the functionality to the backend".

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Marcin Mańk
marcin@maniek.info
In reply to: Tom Lane (#4)
Re: QSoC proposal: Rewrite pg_dump and pg_restore

On Fri, Mar 21, 2014 at 4:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

Here's how I think it needs to look:
[ move all the functionality to the backend ]

Of course, after you've done all that work, you've got something that is
of exactly zero use to its supposed principal use-case, pg_dump. pg_dump
will still have to support server versions that predate all these fancy
new dump functions, and that pretty much ensures that most of pg_dump's
core functionality will still be on the client side. Or, if you try to
finesse that problem by making sure the new server APIs correspond to
easily-identified pieces of pg_dump code, you'll probably end up with APIs
that nobody else wants to use :-(.

Or you should mandate that new server versions should be able to consume
_old_ pg_dump version output. This would change the recommended "when
upgrading, dump using the new pg_dump" to ""when upgrading, dump using the
old pg_dump".

This would be necessary policy going forward anyway, if most of the pg_dump
functionality was server-side, because it would be generating dumps in the
server-version dump format, not the client-version format.

'Regards
Marcin Mańk
(goes back to lurker cave...)

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#4)
Re: QSoC proposal: Rewrite pg_dump and pg_restore

On Thu, Mar 20, 2014 at 11:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

Here's how I think it needs to look:
[ move all the functionality to the backend ]

Of course, after you've done all that work, you've got something that is
of exactly zero use to its supposed principal use-case, pg_dump. pg_dump
will still have to support server versions that predate all these fancy
new dump functions, and that pretty much ensures that most of pg_dump's
core functionality will still be on the client side. Or, if you try to
finesse that problem by making sure the new server APIs correspond to
easily-identified pieces of pg_dump code, you'll probably end up with APIs
that nobody else wants to use :-(.

It's worse than that. If you put all the logic in the server, then a
dump taken on an older version won't be able to quote keywords added
in the newer version. Go directly to fail.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#7)
Re: QSoC proposal: Rewrite pg_dump and pg_restore

On 03/21/2014 09:38 AM, Robert Haas wrote:

On Thu, Mar 20, 2014 at 11:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

Here's how I think it needs to look:
[ move all the functionality to the backend ]

Of course, after you've done all that work, you've got something that is
of exactly zero use to its supposed principal use-case, pg_dump. pg_dump
will still have to support server versions that predate all these fancy
new dump functions, and that pretty much ensures that most of pg_dump's
core functionality will still be on the client side. Or, if you try to
finesse that problem by making sure the new server APIs correspond to
easily-identified pieces of pg_dump code, you'll probably end up with APIs
that nobody else wants to use :-(.

It's worse than that. If you put all the logic in the server, then a
dump taken on an older version won't be able to quote keywords added
in the newer version. Go directly to fail.

Yeah. This tantalizing project has been looked at several times and
found to be a viper's nest.

What would be useful for many purposes, and is a long-standing project
of mine that I still haven't found time to make progress on, is that the
server should contain functions to produce the creation SQL for all its
own objects, free of the locks that pg_dump requires for consistency.

That would be a great SoC project, incidentally. I'd even volunteer to
mentor that one.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#8)
Re: QSoC proposal: Rewrite pg_dump and pg_restore

Andrew Dunstan escribi�:

What would be useful for many purposes, and is a long-standing
project of mine that I still haven't found time to make progress on,
is that the server should contain functions to produce the creation
SQL for all its own objects, free of the locks that pg_dump requires
for consistency.

Maybe you'd like my "DDL deparse" project, then. Right now it's only of
use for event triggers during DDL execution, but I don't see any strong
reason it cannot be used to reconstruct object creation commands from
only their identifying OID.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers