what would tar file FDW look like?

Started by Bear Gilesover 10 years ago4 messages

bgiles@coyotesong.com

over 10 years ago

I'm starting to work on a tar FDW as a proxy for a much more specific FDW.
(It's the 'faster to build two and toss the first away' approach - tar lets
me get the FDW stuff nailed down before attacking the more complex
container.) It could also be useful in its own right, or as the basis for a
zip file FDW.

I have figured out that in one mode the FDW mapping that would take the
name of the tarball as an option and produce a relation that has all of the
metadata for the contained files - filename, size, owner, timestamp, etc. I
can use the same approach I used for the /etc/passwd FDW for that.

(BTW the current version is at https://github.com/beargiles/passwd-fdw.
It's skimpy on automated tests until I can figure out how to handle the
user mapping but it works.)

The problem is the second mode where I pull a single file out of the FDW.
I've identified three approachs so far:

1. A FDW mapping specific to each file. It would take the name of the
tarfile and the embedded file. Cleanest in some ways but it would be a real
pain if you're reading a tarball dynamically.

2. A user-defined function that takes the name of the tarball and file and
returns a blob. This is the traditional approach but why bother with a FDW
then? It also brings up access control issues since it requires disclosure
of the tarball name to the user. A FDW could hide that.

3. A user-defined function that takes a tar FDW and the name of a file and
returns a blob. I think this is the best approach but I don't know if I can
specify a FDW as a parameter or how to access it.

I've skimmed the existing list of FDW but didn't find anything that can
serve as a model. The foreign DB are closest but, again, they aren't
designed for dynamic use where you want to do something with every file in
an archive / table in a foreign DB.

Is there an obvious approach? Or is it simply a bad match for FDW and
should be two standard UDF? (One returns the metadata, the second returns
the specific file.)

Thanks,

Bear

stark@mit.edu

over 10 years ago

In reply to: Bear Giles (#1)

Re: what would tar file FDW look like?

On Mon, Aug 17, 2015 at 3:14 PM, Bear Giles <bgiles@coyotesong.com> wrote:

I'm starting to work on a tar FDW as a proxy for a much more specific FDW.
(It's the 'faster to build two and toss the first away' approach - tar lets
me get the FDW stuff nailed down before attacking the more complex
container.) It could also be useful in its own right, or as the basis for a
zip file FDW.

Hm. tar may be a bad fit where zip may be much easier. Tar has no
index or table of contents. You have to scan the entire file to find
all the members. IIRC Zip does have a table of contents at the end of
the file.

The most efficient way to process a tar file is to describe exactly
what you want to happen with each member and then process it linearly
from start to end (or until you've found the members you're looking
for). Trying to return meta info and then go looking for individual
members will be quite slow and have a large startup cost.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

andrew@dunslane.net

over 10 years ago

In reply to: Bear Giles (#1)

Re: what would tar file FDW look like?

On 08/17/2015 10:14 AM, Bear Giles wrote:

I'm starting to work on a tar FDW as a proxy for a much more specific
FDW. (It's the 'faster to build two and toss the first away' approach
- tar lets me get the FDW stuff nailed down before attacking the more
complex container.) It could also be useful in its own right, or as
the basis for a zip file FDW.

I have figured out that in one mode the FDW mapping that would take
the name of the tarball as an option and produce a relation that has
all of the metadata for the contained files - filename, size, owner,
timestamp, etc. I can use the same approach I used for the /etc/passwd
FDW for that.

(BTW the current version is at
https://github.com/beargiles/passwd-fdw. It's skimpy on automated
tests until I can figure out how to handle the user mapping but it works.)

The problem is the second mode where I pull a single file out of the
FDW. I've identified three approachs so far:

1. A FDW mapping specific to each file. It would take the name of the
tarfile and the embedded file. Cleanest in some ways but it would be a
real pain if you're reading a tarball dynamically.

2. A user-defined function that takes the name of the tarball and file
and returns a blob. This is the traditional approach but why bother
with a FDW then? It also brings up access control issues since it
requires disclosure of the tarball name to the user. A FDW could hide
that.

3. A user-defined function that takes a tar FDW and the name of a file
and returns a blob. I think this is the best approach but I don't know
if I can specify a FDW as a parameter or how to access it.

I've skimmed the existing list of FDW but didn't find anything that
can serve as a model. The foreign DB are closest but, again, they
aren't designed for dynamic use where you want to do something with
every file in an archive / table in a foreign DB.

Is there an obvious approach? Or is it simply a bad match for FDW and
should be two standard UDF? (One returns the metadata, the second
returns the specific file.)

I would probably do something like this:

In this mode, define a table that has <path, blob>. To get the blob for
a single file, just do "select blob from fdwtable where path =
'/path/to/foo'". Make sure you process the qual in the FDW.

e.g.

create foreign table tarblobs (path text, blob bytea)
server tarfiles options (filename '/path/to/tarball', mode 'contents');

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

bgiles@coyotesong.com

over 10 years ago

In reply to: Greg Stark (#2)

Re: what would tar file FDW look like?

I've written readers for both from scratch. Tar isn't that bad since it's
blocked - you read the header, skip forward N blocks, continue. The hardest
part is setting up the decompression libraries if you want to support
tar.gz or tar.bz2 files.

Zip files are more complex. You have (iirc) 5 control blocks - start of
archive, start of file, end of file, start of index, end of archive, and
the information in the control block is pretty limited. That's not a huge
burden since there's support for extensions for things like the unix file
metadata. One complication is that you need to support compression from the
start.

Zip files support two types of encryption. There's a really weak version
that almost nobody supports and a much stronger modern version that's
subject to license restrictions. (Some people use the weak version on
embedded systems because of legal requirements to /do something/, no matter
how lame.)

There are third-party libraries, of course, but that introduces
dependencies. Both formats are simple enough to write from scratch.

I guess my bigger question is if there's an interest in either or both for
"real" use. I'm doing this as an exercise but am willing to contrib the
code if there's a general interest in it.

(BTW the more complex object I'm working on is the .p12 keystore for
digital certificates and private keys. We have everything we need in the
openssl library so there's no additional third-party dependencies. I have a
minimal FDW for the digital certificate itself and am now working on a way
to access keys stored in a standard format on the filesystem instead of in
the database itself. A natural fit is a specialized archive FDW. Unlike tar
and zip it will have two payloads, the digital certificate and the
(optionally encrypted) private key. It has searchable metadata, e.g.,
finding all records with a specific subject.)

Bear

On Mon, Aug 17, 2015 at 8:29 AM, Greg Stark <stark@mit.edu> wrote:

Show quoted text

On Mon, Aug 17, 2015 at 3:14 PM, Bear Giles <bgiles@coyotesong.com> wrote:

I'm starting to work on a tar FDW as a proxy for a much more specific

FDW.

(It's the 'faster to build two and toss the first away' approach - tar

lets

me get the FDW stuff nailed down before attacking the more complex
container.) It could also be useful in its own right, or as the basis

for a

zip file FDW.

Hm. tar may be a bad fit where zip may be much easier. Tar has no
index or table of contents. You have to scan the entire file to find
all the members. IIRC Zip does have a table of contents at the end of
the file.

The most efficient way to process a tar file is to describe exactly
what you want to happen with each member and then process it linearly
from start to end (or until you've found the members you're looking
for). Trying to return meta info and then go looking for individual
members will be quite slow and have a large startup cost.

--
greg