Bootstrap DATA is a pita
Hi,
I've been for a long while been rather annoyed about how cumbersome it
is to add catalog rows using the bootstrap format. Especially pg_proc.h,
pg_operator.h, pg_amop.h, pg_amproc.h and some more are really unwieldy.
I think this needs to be improved. And while I'm not going to start
working on it tonight, I do plan to work on it if we can agree on a
design that I think is worth implementing.
The things that bug me most are:
1) When adding new rows it's rather hard to kno which columns are which,
and you have to specify a lot you really don't care about. Especially
in pg_proc that's rather annoying.
2) Having to assign oids for many things that don't actually need is
bothersome and greatly increases the likelihood of conflicts. There's
some rows for which we need fixed oids (pg_type ones for example),
but e.g. for the majority of pg_proc it's unnecessary.
3) Adding a new column to a system catalog, especially pg_proc.h,
basically requires writing a complex regex or program to modify the
header.
Therefore I propose that we add another format to generate the .bki
insert lines.
What I think we should do is to add pg_<catalog>.data files that contain
the actual data that are automatically parsed by Catalog.pm. Those
contain the rows in some to-be-decided format. I was considering using
json, but it turns out only perl 5.14 started shipping JSON::PP as part
of the standard library. So I guess it's best we just make it a big perl
array + hashes.
To address 1) we just need to make each row a hash and allow leaving out
columns that have some default value.
2) is a bit more complex. Generally many rows don't need a fixed oid at
all and many others primarily need it to handle object descriptions. The
latter seems best best solved by not making it dependant on the oid
anymore.
3) Seems primarily solved by not requiring default values to be
specified anymore. Also it should be much easier to add new values
automatically to a parseable format.
I think we'll need to generate oid #defines for some catalog contents,
but that seems solveable.
Maybe something rougly like:
# pg_type.data
CatalogData(
'pg_type',
[
{
oid => 2249,
data => {typname => 'cstring', typlen => -2, typbyval => 1, fake => '...'},
oiddefine => 'CSTRINGOID'
}
]
);
# pg_proc.data
CatalogData(
'pg_proc',
[
{
oid => 1242,
data => {proname => 'boolin', proretttype => 16, proargtypes => [2275], provolatile => 'i'},
description => 'I/O',
},
{
data => {proname => 'mode_final', proretttype => 2283, proargtypes => [2281, 2283]},
description => 'aggregate final function',
}
]
);
There'd need to be some logic to assign default values for columns, and
maybe even simple logic e.g. to determine arguments like pronargs based
on proargtypes.
This is far from fully though through, but I think something very
roughly along these lines could be a remarkable improvement in the ease
of adding new catalog contents.
Comments?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/20/2015 03:41 PM, Andres Freund wrote:
What I think we should do is to add pg_<catalog>.data files that contain
the actual data that are automatically parsed by Catalog.pm. Those
contain the rows in some to-be-decided format. I was considering using
json, but it turns out only perl 5.14 started shipping JSON::PP as part
of the standard library. So I guess it's best we just make it a big perl
array + hashes.
What about YAML? That might have been added somewhat earlier.
Or what about just doing CSV?
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM2d385d926a004baeca60214726f2433a0f96c2bd2bd9b5601f44e81dbde9f79c31a9be021bd4168a801a4628f76235dc@asav-3.01.com
On 2/20/15 8:46 PM, Josh Berkus wrote:
What about YAML? That might have been added somewhat earlier.
YAML isn't included in Perl, but there is
Module::Build::YAML - Provides just enough YAML support so that
Module::Build works even if YAML.pm is not installed
which might work.
Or what about just doing CSV?
I don't think that would actually address the problems. It would just
be the same format as now with different delimiters.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I violently support this proposal.
Maybe something rougly like:
# pg_type.data
CatalogData(
'pg_type',
[
{
oid => 2249,
data => {typname => 'cstring', typlen => -2, typbyval => 1, fake => '...'},
oiddefine => 'CSTRINGOID'
}
]
);
One concern I have with this is that in my experience different tools
and editors have vastly different ideas on how to format these kinds of
nested structures. I'd try out YAML, or even a homemade fake YAML over
this.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 21/02/15 04:22, Peter Eisentraut wrote:
I violently support this proposal.
Maybe something rougly like:
# pg_type.data
CatalogData(
'pg_type',
[
{
oid => 2249,
data => {typname => 'cstring', typlen => -2, typbyval => 1, fake => '...'},
oiddefine => 'CSTRINGOID'
}
]
);One concern I have with this is that in my experience different tools
and editors have vastly different ideas on how to format these kinds of
nested structures. I'd try out YAML, or even a homemade fake YAML over
this.
+1 for the idea and +1 for YAML(-like) syntax.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-02-20 22:19:54 -0500, Peter Eisentraut wrote:
On 2/20/15 8:46 PM, Josh Berkus wrote:
What about YAML? That might have been added somewhat earlier.
YAML isn't included in Perl, but there is
Module::Build::YAML - Provides just enough YAML support so that
Module::Build works even if YAML.pm is not installed
I'm afraid not:
sub Load {
shift if ($_[0] eq __PACKAGE__ || ref($_[0]) eq __PACKAGE__);
die "not yet implemented";
}
Or what about just doing CSV?
I don't think that would actually address the problems. It would just
be the same format as now with different delimiters.
Yea, we need hierarchies and named keys.
One concern I have with this is that in my experience different tools
and editors have vastly different ideas on how to format these kinds of
nested structures. I'd try out YAML, or even a homemade fake YAML over
this.
Yes, that's a good point. I have zero desire to open-code a format
though, I think that's a bad idea. We could say we just include
Yaml::Tiny, that's what it's made for.
To allow for changing things programatically without noise I was
wondering whether we shouldn't just load/dump the file at some point of
the build process. Then we're sure the indentation is correct and it can
be changed programatically wihtout requiring manual fixup of comments.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/21/2015 05:04 AM, Andres Freund wrote:
Yes, that's a good point. I have zero desire to open-code a format
though, I think that's a bad idea. We could say we just include
Yaml::Tiny, that's what it's made for.
Personally, I think I would prefer that we use JSON (and yes, there's a
JSON::Tiny module, which definitely lives up to its name).
For one thing, we've made a feature of supporting JSON, so arguably we
should eat the same dog food.
I also dislike YAML's line oriented format. I'd like to be able to add a
pg_proc entry in a handful of lines instead of 29 or more (pg_proc has
27 attributes, but some of them are arrays, and there's an oid and in
most cases a description to add as well). We could reduce that number by
defaulting some of the attributes (pronamespace, proowner and prolang,
for example) and possibly infering others (pronargs?). Even so it's
going to take up lots of lines of vertical screen real estate. A JSON
format could be more vertically compact. The price for that is that JSON
strings have to be quoted, which I know lots of people hate.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/21/2015 09:39 AM, Andrew Dunstan wrote:
On 02/21/2015 05:04 AM, Andres Freund wrote:
Yes, that's a good point. I have zero desire to open-code a format
though, I think that's a bad idea. We could say we just include
Yaml::Tiny, that's what it's made for.Personally, I think I would prefer that we use JSON (and yes, there's
a JSON::Tiny module, which definitely lives up to its name).For one thing, we've made a feature of supporting JSON, so arguably we
should eat the same dog food.I also dislike YAML's line oriented format. I'd like to be able to add
a pg_proc entry in a handful of lines instead of 29 or more (pg_proc
has 27 attributes, but some of them are arrays, and there's an oid and
in most cases a description to add as well). We could reduce that
number by defaulting some of the attributes (pronamespace, proowner
and prolang, for example) and possibly infering others (pronargs?).
Even so it's going to take up lots of lines of vertical screen real
estate. A JSON format could be more vertically compact. The price for
that is that JSON strings have to be quoted, which I know lots of
people hate.
Followup:
The YAML spec does support explicit flows like JSON, which would
overcome my objections above, but unfortunately these are not supported
by YAML::Tiny.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
On 2015-02-20 22:19:54 -0500, Peter Eisentraut wrote:
On 2/20/15 8:46 PM, Josh Berkus wrote:
Or what about just doing CSV?
I don't think that would actually address the problems. It would just
be the same format as now with different delimiters.
Yea, we need hierarchies and named keys.
Yeah. One thought though is that I don't think we need the "data" layer
in your proposal; that is, I'd flatten the representation to something
more like
{
oid => 2249,
oiddefine => 'CSTRINGOID',
typname => 'cstring',
typlen => -2,
typbyval => 1,
...
}
This will be easier to edit, either manually or programmatically I think.
The code that turns it into a .bki file will need to know the exact set
of columns in each system catalog, but it would have had to know that
anyway I believe, if you're expecting it to insert default values.
Ideally the column defaults could come from BKI_ macros in the catalog/*.h
files; it would be good if we could keep those files as the One Source of
Truth for catalog schema info, even as we split out the initial data.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-02-21 11:34:09 -0500, Tom Lane wrote:
Andres Freund <andres@2ndquadrant.com> writes:
On 2015-02-20 22:19:54 -0500, Peter Eisentraut wrote:
On 2/20/15 8:46 PM, Josh Berkus wrote:
Or what about just doing CSV?
I don't think that would actually address the problems. It would just
be the same format as now with different delimiters.Yea, we need hierarchies and named keys.
Yeah. One thought though is that I don't think we need the "data" layer
in your proposal; that is, I'd flatten the representation to something
more like{
oid => 2249,
oiddefine => 'CSTRINGOID',
typname => 'cstring',
typlen => -2,
typbyval => 1,
...
}
I don't really like that - then stuff like oid, description, comment (?)
have to not conflict with any catalog columns. I think it's easier to
have them separate.
This will be easier to edit, either manually or programmatically I think.
The code that turns it into a .bki file will need to know the exact set
of columns in each system catalog, but it would have had to know that
anyway I believe, if you're expecting it to insert default values.
There'll need to be some awareness of columns, sure. But I think
programatically editing the values will still be simpler if you don't
need to discern whether a key is a column or some genbki specific value.
Ideally the column defaults could come from BKI_ macros in the catalog/*.h
files; it would be good if we could keep those files as the One Source of
Truth for catalog schema info, even as we split out the initial data.
Hm, yea.
One thing I was considering was to do the regtype and regproc lookups
directly in the tool. That'd have two advantages: 1) it'd make it
possible to refer to typenames in pg_proc, 2) It'd be much faster. Right
now most of initdb's time is doing syscache lookups during bootstrap,
because it can't use indexes... A simple hash lookup during bki
generation could lead to quite measurable savings during lookup.
We could then even rip the bootstrap code out of regtypein/regprocin...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> writes:
On 02/21/2015 09:39 AM, Andrew Dunstan wrote:
Personally, I think I would prefer that we use JSON (and yes, there's
a JSON::Tiny module, which definitely lives up to its name).
For one thing, we've made a feature of supporting JSON, so arguably we
should eat the same dog food.
We've also made a feature of supporting XML, and a lot earlier, so that
argument seems pretty weak.
My only real requirement on the format choice is that it should absolutely
not require any Perl module that's not in a bog-standard installation.
I've gotten the buildfarm code running on several ancient machines now and
in most cases getting the module dependencies dealt with was pure hell.
No non-core modules for a basic build please. I don't care whether they
are "tiny".
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/21/2015 11:43 AM, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 02/21/2015 09:39 AM, Andrew Dunstan wrote:
Personally, I think I would prefer that we use JSON (and yes, there's
a JSON::Tiny module, which definitely lives up to its name).
For one thing, we've made a feature of supporting JSON, so arguably we
should eat the same dog food.We've also made a feature of supporting XML, and a lot earlier, so that
argument seems pretty weak.
Fair enough
My only real requirement on the format choice is that it should absolutely
not require any Perl module that's not in a bog-standard installation.
I've gotten the buildfarm code running on several ancient machines now and
in most cases getting the module dependencies dealt with was pure hell.
No non-core modules for a basic build please. I don't care whether they
are "tiny".
The point about using the "tiny" modules is that they are so small and
self-contained they can either be reasonably shipped with our code or
embedded directly in the script that uses them, so no extra build
dependency would be created.
However, I rather like your suggestion of this:
{
oid => 2249,
oiddefine => 'CSTRINGOID',
typname => 'cstring',
typlen => -2,
typbyval => 1,
...
}
which is pure perl syntax and wouldn't need any extra module, and has
the advantage over JSON that key names won't need to be quoted, making
it more readable.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On February 21, 2015 7:20:04 PM CET, Andrew Dunstan <andrew@dunslane.net> wrote:
On 02/21/2015 11:43 AM, Tom Lane wrote:
{
oid => 2249,
oiddefine => 'CSTRINGOID',
typname => 'cstring',
typlen => -2,
typbyval => 1,
...
}which is pure perl syntax and wouldn't need any extra module, and has
the advantage over JSON that key names won't need to be quoted, making
it more readable.
Yea, my original post suggested using actual perl hashes to avoid problems with the availability of libraries. So far I've not really heard a convincing alternative. Peter's problem with formatting seems to be most easily solved by rewriting the file automatically...
Andres
--
Please excuse brevity and formatting - I am writing this on my mobile phone.
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-02-21 17:43:09 +0100, Andres Freund wrote:
One thing I was considering was to do the regtype and regproc lookups
directly in the tool. That'd have two advantages: 1) it'd make it
possible to refer to typenames in pg_proc, 2) It'd be much faster. Right
now most of initdb's time is doing syscache lookups during bootstrap,
because it can't use indexes... A simple hash lookup during bki
generation could lead to quite measurable savings during lookup.
I've *very* quickly hacked this up. Doing this for all regproc columns
gives a consistent speedup in an assert enabled from ~0m3.589s to
~0m2.544s. My guess is that the relative speedup in optimized mode would
actually be even bigger as now most of the time is spent in
AtEOXact_CatCache.
Given that pg_proc is unlikely to get any smaller and that the current
code is essentially O(lookups * #pg_proc), this alone seems to be worth
a good bit.
The same trick should also allow us to simply refer to type names in
pg_proc et al. If we had a way to denote a column being of type
relnamespace/relauthid we could replace
$row->{bki_values} =~ s/\bPGUID\b/$BOOTSTRAP_SUPERUSERID/g;
$row->{bki_values} =~ s/\bPGNSP\b/$PG_CATALOG_NAMESPACE/g;
as well.
The changes in pg_proc.h are just to demonstrate that using names
instead of oids works.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-WIP-resolve-regtype-regproc-in-genbki.pl.patchtext/x-patch; charset=us-asciiDownload+128-101
On Sat, Feb 21, 2015 at 11:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
The changes in pg_proc.h are just to demonstrate that using names
instead of oids works.
Fwiw I always thought it was strange how much of our bootstrap was
done in a large static text file. Very little of it is actually needed
for bootstrapping and we could get by with a very small set followed
by a bootstrap script written in standard SQL, not unlike how the
system views are created. It's much easier to type CREATE OPERATOR and
CREATE OPERATOR CLASS with all the symbolic names instead of having to
fill in the table.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Feb 21, 2015 at 11:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@2ndquadrant.com> writes:
On 2015-02-20 22:19:54 -0500, Peter Eisentraut wrote:
On 2/20/15 8:46 PM, Josh Berkus wrote:
Or what about just doing CSV?
I don't think that would actually address the problems. It would just
be the same format as now with different delimiters.Yea, we need hierarchies and named keys.
Yeah. One thought though is that I don't think we need the "data" layer
in your proposal; that is, I'd flatten the representation to something
more like{
oid => 2249,
oiddefine => 'CSTRINGOID',
typname => 'cstring',
typlen => -2,
typbyval => 1,
...
}
Even this promises to vastly increase the number of lines in the file,
and make it harder to compare entries by grepping out some common
substring. I agree that the current format is a pain in the tail, but
pg_proc.h is >5k lines already. I don't want it to be 100k lines
instead.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-03-03 21:49:21 -0500, Robert Haas wrote:
On Sat, Feb 21, 2015 at 11:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@2ndquadrant.com> writes:
On 2015-02-20 22:19:54 -0500, Peter Eisentraut wrote:
On 2/20/15 8:46 PM, Josh Berkus wrote:
Or what about just doing CSV?
I don't think that would actually address the problems. It would just
be the same format as now with different delimiters.Yea, we need hierarchies and named keys.
Yeah. One thought though is that I don't think we need the "data" layer
in your proposal; that is, I'd flatten the representation to something
more like{
oid => 2249,
oiddefine => 'CSTRINGOID',
typname => 'cstring',
typlen => -2,
typbyval => 1,
...
}Even this promises to vastly increase the number of lines in the file,
and make it harder to compare entries by grepping out some common
substring. I agree that the current format is a pain in the tail, but
pg_proc.h is >5k lines already. I don't want it to be 100k lines
instead.
Do you have a better suggestion? Sure it'll be a long file, but it still
seems vastly superiour to what we have now.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Even this promises to vastly increase the number of lines in the file,
and make it harder to compare entries by grepping out some common
substring. I agree that the current format is a pain in the tail, but
pg_proc.h is >5k lines already. I don't want it to be 100k lines
instead.Do you have a better suggestion? Sure it'll be a long file, but it still
seems vastly superiour to what we have now.
Not really. What had occurred to me is to try to improve the format
of the DATA lines (e.g. by allowing names to be used instead of OIDs)
but that wouldn't allow defaulted fields to be omitted, which is
certainly a big win. I wonder whether some home-grown single-line
format might be better than using a pre-existing format, but I'm not
too sure it would.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-03-04 08:47:44 -0500, Robert Haas wrote:
Even this promises to vastly increase the number of lines in the file,
and make it harder to compare entries by grepping out some common
substring. I agree that the current format is a pain in the tail, but
pg_proc.h is >5k lines already. I don't want it to be 100k lines
instead.Do you have a better suggestion? Sure it'll be a long file, but it still
seems vastly superiour to what we have now.Not really. What had occurred to me is to try to improve the format
of the DATA lines (e.g. by allowing names to be used instead of OIDs)
That's a separate patch so far, so if we decide to only want thta we can
do it.
but that wouldn't allow defaulted fields to be omitted, which is
certainly a big win. I wonder whether some home-grown single-line
format might be better than using a pre-existing format, but I'm not
too sure it would.
I can't see readability of anything being good unless the column names
are there - we just have too many columns in some of the tables. I think
having more lines is a acceptable price to pay. We can easily start to
split the files at some point if we want, that'd just be a couple lines
of code.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/3/15 9:49 PM, Robert Haas wrote:
Yeah. One thought though is that I don't think we need the "data" layer
in your proposal; that is, I'd flatten the representation to something
more like{
oid => 2249,
oiddefine => 'CSTRINGOID',
typname => 'cstring',
typlen => -2,
typbyval => 1,
...
}Even this promises to vastly increase the number of lines in the file,
I think lines are cheap. Columns are much harder to deal with.
and make it harder to compare entries by grepping out some common
substring.
Could you give an example of the sort of thing you wish to do?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers