Re: Bootstrap DATA is a pita

Started by Caleb Weltonover 10 years ago33 messageshackers

cwelton@pivotal.io

over 10 years ago

Hello Hackers,

Reviving an old thread on simplifying the bootstrap process.

I'm a developer from the GPDB / HAWQ side of the world where we did some
work a while back to enable catalog definition via SQL files and we have
found it valuable from a dev perspective. The mechanism currently in those
products is a bit.. convoluted where SQL is processed in perl to create the
existing DATA statements, which are then processed as they are today in
Postgres... I wouldn't suggest this route, but having worked with both the
DATA mechanism and the SQL based one I've certainly found SQL to be a more
convenient way of interacting with the catalog.

I'd propose:
- Keep enough of the existing bootstrap mechanism functional to get a
small tidy core, essentially you need enough of pg_type, pg_proc, pg_class,
pg_attribute to support the 25 types used by catalog tables and most
everything else can be moved into SQL processing like how system_views.sql
is handled today.

The above was largely proposed back in March and rejected based on
concerns that

1. initdb would be slower.
2. It would introduce too much special purpose bootstrap cruft into the
code.
3. Editing SQL commands is not comfortable in bulk

On 1.

I have a prototype that handles about 1000 functions (all the functions in
pg_proc.h that are not used by other catalog tables, e.g. pg_type,
pg_language, pg_range, pg_aggregate, window functions, pg_ts_parser, etc).

All of initdb can be processed in 1.53s. This compares to 1.37s with the
current bootstrap approach. So yes, this is slower, but not 'noticeably
slower' - I certainly didn't notice the 0.16s until I saw the concern and
then timed it.

On 2.

So far the amount of cruft has been:
- Enabling adding functions with specific OIDs when creating functions.
1 line changes in pg_aggregate.c, proclang.c, typecmds.c
about dozen lines of code in functioncmds.c
3 lines changed in pg_proc.c
- Update the fmgr_internal_validator for builtin functions while the
catalog is mutable
3 lines changed in pg_proc.c
- Update how the builtin function cache is built
Some significant work in fmgr.c that honestly still needs cleanup
before it would be ready to propose as a patch that would be worthy of
committing.
- Update how builtin functions are resolved outside of bootstrap
Minor updates to dynloader for lookup of symbols within the current
executable, so far I've only done darwin.c for my prototype, this would
need to be extended to the other ports.
- Initializitation of the builtin cache
2 line change in postinit.c
- Addition of a stage in initdb to process the sql directives similar in
scope to the processing of system_views.sql.

No changes needed in the parser, planner, etc. My assessment is that this
worry is not a major concern in practice with the right implementation.

On 3.

Having worked with both SQL and bki DATA directives I have personally found
the convenience of SQL outweighs the pain. In many cases changes, such as
adding a new column to pg_proc, have minimal impact on the SQL
representation and what changes are needed are often simple to implement.
E.g. accounting for COST only needs to be done for the functions that need
something other than the default value. This however is somewhat
subjective.

On the Pros side:

a. Debugging bootstrap is extremely painful, debugging once initdb has
gotten to 'postgres --single' is way easier.

b. It is easier to introduce minor issues with DATA directives than it is
when using the SQL processing used for all other user objects.

Example: currently in Postgres all builtin functions default to COST 1,
and all SQL functions default to cost 100. However the following SQL
functions included in bootstrap inexplicably are initialized with a COST of
1:
age(timestamp with time zone)
age(timestamp without time zone)
bit_length(bytea)
bit_length(text)
bit_length(bit)
date_part(text, abstime)
date_part(text, reltime)
date_part(text, date)
... and 26 other examples

c. SQL files are significantly less of a PITA (subjective opinion, but I
can say this from a perspective of experience working with both DATA
directives and SQL driven catalog definition).

If people have interest I can share my patch so far if that helps address
concerns, but if there is not interest then I'll probably leave my
prototype where it is rather than investing more effort in the proof of
concept.

Thanks,
Caleb

On Sat, Mar 7, 2015 at 5:20 PM, <pgsql-hackers-owner@postgresql.org> wrote:

Show quoted text

Date: Sat, 7 Mar 2015 23:46:54 +0100
From: Andres Freund <andres@2ndquadrant.com>
To: Jim Nasby <Jim.Nasby@BlueTreble.com>
Cc: Stephen Frost <sfrost@snowman.net>, Robert Haas <robertmhaas@gmail.com

,

Tom Lane <tgl@sss.pgh.pa.us>, Peter Eisentraut <peter_e@gmx.net>,
Josh Berkus <josh@agliodbs.com>,
"pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
Subject: Re: Bootstrap DATA is a pita
Message-ID: <20150307224654.GC12213@awork2.anarazel.de>

On 2015-03-07 16:43:15 -0600, Jim Nasby wrote:

Semi-related... if we put some special handling in some places for

bootstrap

mode, couldn't most catalog objects be created using SQL, once we got
pg_class, pg_attributes and pg_type created? That would theoretically

allow

us to drive much more of initdb with plain SQL (possibly created via
pg_dump).

Several people have now made that suggestion, but I *seriously* doubt
that we actually want to go there. The overhead of executing SQL
commands in comparison to the bki stuff is really rather
noticeable. Doing the majority of the large number of insertions via SQL
will make initdb noticeably slower. And it's already annoyingly
slow. Besides make install it's probably the thing I wait most for
during development.

That's besides the fact that SQL commands aren't actually that
comfortably editable in bulk.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Caleb Welton

cwelton@pivotal.io

over 10 years ago

In reply to: Caleb Welton (#1)

I'm happy working these ideas forward if there is interest.

Basic design proposal is:
- keep a minimal amount of bootstrap to avoid intrusive changes to core
components
- Add capabilities of creating objects with specific OIDs via DDL during
initdb
- Update the caching/resolution mechanism for builtin functions to be
more dynamic.
- Move as much of bootstrap as possible into SQL files and create catalog
via DDL

Feedback appreciated.

I can provide a sample patch if there is interest, about ~500 lines of
combined diff for the needed infrastructure to support the above, not
including the modifications to pg_proc.h that would follow.

Thanks,
Caleb

Show quoted text

On Thu, Dec 10, 2015 at 11:47 AM, Caleb Welton wrote:

Hello Hackers,

Reviving an old thread on simplifying the bootstrap process.

I'm a developer from the GPDB / HAWQ side of the world where we did some
work a while back to enable catalog definition via SQL files and we have
found it valuable from a dev perspective. The mechanism currently in those
products is a bit.. convoluted where SQL is processed in perl to create the
existing DATA statements, which are then processed as they are today in
Postgres... I wouldn't suggest this route, but having worked with both the
DATA mechanism and the SQL based one I've certainly found SQL to be a more
convenient way of interacting with the catalog.

I'd propose:
- Keep enough of the existing bootstrap mechanism functional to get a
small tidy core, essentially you need enough of pg_type, pg_proc, pg_class,
pg_attribute to support the 25 types used by catalog tables and most
everything else can be moved into SQL processing like how system_views.sql
is handled today.

The above was largely proposed back in March and rejected based on
concerns that

1. initdb would be slower.
2. It would introduce too much special purpose bootstrap cruft into the
code.
3. Editing SQL commands is not comfortable in bulk

On 1.

I have a prototype that handles about 1000 functions (all the functions in
pg_proc.h that are not used by other catalog tables, e.g. pg_type,
pg_language, pg_range, pg_aggregate, window functions, pg_ts_parser, etc).

All of initdb can be processed in 1.53s. This compares to 1.37s with the
current bootstrap approach. So yes, this is slower, but not 'noticeably
slower' - I certainly didn't notice the 0.16s until I saw the concern and
then timed it.

On 2.

So far the amount of cruft has been:
- Enabling adding functions with specific OIDs when creating functions.
1 line changes in pg_aggregate.c, proclang.c, typecmds.c
about dozen lines of code in functioncmds.c
3 lines changed in pg_proc.c
- Update the fmgr_internal_validator for builtin functions while the
catalog is mutable
3 lines changed in pg_proc.c
- Update how the builtin function cache is built
Some significant work in fmgr.c that honestly still needs cleanup
before it would be ready to propose as a patch that would be worthy of
committing.
- Update how builtin functions are resolved outside of bootstrap
Minor updates to dynloader for lookup of symbols within the current
executable, so far I've only done darwin.c for my prototype, this would
need to be extended to the other ports.
- Initializitation of the builtin cache
2 line change in postinit.c
- Addition of a stage in initdb to process the sql directives similar in
scope to the processing of system_views.sql.

No changes needed in the parser, planner, etc. My assessment is that this
worry is not a major concern in practice with the right implementation.

On 3.

Having worked with both SQL and bki DATA directives I have personally found
the convenience of SQL outweighs the pain. In many cases changes, such as
adding a new column to pg_proc, have minimal impact on the SQL
representation and what changes are needed are often simple to implement.
E.g. accounting for COST only needs to be done for the functions that need
something other than the default value. This however is somewhat
subjective.

On the Pros side:

a. Debugging bootstrap is extremely painful, debugging once initdb has
gotten to 'postgres --single' is way easier.

b. It is easier to introduce minor issues with DATA directives than it is
when using the SQL processing used for all other user objects.

Example: currently in Postgres all builtin functions default to COST 1,
and all SQL functions default to cost 100. However the following SQL
functions included in bootstrap inexplicably are initialized with a COST of
1:
age(timestamp with time zone)
age(timestamp without time zone)
bit_length(bytea)
bit_length(text)
bit_length(bit)
date_part(text, abstime)
date_part(text, reltime)
date_part(text, date)
... and 26 other examples

c. SQL files are significantly less of a PITA (subjective opinion, but I
can say this from a perspective of experience working with both DATA
directives and SQL driven catalog definition).

If people have interest I can share my patch so far if that helps address
concerns, but if there is not interest then I'll probably leave my
prototype where it is rather than investing more effort in the proof of
concept.

Thanks,
Caleb

Import Notes

Resolved by subject fallback

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Caleb Welton (#2)

Caleb Welton wrote:

I'm happy working these ideas forward if there is interest.

Basic design proposal is:
- keep a minimal amount of bootstrap to avoid intrusive changes to core
components
- Add capabilities of creating objects with specific OIDs via DDL during
initdb
- Update the caching/resolution mechanism for builtin functions to be
more dynamic.
- Move as much of bootstrap as possible into SQL files and create catalog
via DDL

I think the point we got stuck last time at was deciding on a good
format for the data coming from the DATA lines. One of the objections
raised for formats such as JSON is that it's trivial for "git merge" (or
similar tools) to make a mistake because object-end/object-start lines
are all identical. And as for the SQL-format version, the objection was
that it's hard to modify the lines en-masse when modifying the catalog
definition (new column, etc). Ideally we would like a format that can
be bulk-edited without too much trouble.

A SQL file would presumably not have the merge issue, but mass-editing
would be a pain.

Crazy idea: we could just have a CSV file which can be loaded into a
table for mass changes using regular DDL commands, then dumped back from
there into the file. We already know how to do these things, using
\copy etc. Since CSV uses one line per entry, there would be no merge
problems either (or rather: all merge problems would become conflicts,
which is what we want.)

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Caleb Welton

cwelton@pivotal.io

over 10 years ago

In reply to: Alvaro Herrera (#3)

Makes sense.

During my own prototyping what I did was generate the sql statements via sql querying the existing catalog. Way easier than hand writing 1000+ function definitions and not difficult to modify for future changes. As affirmed that it was very easy to adapt my existing sql to account for some of the newer features in master.

The biggest challenge was establishing a sort order that ensures both a unique ordering and that the dependencies needed for SQL functions have been processed before trying to define them. Which effects about 4/1000 functions based on a natural oid ordering.

On Dec 11, 2015, at 11:43 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Caleb Welton wrote:

I'm happy working these ideas forward if there is interest.

Basic design proposal is:
- keep a minimal amount of bootstrap to avoid intrusive changes to core
components
- Add capabilities of creating objects with specific OIDs via DDL during
initdb
- Update the caching/resolution mechanism for builtin functions to be
more dynamic.
- Move as much of bootstrap as possible into SQL files and create catalog
via DDL

I think the point we got stuck last time at was deciding on a good
format for the data coming from the DATA lines. One of the objections
raised for formats such as JSON is that it's trivial for "git merge" (or
similar tools) to make a mistake because object-end/object-start lines
are all identical. And as for the SQL-format version, the objection was
that it's hard to modify the lines en-masse when modifying the catalog
definition (new column, etc). Ideally we would like a format that can
be bulk-edited without too much trouble.

A SQL file would presumably not have the merge issue, but mass-editing
would be a pain.

Crazy idea: we could just have a CSV file which can be loaded into a
table for mass changes using regular DDL commands, then dumped back from
there into the file. We already know how to do these things, using
\copy etc. Since CSV uses one line per entry, there would be no merge
problems either (or rather: all merge problems would become conflicts,
which is what we want.)

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Alvaro Herrera (#3)

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Crazy idea: we could just have a CSV file which can be loaded into a
table for mass changes using regular DDL commands, then dumped back from
there into the file. We already know how to do these things, using
\copy etc. Since CSV uses one line per entry, there would be no merge
problems either (or rather: all merge problems would become conflicts,
which is what we want.)

That's an interesting proposal. It would mean that the catalog files
stay at more or less their current semantic level (direct representations
of bootstrap catalog contents), but it does sound like a more attractive
way to perform complex edits than writing Emacs macros ;-).

You could actually do that the hard way right now, with a bit of script
to convert between DATA lines and CSV format. But if we anticipate that
becoming the standard approach, it would definitely make sense to migrate
the master copies into CSV or traditional COPY format, and teach BKI mode
to read that (or, perhaps, leave bootstrap.c alone and modify the code
that produces the .bki file).

This is somewhat orthogonal to the question of whether we want to do
things like converting noncritical operator-class definitions into
regular CREATE OPERATOR CLASS syntax. There's almost certainly going
to be some hard core of catalog entries that aren't amenable to that,
and will still need to be loaded from data files of some sort.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Mark Dilger

mark.dilger@enterprisedb.com

over 10 years ago

In reply to: Tom Lane (#5)

On Dec 11, 2015, at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Crazy idea: we could just have a CSV file which can be loaded into a
table for mass changes using regular DDL commands, then dumped back from
there into the file. We already know how to do these things, using
\copy etc. Since CSV uses one line per entry, there would be no merge
problems either (or rather: all merge problems would become conflicts,
which is what we want.)

That's an interesting proposal. It would mean that the catalog files
stay at more or less their current semantic level (direct representations
of bootstrap catalog contents), but it does sound like a more attractive
way to perform complex edits than writing Emacs macros ;-).

I would be happy to work on this, if there is much chance of the community
accepting a patch. Do you think replacing the numeric Oids for functions,
operators, opclasses and such in the source files with their names would
be ok, with the SQL converting those to Oids in the output? My eyes have
gotten tired more than once trying to read head files in src/include/catalog
looking for mistakes in what largely amounts to a big table of numbers.

For example, in pg_amop.h:

/* default operators int2 */
DATA(insert ( 1976 21 21 1 s 95 403 0 ));
DATA(insert ( 1976 21 21 2 s 522 403 0 ));
DATA(insert ( 1976 21 21 3 s 94 403 0 ));
DATA(insert ( 1976 21 21 4 s 524 403 0 ));
DATA(insert ( 1976 21 21 5 s 520 403 0 ));

Would become something like:

amopfamily amoplefttype amoprighttype amopstrategy amoppurpose amopopr amopmethod amopsortfamily
integer_ops int2 int2 1 search "<" btree 0
integer_ops int2 int2 2 search "<=" btree 0
integer_ops int2 int2 3 search "=" btree 0
integer_ops int2 int2 4 search ">=" btree 0
integer_ops int2 int2 5 search ">" btree 0

Note that I prefer to use tabs and a headerline, as the tabstop can be set to
line them up nicely, and the headerline allows you to see which column is
which, and what it is for. Csv is always harder for me to use that way, though
maybe that is just a matter of which editor i use. (vim)

And yes, I'd need to allow the HEADER option for copying tab delimited
files, since it is currently only allowed for csv, I believe.

mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Mark Dilger (#6)

Mark Dilger <hornschnorter@gmail.com> writes:

On Dec 11, 2015, at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
That's an interesting proposal. It would mean that the catalog files
stay at more or less their current semantic level (direct representations
of bootstrap catalog contents), but it does sound like a more attractive
way to perform complex edits than writing Emacs macros ;-).

I would be happy to work on this, if there is much chance of the community
accepting a patch. Do you think replacing the numeric Oids for functions,
operators, opclasses and such in the source files with their names would
be ok, with the SQL converting those to Oids in the output?

Huh? Those files are the definition of that mapping, no? Isn't what
you're proposing circular?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Caleb Welton

cwelton@pivotal.io

over 10 years ago

In reply to: Caleb Welton (#4)

I took a look at a few of the most recent bulk edit cases for pg_proc.h:

There were two this year:
* The addition of proparallel [1]https://github.com/postgres/postgres/commit/7aea8e4f2daa4b39ca9d1309a0c4aadb0f7ed81b
* The addition of protransform [2]https://github.com/postgres/postgres/commit/8f9fe6edce358f7904e0db119416b4d1080a83aa

And prior to that the most recent seems to be from 2012:
* The addition of proleakproof [3]https://github.com/postgres/postgres/commit/cd30728fb2ed7c367d545fc14ab850b5fa2a4850

Quick TLDR - the changes needed to reflect these are super simple to
reflect when generating SQL for CREATE FUNCTION statements.

Attached is the SQL that would generate function definitions prior to
proleakproof and the diffs that would be required after adding support for
proleakproof, protransform and proparallel.

Each of the diffs indicates the changes that would be needed after the new
column is added, the question of how to populate default values for the new
columns is beyond the scope that can easily be expressed in general terms
and depends entirely on what the nature of the new column is.

Note: Currently I have focused on the 'pure' functions, e.g. not the
drivers of type serialization, language validation, operators, or other
object types. I would want to deal with each of those while handling the
conversion for each of those object types in turn. Additional
modifications would likely be needed for other types of functions.

[1]: https://github.com/postgres/postgres/commit/7aea8e4f2daa4b39ca9d1309a0c4aadb0f7ed81b
https://github.com/postgres/postgres/commit/7aea8e4f2daa4b39ca9d1309a0c4aadb0f7ed81b
[2]: https://github.com/postgres/postgres/commit/8f9fe6edce358f7904e0db119416b4d1080a83aa
https://github.com/postgres/postgres/commit/8f9fe6edce358f7904e0db119416b4d1080a83aa
[3]: https://github.com/postgres/postgres/commit/cd30728fb2ed7c367d545fc14ab850b5fa2a4850
https://github.com/postgres/postgres/commit/cd30728fb2ed7c367d545fc14ab850b5fa2a4850

On Fri, Dec 11, 2015 at 12:55 PM, Caleb Welton <cwelton@pivotal.io> wrote:

Show quoted text

Makes sense.

During my own prototyping what I did was generate the sql statements via
sql querying the existing catalog. Way easier than hand writing 1000+
function definitions and not difficult to modify for future changes. As
affirmed that it was very easy to adapt my existing sql to account for some
of the newer features in master.

The biggest challenge was establishing a sort order that ensures both a
unique ordering and that the dependencies needed for SQL functions have
been processed before trying to define them. Which effects about 4/1000
functions based on a natural oid ordering.

On Dec 11, 2015, at 11:43 AM, Alvaro Herrera <alvherre@2ndquadrant.com>

wrote:

Caleb Welton wrote:

I'm happy working these ideas forward if there is interest.

Basic design proposal is:
- keep a minimal amount of bootstrap to avoid intrusive changes to core
components
- Add capabilities of creating objects with specific OIDs via DDL

during

initdb
- Update the caching/resolution mechanism for builtin functions to be
more dynamic.
- Move as much of bootstrap as possible into SQL files and create

catalog

via DDL

I think the point we got stuck last time at was deciding on a good
format for the data coming from the DATA lines. One of the objections
raised for formats such as JSON is that it's trivial for "git merge" (or
similar tools) to make a mistake because object-end/object-start lines
are all identical. And as for the SQL-format version, the objection was
that it's hard to modify the lines en-masse when modifying the catalog
definition (new column, etc). Ideally we would like a format that can
be bulk-edited without too much trouble.

A SQL file would presumably not have the merge issue, but mass-editing
would be a pain.

Crazy idea: we could just have a CSV file which can be loaded into a
table for mass changes using regular DDL commands, then dumped back from
there into the file. We already know how to do these things, using
\copy etc. Since CSV uses one line per entry, there would be no merge
problems either (or rather: all merge problems would become conflicts,
which is what we want.)

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Mark Dilger

mark.dilger@enterprisedb.com

over 10 years ago

In reply to: Tom Lane (#7)

On Dec 11, 2015, at 2:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <hornschnorter@gmail.com> writes:

On Dec 11, 2015, at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
That's an interesting proposal. It would mean that the catalog files
stay at more or less their current semantic level (direct representations
of bootstrap catalog contents), but it does sound like a more attractive
way to perform complex edits than writing Emacs macros ;-).

I would be happy to work on this, if there is much chance of the community
accepting a patch. Do you think replacing the numeric Oids for functions,
operators, opclasses and such in the source files with their names would
be ok, with the SQL converting those to Oids in the output?

Huh? Those files are the definition of that mapping, no? Isn't what
you're proposing circular?

No, there are far more references to Oids than there are definitions of them.

For example, the line in pg_operator.h:

DATA(insert OID = 15 ( "=" PGNSP PGUID b t t 23 20 16 416 36 int48eq eqsel eqjoinsel ));

defines 15 as the oid for the equals operator for (int8,int4) returning bool, but the
fact that 23 is the Oid for int4, 20 is the Oid for int8, and 16 is the Oid for bool
is already defined elsewhere (int pg_type.h) and need not be duplicated here.

I'm just proposing that we don't keep specifying things by number everywhere.
Once you've established the Oid for something (operator, type, function) you
should use the name everywhere else.

mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Caleb Welton

cwelton@pivotal.io

over 10 years ago

In reply to: Tom Lane (#5)

The current semantic level is pretty low level, somewhat cumbersome, and
requires filling in values that most of the time the system has a pretty
good idea how to fill in default values.

Compare:

CREATE FUNCTION lo_export(oid, text) RETURNS integer LANGUAGE internal
STRICT AS 'lo_export' WITH (OID=765);
DATA(insert OID = 765 ( lo_export PGNSP PGUID 12 1 0 0 0 f f f f t f v u
2 0 23 "26 25" _null_ _null_ _null_ _null_ _null_ lo_export _null_ _null_
_null_ ));

In the first one someone has indicated:
1. a function name,
2. two parameter type names
3. a return type
4. a language
5. null handling
6. a symbol
7. an oid

In the second case 30 separate items have been indicated, and yet both of
them will generate identical end results within the catalog.

The former is more immune to even needing modification in the event that
the catalog structure changes.
- adding proleakproof? No change needed, default value is correct
- adding protransform? No change needed, not relevant
- adding proparallel? No change needed, default value is correct
- adding procost? No change needed, default value is correct

On Fri, Dec 11, 2015 at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Crazy idea: we could just have a CSV file which can be loaded into a
table for mass changes using regular DDL commands, then dumped back from
there into the file. We already know how to do these things, using
\copy etc. Since CSV uses one line per entry, there would be no merge
problems either (or rather: all merge problems would become conflicts,
which is what we want.)

That's an interesting proposal. It would mean that the catalog files
stay at more or less their current semantic level (direct representations
of bootstrap catalog contents), but it does sound like a more attractive
way to perform complex edits than writing Emacs macros ;-).

You could actually do that the hard way right now, with a bit of script
to convert between DATA lines and CSV format. But if we anticipate that
becoming the standard approach, it would definitely make sense to migrate
the master copies into CSV or traditional COPY format, and teach BKI mode
to read that (or, perhaps, leave bootstrap.c alone and modify the code
that produces the .bki file).

This is somewhat orthogonal to the question of whether we want to do
things like converting noncritical operator-class definitions into
regular CREATE OPERATOR CLASS syntax. There's almost certainly going
to be some hard core of catalog entries that aren't amenable to that,
and will still need to be loaded from data files of some sort.

regards, tom lane

#11

Caleb Welton

cwelton@pivotal.io

over 10 years ago

In reply to: Mark Dilger (#9)

Yes, that alone without any other changes would be a marked improvement and
could be implemented in many places, pg_operator is a good example.

... but there is some circularity especially with respect to type
definitions and the functions that define those types. If you changed the
definition of prorettype into a regtype then bootstrap would try to lookup
the type before the pg_type entry exists and throw a fit. That's handled
in SQL via shell types. If we wanted bootstrap to be able to handle this
then we'd have to make two passes of pg_type, the first to create the
shells and the second to handle populating the serialization functions.

Unfortunately types and functions tend to be the more volatile areas of the
catalog so this particular circularity is particularly vexing.

On Fri, Dec 11, 2015 at 2:53 PM, Mark Dilger <hornschnorter@gmail.com>
wrote:

Show quoted text

On Dec 11, 2015, at 2:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <hornschnorter@gmail.com> writes:

On Dec 11, 2015, at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
That's an interesting proposal. It would mean that the catalog files
stay at more or less their current semantic level (direct

representations

of bootstrap catalog contents), but it does sound like a more

attractive

way to perform complex edits than writing Emacs macros ;-).

I would be happy to work on this, if there is much chance of the

community

accepting a patch. Do you think replacing the numeric Oids for

functions,

operators, opclasses and such in the source files with their names would
be ok, with the SQL converting those to Oids in the output?

Huh? Those files are the definition of that mapping, no? Isn't what
you're proposing circular?

No, there are far more references to Oids than there are definitions of
them.

For example, the line in pg_operator.h:

DATA(insert OID = 15 ( "=" PGNSP PGUID b t t 23 20 16 416 36
int48eq eqsel eqjoinsel ));

defines 15 as the oid for the equals operator for (int8,int4) returning
bool, but the
fact that 23 is the Oid for int4, 20 is the Oid for int8, and 16 is the
Oid for bool
is already defined elsewhere (int pg_type.h) and need not be duplicated
here.

I'm just proposing that we don't keep specifying things by number
everywhere.
Once you've established the Oid for something (operator, type, function)
you
should use the name everywhere else.

mark

#12

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Mark Dilger (#9)

Mark Dilger <hornschnorter@gmail.com> writes:

On Dec 11, 2015, at 2:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Huh? Those files are the definition of that mapping, no? Isn't what
you're proposing circular?

No, there are far more references to Oids than there are definitions of them.

Well, you're still not being very clear, but I *think* what you're
proposing is to put a lot more smarts into the script that converts
the master source files into .bki format. That is, we might have
"=(int8,int4)" in an entry in the master source file for pg_amop, but
the script would look up that entry using the source data for pg_type
and pg_operator, and then emit a simple numeric OID into the .bki file.
(Presumably, it would know to do this because we'd redefine the
pg_amop.amopopr column as of regoperator type not plain OID.)

Yeah, that could work, though I'd be a bit concerned about the complexity
and speed of the script. Still, one doesn't usually rebuild postgres.bki
many times a day, so speed might not be a big problem.

This seems more or less orthogonal to the question of whether to get rid
of the DATA() lines in favor of a COPY-friendly data format. I'd suggest
treating those as separate patches.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Caleb Welton (#11)

Caleb Welton <cwelton@pivotal.io> writes:

... but there is some circularity especially with respect to type
definitions and the functions that define those types. If you changed the
definition of prorettype into a regtype then bootstrap would try to lookup
the type before the pg_type entry exists and throw a fit. That's handled
in SQL via shell types. If we wanted bootstrap to be able to handle this
then we'd have to make two passes of pg_type, the first to create the
shells and the second to handle populating the serialization functions.

I think what Mark is proposing is to do the lookups while preparing the
.bki file, which would eliminate the circularity ... at the cost of having
to, essentially, reimplement regprocedure_in and friends in Perl.

If we push hard on doing the other thing that you're proposing, which is
to take as much as possible out of the pure bootstrap-data phase, then
maybe it wouldn't be worth the work to do that. Not sure.

On the other hand, I'm not very much in love with the thought of having
two different notations for "core" and "not so core" built-in function
creation. There's something to be said for keeping all the data in one
format. If we push on making the .bki creation script smarter, then in
addition to the name lookup facilities Mark envisions, we could have
things like default column values. That would take us a long way toward
the same ease-of-use as full SQL definitions. We'd still be lacking
some error checks that the SQL commands could perform; but we've
traditionally used sanity checks in the regression tests to do
cross-checking that covers more or less those same bases.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Mark Dilger

mark.dilger@enterprisedb.com

over 10 years ago

In reply to: Tom Lane (#12)

On Dec 11, 2015, at 3:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <hornschnorter@gmail.com> writes:

On Dec 11, 2015, at 2:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Huh? Those files are the definition of that mapping, no? Isn't what
you're proposing circular?

No, there are far more references to Oids than there are definitions of them.

Well, you're still not being very clear, but I *think* what you're
proposing is to put a lot more smarts into the script that converts
the master source files into .bki format. That is, we might have
"=(int8,int4)" in an entry in the master source file for pg_amop, but
the script would look up that entry using the source data for pg_type
and pg_operator, and then emit a simple numeric OID into the .bki file.
(Presumably, it would know to do this because we'd redefine the
pg_amop.amopopr column as of regoperator type not plain OID.)

Yeah, that could work, though I'd be a bit concerned about the complexity
and speed of the script. Still, one doesn't usually rebuild postgres.bki
many times a day, so speed might not be a big problem.

I am proposing that each of the catalog headers that currently has DATA
lines instead have a COPY loadable file that contains the same information.
So, for pg_type.h, there would be a pg_type.dat file. All the DATA lines
would be pulled out of pg_type.h and a corresponding tab delimited row
would be written to pg_type.dat. Henceforth, if you cloned the git repository,
you'd find no DATA lines in pg_type.h, but would find a pg_type.dat file
in the src/include/catalog directory. Likewise for the other header files.

There would be some script, SQL or perl or whatever, that would convert
these .dat files into the .bki file.

Now, if we know that pg_type.dat will be processed before pg_proc.dat,
we can replace all the Oids representing datatypes in pg_proc.dat with the
names for those types, given that we already have a name <=> oid
mapping for types.

Likewise, if we know that pg_proc.dat will be processed before pg_operator.dat,
we can specify both functions and datatypes by name rather than by Oid
in that file, making it much easier to read. By the time pg_operator.dat is
read, pg_type.dat and pg_proc.dat will already have been read and processed,
so there shouldn't be ambiguity.

By the time pg_amop.dat is processed, the operators, procs, datatypes,
opfamilies and so forth would already be know. The example I gave up
thread would be easy to parse:

And if I came along and defined a new datatype, int384, I could add rows to
this file much more easily, as:

amopfamily amoplefttype amoprighttype amopstrategy amoppurpose amopopr amopmethod amopsortfamily
integer_ops int384 int384 1 search "<" btree 0
integer_ops int384 int384 2 search "<=" btree 0
integer_ops int384 int384 3 search "=" btree 0
integer_ops int384 int384 4 search ">=" btree 0
integer_ops int384 int384 5 search ">" btree 0

I don't see how this creates all that much complication, and I clearly see
how it makes files like pg_operator.{h,dat} and pg_amop.{h,dat} easier to read.

mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Tom Lane (#13)

On 2015-12-11 18:12:16 -0500, Tom Lane wrote:

I think what Mark is proposing is to do the lookups while preparing the
.bki file, which would eliminate the circularity ... at the cost of having
to, essentially, reimplement regprocedure_in and friends in Perl.

FWIW, I did that, when this came up last. Rather interesting, because it
leads to rather noticeable speedups - currently initdb spents a
significant amount of its time doing reproc lookups. Especially
interesting because at that stage we're largely not using indexes yet, IIRC.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Mark Dilger (#14)

Mark Dilger <hornschnorter@gmail.com> writes:

Now, if we know that pg_type.dat will be processed before pg_proc.dat,
we can replace all the Oids representing datatypes in pg_proc.dat with the
names for those types, given that we already have a name <=> oid
mapping for types.

I don't think this is quite as simple as you paint it. How can you
process pg_type.dat first, when it contains pg_proc references? Doing
pg_proc first is no better, because it contains pg_type references.

I believe it's soluble, but it's going to take something more like
loading up all the data at once and then doing lookups as we write
out the .bki entries for each catalog. Fortunately, the volume of
bootstrap data is small enough that that won't be a problem on any
machine capable of running modern Postgres ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Tom Lane (#16)

On 2015-12-11 19:26:38 -0500, Tom Lane wrote:

I believe it's soluble, but it's going to take something more like
loading up all the data at once and then doing lookups as we write
out the .bki entries for each catalog. Fortunately, the volume of
bootstrap data is small enough that that won't be a problem on any
machine capable of running modern Postgres ...

I think that's exactly the right approach. Just building a few perl
hashes worked well enough, in my prototype of that.

If additionally a few more plain oid fields are converted into reg*
types, the source data fields are easier to understand and the catalogs
get much nicer to query...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Andres Freund (#17)

Andres Freund <andres@anarazel.de> writes:

On 2015-12-11 19:26:38 -0500, Tom Lane wrote:

I believe it's soluble, but it's going to take something more like
loading up all the data at once and then doing lookups as we write
out the .bki entries for each catalog. Fortunately, the volume of
bootstrap data is small enough that that won't be a problem on any
machine capable of running modern Postgres ...

I think that's exactly the right approach. Just building a few perl
hashes worked well enough, in my prototype of that.

Right. I would draw Mark's attention to src/backend/catalog/Catalog.pm
and the things that use that. Presumably all that would have be
rewritten, but the existing code would be a useful starting point
perhaps.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Mark Dilger

mark.dilger@enterprisedb.com

over 10 years ago

In reply to: Caleb Welton (#10)

On Dec 11, 2015, at 2:54 PM, Caleb Welton <cwelton@pivotal.io> wrote:

The current semantic level is pretty low level, somewhat cumbersome, and requires filling in values that most of the time the system has a pretty good idea how to fill in default values.

Compare:
CREATE FUNCTION lo_export(oid, text) RETURNS integer LANGUAGE internal STRICT AS 'lo_export' WITH (OID=765);

DATA(insert OID = 765 ( lo_export PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 23 "26 25" _null_ _null_ _null_ _null_ _null_ lo_export _null_ _null_ _null_ ));

I would like to hear more about this idea. Are you proposing that we use something
like the above CREATE FUNCTION format to express what is currently being expressed
with DATA statements? That is an interesting idea, though I don't know what exactly
that would look like. If you want to forward this idea, I'd be eager to hear your thoughts.
If not, I'll try to make progress with my idea of tab delimited files and such (or really,
Alvaro's idea of csv files that I only slightly corrupted).

mark

#20

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Mark Dilger (#19)

Mark Dilger <hornschnorter@gmail.com> writes:

On Dec 11, 2015, at 2:54 PM, Caleb Welton <cwelton@pivotal.io> wrote:
Compare:
CREATE FUNCTION lo_export(oid, text) RETURNS integer LANGUAGE internal STRICT AS 'lo_export' WITH (OID=765);

DATA(insert OID = 765 ( lo_export PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 23 "26 25" _null_ _null_ _null_ _null_ _null_ lo_export _null_ _null_ _null_ ));

I would like to hear more about this idea. Are you proposing that we use something
like the above CREATE FUNCTION format to express what is currently being expressed
with DATA statements?

Yes, that sort of idea has been kicked around some already, see the
archives.

That is an interesting idea, though I don't know what exactly
that would look like. If you want to forward this idea, I'd be eager to hear your thoughts.
If not, I'll try to make progress with my idea of tab delimited files and such (or really,
Alvaro's idea of csv files that I only slightly corrupted).

Personally I would like to see both approaches explored. Installing as
much as we can via SQL commands is attractive for a number of reasons;
but there is going to be an irreducible minimum amount of stuff that
has to be inserted by something close to the current bootstrapping
process. (And I'm not convinced that that "minimum amount" is going
to be very small...) So it's not impossible that we'd end up accepting
*both* types of patches, one to do more in the post-bootstrap SQL world
and one to make the bootstrap data notation less cumbersome. In any
case it would be useful to push both approaches forward some more before
we make any decisions between them.

BTW, there's another thing I'd like to see improved in this area, which is
a problem already but will get a lot worse if we push more work into the
post-bootstrap phase of initdb. That is that the post-bootstrap phase is
both inefficient and impossible to debug. If you've ever had a failure
there, you'll have seen that the backend spits out an entire SQL script
and says there's an error in it somewhere; that's because it gets the
whole per-stage script as one submission. (Try introducing a syntax error
somewhere in information_schema.sql, and you'll see what I mean.)
Breaking the stage scripts down further would help, but that is
unattractive because each one requires a fresh backend startup/shutdown,
including a full checkpoint. I'd like to see things rejiggered so that
there's only one post-bootstrap standalone backend session that performs
all the steps, but initdb feeds it just one SQL command at a time so that
errors are better localized. That should both speed up initdb noticeably
and make debugging easier.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Tom Lane (#20)

#22

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Tom Lane (#20)

#23

Andres Freund