unused_oids script is broken with bsd sed

Started by Stas Kelvichalmost 8 years ago27 messageshackers
Jump to latest
#1Stas Kelvich
s.kelvich@postgrespro.ru

Hi.

BSD sed in macOS doesn't understand word boundary operator "\b". So after
372728b0d49 unused_oids doesn’t match all oids in new *.dat files marking
all of them as unused.

It is possible to write more portable regexps, e.g. change "\b" to
something like "^.*{*.*", but it seems easier for feature use to just
rewrite unused_oids in perl to match duplicate_oids. Also add in-place
complain about duplicates instead of running uniq through oids array.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

0001-Rewrite-unused_oids-in-perl.patchapplication/octet-stream; name=0001-Rewrite-unused_oids-in-perl.patch; x-unix-mode=0644Download+45-31
#2Andrew Dunstan
andrew@dunslane.net
In reply to: Stas Kelvich (#1)
Re: unused_oids script is broken with bsd sed

On 04/25/2018 06:22 AM, Stas Kelvich wrote:

Hi.

BSD sed in macOS doesn't understand word boundary operator "\b". So after
372728b0d49 unused_oids doesn’t match all oids in new *.dat files marking
all of them as unused.

It is possible to write more portable regexps, e.g. change "\b" to
something like "^.*{*.*", but it seems easier for feature use to just
rewrite unused_oids in perl to match duplicate_oids. Also add in-place
complain about duplicates instead of running uniq through oids array.

+many for rewriting in perl. Do you want to have a go at that? If not I
will.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#2)
Re: unused_oids script is broken with bsd sed

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

+many for rewriting in perl. Do you want to have a go at that? If not I
will.

+1 ... I was mildly astonished that this didn't already have to happen
as part of the bootstrap data conversion effort. It's certainly not
hard to imagine future extensions to the .dat file format that would
break this script, and duplicate_oids too. I think we should rewrite
both of them to use the Catalog.pm infrastructure.

regards, tom lane

#4Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Tom Lane (#3)
Re: unused_oids script is broken with bsd sed

On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

+many for rewriting in perl. Do you want to have a go at that? If not I
will.

+1 ... I was mildly astonished that this didn't already have to happen
as part of the bootstrap data conversion effort. It's certainly not
hard to imagine future extensions to the .dat file format that would
break this script, and duplicate_oids too. I think we should rewrite
both of them to use the Catalog.pm infrastructure.

regards, tom lane

Hm, I attached patch in first message, but seems that my mail client again
messed with attachment. However archive caught it:

/messages/by-id/attachment/60920/0001-Rewrite-unused_oids-in-perl.patch

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#5Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Tom Lane (#3)
Re: unused_oids script is broken with bsd sed

On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think we should rewrite
both of them to use the Catalog.pm infrastructure.

Okay, seems reasonable. I'll put shared code in Catalog.pm and
update patch.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#6John Naylor
john.naylor@enterprisedb.com
In reply to: Stas Kelvich (#5)
Re: unused_oids script is broken with bsd sed

On 4/25/18, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think we should rewrite
both of them to use the Catalog.pm infrastructure.

Okay, seems reasonable. I'll put shared code in Catalog.pm and
update patch.

I don't think you need any new code in Catalog.pm, I believe the
suggestion was just to use that module as a stable interface to the
data. Looking at your patch, I'll mention that we have an idiom for
extracting #define'd OID symbols, e.g.:

my $FirstBootstrapObjectId = Catalog::FindDefinedSymbol(
'access/transam.h', \@include_path, 'FirstBootstrapObjectId');

This is preferred over using awk, which would have its own portability
issues (Windows for starters).

While I'm thinking out loud, it might be worthwhile to patch genbki.pl
for the duplicate test, since they're run at the same time anyway (see
catalog/Makefile), and we've already read all the data.

-John Naylor

#7David Fetter
david@fetter.org
In reply to: John Naylor (#6)
Re: unused_oids script is broken with bsd sed

On Wed, Apr 25, 2018 at 09:55:54PM +0700, John Naylor wrote:

On 4/25/18, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think we should rewrite
both of them to use the Catalog.pm infrastructure.

Okay, seems reasonable. I'll put shared code in Catalog.pm and
update patch.

I don't think you need any new code in Catalog.pm, I believe the
suggestion was just to use that module as a stable interface to the
data. Looking at your patch, I'll mention that we have an idiom for
extracting #define'd OID symbols, e.g.:

my $FirstBootstrapObjectId = Catalog::FindDefinedSymbol(
'access/transam.h', \@include_path, 'FirstBootstrapObjectId');

While we're improving things, there's not really a reason this program
should need to be run from any particular spot. It can detect where it
is and act on that information.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#8Andreas Karlsson
andreas.karlsson@percona.com
In reply to: David Fetter (#7)
Re: unused_oids script is broken with bsd sed

On 04/25/2018 04:59 PM, David Fetter wrote:

While we're improving things, there's not really a reason this program
should need to be run from any particular spot. It can detect where it
is and act on that information.

+1

I would appreciate this change, and if Stats does not feel like adding
it to his patch I can write a patch for this myself.

Andreas

#9Andrew Dunstan
andrew@dunslane.net
In reply to: John Naylor (#6)
Re: unused_oids script is broken with bsd sed

On 04/25/2018 10:55 AM, John Naylor wrote:

On 4/25/18, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think we should rewrite
both of them to use the Catalog.pm infrastructure.

Okay, seems reasonable. I'll put shared code in Catalog.pm and
update patch.

I don't think you need any new code in Catalog.pm, I believe the
suggestion was just to use that module as a stable interface to the
data. Looking at your patch, I'll mention that we have an idiom for
extracting #define'd OID symbols, e.g.:

my $FirstBootstrapObjectId = Catalog::FindDefinedSymbol(
'access/transam.h', \@include_path, 'FirstBootstrapObjectId');

This is preferred over using awk, which would have its own portability
issues (Windows for starters).

While I'm thinking out loud, it might be worthwhile to patch genbki.pl
for the duplicate test, since they're run at the same time anyway (see
catalog/Makefile), and we've already read all the data.

The logic for getting the set of oids should be centralized, if not in
Catalog.pm then in a script which serves both for dup0licate_oids and
unused_oids.

Here is something I cobbled together for the latter approach. It could
probably improve by using Catalog::FindDefinedSymbol()

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

unused_oids.plapplication/x-perl; name=unused_oids.plDownload
#10Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: John Naylor (#6)
Re: unused_oids script is broken with bsd sed

On 25 Apr 2018, at 17:55, John Naylor <jcnaylor@gmail.com> wrote:

On 4/25/18, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think we should rewrite
both of them to use the Catalog.pm infrastructure.

Okay, seems reasonable. I'll put shared code in Catalog.pm and
update patch.

I don't think you need any new code in Catalog.pm, I believe the
suggestion was just to use that module as a stable interface to the
data. Looking at your patch, I'll mention that we have an idiom for
extracting #define'd OID symbols, e.g.:

my $FirstBootstrapObjectId = Catalog::FindDefinedSymbol(
'access/transam.h', \@include_path, 'FirstBootstrapObjectId');

This is preferred over using awk, which would have its own portability
issues (Windows for starters).

While I'm thinking out loud, it might be worthwhile to patch genbki.pl
for the duplicate test, since they're run at the same time anyway (see
catalog/Makefile), and we've already read all the data.

-John Naylor

New version is attached. I've put iterator into Catalog.pm to avoid copy-pasting
it over two scripts. Also instead of greping through *.dat and *.h files I've
used parsers provided in Catalog module. That way should be more sustainable
to format changes.

Anyone who wanted to help with scripts -- feel free to rewrite.

Attachments:

0001-Rewrite-unused_oids-in-perl.patchapplication/octet-stream; name=0001-Rewrite-unused_oids-in-perl.patch; x-unix-mode=0644Download+85-49
#11John Naylor
john.naylor@enterprisedb.com
In reply to: Tom Lane (#3)
Re: unused_oids script is broken with bsd sed

On 4/25/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

+many for rewriting in perl. Do you want to have a go at that? If not I
will.

+1 ... I was mildly astonished that this didn't already have to happen
as part of the bootstrap data conversion effort. It's certainly not
hard to imagine future extensions to the .dat file format that would
break this script, and duplicate_oids too. I think we should rewrite
both of them to use the Catalog.pm infrastructure.

If we're going to use Catalog.pm for that, it seems more convenient to
expose toast and index oids directly rather than in strings formatted
specifically for the bki file, as in the attached. Thoughts?

-John Naylor

Attachments:

expose-toast-and-index-oids.patchtext/x-patch; charset=US-ASCII; name=expose-toast-and-index-oids.patchDownload+14-20
#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Naylor (#11)
Re: unused_oids script is broken with bsd sed

John Naylor <jcnaylor@gmail.com> writes:

On 4/25/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think we should rewrite
both of them to use the Catalog.pm infrastructure.

If we're going to use Catalog.pm for that, it seems more convenient to
expose toast and index oids directly rather than in strings formatted
specifically for the bki file, as in the attached. Thoughts?

Good idea, I like this better than what Stas did in his patch.

I'm a bit inclined to preserve the old variable names (toast_name,
toast_oid etc) as the hash keys because they seem more greppable
than the generic "oid", "name" that you have here. But maybe that
isn't an interesting consideration.

regards, tom lane

#13John Naylor
john.naylor@enterprisedb.com
In reply to: Stas Kelvich (#10)
Re: unused_oids script is broken with bsd sed

On 4/26/18, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

New version is attached. I've put iterator into Catalog.pm to avoid
copy-pasting
it over two scripts. Also instead of greping through *.dat and *.h files
I've
used parsers provided in Catalog module. That way should be more
sustainable
to format changes.

Anyone who wanted to help with scripts -- feel free to rewrite.

Looks pretty good. Some comments:

Just after you posted, I sent a patch that tweaks the API of
Catalog.pm for toast and index oids. If you use that API in your
patch, you can get rid of the extra bookkeeping you added for those
oids.

The original scripts skipped the relation and rowtype oids for
bootstrap catalogs. You've replaced that with two data-level tests for
pg_class plus pg_type composite types. I think the original test was a
bit cleaner in this regard.

For this type of call:

+my @oids = @{ Catalog::FindAllOidsFromHeaders(@input_files) };
+foreach my $oid (@oids)

this style is used by other callers of the module:

+my $oids = Catalog::FindAllOidsFromHeaders(@input_files);
+foreach my $oid (@{ $oids })

For those following along, these scripts still assume we're in the
catalog directory. I can hack on that part tomorrow if no one else
has.

-John Naylor

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Naylor (#13)
Re: unused_oids script is broken with bsd sed

John Naylor <jcnaylor@gmail.com> writes:

Just after you posted, I sent a patch that tweaks the API of
Catalog.pm for toast and index oids. If you use that API in your
patch, you can get rid of the extra bookkeeping you added for those
oids.

I've adjusted these two patches to play together, and pushed them.

The original scripts skipped the relation and rowtype oids for
bootstrap catalogs. You've replaced that with two data-level tests for
pg_class plus pg_type composite types. I think the original test was a
bit cleaner in this regard.

Yeah, I thought so too. Changed.

For those following along, these scripts still assume we're in the
catalog directory. I can hack on that part tomorrow if no one else
has.

I didn't touch this point.

I notice that duplicate_oids is now a good factor of 10 slower than it was
before (~0.04 sec to ~0.4 sec, on my machine). While this doesn't seem
like a problem for manual use, it seems annoying as part of the standard
build process, especially on slower buildfarm critters. I think we should
do what you said upthread and teach genbki.pl to complain about duplicate
oids, so that we can remove duplicate_oids from the build process.

I went ahead and pushed things as-is so we can confirm that duplicate_oids
has no portability issues that the buildfarm could find. But let's try
to get the build process change in place after a day or so.

regards, tom lane

#15John Naylor
john.naylor@enterprisedb.com
In reply to: Tom Lane (#14)
Re: unused_oids script is broken with bsd sed

On 4/26/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

John Naylor <jcnaylor@gmail.com> writes:

For those following along, these scripts still assume we're in the
catalog directory. I can hack on that part tomorrow if no one else
has.

I didn't touch this point.

Patch 0002 is my attempt at this. Not sure how portable this is. It's
also clunkier than before in that you need to tell it where to find
Catalog.pm, since it seems Perl won't compile unless "use lib" points
to a string and not an expression. Suggestions welcome. I also edited
the boilerplate comments.

I notice that duplicate_oids is now a good factor of 10 slower than it was
before (~0.04 sec to ~0.4 sec, on my machine). While this doesn't seem
like a problem for manual use, it seems annoying as part of the standard
build process, especially on slower buildfarm critters.

If you're concerned about build speed, I think the generated *_d.h
headers cause a lot more files to be rebuilt than before. I'll think
about how to recover that.

I think we should
do what you said upthread and teach genbki.pl to complain about duplicate
oids, so that we can remove duplicate_oids from the build process.

I went ahead and pushed things as-is so we can confirm that duplicate_oids
has no portability issues that the buildfarm could find. But let's try
to get the build process change in place after a day or so.

Patch 0001
-moves the compile-time test into genbki.pl
-removes a usability quirk from unused_oids
-removes duplicate_oids (not sure if you intended this, but since
unused_oids has already been committed to report duplicates, we may as
well standardize on that) and cleans up after that fact
-updates the documentation.

-John Naylor

Attachments:

0001-Remove-standalone-script-duplicate_oids.patchtext/x-patch; charset=US-ASCII; name=0001-Remove-standalone-script-duplicate_oids.patchDownload+80-93
0002-Remove-requirement-to-call-unused_oids-from-its-dire.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-requirement-to-call-unused_oids-from-its-dire.patchDownload+28-15
#16Andreas Karlsson
andreas.karlsson@percona.com
In reply to: John Naylor (#15)
Re: unused_oids script is broken with bsd sed

On 04/26/2018 01:37 PM, John Naylor wrote:> Patch 0002 is my attempt at
this. Not sure how portable this is. It's

also clunkier than before in that you need to tell it where to find
Catalog.pm, since it seems Perl won't compile unless "use lib" points
to a string and not an expression. Suggestions welcome. I also edited
the boilerplate comments.

What about using FindBin (http://perldoc.perl.org/FindBin.html)?

Andreas

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andreas Karlsson (#16)
Re: unused_oids script is broken with bsd sed

Andreas Karlsson <andreas@proxel.se> writes:

What about using FindBin (http://perldoc.perl.org/FindBin.html)?

A quick check suggests that that's present in the core perl
distribution pretty far back, so I think it'll work.
Question is, is it any better than John's "dirname(__FILE__)"
solution?

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#17)
Re: unused_oids script is broken with bsd sed

I wrote:

Andreas Karlsson <andreas@proxel.se> writes:

What about using FindBin (http://perldoc.perl.org/FindBin.html)?

A quick check suggests that that's present in the core perl
distribution pretty far back, so I think it'll work.
Question is, is it any better than John's "dirname(__FILE__)"
solution?

I experimented a bit, and found that actually what we seem to
need is FindBin::RealBin. That's the only one of these solutions
that gives the right answer if someone's put a symlink to
unused_oids or duplicate_oids into their PATH --- the other ones
give you the directory containing the symlink. Now, maybe that
would be an uncommon usage, but it hardly seems out of the question.

regards, tom lane

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Naylor (#15)
Re: unused_oids script is broken with bsd sed

John Naylor <jcnaylor@gmail.com> writes:

On 4/26/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I notice that duplicate_oids is now a good factor of 10 slower than it was
before (~0.04 sec to ~0.4 sec, on my machine). While this doesn't seem
like a problem for manual use, it seems annoying as part of the standard
build process, especially on slower buildfarm critters.

If you're concerned about build speed, I think the generated *_d.h
headers cause a lot more files to be rebuilt than before. I'll think
about how to recover that.

Personally, I use ccache which doesn't seem to care too much, but I agree
than in some usages, extra touches of headers would be costly. Perhaps
it's worth adding logic to avoid overwriting an existing output file
unless it changed? I'm not sure; that would cost something too.

-removes duplicate_oids (not sure if you intended this, but since
unused_oids has already been committed to report duplicates, we may as
well standardize on that) and cleans up after that fact

I don't think anyone's asked for duplicate_oids to be removed altogether,
and I'm afraid that doing so would break existing workflows. For that
matter, now that I think about it, changing the behavior of unused_oids
as we did yesterday was ill-considered as well, so I undid that part.

I've pushed the current-directory-independence change by itself so that
we can see if any buildfarm members think it's problematic. (They'll
still be testing duplicate_oids, as things stand.) Once we have some
confirmation on the FindBin stuff actually being portable, I'll push
the changes to make genbki do duplicate checks and remove duplicate_oids
from the build sequence.

regards, tom lane

#20Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#19)
Re: unused_oids script is broken with bsd sed

On Thu, Apr 26, 2018 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Personally, I use ccache which doesn't seem to care too much, but I agree
than in some usages, extra touches of headers would be costly. Perhaps
it's worth adding logic to avoid overwriting an existing output file
unless it changed? I'm not sure; that would cost something too.

It seems like a good idea to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#20)
#22Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#21)
#23Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#21)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#22)
#25Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#21)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#25)
#27John Naylor
john.naylor@enterprisedb.com
In reply to: Tom Lane (#26)