Uh-oh: documentation PDF output no longer builds in HEAD

Started by Tom Laneabout 10 years ago13 messages

tgl@sss.pgh.pa.us

about 10 years ago

$ cd doc/src/sgml
$ make postgres-US.pdf
... lots of crap later ...

[3253.0.51
! TeX capacity exceeded, sorry [number of strings=245828].
<to be read again>
\endgroup \set@typeset@protect
l.1879198 {1}}
\Node%
! ==> Fatal error occurred, no output PDF file produced!
Transcript written on postgres-US.log.
make: *** [postgres-US.pdf] Error 1
rm postgres-US.tex-pdf

The A4-format PDF still builds, which implies that this has something to
do with the number of pages produced. The 9.5 branch also still builds,
but it seems highly likely that it is within a few pages of failing as
well. (HEAD is only about a dozen pages longer than 9.5 in A4 format,
and presumably the difference is around that for US format.)

We ran into a very similar issue back around 9.0, and solved it with an
ugly style-sheet hack, see thread here:
/messages/by-id/1270189232.5018.7.camel@hp-laptop2.gunduz.org

As noted then, and as I reconfirmed just now, you can *not* fix this by
hacking TeX's parameters: there is a hard-wired limit of 2^18 strings
regardless of what you try to set in texmf.cnf.

Thoughts?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noname

andres@anarazel.de

about 10 years ago

In reply to: Tom Lane (#1)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

On 2015-11-08 13:34:18 -0500, Tom Lane wrote:

$ cd doc/src/sgml
$ make postgres-US.pdf
... lots of crap later ...
! TeX capacity exceeded, sorry [number of strings=245828].

We ran into a very similar issue back around 9.0, and solved it with an
ugly style-sheet hack, see thread here:
/messages/by-id/1270189232.5018.7.camel@hp-laptop2.gunduz.org

As noted then, and as I reconfirmed just now, you can *not* fix this by
hacking TeX's parameters: there is a hard-wired limit of 2^18 strings
regardless of what you try to set in texmf.cnf.

While taking pretty short of forever, postgres-US.pdf seems to build on
my debian unstable as of 8d7396e509 + some additional docs. Is this
dependant of what version of text you're using (plain tex, pdftex,
xetex, whatnot)?

postgres-US.log contains:
360764 strings out of 481710
2617927 string characters out of 6028023
857532 words of memory out of 5085000
252961 multiletter control sequences out of 15000+600000
101035 words of font info for 156 fonts, out of 8000000 for 9000
36 hyphenation exceptions out of 8191

So at least debian's version of tex seems to have to worked around the
limit somehow. I found only one interesting looking setting in the
relevant config files:

%%
%% jacking up TeX settings for the unique uses of jadetex
%%
extra_mem_bot.jadetex = 85000
extra_mem_bot.pdfjadetex = 85000

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Noname (#2)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

andres@anarazel.de (Andres Freund) writes:

While taking pretty short of forever, postgres-US.pdf seems to build on
my debian unstable as of 8d7396e509 + some additional docs. Is this
dependant of what version of text you're using (plain tex, pdftex,
xetex, whatnot)?

postgres-US.log contains:
360764 strings out of 481710

Interesting. They must have boosted the strings limit from 2^18 to 2^19.
According to what I've read, this is doable when compiling TeX from
source, but we can hardly expect users (or packagers) to do that if their
distribution hasn't built it that way. (The 2^18 limit I'm seeing is with
RHEL6's tex package. I'm currently downloading the Fedora rawhide package
to see if it's any better, but man that is one large package...)

BTW, I realized after poking around that the hack I put in back in 9.0
probably only eliminates about 5000 strings from the pool, because
it should save one string per \pagelabel entry added to the .aux
file, and there are less than 5000 such entries after a successful
build. So that was a good quick-n-dirty fix but it's really only
scratching the surface of the problem: there are ~240000 other strings
getting made somewhere. I wonder if a better answer is possible.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noname

andres@anarazel.de

about 10 years ago

In reply to: Tom Lane (#3)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

On 2015-11-08 16:29:56 -0500, Tom Lane wrote:

andres@anarazel.de (Andres Freund) writes:

While taking pretty short of forever, postgres-US.pdf seems to build on
my debian unstable as of 8d7396e509 + some additional docs. Is this
dependant of what version of text you're using (plain tex, pdftex,
xetex, whatnot)?

postgres-US.log contains:
360764 strings out of 481710

Interesting. They must have boosted the strings limit from 2^18 to 2^19.
According to what I've read, this is doable when compiling TeX from
source, but we can hardly expect users (or packagers) to do that if their
distribution hasn't built it that way. (The 2^18 limit I'm seeing is with
RHEL6's tex package.

Debian uses pdflatex from texlive-base (2015.20151016-1). Maybe that's
the relevant difference.

I'm currently downloading the Fedora rawhide package
to see if it's any better, but man that is one large package...)

It indeed is. I've not found any relevant patches in debian's
package. Lots of changing paths and defaults, but afaics nothing else.

BTW, I realized after poking around that the hack I put in back in 9.0
probably only eliminates about 5000 strings from the pool, because
it should save one string per \pagelabel entry added to the .aux
file, and there are less than 5000 such entries after a successful
build. So that was a good quick-n-dirty fix but it's really only
scratching the surface of the problem: there are ~240000 other strings
getting made somewhere. I wonder if a better answer is possible.

Debian's pdfjadetex package has the following comment in
README.jadetex.cfg:

* Not Labelling Elements

In some cases, it is possible for pdfjadetex to error out even with
expanded texmf.cnf settings. The sign of this is that jadetex is able
to process the file, but pdfjadetex isn't. The upstream maintainer,
Sebastian Rahtz, had this to say:

| pdfjadetex _can_ go over a string limit in TeX
| which *isn't* changeable in texmf.cnf. The workaround is to write a
| file called jadetex.cfg, containing just the line
|
| \LabelElementsfalse
|
| and see if that helps. it stops jadetex from using up a string for
| every element. If that leaves unsatisfied cross-references, try
| "jadetex" instead of "pdfjadetex", and create your PDF in
| another via
| (ie via Distiller).

Might be worthwhile to see wether \LabelElementsfalse makes a huge
difference.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Noname (#4)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

andres@anarazel.de (Andres Freund) writes:

In some cases, it is possible for pdfjadetex to error out even with
expanded texmf.cnf settings. The sign of this is that jadetex is able
to process the file, but pdfjadetex isn't. The upstream maintainer,
Sebastian Rahtz, had this to say:

| pdfjadetex _can_ go over a string limit in TeX
| which *isn't* changeable in texmf.cnf. The workaround is to write a
| file called jadetex.cfg, containing just the line
|
| \LabelElementsfalse

Interesting. That seems to be a slightly more aggressive version of my
9.0-era hack: it effectively turns FlowObjectSetup into a no-op that won't
generate page labels at all, saving *both* of the strings it would create
not only one. However, I'm afraid that's not gonna do: it looks like it
turns a large fraction of the index entries from page numbers into "??".
And some of the table-of-contents entries as well. (It looks like maybe
only things with explicit id= entries get correct page number data with
this setting. We could maybe live with that if the tool threw an error
about missing ids; but it doesn't, it just emits "??" ...)

Curiously though, that gets us down to this:

30615 strings out of 245828
397721 string characters out of 1810780

which implies that indeed FlowObjectSetup *is* the cause of most of
the strings being entered. I'm not sure how that squares with the
observation that there are less than 5000 \pagelabel entries in the
postgres-US.aux file. Time for more digging.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#5)

1 attachment(s)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

I wrote:

Curiously though, that gets us down to this:
30615 strings out of 245828
397721 string characters out of 1810780
which implies that indeed FlowObjectSetup *is* the cause of most of
the strings being entered. I'm not sure how that squares with the
observation that there are less than 5000 \pagelabel entries in the
postgres-US.aux file. Time for more digging.

Well, after much digging, I've found what seems a workable answer.
It turns out that the original form of FlowObjectSetup is just
unbelievably awful when it comes to handling of hyperlink anchors:
it will put a hyperlink anchor into the PDF for every "flow object",
that is, everything in the document that could possibly have a link
to it, whether or not it actually is linked to. And aside from bloating
the PDF file, it turns out that the hyperlink stuff also consumes some
control sequence names, which is why we're running out of strings.

There already is logic (probably way older than the hyperlink code)
in jadetex to avoid generating page-number labels for objects that have
no cross-references. So what I did to fix this was to piggyback on
that code: with the attached jadetex.cfg, both a page-number label
and a hyperlink anchor will be generated for all and only those flow
objects that have either a page-number reference or a hyperlink reference.
(We could try to separate those things, but then we'd need two control
sequence names not one per object for tracking purposes, and anyway many
objects will have both kinds of reference if they have either.)

This gets us down to ~135000 strings to build HEAD, and not incidentally,
the resulting PDF is about half the size it was before. I think I've
also fixed a number of formerly unexplainable broken hyperlinks in the
PDF; some are still broken, but they were that way before. (It looks
like <xref> with endterm doesn't work very well in jadetex; all the
remaining bad links seem to be associated with uses of that.)

Barring objection I'll commit this tomorrow. I'm inclined to back-patch
it at least into 9.5, maybe further, because I'm afraid we may be closer
than we realized to exceeding the strings limit in the back branches too.

regards, tom lane

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Tom Lane (#6)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

On 2015-11-09 19:46:37 -0500, Tom Lane wrote:

Well, after much digging, I've found what seems a workable answer.
It turns out that the original form of FlowObjectSetup is just
unbelievably awful [...].

This gets us down to ~135000 strings to build HEAD, and not incidentally,
the resulting PDF is about half the size it was before. I think I've
also fixed a number of formerly unexplainable broken hyperlinks in the
PDF; some are still broken, but they were that way before. (It looks
like <xref> with endterm doesn't work very well in jadetex; all the
remaining bad links seem to be associated with uses of that.)

Nice work. On an ugly subject.

Barring objection I'll commit this tomorrow. I'm inclined to back-patch
it at least into 9.5, maybe further, because I'm afraid we may be closer
than we realized to exceeding the strings limit in the back branches too.

+1 for doing this in 9.5+. I think we will probably want this in all
branches at some point. I don't have a strong opinion on whether we want
to let this mature in 9.5 or not.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Tom Lane (#6)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

On Mon, Nov 9, 2015 at 7:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Curiously though, that gets us down to this:
30615 strings out of 245828
397721 string characters out of 1810780
which implies that indeed FlowObjectSetup *is* the cause of most of
the strings being entered. I'm not sure how that squares with the
observation that there are less than 5000 \pagelabel entries in the
postgres-US.aux file. Time for more digging.

Well, after much digging, I've found what seems a workable answer.
It turns out that the original form of FlowObjectSetup is just
unbelievably awful when it comes to handling of hyperlink anchors:
it will put a hyperlink anchor into the PDF for every "flow object",
that is, everything in the document that could possibly have a link
to it, whether or not it actually is linked to. And aside from bloating
the PDF file, it turns out that the hyperlink stuff also consumes some
control sequence names, which is why we're running out of strings.

There already is logic (probably way older than the hyperlink code)
in jadetex to avoid generating page-number labels for objects that have
no cross-references. So what I did to fix this was to piggyback on
that code: with the attached jadetex.cfg, both a page-number label
and a hyperlink anchor will be generated for all and only those flow
objects that have either a page-number reference or a hyperlink reference.
(We could try to separate those things, but then we'd need two control
sequence names not one per object for tracking purposes, and anyway many
objects will have both kinds of reference if they have either.)

This gets us down to ~135000 strings to build HEAD, and not incidentally,
the resulting PDF is about half the size it was before. I think I've
also fixed a number of formerly unexplainable broken hyperlinks in the
PDF; some are still broken, but they were that way before. (It looks
like <xref> with endterm doesn't work very well in jadetex; all the
remaining bad links seem to be associated with uses of that.)

Barring objection I'll commit this tomorrow. I'm inclined to back-patch
it at least into 9.5, maybe further, because I'm afraid we may be closer
than we realized to exceeding the strings limit in the back branches too.

I am in awe.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Magnus Hagander

magnus@hagander.net

about 10 years ago

In reply to: Tom Lane (#6)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

On Tue, Nov 10, 2015 at 1:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Curiously though, that gets us down to this:
30615 strings out of 245828
397721 string characters out of 1810780
which implies that indeed FlowObjectSetup *is* the cause of most of
the strings being entered. I'm not sure how that squares with the
observation that there are less than 5000 \pagelabel entries in the
postgres-US.aux file. Time for more digging.

Well, after much digging, I've found what seems a workable answer.
It turns out that the original form of FlowObjectSetup is just
unbelievably awful when it comes to handling of hyperlink anchors:
it will put a hyperlink anchor into the PDF for every "flow object",
that is, everything in the document that could possibly have a link
to it, whether or not it actually is linked to. And aside from bloating
the PDF file, it turns out that the hyperlink stuff also consumes some
control sequence names, which is why we're running out of strings.

There already is logic (probably way older than the hyperlink code)
in jadetex to avoid generating page-number labels for objects that have
no cross-references. So what I did to fix this was to piggyback on
that code: with the attached jadetex.cfg, both a page-number label
and a hyperlink anchor will be generated for all and only those flow
objects that have either a page-number reference or a hyperlink reference.
(We could try to separate those things, but then we'd need two control
sequence names not one per object for tracking purposes, and anyway many
objects will have both kinds of reference if they have either.)

This gets us down to ~135000 strings to build HEAD, and not incidentally,
the resulting PDF is about half the size it was before. I think I've
also fixed a number of formerly unexplainable broken hyperlinks in the
PDF; some are still broken, but they were that way before. (It looks
like <xref> with endterm doesn't work very well in jadetex; all the
remaining bad links seem to be associated with uses of that.)

Barring objection I'll commit this tomorrow. I'm inclined to back-patch
it at least into 9.5, maybe further, because I'm afraid we may be closer
than we realized to exceeding the strings limit in the back branches too.

Impressive, indeed.

When you say it's half the size - is that half the size of the preprocessed
PDF or is it also after the stuff we do on the website PDFs using
jpdftweak? IIRC that tweak is only there to deal with the size, and
specifically it deals with "bookmarks" which sounds a lot like this...

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#10

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Magnus Hagander (#9)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

Magnus Hagander <magnus@hagander.net> writes:

When you say it's half the size - is that half the size of the preprocessed
PDF or is it also after the stuff we do on the website PDFs using
jpdftweak? IIRC that tweak is only there to deal with the size, and
specifically it deals with "bookmarks" which sounds a lot like this...

I'm just looking at the size of the file produced by "make postgres-A4.pdf".

I don't know anything about jpdftweak, but if it's being used to get rid
of unreferenced hyperlink anchors, maybe we could dispense with that step
after this goes in.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Magnus Hagander

magnus@hagander.net

about 10 years ago

In reply to: Tom Lane (#10)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

On Tue, Nov 10, 2015 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

When you say it's half the size - is that half the size of the

preprocessed

PDF or is it also after the stuff we do on the website PDFs using
jpdftweak? IIRC that tweak is only there to deal with the size, and
specifically it deals with "bookmarks" which sounds a lot like this...

I'm just looking at the size of the file produced by "make
postgres-A4.pdf".

I don't know anything about jpdftweak, but if it's being used to get rid
of unreferenced hyperlink anchors, maybe we could dispense with that step
after this goes in.

Yeah, that's what I was hoping. You can see how it's used in the
mk-release-pdfs script on borka. It doesn't explain why we're doing it,
that's probably in the list archives somewhere, but it does show what we
do. So it should be easy enough to see if the benefit goes away :) (it
would also be nice for build times - that pdftweak step is very very slow)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#12

Alvaro Herrera

alvherre@2ndquadrant.com

about 10 years ago

In reply to: Magnus Hagander (#11)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

Magnus Hagander wrote:

On Tue, Nov 10, 2015 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

When you say it's half the size - is that half the size of the

preprocessed

PDF or is it also after the stuff we do on the website PDFs using
jpdftweak? IIRC that tweak is only there to deal with the size, and
specifically it deals with "bookmarks" which sounds a lot like this...

I'm just looking at the size of the file produced by "make
postgres-A4.pdf".

I don't know anything about jpdftweak, but if it's being used to get rid
of unreferenced hyperlink anchors, maybe we could dispense with that step
after this goes in.

Yeah, that's what I was hoping. You can see how it's used in the
mk-release-pdfs script on borka. It doesn't explain why we're doing it,
that's probably in the list archives somewhere, but it does show what we
do. So it should be easy enough to see if the benefit goes away :) (it
would also be nice for build times - that pdftweak step is very very slow)

/messages/by-id/1284678175.2459.21.camel@hp-laptop2.gunduz.org

/messages/by-id/AANLkTi=3bkqc3ScM5Y==NPeY0_4uLFy+yGD9=GJ-NMZB@mail.gmail.com

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Magnus Hagander (#11)

Re: Uh-oh: documentation PDF output no longer builds in HEAD

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Nov 10, 2015 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't know anything about jpdftweak, but if it's being used to get rid
of unreferenced hyperlink anchors, maybe we could dispense with that step
after this goes in.

Yeah, that's what I was hoping. You can see how it's used in the
mk-release-pdfs script on borka.

Hmm ... building current HEAD in A4 format, I get

HEAD with my patch

initially generated PDF: 26.30MB 13.25MB
after jpdftweak: 7.24MB 7.24MB

Evidently, jpdftweak *is* removing unreferenced bookmarks --- the output
file sizes would not be so close to identical otherwise. But it's
evidently doing more than that. The initially generated PDFs are
fairly compressible by "gzip", while jpdftweak's outputs are not, so
I suppose that the additional savings come from applying compression.
jdftweak's help output indicates that the "-ocs" options are selecting
aggressive compression.

I tried removing the load/save bookmarks steps from mk-release-pdfs,
but what I get is files that are a little smaller yet and again almost the
same size for either starting point; that probably means that jpdftweak's
default behavior is to strip all bookmarks :-(.

Maybe we could look around for another tool that just does PDF compression
and not the other stuff ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Uh-oh: documentation PDF output no longer builds in HEAD

Attachments: