7.0.2 issues / Geocrawler

Started by Tim Perduealmost 26 years ago22 messageshackers

Jump to latest

Tim Perdue

tperdue@valinux.com

almost 26 years ago

Well, I took Thomas' advice and upgraded to 7.0.2 from source.tar.gz.

For some reason, I cannot create the following index:

db_geocrawler=# DROP INDEX idx_mail_archive_list_subject;
ERROR: index "idx_mail_archive_list_subject" nonexistent

db_geocrawler=# CREATE INDEX "idx_mail_archive_list_subject" on
"tbl_mail_archive" using btree ( "fld_mail_
list" "int4_ops", "fld_mail_subject" "text_ops" );
ERROR: cannot create idx_mail_archive_list_subject

[root@geocrawler db_geocrawler]# rm -f idx_mail_archive_list_subject

That removes the physical file on disk, so I can then try to create it
again. If I then issue the SQL command, postgres accepts it and it runs
forever, never creating more than an 8192 byte file.

If you watch your process list:

[root@geocrawler db_geocrawler]# ps ax
PID TTY STAT TIME COMMAND
457 ? SW 0:00 [postmaster]
1419 ? R 1:34 [postmaster]

Eventually, the psql connection disappears from the process list and I
get strange postmaster processes running (see above).

After that, I get this error from psql:

ERROR: btree: index item size 2820 exceeds maximum 2717

Any way to tell where that item is at?

From the pgserver.log file:

DEBUG: Data Base System is starting up at Tue Jul 11 16:59:33 2000
DEBUG: Data Base System was interrupted being in production at Tue Jul
11 15:47:04 2000
DEBUG: Data Base System is in production state at Tue Jul 11 16:59:33
2000

...Doesn't give me much to go on.

I'm really at wits end - I've spent over two days trying to rebuild
Geocrawler.

Next step is reformatting the hard disk and reinstalling postgres 6.4.2.

Tim

--
Founder - PHPBuilder.com / Geocrawler.com
Lead Developer - SourceForge
VA Linux Systems
408-542-5723

Mike Mascari

mascarm@mascari.com

almost 26 years ago

In reply to: Tim Perdue (#1)

Re: 7.0.2 issues / Geocrawler

Tim Perdue wrote:

...
After that, I get this error from psql:

ERROR: btree: index item size 2820 exceeds maximum 2717

Any way to tell where that item is at?

I've been wondering at the state of the problems you've been
having with PostgreSQL and wondering why I haven't experienced
the same. I think this may very well be it. Earlier versions of
PostgreSQL allowed for the creation of indexes on fields whose
length would not permit at least 2 entries per index page. 95% of
the time, things would work fine. But 5% you would get corrupted
data.

Before creating the index:

SELECT * FROM tbl_main_archive WHERE Length(fld_mail_subject) >
2700;

will get you the list of records which cannot be indexed. You're
attempting to create a multi-key index so I would truncate (or
delete) any record whose fld_mail_subject is > 2700:

UPDATE tbl_main_archive SET fld_mail_subject =
SubStr(fld_mail_subject, 1, 2700);

At this point, your index creation should be relatively quick
(and successful) depending upon how many rows you have. I have a
few tables with ~2 million rows that take about 5 - 10 minutes
(with fsync off, naturally) to index. I would also recommend
letting PostgreSQL determine the correct "ops":

CREATE INDEX idx_mail_archive_list_subject
ON tbl_mail_archive (fld_mail_list, fld_mail_subject);

Without following the lists every day, most people wouldn't know
about this issue. I'm surprised it took so long for PostgreSQL
7.0.2 to bail on the index creation though. Is this a
particularly large table? At any rate, this is an example of a
bug which *would* allow for the kinds of corruption you've seen
in the past that has been addressed in 7.0.2, as Tom Lane crushed
them by the hundreds. If you can:

psql db_geocrawler < 6_4dump.txt

and it never bails, then you know all your data is "clean". Until
that point, any index you have on a "text" datatype is subject to
similar problems.

Hope that helps,

Mike Mascari

Tim Perdue

tperdue@valinux.com

almost 26 years ago

In reply to: Tim Perdue (#1)

Re: 7.0.2 issues / Geocrawler

This is a *big* help.

Yes, the table is approx 10-12GB in size and running your length() and
update queries is going to take a lifetime, since it will require a
calculation on 4 million rows.

This doesn't address the serious performance problem I'm finding in
7.0.2 for a multi-key select/order by/limit/offset query, which I sent
in a separate email.

Tim

Mike Mascari wrote:

Tim Perdue wrote:

...
After that, I get this error from psql:

ERROR: btree: index item size 2820 exceeds maximum 2717

Any way to tell where that item is at?

I've been wondering at the state of the problems you've been
having with PostgreSQL and wondering why I haven't experienced
the same. I think this may very well be it. Earlier versions of
PostgreSQL allowed for the creation of indexes on fields whose
length would not permit at least 2 entries per index page. 95% of
the time, things would work fine. But 5% you would get corrupted
data.

Before creating the index:

SELECT * FROM tbl_main_archive WHERE Length(fld_mail_subject) >
2700;

will get you the list of records which cannot be indexed. You're
attempting to create a multi-key index so I would truncate (or
delete) any record whose fld_mail_subject is > 2700:

UPDATE tbl_main_archive SET fld_mail_subject =
SubStr(fld_mail_subject, 1, 2700);

At this point, your index creation should be relatively quick
(and successful) depending upon how many rows you have. I have a
few tables with ~2 million rows that take about 5 - 10 minutes
(with fsync off, naturally) to index. I would also recommend
letting PostgreSQL determine the correct "ops":

CREATE INDEX idx_mail_archive_list_subject
ON tbl_mail_archive (fld_mail_list, fld_mail_subject);

Without following the lists every day, most people wouldn't know
about this issue. I'm surprised it took so long for PostgreSQL
7.0.2 to bail on the index creation though. Is this a
particularly large table? At any rate, this is an example of a
bug which *would* allow for the kinds of corruption you've seen
in the past that has been addressed in 7.0.2, as Tom Lane crushed
them by the hundreds. If you can:

psql db_geocrawler < 6_4dump.txt

and it never bails, then you know all your data is "clean". Until
that point, any index you have on a "text" datatype is subject to
similar problems.

Hope that helps,

Mike Mascari

--
Founder - PHPBuilder.com / Geocrawler.com
Lead Developer - SourceForge
VA Linux Systems
408-542-5723

Mike Mascari

mascarm@mascari.com

almost 26 years ago

In reply to: Tim Perdue (#1)

Re: 7.0.2 issues / Geocrawler

Tim Perdue wrote:

This is a *big* help.

Yes, the table is approx 10-12GB in size and running your length() and
update queries is going to take a lifetime, since it will require a
calculation on 4 million rows.

This doesn't address the serious performance problem I'm finding in
7.0.2 for a multi-key select/order by/limit/offset query, which I sent
in a separate email.

Tim

If I recall correctly, Marc experienced similar performance
differences with UDM search after upgrading. The optimizer was
redesigned to be smarter about using indexes with both order by
and limit. Tom Lane, of course, knows all there is to know on
this. All I can ask is standard issue precursor to optimizer
questions:

Have you VACUUM ANALYZE'd the table(s) in question?

If so, hopefully Tom Lane can comment.

Sorry I couldn't be more help,

Mike Mascari

Tim Perdue

tperdue@valinux.com

almost 26 years ago

In reply to: Tim Perdue (#1)

Re: 7.0.2 issues / Geocrawler

Mike Mascari wrote:

Have you VACUUM ANALYZE'd the table(s) in question?

Yes, they've been vacuum analyze'd and re-vaccum analyze'd to death.
Also added some extra indexes that I don't really need just to see if
that helps.

Tim

--
Founder - PHPBuilder.com / Geocrawler.com
Lead Developer - SourceForge
VA Linux Systems
408-542-5723

Ross J. Reedstrom

reedstrm@rice.edu

almost 26 years ago

In reply to: Tim Perdue (#5)

Re: 7.0.2 issues / Geocrawler

On Wed, Jul 12, 2000 at 06:17:23AM -0700, Tim Perdue wrote:

Mike Mascari wrote:

Have you VACUUM ANALYZE'd the table(s) in question?

Yes, they've been vacuum analyze'd and re-vaccum analyze'd to death.
Also added some extra indexes that I don't really need just to see if
that helps.

Tim, why are you building a multikey index, especially one containing a
large text field? It's almost never a win to index a text field, unless
all the WHERE clauses that use it are either anchored to the beginning
of the field, or are equality tests (in which case, the field is really
an enumerated type, masquerading as a text field)

A multikey index is only useful for a very limited set of queries. Here's
a message from last August, where Tom Lane talks about that:

http://www.postgresql.org/mhonarc/pgsql-sql/1999-08/msg00145.html

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

Tim Perdue

tperdue@valinux.com

almost 26 years ago

In reply to: Tim Perdue (#1)

Re: 7.0.2 issues / Geocrawler

"Ross J. Reedstrom" wrote:

Tim, why are you building a multikey index, especially one containing a
large text field? It's almost never a win to index a text field, unless

This is not a key on a text field.

The keys are:

mail_list (example, the PHP mailing list=1)
mail_year (1999)
mail_month (July=7)

Yes it is a multi-key index, and the matches are exact.

Someone else asked why I have separated these fields out from the
mail_date.

If I didn't, and I wanted to see the messages for this month, I'd have
to regex and that would overwhelm the database.

Tim

--
Founder - PHPBuilder.com / Geocrawler.com
Lead Developer - SourceForge
VA Linux Systems
408-542-5723

Mike Mascari

mascarm@mascari.com

almost 26 years ago

In reply to: Tim Perdue (#1)

Re: 7.0.2 issues / Geocrawler

"Ross J. Reedstrom" wrote:

On Wed, Jul 12, 2000 at 06:17:23AM -0700, Tim Perdue wrote:

Mike Mascari wrote:

Have you VACUUM ANALYZE'd the table(s) in question?

Yes, they've been vacuum analyze'd and re-vaccum analyze'd to death.
Also added some extra indexes that I don't really need just to see if
that helps.

Tim, why are you building a multikey index, especially one containing a
large text field? It's almost never a win to index a text field, unless
all the WHERE clauses that use it are either anchored to the beginning
of the field, or are equality tests (in which case, the field is really
an enumerated type, masquerading as a text field)

A multikey index is only useful for a very limited set of queries. Here's
a message from last August, where Tom Lane talks about that:

http://www.postgresql.org/mhonarc/pgsql-sql/1999-08/msg00145.html

I think Tim had 2 problems. The first was tuples whose text
attributes did not permit two on the same index page. The second,
however, is that a query against the *same schema* under 6.x now
runs slower by a factor of 15 under 7.x:

"The following query is at the very heart of the site and it
takes
upwards of 15-20 seconds to run now. It used to be instantaneous.

explain SELECT mailid, mail_date, mail_is_followup, mail_from,
mail_subject
FROM mail_archive WHERE mail_list=35 AND mail_year=2000
AND mail_month=1 ORDER BY mail_date DESC LIMIT 26 OFFSET 0;

NOTICE: QUERY PLAN:

Sort (cost=138.41..138.41 rows=34 width=44)
-> Index Scan using idx_mail_archive_list_yr_mo on
tbl_mail_archive
(cost=0.00..137.55 rows=34 width=44)

EXPLAIN"

Even though he's using a mult-key index here, it is composed
entirely of integer fields. Its reducing to a simple index scan +
sort, so I don't see how the performance could drop off so
dramatically. Perhaps if we could see the EXPLAIN output with the
same query against the 6.x database we could see what's going on.

Mike Mascari

Zeugswetter Andreas SB

ZeugswetterA@wien.spardat.at

almost 26 years ago

In reply to: Mike Mascari (#8)

AW: 7.0.2 issues / Geocrawler

Yes it is a multi-key index, and the matches are exact.

Someone else asked why I have separated these fields out from the
mail_date.

If I didn't, and I wanted to see the messages for this month, I'd have
to regex and that would overwhelm the database.

As I said in that mail you could use a between first and last day of month
which can use an index and do the order by with that index.

Unfortunately a datepart(mail_date, year to month) is probably not
understood as indexable by the optimizer.

Andreas

Import Notes

Resolved by subject fallback

#10

The Hermit Hacker

scrappy@hub.org

almost 26 years ago

In reply to: Mike Mascari (#4)

Re: 7.0.2 issues / Geocrawler

On Wed, 12 Jul 2000, Mike Mascari wrote:

Tim Perdue wrote:

This is a *big* help.

Yes, the table is approx 10-12GB in size and running your length() and
update queries is going to take a lifetime, since it will require a
calculation on 4 million rows.

This doesn't address the serious performance problem I'm finding in
7.0.2 for a multi-key select/order by/limit/offset query, which I sent
in a separate email.

Tim

If I recall correctly, Marc experienced similar performance
differences with UDM search after upgrading. The optimizer was
redesigned to be smarter about using indexes with both order by
and limit. Tom Lane, of course, knows all there is to know on
this. All I can ask is standard issue precursor to optimizer
questions:

it was a problem with v7.0 that Tom provided a work around for, but I'm
99% certain that the work around was included in v7.0.1 ...

#11

The Hermit Hacker

scrappy@hub.org

almost 26 years ago

In reply to: Tim Perdue (#5)

Re: 7.0.2 issues / Geocrawler

On Wed, 12 Jul 2000, Tim Perdue wrote:

Mike Mascari wrote:

Have you VACUUM ANALYZE'd the table(s) in question?

Yes, they've been vacuum analyze'd and re-vaccum analyze'd to death.
Also added some extra indexes that I don't really need just to see if
that helps.

what does EXPLAIN <query>; show and what is the QUERY itself that is so
slow?

#12

The Hermit Hacker

scrappy@hub.org

almost 26 years ago

In reply to: Tim Perdue (#7)

Re: 7.0.2 issues / Geocrawler

On Wed, 12 Jul 2000, Tim Perdue wrote:

"Ross J. Reedstrom" wrote:

Tim, why are you building a multikey index, especially one containing a
large text field? It's almost never a win to index a text field, unless

This is not a key on a text field.

The keys are:

mail_list (example, the PHP mailing list=1)
mail_year (1999)
mail_month (July=7)

Yes it is a multi-key index, and the matches are exact.

Someone else asked why I have separated these fields out from the
mail_date.

If I didn't, and I wanted to see the messages for this month, I'd have
to regex and that would overwhelm the database.

if you did it as a proper date field, you can use stuff like 'date_part'
and 'date_trunc' to pull out a particular month, year, etc ...

#13

Ross J. Reedstrom

reedstrm@rice.edu

almost 26 years ago

In reply to: Tim Perdue (#7)

Re: 7.0.2 issues / Geocrawler

On Wed, Jul 12, 2000 at 08:14:29AM -0700, Tim Perdue wrote:

"Ross J. Reedstrom" wrote:

Tim, why are you building a multikey index, especially one containing a
large text field? It's almost never a win to index a text field, unless

This is not a key on a text field.

Ah, I see, I had merged the two problems you reported together. I see now
that the 'can't create index' problem was on a different index.

Mike Mascari gave you a detailed answer to that, which you seemd to just blow
off, based on you guesstimate that it would run too long:

This is a *big* help.

Yes, the table is approx 10-12GB in size and running your length() and
update queries is going to take a lifetime, since it will require a
calculation on 4 million rows.

Mike mentioned that he's run similar index creations on 2 million rows,
and it took 5-10 minutes. I reiterate: you've got a long subject that
tripped a bug in index creation in postgresql versions < 7.0. Give his
solution a try. It's a 'clean it up once' sort of thing: I don't think
anyone's going to complain about the subject getting trimmed at ~ 2k.

The keys are:

mail_list (example, the PHP mailing list=1)
mail_year (1999)
mail_month (July=7)

Yes it is a multi-key index, and the matches are exact.

Right, as your explain output showed: the planner is picking this index
and using it. I'd guess that your time is getting lost in the sort step.
I seem to recall that Tom reworked the sort code as well, to reduce the
size of temporary sort files: perhaps you've found a corner case that is
much slower.

Do you still have the 6.X install available? EXPLAIN output from that
would be useful.

Someone else asked why I have separated these fields out from the
mail_date.

If I didn't, and I wanted to see the messages for this month, I'd have
to regex and that would overwhelm the database.

That's what the date_part function is for:

reedstrm=# select now();
now
------------------------
2000-07-12 11:03:11-05
(1 row)

reedstrm=# select date_part('month', now());
date_part
-----------
7
(1 row)

reedstrm=# select date_part('year', now());
date_part
-----------
2000
(1 row)

So your query would look like:

SELECT mailid, mail_date, mail_is_followup, mail_from, mail_subject FROM
mail_archive WHERE mail_list=35 AND date_part('year',mail_date)=2000 AND
date_part('month',mail_date)=1 ORDER BY mail_date DESC LIMIT 26 OFFSET 0;

You can even build functional indices. However, since you're selecting
and sorting based on the same attribute, the time of the message, it
should be possible to build an index on mail_date, and construct a SELECT
that uses it for ordering as well as limiting the tuples returned.

You're generating the queries programmatically, from a scripting language,
right? So, the best thing would be if you could create a query that
looks something like:

SELECT mailid, mail_date, mail_is_followup, mail_from, mail_subject
FROM mail_archive WHERE mail_list=35 AND mail_date >= 'January 1, 2000'
AND mail_date < 'February 1, 2000' ORDER BY mail_date DESC LIMIT 26 OFFSET 0;

With an index on mail_date, that should do a single index scan, returning
the first 26, and stop. I'd bet a lot that it's the sort that's killing
you, since the backend has to retrieve the entire result set and sort
it to be sure it returns the first 26.

You might be able to use a two key index, on mail_date, mailid. I think
you have to be careful to put key you want sorted output on first,
to ensure that the index order is presorted, and the planner know it.

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

#14

Ross J. Reedstrom

reedstrm@rice.edu

almost 26 years ago

In reply to: Ross J. Reedstrom (#13)

Re: 7.0.2 issues / Geocrawler

On Wed, Jul 12, 2000 at 11:36:44AM -0500, Ross J. Reedstrom wrote:

You might be able to use a two key index, on mail_date, mailid. I think
you have to be careful to put key you want sorted output on first,
to ensure that the index order is presorted, and the planner know it.

Bah, I clearly need lunch: that last sentence, with better grammar:

[...] be careful to put the key you want output sorted on first,
to ensure that the index order is presorted, and that the planner
knows it.

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

#15

Tim Perdue

tperdue@valinux.com

almost 26 years ago

In reply to: Tim Perdue (#1)

Re: 7.0.2 issues / Geocrawler

"Ross J. Reedstrom" wrote:

Mike Mascari gave you a detailed answer to that, which you seemd to just blow
off, based on you guesstimate that it would run too long:

That is a separate issue - unrelated to this performance issue and it
was not "blown" off, I was merely making a comment.

Right, as your explain output showed: the planner is picking this index
and using it. I'd guess that your time is getting lost in the sort step.

I think you're probably right. It's hard to imagine that sorting is that
much slower, but it's hard to say.

Your ideas for selecting based on the date are intriguing, however the
schema of the db was not done with that in mind. Everyone thinks I'm a
nut when I say this, but the date is stored in a char(14) field in
gregorian format: 19990101125959

So perhaps sorting a char(14) field is somehow majorly slower now.

No I don't have 6.5.3 installed anymore - it was totally fubar and
wasn't running anymore.

Tim

--
Founder - PHPBuilder.com / Geocrawler.com
Lead Developer - SourceForge
VA Linux Systems
408-542-5723

#16

The Hermit Hacker

scrappy@hub.org

almost 26 years ago

In reply to: Tim Perdue (#15)

Re: 7.0.2 issues / Geocrawler

On Wed, 12 Jul 2000, Tim Perdue wrote:

"Ross J. Reedstrom" wrote:

Mike Mascari gave you a detailed answer to that, which you seemd to just blow
off, based on you guesstimate that it would run too long:

That is a separate issue - unrelated to this performance issue and it
was not "blown" off, I was merely making a comment.

Right, as your explain output showed: the planner is picking this index
and using it. I'd guess that your time is getting lost in the sort step.

I think you're probably right. It's hard to imagine that sorting is that
much slower, but it's hard to say.

just curious, but what if you remove the ORDER BY, just to test ... is it
that much faster without then with? Just want to narrow down *if* its a
sorting issue or not, that's all ...

If it is a sorting issue, what if you raise the -S value?

#17

Zeugswetter Andreas SB

ZeugswetterA@wien.spardat.at

almost 26 years ago

In reply to: The Hermit Hacker (#16)

AW: 7.0.2 issues / Geocrawler

Your ideas for selecting based on the date are intriguing, however the
schema of the db was not done with that in mind. Everyone thinks I'm a
nut when I say this, but the date is stored in a char(14) field in
gregorian format: 19990101125959

Perfect, that makes it a lot easier:

1. index on (mail_list, mail_date)
2. SELECT mailid, mail_date, mail_is_followup, mail_from, mail_subject
FROM mail_archive WHERE mail_list=35
AND mail_date between '20000100' and '20000199'
ORDER BY mail_list DESC, mail_date DESC LIMIT 26 OFFSET 0;

Note the appended 00 and 99 which is generic for all months.

Andreas

Import Notes

Resolved by subject fallback

#18

Tim Perdue

tperdue@valinux.com

almost 26 years ago

In reply to: Zeugswetter Andreas SB (#17)

Re: AW: 7.0.2 issues / Geocrawler

Zeugswetter Andreas SB wrote:

Your ideas for selecting based on the date are intriguing, however the
schema of the db was not done with that in mind. Everyone thinks I'm a
nut when I say this, but the date is stored in a char(14) field in
gregorian format: 19990101125959

Perfect, that makes it a lot easier:

1. index on (mail_list, mail_date)
2. SELECT mailid, mail_date, mail_is_followup, mail_from, mail_subject
FROM mail_archive WHERE mail_list=35
AND mail_date between '20000100' and '20000199'
ORDER BY mail_list DESC, mail_date DESC LIMIT 26 OFFSET 0;

Note the appended 00 and 99 which is generic for all months.

shouldn't it be between '20000100000000' and '20000199000000'?

I've never indexed that date column, because it is likely that there are
3 million+ different dates in there - remember 4 million emails sent
over the course of 15 years are likely to have a lot of different dates,
when the hour/minute/second is attached.

You still think that will work?

Tim

--
Founder - PHPBuilder.com / Geocrawler.com
Lead Developer - SourceForge
VA Linux Systems
408-542-5723

#19

Tim Perdue

tperdue@valinux.com

almost 26 years ago

In reply to: The Hermit Hacker (#16)

Re: 7.0.2 issues / Geocrawler

The Hermit Hacker wrote:

just curious, but what if you remove the ORDER BY, just to test ... is it
that much faster without then with? Just want to narrow down *if* its a
sorting issue or not, that's all ...

Good call - it was instantaneous as it used to be.

If it is a sorting issue, what if you raise the -S value?

-S is 32768 right now

Tim

--
Founder - PHPBuilder.com / Geocrawler.com
Lead Developer - SourceForge
VA Linux Systems
408-542-5723

#20

The Hermit Hacker

scrappy@hub.org

almost 26 years ago

In reply to: Tim Perdue (#19)

Re: 7.0.2 issues / Geocrawler

On Wed, 12 Jul 2000, Tim Perdue wrote:

The Hermit Hacker wrote:

just curious, but what if you remove the ORDER BY, just to test ... is it
that much faster without then with? Just want to narrow down *if* its a
sorting issue or not, that's all ...

Good call - it was instantaneous as it used to be.

It takes us awhile sometimes, but we eventually clue in :)

If it is a sorting issue, what if you raise the -S value?

-S is 32768 right now

how many results come back from the query? ignoring the LIMIT, that is
... see, it has to ORDER BY before the LIMIT, of course...

#21

Tom Lane

tgl@sss.pgh.pa.us

almost 26 years ago

In reply to: Tim Perdue (#19)

#22

Tom Lane

tgl@sss.pgh.pa.us

almost 26 years ago

In reply to: Tim Perdue (#18)