Serious performance problem

Started by Tille, Andreasover 24 years ago49 messageshackersgeneral
Jump to latest
#1Tille, Andreas
TilleA@rki.de
hackers

Hello,

I discussed a problem concerning the speed of PostgreSQL compared to
MS SQL server heavily on postgres-general list. The thread starts with
message

http://fts.postgresql.org/db/mw/msg.html?mid=1035557

Now I tried a snapshot of version 7.2 and got an increase of speed of
about factor 2. But sorry this is really not enough. The very simple
test I pointed to in my mail is even much to slow and the issue would
probably spoil down the whole project which should be a complete open
source solution and would perhaps and in any M$ stuff. I�ve got under
heavy preasur from my employer who was talking about the nice world
of MS .net (while he is using MS-SQL exclusively). To make the thing
clear the issue is the gal database of infectious diseases in Germany
runned by the Robert Koch-Institute. So the beast could be of some
importance for increasing the acceptance of PostgreSQL and Open Source
in the field of medicine which is generally known for the money which
is involved in. So I really hope that some skilled programmers would
be able to find a good way to solve the performance issue perhaps by
just profiling the simple query

SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;

to the data set I put on

http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2

If this should take less than half a second on a modern PC I could
continue to try mo realistic queries.

I really hope that I could readjust the image of PostgreSQL in the
eyes of my M$-centered colleagues.

Kind regards

Andreas.

#2Vsevolod Lobko
seva@sevasoft.kiev.ua
In reply to: Tille, Andreas (#1)
hackers
Re: Serious performance problem

Seems that problem is very simple :))
MSSql can do queries from indexes, without using actual table at all.
Postgresql doesn't.

So mssql avoids sequental scanning of big table, and simply does scan of
index which is already in needed order and has very much less size.

Show quoted text

On Mon, Oct 29, 2001 at 01:43:37PM +0100, Tille, Andreas wrote:

Hello,

I discussed a problem concerning the speed of PostgreSQL compared to
MS SQL server heavily on postgres-general list. The thread starts with
message

http://fts.postgresql.org/db/mw/msg.html?mid=1035557

Now I tried a snapshot of version 7.2 and got an increase of speed of
about factor 2. But sorry this is really not enough. The very simple
test I pointed to in my mail is even much to slow and the issue would
probably spoil down the whole project which should be a complete open
source solution and would perhaps and in any M$ stuff. I've got under
heavy preasur from my employer who was talking about the nice world
of MS .net (while he is using MS-SQL exclusively). To make the thing
clear the issue is the gal database of infectious diseases in Germany
runned by the Robert Koch-Institute. So the beast could be of some
importance for increasing the acceptance of PostgreSQL and Open Source
in the field of medicine which is generally known for the money which
is involved in. So I really hope that some skilled programmers would
be able to find a good way to solve the performance issue perhaps by
just profiling the simple query

SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;

to the data set I put on

http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2

If this should take less than half a second on a modern PC I could
continue to try mo realistic queries.

I really hope that I could readjust the image of PostgreSQL in the
eyes of my M$-centered colleagues.

Kind regards

Andreas.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

#3Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Tille, Andreas (#1)
hackers
Re: Serious performance problem

Andreas -
I took a look at your problem, since I'm sort of in the field,
and would liek to see free solutions spread, as well.

Here's what I see: Your example touches on what can be an achilles
heel for pgsql's current statistical analyzer: selection on data fields
that have a few common values. Often, the indices don't get used, since
a large fraction of the table needs to be scanned, in any case. In
your example, fully 68% of the table fits the where condition.

Here's some timing results on my machine:

Your dataset and query, as written:

real 0m25.272s
user 0m0.090s
sys 0m0.050s

Creating an index on meldekategorie, and forcing it's use with
"set enable_seqscan = off"

real 0m14.743s
user 0m0.070s
sys 0m0.050s

Same, with index on istaktuell:

real 0m26.511s
user 0m0.050s
sys 0m0.060s

Now, with an index on both meldekategorie and istaktuell:

real 0m7.179s
user 0m0.060s
sys 0m0.030s

I think we have a winner. No it's not sub-second, but I improved the time
by 3x just by trying some indices. Note that I _still_ had to force the
use of indices for this one. It's also the first time I've personally seen
a query/dataset that benefits this much from a two-key index.

As another poster replied to you, there is limitation with postgresql's
use of indices that arises from MVCC: even if the only data requested is
that stored in the index itself, the backend must visit the actual tuple
in the table to ensure that it is 'visible' to the current transaction.

How realistic a representation of your real workload is this query? Realize
that more selective, complex queries are where pgsql shines compared to
other RDBMS: the 'fast table scanner' type query that you proposed as your
test don't really let pgsql stretch it's legs. Do you have example timings
from MS-SQL or others?

Ross

On Mon, Oct 29, 2001 at 01:43:37PM +0100, Tille, Andreas wrote:

Hello,

I discussed a problem concerning the speed of PostgreSQL compared to
MS SQL server heavily on postgres-general list. The thread starts with
message

http://fts.postgresql.org/db/mw/msg.html?mid=1035557

Now I tried a snapshot of version 7.2 and got an increase of speed of
about factor 2. But sorry this is really not enough. The very simple
test I pointed to in my mail is even much to slow and the issue would
probably spoil down the whole project which should be a complete open
source solution and would perhaps and in any M$ stuff. I?ve got under
heavy preasur from my employer who was talking about the nice world
of MS .net (while he is using MS-SQL exclusively). To make the thing
clear the issue is the gal database of infectious diseases in Germany
runned by the Robert Koch-Institute. So the beast could be of some
importance for increasing the acceptance of PostgreSQL and Open Source
in the field of medicine which is generally known for the money which
is involved in. So I really hope that some skilled programmers would
be able to find a good way to solve the performance issue perhaps by
just profiling the simple query

SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;

to the data set I put on

http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2

If this should take less than half a second on a modern PC I could
continue to try mo realistic queries.

I really hope that I could readjust the image of PostgreSQL in the
eyes of my M$-centered colleagues.

Kind regards

Andreas.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

--
Ross Reedstrom, Ph.D. reedstrm@rice.edu
Executive Director phone: 713-348-6166
Gulf Coast Consortium for Bioinformatics fax: 713-348-6182
Rice University MS-39
Houston, TX 77005

#4Jean-Michel POURE
jm.poure@freesurf.fr
In reply to: Tille, Andreas (#1)
hackersgeneral
Re: Serious performance problem

Hello Andreas,

A possible solution would be:
CREATE TABLE foo AS
SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz
FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY
Hauptdaten_Fall.MeldeKategorie;

This is suitable if your data does not change often. To get automatic updates:
1) Define iPrecision, the precision that you need (integer).
2) Create a trigger which increases a counter when a record is updated or
inserted. When the counter reaches iPrecision, do a DROP TABLE foo + CREATE
TABLE foo AS SELECT Hauptdaten_Fall.MeldeKategorie,
Count(Hauptdaten_Fall.ID)... This will take a few seconds but only once.
Run a batch script within a time frame (1 hour, 4 hours, 1 day ?) so a
human user has very little chance to reach iPrecision.

On 300.000 records, you will get instant results. There are plenty of
tricks like this one. If you employ them, you will ***never*** reach the
limits of a double Pentium III computer with U3W discs.

If you need to answer this message, please reply on
pgsql-general@postgresql.org.

Cheers,
Jean-Michel POURE

At 13:43 29/10/01 +0100, you wrote:

Show quoted text

Hello,

I discussed a problem concerning the speed of PostgreSQL compared to
MS SQL server heavily on postgres-general list. The thread starts with
message

http://fts.postgresql.org/db/mw/msg.html?mid=1035557

Now I tried a snapshot of version 7.2 and got an increase of speed of
about factor 2. But sorry this is really not enough. The very simple
test I pointed to in my mail is even much to slow and the issue would
probably spoil down the whole project which should be a complete open
source solution and would perhaps and in any M$ stuff. I�ve got under
heavy preasur from my employer who was talking about the nice world
of MS .net (while he is using MS-SQL exclusively). To make the thing
clear the issue is the gal database of infectious diseases in Germany
runned by the Robert Koch-Institute. So the beast could be of some
importance for increasing the acceptance of PostgreSQL and Open Source
in the field of medicine which is generally known for the money which
is involved in. So I really hope that some skilled programmers would
be able to find a good way to solve the performance issue perhaps by
just profiling the simple query

SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS
Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY
Hauptdaten_Fall.MeldeKategorie;

to the data set I put on

http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2

If this should take less than half a second on a modern PC I could
continue to try mo realistic queries.

I really hope that I could readjust the image of PostgreSQL in the
eyes of my M$-centered colleagues.

Kind regards

Andreas.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

#5Tille, Andreas
TilleA@rki.de
In reply to: Jean-Michel POURE (#4)
hackersgeneral
Re: Serious performance problem

On Mon, 29 Oct 2001, Jean-Michel POURE wrote:

A possible solution would be:
CREATE TABLE foo AS
SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz
FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY
Hauptdaten_Fall.MeldeKategorie;

Sorry, this is NO solution of my problem.

On 300.000 records, you will get instant results. There are plenty of
tricks like this one. If you employ them, you will ***never*** reach the
limits of a double Pentium III computer with U3W discs.

It is really no help if I solve the speed issue of this *very simple,
zeroth order try*. I repeat a hava a plenty of queries which do much
more complicated stuff than this. This is just a rude strip down of the
problem fit for debugging/profg issues of the database *server*. Simple
tricks on a simple example do not help.

If you need to answer this message, please reply on
pgsql-general@postgresql.org.

No, because ...

I discussed a problem concerning the speed of PostgreSQL compared to
MS SQL server heavily on postgres-general list. The thread starts with
message

http://fts.postgresql.org/db/mw/msg.html?mid=1035557

I did so and got the explicit advise of Tom to ask here.

Consider the problem as a benchmark. I would love to see postgresql
as the winner.

Kind regards

Andreas.

#6Tille, Andreas
TilleA@rki.de
In reply to: Ross J. Reedstrom (#3)
hackers
Re: Serious performance problem

On Mon, 29 Oct 2001, Ross J. Reedstrom wrote:

Here's what I see: Your example touches on what can be an achilles
heel for pgsql's current statistical analyzer: selection on data fields
that have a few common values. Often, the indices don't get used, since
a large fraction of the table needs to be scanned, in any case. In
your example, fully 68% of the table fits the where condition.
...

I think we have a winner. No it's not sub-second, but I improved the time
by 3x just by trying some indices. Note that I _still_ had to force the
use of indices for this one. It's also the first time I've personally seen
a query/dataset that benefits this much from a two-key index.

This is true for this example and I also played with indices as you. I also
enforced the index scan and compared with forbidding the index scan. The
result was on my more realistic examples that both versions performed quite
the same. There was no *real* difference. For sure in this simple query there
is a difference but the real examples showed only 2% - 5% speed increase
(if not slower with enforcing index scans!).

As another poster replied to you, there is limitation with postgresql's
use of indices that arises from MVCC: even if the only data requested is
that stored in the index itself, the backend must visit the actual tuple
in the table to ensure that it is 'visible' to the current transaction.

Any possibility to switch of this temporarily for certain queries like this
if the programmer could make sure that it is not necessary? Just a stupid
idea from a bloody uneducated man in database-engeniering.

How realistic a representation of your real workload is this query? Realize
that more selective, complex queries are where pgsql shines compared to
other RDBMS: the 'fast table scanner' type query that you proposed as your
test don't really let pgsql stretch it's legs. Do you have example timings
from MS-SQL or others?

Unfortunately the four test we did here seemed all to suffer from the
same problem. The situation is that there is a given database structure
which was developed over more than a year on MS-SQL and has a Access GUI.
Now parts of the UI should be made public via web (I want to use Zope)
and I just imported the data and did some example queries with the
terrible slow result.

Kind regards and thanks for your ideas

Andreas.

#7Tille, Andreas
TilleA@rki.de
In reply to: Vsevolod Lobko (#2)
hackers
Re: Serious performance problem

On Mon, 29 Oct 2001, Vsevolod Lobko wrote:

Seems that problem is very simple :))
MSSql can do queries from indexes, without using actual table at all.
Postgresql doesn't.

So mssql avoids sequental scanning of big table, and simply does scan of
index which is already in needed order and has very much less size.

Hmmm, could anyone imagine a simple or not *solution* of the Problem.
I�m thinking of some switch the database programmer could use if he
really knows what he is doing.

Kind regards

Andreas.

#8Jean-Michel POURE
jm.poure@freesurf.fr
In reply to: Tille, Andreas (#5)
hackersgeneral
Re: [HACKERS] Serious performance problem

CREATE TABLE foo AS
SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz
FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY
Hauptdaten_Fall.MeldeKategorie;

Sorry, this is NO solution of my problem.

Allo Andreas,

For every problem there is a solution. That is what software optimization
is all about. Do you make use of PL/pgSQL stored queries in your database?
If not, you will probably end up with terrible nested queries that will
eat-up server time and power.

Best regards,
Jean-Michel POURE

#9Hannu Krosing
hannu@tm.ee
In reply to: Tille, Andreas (#6)
hackers
Re: Serious performance problem

"Tille, Andreas" wrote:

On Mon, 29 Oct 2001, Ross J. Reedstrom wrote:

Here's what I see: Your example touches on what can be an achilles
heel for pgsql's current statistical analyzer: selection on data fields
that have a few common values. Often, the indices don't get used, since
a large fraction of the table needs to be scanned, in any case. In
your example, fully 68% of the table fits the where condition.
...

I think we have a winner. No it's not sub-second, but I improved the time
by 3x just by trying some indices. Note that I _still_ had to force the
use of indices for this one. It's also the first time I've personally seen
a query/dataset that benefits this much from a two-key index.

This is true for this example and I also played with indices as you. I also
enforced the index scan and compared with forbidding the index scan. The
result was on my more realistic examples that both versions performed quite
the same. There was no *real* difference. For sure in this simple query there
is a difference but the real examples showed only 2% - 5% speed increase
(if not slower with enforcing index scans!).

I studied his dataset and found that a simple count(*) on whole table
took 1.3 sec on my Celeron 375 so I'm sure that the more complex query,
which has to visit 2/3 of tuples will not be able to execute under 1 sec

My playing with indexes / subqueries and query rewriting got the example
query (actually a functional equivalent) to run in ~5 sec with simple
aggregate(group(indexscan))) plan and I suspect that this is how fast
it will be on my hardware

It could probably be soon possible to make it run in ~ 1.5 by using an
aggregate
function that does a sequential scan and returns a rowset.

As another poster replied to you, there is limitation with postgresql's
use of indices that arises from MVCC: even if the only data requested is
that stored in the index itself, the backend must visit the actual tuple
in the table to ensure that it is 'visible' to the current transaction.

Any possibility to switch of this temporarily for certain queries like this
if the programmer could make sure that it is not necessary? Just a stupid
idea from a bloody uneducated man in database-engeniering.

There have been plans to set aside a bit in index that would mark the
deleted
tuple. Unfortunately this helps only in cases when there are many
deleted tuples
and all live tuples have to be checked anyway ;(

--------------
Hannu

#10Tille, Andreas
TilleA@rki.de
In reply to: Hannu Krosing (#9)
hackers
Re: Serious performance problem

On Mon, 29 Oct 2001, Vsevolod Lobko wrote:

Seems that problem is very simple :))
MSSql can do queries from indexes, without using actual table at all.
Postgresql doesn't.

So mssql avoids sequental scanning of big table, and simply does scan of
index which is already in needed order and has very much less size.

I forewarded this information to my colleague and he replied the following
(im translating from German into English):

hc> I expected this problem. But what is the purpose of an index: Not
hc> to look into the table itself. Moreover this means that the expense
hc> grows linear with the table size - no good prospect at all (the
hc> good thing is it is not exponential :-)).
I have to explain that we are in the *beginning* of production process.
We expect a lot more of data.

hc> In case of real index usage the expense grows only with log(n).
hc> No matter about the better philosophy of database servers, MS-SQL-Server
hc> has a consequent index usage and so it is very fast at many queries.
hc> When performing a query to a field without index, I get a slow
hc> table scan. This is like measuring the speed of the harddisk and
hc> the cleverness of the cache.

The consequence for my problem is now: If it is technically possible
to implement index scans without table lookups please implement it. If
not we just have to look for another database engine which does so,
because our applictaion really need the speed on this type of queries.
I repeat from my initial posting: The choice of the server for our
application could have importance for many projects in the field of
medicine in Germany. I really hope that there is a reasonable solution
which perhaps could give a balance between safety and speed. For
example I can assure in my application that the index, once created
will be valid, because I just want to read in a new set of data once
a day (from the MS-SQL Server which collects data over the day). So
I could recreate all indices after the import and the database is
readonly until the next cron job. Is there any chance to speed up
those applications?

Kind regards

Andreas.

#11Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tille, Andreas (#10)
hackers
Re: Serious performance problem

I also
enforced the index scan and compared with forbidding the index scan.

The

result was on my more realistic examples that both versions performed

quite

the same. There was no *real* difference. For sure in this simple

query there

is a difference but the real examples showed only 2% - 5% speed

increase

(if not slower with enforcing index scans!).

You could somewhat speed up the query if you avoid that the sort
hits the disk. A simple test here showed, that you need somewhere
near sort_mem = 15000 in postgresql.conf.

Andreas

#12Brent Verner
brent@rcfile.org
In reply to: Tille, Andreas (#10)
hackers
Re: Serious performance problem

On 30 Oct 2001 at 11:44 (+0100), Tille, Andreas wrote:
| On Mon, 29 Oct 2001, Vsevolod Lobko wrote:
|
| > Seems that problem is very simple :))
| > MSSql can do queries from indexes, without using actual table at all.
| > Postgresql doesn't.
| >
| > So mssql avoids sequental scanning of big table, and simply does scan of
| > index which is already in needed order and has very much less size.
| I forewarded this information to my colleague and he replied the following
| (im translating from German into English):
|
| hc> I expected this problem. But what is the purpose of an index: Not
| hc> to look into the table itself. Moreover this means that the expense
| hc> grows linear with the table size - no good prospect at all (the
| hc> good thing is it is not exponential :-)).
| I have to explain that we are in the *beginning* of production process.
| We expect a lot more of data.
|
| hc> In case of real index usage the expense grows only with log(n).
| hc> No matter about the better philosophy of database servers, MS-SQL-Server
| hc> has a consequent index usage and so it is very fast at many queries.
| hc> When performing a query to a field without index, I get a slow
| hc> table scan. This is like measuring the speed of the harddisk and
| hc> the cleverness of the cache.
|
| The consequence for my problem is now: If it is technically possible
| to implement index scans without table lookups please implement it. If
| not we just have to look for another database engine which does so,
| because our applictaion really need the speed on this type of queries.
| I repeat from my initial posting: The choice of the server for our
| application could have importance for many projects in the field of
| medicine in Germany. I really hope that there is a reasonable solution
| which perhaps could give a balance between safety and speed. For
| example I can assure in my application that the index, once created
| will be valid, because I just want to read in a new set of data once
| a day (from the MS-SQL Server which collects data over the day). So
| I could recreate all indices after the import and the database is
| readonly until the next cron job. Is there any chance to speed up
| those applications?

CREATE INDEX idx_meldekategorie_hauptdaten_f
ON hauptdaten_fall(meldekategorie);
CLUSTER idx_meldekategorie_hauptdaten_f ON hauptdaten_fall;

Aggregate (cost=5006.02..5018.90 rows=258 width=16)
-> Group (cost=5006.02..5012.46 rows=2575 width=16)
-> Sort (cost=5006.02..5006.02 rows=2575 width=16)
-> Seq Scan on hauptdaten_fall (cost=0.00..4860.12 rows=2575 width=16)

This looks much nicer, but is still quite slow. I'm quite sure the
slowness is in the sort(), since all queries that don't sort, return
quickly. I hoped the clustered index would speed up the sort, but
that is not the case.

It _seems_ a simple optimization would be to not (re)sort the tuples
when using a clustered index.

if( the_column_to_order_by_is_clustered ){
if( order_by_is_DESC )
// reverse the tuples to handle
}

I haven't looked at the code to see if this is even feasible, but I
do imagine there is enough info available to avoid an unnecessary
sort on the CLUSTERED index. The only problem I see with this is
if the CLUSTERed index is not kept in a CLUSTERed state as more
records are added to this table.

brent

--
"Develop your talent, man, and leave the world something. Records are
really gifts from people. To think that an artist would love you enough
to share his music with anyone is a beautiful thing." -- Duane Allman

#13Jean-Michel POURE
jm.poure@freesurf.fr
In reply to: Tille, Andreas (#10)
hackers
Re: [HACKERS] Serious performance problem

For example I can assure in my application that the index, once created
will be valid, because I just want to read in a new set of data once
a day (from the MS-SQL Server which collects data over the day). So
I could recreate all indices after the import and the database is
readonly until the next cron job. Is there any chance to speed up
those applications?

Hello Andreas,

Is your database read-only? Good point, sorry to insist your problem is
software optimization. In your case, the database may climb up to 200
million rows (1000 days x 200.000 rows). What are you going to do then? Buy
a 16 Itanium computer with 10 Gb RAM and MS SQL Server licence. Have a
close look at your problem. How much time does it get MS SQL Server to
query 200 million rows ? The problem is not in choosing MS SQL or
PostgreSQL ...

If you are adding 200.000 rows data everyday, consider using a combination
of CREATE TABLE AS to create a result table with PL/pgSQL triggers to
maintain data consistency. You will then get instant results, even on 2
billion rows because you will always query the result table; not the
original one. Large databases are always optimized this way because, even
in case of smart indexes, there are things (like your problem) that need
*smart* optimization.

Do you need PL/pgSQL source code to perform a test on 2 billion rows? If
so, please email me on pgsql-general and I will send you the code.

Best regards,
Jean-Michel POURE

#14Jean-Michel POURE
jm.poure@freesurf.fr
In reply to: Jean-Michel POURE (#8)
general
Re: [HACKERS] Serious performance problem

At 11:49 30/10/01 +0100, you wrote:

On Tue, 30 Oct 2001, Jean-Michel POURE wrote:

Other examples? Maybe I can help you.

The only thing I could send you would be the whole database structure and
the queries I performed. I�m not allowed to publish confident medical data.
If you think that you could make serious performance tests without living
data I would be happy to send you this stuff.

Kind regards

Hello Andreas,

Sorry, I just read your email.
The database structure is fine for me.
I don't need data. Send me a couple of queries.

Do you have a Windows desktop?
If so, please install pgAdmin II from http://www.pgadmin.org.

Kind regards,
Jean-Michel POURE

#15Tille, Andreas
TilleA@rki.de
In reply to: Jean-Michel POURE (#8)
hackersgeneral
Re: [HACKERS] Serious performance problem

On Tue, 30 Oct 2001, Jean-Michel POURE wrote:

Sorry, this is NO solution of my problem.

^^^^

For every problem there is a solution. That is what software optimization

For sure and as I said I�m really sure that it is a server side programming
issue (and others thought this as well on the hackers list).
But your proposal was not the solution.

is all about. Do you make use of PL/pgSQL stored queries in your database?
If not, you will probably end up with terrible nested queries that will
eat-up server time and power.

Also stored procedures are not the solution and I see no reason to discuss
this topic here on general list (moreover you leave people unclear about
the original question). It was discussed here in two long threads and
belongs to the PostgreSQL programmers. I hope not to violate Tom Lanes
privacy if I quote here one sentence from him in a private posting:
"Performance issues can be discussed in -hackers."
We do not talk about the performance of the datapase application but
about the server itself.

Kind regards

Andreas.

#16Tille, Andreas
TilleA@rki.de
In reply to: Zeugswetter Andreas SB SD (#11)
hackers
Re: Serious performance problem

On Tue, 30 Oct 2001, Zeugswetter Andreas SB SD wrote:

You could somewhat speed up the query if you avoid that the sort
hits the disk. A simple test here showed, that you need somewhere
near sort_mem = 15000 in postgresql.conf.

Well this are the usual hints from pgsql-general. I did so and
increased step by step to:

shared_buffers = 131072
sort_mem = 65536

This lead to a double of speed in my tests but this are settings where
an enhancement of memory doesn�t result in a speed increase any more.

When I was posting my question here I was talking about this "tuned"
PostgreSQL server. The default settings where even worse!

Kind regards

Andreas.

#17Jean-Michel POURE
jm.poure@freesurf.fr
In reply to: Jean-Michel POURE (#14)
general
Re: [HACKERS] Serious performance problem

thanks for your kind offer to help!

You are welcome.

Just to make clear the situation once more: I�ve got hints on the
general list about how to change my database application to perform
better by using extra tables and indices. I�m also able to do some
database programming myself, even if I�m not very experienced. But
the problem is different: There is an existing database application
which has absolutely no performance problems on MS SQL server. If
I just test the simplest query and it takes orders of magnitude more
than something with the server is wrong.
Moreover chances are low that for this reason changes in the database
existing application are accepted. What if the MS-SQL server would
have performance problems and it comes to tuning of the code there
and we are at the end of speed optimations at our PostgreSQL server.
There is a binary administrative decision: Yes or no. If the same
database application does perform this badly we can *NOT* use PostgreSQL.
The cost of development would be much higher than a second M$-SQL
server license.

Have you done tests on MS SQL with xx million rows. How much data are you
going to have in a single database?

There is a binary technical decission: Yes or no. If there is a
need for at least one additional table for each query of our application
to perform in a comparable manner than we are lost because we are
unable to do that for each new query.

Yes.This is what multi-dimensional databases do.$
This time, we will do it "by hand" in PostgreSQL.
With good naming conventions, this will be clean.

The reason for the performance lack is also clear: It is the different
use of indexes in PostgreSQL. This would have the consequence of a
linear scaling of performance compared to the log(n) scaling when
using indexes strictly. So chances are high that we will run into
performance trouble again even if we now could cope with the problems.

Beware PHP native MS SQL driver do not work well.

Do you have a Windows desktop?

No, I havn�t?

If so, please install pgAdmin II from http://www.pgadmin.org.

Sorry, I see no relation to the question.

This is for programming in PL/pgSQL and administrating PostgreSQL .
Without pgAdmin, you will quickly mess-up with objects (tables, views,
triggers).

Best regards,
Jean-Michel POURE

In reply to: Tille, Andreas (#10)
hackers
Re: Serious performance problem

AFAIK, sorting is necessary even when you have CLUSTERed a table using an index.

Somewhere on the docs I read sth like "CLUSTER reorders the table on disk so that entries
closer on the index are closer on the disk" (obviously written in better English ;-)

But if you INSERT a single row later, it will NOT get inserted to the right place. So
SORT is still necessary.

MAYBE, but I am not sure at all, the sort may take place in less "real" time than in case
the table was not CLUSTERed, as the table is "nearly" sorted.

Hackers, is the sorting algorithm capable of exiting at the very moment the table is
sorted, or are some extra passes always calculated?

Good luck!

Antonio

Brent Verner wrote:

Show quoted text

On 30 Oct 2001 at 11:44 (+0100), Tille, Andreas wrote:
| On Mon, 29 Oct 2001, Vsevolod Lobko wrote:
|
| > Seems that problem is very simple :))
| > MSSql can do queries from indexes, without using actual table at all.
| > Postgresql doesn't.
| >
| > So mssql avoids sequental scanning of big table, and simply does scan of
| > index which is already in needed order and has very much less size.
| I forewarded this information to my colleague and he replied the following
| (im translating from German into English):
|
| hc> I expected this problem. But what is the purpose of an index: Not
| hc> to look into the table itself. Moreover this means that the expense
| hc> grows linear with the table size - no good prospect at all (the
| hc> good thing is it is not exponential :-)).
| I have to explain that we are in the *beginning* of production process.
| We expect a lot more of data.
|
| hc> In case of real index usage the expense grows only with log(n).
| hc> No matter about the better philosophy of database servers, MS-SQL-Server
| hc> has a consequent index usage and so it is very fast at many queries.
| hc> When performing a query to a field without index, I get a slow
| hc> table scan. This is like measuring the speed of the harddisk and
| hc> the cleverness of the cache.
|
| The consequence for my problem is now: If it is technically possible
| to implement index scans without table lookups please implement it. If
| not we just have to look for another database engine which does so,
| because our applictaion really need the speed on this type of queries.
| I repeat from my initial posting: The choice of the server for our
| application could have importance for many projects in the field of
| medicine in Germany. I really hope that there is a reasonable solution
| which perhaps could give a balance between safety and speed. For
| example I can assure in my application that the index, once created
| will be valid, because I just want to read in a new set of data once
| a day (from the MS-SQL Server which collects data over the day). So
| I could recreate all indices after the import and the database is
| readonly until the next cron job. Is there any chance to speed up
| those applications?

CREATE INDEX idx_meldekategorie_hauptdaten_f
ON hauptdaten_fall(meldekategorie);
CLUSTER idx_meldekategorie_hauptdaten_f ON hauptdaten_fall;

Aggregate (cost=5006.02..5018.90 rows=258 width=16)
-> Group (cost=5006.02..5012.46 rows=2575 width=16)
-> Sort (cost=5006.02..5006.02 rows=2575 width=16)
-> Seq Scan on hauptdaten_fall (cost=0.00..4860.12 rows=2575 width=16)

This looks much nicer, but is still quite slow. I'm quite sure the
slowness is in the sort(), since all queries that don't sort, return
quickly. I hoped the clustered index would speed up the sort, but
that is not the case.

It _seems_ a simple optimization would be to not (re)sort the tuples
when using a clustered index.

if( the_column_to_order_by_is_clustered ){
if( order_by_is_DESC )
// reverse the tuples to handle
}

I haven't looked at the code to see if this is even feasible, but I
do imagine there is enough info available to avoid an unnecessary
sort on the CLUSTERED index. The only problem I see with this is
if the CLUSTERed index is not kept in a CLUSTERed state as more
records are added to this table.

brent

--
"Develop your talent, man, and leave the world something. Records are
really gifts from people. To think that an artist would love you enough
to share his music with anyone is a beautiful thing." -- Duane Allman

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#19Tille, Andreas
TilleA@rki.de
In reply to: Jean-Michel POURE (#17)
general
Re: [HACKERS] Serious performance problem

On Tue, 30 Oct 2001, Jean-Michel POURE wrote:

Have you done tests on MS SQL with xx million rows. How much data are you
going to have in a single database?

No because I will not have such a database in the next couple of years.
Why should I? But there are people who claim to do this successfully (not
verified ba myself for sure).

Yes.This is what multi-dimensional databases do.$
This time, we will do it "by hand" in PostgreSQL.
With good naming conventions, this will be clean.

Am I ignorant or why do I fail to see the relation to my original problem
that PostgreSQL performs more than 10 times slower than SQL server for a
*simple* task. I consider it inacceptable for the simple task but you
are talking about hard tasks. I�m afraid we are talking about different
things.

using indexes strictly. So chances are high that we will run into
performance trouble again even if we now could cope with the problems.

Beware PHP native MS SQL driver do not work well.

Oh please do not remind me on the consequences if my PostgreSQL efforts
would fail. It would not be PHP. It would be worse. It would be the
technique of the same people who wrote MS-SQL server *shrug*. And not
only this it would mean that *each* web stuff would be done with ASP
stuff because people mean that we should stick to just one web technology
(which is not wrong in principle). My choice was PostgreSQL + Zope ...

May be I would give MySQL a trial before ...

This is for programming in PL/pgSQL and administrating PostgreSQL .
Without pgAdmin, you will quickly mess-up with objects (tables, views,
triggers).

Huh, than I hope we would get a port of it soon ...
But in the moment I had no problems without it.

Kind regards

Andreas.

#20Tille, Andreas
TilleA@rki.de
In reply to: Antonio Fiol Bonnín (#18)
hackers
Re: Serious performance problem

On Tue, 30 Oct 2001, Antonio Fiol Bonn�n wrote:

AFAIK, sorting is necessary even when you have CLUSTERed a table using an index.

Sorting is not the performance constraint in my example. Just leave out
the sorting and see what happens ...

But if you INSERT a single row later, it will NOT get inserted to the right place. So
SORT is still necessary.

Well rearanging the database in a cronjob after inserting new data once a day
over night would be possible - but I doubt that it makes a big difference.

Kind regards

Andreas.

#21Tille, Andreas
TilleA@rki.de
In reply to: Jean-Michel POURE (#13)
hackers
#22Alex Pilosov
alex@pilosoft.com
In reply to: Antonio Fiol Bonnín (#18)
hackers
#23Tille, Andreas
TilleA@rki.de
In reply to: Alex Pilosov (#22)
hackers
#24Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Tille, Andreas (#23)
hackers
#25Horst Herb
hherb@malleenet.net.au
In reply to: Tille, Andreas (#23)
hackers
#26Jean-Michel POURE
jm.poure@freesurf.fr
In reply to: Horst Herb (#25)
hackers
#27Hannu Krosing
hannu@tm.ee
In reply to: Alex Pilosov (#22)
hackers
#28Hannu Krosing
hannu@tm.ee
In reply to: Tille, Andreas (#23)
hackers
#29Hannu Krosing
hannu@tm.ee
In reply to: Tille, Andreas (#1)
hackers
#30mlw
markw@mohawksoft.com
In reply to: Tille, Andreas (#1)
hackers
#31Denis Perchine
dyp@perchine.com
In reply to: Alex Pilosov (#22)
hackers
#32Tille, Andreas
TilleA@rki.de
In reply to: Hannu Krosing (#29)
hackers
#33Tille, Andreas
TilleA@rki.de
In reply to: Horst Herb (#25)
hackers
#34Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Tille, Andreas (#33)
hackers
#35Doug McNaught
doug@wireboard.com
In reply to: Tille, Andreas (#33)
hackers
#36Dave Cramer
pg@fastcrypt.com
In reply to: Stephan Szabo (#34)
hackers
#37Dave Cramer
pg@fastcrypt.com
In reply to: Dave Cramer (#36)
hackers
#38Bruce Momjian
bruce@momjian.us
In reply to: Dave Cramer (#37)
hackers
#39Dave Cramer
pg@fastcrypt.com
In reply to: Bruce Momjian (#38)
hackers
#40Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Doug McNaught (#35)
hackers
#41Peter Eisentraut
peter_e@gmx.net
In reply to: Dave Cramer (#39)
hackers
#42Dave Cramer
pg@fastcrypt.com
In reply to: Peter Eisentraut (#41)
hackers
#43Tille, Andreas
TilleA@rki.de
In reply to: Zeugswetter Andreas SB SD (#40)
hackers
#44Andrea Aime
aaime@comune.modena.it
In reply to: Alex Pilosov (#22)
hackers
#45Hannu Krosing
hannu@tm.ee
In reply to: Alex Pilosov (#22)
hackers
#46Horst Herb
horst@hherb.com
In reply to: Tille, Andreas (#43)
hackers
#47Horst Herb
hherb@malleenet.net.au
In reply to: Tille, Andreas (#43)
hackers
#48Dave Cramer
pg@fastcrypt.com
In reply to: Dave Cramer (#42)
hackers
#49Peter Eisentraut
peter_e@gmx.net
In reply to: Dave Cramer (#42)
hackers