performance issue using DBI
<html><div style='background-color:'><DIV></DIV>
<DIV></DIV>Hi all,<BR><BR>I have developped a perl script to populate a database with two tables: sessions<BR>and actions.<BR><BR>the table actions contains the following columns: session_id, url, timestamp.<BR>The column session_id references to the table sessions.<BR><BR>the table sessions contains the following columns: session_id, remote_ip,<BR>phone_type, phone_number. The column session_id is serial.<BR><BR>The lines of the table actions which are part of the same session have the same<BR>session_id.<BR><BR>There are then more lines in the table actions than in the table session.<BR><BR><BR>To fill the two tables, I first need to know if the session already exists for a<BR>certain phone_type, a certain remote_ip and a certain phone_number:<BR><BR>SELECT session_id FROM sessions WHERE (phone_number = ?) AND (remote_ip = ?)<BR>AND (phone_type = ?) ORDER BY session_id;<BR><BR>I also need to apply a criteria to know if I have to add a new entry in the<BR>table sessions or not:<BR><BR>SELECT (max(timestamp) + ?)<? FROM actions WHERE (session_id = ?);<BR><BR><BR><BR>If the session already exists I add a line in the table actions with a INSERT<BR><BR>If the session doesn't exist or if the criteria is true, I add a line in the<BR>table sessions with an INSERT and then add a line with a INSERT in the table actions (I use nextval and currval).<BR><BR>I have put indexes on sessions(session_id), sessions(msisdn),<BR>actions(session_id) and actions(timestamp). I process one log file of about 20000 lines every day. All the lines are processed in one transaction (autocommit set to 0).<BR><BR>My problem is that populating my database is slower when the data gets bigger<BR>and the performance falls dramatically. I thought that is would be improve with<BR>my indexes but the problem still persists.<BR><BR>Could you please give me some clues that could solve this issue.<BR><BR>Thanks.<BR><BR><BR><BR>Nicolas Nolst
<DIV></DIV>
<DIV></DIV><IMG src="http://www.3dflags/World/Gif/belgium_gs.gif">
<DIV></DIV></div><br clear=all><hr>MSN Photos is the easiest way to share and print your photos: <a href='http://g.msn.com/1HM500901/157'>Click Here</a><br></html>
i'm not a db expert but i think that a transaction with more than 20.000 operations cant be quick anyway.
I think you have to work in other ways, maybe reading one line at a time, doing the operations, committing, deleting the line from the file.
to remember all this transactions i think postgres must swap more then a 386 with 4MB Ram!
Bye
Giorgio
----- Original Message -----
From: Nicolas Nolst
To: pgsql-admin@postgresql.org ; pgsql-general@postgresql.org
Sent: Thursday, June 06, 2002 11:40 AM
Subject: [ADMIN] performance issue using DBI
Hi all,
I have developped a perl script to populate a database with two tables: sessions
and actions.
the table actions contains the following columns: session_id, url, timestamp.
The column session_id references to the table sessions.
the table sessions contains the following columns: session_id, remote_ip,
phone_type, phone_number. The column session_id is serial.
The lines of the table actions which are part of the same session have the same
session_id.
There are then more lines in the table actions than in the table session.
To fill the two tables, I first need to know if the session already exists for a
certain phone_type, a certain remote_ip and a certain phone_number:
SELECT session_id FROM sessions WHERE (phone_number = ?) AND (remote_ip = ?)
AND (phone_type = ?) ORDER BY session_id;
I also need to apply a criteria to know if I have to add a new entry in the
table sessions or not:
SELECT (max(timestamp) + ?)<? FROM actions WHERE (session_id = ?);
If the session already exists I add a line in the table actions with a INSERT
If the session doesn't exist or if the criteria is true, I add a line in the
table sessions with an INSERT and then add a line with a INSERT in the table actions (I use nextval and currval).
I have put indexes on sessions(session_id), sessions(msisdn),
actions(session_id) and actions(timestamp). I process one log file of about 20000 lines every day. All the lines are processed in one transaction (autocommit set to 0).
My problem is that populating my database is slower when the data gets bigger
and the performance falls dramatically. I thought that is would be improve with
my indexes but the problem still persists.
Could you please give me some clues that could solve this issue.
Thanks.
Nicolas Nolst
------------------------------------------------------------------------------
MSN Photos is the easiest way to share and print your photos: Click Here
How much data is 20,000 lines? 20KB? 100MB? You might do well to just
process this in memory and then just COPY the right data to the table.
This gets away from doing the SELECT/INSERT/SELECT/INSERT thing which is
going to be painful for batch jobs.
See, the thing is that on every insert the indexes have to be updated.
They won't be used until after you VACUUM the tables so it does no good
inside your transaction. Drop the indexes.
Also, leave off with your use of currval/nextval. The point of a serial
type is that the column increments each time automagically. For what you
are doing you could turn that serial into an integer, create your sequence
separately, exec nextval('yourseq') and then just use the value you
retrieved. Don't use currval since some other process might alter the
sequence between the time you call nextval and currval. Just store the
value.
I think this problem is better solved on the perl side than on the
PostgreSQL side. Consider using data structures like so. You can probably
create a better structure since you know your data and I don't.
sessions
{ remote_ip => { phone_type => { phone_number => session_id,
phone_number => session_id
},
phone_type => { phone_number => session_id,
phone_number => session_id
}
},
remote_ip => { phone_type => { phone_number => session_id,
phone_number => session_id
},
phone_type => { phone_number => session_id,
phone_number => session_id
}
}
}
actions - now session_id is the array offset.
[ [ url, timestamp ],
[ url, timestamp ],
[ url, timestamp ] ],
[ [ url, timestamp ],
[ url, timestamp ],
[ url, timestamp ] ],
[ [ url, timestamp ],
[ url, timestamp ],
[ url, timestamp ] ],
Show quoted text
If the session already exists I add a line in the table actions with a INSERT
If the session doesn't exist or if the criteria is true, I add a line in the
table sessions with an INSERT and then add a line with a INSERT in the table actions (I use nextval and currval).I have put indexes on sessions(session_id), sessions(msisdn),
actions(session_id) and actions(timestamp). I process one log file of about 20000 lines every day. All the lines are
processed in one transaction (autocommit set to 0).My problem is that populating my database is slower when the data gets bigger
and the performance falls dramatically. I thought that is would be improve with
my indexes but the problem still persists.Could you please give me some clues that could solve this issue.
Thanks.
Nicolas Nolst
[belgium_gs.gif]______________________________________________________________________________________________________________________________
MSN Photos is the easiest way to share and print your photos: Click Here
On Thu, 2002-06-06 at 14:55, Joshua b. Jore wrote:u
Don't use currval since some other process might alter the
sequence between the time you call nextval and currval.
This is wrong. currval() will always return the last serial assigned in
*the*same*session*. It is not affected by other users at all.
--
Oliver Elphick Oliver.Elphick@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C
"Therefore being justified by faith, we have peace with
God through our Lord Jesus Christ." Romans 5:1
On 6 Jun 2002 at 17:14, Oliver Elphick wrote:
On Thu, 2002-06-06 at 14:55, Joshua b. Jore wrote:u
Don't use currval since some other process might alter the
sequence between the time you call nextval and currval.This is wrong. currval() will always return the last serial assigned in
*the*same*session*. It is not affected by other users at all.
Folks, here's a short test which might help. Note the BEGIN.
$ psql testing
testing=# create sequence test;
CREATE
testing=# select currval('test');
ERROR: test.currval is not yet defined in this session
testing=# select setval('test', 1);
setval
--------
1
(1 row)
testing=# select currval('test');
currval
---------
1
(1 row)
testing=# select currval('test');
currval
---------
1
(1 row)
testing=# select currval('test');
currval
---------
1
(1 row)
testing=# select currval('test');
currval
---------
1
(1 row)
testing=#
Then, in another window, I did this:
$ psql testing
# select nextval('test');
nextval
---------
2
(1 row)
testing=# select nextval('test');
nextval
---------
3
(1 row)
testing=# select nextval('test');
nextval
---------
4
(1 row)
testing=#
Then back to the other window:
testing=# select currval('test');
currval
---------
1
(1 row)
testing=# select nextval('test');
nextval
---------
5
(1 row)
testing=#
cheers
FWIW: I always use nextval when looking for a new ID.
--
Dan Langille
If you need to get the actual "current value" without incrementing, try:
SELECT last_value FROM <sequence_name>;
This will return the last value returned from a nextval command directly
from the sequence properties in the system tables.
-----Original Message-----
From: Dan Langille [mailto:dan@langille.org]
Sent: Thursday, June 06, 2002 12:34 PM
To: Nicolas Nolst
Cc: pgsql-admin@postgresql.org; pgsql-general@postgresql.org
Subject: Re: [ADMIN] [GENERAL] performance issue using DBI
On 6 Jun 2002 at 17:14, Oliver Elphick wrote:
On Thu, 2002-06-06 at 14:55, Joshua b. Jore wrote:u
Don't use currval since some other process might alter the
sequence between the time you call nextval and currval.This is wrong. currval() will always return the last serial assigned
in
*the*same*session*. It is not affected by other users at all.
Folks, here's a short test which might help. Note the BEGIN.
$ psql testing
testing=# create sequence test;
CREATE
testing=# select currval('test');
ERROR: test.currval is not yet defined in this session
testing=# select setval('test', 1);
setval
--------
1
(1 row)
testing=# select currval('test');
currval
---------
1
(1 row)
testing=# select currval('test');
currval
---------
1
(1 row)
testing=# select currval('test');
currval
---------
1
(1 row)
testing=# select currval('test');
currval
---------
1
(1 row)
testing=#
Then, in another window, I did this:
$ psql testing
# select nextval('test');
nextval
---------
2
(1 row)
testing=# select nextval('test');
nextval
---------
3
(1 row)
testing=# select nextval('test');
nextval
---------
4
(1 row)
testing=#
Then back to the other window:
testing=# select currval('test');
currval
---------
1
(1 row)
testing=# select nextval('test');
nextval
---------
5
(1 row)
testing=#
cheers
FWIW: I always use nextval when looking for a new ID.
--
Dan Langille
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Import Notes
Resolved by subject fallback
Oh ok. So I was thinking of the way that sequences cut across transactions
and I misinterpreted it. Thanks for the clarification.
Joshua b. Jore ; http://www.greentechnologist.org ; 1121 1233 1311 200
1201 1302 1211 200 1201 1303 200 1300 1233 1313 1211 1302 1212 1311 1230
200 1201 1303 200 1321 1233 1311 1302 200 1211 1232 1211 1231 1321 200
1310 1220 1221 1232 1223 1303 200 1321 1233 1311 200 1201 1302 1211 232
200 1112 1233 1310 1211 200 1013 1302 1211 1211 1232 201 22
On 6 Jun 2002, Oliver Elphick wrote:
Show quoted text
On Thu, 2002-06-06 at 14:55, Joshua b. Jore wrote:u
Don't use currval since some other process might alter the
sequence between the time you call nextval and currval.This is wrong. currval() will always return the last serial assigned in
*the*same*session*. It is not affected by other users at all.--
Oliver Elphick Oliver.Elphick@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C"Therefore being justified by faith, we have peace with
God through our Lord Jesus Christ." Romans 5:1
Nicolas,
The more data your tables contain the more work
PostgreSQL needs to do. Adding indexes on your tables
makes the INSERTS slower due to additional overhead.
Indexes speed up SELECT but slow down INSERT.
Therefore OLTP systems tend to have few indexes and
data warehouses many.
Instead of adding your data one line at a time I would
use a staging area to speed things up as well as set
operations instead of processing one item at a time in
a loop.
Your raw data obviously contains the following columns:
remote_ip, phone_type, phone_number url, timestamp
So create a table raw_data with the above columns.
Truncate it before each data load.
As first step of your daily processing load your raw
data into the new table.
Then find out all unique sessions within your raw data:
CREATE TABLE unique_sessions AS
SELECT DISTINCT remote_ip, phone_type, phone_number
FROM raw_data;
Now find all the new sessions:
CREATE TABLE new_sessions AS
SELECT remote_ip, phone_type, phone_number
FROM unique_sessions
MINUS
SELECT remote_ip, phone_type, phone_number
FROM sessions;
Finally insert the new sessions into the sessions table:
INSERT INTO sessions
SELECT nextval('sessin_id'), remote_ip, phone_type,
phone_number
FROM new_sessions;
Now insert into the actions table:
INSERT INTO actions
SELECT s.session_id, r.url, r.timestamp
FROM raw_data r, sessions s
WHERE s. remote_ip = r. remote_ip
AND s.phone_type = r. phone_type
AND s.phone_number =r. phone_number
As the last step clean up the staging tables no longer
needed:
DROP TABLE unique_sessions;
DROP TABLE new_sessions;
TRUNCATE TABLE raw_data;
Regards,
Nikolaus
On Thu, 06 June 2002, "Nicolas Nolst" wrote
<html><div style='background-color:'><DIV></DIV>
<DIV></DIV>Hi all,<BR><BR>I have developped a perl
script to populate a database with two tables:
sessions<BR>and actions.<BR><BR>the table actions
contains the following columns: session_id, url,
timestamp.<BR>The column session_id references to the
table sessions.<BR><BR>the table sessions contains the
following columns: session_id,
remote_ip,<BR>phone_type, phone_number. The column
session_id is serial.<BR><BR>The lines of the table
actions which are part of the same session have the
same<BR>session_id.<BR><BR>There are then more lines in
the table actions than in the table
session.<BR><BR><BR>To fill the two tables, I first
need to know if the session already exists for
a<BR>certain phone_type, a certain remote_ip and a
certain
phone_number:<BR><BR>SELECT session_id FROM
sessions WHERE (phone_number = ?) AND
(remote_ip = ?)<BR>AND (phone_type =
?) ORDER BY session_id;<BR><BR>I also need to
apply a criteria to know if I have to add a
new entry in the<BR>table sessions or
not:<BR><BR>SELECT (max(timestamp) +
?)<? FROM actions WHERE (session_id =
?);<BR><BR><BR><BR>If the session already exists I add
a line in the table actions with a INSERT<BR><BR>If the
session doesn't exist or if the criteria is true, I add
a line in the<BR>table sessions with an INSERT and then
add a line with a INSERT in the table actions (I
use nextval and currval).<BR><BR>I have put indexes on
sessions(session_id),
sessions(msisdn),<BR>actions(session_id) and actions(timestamp). I
process one log file of about 20000 lines every day.
All the lines are processed in one transaction
(autocommit set to 0).<BR><BR>My problem is that
populating my database is slower when the data
gets bigger<BR>and the performance falls dramatically.
I thought that is would be improve with<BR>my indexes
but the problem still persists.<BR><BR>Could you please
give me some clues that could solve this
issue.<BR><BR>Thanks.<BR><BR><BR><BR>Nicolas Nolst
<DIV></DIV>
<DIV></DIV><IMG
src="http://www.3dflags/World/Gif/belgium_gs.gif">
<DIV></DIV></div><br clear=all><hr>MSN Photos is the
easiest way to share and print your photos: <a
href='http://g.msn.com/1HM500901/157'>Click
Here</a><br></html>
Import Notes
Resolved by subject fallback
On Thu, Jun 06, 2002 at 12:45:09PM -0400, Lee Green wrote:
If you need to get the actual "current value" without incrementing, try:
SELECT last_value FROM <sequence_name>;
This will return the last value returned from a nextval command directly
from the sequence properties in the system tables.
Except that if you do it this way, you *will* be affected by setvals done by
other users. Use currval(), that's what it's there for.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.
Some of my carefully throught out, and painstakingly graded SQL inserts (or
updates) are failing intermittantly because I guess I didn't grade hard enough!
:)
I'm getting ERROR: messages on the console (and rightly so!) ... My question
is, how can I find out what the text of the failed query would be. That would
give me a better clue as to where my code fails. It would also take a lot less
time going through the code SQL statement by SQL statement to find the guilty
one.
I compiled --without-syslog to eek the last bit of performance out of
PostgreSQL. I have no problem recompiling with more feedback detail enabled to
resolve the problem. I just need a clue on where to start looking. The whole
logging part of PostgreSQL is still magical to me.
CG
__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com
On Tue, 11 Jun 2002 07:53:19 -0700 (PDT)
"Chris Gamache" <cgg007@yahoo.com> wrote:
I'm getting ERROR: messages on the console (and rightly so!) ... My question
is, how can I find out what the text of the failed query would be.
Set "debug_print_query = true" in postgresql.conf and send the postmaster a
SIGHUP.
I compiled --without-syslog to eek the last bit of performance out of
PostgreSQL.
I'd be surprised if that had any performance impact, particularly since
logging through syslog is disabled by default.
Cheers,
Neil
--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC
Redirect the output to a logfile of your choice. There was a huge talk
about this yesterday on the list.
#!/bin/bash
HOME=/usr/local/pgsql
LOG=$HOME/logs/postgres.log
USER=darren
ulimit -SHc unlimited
su $USER -c "nohup $HOME/bin/postmaster -D $HOME/db $@ >$LOG 2>&1
</dev/null&"
This script above is what i use to start the postmaster and as you can see
i redirect the output to the file postgres.log. There are also different
logging levels in the postgres.conf file that you can turn on.
HTH
Darren Ferguson
On Tue, 11 Jun 2002, Chris Gamache wrote:
Show quoted text
Some of my carefully throught out, and painstakingly graded SQL inserts (or
updates) are failing intermittantly because I guess I didn't grade hard enough!
:)I'm getting ERROR: messages on the console (and rightly so!) ... My question
is, how can I find out what the text of the failed query would be. That would
give me a better clue as to where my code fails. It would also take a lot less
time going through the code SQL statement by SQL statement to find the guilty
one.I compiled --without-syslog to eek the last bit of performance out of
PostgreSQL. I have no problem recompiling with more feedback detail enabled to
resolve the problem. I just need a clue on where to start looking. The whole
logging part of PostgreSQL is still magical to me.CG
__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
The solution is not as sexy as the one I was hoping for... :) I'm not one to
look a gift-horse in the mouth. Thank you for your responses, which were right
on target. I imagine I can combine the debug_print_query with the logging to a
postgres.log and just do a
tail -f postgres.log | grep -B1 ERROR:
and see where I'm going wrong!
CG
--- Darren Ferguson <darren@crystalballinc.com> wrote:
Redirect the output to a logfile of your choice. There was a huge talk
about this yesterday on the list.#!/bin/bash
HOME=/usr/local/pgsql
LOG=$HOME/logs/postgres.log
USER=darrenulimit -SHc unlimited
su $USER -c "nohup $HOME/bin/postmaster -D $HOME/db $@ >$LOG 2>&1
</dev/null&"This script above is what i use to start the postmaster and as you can see
i redirect the output to the file postgres.log. There are also different
logging levels in the postgres.conf file that you can turn on.
__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com