Re: random rows

Started by Ing. Roberto Andrade Fonsecaalmost 25 years ago9 messagesgeneral
Jump to latest

On Thu, 26 Apr 2001, Jie Liang wrote:

How I can return random N rows from my select stmt?
like:
e.g. what my selectee is a set of 1000 rows, I want randomly
pickup 100 of 1000.

I've donne something like:

prereg_iw=# select random(), nombre from asistente order by 1 limit 10;
random | nombre
----------------------+---------------
0.000214523170243261 | ALFONSO
0.000478655100091666 | OSCAR
0.000892118085591177 | JOSE LUIS
0.000972398091560415 | IGNACIO
0.00109919300354048 | NORBERTO
0.00180558208460248 | SALVADOR
0.00196880800741204 | ANDRES NOE
0.00197171233686233 | SARA MICHELLE
0.00226354412839913 | DAVID
0.00233715307076329 | RODOLFO
(10 rows)

and it works!

Saludos,

Roberto Andrade Fonseca
randrade@abl.com.mx

#2Jie Liang
jliang@ipinc.com
In reply to: Ing. Roberto Andrade Fonseca (#1)

thanks.

Jie LIANG

St. Bernard Software

10350 Science Center Drive
Suite 100, San Diego, CA 92121
Office:(858)320-4873

jliang@ipinc.com
www.stbernard.com
www.ipinc.com

On Thu, 26 Apr 2001, Ing. Roberto Andrade Fonseca wrote:

Show quoted text

On Thu, 26 Apr 2001, Jie Liang wrote:

How I can return random N rows from my select stmt?
like:
e.g. what my selectee is a set of 1000 rows, I want randomly
pickup 100 of 1000.

I've donne something like:

prereg_iw=# select random(), nombre from asistente order by 1 limit 10;
random | nombre
----------------------+---------------
0.000214523170243261 | ALFONSO
0.000478655100091666 | OSCAR
0.000892118085591177 | JOSE LUIS
0.000972398091560415 | IGNACIO
0.00109919300354048 | NORBERTO
0.00180558208460248 | SALVADOR
0.00196880800741204 | ANDRES NOE
0.00197171233686233 | SARA MICHELLE
0.00226354412839913 | DAVID
0.00233715307076329 | RODOLFO
(10 rows)

and it works!

Saludos,

Roberto Andrade Fonseca
randrade@abl.com.mx

#3Joel Burton
jburton@scw.org
In reply to: Jie Liang (#2)

On Thu, 26 Apr 2001, Jie Liang wrote:

I've donne something like:

prereg_iw=# select random(), nombre from asistente order by 1 limit 10;
random | nombre
----------------------+---------------
0.000214523170243261 | ALFONSO
0.000478655100091666 | OSCAR
0.000892118085591177 | JOSE LUIS
0.000972398091560415 | IGNACIO
0.00109919300354048 | NORBERTO
0.00180558208460248 | SALVADOR
0.00196880800741204 | ANDRES NOE
0.00197171233686233 | SARA MICHELLE
0.00226354412839913 | DAVID
0.00233715307076329 | RODOLFO
(10 rows)

How totally obvious in retrospect, and how much better of a solution than
the ones I offered. D'oh!

Thanks for posting it to the list.

--
Joel Burton <jburton@scw.org>
Director of Information Systems, Support Center of Washington

#4John Coers
coers@intrinsity.com
In reply to: Joel Burton (#3)
Performance with Large Volumes of Data

Hi,

I am non a "real" sysadmin or dba, but "got stuck" doing it and am trying to learn via a fairly
difficult problem that my group must deal with: a LARGE volume of data. I have been working
from scratch on this for about 3 weeks and have runs lots of tests.

I am running postgres7.1 on a Solaris5.7 with 1GB RAM and 2 300MHZ processors and a 6GB partition.

The application I am using it for is to COPY a LARGE amount of data (avg of 15k rows
of 3 ints every 15 minutes or so avg from 170 machines day/250 at night) into a db
and then do a query after the fact. The COPIES are done via the libq PQputline()
subroutine. The after-the-fact query will postprocess the data and reduce the
amount and granularity of data then load it into a new table. Ultimately, I will have
1 db with a table of about 250M rows and several other dbs each with 10's of millions...
Multiple GBs of data.

Here are the options I run with postmaster:

postmaster -D /evsx/aus16/coers -o "-S 32768" -i -B 8192 -N 2

Here are my IPC params:
set shmsys:shminfo_shmmax=524288000
set shmsys:shminfo_shmmin=16
set shmsys:shminfo_shmmni=200
set shmsys:shminfo_shmseg=200
set semsys:seminfo_semmni=500
set semsys:seminfo_semmns=500
set semsys:seminfo_semmsl=500
set semsys:seminfo_semume=100

Here are my questions:

1) NUM OF CONNECTIONS: I use -N 2 because that seems to optimize performance. If I allow more connections,
the server bogs down, ultimately to a near-stand still if I allow too many connections. I assume
this is because all of the competing connections are all trying to COPY to the same database
and they block all but one and all the semaphore chasing slows everything down. The weird
thing is that the CPU, iowait and swap waits on top do not elevate very much. What is slowing
things down in this case? Currently, I have each client try for a connection and if it fails,
wait for rand()%4+1 and then try again. This actually works pretty well, but it seems to me
that the server should be handling this and be doing a more efficient job. Also, am I correct
in assuming there is no way to keep multiple COPIES to the same db & table from blocking? When
I tried to setnonblocking, data got dropped.

2) BOTTLENECK: I was running 2 queries on the 65M rows of data I had collected after I had finished loading.
I had not indexed the tables. Based on the top reading below, what is the bottleneck that is
slowing the query down? The same phenomenon occurs when COPYing data into the table.

last pid: 15973; load averages: 0.98, 0.92, 0.78
41 processes: 38 sleeping, 1 running, 2 on cpu
CPU states: 48.5% idle, 43.8% user, 5.2% kernel, 2.6% iowait, 0.0% swap
Memory: 1024M real, 17M free, 109M swap in use, 2781M swap free

PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
15919 postgres 1 40 0 71M 69M cpu0 30:58 22.12% postgres
15966 postgres 1 30 0 71M 69M run 12:02 23.63% postgres

3) CONGIGURATION/SETTINGS: Are my IPC params and postmaster options set right for my application?
My thinking is that I need lots of shared memory to reduce disk access. Am I missing something? Are
there any other configurable kernal params that I need to know about? What exactly will the sort
mem (-o "-S") buy me?

Thanks for your patience!

--
John Coers Intrinsity, Inc.
coers@intrinsity.com Austin, Texas

#5John Coers
coers@intrinsity.com
In reply to: Joel Burton (#3)
Re: Performance with Large Volumes of Data

I recently posted this on the admin list and got no response. Could
anyone here help me?

Hi,

I am non a "real" sysadmin or dba, but "got stuck" doing it and am trying to learn via a fairly
difficult problem that my group must deal with: a LARGE volume of data. I have been working
from scratch on this for about 3 weeks and have runs lots of tests.

I am running postgres7.1 on a Solaris5.7 with 1GB RAM and 2 300MHZ processors and a 6GB partition.

The application I am using it for is to COPY a LARGE amount of data (avg of 15k rows
of 3 ints every 15 minutes or so avg from 170 machines day/250 at night) into a db
and then do a query after the fact. The COPIES are done via the libq PQputline()
subroutine. The after-the-fact query will postprocess the data and reduce the
amount and granularity of data then load it into a new table. Ultimately, I will have
1 db with a table of about 250M rows and several other dbs each with 10's of millions...
Multiple GBs of data.

Here are the options I run with postmaster:

postmaster -D /evsx/aus16/coers -o "-S 32768" -i -B 8192 -N 2

Here are my IPC params:
set shmsys:shminfo_shmmax=524288000
set shmsys:shminfo_shmmin=16
set shmsys:shminfo_shmmni=200
set shmsys:shminfo_shmseg=200
set semsys:seminfo_semmni=500
set semsys:seminfo_semmns=500
set semsys:seminfo_semmsl=500
set semsys:seminfo_semume=100

Here are my questions:

1) NUM OF CONNECTIONS: I use -N 2 because that seems to optimize performance. If I allow more connections,
the server bogs down, ultimately to a near-stand still if I allow too many connections. I assume
this is because all of the competing connections are all trying to COPY to the same database
and they block all but one and all the semaphore chasing slows everything down. The weird
thing is that the CPU, iowait and swap waits on top do not elevate very much. What is slowing
things down in this case? Currently, I have each client try for a connection and if it fails,
wait for rand()%4+1 and then try again. This actually works pretty well, but it seems to me
that the server should be handling this and be doing a more efficient job. Also, am I correct
in assuming there is no way to keep multiple COPIES to the same db & table from blocking? When
I tried to setnonblocking, data got dropped.

2) BOTTLENECK: I was running 2 queries on the 65M rows of data I had collected after I had finished loading.
I had not indexed the tables. Based on the top reading below, what is the bottleneck that is
slowing the query down? The same phenomenon occurs when COPYing data into the table.

last pid: 15973; load averages: 0.98, 0.92, 0.78
41 processes: 38 sleeping, 1 running, 2 on cpu
CPU states: 48.5% idle, 43.8% user, 5.2% kernel, 2.6% iowait, 0.0% swap
Memory: 1024M real, 17M free, 109M swap in use, 2781M swap free

PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
15919 postgres 1 40 0 71M 69M cpu0 30:58 22.12% postgres
15966 postgres 1 30 0 71M 69M run 12:02 23.63% postgres

3) CONGIGURATION/SETTINGS: Are my IPC params and postmaster options set right for my application?
My thinking is that I need lots of shared memory to reduce disk access. Am I missing something? Are
there any other configurable kernal params that I need to know about? What exactly will the sort
mem (-o "-S") buy me?

Thanks for your patience!

--
John Coers Intrinsity, Inc.
coers@intrinsity.com Austin, Texas

#6John Coers
coers@intrinsity.com
In reply to: Joel Burton (#3)
Copy

Does a COPY FROM block? In case I am using the wrong terminology,
what I need to know is if I have multiple clients performing COPY FROM...PQputline()
using asynchronous connections, will I lose data? Will the server simply execute
them serially?

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Coers (#6)
Re: Copy

John Coers <coers@intrinsity.com> writes:

Does a COPY FROM block? In case I am using the wrong terminology,
what I need to know is if I have multiple clients performing COPY FROM...PQputline()
using asynchronous connections, will I lose data?

No.

Will the server simply execute them serially?

They'll be executed in parallel, just the same as if each client had
done a bunch of INSERTs wrapped in a BEGIN/END block. You'd only see
problems if clients inserted conflicting data (eg, identical keys in
a column with a UNIQUE index). You may care to read the documentation
about multi-version concurrency control and locking.

regards, tom lane

#8John Coers
coers@intrinsity.com
In reply to: Joel Burton (#3)
Re: Copy

Tom Lane wrote:

John Coers <coers@intrinsity.com> writes:

Does a COPY FROM block? In case I am using the wrong terminology,
what I need to know is if I have multiple clients performing COPY FROM...PQputline()
using asynchronous connections, will I lose data?

No.

Will the server simply execute them serially?

They'll be executed in parallel, just the same as if each client had
done a bunch of INSERTs wrapped in a BEGIN/END block. You'd only see
problems if clients inserted conflicting data (eg, identical keys in
a column with a UNIQUE index). You may care to read the documentation
about multi-version concurrency control and locking.

regards, tom lane

hmmmm...I get best performance with N=2 and force the clients to keep trying to
get a connection. If I leave the default of N=32 the machine gets bogged down
to a complete standstill by 32 copies of 15k rows of data. The CPU doesn't seem
stressed -- what may be causing the bottleneck?

--
John Coers Intrinsity, Inc.
coers@intrinsity.com Austin, Texas

#9John Coers
coers@intrinsity.com
In reply to: Joel Burton (#3)
Re: Copy

Tom Lane wrote:

John Coers <coers@intrinsity.com> writes:

Does a COPY FROM block? In case I am using the wrong terminology,
what I need to know is if I have multiple clients performing COPY FROM...PQputline()
using asynchronous connections, will I lose data?

No.

Will the server simply execute them serially?

They'll be executed in parallel, just the same as if each client had
done a bunch of INSERTs wrapped in a BEGIN/END block. You'd only see
problems if clients inserted conflicting data (eg, identical keys in
a column with a UNIQUE index). You may care to read the documentation
about multi-version concurrency control and locking.

Do I need to do a PQconnectStart(),PQconnectPoll connection to make the server process
the COPY's in parallel, or does that just affect the client-side thread? How about a
PQconnectdb() and then PQsetnonblocking()?

Can I just do a PQconnectdb()?

I am unclear on how these different connection types affect the client threads vs. the server threads.

--
John Coers Intrinsity, Inc.
coers@intrinsity.com Austin, Texas