110,000,000 rows

Started by John Gagealmost 16 years ago16 messagesgeneral
Jump to latest
#1John Gage
jsmgage@numericable.fr

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables? I recently
multiplied 85,000 by 1,400 and said now way Jose.

Thanks,

John Gage

#2Alex Thurlow
alex@blastro.com
In reply to: John Gage (#1)
Re: 110,000,000 rows

I've had many times that before and things were very slow. That's when
I partitioned it out. Luckily that table was just for reporting and
could be slow. Are you thinking you'll need that many rows and you just
don't know how to handle it? I would recommend partitioning if at all
possible.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

-Alex

Show quoted text

On 5/26/2010 3:29 PM, John Gage wrote:

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables? I recently
multiplied 85,000 by 1,400 and said now way Jose.

Thanks,

John Gage

#3Vick Khera
vivek@khera.org
In reply to: John Gage (#1)
Re: 110,000,000 rows

On Wed, May 26, 2010 at 4:29 PM, John Gage <jsmgage@numericable.fr> wrote:

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables?  I recently multiplied
85,000 by 1,400 and said now way Jose.

/me stands and waves hand. I have two tables in my primary app that
are on this magnitude. Each is served by 100 partitions split on mod
100 of the primary key as that worked out well for my use case.

#4Thom Brown
thombrown@gmail.com
In reply to: John Gage (#1)
Re: 110,000,000 rows

On 26 May 2010 21:29, John Gage <jsmgage@numericable.fr> wrote:

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables?  I recently multiplied
85,000 by 1,400 and said now way Jose.

Thanks,

John Gage

There's no reason why it can't have that many rows. There's no limit
on the number of rows for tables, only table sizes, which is limited
to 32TB. If you ever have a table that big though, you probably need
to rethink your schema.

Thom

#5Dann Corbit
DCorbit@connx.com
In reply to: Thom Brown (#4)
Re: 110,000,000 rows

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Thom Brown
Sent: Wednesday, May 26, 2010 2:14 PM
To: John Gage
Cc: PostgreSQL - General
Subject: Re: [GENERAL] 110,000,000 rows

On 26 May 2010 21:29, John Gage <jsmgage@numericable.fr> wrote:

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables?  I recently

multiplied

85,000 by 1,400 and said now way Jose.

Thanks,

John Gage

There's no reason why it can't have that many rows. There's no limit
on the number of rows for tables, only table sizes, which is limited
to 32TB. If you ever have a table that big though, you probably need
to rethink your schema.

There really are domains that big, so that there is no more normalization or other processes to mitigate the problem.

Examples:
Microsoft's registered customers database (all MS products bought by any customer, including operating systems)
Tolls taken on the New Jersey road system for FY 2009
DNA data from the Human Genome Project
Protein data from the Protein Folding Project
The US Census bureau's Tiger/Line data
Online orders processed by Amazon.com
Cellular phone calls for t-Mobile for 2008
FedEx shipments worldwide in 2008

We work with mainframe data on a regular basis and files of that size are not really very unusual. Expansion of data is exponential over time. We need to prepare for it. Database systems that cannot handle the volume will be supplanted by those that can.

#6David Wilson
david.t.wilson@gmail.com
In reply to: John Gage (#1)
Re: 110,000,000 rows

On Wed, May 26, 2010 at 4:29 PM, John Gage <jsmgage@numericable.fr> wrote:

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables? I recently multiplied
85,000 by 1,400 and said now way Jose.

I've got a DB with two 500,000,000+ row tables. Performance is quite good
even on very limited hardware if you set up the tables correctly (cluster,
indices, etc) and have a reasonable usage pattern.

--
- David T. Wilson
david.t.wilson@gmail.com

#7David Gardner
david@gardnerit.net
In reply to: John Gage (#1)
Re: 110,000,000 rows

At work I have one table with 32 million rows, not quite the size you
are talking about, but to give you an idea of the performance, the
following query returns 14,659 rows in 405ms:

SELECT * FROM farm.frame
WHERE process_start > '2010-05-26';

process_start is a timestamp without time zone column, and is covered by
an index. Rows are reletively evenly distributed over time, so the index
performs quite well.

A between select also performs well:
SELECT * FROM farm.frame
WHERE process_start
BETWEEN '2010-05-26 08:00:00'
AND '2010-05-26 09:00:00';

fetches 1,350 rows at 25ms.

I also have a summary table that is maintained by triggers, which is a
bit of denormalization, but speeds up common reporting queries.

Show quoted text

On 22:29 Wed 26 May , John Gage wrote:

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables? I recently
multiplied 85,000 by 1,400 and said now way Jose.

Thanks,

John Gage

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Torsten Zühlsdorff
foo@meisterderspiele.de
In reply to: John Gage (#1)
Re: 110,000,000 rows

John Gage schrieb:

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables? I recently
multiplied 85,000 by 1,400 and said now way Jose.

I have two private applications with about 250,000,000 rows a table. I
could cluster them, but the performance is good enough.

Greetings from Germany,
Torsten

--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

#9John Gage
jsmgage@numericable.fr
In reply to: David Gardner (#7)
Re: 110,000,000 rows

Herbert Simon must be spinning in his grave...or smiling wisely. What
does a human do with a petabyte of data?

But when a desktop machine for $1700 retail has a terabyte of storage,
the unix operating system, 4 gigs of memory, and an amazing 27 inch
display, I guess hardware isn't the problem (and I know one could put
together the same machine on Linux etc. for much less).

I sort of understood that the Amazon's of the world had this amount of
data, but it looks like the phenomenon is much, much more widespread.

Thanks for the instruction. It will come in handy.

John

On May 27, 2010, at 12:18 AM, david@gardnerit.net wrote:

Show quoted text

At work I have one table with 32 million rows, not quite the size you
are talking about, but to give you an idea of the performance, the
following query returns 14,659 rows in 405ms:

SELECT * FROM farm.frame
WHERE process_start > '2010-05-26';

process_start is a timestamp without time zone column, and is
covered by
an index. Rows are reletively evenly distributed over time, so the
index
performs quite well.

A between select also performs well:
SELECT * FROM farm.frame
WHERE process_start
BETWEEN '2010-05-26 08:00:00'
AND '2010-05-26 09:00:00';

fetches 1,350 rows at 25ms.

I also have a summary table that is maintained by triggers, which is a
bit of denormalization, but speeds up common reporting queries.

On 22:29 Wed 26 May , John Gage wrote:

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables? I recently
multiplied 85,000 by 1,400 and said now way Jose.

Thanks,

John Gage

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

In reply to: Dann Corbit (#5)
Re: 110,000,000 rows

Dann,

There really are domains that big, so that there is no more normalization or

other processes to mitigate the problem.

Examples:
Microsoft's registered customers database (all MS products bought by any
customer, including operating systems)
Tolls taken on the New Jersey road system for FY 2009
DNA data from the Human Genome Project

.....

please also think of ouer most risk exposed users, the ones using Poker /
Roulette simulation and analyzing software with an PostgrSQL database below.
There are so many rounds of Poker to play .... :)

Harald

--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
Using PostgreSQL is mostly about sleeping well at night.

#11Nikolas Everett
nik9000@gmail.com
In reply to: Massa, Harald Armin (#10)
Re: 110,000,000 rows

I've had a reporting database with just about a billion rows. Each row
was horribly large because the legacy schema had problems. We partitioned
it out by month and it ran about 30 million rows a month. With a reasonably
large box you can get that kind of data into memory and indexes are
almost unnecessary. So long as you have constraint exclusion and a good
partition scheme you should be fine. Throw in a well designed schema and
you'll be cooking well into the tens of billions of rows.

We ran self joins of that table reasonably consistently by the way:
SELECT lhs.id, rhs.id
FROM bigtable lhs, bigtable rhs
WHERE lhs.id > rhs.id
AND '' > lhs.timestamp AND lhs.timestamp >= ''
AND '' > rhs.timestamp AND rhs.timestamp >= ''
AND lhs.timestamp = rhs.timestamp
AND lhs.foo = rhs.foo
AND lhs.bar = rhs.bar

This really liked the timestamp index and we had to be careful to only do it
for a few days at a time. It took a few minutes each go but it was
definitely doable.

Once you get this large you do have to be careful with a few things though:
*It's somewhat easy to write super long queries or updates. This can lots
of dead rows in your tables. Limit your longest running queries to a day or
so. Note that queries are unlikely to take that long but updates with
massive date ranges could. SELECT COUNT(*) FROM bigtable too about 30
minutes when the server wasn't under heavy load.
*You sometimes get bad plans because:
**You don't or can't get enough statistics about a column.
**PostgreSQL doesn't capture statistics about two columns together.
PostgreSQL has no way of knowing that columnA = 'foo' implies columnB =
'bar' about 30% of the time.

Nik

On Thu, May 27, 2010 at 5:58 AM, Massa, Harald Armin <chef@ghum.de> wrote:

Show quoted text

Dann,

There really are domains that big, so that there is no more normalization

or other processes to mitigate the problem.

Examples:
Microsoft's registered customers database (all MS products bought by any
customer, including operating systems)
Tolls taken on the New Jersey road system for FY 2009
DNA data from the Human Genome Project

.....

please also think of ouer most risk exposed users, the ones using Poker /
Roulette simulation and analyzing software with an PostgrSQL database below.
There are so many rounds of Poker to play .... :)

Harald

--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
Using PostgreSQL is mostly about sleeping well at night.

#12Torsten Zühlsdorff
foo@meisterderspiele.de
In reply to: John Gage (#9)
Re: 110,000,000 rows

John Gage schrieb:

Herbert Simon must be spinning in his grave...or smiling wisely. What
does a human do with a petabyte of data?

for example i have a private search-engine for my most often used sites.
google and the other ones always know just a part of the whole site, my
own one knowns all. its a good research-tool (and mirror) and support a
lot more filter-posibilities than google. there are many great internet
sites out there, which have no search. after waiting for crawling this
is no longer a problem for me.

another big example in my private use is a neural network for figuring
out relations between news and stock-prices. or statistical data of
website usage. oh - analyse of the behavior of google is also a great
fun with much data. or a database for typical games like chess or poker
or something like this. i also have some databases with geo-data or free
avaiable data like statistics about birthnumbers in germany, a list of
all germany citys with its habitants (grouped by gender) and so on.

or calculating a list of prim-numbers on your own just to make some
implementation tests. sometime this databases just grow because you want
to see how long it can take to get x results and forgot to disable the
test after reaching the border :D

But when a desktop machine for $1700 retail has a terabyte of storage,
the unix operating system, 4 gigs of memory, and an amazing 27 inch
display, I guess hardware isn't the problem (and I know one could put
together the same machine on Linux etc. for much less).

yes and for private use you can use such a desktop machine as
database-server. it can work while you're on work ;)

Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

#13Thom Brown
thombrown@gmail.com
In reply to: Nikolas Everett (#11)
Re: 110,000,000 rows

On 27 May 2010 14:48, Nikolas Everett <nik9000@gmail.com> wrote:

I've had a reporting database with just about a billion rows.  Each row
was horribly large because the legacy schema had problems.  We partitioned
it out by month and it ran about 30 million rows a month.  With a reasonably
large box you can get that kind of data into memory and indexes are
almost unnecessary.  So long as you have constraint exclusion and a good
partition scheme you should be fine.  Throw in a well designed schema and
you'll be cooking well into the tens of billions of rows.
We ran self joins of that table reasonably consistently by the way:
SELECT lhs.id, rhs.id
FROM bigtable lhs, bigtable rhs
WHERE lhs.id > rhs.id
     AND '' > lhs.timestamp AND lhs.timestamp >= ''
     AND '' > rhs.timestamp AND rhs.timestamp >= ''
     AND lhs.timestamp = rhs.timestamp
     AND lhs.foo = rhs.foo
     AND lhs.bar = rhs.bar
This really liked the timestamp index and we had to be careful to only do it
for a few days at a time.  It took a few minutes each go but it was
definitely doable.
Once you get this large you do have to be careful with a few things though:
*It's somewhat easy to write super long queries or updates.  This can lots
of dead rows in your tables.  Limit your longest running queries to a day or
so.  Note that queries are unlikely to take that long but updates with
massive date ranges could.  SELECT COUNT(*) FROM bigtable too about 30
minutes when the server wasn't under heavy load.
*You sometimes get bad plans because:
**You don't or can't get enough statistics about a column.
**PostgreSQL doesn't capture statistics about two columns together.
 PostgreSQL has no way of knowing that columnA = 'foo' implies columnB =
'bar' about 30% of the time.
Nik

What's that middle bit about?

AND '' > lhs.timestamp AND lhs.timestamp >= ''
AND '' > rhs.timestamp AND rhs.timestamp >= ''

If blank is greater than the timestamp? What is that doing out of curiosity?

Thom

#14Jasen Betts
jasen@xnet.co.nz
In reply to: John Gage (#1)
Re: 110,000,000 rows

On 2010-05-26, John Gage <jsmgage@numericable.fr> wrote:

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables? I recently
multiplied 85,000 by 1,400 and said now way Jose.

census data would be one.
USA phone whitepages.
transaction records at a medium sized bank bank.

you're probably going to want to partition it somehow.

#15Steve Crawford
scrawford@pinpointresearch.com
In reply to: Jasen Betts (#14)
Re: 110,000,000 rows

On 05/31/2010 03:48 AM, Jasen Betts wrote:

On 2010-05-26, John Gage<jsmgage@numericable.fr> wrote:

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables? I recently
multiplied 85,000 by 1,400 and said now way Jose.

census data would be one.
USA phone whitepages.
transaction records at a medium sized bank bank

Text messages, phone-bills, tweets, etc. I have single tables of
market-research-related data that exceed 80-million rows.

But a question to the OP: Setting aside for the moment that
85000*1400=119,000,000, not 110,000,000; what is the significance to you
of these numbers?

Cheers,
Steve

#16John Gage
jsmgage@numericable.fr
In reply to: Steve Crawford (#15)
Re: 110,000,000 rows

I was aware that there are, in fact, many applications such as census
data or cell phone calls that would easily surpass this number.
However these applications exist in very large companies/organizations
that can throw essentially unlimited resources at the problem. One
thinks of the NSA's computers, for example.

What I was surprised to learn is the extremely common occurence of
table sizes like this even in databases run by single individuals for
individual needs.

Putting aside the cognitive challenge posed by data like this, what
made me glad to have asked the question were the solutions provided in
the responses.

John

On Jun 1, 2010, at 10:27 PM, Steve Crawford wrote:

Show quoted text

transaction records at a medium sized bank bank

Text messages, phone-bills, tweets, etc. I have single tables of
market-research-related data that exceed 80-million rows.

But a question to the OP: Setting aside for the moment that
85000*1400=119,000,000, not 110,000,000; what is the significance to
you of these numbers?

Cheers,
Steve

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general