OSS database needed for testing

Started by Josh Berkusalmost 23 years ago24 messages
#1Josh Berkus
josh@agliodbs.com

Folks,

Please pardon the cross-posting.

A small group of us on the Performance list were discussing the first steps
toward constructing a comprehensive Postgresql installation benchmarking
tool, mostly to compare different operating systems and file systemsm but
later to be used as a foundation for a tuning wizard.

To do this, we need one or more real (not randomly generated*) medium-large
database which is or can be BSD-licensed (data AND schema). This database
must have:

1) At least one "main" table with 12+ columns and 100,000+ rows (each).
2) At least 10-12 additional tables of assorted sizes, at least half of which
should have Foriegn Key relationships to the main table(s) or each other.
3) At least one large text or varchar field among the various tables.

In addition, the following items would be helpful, but are not required:
4) Views, triggers, and functions built on the database
5) A query log of database activity to give us sample queries to work with.
6) Some complex data types, such as geometric, network, and/or custom data
types.

Thanks for any leads you can give me!

(* To forestall knee-jerk responses: Randomly generated data does not look or
perform the same as real data in my professional opinion, and I'm the one
writing the test scripts.)

--
-Josh Berkus
Aglio Database Solutions
San Francisco

#2Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Josh Berkus (#1)
Re: [HACKERS] OSS database needed for testing

Josh Berkus wrote:

1) At least one "main" table with 12+ columns and 100,000+ rows

(each).

2) At least 10-12 additional tables of assorted sizes, at least half

of

which
should have Foriegn Key relationships to the main table(s) or each

other.

3) At least one large text or varchar field among the various tables.

In addition, the following items would be helpful, but are not

required:

4) Views, triggers, and functions built on the database
5) A query log of database activity to give us sample queries to work
with.
6) Some complex data types, such as geometric, network, and/or custom

data

types.

Might I recommend the FCC database of transmitters. Its publicly
available via anonymous FTP, medium largish with tables running 100k ->
1m+ records, and demonstrates many interesting test cases. For example,
lat/lon spatial queries (RTree vs. GIST) can be tested with a decent
volume. Also it demonstrates a good example of the use of schemas.
Email me if you want info.

Format is pipe delimited (non quoted), and data turnover is < 1% a week.

Merlin

#3Noname
pgsql@mohawksoft.com
In reply to: Josh Berkus (#1)
Re: [HACKERS] OSS database needed for testing

I don't know that it meets your criteria, but.....

I have a set of scripts and a program that will load the US Census TigerUA
database into PostgreSQL. The thing is absolutely freak'n huge. I forget
which, but it is either 30g or 60g of data excluding indexes.

Also, if that is too much, I have a similar setup to load the FreeDB music
database, from www.freedb.org. It has roughly 670,000 entries in "cdtitles"
and 8 million entries in "cdsongs."

Either one of which, I would be willing to send you the actual DB on cd(s)
if you pay for postage and media.

Show quoted text

Folks,

Please pardon the cross-posting.

A small group of us on the Performance list were discussing the first
steps toward constructing a comprehensive Postgresql installation
benchmarking tool, mostly to compare different operating systems and
file systemsm but later to be used as a foundation for a tuning
wizard.

To do this, we need one or more real (not randomly generated*)
medium-large database which is or can be BSD-licensed (data AND
schema). This database must have:

1) At least one "main" table with 12+ columns and 100,000+ rows (each).
2) At least 10-12 additional tables of assorted sizes, at least half of
which should have Foriegn Key relationships to the main table(s) or
each other. 3) At least one large text or varchar field among the
various tables.

In addition, the following items would be helpful, but are not
required: 4) Views, triggers, and functions built on the database
5) A query log of database activity to give us sample queries to work
with. 6) Some complex data types, such as geometric, network, and/or
custom data types.

Thanks for any leads you can give me!

(* To forestall knee-jerk responses: Randomly generated data does not
look or perform the same as real data in my professional opinion, and
I'm the one writing the test scripts.)

--
-Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of
broadcast)--------------------------- TIP 1: subscribe and unsubscribe
commands go to majordomo@postgresql.org

#4Bruno Wolff III
bruno@wolff.to
In reply to: Noname (#3)
Re: [PERFORM] [HACKERS] OSS database needed for testing

On Thu, Apr 03, 2003 at 13:26:01 -0500,
pgsql@mohawksoft.com wrote:

I don't know that it meets your criteria, but.....

I have a set of scripts and a program that will load the US Census TigerUA
database into PostgreSQL. The thing is absolutely freak'n huge. I forget
which, but it is either 30g or 60g of data excluding indexes.

Are the data model or the loading scripts available publicly?
I have the tiger data and a program that uses it to convert addresses
to latitude and longitude, but I don't really like the program and
was thinking about trying to load the data into a database and do
queries against the database to find location.

#5mlw
pgsql@mohawksoft.com
In reply to: Josh Berkus (#1)
Re: [PERFORM] [HACKERS] OSS database needed for testing

Bruno Wolff III wrote:

On Thu, Apr 03, 2003 at 13:26:01 -0500,
pgsql@mohawksoft.com wrote:

I don't know that it meets your criteria, but.....

I have a set of scripts and a program that will load the US Census TigerUA
database into PostgreSQL. The thing is absolutely freak'n huge. I forget
which, but it is either 30g or 60g of data excluding indexes.

Are the data model or the loading scripts available publicly?
I have the tiger data and a program that uses it to convert addresses
to latitude and longitude, but I don't really like the program and
was thinking about trying to load the data into a database and do
queries against the database to find location.

I have a set of scripts, SQL table defs, a small C program, along with a
set of field with files that loads it into PGSQL using the "copy from
stdin" It works fairly well, but takes a good long time to load it all.

Should I put it in the download section of my website?

#6Bruno Wolff III
bruno@wolff.to
In reply to: mlw (#5)
Re: [PERFORM] [HACKERS] OSS database needed for testing

On Thu, Apr 03, 2003 at 17:19:13 -0500,
mlw <pgsql@mohawksoft.com> wrote:

I have a set of scripts, SQL table defs, a small C program, along with a
set of field with files that loads it into PGSQL using the "copy from
stdin" It works fairly well, but takes a good long time to load it all.

Should I put it in the download section of my website?

Yes. I would be interested in looking at it even if I don't use exactly
the same way to do stuff. Taking a logn time to load the data into the
database isn't a big deal for me. reading through the tiger (and FIPS) data
documentation it seemed like there might be some gotchas in unusual cases
and I am not sure the google contest program really handled things right
so I would like to see another implementation. I am also interested in the
data model as that will save me some time.

#7Jeffrey D. Brower
jeff@pointhere.net
In reply to: Josh Berkus (#1)
Re: [HACKERS] OSS database needed for testing

Hi Josh,

Let me vote on the Tiger data. I used to use this database. It is public,
updated by the government, VERY useful in own right, it works well with the
earthdistance contribution, a real world database a lot of us use and I
think you can put together some killer scripts on it.

Can I vote twice? <g>

Jeff

----- Original Message -----
From: <pgsql@mohawksoft.com>
To: <josh@agliodbs.com>
Cc: <pgsql-general@postgresql.org>; <pgsql-performance@postgresql.org>;
<pgsql-hackers@postgresql.org>
Sent: Thursday, April 03, 2003 1:26 PM
Subject: Re: [PERFORM] [HACKERS] OSS database needed for testing

I don't know that it meets your criteria, but.....

I have a set of scripts and a program that will load the US Census TigerUA
database into PostgreSQL. The thing is absolutely freak'n huge. I forget
which, but it is either 30g or 60g of data excluding indexes.

Also, if that is too much, I have a similar setup to load the FreeDB music
database, from www.freedb.org. It has roughly 670,000 entries in

"cdtitles"

Show quoted text

and 8 million entries in "cdsongs."

Either one of which, I would be willing to send you the actual DB on cd(s)
if you pay for postage and media.

Folks,

Please pardon the cross-posting.

A small group of us on the Performance list were discussing the first
steps toward constructing a comprehensive Postgresql installation
benchmarking tool, mostly to compare different operating systems and
file systemsm but later to be used as a foundation for a tuning
wizard.

To do this, we need one or more real (not randomly generated*)
medium-large database which is or can be BSD-licensed (data AND
schema). This database must have:

1) At least one "main" table with 12+ columns and 100,000+ rows (each).
2) At least 10-12 additional tables of assorted sizes, at least half of
which should have Foriegn Key relationships to the main table(s) or
each other. 3) At least one large text or varchar field among the
various tables.

In addition, the following items would be helpful, but are not
required: 4) Views, triggers, and functions built on the database
5) A query log of database activity to give us sample queries to work
with. 6) Some complex data types, such as geometric, network, and/or
custom data types.

Thanks for any leads you can give me!

(* To forestall knee-jerk responses: Randomly generated data does not
look or perform the same as real data in my professional opinion, and
I'm the one writing the test scripts.)

--
-Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of
broadcast)--------------------------- TIP 1: subscribe and unsubscribe
commands go to majordomo@postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#8Josh Berkus
josh@agliodbs.com
In reply to: Jeffrey D. Brower (#7)
Re: [HACKERS] OSS database needed for testing

Jeff,

Let me vote on the Tiger data. I used to use this database. It is public,
updated by the government, VERY useful in own right, it works well with the
earthdistance contribution, a real world database a lot of us use and I
think you can put together some killer scripts on it.

We'd have to use a subset of it. 30G is a little larger than anything we
want people to download as a test package.

--
Josh Berkus
Aglio Database Solutions
San Francisco

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#8)
Re: [HACKERS] OSS database needed for testing

Josh Berkus <josh@agliodbs.com> writes:

We'd have to use a subset of it. 30G is a little larger than anything we
want people to download as a test package.

Yeah, it seems a bit over the top ...

The FCC database sounded like an interesting alternative to me.

regards, tom lane

#10mlw
pgsql@mohawksoft.com
In reply to: Josh Berkus (#1)
Re: [HACKERS] OSS database needed for testing

Josh Berkus wrote:

Jeff,

Let me vote on the Tiger data. I used to use this database. It is public,
updated by the government, VERY useful in own right, it works well with the
earthdistance contribution, a real world database a lot of us use and I
think you can put together some killer scripts on it.

We'd have to use a subset of it. 30G is a little larger than anything we
want people to download as a test package.

Actually, come to think of it, the TigerUA DB is in chunks. You can use
as much or as little as you want. I'll put the loader scripts on my
download page tonight.

Here is the home page for the data:
http://www.census.gov/geo/www/tiger/tigerua/ua_tgr2k.html

#11Jeffrey D. Brower
jeff@Green-Visor.US
In reply to: Josh Berkus (#1)
Re: [HACKERS] OSS database needed for testing

Absolutely. We could just use one large state or several small ones and let
folks download the whole thing if they wanted. Using that technique you
could control the size of the test quite closely and still make something
potentially quite valuable as a contribution beyond the bench.

----- Original Message -----
From: "Josh Berkus" <josh@agliodbs.com>
To: "Jeffrey D. Brower" <jeff@pointhere.net>; <pgsql@mohawksoft.com>
Cc: <pgsql-performance@postgresql.org>
Sent: Thursday, April 03, 2003 11:29 PM
Subject: Re: [PERFORM] [HACKERS] OSS database needed for testing

Jeff,

Let me vote on the Tiger data. I used to use this database. It is

public,

updated by the government, VERY useful in own right, it works well with

the

Show quoted text

earthdistance contribution, a real world database a lot of us use and I
think you can put together some killer scripts on it.

We'd have to use a subset of it. 30G is a little larger than anything we
want people to download as a test package.

--
Josh Berkus
Aglio Database Solutions
San Francisco

#12Josh Berkus
josh@agliodbs.com
In reply to: Jeffrey D. Brower (#11)
Re: [PERFORM] OSS database needed for testing

Jeff, Mlw,

Absolutely. We could just use one large state or several small ones and
let folks download the whole thing if they wanted. Using that technique
you could control the size of the test quite closely and still make
something potentially quite valuable as a contribution beyond the bench.

Hold on a second. The FCC database is still a better choice because it is
more complex with a carefully defined schema. The Tiger database would be
good for doing tests of type 1 and 3, but not for tests of types 2 and 4.

It would certainly be interesting to use the Tiger database as the basis for
an additional type of test:

6) Very Large Data Set: querying, then updating, 300+ selected rows from a
2,000,000 + row table.

... but I still see the FCC database as our best candidate for the battery of
tests 1-5.

--
Josh Berkus
Aglio Database Solutions
San Francisco

#13Jeffrey D. Brower
jeff@pointhere.net
In reply to: Josh Berkus (#1)
Re: OSS database needed for testing

I think you got me there. I have to agree with both points.

(Besides, you are the one coding this thing and I think you understand it
better than I do.)

Let me know if I can help.

Jeff

----- Original Message -----
From: "Josh Berkus" <josh@agliodbs.com>
To: "Jeffrey D. Brower" <jeff@Green-Visor.US>; <pgsql@mohawksoft.com>
Cc: <pgsql-performance@postgresql.org>; <pgsql-hackers@postgresql.org>
Sent: Friday, April 04, 2003 11:09 AM
Subject: Re: [PERFORM] [HACKERS] OSS database needed for testing

Jeff, Mlw,

Absolutely. We could just use one large state or several small ones and
let folks download the whole thing if they wanted. Using that technique
you could control the size of the test quite closely and still make
something potentially quite valuable as a contribution beyond the bench.

Hold on a second. The FCC database is still a better choice because it is
more complex with a carefully defined schema. The Tiger database would

be

good for doing tests of type 1 and 3, but not for tests of types 2 and 4.

It would certainly be interesting to use the Tiger database as the basis

for

an additional type of test:

6) Very Large Data Set: querying, then updating, 300+ selected rows from a
2,000,000 + row table.

... but I still see the FCC database as our best candidate for the battery

of

Show quoted text

tests 1-5.

--
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

#14Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Jeffrey D. Brower (#13)
Re: [PERFORM] OSS database needed for testing

The fcc FTP site is ftp.fcc.gov

The location of the data of interest is at
/pub/Bureaus/Wireless/Databases/uls/.

There are zip files (pipe delimited) in complete and the daily changed
files in daily. Theres lots of info in documentation which includes
excel spreadsheets of the schema. These will have to be converted to
sql statemtents.

The ULS is the database system that holds the data for Fixed and Mobile
wireless services. This includes most two way systems and point to
multipoint (microwave) but not broadcast (AM, FM, TV) and not advanced
radio.

The database is really a database of applications. It contains
application data submitted by wireless applicants.

There are two families of tables, prefixed with 'a' and 'l'. The 'a'
tables stand for application records that are pending being granted by
the fcc. The 'l' tables have received licenses and may or may not be
operating.

Combined, the 'a' and 'l' zipfiles represent a specific service. For
example, 'a_micro' and 'l_micro' contain the applications and licensed
data for microwave systems. The different services have slightly
different layouts because they have different requirements.

I strongly suggest looking at LMcomm and LMpriv first. These are the
fixed land mobile systems, and 90% of the entire database. They also
have identical layouts.

There are a great deal of files in each zipfile, but here are the most
interesting:

hd: header data
ad: application detail
an: antenna data
lo: location data
fr: frequency data
em: emission data

There are others. I can help you write meaningful queries that are
quite complex and will require optimization techniques.

Merlin

#15Josh Berkus
josh@agliodbs.com
In reply to: Merlin Moncure (#14)
Re: [PERFORM] OSS database needed for testing

Merlin,

The fcc FTP site is ftp.fcc.gov

The location of the data of interest is at
/pub/Bureaus/Wireless/Databases/uls/.

Cool. I'll tackle this in a week or two. Right now, I'm being paid to
convert a client's data and that'll keep me busy through the weekend ...

--
-Josh Berkus
Aglio Database Solutions
San Francisco

#16Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Josh Berkus (#15)
Re: [PERFORM] OSS database needed for testing

Josh Berkus wrote:

Cool. I'll tackle this in a week or two. Right now, I'm being paid

to

convert a client's data and that'll keep me busy through the weekend

...

I would suggest downloading the data now. I can help get you started
with the create table statements and the import scripts. There are not
very many ways to get the data in a reasonable timeframe: the spi
functions or the copy command are a good place to start. Do not bother
with running stuff through insert queries: take my word for it, it just
won't work. Of course, if you use copy, you have to pre-format. Be
aware that you will have many gigabytes (like more than 20) of data
before you are done.

Whatever you decide to do, document the process: the difficulty of
getting large amounts of data into postgres quickly and easily has been
a historical complaint of mine. Using mysql, it was a snap to get the
data in but using *that* database I really felt it couldn't handle this
much data.

I can also get you started with some example queries that should be
quite a challenge to set up to run quickly. After that, it's your
ballgame.

Merlin

#17Josh Berkus
josh@agliodbs.com
In reply to: Merlin Moncure (#16)
Re: OSS database needed for testing

Merlin,

I would suggest downloading the data now. I can help get you started

OK, downloading now.

with the create table statements and the import scripts. There are not
very many ways to get the data in a reasonable timeframe: the spi
functions or the copy command are a good place to start. Do not bother
with running stuff through insert queries: take my word for it, it just
won't work. Of course, if you use copy, you have to pre-format. Be
aware that you will have many gigabytes (like more than 20) of data
before you are done.

From my perspective, the easiest and fastest way to do this is to create the
table definitions in PostgreSQL, and then to use Perl to convert the data
format to something COPY will recognize. If you can do the create table
statements for the LM* data, I can do the Perl scripts.

Given that the *total* data is 20G, we'll want to use a subset of it. Per
your suggestion, I am downloading the *LM* tables. I may truncate them
further if the resulting database is too large. If some of the other tables
are reference lists or child tables, please tell me and I will download them
as well.

--
-Josh Berkus
Aglio Database Solutions
San Francisco

#18Lamar Owen
lamar.owen@wgcr.org
In reply to: Merlin Moncure (#14)
Re: [PERFORM] OSS database needed for testing

On Friday 04 April 2003 11:47, Merlin Moncure wrote:

The location of the data of interest is at
/pub/Bureaus/Wireless/Databases/uls/.

wireless services. This includes most two way systems and point to
multipoint (microwave) but not broadcast (AM, FM, TV) and not advanced
radio.

Also check out the cdbs files (which contain the broadcast stuff as well as
more) at /pub/Bureaus/Mass_Media/Databases/cdbs/ (which I would be more
interested in doing, since I am a broadcast engineer by profession....)
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#19Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Lamar Owen (#18)
Re: [PERFORM] OSS database needed for testing

Lamar Owen wrote:

Also check out the cdbs files (which contain the broadcast stuff as

well

as
more) at /pub/Bureaus/Mass_Media/Databases/cdbs/ (which I would be

more

interested in doing, since I am a broadcast engineer by

profession....)

--

Up until about 6 months ago, I worked at a company called RadioSoft.
They are a provider of high quality database, engineering, and GIS
software. The company has its roots as source of engineering tools for
broadcast engineers. They currently offer several products and services
(including online web based database services), some of which are based
on postgres, some not. You might consider checking them out.

RadioSoft's flagship product, ComStudy, is the #1 tool for broadcast
engineers on the market. I happen to be intimately familiar with the
cdbs. I suggested the land mobile stuff because it is (much) bigger and
(much) more complicated, but mostly draws on the same concepts, like
haat, etc. You might get a kick out of this project.

The last project I did before leaving there was an online database of
directional patterns in xml format. I also made up a xml schema for
directional patterns hoping to get some standardization in that regard.
You can see that in the free section of the RadioSoft web page.

Merlin

#20Josh Berkus
josh@agliodbs.com
In reply to: Lamar Owen (#18)
Re: [PERFORM] OSS database needed for testing

Lamar,

Also check out the cdbs files (which contain the broadcast stuff as well as
more) at /pub/Bureaus/Mass_Media/Databases/cdbs/ (which I would be more
interested in doing, since I am a broadcast engineer by profession....)

Hey, if you're willing to do the text --> postgres conversions, I'll use
whichever tables you want ...

--
-Josh Berkus
Aglio Database Solutions
San Francisco

#21Lamar Owen
lamar.owen@wgcr.org
In reply to: Merlin Moncure (#19)
Re: [PERFORM] OSS database needed for testing

On Friday 04 April 2003 14:23, Merlin Moncure wrote:

Up until about 6 months ago, I worked at a company called RadioSoft.
They are a provider of high quality database, engineering, and GIS
software. The company has its roots as source of engineering tools for
broadcast engineers. They currently offer several products and services
(including online web based database services), some of which are based
on postgres, some not. You might consider checking them out.

I'm quite familiar with RadioSoft. Can't afford any of the software; familiar
with the products... :-)

I've been putting together open source tools to do much of the same stuff.
With the release of the FCC's Fortran source, I've been able to do virtually
everything I need to do.

But while the LMR dataset is larger, the MB dataset is just as varied. I'm
interested in both, however.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#22Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Lamar Owen (#21)
Re: [PERFORM] OSS database needed for testing

I'm quite familiar with RadioSoft. Can't afford any of the software;
familiar
with the products... :-)

I've been putting together open source tools to do much of the same

stuff.

With the release of the FCC's Fortran source, I've been able to do
virtually
everything I need to do.

But while the LMR dataset is larger, the MB dataset is just as varied.
I'm
interested in both, however.

Peter's my father! At RadioSoft I worked over that same fortran code
you talked about. We ported most of it to C. Of course, I would prefer
not to write any software that would compete with my father's company,
free or no :)
I can tell you, though; the land mobile database is much more
complicated. Getting it to run decently on pc hardware is a significant
engineering challenge. OTOH, the broadcast database isn't too bad. For
example, the last time I checked there were about 40k tv stations in the
cdbs, compared with about 4m frequencies in the land mobile private.
Also the lm database has a much wider scope of function. The
requirements for broadcast have changed (like the nature of broadcast
technology itself) very little in the last 50 years, except for the
ill-fated DTV rollout and the failed AM stereo. It's a conservative
industry.

Merlin

#23Lamar Owen
lamar.owen@wgcr.org
In reply to: Merlin Moncure (#22)
Re: [PERFORM] OSS database needed for testing

On Friday 04 April 2003 14:54, Merlin Moncure wrote:

I can tell you, though; the land mobile database is much more
complicated. Getting it to run decently on pc hardware is a significant
engineering challenge.

Then it sounds like it's a better fit for Josh's requirements.

ill-fated DTV rollout and the failed AM stereo. It's a conservative
industry.

Tell me about it. Yet we get IBOC....and the 30,000 translator apps in the one
week window... Anyway, those topics more correctly belong to
radio-tech@broadcast.net; rather off-topic here.

I do still want to get CDBS in a PostgreSQL setup, with automatic nightly
import, at some point in time. Just probably not as quickly as Josh needs a
dataset to crank on.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#24Josh Berkus
josh@agliodbs.com
In reply to: Lamar Owen (#23)
Re: [PERFORM] OSS database needed for testing

Lamar,

I do still want to get CDBS in a PostgreSQL setup, with automatic nightly
import, at some point in time. Just probably not as quickly as Josh needs a
dataset to crank on.

Oh, I don't know. I expect setting this up to take several weeks. And if
we do the CDBS database as part of the test pack, then you can get my help
without paying the $175/hour I normally charge <grin>.

--
-Josh Berkus
Aglio Database Solutions
San Francisco