Netflix Prize data
I signed up for the Netflix Prize. (www.netflixprize.com) and downloaded
their data and have imported it into PostgreSQL. Here is how I created the
table:
Table "public.ratings"
Column | Type | Modifiers
--------+---------+-----------
item | integer |
client | integer |
rating | integer |
rdate | text |
Indexes:
"ratings_client" btree (client)
"ratings_item" btree (item)
markw@snoopy:~/netflix$ time psql netflix -c "select count(*) from ratings"
count
-----------
100480507
(1 row)
real 2m6.270s
user 0m0.004s
sys 0m0.005s
The one thing I notice is that it is REAL slow. I know it is, in fact, 100
million records, but I don't think PostgreSQL is usually slow like this.
I'm going to check with some other machines to see if there is a problem
with my test machine or if something is wierd about PostgreSQL and large
numbers of rows.
I tried to cluster the data along a particular index but had to cancel it
after 3 hours.
I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" So,
the raw data is 23 bytes, the date string will probably be rounded up to
12 bytes, that's 24 bytes per row of data. What is the overhead per
variable? per row?
Is there any advantage to using "varchar(10)" over "text" ?
Mark,
On 10/4/06 1:43 PM, "Mark Woodward" <pgsql@mohawksoft.com> wrote:
markw@snoopy:~/netflix$ time psql netflix -c "select count(*) from ratings"
count
-----------
100480507
(1 row)real 2m6.270s
user 0m0.004s
sys 0m0.005s
I think you are getting about 40MB/s on your sequential scan of about 5GB of
heap data in this case. I calculate the size of the data as:
3 Integers (12 bytes), one text date field (10 bytes ?) and tuple overhead
(24 bytes) = 46 bytes per row
100 million rows x 46 bytes / row = 4.6 Gbytes
- Luke
"Mark Woodward" <pgsql@mohawksoft.com> writes:
The one thing I notice is that it is REAL slow.
How fast is your disk? Counting on my fingers, I estimate you are
scanning the table at about 47MB/sec, which might or might not be
disk-limited...
I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06"
So why aren't you storing it as type "date"?
regards, tom lane
"Mark Woodward" <pgsql@mohawksoft.com> writes:
I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" So,
the raw data is 23 bytes, the date string will probably be rounded up to
12 bytes, that's 24 bytes per row of data. What is the overhead per
variable? per row?Is there any advantage to using "varchar(10)" over "text" ?
I'll second the "use a date" comment.
But to answer the questions, text and varchar are handled identically in
almost every respect. The overhead per variable width field (like text or
varchar) is 4 bytes. The overhead per row depends on a few factors, but figure
28 bytes.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I signed up for the Netflix Prize. (www.netflixprize.com)
and downloaded their data and have imported it into PostgreSQL.
Here is how I created the table:
I signed up as well, but have the table as follows:
CREATE TABLE rating (
movie SMALLINT NOT NULL,
person INTEGER NOT NULL,
rating SMALLINT NOT NULL,
viewed DATE NOT NULL
);
I also recommend not loading the entire file until you get further
along in the algorithm solution. :)
Not that I have time to really play with this....
- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200610041827
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----
iD8DBQFFJDZ0vJuQZxSWSsgRAr6OAKCiOuspNm8QCsujaEN0Kgie6RsTjgCdGPda
9zVzpkrhTEhySEVBwMBTOdU=
=zF7u
-----END PGP SIGNATURE-----
I signed up for the Netflix Prize. (www.netflixprize.com)
and downloaded their data and have imported it into PostgreSQL.
Here is how I created the table:I signed up as well, but have the table as follows:
CREATE TABLE rating (
movie SMALLINT NOT NULL,
person INTEGER NOT NULL,
rating SMALLINT NOT NULL,
viewed DATE NOT NULL
);I also recommend not loading the entire file until you get further
along in the algorithm solution. :)Not that I have time to really play with this....
As luck would have it, I wrote a recommendations system based on music
ratings a few years ago.
After reading the NYT article, it seems as though one or more of the guys
behind "Net Perceptions" is either helping them or did their system, I'm
not sure. I wrote my system because Net Perceptions was too slow and did a
lousy job.
I think the notion of "communities" in general is an interesting study in
statistics, but every thing I've seen in the form of bad recommendations
shows that while [N] people may share certain tastes, but that doesn't
nessisarily mean that what one likes the others do. This is especially
flawed with movie rentals because it is seldom a 1:1 ratio of movies to
people. There are often multiple people in a household. Also, movies are
almost always for multiple people.
Anyway, good luck! (Not better than me, of course :-)
"Mark Woodward" <pgsql@mohawksoft.com> writes:
The one thing I notice is that it is REAL slow.
How fast is your disk? Counting on my fingers, I estimate you are
scanning the table at about 47MB/sec, which might or might not be
disk-limited...I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06"
So why aren't you storing it as type "date"?
You are assuming I gave it any thought at all. :-)
I converted it to a date type (create table ratings2 as ....)
markw@snoopy:~/netflix/download$ time psql -c "select count(*) from
ratings" netflix
count
-----------
100480507
(1 row)
real 1m29.852s
user 0m0.002s
sys 0m0.005s
That's about the right increase based on the reduction in data size.
OK, I guess I am crying wolf, 47M/sec isn't all that bad for the system.
"Greg Sabino Mullane" <greg@turnstep.com> writes:
CREATE TABLE rating (
movie SMALLINT NOT NULL,
person INTEGER NOT NULL,
rating SMALLINT NOT NULL,
viewed DATE NOT NULL
);
You would probably be better off putting the two smallints first followed by
the integer and date. Otherwise both the integer and the date field will have
an extra two bytes of padding wasting 4 bytes of space.
If you reorder the fields that way you'll be down to 28 bytes of tuple header
overhead and 12 bytes of data. There's actually another 4 bytes in the form of
the line pointer so a total of 44 bytes per record. Ie, almost 73% of the disk
i/o you're seeing is actually per-record overhead.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
"Greg Sabino Mullane" <greg@turnstep.com> writes:
CREATE TABLE rating (
movie SMALLINT NOT NULL,
person INTEGER NOT NULL,
rating SMALLINT NOT NULL,
viewed DATE NOT NULL
);You would probably be better off putting the two smallints first followed
by
the integer and date. Otherwise both the integer and the date field will
have
an extra two bytes of padding wasting 4 bytes of space.If you reorder the fields that way you'll be down to 28 bytes of tuple
header
overhead and 12 bytes of data. There's actually another 4 bytes in the
form of
the line pointer so a total of 44 bytes per record. Ie, almost 73% of the
disk
i/o you're seeing is actually per-record overhead.
That's good advice, however, It is said that Netflix has greater than 64K
movies, so, while the test info may work with a small int, I doubt the
overall system would work.
The rating, however, is one char 1~9. Would making it a char(1) buy anything?
In wonder....
If I started screwing around with movie ID and rating, and moved them into
one int. One byte for rating, three bytes for movie ID. That could reduce
the data size by at least half gig.
"Mark Woodward" <pgsql@mohawksoft.com> writes:
The rating, however, is one char 1~9. Would making it a char(1) buy anything?
No, that would actually hurt because of the length word for the char
field. Even if you used the "char" type, which really is only one byte,
you wouldn't win anything because of alignment issues. Personally I'd
just go for three ints and a date, rather than trying to be cute with
the rating.
regards, tom lane
"Mark Woodward" <pgsql@mohawksoft.com> writes:
The rating, however, is one char 1~9. Would making it a char(1) buy
anything?No, that would actually hurt because of the length word for the char
field. Even if you used the "char" type, which really is only one byte,
you wouldn't win anything because of alignment issues. Personally I'd
just go for three ints and a date, rather than trying to be cute with
the rating.
Actually, the date is just days, right? I don't actualy need it too much.
So, create a small int for date and do this: smalldate =
date('1970-01-01') - rdate. And use small int for rating.
Column | Type | Modifiers
--------+----------+-----------
movie | integer |
client | integer |
day | smallint |
rating | smallint |
Mark Woodward wrote:
I tried to cluster the data along a particular index but had to cancel it
after 3 hours.
If the data is in random order, it's faster to do
SELECT * INTO foo_sorted FROM foo ORDER BY bar
then CREATE INDEX, than to run CLUSTER.
That's because CLUSTER does a full index scan of the table, which is
slower than a seqscan + sort if the table is not already clustered.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com