optimizing a cpu-heavy query

Started by Joel Reymontalmost 15 years ago6 messagesgeneral
Jump to latest
#1Joel Reymont
joelr1@gmail.com

Folks,

I'm trying to optimize the following query that performs KL Divergence [1]http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence. As you can see the distance function operates on vectors of 150 floats.

The query takes 12 minutes to run on an idle (apart from pgsql) EC2 m1 large instance with 2 million documents in the docs table. The CPU is pegged at 100% during this time. I need to be able to both process concurrent distance queries and otherwise use the database.

I have the option of moving this distance calculation off of PG but are there other options?

Is there anything clearly wrong that I'm doing here?

Would it speed things up to make the float array a custom data type backed by C code?

Thanks in advance, Joel

[1]: http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

---

CREATE DOMAIN topics AS float[150];
CREATE DOMAIN doc_id AS varchar(64);

CREATE TABLE docs
(
id serial,
doc_id doc_id NOT NULL PRIMARY KEY,
topics topics NOT NULL
);

CREATE OR REPLACE FUNCTION docs_within_distance(vec topics, threshold float)
RETURNS TABLE(id doc_id, distance float) AS $$
BEGIN
RETURN QUERY
SELECT *
FROM (SELECT doc_id, (SELECT sum(vec[i] * ln(vec[i] / topics[i]))
FROM generate_subscripts(topics, 1) AS i
WHERE topics[i] > 0) AS distance
FROM docs) AS tab
WHERE tab.distance <= threshold;
END;
$$ LANGUAGE plpgsql;

--------------------------------------------------------------------------
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
---------------------+------------+---------------------------------------
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
---------------------+------------+---------------------------------------

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Reymont (#1)
Re: optimizing a cpu-heavy query

Joel Reymont <joelr1@gmail.com> writes:

I'm trying to optimize the following query that performs KL Divergence [1]. As you can see the distance function operates on vectors of 150 floats.

CREATE OR REPLACE FUNCTION docs_within_distance(vec topics, threshold float)
RETURNS TABLE(id doc_id, distance float) AS $$
BEGIN
RETURN QUERY
SELECT *
FROM (SELECT doc_id, (SELECT sum(vec[i] * ln(vec[i] / topics[i]))
FROM generate_subscripts(topics, 1) AS i
WHERE topics[i] > 0) AS distance
FROM docs) AS tab
WHERE tab.distance <= threshold;
END;
$$ LANGUAGE plpgsql;

Yikes. That sub-select is a mighty expensive way to compute the scalar
product. Push it into a sub-function that takes the two arrays and
iterates over them with a for-loop. For another couple orders of
magnitude, convert the sub-function to C code. (I don't think you need
a whole data type, just a function that does the scalar product.)

regards, tom lane

#3Joel Reymont
joelr1@gmail.com
In reply to: Tom Lane (#2)
Re: optimizing a cpu-heavy query

Tom,

On Apr 26, 2011, at 5:00 PM, Tom Lane wrote:

For another couple orders of magnitude, convert the sub-function to C code. (I don't think you need
a whole data type, just a function that does the scalar product.)

That's a 30x speedup, from 12 minutes down to 38s. Thanks Tom!

--------------------------------------------------------------------------
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
---------------------+------------+---------------------------------------
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
---------------------+------------+---------------------------------------

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Reymont (#3)
Re: optimizing a cpu-heavy query

Joel Reymont <joelr1@gmail.com> writes:

On Apr 26, 2011, at 5:00 PM, Tom Lane wrote:

For another couple orders of magnitude, convert the sub-function to C code. (I don't think you need
a whole data type, just a function that does the scalar product.)

That's a 30x speedup, from 12 minutes down to 38s. Thanks Tom!

Huh, I would've bet on a lot more actually. The nodeFunctionscan and
nodeAgg code must not be as inefficient as it looks on the surface ...

regards, tom lane

#5Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Tom Lane (#4)
Re: optimizing a cpu-heavy query

2011/4/27 Tom Lane <tgl@sss.pgh.pa.us>:

Joel Reymont <joelr1@gmail.com> writes:

On Apr 26, 2011, at 5:00 PM, Tom Lane wrote:

For another couple orders of magnitude, convert the sub-function to C code.  (I don't think you need
a whole data type, just a function that does the scalar product.)

That's a 30x speedup, from 12 minutes down to 38s. Thanks Tom!

Huh, I would've bet on a lot more actually.  The nodeFunctionscan and
nodeAgg code must not be as inefficient as it looks on the surface ...

Did you mean in that case you can optimize it by collapsing those
nodes into one?

Regards,

--
Hitoshi Harada

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hitoshi Harada (#5)
Re: optimizing a cpu-heavy query

Hitoshi Harada <umi.tanuki@gmail.com> writes:

2011/4/27 Tom Lane <tgl@sss.pgh.pa.us>:

Joel Reymont <joelr1@gmail.com> writes:

That's a 30x speedup, from 12 minutes down to 38s. Thanks Tom!

Huh, I would've bet on a lot more actually. �The nodeFunctionscan and
nodeAgg code must not be as inefficient as it looks on the surface ...

Did you mean in that case you can optimize it by collapsing those
nodes into one?

No, just that the overhead of that code looks like it ought to be much
more than the cost of a few multiplications ...

regards, tom lane