X-From-Line: mailagent Tue Nov 16 16:52:32 EST 2004
Return-Path: <tgl@sss.pgh.pa.us>
Received: from po11.mit.edu [18.7.21.73]
	by stark.xeocode.com with POP3 (fetchmail-5.9.7)
	for stark@localhost (single-drop); Tue, 16 Nov 2004 16:52:39 -0500 (EST)
Received: from po11.mit.edu (po11.mit.edu [18.7.21.73])
	by po11.mit.edu (Cyrus v2.1.5) with LMTP;
	Tue, 16 Nov 2004 16:39:25 -0500
X-Sieve: CMU Sieve 2.2
Received: from pacific-carrier-annex.mit.edu by po11.mit.edu (8.12.4/4.7) id
	iAGLd0Wt005394; Tue, 16 Nov 2004 16:39:18 -0500 (EST)
Received: from sss.pgh.pa.us (sss.pgh.pa.us [66.207.139.130])
	by pacific-carrier-annex.mit.edu (8.12.4/8.9.2) with ESMTP id
	iAGLbMov013197
	for <gsstark@MIT.EDU>; Tue, 16 Nov 2004 16:37:23 -0500 (EST)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.13.1/8.13.1) with ESMTP id iAGLbLlD002762;
	Tue, 16 Nov 2004 16:37:21 -0500 (EST)
To: Greg Stark <gsstark@MIT.EDU>
cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] GiST: PickSplit and multi-attr indexes 
In-reply-to: <87r7mt7b1i.fsf@stark.xeocode.com> 
References: <419723A7.5040505@samurai.com> <27866.1100476459@sss.pgh.pa.us>
	<1100492090.23420.28.camel@localhost.localdomain>
	<7879.1100531985@sss.pgh.pa.us>
	<1100578365.23420.71.camel@localhost.localdomain>
	<17937.1100616003@sss.pgh.pa.us> <87wtwl7bqe.fsf@stark.xeocode.com>
	<87r7mt7b1i.fsf@stark.xeocode.com>
Comments: In-reply-to Greg Stark <gsstark@MIT.EDU>
	message dated "16 Nov 2004 16:12:57 -0500"
Date: Tue, 16 Nov 2004 16:37:20 -0500
X-Gnus-Mail-Source: directory:~/incoming
Message-ID: <2761.1100641040@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
X-Scanned-By: MIMEDefang 2.42
X-Spam-Score: -4.756
X-Mailagent-Processed: Tue, 16 Nov 2004 16:53:28 -0500
X-Mailagent-Processed: Job:158263 File:fm28563
X-Bogosity: No, tests=bogofilter, spamicity=0.485263, version=0.17.5
X-Spam-DCC: : 
X-Spam-Checker-Version: SpamAssassin 2.64 (2004-01-11) on stark.xeocode.com
X-Spam-Level: 
X-Spam-Status: No, hits=0.0 required=4.0 tests=none autolearn=ham version=2.64
X-Filter: mailagent [version 3.0 PL73] for gsstark@mit.edu
Lines: 22
Xref: stark.xeocode.com misc:37450

Greg Stark <gsstark@MIT.EDU> writes:
> The approach they take is to have a function which calculates an
> abstract "distance" between any two entries. There's an algorithm that
> they use to pick the split based on this distance function.

> If you abandoned "PickSplit" and instead exposed this distance
> function as the external API then the behaviour for multi-column
> indexes is clear. You calculate the distance along all the axes and
> calculate the diagonal distance.

Hmm ... the problem with that is the assumption that different opclasses
will compute similarly-scaled distances.  If opclass A generates
distances in the range (0,1e6) while B generates in the range (0,1),
combining them with Euclidean distance won't work well at all.  OTOH you
can't blindly normalize, because in some cases maybe the data is such
that a massive difference in distances is truly appropriate.

I'm also a bit leery of the assumption that every GiST application can
reduce its PickSplit logic to Euclidean distances.

			regards, tom lane


