Re: [HACKERS] Postgres dies in the rules regression test (64-bit problem)

Started by Tom Lanealmost 27 years ago4 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes:

#0 replace_opid (oper=0x4015aad0) at nodeFuncs.c:95
#1 0x1201208b0 in fix_opid (clause=0x14015aaa0) at clauses.c:554

(gdb) p *((Expr *) clause)
$3 = {type = T_Expr, typeOid = 23, opType = OP_EXPR, oper = 0x4015aad0,
args = 0x14015ab30}

I don't know if ((Expr*) clause)->oper should point to itself as it seems
to do,

It shouldn't ever point to itself, but it looks to me like it's not ---
the low order bits of clause are ...aaa0 and oper is ...aad0.

but certainly its value is passed though an int variable and is
truncated.

Looks that way doesn't it :-(. I did some quick scratching around in
the sources and couldn't find any obvious mistakes of that ilk. Most of
the code that touches Oper nodes would have been exercised heavily long
before we get to the rules regression test, so I'm not sure what to think.

If someone points me to the right place to look, I can play a bit more
with gdb and try to find the cause.

The Expr node was presumably made by make_op() in
backend/parser/parse_oper.c, although the tree might have been copied at
least once using the functions in backend/nodes/copyfuncs.c. Good luck!

regards, tom lane

#2Pedro J. Lobo
pjlobo@euitt.upm.es
In reply to: Tom Lane (#1)

On Thu, 10 Jun 1999, Tom Lane wrote:

"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes:

#0 replace_opid (oper=0x4015aad0) at nodeFuncs.c:95
#1 0x1201208b0 in fix_opid (clause=0x14015aaa0) at clauses.c:554

(gdb) p *((Expr *) clause)
$3 = {type = T_Expr, typeOid = 23, opType = OP_EXPR, oper = 0x4015aad0,
args = 0x14015ab30}

I don't know if ((Expr*) clause)->oper should point to itself as it seems
to do,

It shouldn't ever point to itself, but it looks to me like it's not ---
the low order bits of clause are ...aaa0 and oper is ...aad0.

Ooops, you are right! Now I am using a bigger font :-)

but certainly its value is passed though an int variable and is
truncated.

Looks that way doesn't it :-(. I did some quick scratching around in
the sources and couldn't find any obvious mistakes of that ilk. Most of
the code that touches Oper nodes would have been exercised heavily long
before we get to the rules regression test, so I'm not sure what to think.

I have also looked at the warnings that arise in the compilation process,
and haven't found any that could be related to that.

If someone points me to the right place to look, I can play a bit more
with gdb and try to find the cause.

The Expr node was presumably made by make_op() in
backend/parser/parse_oper.c, although the tree might have been copied at
least once using the functions in backend/nodes/copyfuncs.c. Good luck!

Ok, I'll let you know if/when I find something (or, more probably, when I
have more questions ;-)

Regards,

Pedro.

--
-------------------------------------------------------------------
Pedro Jos� Lobo Perea Tel: +34 91 336 78 19
Centro de C�lculo Fax: +34 91 331 92 29
E.U.I.T. Telecomunicaci�n e-mail: pjlobo@euitt.upm.es
Universidad Polit�cnica de Madrid
Ctra. de Valencia, Km. 7 E-28031 Madrid - Espa�a / Spain

#3Pedro J. Lobo
pjlobo@euitt.upm.es
In reply to: Pedro J. Lobo (#2)

On Fri, 11 Jun 1999, Pedro J. Lobo wrote:

On Thu, 10 Jun 1999, Tom Lane wrote:

"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes:

If someone points me to the right place to look, I can play a bit more
with gdb and try to find the cause.

The Expr node was presumably made by make_op() in
backend/parser/parse_oper.c, although the tree might have been copied at
least once using the functions in backend/nodes/copyfuncs.c. Good luck!

Ok, I'll let you know if/when I find something (or, more probably, when I
have more questions ;-)

Well, I have found something. As you said, the node is made by make_op()
(which, btw, is in parse_node.c ;-) and later copied using copyObject().
But those functions are correct and work as expected.

I have located the point where the pointer is overwritten. Here is the
stack trace:

(gdb) where
#0 ResolveNew (info=0x140169ac0, targetlist=0x140169f70, nodePtr=0x14016aac8,
sublevels_up=0) at rewriteManip.c:719
#1 0x12013ea38 in ResolveNew (info=0x140169ac0, targetlist=0x140169f70,
nodePtr=0x14016ab38, sublevels_up=0) at rewriteManip.c:670
#2 0x12013ea58 in ResolveNew (info=0x140169ac0, targetlist=0x140169f70,
nodePtr=0x140169b80, sublevels_up=0) at rewriteManip.c:730
#3 0x12013ead0 in FixNew (info=0x140169ac0, parsetree=0x1401692d0)
at rewriteManip.c:753
#4 0x12013cc5c in fireRules (parsetree=0x1401692d0, rt_index=1,
event=CMD_UPDATE, instead_flag=0x11fffa6c0 "\001", locks=0x14016a8c0,
qual_products=0x11fffa6b8) at rewriteHandler.c:2612
#5 0x12013ce90 in RewriteQuery (parsetree=0x1401692d0,
instead_flag=0x11fffa6c0 "\001", qual_products=0x11fffa6b8)
at rewriteHandler.c:2697
#6 0x12013cf38 in deepRewriteQuery (parsetree=0x1401692d0)
at rewriteHandler.c:2742
#7 0x12013d008 in QueryRewriteOne (parsetree=0x1401692d0)
at rewriteHandler.c:2791
#8 0x12013d128 in BasicQueryRewrite (parsetree=0x1401692d0)
at rewriteHandler.c:2862
#9 0x12013d230 in QueryRewrite (parsetree=0x1401692d0)
at rewriteHandler.c:2916
#10 0x1200b3ce4 in ExplainQuery (query=0x1401692d0, verbose=0 '\000',
dest=Remote) at explain.c:68
#11 0x12015dce0 in ProcessUtility (parsetree=0x140169110, dest=Remote)
at utility.c:651
#12 0x12015a138 in pg_exec_query_dest (
query_string=0x11fffa918 "explain update rtest_v1 set a = rtest_t3.a +
20 where b = rtest_t3.b;", dest=Remote, aclOverride=0 '\000') at
postgres.c:727
#13 0x120159f80 in pg_exec_query (
query_string=0x11fffa918 "explain update rtest_v1 set a = rtest_t3.a +
20 where b = rtest_t3.b;") at postgres.c:656
#14 0x12015baa0 in PostgresMain (argc=10, argv=0x11fffee90, real_argc=9,
real_argv=0x11ffffc28) at postgres.c:1658
#15 0x12012d02c in DoBackend (port=0x1400d9a00) at postmaster.c:1628
#16 0x12012c778 in BackendStartup (port=0x1400d9a00) at postmaster.c:1373
#17 0x12012b5d8 in ServerLoop () at postmaster.c:823
#18 0x12012ae00 in PostmasterMain (argc=9, argv=0x11ffffc28)
at postmaster.c:616
#19 0x1200e0b30 in main (argc=9, argv=0x11ffffc28) at main.c:93
(gdb)

ResolveNew() is a recursive function in backend/rewrite/rewriteManip.c.
The problem appears one time that it is called with nodePtr pointing to a
Var node. This is the node:

(gdb) p *((Var *) *nodePtr)
$4 = {type = T_Var, varno = 4, varattno = 1, vartype = 23, vartypmod = -1,
varlevelsup = 0, varnoold = 4, varoattno = 1}

In the beginning of the function, there is a 'switch(nodeTag(node))'
('node' is assigned the value of '*nodePtr'), that jumps to the
corresponding 'case T_Var:' at line 695. There you see a call to
FindMatchingNew() (a static function of the same c file), that returns a
pointer to a node that, from what I have seen in that function, is
expected to be of type Expr, and indeed it is. This pointer is stored in
a variable named 'n'. Right after that, you can see this code:

if (n == NULL)
{
... some code that isn't executed because n != NULL
}
else
{
*nodePtr = copyObject(n);
((Var *) *nodePtr)->varlevelsup = this_varlevelsup;
}

Well, this *can't* be correct. 'n' points to a node of type Expr, and of
course copyObject returns a node of that type. But that node is treated in
the following line as if it were of type Var! The value of
'this_varlevelsup' is 0, and the offset of the field 'varlevelsup' of
node Var is the same as that of the field 'oper' of node Expr, and the
pointer is overwritten.

I'd bet that the copy of the Expr node is not supposed to be stored in
*nodePtr (which points to a node of type Var), but I don't know what would
be the right action. I've found the bug, now I'll let the wise ones fix it
;-)

This problem has been hidden until now (for me, at least) because of a bug
in the 'money' data type that broke the rules test before it reached that
point. That bug sowed up only when you had locale enabled and your
currency didn't use cents (like spanish pesetas). On 32-bit systems it
is likely that the pointer isn't overwritten, so the bug didn't showed up,
either.

Cheers,

Pedro.

--
-------------------------------------------------------------------
Pedro Jos� Lobo Perea Tel: +34 91 336 78 19
Centro de C�lculo Fax: +34 91 331 92 29
E.U.I.T. Telecomunicaci�n e-mail: pjlobo@euitt.upm.es
Universidad Polit�cnica de Madrid
Ctra. de Valencia, Km. 7 E-28031 Madrid - Espa�a / Spain

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Pedro J. Lobo (#3)

"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes:

I have located the point where the pointer is overwritten.

*nodePtr = copyObject(n);
((Var *) *nodePtr)->varlevelsup = this_varlevelsup;

Well, this *can't* be correct. 'n' points to a node of type Expr,

Good sleuthing! It looks like on a 32-bit machine, the overwritten
area will be just past the last field that's part of Expr, and because
of the memory manager's habit of rounding up object sizes that area
just happens to be available space rather than the start of the next
object. So that's why we hadn't found it before.

I believe the copied TLE entry could be any of several other node types
besides Var and Expr, so it's probably possible to see related failures
even on a 32-bit machine.

Jan, what should really be happening here?

regards, tom lane