Do we work with LLVM 12 on s390x?

Started by Tom Lanealmost 5 years ago8 messages
#1Tom Lane
tgl@sss.pgh.pa.us

The Red Hat folk are seeing a problem with that combination:

https://bugzilla.redhat.com/show_bug.cgi?id=1940964

which boils down to

Build fails with this error:
ERROR: failed to JIT module: Added modules have incompatible data layouts: E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64 (module) vs E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64 (jit)

(By "build", I imagine the reporter means "regression tests")

So I was wondering if we'd tested it yet.

regards, tom lane

#2Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#1)
Re: Do we work with LLVM 12 on s390x?

Hi,

On 2021-03-19 14:03:21 -0400, Tom Lane wrote:

The Red Hat folk are seeing a problem with that combination:

https://bugzilla.redhat.com/show_bug.cgi?id=1940964

which boils down to

Build fails with this error:
ERROR: failed to JIT module: Added modules have incompatible data layouts: E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64 (module) vs E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64 (jit)

(By "build", I imagine the reporter means "regression tests")

So I was wondering if we'd tested it yet.

Yes, I did test it not too long ago, after Christoph Berg reported
Debian s390x failing with jit. Which made me learn a bunch of s390x
assembler and discover a bug in our code that only rarely happend (iirc
something about booleans that are not exactly 0 or 1 not testing
true)...

/messages/by-id/20201015222924.yyms42qjloydfvar@alap3.anarazel.de

I think the error above comes from a "mismatch" between the clang used
to compile bitcode, and the LLVM version linked to. Normally we're
somewhat tolerant of differences between the two, but there was an ABI
change at some point, leading to that error. IIRC I hit that, but it
vanished as soon as I used a matching libllvm and clang.

Greetings,

Andres Freund

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#2)
Re: Do we work with LLVM 12 on s390x?

Andres Freund <andres@anarazel.de> writes:

I think the error above comes from a "mismatch" between the clang used
to compile bitcode, and the LLVM version linked to. Normally we're
somewhat tolerant of differences between the two, but there was an ABI
change at some point, leading to that error. IIRC I hit that, but it
vanished as soon as I used a matching libllvm and clang.

Thanks, I passed that advice on.

regards, tom lane

#4Honza Horak
hhorak@redhat.com
In reply to: Tom Lane (#3)
Re: Do we work with LLVM 12 on s390x?

On 3/19/21 8:15 PM, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think the error above comes from a "mismatch" between the clang used
to compile bitcode, and the LLVM version linked to. Normally we're
somewhat tolerant of differences between the two, but there was an ABI
change at some point, leading to that error. IIRC I hit that, but it
vanished as soon as I used a matching libllvm and clang.

Thanks, I passed that advice on.

regards, tom lane

Tom Stellard was so kind to look at this issue deeper with his LLVM
skills and found PostgreSQL is not actually handling the LLVM perfectly.
He's working on improving the patch, but sharing even the first attempt
with upstream seems like a good idea:

https://src.fedoraproject.org/rpms/postgresql/pull-request/29

Regards,
Honza

#5Tom Stellard
tstellar@redhat.com
In reply to: Honza Horak (#4)
1 attachment(s)
Re: Do we work with LLVM 12 on s390x?

On 4/21/21 6:40 AM, Honza Horak wrote:

On 3/19/21 8:15 PM, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think the error above comes from a "mismatch" between the clang used
to compile bitcode, and the LLVM version linked to. Normally we're
somewhat tolerant of differences between the two, but there was an ABI
change at some point, leading to that error.  IIRC I hit that, but it
vanished as soon as I used a matching libllvm and clang.

Thanks, I passed that advice on.

            regards, tom lane

Tom Stellard was so kind to look at this issue deeper with his LLVM skills and found PostgreSQL is not actually handling the LLVM perfectly. He's working on improving the patch, but sharing even the first attempt with upstream seems like a good idea:

https://src.fedoraproject.org/rpms/postgresql/pull-request/29

I wrote a new patch based on the bug discussion[1]/messages/by-id/16971-5d004d34742a3d35@postgresql.org. It works around
the issue specifically on s390x rather than disabling specific
CPUs and features for all targets. The patch is attached.

[1]: /messages/by-id/16971-5d004d34742a3d35@postgresql.org

Show quoted text

Regards,
Honza

Attachments:

0001-jit-Workaround-potential-datalayout-mismatch-on-s390.patchtext/x-patch; charset=UTF-8; name=0001-jit-Workaround-potential-datalayout-mismatch-on-s390.patchDownload
From 0edaa982336823d4d7af8f10b91579fe0099ef3d Mon Sep 17 00:00:00 2001
From: Tom Stellard <tstellar@redhat.com>
Date: Tue, 20 Apr 2021 20:14:21 -0700
Subject: [PATCH] jit: Workaround potential datalayout mismatch on s390x

LLVM's s390x target uses a different datalayout for z13 and newer processors.
If llvmjit_types.bc is compiled to target a processor older than z13, and
then the JIT runs on a z13 or newer processor, then there will be a mismatch
in datalayouts between llvmjit_types.bc and the JIT engine.  This mismatch
causes the JIT to fail at runtime.
---
 src/backend/jit/llvm/llvmjit.c | 46 ++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 98a27f08bf..05b6438ba8 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -776,6 +776,35 @@ llvm_compile_module(LLVMJitContext *context)
 			 errhidecontext(true)));
 }
 
+/*
+ * For the systemz target, LLVM uses a different datalayout for z13 and newer
+ * CPUs than it does for older CPUs.  This can cause a mismatch in datalayouts
+ * in the case where the llvm_types_module is compiled with a pre-z13 CPU
+ * and the JIT is running on z13 or newer.
+ * See computeDataLayout() function in
+ * llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp for information on the
+ * datalayout differences.
+ */
+static bool
+needs_systemz_workaround(void)
+{
+	bool ret = false;
+	LLVMContextRef llvm_context;
+	LLVMTypeRef vec_type;
+	LLVMTargetDataRef llvm_layoutref;
+	if (strncmp(LLVMGetTargetName(llvm_targetref), "systemz", strlen("systemz")))
+	{
+		return false;
+	}
+
+	llvm_context = LLVMGetModuleContext(llvm_types_module);
+	vec_type = LLVMVectorType(LLVMIntTypeInContext(llvm_context, 32), 4);
+	llvm_layoutref = LLVMCreateTargetData(llvm_layout);
+	ret = (LLVMABIAlignmentOfType(llvm_layoutref, vec_type) == 16);
+	LLVMDisposeTargetData(llvm_layoutref);
+	return ret;
+}
+
 /*
  * Per session initialization.
  */
@@ -785,6 +814,7 @@ llvm_session_initialize(void)
 	MemoryContext oldcontext;
 	char	   *error = NULL;
 	char	   *cpu = NULL;
+	char       *host_features = NULL;
 	char	   *features = NULL;
 	LLVMTargetMachineRef opt0_tm;
 	LLVMTargetMachineRef opt3_tm;
@@ -816,10 +846,17 @@ llvm_session_initialize(void)
 	 * features not all CPUs have (weird, huh).
 	 */
 	cpu = LLVMGetHostCPUName();
-	features = LLVMGetHostCPUFeatures();
+	features = host_features = LLVMGetHostCPUFeatures();
 	elog(DEBUG2, "LLVMJIT detected CPU \"%s\", with features \"%s\"",
 		 cpu, features);
 
+	if (needs_systemz_workaround())
+	{
+		const char *no_vector =",-vector";
+		features = malloc(sizeof(char) * (strlen(host_features) + strlen(no_vector) + 1));
+		sprintf(features, "%s%s", host_features, no_vector);
+	}
+
 	opt0_tm =
 		LLVMCreateTargetMachine(llvm_targetref, llvm_triple, cpu, features,
 								LLVMCodeGenLevelNone,
@@ -833,8 +870,13 @@ llvm_session_initialize(void)
 
 	LLVMDisposeMessage(cpu);
 	cpu = NULL;
-	LLVMDisposeMessage(features);
+	if (features != host_features)
+	{
+		free(features);
+	}
 	features = NULL;
+	LLVMDisposeMessage(host_features);
+	host_features = NULL;
 
 	/* force symbols in main binary to be loaded */
 	LLVMLoadLibraryPermanently(NULL);
-- 
2.27.0

#6Honza Horak
hhorak@redhat.com
In reply to: Tom Stellard (#5)
Re: Do we work with LLVM 12 on s390x?

On 4/22/21 6:35 PM, Tom Stellard wrote:

On 4/21/21 6:40 AM, Honza Horak wrote:

On 3/19/21 8:15 PM, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think the error above comes from a "mismatch" between the clang used
to compile bitcode, and the LLVM version linked to. Normally we're
somewhat tolerant of differences between the two, but there was an ABI
change at some point, leading to that error.  IIRC I hit that, but it
vanished as soon as I used a matching libllvm and clang.

Thanks, I passed that advice on.

            regards, tom lane

Tom Stellard was so kind to look at this issue deeper with his LLVM
skills and found PostgreSQL is not actually handling the LLVM
perfectly. He's working on improving the patch, but sharing even the
first attempt with upstream seems like a good idea:

https://src.fedoraproject.org/rpms/postgresql/pull-request/29

I wrote a new patch based on the bug discussion[1].  It works around
the issue specifically on s390x rather than disabling specific
CPUs and features for all targets.  The patch is attached.

[1]
/messages/by-id/16971-5d004d34742a3d35@postgresql.org

Thanks, Tom, it looks good in koji build, so merging so far. We very
much appreciate your help here.

Cheers,
Honza

Show quoted text

Regards,
Honza

#7Andres Freund
andres@anarazel.de
In reply to: Tom Stellard (#5)
Re: Do we work with LLVM 12 on s390x?

Hi,

On 2021-04-22 09:35:48 -0700, Tom Stellard wrote:

On 4/21/21 6:40 AM, Honza Horak wrote:
I wrote a new patch based on the bug discussion[1]. It works around
the issue specifically on s390x rather than disabling specific
CPUs and features for all targets. The patch is attached.

Cool, this is a pretty clear improvement. There's a few minor things I'd
change to fit it into PG - do you mind if I send that to the thread at
[1]: for you to test before I push it?

+/*
+ * For the systemz target, LLVM uses a different datalayout for z13 and newer
+ * CPUs than it does for older CPUs.  This can cause a mismatch in datalayouts
+ * in the case where the llvm_types_module is compiled with a pre-z13 CPU
+ * and the JIT is running on z13 or newer.
+ * See computeDataLayout() function in
+ * llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp for information on the
+ * datalayout differences.
+ */
+static bool
+needs_systemz_workaround(void)
+{
+	bool ret = false;
+	LLVMContextRef llvm_context;
+	LLVMTypeRef vec_type;
+	LLVMTargetDataRef llvm_layoutref;
+	if (strncmp(LLVMGetTargetName(llvm_targetref), "systemz", strlen("systemz")))
+	{
+		return false;
+	}
+
+	llvm_context = LLVMGetModuleContext(llvm_types_module);
+	vec_type = LLVMVectorType(LLVMIntTypeInContext(llvm_context, 32), 4);
+	llvm_layoutref = LLVMCreateTargetData(llvm_layout);
+	ret = (LLVMABIAlignmentOfType(llvm_layoutref, vec_type) == 16);
+	LLVMDisposeTargetData(llvm_layoutref);
+	return ret;
+}

I wonder if it'd be better to compare LLVMCopyStringRepOfTargetData() of
the llvm_types_module with the one of the JIT target machine, and only
specify -vector in that case? We currently support older LLVM versions
than the one that introduced the vector specific handling for systemz,
and I don't know what'd happen if we unnecessarily specified -vector.

Greetings,

Andres Freund

#8Tom Stellard
tstellar@redhat.com
In reply to: Andres Freund (#7)
Re: Do we work with LLVM 12 on s390x?

On 4/22/21 3:25 PM, Andres Freund wrote:

Hi,

On 2021-04-22 09:35:48 -0700, Tom Stellard wrote:

On 4/21/21 6:40 AM, Honza Horak wrote:
I wrote a new patch based on the bug discussion[1]. It works around
the issue specifically on s390x rather than disabling specific
CPUs and features for all targets. The patch is attached.

Cool, this is a pretty clear improvement. There's a few minor things I'd
change to fit it into PG - do you mind if I send that to the thread at
[1] for you to test before I push it?

Sure, no problem.

+/*
+ * For the systemz target, LLVM uses a different datalayout for z13 and newer
+ * CPUs than it does for older CPUs.  This can cause a mismatch in datalayouts
+ * in the case where the llvm_types_module is compiled with a pre-z13 CPU
+ * and the JIT is running on z13 or newer.
+ * See computeDataLayout() function in
+ * llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp for information on the
+ * datalayout differences.
+ */
+static bool
+needs_systemz_workaround(void)
+{
+	bool ret = false;
+	LLVMContextRef llvm_context;
+	LLVMTypeRef vec_type;
+	LLVMTargetDataRef llvm_layoutref;
+	if (strncmp(LLVMGetTargetName(llvm_targetref), "systemz", strlen("systemz")))
+	{
+		return false;
+	}
+
+	llvm_context = LLVMGetModuleContext(llvm_types_module);
+	vec_type = LLVMVectorType(LLVMIntTypeInContext(llvm_context, 32), 4);
+	llvm_layoutref = LLVMCreateTargetData(llvm_layout);
+	ret = (LLVMABIAlignmentOfType(llvm_layoutref, vec_type) == 16);
+	LLVMDisposeTargetData(llvm_layoutref);
+	return ret;
+}

I wonder if it'd be better to compare LLVMCopyStringRepOfTargetData() of
the llvm_types_module with the one of the JIT target machine, and only
specify -vector in that case? We currently support older LLVM versions
than the one that introduced the vector specific handling for systemz,
and I don't know what'd happen if we unnecessarily specified -vector.

The problem is that you have to pass the features to LLVMCreateTargetMachine
in order to know what the data layout of the JIT target is going to be,
so the only way to make this work, would be to create the TargetMachine
with the default features, check it's datalayout, and then re-create the
TargetMachine in order to apply the workaround. Maybe that's not so bad?

The other question I had is should we #ifdef ARCH_S390x in
needs_sytemz_workaround(), so we don't need to check the target
name.

-Tom

Show quoted text

Greetings,

Andres Freund