[PATCH] Log details for client certificate failures
Hello,
(I'm cleaning up some old git branches and found this. It was helpful
when I was trying to debug failures between an NSS client and an
OpenSSL server, and it seems general enough to help for more
complicated OpenSSL-only setups as well.)
Currently, debugging client cert verification failures is mostly
limited to looking at the TLS alert code on the client side. For simple
deployments, usually it's enough to see "sslv3 alert certificate
revoked" and know exactly what needs to be fixed, but if you add any
more complexity (multiple CA layers, misconfigured CA certificates,
etc.), trying to debug what happened based on the TLS alert alone can
be an exercise in frustration.
Luckily, the server has more information about exactly what failed in
the chain, and we already have the requisite callback implemented as a
stub, so I've filled it out with error handling and added a COMMERROR
log so that a DBA can debug client failures more easily.
It ends up looking like
LOG: connection received: host=localhost port=44120
LOG: client certificate verification failed at depth 1: unable to get local issuer certificate
DETAIL: failed certificate's subject: /CN=Test CA for PostgreSQL SSL regression test client certs
LOG: could not accept SSL connection: certificate verify failed
It might be even nicer to make this available to the client, but I
think the server log is an appropriate place for this information -- an
admin might not want to advertise exactly why a client certificate has
failed verification (other than what's already available via the TLS
alert, that is), and I think more complicated failures (with
intermediate CAs, etc.) are going to need administrator intervention
anyway. So having to check the logs doesn't seem like a big hurdle.
One question/concern -- the Subject that's printed to the logs could be
pretty big (OpenSSL limits the incoming certificate chain to 100K, by
default), which introduces an avenue for intentional log spamming. Is
there an existing convention for limiting the length of log output used
for debugging? Maybe I should just hardcode a smaller limit and
truncate anything past that? Or we could just log the Common Name,
which should be limited to 64 bytes...
I'll add this to the July commitfest.
Thanks,
--Jacob
Attachments:
0001-Log-details-for-client-certificate-failures.patchtext/x-patch; name=0001-Log-details-for-client-certificate-failures.patchDownload+60-5
On 03.05.22 19:04, Jacob Champion wrote:
One question/concern -- the Subject that's printed to the logs could be
pretty big (OpenSSL limits the incoming certificate chain to 100K, by
default), which introduces an avenue for intentional log spamming. Is
there an existing convention for limiting the length of log output used
for debugging? Maybe I should just hardcode a smaller limit and
truncate anything past that? Or we could just log the Common Name,
which should be limited to 64 bytes...
The information in pg_stat_ssl is limited to NAMEDATALEN (see struct
PgBackendSSLStatus).
It might make sense to align what your patch prints to identify
certificates with what is shown in that view.
On Tue, 2022-05-03 at 21:06 +0200, Peter Eisentraut wrote:
The information in pg_stat_ssl is limited to NAMEDATALEN (see struct
PgBackendSSLStatus).It might make sense to align what your patch prints to identify
certificates with what is shown in that view.
Sure, a max length should be easy enough to do. Is there a reason to
limit to NAMEDATALEN specifically? I was under the impression that we
would rather not have had that limitation in the stats framework, if we
could have avoided it. (In particular I think NAMEDATALEN will cut off
the longest possible Common Name by just five bytes.)
Thanks,
--Jacob
On 04.05.22 01:05, Jacob Champion wrote:
On Tue, 2022-05-03 at 21:06 +0200, Peter Eisentraut wrote:
The information in pg_stat_ssl is limited to NAMEDATALEN (see struct
PgBackendSSLStatus).It might make sense to align what your patch prints to identify
certificates with what is shown in that view.Sure, a max length should be easy enough to do. Is there a reason to
limit to NAMEDATALEN specifically? I was under the impression that we
would rather not have had that limitation in the stats framework, if we
could have avoided it. (In particular I think NAMEDATALEN will cut off
the longest possible Common Name by just five bytes.)
Just saying that cutting it off appears to be acceptable. A bit more
than 63 bytes should be okay for the log.
In terms of aligning what is printed, I meant that pg_stat_ssl uses the
issuer plus serial number to identify the certificate unambiguously.
On Wed, 2022-05-04 at 15:53 +0200, Peter Eisentraut wrote:
Just saying that cutting it off appears to be acceptable. A bit more
than 63 bytes should be okay for the log.
Gotcha.
In terms of aligning what is printed, I meant that pg_stat_ssl uses the
issuer plus serial number to identify the certificate unambiguously.
Oh, that's a great idea. I'll do that too.
Thanks!
--Jacob
On Thu, 2022-05-05 at 15:12 +0000, Jacob Champion wrote:
On Wed, 2022-05-04 at 15:53 +0200, Peter Eisentraut wrote:
In terms of aligning what is printed, I meant that pg_stat_ssl uses the
issuer plus serial number to identify the certificate unambiguously.Oh, that's a great idea. I'll do that too.
v2 limits the maximum subject length and adds the serial number to the
logs.
Thanks!
--Jacob
On 13.05.22 00:36, Jacob Champion wrote:
On Thu, 2022-05-05 at 15:12 +0000, Jacob Champion wrote:
On Wed, 2022-05-04 at 15:53 +0200, Peter Eisentraut wrote:
In terms of aligning what is printed, I meant that pg_stat_ssl uses the
issuer plus serial number to identify the certificate unambiguously.Oh, that's a great idea. I'll do that too.
v2 limits the maximum subject length and adds the serial number to the
logs.
I wrote that pg_stat_ssl uses the *issuer* plus serial number to
identify a certificate. What your patch shows is the subject and the
serial number, which isn't the same thing. Let's get that sorted out
one way or the other.
Another point, your patch produces
LOG: connection received: host=localhost port=44120
LOG: client certificate verification failed at depth 1: ...
DETAIL: failed certificate had subject ...
LOG: could not accept SSL connection: certificate verify failed
I guess what we really would like is
LOG: connection received: host=localhost port=44120
LOG: could not accept SSL connection: certificate verify failed
DETAIL: client certificate verification failed at depth 1: ...
failed certificate had subject ...
But I suppose that would be very cumbersome to produce with the callback
structure provided by OpenSSL?
I'm not saying the proposed way is unacceptable, but maybe it's worth
being explicit about this tradeoff.
On 30 Jun 2022, at 10:43, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
I wrote that pg_stat_ssl uses the *issuer* plus serial number to identify a certificate. What your patch shows is the subject and the serial number, which isn't the same thing. Let's get that sorted out one way or the other.
Quick observation on this one, the string format of an issuer and serial number is defined as a “Certificate Exact Assertion” in RFC 4523.
I added this to httpd a while back:
SSL_CLIENT_CERT_RFC4523_CEA
It would be good to interoperate.
Regards,
Graham
—
On Thu, Jun 30, 2022 at 2:43 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
On 13.05.22 00:36, Jacob Champion wrote:
v2 limits the maximum subject length and adds the serial number to the
logs.I wrote that pg_stat_ssl uses the *issuer* plus serial number to
identify a certificate. What your patch shows is the subject and the
serial number, which isn't the same thing. Let's get that sorted out
one way or the other.
Sorry for the misunderstanding! v3 adds the Issuer to the logs as well.
I wanted to clarify that this "issuer" has not actually been verified,
but all I could come up with was "purported issuer" which doesn't read
well to me. "Claimed issuer"? "Alleged issuer"? Thoughts?
Another point, your patch produces
LOG: connection received: host=localhost port=44120
LOG: client certificate verification failed at depth 1: ...
DETAIL: failed certificate had subject ...
LOG: could not accept SSL connection: certificate verify failedI guess what we really would like is
LOG: connection received: host=localhost port=44120
LOG: could not accept SSL connection: certificate verify failed
DETAIL: client certificate verification failed at depth 1: ...
failed certificate had subject ...But I suppose that would be very cumbersome to produce with the callback
structure provided by OpenSSL?
I was about to say "yes, very cumbersome", but I actually think we
might be able to do that without bubbling the error up through
multiple callback layers, using SSL_set_ex_data() and friends. I'll
take a closer look.
Thanks!
--Jacob
On Thu, Jun 30, 2022 at 2:54 AM Graham Leggett <minfrin@sharp.fm> wrote:
I added this to httpd a while back:
SSL_CLIENT_CERT_RFC4523_CEA
It would be good to interoperate.
What kind of interoperation did you have in mind? Are there existing
tools that want to scrape this information for observability?
I think the CEA syntax might not be a good fit for this particular
patch: first, we haven't actually verified the certificate, so no one
should be using it to assert certificate equality (and I'm truncating
the Issuer anyway, to avoid letting someone flood the logs). Second,
this is designed to be human-readable rather than machine-readable.
Thanks,
--Jacob
On Fri, Jul 1, 2022 at 1:51 PM Jacob Champion <jchampion@timescale.com> wrote:
Sorry for the misunderstanding! v3 adds the Issuer to the logs as well.
Resending v3; I messed up the certificate diff with my gitconfig.
--Jacob
Attachments:
v3-0001-Log-details-for-client-certificate-failures.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Log-details-for-client-certificate-failures.patchDownload+200-7
On 05.07.22 18:34, Jacob Champion wrote:
On Fri, Jul 1, 2022 at 1:51 PM Jacob Champion <jchampion@timescale.com> wrote:
Sorry for the misunderstanding! v3 adds the Issuer to the logs as well.
Resending v3; I messed up the certificate diff with my gitconfig.
This patch looks pretty good to me. Some minor details:
I looked into how you decode the serial number. I have found some code
elsewhere that passed the result of X509_get_serialNumber() directly to
ASN1_INTEGER_set(). But I guess a serial number of maximum length 20
octets wouldn't fit into a 32-bit long. (There is
ASN1_INTEGER_set_int64(), but that requires OpenSSL 1.1.0.) Does that
match your understanding?
For the detail string, I think we could do something like:
DETAIL: Failed certificate data (unverified): subject '%s', serial
number %s, issuer '%s'
On Thu, Jul 7, 2022 at 2:50 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
I looked into how you decode the serial number. I have found some code
elsewhere that passed the result of X509_get_serialNumber() directly to
ASN1_INTEGER_set(). But I guess a serial number of maximum length 20
octets wouldn't fit into a 32-bit long. (There is
ASN1_INTEGER_set_int64(), but that requires OpenSSL 1.1.0.) Does that
match your understanding?
Yep. And the bit lengths of the serial numbers used in the test suite
are in the low 60s already. Many people will just randomize their
serial numbers, so I think BN_bn2dec() is the way to go.
For the detail string, I think we could do something like:
DETAIL: Failed certificate data (unverified): subject '%s', serial
number %s, issuer '%s'
Done that way in v4.
I also added an optional 0002 that bubbles the error info up to the
final ereport(ERROR), using errdetail() and errhint(). I can squash it
into 0001 if you like it, or drop it if you don't. (This approach
could be adapted to the client, too.)
Thanks!
--Jacob
Attachments:
since-v3.diff.txttext/plain; charset=US-ASCII; name=since-v3.diff.txtDownload+8-7
v4-0001-Log-details-for-client-certificate-failures.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Log-details-for-client-certificate-failures.patchDownload+201-7
v4-0002-squash-Log-details-for-client-certificate-failure.patchtext/x-patch; charset=US-ASCII; name=v4-0002-squash-Log-details-for-client-certificate-failure.patchDownload+46-23
On 01 Jul 2022, at 22:59, Jacob Champion <jchampion@timescale.com> wrote:
I added this to httpd a while back:
SSL_CLIENT_CERT_RFC4523_CEA
It would be good to interoperate.
What kind of interoperation did you have in mind? Are there existing
tools that want to scrape this information for observability?
This is for human troubleshooting.
I think the CEA syntax might not be a good fit for this particular
patch: first, we haven't actually verified the certificate, so no one
should be using it to assert certificate equality (and I'm truncating
the Issuer anyway, to avoid letting someone flood the logs). Second,
this is designed to be human-readable rather than machine-readable.
This is what a CEA looks like:
{ serialNumber 400410167207191393705333222102472642510002355884, issuer rdnSequence:”CN=Foo UK G1,O=Foo,C=UK" }
Whitespace and escaping is important above.
When troubleshooting, you want a string like the above that you can cut and paste and search for in other systems and log files. The verification status of the cert isn’t an issue at this point, you have a system in front of you where it doesn’t work when it should, and you need to know exactly what’s connecting, not what you think you’re connecting to, and you need precise data.
Please don’t invent another format, or try and truncate the data. This is a huge headache when troubleshooting.
Regards,
Graham
—
On 08.07.22 20:39, Jacob Champion wrote:
I also added an optional 0002 that bubbles the error info up to the
final ereport(ERROR), using errdetail() and errhint(). I can squash it
into 0001 if you like it, or drop it if you don't. (This approach
could be adapted to the client, too.)
I squashed those two together. I also adjusted the error message a bit
more for project style. (We can put both lines into detail.)
I had to read up on this "ex_data" API. Interesting. But I'm wondering
a bit about how the life cycle of these objects is managed. What
happens if the allocated error string is deallocated before its
containing object? Or vice versa? How do we ensure we don't
accidentally reuse the error string when the code runs again? (I guess
currently it can't?) Maybe we should avoid this and just put the
errdetail itself into a static variable that we unset once the message
is printed?
Attachments:
v5-0001-Log-details-for-client-certificate-failures.patchtext/plain; charset=UTF-8; name=v5-0001-Log-details-for-client-certificate-failures.patchDownload+230-8
On Sat, Jul 9, 2022 at 6:49 AM Graham Leggett <minfrin@sharp.fm> wrote:
Please don’t invent another format, or try and truncate the data. This is a huge headache when troubleshooting.
I hear you, and I agree that correlating these things across machines
is something we should be making easier. I'm just not convinced that
the particular format you've proposed, with a new set of rules for
quoting and escaping, needs to be part of this patch. (And I think
there are good reasons to truncate unverified cert data, so there'd
have to be clear benefits to offset the risk of opening it up.)
Searching Google for "issuer rdnSequence" comes up with mostly false
positives related to LDAP filtering and certificate dumps, and the
true positives seem to be mail threads that you've participated in. Do
many LDAP servers log certificate failures in this format by default?
(For that matter, does httpd?) The discussion at the time you added
this to httpd [1]https://lists.apache.org/thread/1665qc4mod7ppp58qk3bqc2l3wtl3lkn seemed to be making the point that this was a niche
format, suited mostly for interaction with LDAP filters -- and Kaspar
additionally pointed out that it's not a canonical format, so all of
our implementations would have to have an ad hoc agreement to choose
exactly one encoding.
If you're using randomized serial numbers, you should be able to grep
for those by themselves and successfully match many different formats,
no? To me, that seems good enough for a first patch, considering we
don't currently log any of this information.
--Jacobfi
[1]: https://lists.apache.org/thread/1665qc4mod7ppp58qk3bqc2l3wtl3lkn
On Mon, Jul 11, 2022 at 6:09 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
I squashed those two together. I also adjusted the error message a bit
more for project style. (We can put both lines into detail.)
Oh, okay. Log parsers don't have any issues with that?
I had to read up on this "ex_data" API. Interesting. But I'm wondering
a bit about how the life cycle of these objects is managed. What
happens if the allocated error string is deallocated before its
containing object? Or vice versa?
Yeah, I'm currently leaning heavily on the lack of any memory context
switches here. And I end up leaking out a pointer to the stale stack
of be_tls_open_server(), which is gross -- it works since there are no
other clients, but that could probably come back to bite us.
The ex_data API exposes optional callbacks for new/dup/free (I'm
currently setting those to NULL), so we can run custom code whenever
the SSL* is destroyed. If you'd rather the data have the same lifetime
of the SSL* object, we can switch to malloc/strdup/free (or even
OPENSSL_strdup() in later versions). But since we don't have any use
for the ex_data outside of this function, maybe we should just clear
it before we return, rather than carrying it around.
How do we ensure we don't
accidentally reuse the error string when the code runs again? (I guess
currently it can't?)
The ex_data is associated with the SSL*, not the global SSL_CTX*, so
that shouldn't be an issue. A new SSL* gets created at the start of
be_tls_open_server().
Maybe we should avoid this and just put the
errdetail itself into a static variable that we unset once the message
is printed?
If you're worried about the lifetime of the palloc'd data being too
short, does switching to a static variable help in that case?
--Jacob
On 13.07.22 01:06, Jacob Champion wrote:
I had to read up on this "ex_data" API. Interesting. But I'm wondering
a bit about how the life cycle of these objects is managed. What
happens if the allocated error string is deallocated before its
containing object? Or vice versa?Yeah, I'm currently leaning heavily on the lack of any memory context
switches here. And I end up leaking out a pointer to the stale stack
of be_tls_open_server(), which is gross -- it works since there are no
other clients, but that could probably come back to bite us.The ex_data API exposes optional callbacks for new/dup/free (I'm
currently setting those to NULL), so we can run custom code whenever
the SSL* is destroyed. If you'd rather the data have the same lifetime
of the SSL* object, we can switch to malloc/strdup/free (or even
OPENSSL_strdup() in later versions). But since we don't have any use
for the ex_data outside of this function, maybe we should just clear
it before we return, rather than carrying it around.
Concretely, I was thinking like the attached top-up patch.
The other way can surely be made to work somehow, but this seems much
simpler and with fewer questions about the details.
Attachments:
v5-0002-squash-Log-details-for-client-certificate-failure.patchtext/plain; charset=UTF-8; name=v5-0002-squash-Log-details-for-client-certificate-failure.patchDownload+5-25
On Thu, Jul 14, 2022 at 1:12 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
Concretely, I was thinking like the attached top-up patch.
The other way can surely be made to work somehow, but this seems much
simpler and with fewer questions about the details.
Ah, seeing it side-by-side helps. That's much easier, I agree.
Thanks,
--Jacob
On 14.07.22 23:09, Jacob Champion wrote:
On Thu, Jul 14, 2022 at 1:12 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:Concretely, I was thinking like the attached top-up patch.
The other way can surely be made to work somehow, but this seems much
simpler and with fewer questions about the details.Ah, seeing it side-by-side helps. That's much easier, I agree.
Committed like that.