On Wed, Feb 12, 2014 at 06:39:32PM +0900, Yasuo Ohgaki wrote:
> On Tue, Feb 11, 2014 at 4:43 PM, Solar Designer <[email protected]> wrote:
> > On Tue, Feb 11, 2014 at 07:55:41AM +0800, Tjerk Meesters wrote:
> > > This problem isn't about performance, not necessarily anyway; the primary
> > > problem that this rfc is attempting to solve is a string comparison
> > > function with the tightest possible Theta(given-length) runtime.
> >
> > Not exactly. An actual goal is not leaking info via timings, and it
> > may be achieved in a variety of ways. (A secondary goal is not leaking
> > info via memory access pattern, but it's trickier.)
>
> I think memory access pattern cannot be hidden. Code is known,
> compiler/system can be guessed nicely. Therefore, it's better to
> have code that would not depends on memory access pattern.
I think there's some misunderstanding here, but since it would be only a
secondary goal anyway let's not discuss it now.
> > > If this can be accomplished with reading double words first, followed
> > > by a byte wise suffix comparison, then great.
> >
> > It can be, but this will add code complexity and thus potential bugs.
> > I think this is not worth it in a function focused on security. Let's
> > not waste time (and increase risk) trying to implement this.
>
> I was curious about performance difference. Especially, because someone
> told that Python 3.4 uses SipHash for all ==/=== comparison. I thought
> we may have the same.
It is quite possible that comparison via optimized implementation of
SipHash is faster than byte-by-byte constant time comparison (no early
abort either way, but optimized SipHash will do larger than byte reads).
However, for ==/=== we probably want optimized comparison with larger
than byte reads and with early reject (so not timing safe). Going
Python's way is an option, but it is controversial.
I think introducing a (mostly) timing safe string comparison function is
reasonable.
> > I think we should focus mostly on implementation correctness, not speed.
>
> Correctness is mandatory. I agree.
> That's the reason why OpenBSD people use simple timingsafe_bcmp(),
> probably. It cannot be wrong.
BTW, they also keep their strlcpy()/strlcat() very simple - with byte
reads/writes - even though these are more performance-relevant.
> > This is a more general problem with your codebase, but things like use
> > of signed ints for string lengths are not great. Ditto about use of
> > potentially signed chars with implicit promotion to potentially signed
> > int (depending on whether char is signed on a given platform) and then
> > assignment to definitely signed int. Now recall that signed integer
> > overflow has undefined behavior in C. I think you got away with it this
> > time (although one has to consider both possibilities for char's
> > signedness when reviewing code like this), but chances are that you have
> > quite some instances of signed integer overflow UB in the PHP tree.
> > I think you should be using size_t and other explicitly unsigned types
> > where appropriate.
>
> The cast is not forbidden by spec AFAIK. I could be wrong, though.
I'm not saying there's a bug in this instance of the code. I'm saying
that it's outdated and risky practice, that it complicates code reviews
(have to consider what happens when char is signed and what happens when
it is unsigned), that you probably have bugs because of such practice
elsewhere in the tree, that you have artificial limitation (2 GB if int
is 32-bit), that you have non-intuitive behavior when 2 GB is hit (e.g.,
memory_limit set to 5 GB is silently treated as 1 GB, right?), and that
I don't like us spending much time polishing/optimizing one function
while it still uses the wrong data types because the rest of the tree
does. Maybe introduce this new function in a minimal and unoptimized
form first, then bring the whole tree up to modern coding practices,
then revisit improving and optimizing individual functions like this one.
Of course, moving from int to size_t (where appropriate) in a large tree
is a lot of work and it has risks of its own.
> I agree that cast to other data type is tricky. IIRC, MySQL's admin password
> was cracked due to the difference of "signed char" and "unsigned char".
I don't recall that. Got a pointer?
I do recall that pre-4.0 MySQL password hashing is ridiculously weak.
> IIRC, bcrypt had similar issue also.
OpenBSD's original did not.
My reimplementation of it did, yes. :-( An error I made in 1997 or
1998 (plus insufficient re-testing when reusing this code in 2000),
which went undetected until 2011. I think it's the worst bug I ever
had, by the mess it has caused. My lessons from it include more and
better testing (BTW, you got to introduce test vectors with 8-bit chars
for PHP's string comparison functions/operators!) and better coding
practices (less regard to portability to ancient systems, and more to
correctness on current systems).
> str_word_compare() uses cast and I suppose it does what we need.
(I haven't looked. I was merely dragged into this thread to comment on
your new/proposed function.)
> We have to verify assembler code on
> various platforms if it is okay or not.
Looking at assembly code is good, but let's not forget that correct
assembly code doesn't prove the C source code is correct. A piece of C
source that invokes UB may be compiled to correct assembly code most
of the time currently, but not necessarily always and in the future.
> > Much of my own older code has similar drawbacks/risks, some of it
> > coming from pre-C99 portability concerns. These days, I think we should
> > focus on correctness on modern systems, especially when we're talking
> > not about reusable code snippets, but about the PHP codebase, where you
> > can reasonably stipulate C99 as a requirement (perhaps you already do?)
>
> I agree.
>
> I think str_byte_compare() (Simple fixed length byte by byte compare)
> is the total winner unless str_word_compare() is proven to be a correct
> implementation.
>
> I'll vote for str_byte_compare() implementation. (OpenBSD one)
I agree, but I'm not sure which exact code version you're referring to
by str_byte_compare(). Perhaps this was mentioned in another thread
(not CC'ed to me).
I suggest that you write it in a way that the same type conversions are
invoked regardless of char signedness. (Not merely that the different
conversions happen to yield the same correct result anyway.) In other
words, cast to (unsigned char), or cast pointers to (unsigned char *).
> Could we consider to add timing safe string comparison to ==/===
> operator? It assures comparison safety of any PHP scripts, including
> existing one.
It'd be a pity to permanently lose the faster string comparisons with
early abort. There are some reasonable use cases for them. So if we
were to redefine ==/=== to mean timing-safe comparison, we'd need to
introduce a function or/and operator for faster string comparison with
early abort.
Since PHP is not a language primarily focused on security and
correctness, I think defining a new function for timing-safe comparison
is more appropriate for PHP, although having a php.ini setting turning
==/=== into timing-safe could make sense.
Alexander