-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[stdlib] Add to Unsafe[Mutable]RawBufferPointer implementation of _custom[Last]IndexOfEquatableElement #63128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark needs to be added in a PR separate from the change you intend to measure. #63106 can fulfill that requirement. Please remove the benchmark from this PR. You should also add a test to exercise correctness in test/stdlib/UnsafeRawBufferPointer.swift
.
let word: UInt = UnsafePointer<UInt>( | ||
bitPattern: address &+ i | ||
)._unsafelyUnwrappedUnchecked.pointee |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would ptr.load(fromByteOffset: i, as: UInt.self)
generate worse code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, if we must resort to perhaps-undefined behaviour:
let word = ptr.advanced(by: i).assumingMemoryBound(to: UInt.self).pointee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same short instruction count, but helping the compiler understand what's going on:
let word = ptr.advanced(by: i).withMemoryRebound(
to: UInt.self, capacity: 1, { $0.pointee }
)
(80 columns is very narrow.)
Also, UInt
isn't 64 bits on every platform. It should be UInt64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All three fixes you propose generate the same code of two move statements on ARM. I personally find variant ptr.load(fromByteOffset: i, as: UInt.self)
the easiest to understand. I don't get why last variant you proposed is helping compiler understand what's going on and why should we care about this.
About UInt64
. Why not UInt128
then? I expect UInt
be the fastest on all platforms (32 or 64). Or probably, the opposite - UInt64
might be slower then UInt
on 32-bit platform. I don't have 32 bit platform to benchmark this. How should we proceed with this? Should I make separate PR with UInt64
variant and after benchmarking we could make our choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have UInt128
codified, so it's not an option. However, UInt64
does exist on any 32-bit platform of interest. Since this is a lot about reducing the number of loop iterations, then going to UInt64
seems like a reasonable option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding ptr.load(fromByteOffset: i, as: UInt.self)
, I happened to be looking at x86_64, and the load
option generates two extra instructions when compared to the straight pointer dereference. 🤷🏽 The last variant I suggested is better than your original one because bouncing between Int
and pointer representations confuses the compiler. Did you try using loadUnaligned
? If that generates compact code then we can eliminate the first loop that brings the pointer into alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try using loadUnaligned? If that generates compact code then we can eliminate the first loop that brings the pointer into alignment.
Surprisingly, loadUnaligned
is the same one move statement (checked on X86-64). But if buffer is shorter than word width, it will be segfault (will be?). So we have to branch here depending on buffer length. Does it worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's certainly worth skipping when the buffer is too narrow: we clearly want to avoid the out-of-bounds memory access. e.g.
if count < stride {
// loop setup
repeat {
// load word and compare
i &+= stride
} while i < endOfStride
}
// final loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to look for is that there may be a minimum number of strides for which it's worth paying the setup cost of doing a vectorized loop. I would hope that number to be 1, but I wouldn't be surprised by a different result.
0798b92
to
b4e96b6
Compare
Created new issue to describe this work: #63200 |
I pushed a shim we could use for |
After discussions with teammates, we believe that using |
How should we proceed to see difference between current stdlib implementation, current state of this PR and I am ready to push |
2833e74
to
0856dca
Compare
@valeriyvan I'll trigger a benchmark run with the current state, and we can go from there. Thanks for your patience! |
…f _customIndexOfEquatableElement
0856dca
to
e0c4f5e
Compare
@swift-ci please benchmark |
Preliminary benchmarking results, keeping only the new benchmarks. (We can worry about other benchmarks later.) Performance (x86_64): -O
Code size: -O
Performance (x86_64): -Osize
Code size: -Osize
Performance (x86_64): -OnoneImprovement | OLD | NEW | DELTA | RATIO Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
5fea87d
to
f8f5aa9
Compare
@xwu, could you please re-run benchmark? |
@swift-ci benchmark |
Performance (x86_64): -O
Performance (x86_64): -Osize
Performance (x86_64): -Onone
|
@eeckstein could you please re-run benchmark once again? |
@swift-ci benchmark |
Performance (x86_64): -O
Code size: -O
Performance (x86_64): -Osize
Code size: -Osize
Performance (x86_64): -Onone
|
@eeckstein, may I ask re-run benchmark once again? |
@swift-ci benchmark |
Performance (x86_64): -O
Code size: -O
Performance (x86_64): -Osize
Code size: -Osize
Performance (x86_64): -Onone
|
Is this acceptable price for improvements?
|
All of these APIs need availability annotations (or I for one would be curious how that, as well as adopting @glessard’s suggestion to try |
I don't get why these need availability annotations. Could you please explain. |
Will try tinker with |
We need availability annotations because these new functions will not be inlined (and should not be), and will become part of the |
@swift-ci please benchmark |
Performance (x86_64): -O
Code size: -O
Performance (x86_64): -Osize
Code size: -Osize
Performance (x86_64): -Onone
|
…eElement_SIMD-friendly
8c8e234
to
5aa873b
Compare
Removed Added |
Attempt speed up
firstIndex(of:)
ofUnsafe[Mutable]RawBufferPointer
by implementing_customIndexOfEquatableElement
.Currently benchmark of
firstIndex(of:)
terminates withSignals.SIGABRT: 6
and I have no idea why.In separate branch #63106 this benchmark works normally.
Any help is appreciated.