Skip to content

8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs #25609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

limingliu-ampere
Copy link
Member

@limingliu-ampere limingliu-ampere commented Jun 3, 2025

This PR is to enable the use of crypto pmull for CRC32/CRC32C intrinsics on Ampere CPU. There is an option UseCryptoPmullForCRC32 that can enable crypto pmull, but directly enabling it on Ampere CPU will cause the following problems.

  1. There will be regressions (-14% ~ -8%) on Ampere1 when the length is 64. When <= 128, both kernel_crc32_using_crc32 and kernel_crc32_using_crypto_pmull use the loop labeled as CRC_by32_loop, but their implements are a little different, and the loop in kernel_crc32_using_crc32 is better at hiding latency on Ampere1. So this PR takes the loop in kernel_crc32_using_crc32 to kernel_crc32_using_crypto_pmull, and does the same for CRC32C intrinsic.

  2. The intrinsics only use crypto pmull when the length is higher than 383, while the loop in kernel_crc32_common_fold_using_crypto_pmull looks able to handle 256, and if it handles 256 on Ampere1, the improvements can be as high as 110% compared with kernel_crc32_using_crc32/kernel_crc32c_using_crc32c. However, there are regressions (~-6%) on Neoverse V1 when the length is 256. So this PR introduces a new option named CryptoPmullForCRC32LowLimit. It defaults to 256 since the code could handle 256, while it is set to 384 for V1/V2 to keep the old behavior on these platforms.

The performance regressions and improvements were measured with the following microbenchmarks:
org.openjdk.bench.java.util.TestCRC32.testCRC32Update
org.openjdk.bench.java.util.TestCRC32C.testCRC32CUpdate

Ran the following JTReg tests on Ampere1 and did not find problems:
test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java
test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25609/head:pull/25609
$ git checkout pull/25609

Update a local copy of the PR:
$ git checkout pull/25609
$ git pull https://git.openjdk.org/jdk.git pull/25609/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25609

View PR using the GUI difftool:
$ git pr show -t 25609

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25609.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 3, 2025

👋 Welcome back lliu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jun 3, 2025

@limingliu-ampere This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs

Reviewed-by: aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 350 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@eme64, @theRealAph) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot changed the title JDK-8358032 8358032: Use crypto pmull for CRC32/CRC32C intrinsics on Ampere CPU Jun 3, 2025
@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 3, 2025
@openjdk
Copy link

openjdk bot commented Jun 3, 2025

@limingliu-ampere The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Jun 3, 2025

Webrevs

Comment on lines 92 to 95
product(intx, CryptoPmullForCRC32LowLimit, 256, \
"Minimum size in bytes when Crypto PMULL will be used." \
"Value must be a multiple of 128.") \
range(256, max_jint) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be a general product flag.

Suggested change
product(intx, CryptoPmullForCRC32LowLimit, 256, \
"Minimum size in bytes when Crypto PMULL will be used." \
"Value must be a multiple of 128.") \
range(256, max_jint) \
product(intx, CryptoPmullForCRC32LowLimit, 256, DIAGNOSTIC, \
"Minimum size in bytes when Crypto PMULL will be used." \
"Value must be a multiple of 128.") \
range(256, max_jint) \

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@limingliu-ampere Thanks for working on this! 😊

Generally looks reasonable to me as a non expert in crypto intrinsics. But we definitively need an expert to approve this in the end. I have a few comments below.

Also: it would be nice to have a sanity test where you use that new flag. It could also be an additional run in an existing test (that's probably even better). You may want to run it with a few different values, including non-multiple of 128 just to sanity check the alignment correction as well. I don't know how much runtime that would add, so that should be checked before going too crazy. Having different values for the flag helps us to simulate the behavior of other hardware for example, and that can be quite useful in general. What do you think?

Comment on lines 92 to 95
product(intx, CryptoPmullForCRC32LowLimit, 256, DIAGNOSTIC, \
"Minimum size in bytes when Crypto PMULL will be used." \
"Value must be a multiple of 128.") \
range(256, max_jint) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it sane to have negative values? If not, use uintx... or maybe even just uint?

@@ -4332,7 +4332,7 @@ void MacroAssembler::kernel_crc32_using_crypto_pmull(Register crc, Register buf,
Label CRC_by4_loop, CRC_by1_loop, CRC_less128, CRC_by128_pre, CRC_by32_loop, CRC_less32, L_exit;
assert_different_registers(crc, buf, len, tmp0, tmp1, tmp2);

subs(tmp0, len, 384);
subs(tmp0, len, CryptoPmullForCRC32LowLimit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have another alignment sanity check here? It would be both helpful to make sure nobody later breaks your assumption, and could also be helpful for the reader to see the 128 alignment immediately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the alignment does not effect the correctness here, but it should be >= 256. So I added the corresponding assertion above.

crc32x(crc, crc, tmp2);
subs(len, len, 32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of these changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise: converting these adjustments to post-increment operations isn't obviously an improvement on AArch64 generally. How does it help?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to perf, post-increment ops help to reduce the access to TLB on Ampere1 in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to perf, post-increment ops help to reduce the access to TLB on Ampere1 in this case.

Hmm, but it's code in a rather odd style in shared code. And from what I see, the intrinsic is only 22% of the runtime (for 128 bytes) anyway, and you're making the code larger. I certainly don't want to see this sort of thing proliferating in the intrinsics.

In general, it's up to CPU designers to make simple, straightforward code work well.

How important is this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand this code already exists in CRC32C, so it's simply unifying the two routines. OK, I won't object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're making the code larger.

I don't think this makes the code larger.

How important is this?

As I mentioned in problem 1, this causes a regression (~-14%) on Ampere1 when handling 64 bytes. No obvious effects in other cases though.

so it's simply unifying the two routines.

Yes.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 9, 2025
@limingliu-ampere
Copy link
Member Author

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Jun 11, 2025
@openjdk
Copy link

openjdk bot commented Jun 11, 2025

@limingliu-ampere
Your change (at version df9f920) is now ready to be sponsored by a Committer.

@theRealAph
Copy link
Contributor

Generally looks reasonable to me as a non expert in crypto intrinsics. But we definitively need an expert to approve this in the end. I have a few comments below.

I'm happy enough, and it seems also to improvements on Apple silicon.

However, the title is misleading: while this PR may have started as something purely Ampere-specific, it no longer is. Something like "AArch64: CRC32/CRC32 enhancements" is perhaps rather vague, but at least it's true.

@limingliu-ampere
Copy link
Member Author

Ping.

@theRealAph
Copy link
Contributor

Ping.

When you reply to my last point.

@limingliu-ampere
Copy link
Member Author

limingliu-ampere commented Jun 20, 2025

Ping.

When you reply to my last point.

Does this mean changing the title? But I don't get how does the patch help Apple, since the patch does not effect the default behavior on Apple. There would be changes when enabling UseCryptoPmullForCRC32 for 383 bytes or smaller. So, are the improvements from this?

@theRealAph
Copy link
Contributor

Ping.

When you reply to my last point.

Does this mean changing the title? But I don't get how does the patch help Apple, since the patch does not effect the default behavior on Apple. There would be changes when enabling UseCryptoPmullForCRC32 for 383 bytes or smaller. So, are the improvements from this?

This PR does not only enable crypto pmull for CRC32/CRC32C intrinsics on Ampere CPU, it also improves the algorithm for some cases of short arrays. I made a suggestion for a title that accurately describes this PR. Feel free to write your own accurate title, or use mine.

@openjdk openjdk bot removed sponsor Pull request is ready to be sponsored ready Pull request is ready to be integrated labels Jun 23, 2025
@limingliu-ampere limingliu-ampere changed the title 8358032: Use crypto pmull for CRC32/CRC32C intrinsics on Ampere CPU 8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs Jun 23, 2025
@openjdk openjdk bot added sponsor Pull request is ready to be sponsored ready Pull request is ready to be integrated labels Jun 23, 2025
Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is progressing nicely here :)

I have 2 comments below. Once you addressed them, I'll run some internal testing, and then I can approve it :)

@@ -89,6 +89,10 @@ define_pd_global(intx, InlineSmallCode, 1000);
"Use CRC32 instructions for CRC32 computation") \
product(bool, UseCryptoPmullForCRC32, false, \
"Use Crypto PMULL instructions for CRC32 computation") \
product(uint, CryptoPmullForCRC32LowLimit, 256, DIAGNOSTIC, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a test that uses this flag, and sets it to some selected values, and maybe even a random value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there already an IR test that checks for the presence of the crypto pmull? That could be good to ensure it occurs as expected and only when expected :)

Copy link
Member Author

@limingliu-ampere limingliu-ampere Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java and TestCRC32C.java. They cover various lengths for the input, and test the intrinsics with the default value of the flag. But they do not cover different values of the flag, which I think could be covered by VM_OPTIONS. I feel that it is not suitable to add the flag in the @run tag, since it is aarch64 specific while the test is generic.

Comment on lines +123 to +126
if (!(is_aligned(CryptoPmullForCRC32LowLimit, 128))) {
warning("CryptoPmullForCRC32LowLimit must be a multiple of 128");
CryptoPmullForCRC32LowLimit = align_down(CryptoPmullForCRC32LowLimit, 128);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe somewhere why it has to be a multiple of 128? Imagine someone comes across this later, and wonders if that is just some strange implementation limitation or something more fundamental, or something very subtle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 4 kinds of loops labeled as CRC_by128_loop, CRC_by32_loop, CRC_by4_loop and CRC_by1_loop. If the flag is 266 which is 128x2+10, then for 265 bytes of inputs, there are 256 bytes that are handled by CRC_by32_loop, while for 266 bytes of inputs, the corresponding 256 bytes are handled by CRC_by128_loop, and I think this cases inconsistency. If CRC_by32_loop handles 256 bytes better than CRC_by128_loop on a platform, it should be used for 266 bytes as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot [email protected] ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored
Development

Successfully merging this pull request may close these issues.

3 participants